cs.CV [Total: 240]
cs.CL [Total: 46]
cs.RO [Total: 7]
cs.GR [Total: 2]
cs.MA [Total: 1]
q-bio.QM [Total: 1]
cs.IR [Total: 1]
cs.MM [Total: 2]
eess.IV [Total: 3]
cs.AI [Total: 3]
cs.CR [Total: 1]
cs.LG [Total: 20]
cs.CY [Total: 1]
cs.HC [Total: 1]
cs.DB [Total: 1]
cs.DC [Total: 1]
cs.SD [Total: 3]
cs.SE [Total: 1]

cs.CV [Back]

[1] Multimodal AI for Body Fat Estimation: Computer Vision and Anthropometry with DEXA Benchmarks cs.CV | cs.AI | cs.LGPDF

Rayan Aldajani

TL;DR: 这篇论文研究了利用人工智能（AI）模型通过正面身体图像和基础人体测量数据来估算体脂率的可行性，目标是提供一种低成本替代DEXA扫描的方法。研究开发了基于ResNet的图像模型和回归模型，并提出了多模态融合框架。结果显示，图像模型的RMSE为4.44%，R²为0.807，表明AI模型在这类应用中有潜力。

Details

Motivation: DEXA扫描作为体脂率估算的金标准方法成本高且难以普及，因此需要一种低成本且可广泛使用的方法。本研究旨在探索AI模型是否可以通过计算机视觉和人体测量学数据实现这一目标。

Result: 图像模型的RMSE为4.44%，R²为0.807，显示出较好的性能。这表明AI模型可以作为低成本体脂估算的替代方案。

Insight: 1. AI模型在体脂估算中具有潜力，但仍需进一步优化和多模态数据的支持；2. 未来可通过融合图像和人体测量数据提升模型性能；3. 研究强调了数据稀缺性在这一领域的挑战。

Abstract: Tracking body fat percentage is essential for effective weight management, yet gold-standard methods such as DEXA scans remain expensive and inaccessible for most people. This study evaluates the feasibility of artificial intelligence (AI) models as low-cost alternatives using frontal body images and basic anthropometric data. The dataset consists of 535 samples: 253 cases with recorded anthropometric measurements (weight, height, neck, ankle, and wrist) and 282 images obtained via web scraping from Reddit posts with self-reported body fat percentages, including some reported as DEXA-derived by the original posters. Because no public datasets exist for computer-vision-based body fat estimation, this dataset was compiled specifically for this study. Two approaches were developed: (1) ResNet-based image models and (2) regression models using anthropometric measurements. A multimodal fusion framework is also outlined for future expansion once paired datasets become available. The image-based model achieved a Root Mean Square Error (RMSE) of 4.44% and a Coefficient of Determination (R^2) of 0.807. These findings demonstrate that AI-assisted models can offer accessible and low-cost body fat estimates, supporting future consumer applications in health and fitness.

[2] Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding cs.CV | cs.AIPDF

Yassir Benhammou, Suman Kalyan, Sujay Kumar

TL;DR: 该论文提出了一种多模态自编码器（MMAE），通过学习文本、音频和视觉数据的统一表示，实现了元数据提取和语义聚类的端到端自动化，显著提升了聚类和对齐性能。

Details

Motivation: 广播和媒体组织依赖AI自动化内容索引和元数据生成，但现有系统通常仅针对单一模态，限制了其对复杂跨模态关系的理解能力。

Result: 在聚类和对齐指标（Silhouette、ARI、NMI）上表现显著优于线性基线，证明了其在元数据生成和跨模态检索中的潜力。

Insight: 重构驱动的多模态学习可以有效提升广播工作流中的自动化效率和内容管理能力。

Abstract: Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive datasets. We demonstrate significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI) compared to linear baselines, indicating that reconstruction-based multimodal embeddings can serve as a foundation for scalable metadata generation and cross-modal retrieval in broadcast archives. These results highlight the potential of reconstruction-driven multimodal learning to enhance automation, searchability, and content management efficiency in modern broadcast workflows.

[3] BCWildfire: A Long-term Multi-factor Dataset and Deep Learning Benchmark for Boreal Wildfire Risk Prediction cs.CVPDF

Zhengsen Xu, Sibo Cheng, Hongjie He, Lanying Wang, Wentao Sun

TL;DR: 该论文提出了一个25年的多因素森林火灾风险预测数据集BCWildfire，并评估了多种时间序列预测模型。

Details

Motivation: 森林火灾风险预测的复杂性需要综合多模态驱动因素的数据集，而现有公开数据集在时间跨度、空间覆盖和多模态数据方面仍显不足。

Result: 数据集和代码已开源，为火灾风险预测提供了基准支持。

Insight: 位置嵌入和多模态因素的结合对火灾风险预测模型性能有显著影响。

Abstract: Wildfire risk prediction remains a critical yet challenging task due to the complex interactions among fuel conditions, meteorology, topography, and human activity. Despite growing interest in data-driven approaches, publicly available benchmark datasets that support long-term temporal modeling, large-scale spatial coverage, and multimodal drivers remain scarce. To address this gap, we present a 25-year, daily-resolution wildfire dataset covering 240 million hectares across British Columbia and surrounding regions. The dataset includes 38 covariates, encompassing active fire detections, weather variables, fuel conditions, terrain features, and anthropogenic factors. Using this benchmark, we evaluate a diverse set of time-series forecasting models, including CNN-based, linear-based, Transformer-based, and Mamba-based architectures. We also investigate effectiveness of position embedding and the relative importance of different fire-driving factors. The dataset and the corresponding code can be found at https://github.com/SynUW/mmFire

[4] Robustness of Structured Data Extraction from Perspectively Distorted Documents cs.CV | cs.CL | cs.LGPDF

Hyakka Nakada, Yoshiyasu Tanaka

TL;DR: 该论文研究了透视畸变对多模态大语言模型（如Gemini-1.5-pro）在文档数据提取任务中的影响，并提出了一种简化畸变参数的方法，同时发现结构识别准确性受畸变影响较大，但可通过简单旋转校正提升。

Details

Motivation: 现实世界中的文档图像通常不仅存在平面旋转，还包含透视畸变，而这些畸变对多模态大语言模型（LLMs）在数据提取任务中的准确性影响尚未充分研究。

Result: 实验结果表明，结构识别准确性受文档畸变影响较大，但通过简单的旋转校正可以显著提升性能。

Insight: 透视畸变是影响多模态LLMs文档数据提取性能的关键因素之一，尤其是在结构识别任务中；简单的预处理（如旋转校正）可有效改善模型表现，为实际应用提供了实用建议。

Abstract: Optical Character Recognition (OCR) for data extraction from documents is essential to intelligent informatics, such as digitizing medical records and recognizing road signs. Multi-modal Large Language Models (LLMs) can solve this task and have shown remarkable performance. Recently, it has been noticed that the accuracy of data extraction by multi-modal LLMs can be affected when in-plane rotations are present in the documents. However, real-world document images are usually not only in-plane rotated but also perspectively distorted. This study investigates the impacts of such perturbations on the data extraction accuracy for the state-of-the-art model, Gemini-1.5-pro. Because perspective distortions have a high degree of freedom, designing experiments in the same manner as single-parametric rotations is difficult. We observed typical distortions of document images and showed that most of them approximately follow an isosceles-trapezoidal transformation, which allows us to evaluate distortions with a small number of parameters. We were able to reduce the number of independent parameters from eight to two, i.e. rotation angle and distortion ratio. Then, specific entities were extracted from synthetically generated sample documents with varying these parameters. As the performance of LLMs, we evaluated not only a character-recognition accuracy but also a structure-recognition accuracy. Whereas the former represents the classical indicators for optical character recognition, the latter is related to the correctness of reading order. In particular, the structure-recognition accuracy was found to be significantly degraded by document distortion. In addition, we found that this accuracy can be improved by a simple rotational correction. This insight will contribute to the practical use of multi-modal LLMs for OCR tasks.

[5] 3D Ground Truth Reconstruction from Multi-Camera Annotations Using UKF cs.CVPDF

Linh Van Ma, Unse Fatima, Tepy Sokun Chriv, Haroon Imran, Moongu Jeon

TL;DR: 该论文提出了一种基于UKF的方法，利用多摄像头标注的2D边界框或关键点信息，通过同态投影和UKF融合，生成精确的3D地面真值，适用于自动驾驶、监控和机器人等领域。

Details

Motivation: 高精度的3D地面真值对于自动驾驶、监控和机器人应用至关重要。现有方法通常仅提供地面信息，且依赖于昂贵的3D标注数据，亟需一种利用低成本2D标注生成完整3D形状的方法。

Result: 在CMC、Wildtrack和Panoptic数据集上验证，展示了高精度的3D定位能力，优于现有方法。

Insight: 该方法通过低成本2D标注实现了高质量的3D重建，为多摄像头系统提供了一种实用的自动化解决方案。

Abstract: Accurate 3D ground truth estimation is critical for applications such as autonomous navigation, surveillance, and robotics. This paper introduces a novel method that uses an Unscented Kalman Filter (UKF) to fuse 2D bounding box or pose keypoint ground truth annotations from multiple calibrated cameras into accurate 3D ground truth. By leveraging human-annotated ground-truth 2D, our proposed method, a multi-camera single-object tracking algorithm, transforms 2D image coordinates into robust 3D world coordinates through homography-based projection and UKF-based fusion. Our proposed algorithm processes multi-view data to estimate object positions and shapes while effectively handling challenges such as occlusion. We evaluate our method on the CMC, Wildtrack, and Panoptic datasets, demonstrating high accuracy in 3D localization compared to the available 3D ground truth. Unlike existing approaches that provide only ground-plane information, our method also outputs the full 3D shape of each object. Additionally, the algorithm offers a scalable and fully automatic solution for multi-camera systems using only 2D image annotations.

[6] Unified Low-Light Traffic Image Enhancement via Multi-Stage Illumination Recovery and Adaptive Noise Suppression cs.CV | cs.AIPDF

Siddiqua Namrah

TL;DR: 该论文提出了一种无监督的多阶段深度学习框架，用于低光照交通图像增强，通过全局和局部亮度校正、噪声抑制及结构恢复，显著提升了低光照条件下的图像质量和下游任务可靠性。

Details

Motivation: 低光照交通图像在自动驾驶、智能交通等应用中因光照不足、噪声等问题导致能见度低，影响感知任务如物体检测和场景理解。

Result: 在通用和交通专用数据集上表现优于现有方法，定量和定性结果均验证了方法的优越性。

Insight: 多阶段分解和无监督训练的结合有效解决了低光照交通图像增强的复杂挑战，尤其适用于实际应用场景。

Abstract: Enhancing low-light traffic images is crucial for reliable perception in autonomous driving, intelligent transportation, and urban surveillance systems. Nighttime and dimly lit traffic scenes often suffer from poor visibility due to low illumination, noise, motion blur, non-uniform lighting, and glare from vehicle headlights or street lamps, which hinder tasks such as object detection and scene understanding. To address these challenges, we propose a fully unsupervised multi-stage deep learning framework for low-light traffic image enhancement. The model decomposes images into illumination and reflectance components, progressively refined by three specialized modules: (1) Illumination Adaptation, for global and local brightness correction; (2) Reflectance Restoration, for noise suppression and structural detail recovery using spatial-channel attention; and (3) Over-Exposure Compensation, for reconstructing saturated regions and balancing scene luminance. The network is trained using self-supervised reconstruction, reflectance smoothness, perceptual consistency, and domain-aware regularization losses, eliminating the need for paired ground-truth images. Experiments on general and traffic-specific datasets demonstrate superior performance over state-of-the-art methods in both quantitative metrics (PSNR, SSIM, LPIPS, NIQE) and qualitative visual quality. Our approach enhances visibility, preserves structure, and improves downstream perception reliability in real-world low-light traffic scenarios.

[7] HSMix: Hard and Soft Mixing Data Augmentation for Medical Image Segmentation cs.CV | cs.LGPDF

Danyang Sun, Fadi Dornaika, Nagore Barrena

TL;DR: HSMix提出了一种新的医学图像分割数据增强方法，结合硬混合和软混合技术，通过局部图像编辑生成多样化且语义保留的增强数据。

Details

Motivation: 医学图像标注成本高且部分疾病样本稀缺，导致数据不足和过拟合问题。传统的数据增强方法在分割任务中效果有限，因此需要一种更有效的增强方法。

Result: 实验结果验证了HSMix在多种医学分割任务中的有效性。

Insight: HSMix通过局部图像编辑提升了数据多样性，同时保留了语义信息，是一种简单高效的增强方法。

Abstract: Due to the high cost of annotation or the rarity of some diseases, medical image segmentation is often limited by data scarcity and the resulting overfitting problem. Self-supervised learning and semi-supervised learning can mitigate the data scarcity challenge to some extent. However, both of these paradigms are complex and require either hand-crafted pretexts or well-defined pseudo-labels. In contrast, data augmentation represents a relatively simple and straightforward approach to addressing data scarcity issues. It has led to significant improvements in image recognition tasks. However, the effectiveness of local image editing augmentation techniques in the context of segmentation has been less explored. We propose HSMix, a novel approach to local image editing data augmentation involving hard and soft mixing for medical semantic segmentation. In our approach, a hard-augmented image is created by combining homogeneous regions (superpixels) from two source images. A soft mixing method further adjusts the brightness of these composed regions with brightness mixing based on locally aggregated pixel-wise saliency coefficients. The ground-truth segmentation masks of the two source images undergo the same mixing operations to generate the associated masks for the augmented images. Our method fully exploits both the prior contour and saliency information, thus preserving local semantic information in the augmented images while enriching the augmentation space with more diversity. Our method is a plug-and-play solution that is model agnostic and applicable to a range of medical imaging modalities. Extensive experimental evidence has demonstrated its effectiveness in a variety of medical segmentation tasks. The source code is available in https://github.com/DanielaPlusPlus/HSMix.

[8] Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach cs.CVPDF

Ju-Young Oh

TL;DR: 该论文提出了一种名为FIQ的嵌入集成方法，通过生成基础性问题对来增强视频问答模型的推理能力，提升了模型对视频内容的全面理解。

Details

Motivation: 现有视频问答方法的标注主要为事件中心型，缺乏对场景基础信息的捕捉（如对象类别、空间配置等），限制了模型的泛化和推理能力。

Result: 在SUTD-TrafficQA数据集上取得了最先进的性能。

Insight: 通过生成基础性问题对和视觉-问题嵌入对齐，能够显著提升视频问答模型的全面理解和推理能力。

Abstract: Conventional VQA approaches primarily rely on question-answer (Q&A) pairs to learn the spatio-temporal dynamics of video content. However, most existing annotations are event-centric, which restricts the model’s ability to capture the comprehensive context of a scene. The lack of fundamental information such as object categories, spatial configurations, and descriptive visual attributes prevents the model from forming a complete understanding of the environment, ultimately limiting its generalization and reasoning capability. In this paper, we introduce Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach (FIQ), a framework designed to enhance the reasoning capability of VQA models by improving their foundational comprehension of video content. FIQ generates Q&A pairs from descriptive information extracted directly from videos, thereby enriching the dataset with core scene-level attributes. These generated pairs help the model develop a more holistic understanding of the video, leading to improved generalizability and reasoning performance. In addition, we propose a VQ-CAlign module that aligns task-specific question embeddings with corresponding visual features, preserving essential contextual cues and enhancing adaptability to downstream tasks. Experimental results on the SUTD-TrafficQA dataset demonstrate that FIQ achieves state-of-the-art performance, surpassing existing baseline approaches.

[9] Upstream Probabilistic Meta-Imputation for Multimodal Pediatric Pancreatitis Classification cs.CV | cs.LGPDF

Max A. Nelson, Elif Keles, Eminenur Sen Tasci, Merve Yazol, Halil Ertugrul Aktas

TL;DR: 论文提出了一种轻量级的数据增强策略UPMI，用于解决小儿胰腺炎诊断中样本有限和多模态成像复杂性的挑战，通过在低维元特征空间中进行上游概率元插补，显著提升了分类性能。

Details

Motivation: 小儿胰腺炎是一种临床诊断困难的疾病，传统机器学习方法因样本稀缺和多模态复杂性难以有效诊断。

Result: 在67例患者的T1W/T2W MRI数据上，UPMI的平均AUC达到0.908，显著优于基线方法。

Insight: 在低维空间进行数据增强比直接在图像空间操作更高效，适合样本稀缺问题。

Abstract: Pediatric pancreatitis is a progressive and debilitating inflammatory condition, including acute pancreatitis and chronic pancreatitis, that presents significant clinical diagnostic challenges. Machine learning-based methods also face diagnostic challenges due to limited sample availability and multimodal imaging complexity. To address these challenges, this paper introduces Upstream Probabilistic Meta-Imputation (UPMI), a light-weight augmentation strategy that operates upstream of a meta-learner in a low-dimensional meta-feature space rather than in image space. Modality-specific logistic regressions (T1W and T2W MRI radiomics) produce probability outputs that are transformed into a 7-dimensional meta-feature vector. Class-conditional Gaussian mixture models (GMMs) are then fit within each cross-validation fold to sample synthetic meta-features that, combined with real meta-features, train a Random Forest (RF) meta-classifier. On 67 pediatric subjects with paired T1W/T2W MRIs, UPMI achieves a mean AUC of 0.908 $\pm$ 0.072, a $\sim$5% relative gain over a real-only baseline (AUC 0.864 $\pm$ 0.061).

[10] SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios cs.CV | cs.AI | cs.ROPDF

Jieru Lin, Zhiwei Yu, Börje F. Karlsson

TL;DR: SWITCH是一个新提出的基准测试，专注于评估自主智能在长视野情境中对有形控制接口（TCIs）的建模与处理能力，测试任务包括视觉问答、语义UI接地、动作生成、状态转换预测和结果验证。

Details

Motivation: 现有基准测试很少涉及与现实世界交互的能力，尤其是对有形控制接口的复杂操作和安全验证，SWITCH填补了这一空白。

Result: 商业和开源的多模态模型在测试中表现不一致，尤其在依赖视觉证据和处理复杂交互时有显著不足。

Insight: 模型的性能缺陷揭示了当前多模态系统对文本线索的过度依赖和对视觉信息的利用率不足，需要进一步改进。

Abstract: Autonomous intelligence requires not only perception and reasoning, but critically, effective interaction with the existing world and its infrastructure. Everyday environments are rich in tangible control interfaces (TCIs), e.g., light switches, appliance panels, and embedded GUIs, that demand commonsense and physics reasoning, but also causal prediction and outcome verification in time and space (e.g., delayed heating, remote lights). Moreover, failures here have potential safety implications, yet current benchmarks rarely test grounding, partial observability (video), or post-hoc verification in situated settings. We introduce SWITCH (Semantic World Interface Tasks for Control and Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities:task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification, under egocentric RGB video input and device diversity. Across 351 tasks spanning 98 real devices and appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions, often over-relying on textual cues and under-using visual or video evidence (and high aggregate scores can mask such failures). SWITCH provides data, code, and held-out splits to enable reproducible evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of training datasets. Benchmark resources are available at: https://github.com/BAAI-Agents/SWITCH.

[11] MedPEFT-CL: Dual-Phase Parameter-Efficient Continual Learning with Medical Semantic Adapter and Bidirectional Memory Consolidation cs.CVPDF

Ziyuan Gao

TL;DR: MedPEFT-CL提出了一种高效的持续学习框架，通过双阶段架构（自适应学习阶段和知识整合阶段）解决医学视觉语言分割任务中的灾难性遗忘问题，显著减少了可训练参数并保持了跨模态学习能力。

Details

Motivation: 医学视觉语言分割模型在学习新解剖结构时容易遭遇灾难性遗忘，需重新训练，限制了临床应用。现有的持续学习方法未充分针对医学任务的特点。

Result: 在多样化医学数据集上验证了框架的高效性，显著减少遗忘并保持性能，参数开销小。

Insight: 语义相似性和双向记忆协调是关键，为医学持续学习提供了新思路。

Abstract: Medical vision-language segmentation models suffer from catastrophic forgetting when adapting to new anatomical structures, requiring complete retraining that limits their clinical deployment. Although continual learning approaches have been studied for various applications, targeted research on continual learning approaches specifically designed for medical vision-language tasks remains underexplored. We propose MedPEFT-CL, a parameter-efficient continual learning framework that addresses both efficient learning of new tasks and preservation of previous knowledge through a dual-phase architecture based on CLIPSeg. Our dual-phase architecture features an adaptive learning phase that employs semantic similarity-based adapter allocation and parameter-efficient fine-tuning for medical tasks through prompt similarity analysis, and a knowledge consolidation phase employing bi-directional Fisher-memory coordination. This creates a reinforcing cycle: consolidation directs replay priorities while new tasks provide challenging samples that improve retention strategies. Our key contributions are: (1) a semantic-driven adapter allocation mechanism that enables efficient learning of new medical tasks, (2) a bi-modal LoRA adaptation that significantly reduces trainable parameters while maintaining cross-modal learning, and (3) bidirectional Fisher-memory coordination that prevents catastrophic forgetting from previous medical tasks. Extensive experiments across diverse medical datasets demonstrate superior forgetting mitigation and performance retention with minimal parameter overhead, making the framework effective for continual learning in medical vision-language scenarios.

[12] Person Recognition in Aerial Surveillance: A Decade Survey cs.CVPDF

Kien Nguyen, Feng Liu, Clinton Fookes, Sridha Sridharan, Xiaoming Liu

TL;DR: 该论文是对过去10年中150多篇关于空中监视任务的计算机视觉和机器学习研究的系统性综述，重点关注无人机等平台上的人体识别任务。

Details

Motivation: 随着无人机和其他空中平台的快速发展，空中监视因其独特的优势（如规模、机动性和隐蔽性）成为研究热点。本文旨在总结和分析这些技术在人体识别任务中的应用和挑战。

Result: 总结了现有研究的局限性，提出了未来的研究方向。

Insight: 空中监视中的人体识别面临视角变化、分辨率低和遮挡等挑战，现有方法需进一步优化以适应这些特殊条件。

Abstract: The rapid emergence of airborne platforms and imaging sensors is enabling new forms of aerial surveillance due to their unprecedented advantages in scale, mobility, deployment, and covert observation capabilities. This paper provides a comprehensive overview of 150+ papers over the last 10 years of human-centric aerial surveillance tasks from a computer vision and machine learning perspective. It aims to provide readers with an in-depth systematic review and technical analysis of the current state of aerial surveillance tasks using drones, UAVs, and other airborne platforms. The object of interest is humans, where human subjects are to be detected, identified, and re-identified. More specifically, for each of these tasks, we first identify unique challenges in performing these tasks in an aerial setting compared to the popular ground-based setting and subsequently compile and analyze aerial datasets publicly available for each task. Most importantly, we delve deep into the approaches in the aerial surveillance literature with a focus on investigating how they presently address aerial challenges and techniques for improvement. We conclude the paper by discussing the gaps and open research questions to inform future research avenues.

Weiyi Lv, Ning Zhang, Hanyang Sun, Haoran Jiang, Kai Zhao

TL;DR: 该论文提出了一种名为VMRMOT的新型多模态框架，通过多模态大语言模型（MLLMs）将运动模态引入Referring Multi-Object Tracking（RMOT），以解决静态参考与动态视觉模态之间的不一致问题。

Details

Motivation: 现有RMOT基准仅描述物体的外观、相对位置和初始运动状态，忽略了运动动态变化（如速度和方向变化），导致静态参考和动态视觉模态之间的时间不一致，限制了多模态跟踪性能。

Result: 在多个RMOT基准上的实验表明，VMRMOT优于现有最先进方法。

Insight: 运动模态的引入能够有效解决静态参考与动态视觉之间的时间不一致问题，提升多模态跟踪的鲁棒性和准确性。

Abstract: Referring Multi-Object Tracking (RMOT) extends conventional multi-object tracking (MOT) by introducing natural language references for multi-modal fusion tracking. RMOT benchmarks only describe the object’s appearance, relative positions, and initial motion states. This so-called static regulation fails to capture dynamic changes of the object motion, including velocity changes and motion direction shifts. This limitation not only causes a temporal discrepancy between static references and dynamic vision modality but also constrains multi-modal tracking performance. To address this limitation, we propose a novel Vision-Motion-Reference aligned RMOT framework, named VMRMOT. It integrates a motion modality extracted from object dynamics to enhance the alignment between vision modality and language references through multi-modal large language models (MLLMs). Specifically, we introduce motion-aware descriptions derived from object dynamic behaviors and, leveraging the powerful temporal-reasoning capabilities of MLLMs, extract motion features as the motion modality. We further design a Vision-Motion-Reference Alignment (VMRA) module to hierarchically align visual queries with motion and reference cues, enhancing their cross-modal consistency. In addition, a Motion-Guided Prediction Head (MGPH) is developed to explore motion modality to enhance the performance of the prediction head. To the best of our knowledge, VMRMOT is the first approach to employ MLLMs in the RMOT task for vision-reference alignment. Extensive experiments on multiple RMOT benchmarks demonstrate that VMRMOT outperforms existing state-of-the-art methods.

[14] Understanding Counting Mechanisms in Large Language and Vision-Language Models cs.CV | cs.AIPDF

Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian

TL;DR: 该论文通过实验和工具CountScope，研究了大型语言模型（LLMs）和视觉语言模型（LVLMs）在处理计数任务时的机制，揭示了模型内部计数信息的表示和传递方式，以及在层级中逐步形成的数值表示。

Details

Motivation: 大型语言和视觉语言模型在处理计数任务时的内部机制尚不明确，论文旨在揭示这些模型如何表示和计算数字信息，以提升对其行为的理解。

Result: 模型内部存在计数器机制，数值信息逐步从低层向高层编码；LVLMs中视觉嵌入也包含数字信息；结构性提示（如分隔符）影响计数的准确性。

Insight: 计数在LLMs和LVLMs中是一个结构化、层级化的过程，其表现受模型结构和输入属性的影响。

Abstract: This paper examines how large language models (LLMs) and large vision-language models (LVLMs) represent and compute numerical information in counting tasks. We use controlled experiments with repeated textual and visual items and analyze model behavior through causal mediation and activation patching. To this end, we design a specialized tool, CountScope, for mechanistic interpretability of numerical content. Results show that individual tokens or visual features encode latent positional count information that can be extracted and transferred across contexts. Layerwise analyses reveal a progressive emergence of numerical representations, with lower layers encoding small counts and higher layers representing larger ones. We identify an internal counter mechanism that updates with each item, stored mainly in the final token or region and transferable between contexts. In LVLMs, numerical information also appears in visual embeddings, shifting between background and foreground regions depending on spatial composition. Models rely on structural cues such as separators in text, which act as shortcuts for tracking item counts and influence the accuracy of numerical predictions. Overall, counting emerges as a structured, layerwise process in LLMs and follows the same general pattern in LVLMs, shaped by the properties of the vision encoder.

[15] Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions cs.CVPDF

Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, Donald E. Brown

TL;DR: 该论文研究了视觉语言模型（VLM）在计数任务中的表现，构建了一个合成基准数据集和评估框架，分析了输入参数变化对注意力分配的影响，并通过注意力干预测试了其对计数性能的提升效果。

Details

Motivation: 研究发现VLM在回答图像属性问题时容易依赖训练中的固有偏见，尤其是在计数任务中表现不佳。这促使作者系统性分析VLM的计数能力及其改进方法。

Result: 研究表明，VLM在复杂视觉或语言条件下计数任务仍具挑战性，但某些注意力干预可带来小幅性能提升。

Insight: VLM在计数任务中的表现受视觉和语言复杂度的显著影响，注意力机制的有效干预可为类似任务提供改进方向。

Abstract: Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require them to focus on particular areas of the image in tasks such as counting. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically determine how counting performance varies as image and prompt properties change. Using open-source VLMs, we then analyze how attention allocation fluctuates with varying input parameters (e.g. number of objects in the image, objects color, background color, objects texture, background texture, and prompt specificity). We further implement attention-based interventions to modulate focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions. Our experiments reveal that while VLM counting performance remains challenging, especially under high visual or linguistic complexity, certain attention interventions can lead to modest gains in counting performance.

[16] The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation cs.CVPDF

Victor Li, Naveenraj Kamalakannan, Avinash Parnandi, Heidi Schambra, Carlos Fernandez-Granda

TL;DR: 本文探讨了视觉-语言模型(VLMs)在中风康复数据驱动分析中的潜力与局限，发现当前模型在精细动作理解方面不足，但展现了未来优化的可能性。

Details

Motivation: VLMs在计算机视觉任务中表现优异，但其在医疗视频分析(如中风康复)中的应用潜力尚不明确，作者希望通过具体案例研究填补这一空白。

Result: VLMs在剂量估计和损伤评分上与基线方法相当，但对高层面活动分类和动作检测表现尚可(适度准确率)，尤其在轻度受损患者中剂量计数误差在25%以内。

Insight: 1. VLMs在复杂医疗任务中仍需改进；2. 提示优化和后处理能显著提升性能；3. 无需微调的特性为其在临床中快速部署提供了可能。

Abstract: Vision-language models (VLMs) have demonstrated remarkable performance across a wide range of computer-vision tasks, sparking interest in their potential for digital health applications. Here, we apply VLMs to two fundamental challenges in data-driven stroke rehabilitation: automatic quantification of rehabilitation dose and impairment from videos. We formulate these problems as motion-identification tasks, which can be addressed using VLMs. We evaluate our proposed framework on a cohort of 29 healthy controls and 51 stroke survivors. Our results show that current VLMs lack the fine-grained motion understanding required for precise quantification: dose estimates are comparable to a baseline that excludes visual information, and impairment scores cannot be reliably predicted. Nevertheless, several findings suggest future promise. With optimized prompting and post-processing, VLMs can classify high-level activities from a few frames, detect motion and grasp with moderate accuracy, and approximate dose counts within 25% of ground truth for mildly impaired and healthy participants, all without task-specific training or finetuning. These results highlight both the current limitations and emerging opportunities of VLMs for data-driven stroke rehabilitation and broader clinical video analysis.

[17] VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning cs.CV | cs.LGPDF

Lingxiao Li, Yifan Wang, Xinyan Gao, Chen Tang, Xiangyu Yue

TL;DR: VisReason是一个大规模数据集，旨在推动视觉链式推理（CoT）的发展，包含489K标注样本和165K高质量子集VisReason-Pro。通过在Qwen2.5-VL上微调，显著提升了多模态大语言模型（MLLM）的逐步推理能力和可解释性。

Details

Motivation: 尽管链式推理（CoT）在语言模型中已成功应用，但其在多模态大语言模型（MLLM）中的潜力尚未完全开发，原因是缺乏大规模、空间基础的数据集。VisReason填补了这一空白，支持更系统的视觉推理。

Result: VisReason和VisReason-Pro显著提升了模型在视觉分步推理任务中的准确性和可解释性，同时增强了跨任务泛化能力。

Insight: 大规模数据集的构建对提升MLLM的推理能力至关重要，尤其是在多模态环境中。VisReason的数据标注方式（如3D空间基础和专家生成）为未来数据集设计提供了参考。

Abstract: Chain-of-Thought (CoT) prompting has proven remarkably effective for eliciting complex reasoning in large language models (LLMs). Yet, its potential in multimodal large language models (MLLMs) remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. Existing visual-CoT resources are typically small, domain-specific, or lack the human-like stepwise structure necessary for compositional visual reasoning. In this paper, we introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning. VisReason comprises 489K annotated examples spanning four diverse domains, each featuring multi-round, human-like rationales that guide MLLMs through interpretable visual reasoning steps. Building upon this, we curate VisReason-Pro, a 165K subset produced with a stronger expert-level GPT annotator, enriched with detailed reasoning traces and 3D spatial grounding via depth-informed annotations. Fine-tuning the state-of-the-art Qwen2.5-VL model on VisReason and VisReason-Pro yields substantial improvements in step-by-step visual reasoning accuracy, interpretability, and cross-benchmark generalization. These results demonstrate that VisReason equips MLLMs with more systematic and generalizable reasoning capabilities. We envision VisReason as a cornerstone for cultivating human-like visual reasoning, paving the way toward the next generation of multimodal intelligence.

[18] Towards Open-Ended Visual Scientific Discovery with Sparse Autoencoders cs.CVPDF

Samuel Stevens, Jacob Beattie, Tanya Berger-Wolf, Yu Su

TL;DR: 论文探讨了稀疏自编码器（SAEs）是否能为开放式的科学发现提供特征支持，通过实验证明其在生态图像中提取细粒度结构的能力，并将方法推广到其他科学领域。

Details

Motivation: 科学档案中蕴含大量未发现的模式，现有方法仅针对预设目标提取结构，不支持开放式发现。稀疏自编码器可能填补这一空白。

Result: 实验证明稀疏自编码器能从生态图像中提取细粒度结构，无需分割或部件标签，且方法适用于其他科学领域。

Insight: 稀疏分解是一种实用的工具，可从基础模型中探索未知的科学知识，为真正的科学发现铺平道路。

Abstract: Scientific archives now contain hundreds of petabytes of data across genomics, ecology, climate, and molecular biology that could reveal undiscovered patterns if systematically analyzed at scale. Large-scale, weakly-supervised datasets in language and vision have driven the development of foundation models whose internal representations encode structure (patterns, co-occurrences and statistical regularities) beyond their training objectives. Most existing methods extract structure only for pre-specified targets; they excel at confirmation but do not support open-ended discovery of unknown patterns. We ask whether sparse autoencoders (SAEs) can enable open-ended feature discovery from foundation model representations. We evaluate this question in controlled rediscovery studies, where the learned SAE features are tested for alignment with semantic concepts on a standard segmentation benchmark and compared against strong label-free alternatives on concept-alignment metrics. Applied to ecological imagery, the same procedure surfaces fine-grained anatomical structure without access to segmentation or part labels, providing a scientific case study with ground-truth validation. While our experiments focus on vision with an ecology case study, the method is domain-agnostic and applicable to models in other sciences (e.g., proteins, genomics, weather). Our results indicate that sparse decomposition provides a practical instrument for exploring what scientific foundation models have learned, an important prerequisite for moving from confirmation to genuine discovery.

[19] AEGIS: Preserving privacy of 3D Facial Avatars with Adversarial Perturbations cs.CV | cs.AIPDF

Dawid Wolkiewicz, Anastasiya Pechko, Przemysław Spurek, Piotr Syga

TL;DR: AEGIS是首个面向3D高斯化身的隐私保护框架，通过对抗扰动隐藏身份特征，同时保持感知真实性和功能完整性。

Details

Motivation: 随着逼真的3D面部化身广泛应用，尤其是基于3D高斯散射表示的高效技术，身份盗窃风险增加，现有方法对2D图像有效，但3D动态化身缺乏一致的视角保护。

Result: AEGIS彻底脱敏（人脸检索和验证准确率降至0%），同时保持高感知质量（SSIM=0.9555，PSNR=35.52dB），并保留年龄、种族、性别和情感等关键属性。

Insight: 对抗扰动在3D隐私保护中有效，无需几何调整即可实现多视角一致性，为3D生物识别安全提供新思路。

Abstract: The growing adoption of photorealistic 3D facial avatars, particularly those utilizing efficient 3D Gaussian Splatting representations, introduces new risks of online identity theft, especially in systems that rely on biometric authentication. While effective adversarial masking methods have been developed for 2D images, a significant gap remains in achieving robust, viewpoint-consistent identity protection for dynamic 3D avatars. To address this, we present AEGIS, the first privacy-preserving identity masking framework for 3D Gaussian Avatars that maintains the subject’s perceived characteristics. Our method aims to conceal identity-related facial features while preserving the avatar’s perceptual realism and functional integrity. AEGIS applies adversarial perturbations to the Gaussian color coefficients, guided by a pre-trained face verification network, ensuring consistent protection across multiple viewpoints without retraining or modifying the avatar’s geometry. AEGIS achieves complete de-identification, reducing face retrieval and verification accuracy to 0%, while maintaining high perceptual quality (SSIM = 0.9555, PSNR = 35.52 dB). It also preserves key facial attributes such as age, race, gender, and emotion, demonstrating strong privacy protection with minimal visual distortion.

[20] SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration cs.CVPDF

Zhimin Shao, Abhay Yadav, Rama Chellappa, Cheng Peng

TL;DR: SPIDER提出了一种通用的特征匹配框架，结合2D和3D对应关系估计，解决了跨域场景中由于外观、尺度和视角变化带来的匹配挑战，并在大基线场景中显著优于现有方法。

Details

Motivation: 解决跨域场景中由于外观、尺度和视角变化导致的图像匹配不可靠问题，特别是针对大基线和复杂几何细节的场景。

Result: SPIDER在大基线场景中显著优于现有方法，表现出强大的通用图像匹配能力。

Insight: 3D基础模型的匹配能力集中在主导平面区域，而对细粒度几何细节敏感度不足，结合2D和3D方法可以更好地应对跨域场景的匹配挑战。

Abstract: Reliable image correspondences form the foundation of vision-based spatial perception, enabling recovery of 3D structure and camera poses. However, unconstrained feature matching across domains such as aerial, indoor, and outdoor scenes remains challenging due to large variations in appearance, scale and viewpoint. Feature matching has been conventionally formulated as a 2D-to-2D problem; however, recent 3D foundation models provides spatial feature matching properties based on two-view geometry. While powerful, we observe that these spatially coherent matches often concentrate on dominant planar regions, e.g., walls or ground surfaces, while being less sensitive to fine-grained geometric details, particularly under large viewpoint changes. To better understand these trade-offs, we first perform linear probe experiments to evaluate the performance of various vision foundation models for image matching. Building on these insights, we introduce SPIDER, a universal feature matching framework that integrates a shared feature extraction backbone with two specialized network heads for estimating both 2D-based and 3D-based correspondences from coarse to fine. Finally, we introduce an image-matching evaluation benchmark that focuses on unconstrained scenarios with large baselines. SPIDER significantly outperforms SoTA methods, demonstrating its strong ability as a universal image-matching method.

[21] CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation cs.CVPDF

Prantik Howlader, Hoang Nguyen-Canh, Srijan Das, Jingyi Xu, Hieu Le

TL;DR: CORA是一个半监督推理分割框架，利用有限标注数据和大量未标注图像联合学习，通过条件视觉指令、一致性伪标签过滤和令牌级对比对齐，实现高性能分割，标注需求远低于基线方法。

Details

Motivation: 推理分割需要复杂的上下文理解，现有多模态语言模型受限于高质量标注数据的稀缺，泛化能力有限。CORA旨在通过半监督学习减少对标注数据的依赖。

Result: 在Cityscapes和PanNuke数据集上，分别以100和180张标注图像实现SOTA，性能提升2.3%和2.4%。

Insight: 半监督学习和一致性机制可显著减少对大量标注数据的需求，尤其在复杂推理任务中表现突出。

Abstract: Reasoning segmentation seeks pixel-accurate masks for targets referenced by complex, often implicit instructions, requiring context-dependent reasoning over the scene. Recent multimodal language models have advanced instruction following segmentation, yet generalization remains limited. The key bottleneck is the high cost of curating diverse, high-quality pixel annotations paired with rich linguistic supervision leading to brittle performance under distribution shift. Therefore, we present CORA, a semi-supervised reasoning segmentation framework that jointly learns from limited labeled data and a large corpus of unlabeled images. CORA introduces three main components: 1) conditional visual instructions that encode spatial and contextual relationships between objects; 2) a noisy pseudo-label filter based on the consistency of Multimodal LLM’s outputs across semantically equivalent queries; and 3) a token-level contrastive alignment between labeled and pseudo-labeled samples to enhance feature consistency. These components enable CORA to perform robust reasoning segmentation with minimal supervision, outperforming existing baselines under constrained annotation settings. CORA achieves state-of-the-art results, requiring as few as 100 labeled images on Cityscapes, a benchmark dataset for urban scene understanding, surpassing the baseline by $+2.3%$. Similarly, CORA improves performance by $+2.4%$ with only 180 labeled images on PanNuke, a histopathology dataset.

[22] Deepfake Geography: Detecting AI-Generated Satellite Images cs.CVPDF

Mansur Yerzhanuly

TL;DR: 本文探讨了生成模型（如StyleGAN2和Stable Diffusion）对卫星图像真实性的威胁，并提出了一种基于Vision Transformers（ViTs）的检测方法。ViTs在检测AI生成的卫星图像方面显著优于传统卷积神经网络（CNNs），尤其在准确性和鲁棒性上表现突出。

Details

Motivation: 随着生成模型的快速发展，AI生成的卫星图像可能威胁到科学和安全领域的决策可靠性。目前针对深度伪造的研究主要集中在面部图像，卫星图像的特殊性（如地形不一致性和结构伪影）需要新的检测方法。

Result: ViTs在检测AI生成卫星图像方面的准确率达到95.11%，显著高于CNNs的87.02%。

Insight: ViTs能够有效捕捉合成图像中的结构不一致性和重复纹理模式，这表明其在卫星图像真实性检测中具有潜力。未来研究可以拓展到多光谱和SAR图像模态。

Abstract: The rapid advancement of generative models such as StyleGAN2 and Stable Diffusion poses a growing threat to the authenticity of satellite imagery, which is increasingly vital for reliable analysis and decision-making across scientific and security domains. While deepfake detection has been extensively studied in facial contexts, satellite imagery presents distinct challenges, including terrain-level inconsistencies and structural artifacts. In this study, we conduct a comprehensive comparison between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for detecting AI-generated satellite images. Using a curated dataset of over 130,000 labeled RGB images from the DM-AER and FSI datasets, we show that ViTs significantly outperform CNNs in both accuracy (95.11 percent vs. 87.02 percent) and overall robustness, owing to their ability to model long-range dependencies and global semantic structures. We further enhance model transparency using architecture-specific interpretability methods, including Grad-CAM for CNNs and Chefer’s attention attribution for ViTs, revealing distinct detection behaviors and validating model trustworthiness. Our results highlight the ViT’s superior performance in detecting structural inconsistencies and repetitive textural patterns characteristic of synthetic imagery. Future work will extend this research to multispectral and SAR modalities and integrate frequency-domain analysis to further strengthen detection capabilities and safeguard satellite imagery integrity in high-stakes applications.

[23] Target-Bench: Can World Models Achieve Mapless Path Planning with Semantic Targets? cs.CV | cs.ROPDF

Dingrui Wang, Hongyuan Ye, Zhihao Liang, Zhexiao Sun, Zhaowei Lu

TL;DR: Target-Bench是首个评估世界模型在无地图路径规划中性能的基准测试，覆盖45个语义类别和450段视频序列，发现现有模型（如Sora 2、Veo 3.1）在机器人路径规划任务中表现有限，微调开源模型后性能显著提升。

Details

Motivation: 近年来，世界模型在视频生成方面表现出色，但其在机器人路径规划任务中的应用和性能尚未得到明确评估和量化。这促使研究者设计Target-Bench，填补这一空白。

Result: - 当前最佳开源模型（Wan2.2-Flash）仅得0.299分。

Insight: 尽管世界模型在视频生成上表现优异，但其在机器人路径规划中的能力仍有显著不足，微调可以有效提升性能。

Abstract: While recent world models generate highly realistic videos, their ability to perform robot path planning remains unclear and unquantified. We introduce Target-Bench, the first benchmark specifically designed to evaluate world models on mapless path planning toward semantic targets in real-world environments. Target-Bench provides 450 robot-collected video sequences spanning 45 semantic categories with SLAM-based ground truth trajectories. Our evaluation pipeline recovers camera motion from generated videos and measures planning performance using five complementary metrics that quantify target-reaching capability, trajectory accuracy, and directional consistency. We evaluate state-of-the-art models including Sora 2, Veo 3.1, and the Wan series. The best off-the-shelf model (Wan2.2-Flash) achieves only 0.299 overall score, revealing significant limitations in current world models for robotic planning tasks. We show that fine-tuning an open-source 5B-parameter model on only 325 scenarios from our dataset achieves 0.345 overall score – an improvement of more than 400% over its base version (0.066) and 15% higher than the best off-the-shelf model. We will open-source the code and dataset.

[24] Attention Guided Alignment in Efficient Vision-Language Models cs.CV | cs.LGPDF

Shweta Mahajan, Hoang Le, Hyojin Park, Farzad Farhadzadeh, Munawar Hayat

TL;DR: 分析了高效视觉语言模型（VLM）中的注意力机制问题，提出了一种注意力引导的框架（AGE-VLM），通过交叉注意力层和从Segment Anything模型（SAM）中提取的空间知识，显著减少了对象幻觉问题。

Details

Motivation: 当前的高效VLM在多模态对齐中存在注意力分配不均的问题，导致语义匹配与非匹配的图像-文本对难以区分，引发对象幻觉。

Result: 在多个视觉中心基准测试中表现优于或与现有高效VLM相当。

Insight: 通过注意力引导的空间知识融入，可以有效提升VLM的多模态对齐能力，减少幻觉现象。

Abstract: Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs) to integrate visual and textual information. This paper presents a comprehensive analysis of attention patterns in efficient VLMs, revealing that concatenation-based architectures frequently fail to distinguish between semantically matching and non-matching image-text pairs. This is a key factor for object hallucination in these models. To address this, we introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers to instill vision capabilities in pretrained small language models. This enforces in VLM the ability “look” at the correct image regions by leveraging spatial knowledge distilled from the Segment Anything Model (SAM), significantly reducing hallucination. We validate our approach across different vision-centric benchmarks where our method is better or comparable to prior work on efficient VLMs. Our findings provide valuable insights for future research aimed at achieving enhanced visual and linguistic understanding in VLMs.

[25] Pillar-0: A New Frontier for Radiology Foundation Models cs.CV | cs.AIPDF

Kumar Krishna Agrawal, Longchao Liu, Long Lian, Michael Nercessian, Natalia Harguindeguy

TL;DR: Pillar-0是一个新的放射学基础模型，通过高质量的数据和RATE框架显著提升了影像诊断的性能，并在多个内部和外部数据集上超越了现有模型。

Details

Motivation: 放射学在现代医学中至关重要，但影像数据的快速增长超出了医疗资源的承受能力。现有医学基础模型在处理体积CT和MRI时存在局限性（如低分辨率2D切片、丢失灰度对比信息），且缺乏真实临床实践的评估框架。

Result: Pillar-0在腹部盆腔CT、胸部CT等任务上的AUROC平均为86.4-90.1，比现有模型提高了7.8-15.8分。在外部验证中表现同样优异，且在肺癌风险预测等任务上也有显著提升。

Insight: Pillar-0通过高质量数据和RATE框架的结合，解决了计算、数据和评估限制的问题，为高性能放射学系统提供了开放、临床严谨的基础。

Abstract: Radiology plays an integral role in modern medicine, yet rising imaging volumes have far outpaced workforce growth. Foundation models offer a path toward assisting with the full spectrum of radiology tasks, but existing medical models remain limited: they process volumetric CT and MRI as low-fidelity 2D slices, discard critical grayscale contrast information, and lack evaluation frameworks that reflect real clinical practice. We introduce Pillar-0, a radiology foundation model pretrained on 42,990 abdomen-pelvis CTs, 86,411 chest CTs, 14,348 head CTs, and 11,543 breast MRIs from a large academic center, together with RATE, a scalable framework that extracts structured labels for 366 radiologic findings with near-perfect accuracy using LLMs. Across internal test sets of 14,230 abdomen-pelvis CTs, 10,646 chest CTs, 4,906 head CTs, and 1,585 breast MRIs, Pillar-0 establishes a new performance frontier, achieving mean AUROCs of 86.4, 88.0, 90.1, and 82.9, outperforming MedGemma (Google), MedImageInsight (Microsoft), Lingshu (Alibaba), and Merlin (Stanford) by 7.8-15.8 AUROC points and ranking best in 87.2% (319/366) tasks. Pillar-0 similarly outperforms all baselines in an external validation on the Stanford Abdominal CT dataset, including Merlin (82.2 vs 80.6 AUROC). Pillar-0 extends to tasks beyond its pretraining, such as long-horizon lung cancer risk prediction, where it improves upon the state-of-the-art Sybil by 3.0 C-index points on NLST, and generalizes with gains of 5.9 (MGH) and 1.9 (CGMH). In brain hemorrhage detection, Pillar-0 obtained a >95 AUROC when using only 1/20th of the data of the next most sample efficient baseline. Pillar-0 and RATE together provide an open, clinically rigorous foundation for building high-performance radiology systems, enabling applications that were previously infeasible due to computational, data, and evaluation constraints.

[26] A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking cs.CV | cs.AIPDF

Chengan Che, Chao Wang, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

TL;DR: 本文提出PL-Stitch，一种自监督学习框架，利用视频帧的内在时间顺序作为监督信号，通过Plackett-Luce模型的两个新目标学习工作流程的顺序性，在手术和烹饪任务中表现优越。

Details

Motivation: 现有的自监督学习方法通常忽略了程序性活动的时间顺序性，本文通过实验证实了这一点，并提出了一种方法来解决这一缺陷。

Result: 在手术阶段识别和烹饪动作分割任务中显著提升性能（如Cholec80上的k-NN准确性提升11.4个百分点）。

Insight: 时间顺序性是程序性活动的重要特征，将其显式建模可以显著提升自监督学习的效果。

Abstract: Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured as a set of actions conducted in a specific temporal order. Despite their success on static images and short clips, current self-supervised learning methods often overlook the procedural nature that underpins such activities. We expose the lack of procedural awareness in current SSL methods with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates two novel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective trains the model to sort sampled frames chronologically, compelling it to learn the global workflow progression. The secondary objective, a spatio-temporal jigsaw loss, complements the learning by capturing fine-grained, cross-frame object correlations. Our approach consistently achieves superior performance across five surgical and cooking benchmarks. Specifically, PL-Stitch yields significant gains in surgical phase recognition (e.g., +11.4 pp k-NN accuracy on Cholec80) and cooking action segmentation (e.g., +5.7 pp linear probing accuracy on Breakfast), demonstrating its effectiveness for procedural video representation learning.

[27] QAL: A Loss for Recall Precision Balance in 3D Reconstruction cs.CV | cs.ROPDF

Pranay Meshram, Yash Turkar, Kartikeya Singh, Praveen Raj Masilamani, Charuvahan Adhivarahan

TL;DR: 论文提出了一种名为QAL（Quality-Aware Loss）的损失函数，用于解决3D重建任务中召回率和精确率不平衡的问题，相较于传统的Chamfer Distance和Earth Mover’s Distance，QAL在性能和稳定性上均有提升。

Details

Motivation: 目前3D重建任务中的训练目标（如Chamfer Distance和Earth Mover’s Distance）未能很好地平衡召回率和精确率，导致模型在薄结构和低覆盖率区域的性能不佳。

Result: QAL平均覆盖率分别比CD和最佳替代方法提升了4.3分和2.8分，显著改善了薄结构和低覆盖率区域的性能。QAL训练的模型在GraspNet评估中也获得了更高的抓取分数。

Insight: QAL展示了在3D视觉任务中，显式优化召回率和精确率的重要性，尤其是在安全关键型机器人任务中，改进的覆盖率直接提升了操作的可靠性。

Abstract: Volumetric learning underpins many 3D vision tasks such as completion, reconstruction, and mesh generation, yet training objectives still rely on Chamfer Distance (CD) or Earth Mover’s Distance (EMD), which fail to balance recall and precision. We propose Quality-Aware Loss (QAL), a drop-in replacement for CD/EMD that combines a coverage-weighted nearest-neighbor term with an uncovered-ground-truth attraction term, explicitly decoupling recall and precision into tunable components. Across diverse pipelines, QAL achieves consistent coverage gains, improving by an average of +4.3 pts over CD and +2.8 pts over the best alternatives. Though modest in percentage, these improvements reliably recover thin structures and under-represented regions that CD/EMD overlook. Extensive ablations confirm stable performance across hyperparameters and across output resolutions, while full retraining on PCN and ShapeNet demonstrates generalization across datasets and backbones. Moreover, QAL-trained completions yield higher grasp scores under GraspNet evaluation, showing that improved coverage translates directly into more reliable robotic manipulation. QAL thus offers a principled, interpretable, and practical objective for robust 3D vision and safety-critical robotics pipelines

[28] Toward explainable AI approaches for breast imaging: adapting foundation models to diverse populations cs.CV | cs.AIPDF

Guilherme J. Cavalcante, José Gabriel A. Moreira, Gabriel A. B. do Nascimento, Vincent Dong, Alex Nguyen

TL;DR: 该研究探讨了基础模型在乳腺影像中的应用，特别是针对BI-RADS乳腺密度分类任务，通过多模态数据训练验证了BiomedCLIP的泛化能力与可解释性。

Details

Motivation: 基础模型在医学影像领域的潜力尚未充分挖掘，尤其是在乳腺影像中，如何利用多模态数据和解决类别不平衡问题成为研究重点。

Result: 多模态和单模态训练的效果相似（准确率分别为0.74和0.73），但多模态模型在AUC表现更优（均高于0.84），外部验证AUC范围为0.80-0.93。

Insight: 基础模型可用于乳腺影像任务，多模态数据训练提升泛化性，GradCAM可视化增强了模型的可解释性，为未来诊断任务奠定了基础。

Abstract: Foundation models hold promise for specialized medical imaging tasks, though their effectiveness in breast imaging remains underexplored. This study leverages BiomedCLIP as a foundation model to address challenges in model generalization. BiomedCLIP was adapted for automated BI-RADS breast density classification using multi-modality mammographic data (synthesized 2D images, digital mammography, and digital breast tomosynthesis). Using 96,995 images, we compared single-modality (s2D only) and multi-modality training approaches, addressing class imbalance through weighted contrastive learning. Both approaches achieved similar accuracy (multi-modality: 0.74, single-modality: 0.73), with the multi-modality model offering broader applicability across different imaging modalities and higher AUC values consistently above 0.84 across BI-RADS categories. External validation on the RSNA and EMBED datasets showed strong generalization capabilities (AUC range: 0.80-0.93). GradCAM visualizations confirmed consistent and clinically relevant attention patterns, highlighting the models interpretability and robustness. This research underscores the potential of foundation models for breast imaging applications, paving the way for future extensions for diagnostic tasks.

[29] Show Me: Unifying Instructional Image and Video Generation with Diffusion Models cs.CVPDF

Yujiang Pu, Zhanbo Huang, Vishnu Boddeti, Yu Kong

TL;DR: 本文提出了ShowMe框架，通过选择性激活视频扩散模型的空间和时间组件，统一了图像和视频生成任务，同时引入结构和运动一致性奖励以提升保真度和时序连贯性。

Details

Motivation: 现有方法通常孤立处理文本引导的图像编辑或视频预测任务，导致图像编辑忽略动作时间动态，而视频预测忽视目标意图。本文旨在解决这一问题。

Result: 实验表明，ShowMe在多样化基准测试中优于专家模型，验证了其作为统一动作-状态转换器的优势。

Insight: 视频扩散模型的空间知识可提升图像编辑质量，而指令引导的视频预测则增强模型的推理能力，二者互补带来性能提升。

Abstract: Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.

[30] JigsawComm: Joint Semantic Feature Encoding and Transmission for Communication-Efficient Cooperative Perception cs.CVPDF

Chenyi Wang, Zhaowei Li, Ming F. Li, Wujie Wen

TL;DR: JigsawComm提出了一种联合语义特征编码与传输方法，用于解决多智能体协同感知中的通信带宽限制问题，通过语义感知的特征选择和传输策略，实现了高效通信和高精度感知。

Details

Motivation: 多智能体协同感知（CP）能够克服单智能体系统的遮挡和感知范围限制，但实际应用中受限于通信带宽。现有方法未能充分考虑语义相关性和跨智能体数据的冗余性，无法最大化每个传输比特对感知任务的贡献。

Result: 在OPV2V和DAIR-V2X基准测试中，JigsawComm将总数据量减少了500倍以上，同时达到或超越了现有方法的精度。

Insight: 通过语义感知的特征选择和传输策略，可以有效减少冗余数据，同时保持感知精度。该方法为多智能体协同感知的实用性提供了新思路。

Abstract: Multi-agent cooperative perception (CP) promises to overcome the inherent occlusion and sensing-range limitations of single-agent systems (e.g., autonomous driving). However, its practicality is severely constrained by the limited communication bandwidth. Existing approaches attempt to improve bandwidth efficiency via compression or heuristic message selection, without considering the semantic relevance or cross-agent redundancy of sensory data. We argue that a practical CP system must maximize the contribution of every transmitted bit to the final perception task, by extracting and transmitting semantically essential and non-redundant data. In this paper, we formulate a joint semantic feature encoding and transmission problem, which aims to maximize CP accuracy under limited bandwidth. To solve this problem, we introduce JigsawComm, an end-to-end trained, semantic-aware, and communication-efficient CP framework that learns to ``assemble the puzzle’’ of multi-agent feature transmission. It uses a regularized encoder to extract semantically-relevant and sparse features, and a lightweight Feature Utility Estimator to predict the contribution of each agent’s features to the final perception task. The resulting meta utility maps are exchanged among agents and leveraged to compute a provably optimal transmission policy, which selects features from agents with the highest utility score for each location. This policy inherently eliminates redundancy and achieves a scalable $\mathcal{O}(1)$ communication cost as the number of agents increases. On the benchmarks OPV2V and DAIR-V2X, JigsawComm reduces the total data volume by up to $>$500$\times$ while achieving matching or superior accuracy compared to state-of-the-art methods.

[31] Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation cs.CV | cs.AIPDF

Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov

TL;DR: 该论文提出了一种数据高效的方法，通过在稀疏、低质量的合成数据上进行微调，成功实现了对文本生成视频模型的生成控制，且效果优于基于真实数据的微调。

Details

Motivation: 传统的文本生成视频扩散模型在添加新的生成控制（如摄像机参数）时需要大量高质量数据，但这些数据难以获取。

Result: 实验结果表明，该方法能够在少量低质量数据上学习到有效的生成控制，且效果优于基于真实数据的模型。

Insight: 低质量合成数据在某些情况下可能比高质量真实数据更适合用于模型的微调，因为它能够提供更直接的生成控制信号。

Abstract: Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic “real” data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

[32] MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use cs.CV | cs.AIPDF

Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath

TL;DR: MGA-VQA是一个多模态框架，通过集成令牌级编码、空间图推理、记忆增强推理和问题引导压缩，解决了DocVQA任务中的空间关系建模、多跳推理和可解释性等问题。

Details

Motivation: 现有的DocVQA方法在显式空间关系建模、高分辨率文档处理效率、多跳推理和可解释性方面存在不足。

Result: 在六个基准测试中（FUNSD、CORD、SROIE、DocVQA、STE-VQA和RICO），MGA-VQA在答案预测和空间定位方面均表现出色。

Insight: 图结构和记忆机制的结合可以有效提升VQA任务的透明性和推理能力。

Abstract: Document Visual Question Answering (DocVQA) requires models to jointly understand textual semantics, spatial layout, and visual features. Current methods struggle with explicit spatial relationship modeling, inefficiency with high-resolution documents, multi-hop reasoning, and limited interpretability. We propose MGA-VQA, a multi-modal framework that integrates token-level encoding, spatial graph reasoning, memory-augmented inference, and question-guided compression. Unlike prior black-box models, MGA-VQA introduces interpretable graph-based decision pathways and structured memory access for enhanced reasoning transparency. Evaluation across six benchmarks (FUNSD, CORD, SROIE, DocVQA, STE-VQA, and RICO) demonstrates superior accuracy and efficiency, with consistent improvements in both answer prediction and spatial localization.

[33] ArticFlow: Generative Simulation of Articulated Mechanisms cs.CV | cs.ROPDF

Jiong Lin, Jinchen Ruan, Hod Lipson

TL;DR: ArticFlow是一种生成式模拟框架，通过学习可控的速度场，从噪声生成具有明确动作控制的目标点集，解决了静态3D形状生成与动作依赖性变形的挑战。

Details

Motivation: 静态3D形状生成已有显著进展，但基于动作的3D生成仍因变形复杂和数据有限而困难重重。ArticFlow旨在通过动作控制生成高质量的可动机制。

Result: 在MuJoCo Menagerie上，ArticFlow在运动学准确性和形状质量上优于特定对象模拟器和静态点云生成器变体。

Insight: 动作条件流匹配是一种实用的方法，可实现高质量和可控的可动机制生成。

Abstract: Recent advances in generative models have produced strong results for static 3D shapes, whereas articulated 3D generation remains challenging due to action-dependent deformations and limited datasets. We introduce ArticFlow, a two-stage flow matching framework that learns a controllable velocity field from noise to target point sets under explicit action control. ArticFlow couples (i) a latent flow that transports noise to a shape-prior code and (ii) a point flow that transports points conditioned on the action and the shape prior, enabling a single model to represent diverse articulated categories and generalize across actions. On MuJoCo Menagerie, ArticFlow functions both as a generative model and as a neural simulator: it predicts action-conditioned kinematics from a compact prior and synthesizes novel morphologies via latent interpolation. Compared with object-specific simulators and an action-conditioned variant of static point-cloud generators, ArticFlow achieves higher kinematic accuracy and better shape quality. Results show that action-conditioned flow matching is a practical route to controllable and high-quality articulated mechanism generation.

[34] FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning cs.CV | cs.LGPDF

Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen

TL;DR: FastMMoE提出了一种无需训练的加速框架，通过动态专家激活和路由感知的令牌修剪，显著减少多模态大语言模型的计算负担，同时保持高性能。

Details

Motivation: 多模态大语言模型（MLLMs）在高分辨率视觉输入下会产生大量冗余的视觉令牌，导致计算和内存负担增加，难以在资源受限或延迟敏感的场景中部署。

Result: FastMMoE能够减少高达55.0%的FLOPs，同时保留约95.5%的原始性能，优于密集模型的修剪基线如FastV和SparseVLM。

Insight: 从路由分析角度出发的令牌修剪方法更适合MoE-MLLMs，可以在保持性能的同时显著降低计算开销。

Abstract: Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.

[35] When Better Teachers Don’t Make Better Students: Revisiting Knowledge Distillation for CLIP Models in VQA cs.CV | cs.CLPDF

Pume Tuchinda, Parinthapat Pengpun, Romrawin Chumpu, Sarana Nutanong, Peerat Limkonchotiwat

TL;DR: 本文系统研究了CLIP风格模型的知识蒸馏（KD），发现更强的教师模型并不总是带来更好的学生模型，特别是在视觉问答（VQA）等多模态任务中，传统蒸馏框架难以扩展。

Details

Motivation: 尽管视觉语言模型（VLMs）在多模态任务中表现出色，但其高计算需求限制了高效部署。知识蒸馏（KD）是一种轻量化模型的有效方法，但在VLMs（尤其是CLIP风格模型）中的应用较少，且通常局限于小规模教师模型和狭窄任务评估（如分类或检索）。

Result: 研究发现，现有的蒸馏框架在多模态任务（如视觉问答）中往往无法扩展，甚至导致性能下降。

Insight: 挑战了知识蒸馏领域的普遍假设，为设计参数高效的多模态模型提供了新方向。

Abstract: Vision-language models (VLMs) have achieved remarkable success across multimodal tasks, yet their substantial computational demands hinder efficient deployment. Knowledge distillation (KD) has emerged as a powerful approach for building lightweight but competitive models, with strong evidence from both language and vision domains. However, its application to VLMs, particularly CLIP-style models, remains limited, often constrained to small-scale teachers and narrow evaluation tasks such as classification or retrieval. In this work, we present the first systematic study of distillation across a range of CLIP-style teacher models, ranging from standard baselines to large-scale state-of-the-art models. Contrary to trends observed in NLP and vision, we find that stronger teachers do not consistently yield better students; in fact, existing distillation frameworks often fail to scale, leading to degraded performance in downstream multimodal tasks such as visual question answering. Our findings challenge prevailing assumptions in KD and point toward new directions for designing parameter-efficient multimodal models.

[36] CUS-GS: A Compact Unified Structured Gaussian Splatting Framework for Multimodal Scene Representation cs.CV | cs.ROPDF

Yuhang Ming, Chenxin Fang, Xingyuan Yu, Fan Zhang, Weichen Dai

TL;DR: CUS-GS提出了一种紧凑统一的结构化高斯散射框架，用于多模态场景表示，通过结合语义特征与结构化3D几何，实现了高性能和小模型尺寸的平衡。

Details

Motivation: 现有高斯散射方法分为语义导向和结构导向两类，前者缺乏明确的3D几何建模，后者语义抽象能力不足。CUS-GS旨在弥补这一差距，结合两者的优势。

Result: CUS-GS仅用600万参数即达到SOTA性能，模型尺寸显著小于35M参数的竞品，性能与效率表现优异。

Insight: 通过统一多模态特征和结构化几何，CUS-GS证明了在3D场景表示中语义与几何的协同建模潜力。

Abstract: Recent advances in Gaussian Splatting based 3D scene representation have shown two major trends: semantics-oriented approaches that focus on high-level understanding but lack explicit 3D geometry modeling, and structure-oriented approaches that capture spatial structures yet provide limited semantic abstraction. To bridge this gap, we present CUS-GS, a compact unified structured Gaussian Splatting representation, which connects multimodal semantic features with structured 3D geometry. Specifically, we design a voxelized anchor structure that constructs a spatial scaffold, while extracting multimodal semantic features from a set of foundation models (e.g., CLIP, DINOv2, SEEM). Moreover, we introduce a multimodal latent feature allocation mechanism to unify appearance, geometry, and semantics across heterogeneous feature spaces, ensuring a consistent representation across multiple foundation models. Finally, we propose a feature-aware significance evaluation strategy to dynamically guide anchor growing and pruning, effectively removing redundant or invalid anchors while maintaining semantic integrity. Extensive experiments show that CUS-GS achieves competitive performance compared to state-of-the-art methods using as few as 6M parameters - an order of magnitude smaller than the closest rival at 35M - highlighting the excellent trade off between performance and model efficiency of the proposed framework.

[37] PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning cs.CV | cs.AIPDF

Yingjie Ma, Xun Lin, Yong Xu, Weicheng Xie, Zitong Yu

TL;DR: PA-FAS提出了一种基于路径增强强化学习的多模态人脸防伪（FAS）方法，通过扩展推理路径和答案打乱机制，解决了现有方法在跨模态验证和推理路径利用上的局限性，显著提升了多模态推理的准确性和跨域泛化能力。

Details

Motivation: 现有基于监督微调和强化学习（SFT+RL）的多模态FAS方法存在推理路径受限和单任务监督与多样化推理路径不匹配的问题，导致模型可能忽视多模态互补性并倾向于利用浅层线索进行预测。

Result: PA-FAS在多模态FAS任务中显著提升了推理准确性和跨域泛化能力，同时增强了多模态融合、泛化和可解释性的统一。

Insight: 1. 扩展推理路径能有效缓解强化学习中的探索空间受限问题；2. 答案打乱机制有助于模型学习更深层次的多模态特征，避免捷径学习。

Abstract: Face anti-spoofing (FAS) has recently advanced in multimodal fusion, cross-domain generalization, and interpretability. With large language models and reinforcement learning (RL), strategy-based training offers new opportunities to jointly model these aspects. However, multimodal reasoning is more complex than unimodal reasoning, requiring accurate feature representation and cross-modal verification while facing scarce, high-quality annotations, which makes direct application of RL sub-optimal. We identify two key limitations of supervised fine-tuning plus RL (SFT+RL) for multimodal FAS: (1) limited multimodal reasoning paths restrict the use of complementary modalities and shrink the exploration space after SFT, weakening the effect of RL; and (2) mismatched single-task supervision versus diverse reasoning paths causes reasoning confusion, where models may exploit shortcuts by mapping images directly to answers and ignoring the intended reasoning. To address this, we propose PA-FAS, which enhances reasoning paths by constructing high-quality extended reasoning sequences from limited annotations, enriching paths and relaxing exploration constraints. We further introduce an answer-shuffling mechanism during SFT to force comprehensive multimodal analysis instead of using superficial cues, thereby encouraging deeper reasoning and mitigating shortcut learning. PA-FAS significantly improves multimodal reasoning accuracy and cross-domain generalization, and better unifies multimodal fusion, generalization, and interpretability for trustworthy FAS.

[38] MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection cs.CV | cs.AIPDF

Hui Lu, Yi Yu, Shijian Lu, Deepu Rajan, Boon Poh Ng

TL;DR: MambaTAD提出了一种新的状态空间TAD模型，结合了长时程建模和全局特征检测能力，通过DMBSS模块和全局特征融合头解决了传统方法在长时程动作检测中的不足。

Details

Motivation: 传统TAD方法在处理长时程动作实例时因缺乏全局意识和低效的检测头而表现不佳，结构化状态空间模型（如Mamba）虽具备长时程建模能力，但仍面临时间上下文衰减和全局视觉冲突等问题。

Result: 实验表明，MambaTAD在多个公开基准测试中均取得了显著的TAD性能提升。

Insight: 通过结合状态空间模型的长时程建模优势和全局特征融合策略，MambaTAD为长时程动作检测提供了一种高效且轻量化的解决方案。

Abstract: Temporal Action Detection (TAD) aims to identify and localize actions by determining their starting and ending frames within untrimmed videos. Recent Structured State-Space Models such as Mamba have demonstrated potential in TAD due to their long-range modeling capability and linear computational complexity. On the other hand, structured state-space models often face two key challenges in TAD, namely, decay of temporal context due to recursive processing and self-element conflict during global visual context modeling, which become more severe while handling long-span action instances. Additionally, traditional methods for TAD struggle with detecting long-span action instances due to a lack of global awareness and inefficient detection heads. This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities for accurate temporal action detection. MambaTAD comprises two novel designs that complement each other with superior TAD performance. First, it introduces a Diagonal-Masked Bidirectional State-Space (DMBSS) module which effectively facilitates global feature fusion and temporal action detection. Second, it introduces a global feature fusion head that refines the detection progressively with multi-granularity features and global awareness. In addition, MambaTAD tackles TAD in an end-to-end one-stage manner using a new state-space temporal adapter(SSTA) which reduces network parameters and computation cost with linear complexity. Extensive experiments show that MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.

[39] UniRSCD: A Unified Novel Architectural Paradigm for Remote Sensing Change Detection cs.CVPDF

Yuan Qu, Zhipeng Zhang, Chaojun Xu, Qiao Wan, Mengying Xie

TL;DR: 论文提出了一个统一的遥感变化检测框架UniRSCD，通过频率变化提示生成器和共享表征空间，消除了对专用解码器的需求，适应多种任务，并在多个数据集上取得领先性能。

Details

Motivation: 现有遥感变化检测方法需要针对不同任务设计专用解码器，导致模型选择和通用性受限。本文旨在解决这些问题。

Result: 在LEVIR-CD、SECOND和xBD等多个数据集上取得领先性能。

Insight: 统一编码器和共享表征空间可以有效解决多任务变化检测的信息补偿问题，提升模型通用性和性能。

Abstract: In recent years, remote sensing change detection has garnered significant attention due to its critical role in resource monitoring and disaster assessment. Change detection tasks exist with different output granularities such as BCD, SCD, and BDA. However, existing methods require substantial expert knowledge to design specialized decoders that compensate for information loss during encoding across different tasks. This not only introduces uncertainty into the process of selecting optimal models for abrupt change scenarios (such as disaster outbreaks) but also limits the universality of these architectures. To address these challenges, this paper proposes a unified, general change detection framework named UniRSCD. Building upon a state space model backbone, we introduce a frequency change prompt generator as a unified encoder. The encoder dynamically scans bitemporal global context information while integrating high-frequency details with low-frequency holistic information, thereby eliminating the need for specialized decoders for feature compensation. Subsequently, the unified decoder and prediction head establish a shared representation space through hierarchical feature interaction and task-adaptive output mapping. This integrating various tasks such as binary change detection and semantic change detection into a unified architecture, thereby accommodating the differing output granularity requirements of distinct change detection tasks. Experimental results demonstrate that the proposed architecture can adapt to multiple change detection tasks and achieves leading performance on five datasets, including the binary change dataset LEVIR-CD, the semantic change dataset SECOND, and the building damage assessment dataset xBD.

[40] Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion cs.CV | cs.GRPDF

Yan Xu, Yixing Wang, Stella X. Yu

TL;DR: 该论文提出了一种基于预训练视频扩散模型的零样本框架，通过不确定性机制生成伪视图，并结合3D高斯泼溅（3D-GS）完成稀疏输入的新视角合成任务。

Details

Motivation: 从稀疏的场景视角中重建高质量的新视角视频，同时填补空间和时间上的缺失信息是计算机视觉中的挑战性问题。传统方法需要大量输入或场景特定训练，限制了实用性。

Result: 在LLFF、DTU、DL3DV和MipNeRF-360数据集上，该方法在极端稀疏输入下显著优于现有3D-GS基线方法。

Insight: 1. 预训练视频扩散模型能够作为强大的先验，完成稀疏输入的视频合成任务；
2. 2D视图生成和3D几何重建的联合优化能够相互提升性能。

Abstract: Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That’s the lens we take on \emph{sparse-input novel view synthesis}, not only as filling spatial gaps between widely spaced views, but also as \emph{completing a natural video} unfolding through space. We recast the task as \emph{test-time natural video completion}, using powerful priors from \emph{pretrained video diffusion models} to hallucinate plausible in-between views. Our \emph{zero-shot, generation-guided} framework produces pseudo views at novel camera poses, modulated by an \emph{uncertainty-aware mechanism} for spatial coherence. These synthesized frames densify supervision for \emph{3D Gaussian Splatting} (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs \emph{without any scene-specific training or fine-tuning}. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity.

[41] SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System cs.CVPDF

Zhiyu Xu, Weilong Yan, Yufei Shi, Xin Meng, Tao He

TL;DR: SciEducator提出了一种基于Deming-Cycle的多智能体系统，用于科学视频理解与教育，通过迭代自进化机制提升复杂科学活动的解释能力，并在多模态教育内容生成中展现优势。

Details

Motivation: 现有方法难以满足科学视频理解和教育中对专业知识和分步推理的高要求，因此需要一种更高效的系统来解决这一挑战。

Result: SciEducator在SciVBench上显著优于Gemini、GPT-4o等主流MLLM和视频代理系统。

Insight: Deming-Cycle的管理哲学可有效应用于科学视频理解领域，多智能体系统具备潜力解决复杂场景的需求。

Abstract: Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating, a domain that demands external professional knowledge integration and rigorous step-wise reasoning, existing approaches often struggle. To bridge this gap, we propose SciEducator, the first iterative self-evolving multi-agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan-Do-Study-Act philosophy into a self-evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert-verified and literature-grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed-source MLLMs (e.g., Gemini, GPT-4o) and state-of-the-art video agents on the benchmark, establishing a new paradigm for the community.

[42] Test-Time Temporal Sampling for Efficient MLLM Video Understanding cs.CVPDF

Kaibin Wang, Mingbao Lin

TL;DR: 论文提出了一种名为T3S的训练免费、即插即用的推理方法，用于高效处理长视频理解任务，减少了计算成本并提升了准确性和推理速度。

Details

Motivation: 多模态大语言模型（MLLMs）在处理长视频时因自注意力机制的高计算成本和慢推理速度而面临挑战，现有方法在准确性、额外训练需求或推理速度上存在权衡。

Result: 在长视频理解基准测试中，T3S将准确性提升了3.1%，并减少了2.04倍的首token延迟，且无需模型修改或微调。

Insight: 通过利用视频冗余并将其转化为计算优势，T3S为长视频理解提供了一种高效且可扩展的解决方案。

Abstract: Processing long videos with multimodal large language models (MLLMs) poses a significant computational challenge, as the model’s self-attention mechanism scales quadratically with the number of video tokens, resulting in high computational demand and slow inference speed. Current solutions, such as rule-based sub-sampling, learned frame selector, or memory-based summarization, often introduce their own trade-offs: they compromise accuracy, necessitate additional training, or decrease inference speed. In this paper, we propose Test-Time Temporal Sampling (T3S), a training-free, plug-and-play inference wrapper that enables MLLMs to process long videos both efficiently and effectively. T3S exploits spatiotemporal redundancy by generating multiple short and diverse subsequences of video tokens at inference time, packing them within a single forward pass, and aggregating their predictions. This multi-subsequence formulation broadens visual coverage while reducing the computational cost of self-attention from $O(L^2)$ to $O(\sum_{i=1}^m α_i^2L^2)$, where $\sum_{i=1}^m α_i^2 < 1$. Extensive experiments on long video understanding benchmarks demonstrate that T3S improves accuracy by up to 3.1% and reduces first token delay by $2.04\times$, all with minimal integration effort. Our approach operates entirely at inference time, requires no model modifications or fine-tuning, and is compatible with a wide range of pretrained MLLMs. T3S turns video redundancy into a computational advantage, offering a scalable solution for long-video understanding. The code is available at https://github.com/kaibinwang3/T3S.

Liangyang Ouyang, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta

TL;DR: 该论文提出了一种多模态多说话者注意力对齐方法，通过动态跨模态头选择和自适应社会感知注意力偏置，改进了多说话者场景中的MLLMs性能，并在多个基准测试中达到SOTA。

Details

Motivation: 视频中的社会互动需要理解说话者、听众及其非语言线索（如视线或手势）的动态交互。现有的MLLMs在多说话者场景中表现不佳，主要原因是视觉和文本标记缺乏说话者一致的跨模态对齐。

Result: 在TVQA+、MMSI和OnlineMMSI三个基准测试中，该方法显著提升了MLLMs的多说话者社会推理能力，并实现SOTA性能。

Insight: 说话者相关的视觉和语言对齐是多模态社会推理的关键，现有的MLLMs需针对性优化以提升多说话者场景的性能。

Abstract: Understanding social interaction in video requires reasoning over a dynamic interplay of verbal and non-verbal cues: who is speaking, to whom, and with what gaze or gestures. While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks. Our quantitative analysis of cross-modal attention inside state-of-the-art MLLMs reveals a core failure mode: in multi-speaker scenes, visual and textual tokens lack speaker-consistent alignment, exhibiting substantially weaker cross-modal attention than in object-centric images. To address this, we propose a multimodal multi-speaker attention alignment method that can be integrated into existing MLLMs. First, we introduce dynamic cross-modal head selection to identify attention heads most responsible for grounding. Then, an adaptive social-aware attention bias, computed from existing attention patterns and speaker locations, is injected into the attention mechanism. This bias reinforces alignment between a speaker’s visual representation and their utterances without introducing trainable parameters or architectural changes. We integrate our method into three distinct MLLMs (LLaVA-NeXT-Video, Qwen2.5-VL, and InternVL3) and evaluate on three benchmarks (TVQA+, MMSI, OnlineMMSI). Across four social tasks, results demonstrate that our approach improves the ability of MLLMs and achieves state-of-the-art results. Attention visualizations confirm our method successfully focuses the model on speaker-relevant regions, enabling more robust multi-party social reasoning. Our implementation and model will be available at https://github.com/ut-vision/SocialInteraction.

[44] HEAL: Learning-Free Source Free Unsupervised Domain Adaptation for Cross-Modality Medical Image Segmentation cs.CVPDF

Yulong Shi, Jiapeng Li, Lin Qi

TL;DR: HEAL提出了一种无学习的跨模态医学图像分割方法，结合了分层去噪、边缘引导选择、大小感知融合和无学习特性，在无监督和无源数据的条件下实现了SOTA性能。

Details

Motivation: 临床数据隐私和存储限制的需求增加了无监督无源域适应的重要性，但现有方法在跨模态医学图像分割中面临挑战：缺乏源数据和目标域标签监督。

Result: 在跨模态医学图像分割任务中超越现有SFUDA方法，达到SOTA性能。

Insight: 无学习特性在域适应中的有效性表明，无需复杂训练的策略也能显著提升模型在目标域的泛化能力。

Abstract: Growing demands for clinical data privacy and storage constraints have spurred advances in Source Free Unsupervised Domain Adaptation (SFUDA). SFUDA addresses the domain shift by adapting models from the source domain to the unseen target domain without accessing source data, even when target-domain labels are unavailable. However, SFUDA faces significant challenges: the absence of source domain data and label supervision in the target domain due to source free and unsupervised settings. To address these issues, we propose HEAL, a novel SFUDA framework that integrates Hierarchical denoising, Edge-guided selection, size-Aware fusion, and Learning-free characteristic. Large-scale cross-modality experiments demonstrate that our method outperforms existing SFUDA approaches, achieving state-of-the-art (SOTA) performance. The source code is publicly available at: https://github.com/derekshiii/HEAL.

[45] VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment cs.CV | cs.AIPDF

Ziheng Jia, Linhan Cao, Jinliang Han, Zicheng Zhang, Jiaying Qian

TL;DR: VITAL提出了一种以视觉编码器为中心的预训练方法，通过大规模数据集和多任务训练，提升了视觉质量评估（VQualA）大模型的泛化能力和迁移性。

Details

Motivation: 现有VQualA大模型通常专注于单一任务，依赖全参数微调，容易过拟合，限制了泛化和迁移能力。

Result: 模型库展示了强大的零样本性能，每个解码器仅需少量数据即可达到与完整训练相当的性能。

Insight: 以视觉编码器为中心的预训练是提升VQualA大模型泛化和迁移能力的有效路径。

Abstract: Developing a robust visual quality assessment (VQualA) large multi-modal model (LMM) requires achieving versatility, powerfulness, and transferability. However, existing VQualA LMMs typically focus on a single task and rely on full-parameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a vision-encoder-centered generative pre-training pipeline and develop the VITAL-Series LMMs. (1) We adopt a machine-executed annotation-scrutiny paradigm, constructing over 4.5M vision-language (VL) pairs-the largest VQualA training dataset to date. (2) We employ a multi-task training workflow that simultaneously enhances the model’s quantitative scoring precision and strengthens its capability for quality interpretation across both image and video modalities. (3) Building upon the vision encoder, we realize an efficient model zoo extension: the model zoo exhibits strong zero-shot performance, and each paired decoder requires only a swift warm-up using less than 1/1000 of the pre-training data to achieve performance comparable to the fully trained counterpart. Overall, our work lays a cornerstone for advancing toward the foundation LMM for VQualA.

[46] X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification cs.CVPDF

Chenyang Yu, Xuehu Liu, Pingping Zhang, Huchuan Lu

TL;DR: 该论文提出了X-ReID框架，用于解决基于视频的可见光-红外行人重识别问题，通过跨模态原型协作和多粒度信息交互，显著提升了性能。

Details

Motivation: 现有的大规模视觉-语言模型（如CLIP）在多模态检索任务中表现优异，但在视频可见光-红外行人重识别中尚未充分发挥潜力，尤其是在减小模态差距和利用时空信息方面存在挑战。

Result: 在两个大规模VVI-ReID基准测试（HITSZ-VCM和BUPTCampus）中，表现优于现有方法。

Insight: 结合跨模态对齐和多粒度时空信息交互，可以有效提升视频可见光-红外行人重识别的性能。

Abstract: Large-scale vision-language models (e.g., CLIP) have recently achieved remarkable performance in retrieval tasks, yet their potential for Video-based Visible-Infrared Person Re-Identification (VVI-ReID) remains largely unexplored. The primary challenges are narrowing the modality gap and leveraging spatiotemporal information in video sequences. To address the above issues, in this paper, we propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID. Specifically, we first propose a Cross-modality Prototype Collaboration (CPC) to align and integrate features from different modalities, guiding the network to reduce the modality discrepancy. Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment to enhance temporal modeling and further reduce modality gaps. Finally, by integrating multi-granularity information, a robust sequence-level representation is achieved. Extensive experiments on two large-scale VVI-ReID benchmarks (i.e., HITSZ-VCM and BUPTCampus) demonstrate the superiority of our method over state-of-the-art methods. The source code is released at https://github.com/AsuradaYuci/X-ReID.

Yangyang Liu, Yuhao Wang, Pingping Zhang

TL;DR: 论文提出了一种名为Signal的新型多模态目标ReID框架，通过选择性交互和全局-局部对齐来解决多模态特征融合中的背景干扰和多模态一致性对齐问题。

Details

Motivation: 现有多模态目标ReID方法主要关注多模态特征融合，但忽视了背景干扰和多模态一致性对齐的挑战。

Result: 在三个多模态目标ReID基准数据集（RGBNT201、RGBNT100、MSVR310）上的实验验证了方法的有效性。

Insight: 论文表明，选择性交互和全局-局部对齐能有效提升多模态目标ReID的特征判别性，减少背景干扰和多模态不一致性问题。

Abstract: Multi-modal object Re-IDentification (ReID) is devoted to retrieving specific objects through the exploitation of complementary multi-modal image information. Existing methods mainly concentrate on the fusion of multi-modal features, yet neglecting the background interference. Besides, current multi-modal fusion methods often focus on aligning modality pairs but suffer from multi-modal consistency alignment. To address these issues, we propose a novel selective interaction and global-local alignment framework called Signal for multi-modal object ReID. Specifically, we first propose a Selective Interaction Module (SIM) to select important patch tokens with intra-modal and inter-modal information. These important patch tokens engage in the interaction with class tokens, thereby yielding more discriminative features. Then, we propose a Global Alignment Module (GAM) to simultaneously align multi-modal features by minimizing the volume of 3D polyhedra in the gramian space. Meanwhile, we propose a Local Alignment Module (LAM) to align local features in a shift-aware manner. With these modules, our proposed framework could extract more discriminative features for object ReID. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100, MSVR310) validate the effectiveness of our method. The source code is available at https://github.com/010129/Signal.

[48] Plan-X: Instruct Video Generation via Semantic Planning cs.CV | cs.AIPDF

Lun Huang, You Xie, Hongyi Xu, Tianpei Gu, Chenxu Zhang

TL;DR: Plan-X是一个通过语义规划指导视频生成的框架，通过显式的语义规划解决了扩散变压器在高级语义推理和长程规划上的不足。

Details

Motivation: 现有的扩散变压器在视觉合成中表现出色，但在高级语义推理和长程规划上存在局限，导致视觉幻觉和用户指令对齐不足的问题。

Result: 实验表明，Plan-X显著减少了视觉幻觉，实现了与多模态上下文一致的细粒度指令对齐视频生成。

Insight: 结合语言模型的推理能力和扩散模型的生成能力，显式语义规划可以有效提升视频生成的对齐性和一致性。

Abstract: Diffusion Transformers have demonstrated remarkable capabilities in visual synthesis, yet they often struggle with high-level semantic reasoning and long-horizon planning. This limitation frequently leads to visual hallucinations and mis-alignments with user instructions, especially in scenarios involving complex scene understanding, human-object interactions, multi-stage actions, and in-context motion reasoning. To address these challenges, we propose Plan-X, a framework that explicitly enforces high-level semantic planning to instruct video generation process. At its core lies a Semantic Planner, a learnable multimodal language model that reasons over the user’s intent from both text prompts and visual context, and autoregressively generates a sequence of text-grounded spatio-temporal semantic tokens. These semantic tokens, complementary to high-level text prompt guidance, serve as structured “semantic sketches” over time for the video diffusion model, which has its strength at synthesizing high-fidelity visual details. Plan-X effectively integrates the strength of language models in multimodal in-context reasoning and planning, together with the strength of diffusion models in photorealistic video synthesis. Extensive experiments demonstrate that our framework substantially reduces visual hallucinations and enables fine-grained, instruction-aligned video generation consistent with multimodal context.

[49] HyM-UNet: Synergizing Local Texture and Global Context via Hybrid CNN-Mamba Architecture for Medical Image Segmentation cs.CV | cs.IRPDF

Haodong Chen, Xianfei Han, Qwen

TL;DR: 论文提出了一种新型混合架构HyM-UNet，结合CNN的局部特征提取能力和Mamba的全局建模能力，用于医学图像分割。实验表明其在ISIC 2018数据集上优于现有方法。

Details

Motivation: 医学图像分割中，CNN因局部感受野限制难以捕捉全局结构，而Mamba能够高效建模全局依赖。结合两者优势可提升分割精度。

Result: 在ISIC 2018数据集上，HyM-UNet在Dice系数和IoU上显著优于现有方法，且参数量和延迟更低。

Insight: 结合CNN的局部能力和Mamba的全局能力可有效处理医学图像中复杂形状和多尺度问题，同时保持轻量化。

Abstract: Accurate organ and lesion segmentation is a critical prerequisite for computer-aided diagnosis. Convolutional Neural Networks (CNNs), constrained by their local receptive fields, often struggle to capture complex global anatomical structures. To tackle this challenge, this paper proposes a novel hybrid architecture, HyM-UNet, designed to synergize the local feature extraction capabilities of CNNs with the efficient global modeling capabilities of Mamba. Specifically, we design a Hierarchical Encoder that utilizes convolutional modules in the shallow stages to preserve high-frequency texture details, while introducing Visual Mamba modules in the deep stages to capture long-range semantic dependencies with linear complexity. To bridge the semantic gap between the encoder and the decoder, we propose a Mamba-Guided Fusion Skip Connection (MGF-Skip). This module leverages deep semantic features as gating signals to dynamically suppress background noise within shallow features, thereby enhancing the perception of ambiguous boundaries. We conduct extensive experiments on public benchmark dataset ISIC 2018. The results demonstrate that HyM-UNet significantly outperforms existing state-of-the-art methods in terms of Dice coefficient and IoU, while maintaining lower parameter counts and inference latency. This validates the effectiveness and robustness of the proposed method in handling medical segmentation tasks characterized by complex shapes and scale variations.

[50] SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining cs.CVPDF

Jiayu Wang, Haoyu Bian, Haoran Sun, Shaoning Zeng

TL;DR: SD-PSFNet是一种新颖的图像去雨方法，通过结合点扩散函数（PSF）机制和多阶段恢复架构，动态模拟雨条纹光学特性，有效分离雨与背景，并在多个阶段逐步优化去雨效果。

Details

Motivation: 图像去雨对视觉应用至关重要，但复杂多尺度的雨物理特性及其与场景的耦合增加了去雨难度。现有方法往往未能充分结合物理建模和特征融合。

Result: 在Rain100H、RealRain-1k-L和RealRain-1k-H数据集上达到了最先进的PSNR/SSIM指标。

Insight: SD-PSFNet展示了在复杂场景和密集降雨条件下的出色能力，为图像去雨提供了一种新的物理感知方法。

Abstract: Image deraining is crucial for vision applications but is challenged by the complex multi-scale physics of rain and its coupling with scenes. To address this challenge, a novel approach inspired by multi-stage image restoration is proposed, incorporating Point Spread Function (PSF) mechanisms to reveal the image degradation process while combining dynamic physical modeling with sequential feature fusion transfer, named SD-PSFNet. Specifically, SD-PSFNet employs a sequential restoration architecture with three cascaded stages, allowing multiple dynamic evaluations and refinements of the degradation process estimation. The network utilizes components with learned PSF mechanisms to dynamically simulate rain streak optics, enabling effective rain-background separation while progressively enhancing outputs through novel PSF components at each stage. Additionally, SD-PSFNet incorporates adaptive gated fusion for optimal cross-stage feature integration, enabling sequential refinement from coarse rain removal to fine detail restoration. Our model achieves state-of-the-art PSNR/SSIM metrics on Rain100H (33.12dB/0.9371), RealRain-1k-L (42.28dB/0.9872), and RealRain-1k-H (41.08dB/0.9838). In summary, SD-PSFNet demonstrates excellent capability in complex scenes and dense rainfall conditions, providing a new physics-aware approach to image deraining.

[51] RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale cs.CVPDF

Shengyuan Wang, Zhiheng Zheng, Yu Shang, Lixuan He, Yangcheng Yu

TL;DR: RAISECity提出了一种多模态智能体框架，用于城市规模的3D世界生成，解决了现有方法在质量、保真度和可扩展性方面的挑战，并通过动态数据处理和多模态工具调用实现了高精度和现实对齐。

Details

Motivation: 城市规模的3D生成对具身智能和世界模型的发展至关重要，但现有方法在质量、保真度和可扩展性上存在挑战，亟需创新解决方案。

Result: 在现实对齐、形状精度、纹理保真度和美学水平上表现优异，感知质量远超基线。

Insight: 智能体框架和多模态工具的结合为大规模3D生成提供了新思路，有望应用于沉浸式媒体和具身智能等领域。

Abstract: City-scale 3D generation is of great importance for the development of embodied intelligence and world models. Existing methods, however, face significant challenges regarding quality, fidelity, and scalability in 3D world generation. Thus, we propose RAISECity, a \textbf{R}eality-\textbf{A}ligned \textbf{I}ntelligent \textbf{S}ynthesis \textbf{E}ngine that creates detailed, \textbf{C}ity-scale 3D worlds. We introduce an agentic framework that leverages diverse multimodal foundation tools to acquire real-world knowledge, maintain robust intermediate representations, and construct complex 3D scenes. This agentic design, featuring dynamic data processing, iterative self-reflection and refinement, and the invocation of advanced multimodal tools, minimizes cumulative errors and enhances overall performance. Extensive quantitative experiments and qualitative analyses validate the superior performance of RAISECity in real-world alignment, shape precision, texture fidelity, and aesthetics level, achieving over a 90% win-rate against existing baselines for overall perceptual quality. This combination of 3D quality, reality alignment, scalability, and seamless compatibility with computer graphics pipelines makes RAISECity a promising foundation for applications in immersive media, embodied intelligence, and world models.

[52] Is Complete Labeling Necessary? Understanding Active Learning in Longitudinal Medical Imaging cs.CVPDF

Siteng Ma, Honghui Du, Prateek Mathur, Brendan S. Kelly, Ronan P. Killeen

TL;DR: 该论文提出了针对纵向医学图像变化检测任务的新型深度主动学习框架LMI-AL，通过选择性标注最具信息量的样本对，显著减少了标注成本，性能接近全标注模型。

Details

Motivation: 纵向医学图像标注成本高且耗时，传统深度主动学习方法不适用于变化检测任务。

Result: 仅标注不到8%的数据即可达到全标注模型的性能，验证了方法的有效性。

Insight: LMI-AL为纵向医学图像任务提供了高效的标注策略，未来可扩展至其他动态视觉任务。

Abstract: Detecting changes in longitudinal medical imaging using deep learning requires a substantial amount of accurately labeled data. However, labeling these images is notably more costly and time-consuming than labeling other image types, as it requires labeling across various time points, where new lesions can be minor, and subtle changes are easily missed. Deep Active Learning (DAL) has shown promise in minimizing labeling costs by selectively querying the most informative samples, but existing studies have primarily focused on static tasks like classification and segmentation. Consequently, the conventional DAL approach cannot be directly applied to change detection tasks, which involve identifying subtle differences across multiple images. In this study, we propose a novel DAL framework, named Longitudinal Medical Imaging Active Learning (LMI-AL), tailored specifically for longitudinal medical imaging. By pairing and differencing all 2D slices from baseline and follow-up 3D images, LMI-AL iteratively selects the most informative pairs for labeling using DAL, training a deep learning model with minimal manual annotation. Experimental results demonstrate that, with less than 8% of the data labeled, LMI-AL can achieve performance comparable to models trained on fully labeled datasets. We also provide a detailed analysis of the method’s performance, as guidance for future research. The code is publicly available at https://github.com/HelenMa9998/Longitudinal_AL.

[53] RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios cs.CVPDF

Jun Zhang, Jie Feng, Long Chen, Junhui Wang, Zhicheng Liu

TL;DR: 该论文提出了RoadBench，一个专注于评估多模态大语言模型（MLLMs）在城市道路场景中细粒度空间理解与推理能力的系统基准测试。通过六项任务和9,121个测试案例，揭示了现有MLLMs在该领域的显著不足。

Details

Motivation: 现有MLLMs在多模态任务中表现强大，但在复杂城市场景下的细粒度空间理解与推理能力尚未得到充分研究与评估。该研究旨在填补这一空白。

Result: 评估14种主流MLLMs后发现，其在某些任务中的表现甚至不如基于规则或随机选择的基线方法，表明现有模型存在显著局限性。

Insight: RoadBench不仅是挑战性的基准测试，还为未来提升MLLMs的空间理解能力提供了方向和数据支持。

Abstract: Multimodal large language models (MLLMs) have demonstrated powerful capabilities in general spatial understanding and reasoning. However, their fine-grained spatial understanding and reasoning capabilities in complex urban scenarios have not received significant attention in the fields of both research and industry. To fill this gap, we focus primarily on road markings as a typical example of fine-grained spatial elements under urban scenarios, given the essential role of the integrated road traffic network they form within cities. Around road markings and urban traffic systems, we propose RoadBench, a systematic benchmark that comprehensively evaluates MLLMs’ fine-grained spatial understanding and reasoning capabilities using BEV and FPV image inputs. This benchmark comprises six tasks consisting of 9,121 strictly manually verified test cases. These tasks form a systematic evaluation framework that bridges understanding at local spatial scopes to global reasoning. They not only test MLLMs’ capabilities in recognition, joint understanding, and reasoning but also assess their ability to integrate image information with domain knowledge. After evaluating 14 mainstream MLLMs, we confirm that RoadBench is a challenging benchmark for MLLMs while revealing significant shortcomings in existing MLLMs’ fine-grained spatial understanding and reasoning capabilities within urban scenarios. In certain tasks, their performance even falls short of simple rule-based or random selection baselines. These findings, along with RoadBench itself, will contribute to the comprehensive advancement of spatial understanding capabilities for MLLMs. The benchmark code, example datasets, and raw evaluation results are available in the supplementary material.

[54] Modeling Retinal Ganglion Cells with Neural Differential Equations cs.CV | cs.AIPDF

Kacper Dobek, Daniel Jankowski, Krzysztof Krawiec

TL;DR: 该论文探索了使用液体时间常数网络（LTCs）和闭式连续时间网络（CfCs）建模老虎蝾螈视网膜神经节细胞活动，在三个数据集上表现优于卷积基线和LSTM，具有更低MAE、更快收敛和更小模型尺寸。

Details

Motivation: 研究目标是开发高效且适应性强的模型，适用于数据有限且需要频繁重新训练的场景（如视觉假体的边缘部署）。

Result: LTCs和CfCs在多指标（MAE、收敛速度、模型尺寸）上优于基线，但Pearson相关系数略低。

Insight: 连续时间网络在数据有限任务中具有潜力，尤其是在需要高效性和适应性的边缘计算场景。

Abstract: This work explores Liquid Time-Constant Networks (LTCs) and Closed-form Continuous-time Networks (CfCs) for modeling retinal ganglion cell activity in tiger salamanders across three datasets. Compared to a convolutional baseline and an LSTM, both architectures achieved lower MAE, faster convergence, smaller model sizes, and favorable query times, though with slightly lower Pearson correlation. Their efficiency and adaptability make them well suited for scenarios with limited data and frequent retraining, such as edge deployments in vision prosthetics.

[55] MambaX: Image Super-Resolution with State Predictive Control cs.CVPDF

Chenyu Li, Danfeng Hong, Bing Zhang, Zhaojie Pan, Naoto Yokoya

TL;DR: MambaX是一种新型的图像超分辨率方法，通过动态学习非线性状态参数，解决了现有方法在中间阶段误差传播和累积控制不足的问题，并在单图像和多模态超分辨率任务中表现出色。

Details

Motivation: 现有图像超分辨率方法主要关注最终分辨率的提升，而忽视了中间阶段的误差控制和灵活性。Mamba虽然提出了状态序列重建的概念，但其固定的线性映射器和有限的感受野限制了其在精细图像中的表现。

Result: 在单图像和多模态超分辨率任务中，MambaX表现出色，展现了其在跨任意维度和模态的光谱通用建模中的潜力。

Insight: 通过动态学习和非线性状态控制，MambaX在图像超分辨率任务中解决了误差积累和多模态融合的挑战，为光谱通用建模提供了新思路。

Abstract: Image super-resolution (SR) is a critical technology for overcoming the inherent hardware limitations of sensors. However, existing approaches mainly focus on directly enhancing the final resolution, often neglecting effective control over error propagation and accumulation during intermediate stages. Recently, Mamba has emerged as a promising approach that can represent the entire reconstruction process as a state sequence with multiple nodes, allowing for intermediate intervention. Nonetheless, its fixed linear mapper is limited by a narrow receptive field and restricted flexibility, which hampers its effectiveness in fine-grained images. To address this, we created a nonlinear state predictive control model \textbf{MambaX} that maps consecutive spectral bands into a latent state space and generalizes the SR task by dynamically learning the nonlinear state parameters of control equations. Compared to existing sequence models, MambaX 1) employs dynamic state predictive control learning to approximate the nonlinear differential coefficients of state-space models; 2) introduces a novel state cross-control paradigm for multimodal SR fusion; and 3) utilizes progressive transitional learning to mitigate heterogeneity caused by domain and modality shifts. Our evaluation demonstrates the superior performance of the dynamic spectrum-state representation model in both single-image SR and multimodal fusion-based SR tasks, highlighting its substantial potential to advance spectrally generalized modeling across arbitrary dimensions and modalities.

[56] Hybrid Event Frame Sensors: Modeling, Calibration, and Simulation cs.CVPDF

Yunfan Lu, Nico Messikommer, Xiaogang Xu, Liming Chen, Yuhan Chen

TL;DR: 论文提出了一种统一的统计噪声模型，用于建模混合事件帧传感器（APS + EVS）的噪声行为，并开发了校准流程和仿真工具HESIM，验证了其在多种成像任务中的实用性。

Details

Motivation: 混合事件帧传感器虽然结合了APS和EVS的优势，但其复杂的电路结构引入了未被充分理解的噪声模式。现有研究缺乏统一的噪声模型，限制了其在应用中的潜力。

Result: 在两个混合传感器上的实验验证了模型的有效性，尤其是在视频帧插值和去模糊等任务中，仿真数据可较好地迁移到真实场景。

Insight: 统一的噪声建模和校准对混合传感器的应用至关重要；HESIM为开发相关算法提供了可靠的合成数据来源。

Abstract: Event frame hybrid sensors integrate an Active Pixel Sensor (APS) and an Event Vision Sensor (EVS) within a single chip, combining the high dynamic range and low latency of the EVS with the rich spatial intensity information from the APS. While this tight integration offers compact, temporally precise imaging, the complex circuit architecture introduces non-trivial noise patterns that remain poorly understood and unmodeled. In this work, we present the first unified, statistics-based imaging noise model that jointly describes the noise behavior of APS and EVS pixels. Our formulation explicitly incorporates photon shot noise, dark current noise, fixed-pattern noise, and quantization noise, and links EVS noise to illumination level and dark current. Based on this formulation, we further develop a calibration pipeline to estimate noise parameters from real data and offer a detailed analysis of both APS and EVS noise behaviors. Finally, we propose HESIM, a statistically grounded simulator that generates RAW frames and events under realistic, jointly calibrated noise statistics. Experiments on two hybrid sensors validate our model across multiple imaging tasks (e.g., video frame interpolation and deblurring), demonstrating strong transfer from simulation to real data.

[57] UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios cs.CVPDF

Tian Ye, Song Fei, Lei Zhu

TL;DR: UltraFlux是一款基于Flux的DiT模型，专注于原生4K文本到图像生成，通过数据与模型的协同设计解决了现有扩散变换器在4K分辨率下的多个耦合问题，实现了高质量的跨多样纵横比生成。

Details

Motivation: 现有的扩散变换器在1K分辨率下表现良好，但在扩展到原生4K分辨率时，出现了位置编码、VAE压缩和优化等方面的耦合问题。单独解决其中任一问题难以提升整体质量，因此需要数据与模型的协同设计。

Result: 在Aesthetic-Eval 4096基准和多纵横比4K场景下，UltraFlux在保真度、美学和一致性指标上均优于开源基线，并在某些情况下接近或超越专有模型Seedream 4.0。

Insight: 数据与模型的协同设计是解决高分辨率生成问题的关键，尤其是在处理复杂纵横比时。UltraFlux的创新点展示了如何在4K分辨率下优化扩散变换器的各个组件以实现稳定的高质量生成。

Abstract: Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.

[58] IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment cs.CV | cs.AI | cs.CLPDF

Bowen Qu, Shangkun Sun, Xiaoyu Liang, Wei Gao

TL;DR: 该论文提出了IE-Critic-R1和IE-Bench，通过强化学习与人类评分对齐，改进文本驱动图像编辑的评估方法。

Details

Motivation: 现有方法在评估文本驱动图像编辑时，通常仅关注文本-图像对齐，而忽略了与人类感知的一致性。

Result: 实验表明，IE-Critic-R1在文本驱动图像编辑任务上的评估性能优于现有方法。

Insight: 评估文本驱动图像编辑时，需同时考虑文本-图像对齐和人类感知的一致性。

Abstract: Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not well aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding edited results from different editing methods, and nearly 4,000 samples with corresponding Mean Opinion Scores (MOS) provided by 15 human subjects. Furthermore, we introduce IE-Critic-R1, which, benefiting from Reinforcement Learning from Verifiable Rewards (RLVR), provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception. Extensive experiments demonstrate IE-Critic-R1’s superior subjective-alignments on the text-driven image editing task compared with previous metrics. Related data and codes are available to the public.

[59] VK-Det: Visual Knowledge Guided Prototype Learning for Open-Vocabulary Aerial Object Detection cs.CVPDF

Jianhang Yao, Yongbin Zheng, Siqi Lu, Wanying Xu, Peng Sun

TL;DR: VK-Det是一个无需额外监督的开放词汇航空目标检测框架，通过视觉知识引导的原型学习解决现有方法对文本依赖的语义偏差问题。

Details

Motivation: 开放词汇航空目标检测（OVAD）通常依赖于文本监督，导致语义偏差，限制了模型对未定义类别的泛化能力。因此，作者提出了一种无需额外监督的视觉知识引导框架。

Result: 在DIOR和DOTA数据集上分别取得30.1和23.3的mAP，超越了现有方法，甚至优于使用额外监督的方法。

Insight: 视觉知识（而非文本）可以更有效地引导模型学习开放词汇目标检测，减少语义偏差并提升对新类别的泛化能力。

Abstract: To identify objects beyond predefined categories, open-vocabulary aerial object detection (OVAD) leverages the zero-shot capabilities of visual-language models (VLMs) to generalize from base to novel categories. Existing approaches typically utilize self-learning mechanisms with weak text supervision to generate region-level pseudo-labels to align detectors with VLMs semantic spaces. However, text dependence induces semantic bias, restricting open-vocabulary expansion to text-specified concepts. We propose $\textbf{VK-Det}$, a $\textbf{V}$isual $\textbf{K}$nowledge-guided open-vocabulary object $\textbf{Det}$ection framework $\textit{without}$ extra supervision. First, we discover and leverage vision encoder’s inherent informative region perception to attain fine-grained localization and adaptive distillation. Second, we introduce a novel prototype-aware pseudo-labeling strategy. It models inter-class decision boundaries through feature clustering and maps detection regions to latent categories via prototype matching. This enhances attention to novel objects while compensating for missing supervision. Extensive experiments show state-of-the-art performance, achieving 30.1 $\mathrm{mAP}^{N}$ on DIOR and 23.3 $\mathrm{mAP}^{N}$ on DOTA, outperforming even extra supervised methods.

[60] ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models cs.CV | cs.ROPDF

Wencheng Ye, Tianshi Wang, Lei Zhu, Fengling Li, Guoli Yang

TL;DR: ActDistill提出了一种基于动作引导的自蒸馏框架，用于高效压缩视觉-语言-动作（VLA）模型，通过动作先验指导知识迁移和模型压缩，显著降低计算量和推理延迟。

Details

Motivation: 当前的VLA模型虽然在灵活性和泛化性上表现优异，但计算开销大、推理延迟高，限制了其在机器人操控中的实际部署。

Result: 实验表明，ActDistill在保持性能的同时，减少了50%以上的计算量，推理速度提升了1.67倍。

Insight: 动作引导的蒸馏和动态路由相结合，为高效机器人智能提供了一种通用范式。

Abstract: Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup, thereby establishing a general paradigm toward efficient embodied intelligence.

[61] Less Is More: An Explainable AI Framework for Lightweight Malaria Classification cs.CVPDF

Md Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal

TL;DR: 论文提出一种轻量级、可解释的AI框架EMFE，通过提取形态学特征（非背景像素数和细胞孔洞数）结合Logistic回归和随机森林，实现了疟疾细胞图像的高效分类，性能与复杂深度学习模型相当。

Details

Motivation: 深度学习模型在医学图像分类任务中虽然性能优越，但计算成本高且缺乏可解释性。本研究旨在探讨对于疟疾细胞图像这一简单的二元分类任务，是否需要复杂的神经网络。

Result: 单变量Logistic回归模型测试精度94.80%，模型大小1.2 kB，推理延迟2.3 ms；集成模型精度提升至97.15%。相比之下，深度学习模型需要13.6-44.7 MB存储，推理时间更高（68 ms）。

Insight: 对于简单的医学图像分类任务，特征工程结合轻量级模型可在保持高性能的同时，显著降低计算成本和提升可解释性，尤其适合计算资源有限的环境。

Abstract: Background and Objective: Deep learning models have high computational needs and lack interpretability but are often the first choice for medical image classification tasks. This study addresses whether complex neural networks are essential for the simple binary classification task of malaria. We introduce the Extracted Morphological Feature Engineered (EMFE) pipeline, a transparent, reproducible, and low compute machine learning approach tailored explicitly for simple cell morphology, designed to achieve deep learning performance levels on a simple CPU only setup with the practical aim of real world deployment. Methods: The study used the NIH Malaria Cell Images dataset, with two features extracted from each cell image: the number of non background pixels and the number of holes within the cell. Logistic Regression and Random Forest were compared against ResNet18, DenseNet121, MobileNetV2, and EfficientNet across accuracy, model size, and CPU inference time. An ensemble model was created by combining Logistic Regression and Random Forests to achieve higher accuracy while retaining efficiency. Results: The single variable Logistic Regression model achieved a test accuracy of 94.80 percent with a file size of 1.2 kB and negligible inference latency (2.3 ms). The two stage ensemble improved accuracy to 97.15 percent. In contrast, the deep learning methods require 13.6 MB to 44.7 MB of storage and show significantly higher inference times (68 ms). Conclusion: This study shows that a compact feature engineering approach can produce clinically meaningful classification performance while offering gains in transparency, reproducibility, speed, and deployment feasibility. The proposed pipeline demonstrates that simple interpretable features paired with lightweight models can serve as a practical diagnostic solution for environments with limited computational resources.

[62] Together, Then Apart: Revisiting Multimodal Survival Analysis via a Min-Max Perspective cs.CVPDF

Wenjing Liu, Qin Ren, Wen Zhang, Yuewei Lin, Chenyu You

TL;DR: 论文提出了一种名为Together-Then-Apart (TTA)的新框架，通过最小化对齐和最大化多样性的双重视角，解决了多模态生存分析中的模态对齐和特异性保留问题。

Details

Motivation: 现有方法过度依赖跨模态对齐，忽视了模态特异性，导致表示崩溃和多样性下降。

Result: 在五个TCGA基准测试中表现出色，超越了现有方法。

Insight: 对齐和特异性可以联合优化，为多模态生存分析提供了新理论视角和实际应用价值。

Abstract: Integrating heterogeneous modalities such as histopathology and genomics is central to advancing survival analysis, yet most existing methods prioritize cross-modal alignment through attention-based fusion mechanisms, often at the expense of modality-specific characteristics. This overemphasis on alignment leads to representation collapse and reduced diversity. In this work, we revisit multi-modal survival analysis via the dual lens of alignment and distinctiveness, positing that preserving modality-specific structure is as vital as achieving semantic coherence. In this paper, we introduce Together-Then-Apart (TTA), a unified min-max optimization framework that simultaneously models shared and modality-specific representations. The Together stage minimizes semantic discrepancies by aligning embeddings via shared prototypes, guided by an unbalanced optimal transport objective that adaptively highlights informative tokens. The Apart stage maximizes representational diversity through modality anchors and a contrastive regularizer that preserve unique modality information and prevent feature collapse. Extensive experiments on five TCGA benchmarks show that TTA consistently outperforms state-of-the-art methods. Beyond empirical gains, our formulation provides a new theoretical perspective of how alignment and distinctiveness can be jointly achieved in for robust, interpretable, and biologically meaningful multi-modal survival analysis.

[63] Spotlight: Identifying and Localizing Video Generation Errors Using VLMs cs.CVPDF

Aditya Chinchure, Sahithya Ravi, Pushkar Shukla, Vered Shwartz, Leonid Sigal

TL;DR: Spotlight提出了一种新任务，用于定位和解释视频生成中的错误，并通过600个视频和1600多个细粒度标注错误展示了VLMs在视频错误识别和定位上的不足。

Details

Motivation: 现有文本到视频模型（T2V）虽能生成高质量视频，但仍存在局部和细粒度的错误，而当前评估方法缺乏对这些错误的识别与描述能力。

Result: 研究发现VLMs在视频错误识别和定位上表现显著低于人类，但通过推理策略可将其性能提升近2倍。

Insight: Spotlight任务为构建细粒度视频评估工具和改进视频生成器的奖励模型提供了新方向。

Abstract: Current text-to-video models (T2V) can generate high-quality, temporally coherent, and visually realistic videos. Nonetheless, errors still often occur, and are more nuanced and local compared to the previous generation of T2V models. While current evaluation paradigms assess video models across diverse dimensions, they typically evaluate videos holistically without identifying when specific errors occur or describing their nature. We address this gap by introducing Spotlight, a novel task aimed at localizing and explaining video-generation errors. We generate 600 videos using 200 diverse textual prompts and three state-of-the-art video generators (Veo 3, Seedance, and LTX-2), and annotate over 1600 fine-grained errors across six types, including motion, physics, and prompt adherence. We observe that adherence and physics errors are predominant and persist across longer segments, whereas appearance-disappearance and body pose errors manifest in shorter segments. We then evaluate current VLMs on Spotlight and find that VLMs lag significantly behind humans in error identification and localization in videos. We propose inference-time strategies to probe the limits of current VLMs on our task, improving performance by nearly 2x. Our task paves a way forward to building fine-grained evaluation tools and more sophisticated reward models for video generators.

[64] Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning cs.CVPDF

Xiaohong Liu, Xiufeng Song, Huayu Zheng, Lei Bai, Xiaoming Liu

TL;DR: 该论文提出了一种名为MM-Det++的统一多模态检测算法，旨在检测扩散模型生成的视频。通过创新的时空分支和多模态分支，结合统一多模态学习模块，该方法在视频伪造检测中表现出色。

Details

Motivation: 扩散模型生成的视频数量激增，信息安全隐患日益突出，而现有的视频伪造检测方法主要集中在图像层面，视频层面的通用检测研究不足。

Result: 实验表明MM-Det++在检测扩散生成视频方面表现优异，证明了统一多模态学习的有效性。

Insight: 多模态联合学习能够从时空和语义角度更全面地捕捉伪造痕迹，提升检测性能。

Abstract: The proliferation of videos generated by diffusion models has raised increasing concerns about information security, highlighting the urgent need for reliable detection of synthetic media. Existing methods primarily focus on image-level forgery detection, leaving generic video-level forgery detection largely underexplored. To advance video forensics, we propose a consolidated multimodal detection algorithm, named MM-Det++, specifically designed for detecting diffusion-generated videos. Our approach consists of two innovative branches and a Unified Multimodal Learning (UML) module. Specifically, the Spatio-Temporal (ST) branch employs a novel Frame-Centric Vision Transformer (FC-ViT) to aggregate spatio-temporal information for detecting diffusion-generated videos, where the FC-tokens enable the capture of holistic forgery traces from each video frame. In parallel, the Multimodal (MM) branch adopts a learnable reasoning paradigm to acquire Multimodal Forgery Representation (MFR) by harnessing the powerful comprehension and reasoning capabilities of Multimodal Large Language Models (MLLMs), which discerns the forgery traces from a flexible semantic perspective. To integrate multimodal representations into a coherent space, a UML module is introduced to consolidate the generalization ability of MM-Det++. In addition, we also establish a large-scale and comprehensive Diffusion Video Forensics (DVF) dataset to advance research in video forgery detection. Extensive experiments demonstrate the superiority of MM-Det++ and highlight the effectiveness of unified multimodal forgery learning in detecting diffusion-generated videos.

[65] Muskie: Multi-view Masked Image Modeling for 3D Vision Pre-training cs.CVPDF

Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou, Tongrui Hu

TL;DR: Muskie提出了一个专为3D视觉任务设计的原生多视角视觉骨干网络，通过同时处理多视角图像并在预训练阶段引入多视角一致性，优于现有的逐帧方法。

Details

Motivation: 现有的逐帧模型在多视角一致性方面表现有限，无法充分利用多视角图像的几何信息。为了解决这一问题，Muskie设计了原生多视角视觉骨干网络。

Result: 在多视角对应准确性和下游任务（如相机姿态估计和点云重建）上均优于现有逐帧骨干网络（如DINO）。

Insight: 通过多视角一致性预训练任务，模型能够隐式学习视角不变特征和几何理解，为3D视觉任务提供了更有效的骨干网络。

Abstract: We present Muskie, a native multi-view vision backbone designed for 3D vision tasks. Unlike existing models, which are frame-wise and exhibit limited multi-view consistency, Muskie is designed to process multiple views simultaneously and introduce multi-view consistency in pre-training stage. Muskie is trained to reconstruct heavily masked content in one view by finding and utilizing geometric correspondences from other views. Through this pretext task and our proposed aggressive masking strategy, the model implicitly to learn view-invariant features and develop strong geometric understanding without any 3D supervision. Compared with state-of-the-art frame-wise backbones such as DINO, Muskie achieves higher multi-view correspondence accuracy. Furthermore, we demonstrate that using Muskie as a backbone consistently enhances performance on downstream 3D tasks, including camera pose estimation and pointmap reconstruction. Codes are publicly available at https://leo-frank.github.io/Muskie/

[66] PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures cs.CVPDF

Yuheng Shao, Lizhang Wang, Changhao Li, Peixian Chen, Qinyuan Liu

TL;DR: PromptMoE提出了一种基于视觉引导的混合专家机制的零样本异常检测方法，通过学习一组专家提示并结合图像引导的动态组合，解决了现有方法在泛化性和表示能力上的瓶颈问题。

Details

Motivation: 现有零样本异常检测方法在提示工程策略上存在表示瓶颈和过度拟合的问题，难以应对未见异常类的复杂性和多样性。

Result: 在15个工业与医疗数据集上验证了PromptMoE的有效性和先进性能。

Insight: 组合式的提示学习方法优于传统的单提示或固定提示策略，能够更好地捕捉异常的多样性和复杂性。

Abstract: Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose $\mathtt{PromptMoE}$. Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, $\mathtt{PromptMoE}$ learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance. Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of $\mathtt{PromptMoE}$.

[67] MVS-TTA: Test-Time Adaptation for Multi-View Stereo via Meta-Auxiliary Learning cs.CVPDF

Hannuo Zhang, Zhixiang Chi, Yang Wang, Xinxin Zuo

TL;DR: MVS-TTA提出了一种高效的测试时适应框架，通过元辅助学习策略，利用自监督的跨视角一致性损失提升学习型多视角立体匹配方法的适应性。

Details

Motivation: 学习型多视角立体匹配方法在数据驱动下取得了显著进展，但其泛化性能受限于固定参数和有限训练数据分布。优化型方法虽能实现场景自适应，但缺乏可扩展性且计算成本高。

Result: 在标准数据集（DTU、BlendedMVS）和跨数据集泛化设置中，MVS-TTA显著提升了性能，包括最先进的MVS模型。

Insight: 通过测试时适应结合元学习，学习型MVS方法可以在推理时动态优化，提升泛化性能，同时避免了优化型方法的高成本。

Abstract: Recent learning-based multi-view stereo (MVS) methods are data-driven and have achieved remarkable progress due to large-scale training data and advanced architectures. However, their generalization remains sub-optimal due to fixed model parameters trained on limited training data distributions. In contrast, optimization-based methods enable scene-specific adaptation but lack scalability and require costly per-scene optimization. In this paper, we propose MVS-TTA, an efficient test-time adaptation (TTA) framework that enhances the adaptability of learning-based MVS methods by bridging these two paradigms. Specifically, MVS-TTA employs a self-supervised, cross-view consistency loss as an auxiliary task to guide inference-time adaptation. We introduce a meta-auxiliary learning strategy to train the model to benefit from auxiliary-task-based updates explicitly. Our framework is model-agnostic and can be applied to a wide range of MVS methods with minimal architectural changes. Extensive experiments on standard datasets (DTU, BlendedMVS) and a challenging cross-dataset generalization setting demonstrate that MVS-TTA consistently improves performance, even when applied to state-of-the-art MVS models. To our knowledge, this is the first attempt to integrate optimization-based test-time adaptation into learning-based MVS using meta-learning. The code will be available at https://github.com/mart87987-svg/MVS-TTA.

[68] VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging cs.CV | cs.AIPDF

Ming Zhong, Yuanlei Wang, Liuzhou Zhang, Arctanx An, Renrui Zhang

TL;DR: VCU-Bridge提出了一种分层视觉含义理解的框架，模仿人类从基础感知到抽象含义的推理过程。通过构建HVCU-Bench基准和基于MCTS的数据生成方法，证明了分层推理模式对MLLM性能的提升作用。

Details

Motivation: 现有的MLLM在处理视觉信息时缺乏人类自然的分层推理能力，评测协议也常忽视低层感知与高层推理的依赖关系，导致性能瓶颈不清晰。论文旨在填补这一差距。

Result: 实验表明，MLLM在更高推理层级上性能下降；增强低层能力显著提升高层性能（HVCU-Bench提升，MMStar提升7.26%）。

Insight: 分层推理模式对MLLM能力提升至关重要，低层感知能力的改进能直接推动高层抽象推理的提升。

Abstract: While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities. The project page is at https://vcu-bridge.github.io .

[69] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

Dachuan Zhao, Weiyue Li, Zhenda Shen, Yushu Qiu, Bowen Xu

TL;DR: 该论文提出了一种新的去偏方法SPD，通过识别和移除线性可解码的偏置子空间，解决传统坐标替换方法的局限性，取得了更好的公平性和性能平衡。

Details

Motivation: 视觉-语言模型（VLMs）的表征中常编码并放大人口统计学偏置，导致下游任务中的不公平预测。现有方法仅替换与属性最相关的嵌入坐标，但效果有限。

Result: 在零样本分类、文本到图像检索和图像生成任务中，SPD在四种公平性指标上平均提升18.5%，同时任务性能损失最小。

Insight: 偏置表现具有几何结构，传统坐标替换方法无法完全解决偏置问题，而子空间级别的干预更为有效。

Abstract: Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical failures of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose $\textbf{S}$ubspace $\textbf{P}$rojection $\textbf{D}$ebiasing ($\textbf{SPD}$), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of SPD: our method achieves more robust debiasing with an average improvement of $18.5%$ across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.

[70] SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation cs.CVPDF

Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, Yoichi Sato

TL;DR: SFHand是一个实时流式框架，用于语言引导的3D手部预测和具身操作，结合视频和语言指令，实现了未来3D手部状态的全面预测，并在多个任务中取得了显著的性能提升。

Details

Motivation: 现有方法无法满足实时人机交互需求，且通常需要离线处理视频序列，缺乏语言引导的任务意图表达能力。为了解决这些问题，SFHand应运而生。

Result: SFHand在3D手部预测任务中取得了显著优势（比现有方法提升35.8%），并在下游具身操作任务中提高了13.4%的成功率。

Insight: 语言引导可以显著提升3D手部预测的精准性，流式架构和ROI增强技术能够有效地捕捉时间上下文和手部关键区域。

Abstract: Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the first large-scale dataset featuring synchronized 3D hand poses and language instructions. We demonstrate that SFHand achieves new state-of-the-art results in 3D hand forecasting, outperforming prior work by a significant margin of up to 35.8%. Furthermore, we show the practical utility of our learned representations by transferring them to downstream embodied manipulation tasks, improving task success rates by up to 13.4% on multiple benchmarks. Dataset page: https://huggingface.co/datasets/ut-vision/EgoHaFL, project page: https://github.com/ut-vision/SFHand.

[71] Video4Edit: Viewing Image Editing as a Degenerate Temporal Process cs.CVPDF

Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang

TL;DR: 这篇论文提出了一种新颖的方法，将图像编辑视为一种退化的时间过程，通过视频预训练学习单帧演化先验，从而在极少量监督数据下达到主流编辑模型的性能。

Details

Motivation: 当前的多模态基础模型（如扩散/流模型）在图像生成和编辑方面取得了显著进展，但仍需大量高质量三元组（指令、源图像、编辑图像）来覆盖多样的用户意图，且视觉替换的保真度依赖于指令对目标语义的精确引用。此外，这种训练成本高昂。

Result: 实验表明，该方法在性能上与领先的开源基线模型相当，同时仅需主流编辑模型约1%的监督数据。

Insight: 1. 时间建模视角为图像编辑提供了一种新的解决思路；2. 视频预训练的先验知识可以显著提升图像编辑任务的数据效率；3. 该方法减少了高质量三元组数据的依赖，为实际应用提供了更经济的解决方案。

Abstract: We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of {instruction, source image, edited image} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.

[72] SCALER: SAM-Enhanced Collaborative Learning for Label-Deficient Concealed Object Segmentation cs.CV | cs.AIPDF

Chunming He, Rihan Zhang, Longxiang Tang, Ziyun Yang, Kai Li

TL;DR: SCALER是一个统一的协作框架，通过结合一致性约束和SAM监督，优化均值教师分割器和可学习的SAM，用于标签不足的隐蔽目标分割（LDCOS）。

Details

Motivation: 现有方法在标签不足的隐蔽目标分割任务中表现不佳，原因是目标隐蔽和标注稀缺。SCALER旨在通过联合利用一致性约束和SAM监督，实现互补信息的挖掘和分割模型的增强。

Result: 实验表明SCALER在八种半监督和弱监督COS任务中均取得显著性能提升，展示了其作为通用训练范式的潜力。

Insight: SCALER表明，分割模型和SAM可以通过双向指导相互增强，为标签稀缺条件下的模型训练提供了新思路。

Abstract: Existing methods for label-deficient concealed object segmentation (LDCOS) either rely on consistency constraints or Segment Anything Model (SAM)-based pseudo-labeling. However, their performance remains limited due to the intrinsic concealment of targets and the scarcity of annotations. This study investigates two key questions: (1) Can consistency constraints and SAM-based supervision be jointly integrated to better exploit complementary information and enhance the segmenter? and (2) beyond that, can the segmenter in turn guide SAM through reciprocal supervision, enabling mutual improvement? To answer these questions, we present SCALER, a unified collaborative framework toward LDCOS that jointly optimizes a mean-teacher segmenter and a learnable SAM. SCALER operates in two alternating phases. In \textbf{Phase \uppercase\expandafter{\romannumeral1}}, the segmenter is optimized under fixed SAM supervision using entropy-based image-level and uncertainty-based pixel-level weighting to select reliable pseudo-label regions and emphasize harder examples. In \textbf{Phase \uppercase\expandafter{\romannumeral2}}, SAM is updated via augmentation invariance and noise resistance losses, leveraging its inherent robustness to perturbations. Experiments demonstrate that SCALER yields consistent performance gains across eight semi- and weakly-supervised COS tasks. The results further suggest that SCALER can serve as a general training paradigm to enhance both lightweight segmenters and large foundation models under label-scarce conditions. Code will be released.

[73] Compact neural networks for astronomy with optimal transport bias correction cs.CVPDF

Shuhuan Wang, Yuzhen Xie, Jiayi Li

TL;DR: WaveletMamba框架结合小波分解、状态空间建模和偏置校正技术，在天文图像分类任务中高效实现了高分辨率性能。

Details

Motivation: 天文成像面临效率与分辨率的权衡问题，限制了大规模形态分类和红移预测。

Result: 在64x64分辨率下达到81.72%分类准确率，在低分辨率输入下仍保持高分辨率性能，计算效率提升9.7倍。

Insight: 数学严谨性能够在科学AI中实现前所未有的效率和全面的偏置校正，推动跨学科科学发现。

Abstract: Astronomical imaging confronts an efficiency-resolution tradeoff that limits large-scale morphological classification and redshift prediction. We introduce WaveletMamba, a theory-driven framework integrating wavelet decomposition with state-space modeling, mathematical regularization, and multi-level bias correction. WaveletMamba achieves 81.72% +/- 0.53% classification accuracy at 64x64 resolution with only 3.54M parameters, delivering high-resolution performance (80.93% +/- 0.27% at 244x244) at low-resolution inputs with 9.7x computational efficiency gains. The framework exhibits Resolution Multistability, where models trained on low-resolution data achieve consistent accuracy across different input scales despite divergent internal representations. The framework’s multi-level bias correction synergizes HK distance (distribution-level optimal transport) with Color-Aware Weighting (sample-level fine-tuning), achieving 22.96% Log-MSE improvement and 26.10% outlier reduction without explicit selection function modeling. Here, we show that mathematical rigor enables unprecedented efficiency and comprehensive bias correction in scientific AI, bridging computer vision and astrophysics to revolutionize interdisciplinary scientific discovery.

Chunming He, Rihan Zhang, Zheng Chen, Bowen Yang, CHengyu Fang

TL;DR: 这篇论文提出了UnfoldLDM，一种结合深度展开网络（DUNs）和潜在扩散模型（LDM）的方法，用于盲图像恢复（BIR）。通过多粒度退化感知（MGDA）模块和退化抵抗的LDM（DR-LDM），该方法有效解决了现有DUNs的退化特定依赖性和过度平滑问题。

Details

Motivation: 现有深度展开网络在盲图像恢复任务中存在退化特定依赖性和过度平滑偏差，限制了其性能。该文旨在通过结合潜在扩散模型来克服这些问题。

Result: 实验表明UnfoldLDM在多种BIR任务中表现领先，并能提升下游任务性能。

Insight: 结合模型驱动和数据驱动的方法可以有效解决盲图像恢复中的复杂退化问题。

Abstract: Deep unfolding networks (DUNs) combine the interpretability of model-based methods with the learning ability of deep networks, yet remain limited for blind image restoration (BIR). Existing DUNs suffer from: (1) \textbf{Degradation-specific dependency}, as their optimization frameworks are tied to a known degradation model, making them unsuitable for BIR tasks; and (2) \textbf{Over-smoothing bias}, resulting from the direct feeding of gradient descent outputs, dominated by low-frequency content, into the proximal term, suppressing fine textures. To overcome these issues, we propose UnfoldLDM to integrate DUNs with latent diffusion model (LDM) for BIR. In each stage, UnfoldLDM employs a multi-granularity degradation-aware (MGDA) module as the gradient descent step. MGDA models BIR as an unknown degradation estimation problem and estimates both the holistic degradation matrix and its decomposed forms, enabling robust degradation removal. For the proximal step, we design a degradation-resistant LDM (DR-LDM) to extract compact degradation-invariant priors from the MGDA output. Guided by this prior, an over-smoothing correction transformer (OCFormer) explicitly recovers high-frequency components and enhances texture details. This unique combination ensures the final result is degradation-free and visually rich. Experiments show that our UnfoldLDM achieves a leading place on various BIR tasks and benefits downstream tasks. Moreover, our design is compatible with existing DUN-based methods, serving as a plug-and-play framework. Code will be released.

[75] Assessing the alignment between infants’ visual and linguistic experience using multimodal language models cs.CV | cs.CLPDF

Alvin Wei Ming Tan, Jane Yang, Tarun Sepuri, Khai Loong Aw, Robert Z. Sparks

TL;DR: 研究利用CLIP模型自動分析嬰兒視覺與語言經驗的時間對齊問題，發現理想對齊時刻在嬰兒日常經驗中相對罕見，且存在變異性。

Details

Motivation: 嬰兒學習語言時，視覺與語言經驗的時間對齊對理解單詞指稱對象非常關鍵，但傳統人工標註方法耗時費力，亟需自動化解決方案。

Result: 理想對齊時刻（如語言提及對象與視覺對象同時出現）在嬰兒日常經驗中較少見，且不同嬰兒和情境間存在顯著變異性。

Insight: 嬰兒語言學習模型的對齊頻率限制，突顯了現有機器學習數據集與真實嬰兒經驗的差異，為未來研究提供了新的方法學工具。

Abstract: Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children’s visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus of infant-perspective videos. We show that idealized aligned moments for learning (e.g., “look at the ball” with a ball present in the child’s view) are relatively rare in children’s everyday experiences compared to modern machine learning datasets, and highlight variability in alignment both within and across children. These findings suggest that infrequent alignment is a constraint for models describing early word learning and offer a new method for investigating children’s multimodal environment.

[76] Matching-Based Few-Shot Semantic Segmentation Models Are Interpretable by Design cs.CVPDF

Pasquale De Marinis, Uzay Kaymak, Rogier Brussee, Gennaro Vessio, Giovanna Castellano

TL;DR: 该论文提出了一种专门用于解释基于匹配的少样本语义分割（FSS）模型的方法——Affinity Explainer，通过利用模型的结构特性生成归因图，突出支持图像中对查询分割预测贡献最大的像素。

Details

Motivation: 尽管少样本语义分割模型在数据稀缺场景下表现优异，但其决策过程通常不透明，解释性不足。当前可解释AI在其他计算机视觉任务中已取得进展，但在FSS领域仍未被充分探索。

Result: 实验结果表明，该方法显著优于传统的归因方法，并提供结构化和连贯的关注模式，有助于模型诊断和理解。

Insight: 通过结构化解释方法可以有效揭示FSS模型的决策依据，为少样本语义分割的可解释性研究奠定了基础。

Abstract: Few-Shot Semantic Segmentation (FSS) models achieve strong performance in segmenting novel classes with minimal labeled examples, yet their decision-making processes remain largely opaque. While explainable AI has advanced significantly in standard computer vision tasks, interpretability in FSS remains virtually unexplored despite its critical importance for understanding model behavior and guiding support set selection in data-scarce scenarios. This paper introduces the first dedicated method for interpreting matching-based FSS models by leveraging their inherent structural properties. Our Affinity Explainer approach extracts attribution maps that highlight which pixels in support images contribute most to query segmentation predictions, using matching scores computed between support and query features at multiple feature levels. We extend standard interpretability evaluation metrics to the FSS domain and propose additional metrics to better capture the practical utility of explanations in few-shot scenarios. Comprehensive experiments on FSS benchmark datasets, using different models, demonstrate that our Affinity Explainer significantly outperforms adapted standard attribution methods. Qualitative analysis reveals that our explanations provide structured, coherent attention patterns that align with model architectures and and enable effective model diagnosis. This work establishes the foundation for interpretable FSS research, enabling better model understanding and diagnostic for more reliable few-shot segmentation systems. The source code is publicly available at https://github.com/pasqualedem/AffinityExplainer.

[77] Nested Unfolding Network for Real-World Concealed Object Segmentation cs.CV | cs.AIPDF

Chunming He, Rihan Zhang, Dingming Zhang, Fengyang Xiao, Deng-Ping Fan

TL;DR: 该论文提出了一种名为NUN的嵌套展开网络，用于解决真实世界隐蔽目标分割（COS）中的图像退化问题。通过将退化抵抗网络（DeRUN）嵌入分割导向网络（SODUN）中，实现了图像恢复与分割的解耦和相互优化。

Details

Motivation: 现有基于展开网络（DUNs）的方法将背景估计与图像恢复耦合，导致目标冲突，且在真实场景中依赖于预定义的退化类型，限制了其适用性。

Result: 实验表明，NUN在干净和退化数据集上均取得了领先性能。

Insight: 通过嵌套设计和动态退化语义推断，NUN在真实世界的隐蔽目标分割中展示了更强的适应性和鲁棒性。

Abstract: Deep unfolding networks (DUNs) have recently advanced concealed object segmentation (COS) by modeling segmentation as iterative foreground-background separation. However, existing DUN-based methods (RUN) inherently couple background estimation with image restoration, leading to conflicting objectives and requiring pre-defined degradation types, which are unrealistic in real-world scenarios. To address this, we propose the nested unfolding network (NUN), a unified framework for real-world COS. NUN adopts a DUN-in-DUN design, embedding a degradation-resistant unfolding network (DeRUN) within each stage of a segmentation-oriented unfolding network (SODUN). This design decouples restoration from segmentation while allowing mutual refinement. Guided by a vision-language model (VLM), DeRUN dynamically infers degradation semantics and restores high-quality images without explicit priors, whereas SODUN performs reversible estimation to refine foreground and background. Leveraging the multi-stage nature of unfolding, NUN employs image-quality assessment to select the best DeRUN outputs for subsequent stages, naturally introducing a self-consistency loss that enhances robustness. Extensive experiments show that NUN achieves a leading place on both clean and degraded benchmarks. Code will be released.

[78] EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses cs.CVPDF

Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal

TL;DR: EgoControl是一种基于扩散模型的可控第一人称视角视频生成方法，通过3D全身姿态序列实现细粒度控制。

Details

Motivation: 为了实现具身AI代理的动作模拟、预测和规划，需要一种能够通过身体运动进行细粒度控制的第一人称视角视频生成方法。

Result: 实验结果表明，EgoControl能够生成高质量、姿态一致的第一人称视角视频。

Insight: EgoControl为可控的具身视频模拟和理解提供了新的可能性，展示了扩散模型在姿态控制任务中的潜力。

Abstract: Egocentric video generation with fine-grained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.

[79] Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera cs.CVPDF

Mukai Yu, Mosam Dabhi, Liuyue Xie, Sebastian Scherer, László A. Jeni

TL;DR: USF是一个通用的球形前端框架，能够将任何校准相机的图像转换为单位球表示，直接在空间域进行球形重新采样、卷积和池化，避免了昂贵的球谐变换。

Details

Motivation: 现代感知系统广泛使用广角相机，但传统平面CNN无法有效处理球形图像的空间关系和全局旋转敏感性，球谐变换又限制了分辨率和效率。

Result: USF在高分辨率球形图像处理中效率高，在旋转不变性测试中性能下降小于1%，并能零样本泛化到未见过的广角镜头。

Insight: 直接在空间域处理球形图像是可行的，且避免了传统方法的计算复杂度，为广角相机图像处理提供了高效解决方案。

Abstract: Modern perception increasingly relies on fisheye, panoramic, and other wide field-of-view (FoV) cameras, yet most pipelines still apply planar CNNs designed for pinhole imagery on 2D grids, where image-space neighborhoods misrepresent physical adjacency and models are sensitive to global rotations. Frequency-domain spherical CNNs partially address this mismatch but require costly spherical harmonic transforms that constrain resolution and efficiency. We introduce the Unified Spherical Frontend (USF), a lens-agnostic framework that transforms images from any calibrated camera into a unit-sphere representation via ray-direction correspondences, and performs spherical resampling, convolution, and pooling directly in the spatial domain. USF is modular: projection, location sampling, interpolation, and resolution control are fully decoupled. Its distance-only spherical kernels offer configurable rotation-equivariance (mirroring translation-equivariance in planar CNNs) while avoiding harmonic transforms entirely. We compare standard planar backbones with their spherical counterparts across classification, detection, and segmentation tasks on synthetic (Spherical MNIST) and real-world datasets (PANDORA, Stanford 2D-3D-S), and stress-test robustness to extreme lens distortions, varying FoV, and arbitrary rotations. USF processes high-resolution spherical imagery efficiently and maintains less than 1% performance drop under random test-time rotations, even without rotational augmentation, and even enables zero-shot generalization from one lens type to unseen wide-FoV lenses with minimal performance degradation.

[80] Early Lung Cancer Diagnosis from Virtual Follow-up LDCT Generation via Correlational Autoencoder and Latent Flow Matching cs.CVPDF

Yutong Wu, Yifan Wang, Qining Zhang, Chuan Zhou, Lei Ying

TL;DR: 论文提出了一种名为CorrFlowNet的生成方法，通过基线CT扫描生成一年后的虚拟随访CT图像，以早期检测肺结节良恶性。

Details

Motivation: 肺癌早期诊断对提高生存率至关重要，但临床随访需要多次CT检查，可能延误最佳治疗时间。现有AI方法多聚焦于单次CT扫描的特征提取，缺乏对病变动态变化的捕捉。

Result: 在真实临床数据集上，其诊断准确性显著优于基线模型，并与真实随访结果相当。

Insight: 通过生成虚拟随访图像，该方法减少了临床随访的等待时间，为早期肺癌诊断提供了新思路。

Abstract: Lung cancer is one of the most commonly diagnosed cancers, and early diagnosis is critical because the survival rate declines sharply once the disease progresses to advanced stages. However, achieving an early diagnosis remains challenging, particularly in distinguishing subtle early signals of malignancy from those of benign conditions. In clinical practice, a patient with a high risk may need to undergo an initial baseline and several annual follow-up examinations (e.g., CT scans) before receiving a definitive diagnosis, which can result in missing the optimal treatment. Recently, Artificial Intelligence (AI) methods have been increasingly used for early diagnosis of lung cancer, but most existing algorithms focus on radiomic features extraction from single early-stage CT scans. Inspired by recent advances in diffusion models for image generation, this paper proposes a generative method, named CorrFlowNet, which creates a virtual, one-year follow-up CT scan after the initial baseline scan. This virtual follow-up would allow for an early detection of malignant/benign nodules, reducing the need to wait for clinical follow-ups. During training, our approach employs a correlational autoencoder to encode both early baseline and follow-up CT images into a latent space that captures the dynamics of nodule progression as well as the correlations between them, followed by a flow matching algorithm on the latent space with a neural ordinary differential equation. An auxiliary classifier is used to further enhance the diagnostic accuracy. Evaluations on a real clinical dataset show our method can significantly improve downstream lung nodule risk assessment compared with existing baseline models. Moreover, its diagnostic accuracy is comparable with real clinical CT follow-ups, highlighting its potential to improve cancer diagnosis.

[81] From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation cs.CV | cs.AI | cs.CLPDF

Moazzam Umer Gondal, Hamad Ul Qudous, Daniya Siddiqui, Asma Ahmad Farhan

TL;DR: 这篇论文提出了一种基于检索增强的框架，用于自动生成时尚图片的标题和标签，结合了多服装检测、属性推理和大型语言模型（LLM）提示技术，解决了端到端生成模型中属性保真度和领域泛化的问题。

Details

Motivation: 传统的端到端图像标题生成模型在时尚领域存在属性保真度低和领域泛化能力差的问题，无法生成视觉接地气且风格丰富的文本。

Result: 实验结果显示，YOLO检测器在九类服装上的mAP@0.5为0.71；RAG-LLM生成的标题属性覆盖率达0.80，标签生成在50%阈值下完全覆盖，优于BLIP模型。

Insight: 检索增强方法在时尚内容生成中展现出更强的解释性和可扩展性，为其他领域的视觉文本生成提供了新思路。

Abstract: This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.

[82] ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization cs.CV | cs.AIPDF

Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath

TL;DR: ARIAL是一个基于LLM规划代理的模块化框架，通过协调专用工具实现文档VQA任务的精确答案提取和空间定位，在多个基准测试上取得SOTA结果。

Details

Motivation: 现有文档VQA系统在文本准确性和空间定位能力之间存在权衡，牺牲性能或可解释性。ARIAL旨在通过模块化设计同时提升两者。

Result: 在DocVQA、FUNSD、CORD和SROIE四个数据集上，ARIAL取得SOTA结果，如DocVQA上ANLS 88.7和mAP 50.1，超越DLaVA方法。

Insight: 模块化和代理式协调专用工具的设计路径可以同时提升文档AI系统的性能和可解释性。

Abstract: Document Visual Question Answering (VQA) requires models to not only extract accurate textual answers but also precisely localize them within document images, a capability critical for interpretability in high-stakes applications. However, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via a fine-tuned Gemma 3-27B model, and explicit bounding-box localization through text-to-region alignment. This modular architecture produces transparent reasoning traces, enabling tool-level auditability and independent component optimization. We evaluate ARIAL on four benchmarks (DocVQA, FUNSD, CORD, and SROIE) using both textual accuracy (ANLS) and spatial precision (mAP at IoU 0.50 to 0.95). ARIAL achieves state-of-the-art results across all datasets: 88.7 ANLS and 50.1 mAP on DocVQA, 90.0 ANLS and 50.3 mAP on FUNSD, 85.5 ANLS and 60.2 mAP on CORD, and 93.1 ANLS on SROIE, surpassing the previous best method (DLaVA) by +2.8 ANLS and +3.9 mAP on DocVQA. Our work demonstrates how agentic orchestration of specialized tools can simultaneously improve performance and interpretability, providing a pathway toward trustworthy, explainable document AI systems.

[83] InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity cs.CVPDF

Haoming Wang, Qiyao Xue, Wei Gao

TL;DR: InfiniBench 是一个完全自动化、可定制且用户友好的基准生成器，能够通过参数化控制场景复杂性，合成理论上无限的3D场景。其创新包括LLM驱动的框架、灵活的布局优化器和任务感知的相机轨迹优化方法。

Details

Motivation: 现有的视觉空间推理基准缺乏多样性和可定制性，无法隔离和分析视觉语言模型(VLM)在特定空间条件下的失败模式。InfiniBench旨在填补这一空白。

Result: InfiniBench在提示忠实度和物理合理性上优于现有方法，尤其在高复杂度场景中表现突出。

Insight: InfiniBench为视觉空间推理提供了高度可定制的评测工具，能够更全面地分析VLM的能力，尤其是在复杂场景下的表现。

Abstract: Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.

[84] Large-Scale Pre-training Enables Multimodal AI Differentiation of Radiation Necrosis from Brain Metastasis Progression on Routine MRI cs.CVPDF

Ahmed Gomaa, Annette Schwarz, Ludwig Singer, Arnd Dörfler, Matthias Stefan May

TL;DR: 论文提出了一种基于自监督学习的视觉Transformer（ViT）模型，通过大规模预训练和多模态输入（T1CE MRI和分割掩码），在区分放射性坏死与脑转移瘤进展的任务中表现出色。

Details

Motivation: 传统的监督学习方法在区分放射性坏死（RN）与脑转移瘤进展时面临标注数据稀缺的挑战。本文利用大规模无标签MRI数据进行自监督预训练，以提高模型的性能。

Result: 自监督模型在同中心和外部测试集上的AUC分别为0.916和0.764，优于监督学习ViT（0.624/0.496）和放射组学方法（0.807/0.691）。多模态集成进一步提高了性能（0.947/0.821）。

Insight: 大规模预训练可利用无标签数据显著提升模型性能；多模态输入增强了模型的鲁棒性；模型的可解释性有助于临床决策支持。

Abstract: Background: Differentiating radiation necrosis (RN) from tumor progression after stereotactic radiosurgery (SRS) remains a critical challenge in brain metastases. While histopathology represents the gold standard, its invasiveness limits feasibility. Conventional supervised deep learning approaches are constrained by scarce biopsy-confirmed training data. Self-supervised learning (SSL) overcomes this by leveraging the growing availability of large-scale unlabeled brain metastases imaging datasets. Methods: In a two-phase deep learning strategy inspired by the foundation model paradigm, a Vision Transformer (ViT) was pre-trained via SSL on 10,167 unlabeled multi-source T1CE MRI sub-volumes. The pre-trained ViT was then fine-tuned for RN classification using a two-channel input (T1CE MRI and segmentation masks) on the public MOLAB dataset (n=109) using 20% of datasets as same-center held-out test set. External validation was performed on a second-center test cohort (n=28). Results: The self-supervised model achieved an AUC of 0.916 on the same-center test set and 0.764 on the second center test set, surpassing the fully supervised ViT (AUC 0.624/0.496; p=0.001/0.008) and radiomics (AUC 0.807/0.691; p=0.005/0.014). Multimodal integration further improved performance (AUC 0.947/0.821; p=0.073/0.001). Attention map visualizations enabled interpretability showing the model focused on clinically relevant lesion subregions. Conclusion: Large-scale pre-training on increasingly available unlabeled brain metastases datasets substantially improves AI model performance. A two-phase multimodal deep learning strategy achieved high accuracy in differentiating radiation necrosis from tumor progression using only routine T1CE MRI and standard clinical data, providing an interpretable, clinically accessible solution that warrants further validation.

[85] Using MLIR Transform to Design Sliced Convolution Algorithm cs.CV | cs.LG | cs.PFPDF

Victor Ferrari, Marcio Pereira, Lucas Alvarenga, Gustavo Leite, Guido Araujo

TL;DR: 该论文提出了SConvTransform，一种MLIR Transform方言扩展，用于优化2D卷积。其主要操作SConvOp通过完全声明式的转换管道，将Linalg卷积降级为分块和打包的通用操作。

Details

Motivation: 现有卷积优化方法通常缺乏灵活性和可重用性，SConvTransform旨在通过MLIR的模块化和可扩展性解决这一问题。

Result: 在ARM SME和Intel AVX512上，生成的代码分别达到峰值性能的60%和67%。

Insight: 静态形状分析与结构化分块策略的结合在MLIR框架中是有效的，且其模块化设计便于未来扩展。

Abstract: This paper proposes SConvTransform, a Transform dialect extension that provides operations for optimizing 2D convolutions in MLIR. Its main operation, SConvOp, lowers Linalg convolutions into tiled and packed generic operations through a fully declarative transformation pipeline. The process is guided by a Convolution Slicing Analysis that determines tile sizes and data layout strategies based on input and filter shapes, as well as target architecture parameters. SConvOp handles edge cases by splitting irregular regions and adjusting affine maps where needed. All packing and tiling operations are derived from a parametric set of affine equations, enabling reusable and analyzable transformations. Although functional correctness was the primary goal of this work, the experimental evaluation demonstrates the effectiveness of SConvTransform, achieving good enough performance across different target architectures. Future work will focus on optimizing performance and porting to other target devices. When applied to standard convolution configurations, the generated code achieves up to 60% of peak performance on ARM SME and 67% on Intel AVX512. These results validate the benefit of combining static shape analysis with structured tiling and packing strategies within the MLIR Transform dialect. Furthermore, the modular design of SConvTransform facilitates integration with future extensions, enabling continued optimization of convolution workloads through MLIR’s extensible compilation infrastructure.

[86] Parallel qMRI Reconstruction from 4x Accelerated Acquisitions cs.CVPDF

Mingi Kang

TL;DR: 论文提出了一种端到端的深度学习框架，用于从4倍加速的MRI采集数据中并行重建图像和估计线圈灵敏度图。

Details

Motivation: MRI采集时间长，限制了患者吞吐量且容易产生运动伪影。传统方法如SENSE需要线圈灵敏度图，而现有方法无法联合优化这些问题。

Result: 在视觉上比传统SENSE重建更平滑，但PSNR/SSIM指标略低。

Insight: 空间错位问题是未来改进的关键方向。

Abstract: Magnetic Resonance Imaging (MRI) acquisitions require extensive scan times, limiting patient throughput and increasing susceptibility to motion artifacts. Accelerated parallel MRI techniques reduce acquisition time by undersampling k-space data, but require robust reconstruction methods to recover high-quality images. Traditional approaches like SENSE require both undersampled k-space data and pre-computed coil sensitivity maps. We propose an end-to-end deep learning framework that jointly estimates coil sensitivity maps and reconstructs images from only undersampled k-space measurements at 4x acceleration. Our two-module architecture consists of a Coil Sensitivity Map (CSM) estimation module and a U-Net-based MRI reconstruction module. We evaluate our method on multi-coil brain MRI data from 10 subjects with 8 echoes each, using 2x SENSE reconstructions as ground truth. Our approach produces visually smoother reconstructions compared to conventional SENSE output, achieving comparable visual quality despite lower PSNR/SSIM metrics. We identify key challenges including spatial misalignment between different acceleration factors and propose future directions for improved reconstruction quality.

[87] EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning cs.CVPDF

Yogesh Kulkarni, Pooyan Fazli

TL;DR: EgoVITA是一个基于强化学习的框架，通过结构化的规划和验证，解决了多模态大语言模型在自我中心视角下推理意图和动作的挑战。

Details

Motivation: 自我中心视频（egocentric）反映了演员的第一人称视角，具有部分可观测性、狭窄视野和自我参考运动等特点，这给多模态大语言模型的推理带来了挑战。

Result: 在自我中心推理任务中显著优于基线模型，例如在EgoBlind任务上提升了7.7分，同时在第三方视频任务上保持了良好的泛化能力。

Insight: 结合第一人称和第三人称视角的交替推理，可以有效提升模型在动态和部分可观测环境中的表现。

Abstract: Reasoning about intentions and actions from a first-person (egocentric) perspective remains a fundamental challenge for multimodal large language models (MLLMs). Unlike third-person (exocentric) videos that capture scenes from an outside observer, egocentric videos reflect the actor’s continuously changing viewpoint, introducing partial observability, limited field of view, and self-referenced motion. We introduce $\textbf{EgoVITA}$, a reinforcement learning framework that enables MLLMs to reason through structured planning and verification. Built on Group Relative Policy Optimization (GRPO), EgoVITA alternates between two stages: (1) an $\textbf{egocentric planning phase}$, where the model reasons from a first-person viewpoint to predict a step-by-step plan of future actions, and (2) an $\textbf{exocentric verification phase}$, where it switches to a third-person perspective to check the visual and logical consistency of that plan. Through GRPO, the model learns to make plans that are causally predictive of upcoming visual observations, leading to more coherent and visually grounded reasoning. EgoVITA achieves significant gains on egocentric reasoning tasks, outperforming the baseline Qwen2.5-VL-7B by $\mathbf{+7.7}$ on EgoBlind and $\mathbf{+4.4}$ on EgoOrient, while maintaining strong generalization on exocentric video tasks.

[88] UniFlow: Towards Zero-Shot LiDAR Scene Flow for Autonomous Vehicles via Cross-Domain Generalization cs.CVPDF

Siyi Li, Qingwen Zhang, Ishan Khatri, Kyle Vedder, Deva Ramanan

TL;DR: UniFlow提出了一种跨域泛化的零样本LiDAR场景流估计方法，通过训练多样化的数据集提升模型通用性，显著优于现有方法。

Details

Motivation: 现有的LiDAR场景流方法通常在单一传感器数据集上训练和评估，缺乏对不同传感器的泛化能力。本文研究跨数据集训练的潜力，发现运动估计任务对传感器配置不敏感。

Result: UniFlow在Waymo和nuScenes上分别提升了5.1%和35.2%，在未见数据集TruckScenes上相比专用模型提升了30.1%。

Insight: 低层次任务（如运动估计）可能对传感器差异更鲁棒，跨数据集训练有助于学习通用运动先验。

Abstract: LiDAR scene flow is the task of estimating per-point 3D motion between consecutive point clouds. Recent methods achieve centimeter-level accuracy on popular autonomous vehicle (AV) datasets, but are typically only trained and evaluated on a single sensor. In this paper, we aim to learn general motion priors that transfer to diverse and unseen LiDAR sensors. However, prior work in LiDAR semantic segmentation and 3D object detection demonstrate that naively training on multiple datasets yields worse performance than single dataset models. Interestingly, we find that this conventional wisdom does not hold for motion estimation, and that state-of-the-art scene flow methods greatly benefit from cross-dataset training. We posit that low-level tasks such as motion estimation may be less sensitive to sensor configuration; indeed, our analysis shows that models trained on fast-moving objects (e.g., from highway datasets) perform well on fast-moving objects, even across different datasets. Informed by our analysis, we propose UniFlow, a family of feedforward models that unifies and trains on multiple large-scale LiDAR scene flow datasets with diverse sensor placements and point cloud densities. Our frustratingly simple solution establishes a new state-of-the-art on Waymo and nuScenes, improving over prior work by 5.1% and 35.2% respectively. Moreover, UniFlow achieves state-of-the-art accuracy on unseen datasets like TruckScenes, outperforming prior TruckScenes-specific models by 30.1%.

[89] Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization cs.CVPDF

Sina Mokhtarzadeh Azar, Emad Bahrami, Enrico Pallotta, Gianpiero Francesca, Radu Timofte

TL;DR: 该论文提出了一种利用扩散噪声优化（SAVi-DNO）的方法，用于连续视频流的序列自适应视频预测。该方法通过优化扩散噪声而非微调模型参数，实现了对连续视频流的自适应预测，并在多个数据集上验证了其有效性。

Details

Motivation: 针对连续视频流中的未来帧预测问题，传统的扩散模型无法灵活适应新的训练样本。作者希望通过一种轻量化的方法，在不调整模型参数的情况下，实现对视频流的动态适应。

Result: 实验结果表明，SAVi-DNO在Ego4D、OpenDV-YouTube、UCF-101和SkyTimelapse数据集上均表现出优异的性能，显著提升了FVD、SSIM和PSNR指标。

Insight: 扩散噪声优化是一种轻量且高效的方法，能够在保持模型参数不变的同时，实现对连续视频流的自适应预测，为实时视频处理提供了新思路。

Abstract: In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO’s effectiveness.

[90] MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation cs.CVPDF

Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan

TL;DR: MammothModa2（Mammoth2）提出了一种统一的AR-Diffusion框架，将自回归语义建模与基于扩散的高保真图像生成结合，实现了多模态理解和生成的统一。

Details

Motivation: 多模态模型在统一理解和生成任务时面临离散语义推理与高保真视觉合成之间的鸿沟。Mammoth2旨在通过结合自回归和扩散模型来解决这一问题。

Result: 在公开基准测试中，Mammoth2在文本到图像生成和指令编辑任务上表现优异（GenEval 0.87，DPGBench 87.2，ImgEdit 4.06），同时保持了多模态理解任务的竞争力。

Insight: 研究表明，精心设计的AR-Diffusion架构可以在单一模型中高效实现高保真生成和编辑，同时保持多模态理解能力。

Abstract: Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR’s representations with the diffusion decoder’s continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.

[91] SatSAM2: Motion-Constrained Video Object Tracking in Satellite Imagery using Promptable SAM2 and Kalman Priors cs.CVPDF

Ruijie Fan, Junyan Ye, Huan Chen, Zilong Huang, Xiaolei Wang

TL;DR: SatSAM2是一个基于SAM2的零样本卫星视频追踪器，通过结合Kalman滤波器和状态机解决卫星视频追踪中泛化性差和遮挡问题，并在合成数据集MVOT上验证其优越性能。

Details

Motivation: 现有卫星视频追踪方法泛化能力差，需要针对特定场景训练，且在遮挡情况下易丢失目标。

Result: 在OOTB数据集上，SatSAM2的AUC比现有方法提升了5.84%。

Insight: 通过运动约束和状态机设计，SatSAM2在零样本设置下显著提升了卫星视频追踪的性能。

Abstract: Existing satellite video tracking methods often struggle with generalization, requiring scenario-specific training to achieve satisfactory performance, and are prone to track loss in the presence of occlusion. To address these challenges, we propose SatSAM2, a zero-shot satellite video tracker built on SAM2, designed to adapt foundation models to the remote sensing domain. SatSAM2 introduces two core modules: a Kalman Filter-based Constrained Motion Module (KFCMM) to exploit temporal motion cues and suppress drift, and a Motion-Constrained State Machine (MCSM) to regulate tracking states based on motion dynamics and reliability. To support large-scale evaluation, we propose MatrixCity Video Object Tracking (MVOT), a synthetic benchmark containing 1,500+ sequences and 157K annotated frames with diverse viewpoints, illumination, and occlusion conditions. Extensive experiments on two satellite tracking benchmarks and MVOT show that SatSAM2 outperforms both traditional and foundation model-based trackers, including SAM2 and its variants. Notably, on the OOTB dataset, SatSAM2 achieves a 5.84% AUC improvement over state-of-the-art methods. Our code and dataset will be publicly released to encourage further research.

[92] Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models cs.CV | cs.AIPDF

Tianyang Han, Junhao Su, Junjie Hu, Peizhen Yang, Hengyu Shi

TL;DR: 本文提出了PicWorld，首个全面评估文本到图像（T2I）模型隐含世界知识和物理因果推理能力的基准测试，通过多智能体评估器PW-Agent实现细粒度评估。研究发现17种主流T2I模型均存在不同程度的局限性。

Details

Motivation: 当前T2I模型虽能生成逼真且符合指令的图像，但在需要隐含世界知识的提示词上表现不佳。现有评估方法或聚焦组合对齐，或依赖单轮问答评分，导致知识基础、多物理交互和可审计证据等关键维度未被充分测试。

Result: 实验显示17种主流T2I模型在隐含世界知识和物理因果推理上普遍存在局限性，程度不一。

Insight: 未来T2I系统需结合推理意识和知识集成架构，以弥补当前模型在复杂世界知识推理上的不足。

Abstract: Text-to-image (T2I) models today are capable of producing photorealistic, instruction-following images, yet they still frequently fail on prompts that require implicit world knowledge. Existing evaluation protocols either emphasize compositional alignment or rely on single-round VQA-based scoring, leaving critical dimensions such as knowledge grounding, multi-physics interactions, and auditable evidence-substantially undertested. To address these limitations, we introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models. This benchmark consists of 1,100 prompts across three core categories. To facilitate fine-grained evaluation, we propose PW-Agent, an evidence-grounded multi-agent evaluator to hierarchically assess images on their physical realism and logical consistency by decomposing prompts into verifiable visual evidence. We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees. The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems.

[93] Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation cs.CV | cs.CRPDF

Richard J. Young

TL;DR: 本文系统评估了视觉令牌掩码在医疗文档OCR中保护健康信息的局限性，揭示了其无法完全防止结构化标识符泄露的问题，并提出了混合架构的解决方案。

Details

Motivation: 医疗场景中大规模视觉语言模型（VLMs）的应用增加了健康隐私信息（PHI）泄露的风险，因此亟需评估视觉令牌掩码作为隐私保护机制的有效性。

Result: 所有掩码策略仅实现42.9%的PHI减少率，对长形式标识符（如姓名、地址）有效率为100%，但对结构化短标识符（如病历号、社保号）无效。混合架构可将减少率提升至88.6%。

Insight: 视觉掩码的隐私保护效果受限，结构化标识符的泄露主要由语言模型的上下文推理驱动。未来研究应聚焦解码器微调和混合防御架构。

Abstract: Large vision-language models (VLMs) are increasingly deployed for optical character recognition (OCR) in healthcare settings, raising critical concerns about protected health information (PHI) exposure during document processing. This work presents the first systematic evaluation of inference-time vision token masking as a privacy-preserving mechanism for medical document OCR using DeepSeek-OCR. We introduce seven masking strategies (V3-V9) targeting different architectural layers (SAM encoder blocks, compression layers, dual vision encoders, projector fusion) and evaluate PHI reduction across HIPAA-defined categories using 100 synthetic medical billing statements (drawn from a corpus of 38,517 annotated documents) with perfect ground-truth annotations. All masking strategies converge to 42.9% PHI reduction, successfully suppressing long-form spatially-distributed identifiers (patient names, dates of birth, physical addresses at 100% effectiveness) while failing to prevent short structured identifiers (medical record numbers, social security numbers, email addresses, account numbers at 0% effectiveness). Ablation studies varying mask expansion radius (r=1,2,3) demonstrate that increased spatial coverage does not improve reduction beyond this ceiling, indicating that language model contextual inference - not insufficient visual masking - drives structured identifier leakage. A simulated hybrid architecture combining vision masking with NLP post-processing achieves 88.6% total PHI reduction (assuming 80% NLP accuracy on remaining identifiers). This negative result establishes boundaries for vision-only privacy interventions in VLMs, provides guidance distinguishing PHI types amenable to vision-level versus language-level redaction, and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures for HIPAA-compliant medical document processing.

[94] Point-to-Point: Sparse Motion Guidance for Controllable Video Editing cs.CVPDF

Yeji Song, Jaehyun Lee, Mijin Koo, JunHoo Lee, Nojun Kwak

TL;DR: 论文提出了一种名为点对点（Point-to-Point）的视频编辑方法，通过稀疏运动引导实现可控的视频编辑，解决了现有方法在编辑和运动保真度之间的权衡问题。

Details

Motivation: 现有视频编辑方法在保留运动的同时进行主题编辑时，往往需要在编辑保真度和运动保真度之间做出权衡。这些方法依赖的运动表示要么过于依赖布局，要么是隐式定义的。

Result: 实验表明，锚点标记显著提升了视频编辑的可控性和语义对齐性，在编辑和运动保真度方面表现优异。

Insight: 利用视频扩散模型的先验信息，可以更高效地捕捉和编码运动动态，从而实现更灵活和精准的视频编辑。

Abstract: Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.

[95] RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System cs.CVPDF

Runwei Guan, Rongsheng Hu, Shangshu Chen, Ningyuan Xiao, Xue Xia

TL;DR: 该论文提出了RoadSceneVQA数据集和RoadMind基准模型，专注于路边场景的视觉问答任务，融合了多模态大语言模型（MLLM）和辅助解耦思维链（AD-CoT），提升了交通感知和推理能力。

Details

Motivation: 现有的路边感知系统仅关注实例级感知，缺乏自然语言交互和上下文推理能力。RoadSceneVQA旨在填补这一空白，推动智能交通系统中的交互和推理任务。

Result: RoadMind在结构化的交通感知和推理任务中达到SOTA性能，同时在计算效率上表现优异。

Insight: 数据集和方法的结合为智能交通系统中的自然语言交互和复杂推理任务提供了新方向，展示了MLLM在多模态任务中的潜力。

Abstract: Current roadside perception systems mainly focus on instance-level perception, which fall short in enabling interaction via natural language and reasoning about traffic behaviors in context. To bridge this gap, we introduce RoadSceneVQA, a large-scale and richly annotated visual question answering (VQA) dataset specifically tailored for roadside scenarios. The dataset comprises 34,736 diverse QA pairs collected under varying weather, illumination, and traffic conditions, targeting not only object attributes but also the intent, legality, and interaction patterns of traffic participants. RoadSceneVQA challenges models to perform both explicit recognition and implicit commonsense reasoning, grounded in real-world traffic rules and contextual dependencies. To fully exploit the reasoning potential of Multi-modal Large Language Models (MLLMs), we further propose CogniAnchor Fusion (CAF), a vision-language fusion module inspired by human-like scene anchoring mechanisms. Moreover, we propose the Assisted Decoupled Chain-of-Thought (AD-CoT) to enhance the reasoned thinking via CoT prompting and multi-task learning. Based on the above, we propose the baseline model RoadMind. Experiments on RoadSceneVQA and CODA-LM benchmark show that the pipeline consistently improves both reasoning accuracy and computational efficiency, allowing the MLLM to achieve state-of-the-art performance in structural traffic perception and reasoning tasks.

[96] DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition cs.CVPDF

Raja Kumar, Arka Sadhu, Ram Nevatia

TL;DR: DiVE-k提出了一个基于差分视觉推理的细粒度图像识别框架，通过利用模型的top-k预测作为训练信号，避免现有强化学习方法中的记忆化问题并提升泛化能力。

Details

Motivation: 现有的视觉语言模型（LVLMs）在细粒度图像识别中难以利用其文本知识，且传统的强化学习方法容易导致记忆化训练类别而缺乏泛化能力。

Result: 在五个细粒度数据集上的实验表明，DiVE-k在基类到新类的泛化设置中显著优于现有方法（如QWEN2.5-VL-7B和ViRFT）。

Insight: 通过差分推理和多选题训练信号，DiVE-k有效减少了记忆化问题并提升了模型对未见类别的泛化能力。

Abstract: Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $\textbf{DiVE-k}$, $\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning using top-$\textbf{k}$ generations, framework that leverages model’s own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model’s top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios.

[97] ScriptViT: Vision Transformer-Based Personalized Handwriting Generation cs.CV | cs.AI | cs.LGPDF

Sajjan Acharya, Rajendra Baskota

TL;DR: 该论文提出了一个基于Vision Transformer的个性化手写生成框架ScriptViT，通过全局风格编码和跨注意力机制解决了现有方法在捕捉长距离空间依赖性和风格一致性上的局限性，并结合显式注意力分析提升了可解释性。

Details

Motivation: 现有的GAN、Transformer和扩散模型在手写风格生成中难以充分捕捉全局性的风格特征（如倾斜、曲率和笔压等），导致生成的文本风格不一致或不准确。本文旨在解决这一问题。

Result: 生成的个性化手写文本在风格一致性和准确性上优于现有方法，同时通过SSAA提升了模型的可解释性。

Insight: Vision Transformer能有效捕捉手写风格的全局特征，跨注意力机制是实现风格与内容融合的关键，SSAA为风格迁移提供了可解释的途径。

Abstract: Styled handwriting generation aims to synthesize handwritten text that looks both realistic and aligned with a specific writer’s style. While recent approaches involving GAN, transformer and diffusion-based models have made progress, they often struggle to capture the full spectrum of writer-specific attributes, particularly global stylistic patterns that span long-range spatial dependencies. As a result, capturing subtle writer-specific traits such as consistent slant, curvature or stroke pressure, while keeping the generated text accurate is still an open problem. In this work, we present a unified framework designed to address these limitations. We introduce a Vision Transformer-based style encoder that learns global stylistic patterns from multiple reference images, allowing the model to better represent long-range structural characteristics of handwriting. We then integrate these style cues with the target text using a cross-attention mechanism, enabling the system to produce handwritten images that more faithfully reflect the intended style. To make the process more interpretable, we utilize Salient Stroke Attention Analysis (SSAA), which reveals the stroke-level features the model focuses on during style transfer. Together, these components lead to handwriting synthesis that is not only more stylistically coherent, but also easier to understand and analyze.

[98] Stro-VIGRU: Defining the Vision Recurrent-Based Baseline Model for Brain Stroke Classification cs.CVPDF

Subhajeet Das, Pritam Paul, Rohit Bahadur, Sohan Das

TL;DR: 本文提出了一种基于预训练Vision Transformer和Bi-GRU的模型（Stro-VIGRU），用于早期脑卒中分类，通过部分冻结ViT并加入Bi-GRU层，以及数据增强技术，实现了94.06%的分类准确率。

Details

Motivation: 脑卒中是全球致死和致残的主要原因，早期识别对治疗至关重要。尽管CT扫描快速可用，但人工分析耗时且易出错，因此需要开发高效的自动化分类方法。

Result: 模型在脑卒中数据集上的分类准确率达到94.06%，证明了其有效性。

Insight: 1. ViT与Bi-GRU的结合能有效提取时序和空间特征；2. 部分冻结ViT可平衡预训练知识和任务特定特征；3. 数据增强是解决医学数据不平衡的关键技术。

Abstract: Stroke majorly causes death and disability worldwide, and early recognition is one of the key elements of successful treatment of the same. It is common to diagnose strokes using CT scanning, which is fast and readily available, however, manual analysis may take time and may result in mistakes. In this work, a pre-trained Vision Transformer-based transfer learning framework is proposed for the early identification of brain stroke. A few of the encoder blocks of the ViT model are frozen, and the rest are allowed to be fine-tuned in order to learn brain stroke-specific features. The features that have been extracted are given as input to a single-layer Bi-GRU to perform classification. Class imbalance is handled by data augmentation. The model has achieved 94.06% accuracy in classifying brain stroke from the Stroke Dataset.

[99] General vs Domain-Specific CNNs: Understanding Pretraining Effects on Brain MRI Tumor Classification cs.CV | cs.AIPDF

Helia Abedini, Saba Rahimi, Reza Vaziri

TL;DR: 论文比较了通用与医疗领域专用预训练CNN在脑MRI肿瘤分类中的表现，发现通用预训练的ConvNeXt-Tiny在小数据集上表现最佳，而医疗领域预训练的RadImageNet DenseNet121泛化能力较差。

Details

Motivation: 研究旨在探讨在小数据集条件下，哪种预训练模型更适合脑MRI肿瘤分类任务：是医疗领域专用的预训练模型，还是在大型通用数据集上预训练的模型。

Result: ConvNeXt-Tiny表现最佳，EfficientNetV2S次之，而RadImageNet DenseNet121泛化能力差，准确率低、损失高。

Insight: 结果表明，在小数据条件下，医疗领域专用预训练的模型可能泛化能力不足，而现代通用预训练的CNN在医疗影像任务中表现更优。

Abstract: Brain tumor detection from MRI scans plays a crucial role in early diagnosis and treatment planning. Deep convolutional neural networks (CNNs) have demonstrated strong performance in medical imaging tasks, particularly when pretrained on large datasets. However, it remains unclear which type of pretrained model performs better when only a small dataset is available: those trained on domain-specific medical data or those pretrained on large general datasets. In this study, we systematically evaluate three pretrained CNN architectures for brain tumor classification: RadImageNet DenseNet121 with medical-domain pretraining, EfficientNetV2S, and ConvNeXt-Tiny, which are modern general-purpose CNNs. All models were trained and fine-tuned under identical conditions using a limited-size brain MRI dataset to ensure a fair comparison. Our results reveal that ConvNeXt-Tiny achieved the highest accuracy, followed by EfficientNetV2S, while RadImageNet DenseNet121, despite being pretrained on domain-specific medical data, exhibited poor generalization with lower accuracy and higher loss. These findings suggest that domain-specific pretraining may not generalize well under small-data conditions. In contrast, modern, deeper general-purpose CNNs pretrained on large-scale datasets can offer superior transfer learning performance in specialized medical imaging tasks.

[100] ConsistCompose: Unified Multimodal Layout Control for Image Composition cs.CVPDF

Xuanke Shi, Boxuan Li, Xiaoyang Han, Zhongang Cai, Lei Yang

TL;DR: ConsistCompose提出了一种统一的多模态框架，将布局坐标嵌入语言提示中，从而实现布局控制的多实例图像生成。

Details

Motivation: 现有的多模态模型主要关注视觉接地（对齐语言与图像区域），而布局控制的生成任务（如语言嵌入布局生成）仍未充分探索，限制了精确的组合控制。

Result: 在COCO-Position和MS-Bench上表现优于基线，显著提升空间准确性，同时保持了身份保真度和多模态理解能力。

Insight: 通过语言嵌入布局提示的统一方法，为布局控制的图像生成提供了新范式。

Abstract: Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance-coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.

Tianyang Xu, Jinjie Gu, Xuefeng Zhu, XiaoJun Wu, Josef Kittler

TL;DR: 本文提出了首个大规模多模态无人机跟踪数据集MM-UAV，并提供了一个专门设计的基线框架，通过创新的对齐与融合模块以及事件增强关联机制，显著提升了跟踪性能。

Details

Motivation: 随着低空无人机的普及，视觉多目标跟踪在复杂环境下的鲁棒性需求日益突出。然而，单模态跟踪在低光照、杂乱背景和快速运动等挑战场景中表现不佳，且缺乏专门的多模态数据集阻碍了相关研究的进展。

Result: 实验表明，所提框架在多个场景下显著优于现有方法，验证了多模态数据和创新模块的有效性。

Insight: 多模态数据的融合和事件信号的运动信息可以有效提升无人机跟踪的鲁棒性和准确性。

Abstract: With the proliferation of low altitude unmanned aerial vehicles (UAVs), visual multi-object tracking is becoming a critical security technology, demanding significant robustness even in complex environmental conditions. However, tracking UAVs using a single visual modality often fails in challenging scenarios, such as low illumination, cluttered backgrounds, and rapid motion. Although multi-modal multi-object UAV tracking is more resilient, the development of effective solutions has been hindered by the absence of dedicated public datasets. To bridge this gap, we release MM-UAV, the first large-scale benchmark for Multi-Modal UAV Tracking, integrating three key sensing modalities, e.g. RGB, infrared (IR), and event signals. The dataset spans over 30 challenging scenarios, with 1,321 synchronised multi-modal sequences, and more than 2.8 million annotated frames. Accompanying the dataset, we provide a novel multi-modal multi-UAV tracking framework, designed specifically for UAV tracking applications and serving as a baseline for future research. Our framework incorporates two key technical innovations, e.g. an offset-guided adaptive alignment module to resolve spatio mismatches across sensors, and an adaptive dynamic fusion module to balance complementary information conveyed by different modalities. Furthermore, to overcome the limitations of conventional appearance modelling in multi-object tracking, we introduce an event-enhanced association mechanism that leverages motion cues from the event modality for more reliable identity maintenance. Comprehensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art methods. To foster further research in multi-modal UAV tracking, both the dataset and source code will be made publicly available at https://xuefeng-zhu5.github.io/MM-UAV/.

[102] FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement cs.CVPDF

Wenshuo Gao, Junyi Fan, Jiangyue Zeng, Shuai Yang

TL;DR: FlowPortal提出了一种无需训练的基于光流的视频重光照与背景替换框架，通过残差校正光流机制提升时空一致性和光照自然性。

Details

Motivation: 视频重光照与背景替换在影视制作中至关重要，但现有方法在时间一致性、空间保真度和光照自然性之间难以平衡。FlowPortal旨在解决这一问题。

Result: 实验表明，FlowPortal在时间一致性、结构保留和光照真实感方面表现优异，且效率高。

Insight: 残差校正光流和分离前景/背景的策略为视频编辑任务提供了新的设计思路。

Abstract: Video relighting with background replacement is a challenging task critical for applications in film production and creative media. Existing methods struggle to balance temporal consistency, spatial fidelity, and illumination naturalness. To address these issues, we introduce FlowPortal, a novel training-free flow-based video relighting framework. Our core innovation is a Residual-Corrected Flow mechanism that transforms a standard flow-based model into an editing model, guaranteeing perfect reconstruction when input conditions are identical and enabling faithful relighting when they differ, resulting in high structural consistency. This is further enhanced by a Decoupled Condition Design for precise lighting control and a High-Frequency Transfer mechanism for detail preservation. Additionally, a masking strategy isolates foreground relighting from background pure generation process. Experiments demonstrate that FlowPortal achieves superior performance in temporal coherence, structural preservation, and lighting realism, while maintaining high efficiency. Project Page: https://gaowenshuo.github.io/FlowPortalProject/.

[103] MagicWand: A Universal Agent for Generation and Evaluation Aligned with User Preference cs.CVPDF

Zitong Xu, Dake Shen, Yaosong Du, Kexiang Hao, Jinghan Huang

TL;DR: 论文提出了MagicWand，一种结合用户偏好的通用生成与评估代理，支持高质量内容生成和偏好对齐的评估与优化。

Details

Motivation: 现有AIGC模型虽能生成内容，但用户难以通过提示词精确表达偏好，且缺乏保留偏好的机制。

Result: 在UniPreferBench上验证了MagicWand在多场景下生成和评估与用户偏好高度对齐的内容。

Insight: 结合用户偏好数据能显著提升AIGC模型的实用性和用户体验。

Abstract: Recent advances in AIGC (Artificial Intelligence Generated Content) models have enabled significant progress in image and video generation. However, users still struggle to obtain content that aligns with their preferences due to the difficulty of crafting detailed prompts and the lack of mechanisms to retain their preferences. To address these challenges, we construct \textbf{UniPrefer-100K}, a large-scale dataset comprising images, videos, and associated text that describes the styles users tend to prefer. Based on UniPrefer-100K, we propose \textbf{MagicWand}, a universal generation and evaluation agent that enhances prompts based on user preferences, leverages advanced generation models for high-quality content, and applies preference-aligned evaluation and refinement. In addition, we introduce \textbf{UniPreferBench}, the first large-scale benchmark with over 120K annotations for assessing user preference alignment across diverse AIGC tasks. Experiments on UniPreferBench demonstrate that MagicWand consistently generates content and evaluations that are well aligned with user preferences across a wide range of scenarios.

[104] TRANSPORTER: Transferring Visual Semantics from VLM Manifolds cs.CVPDF

Alexandros Stergiou

TL;DR: 论文提出了一种名为TRANSPORTER的方法，通过生成视频来理解和控制视觉语言模型（VLM）的内部预测过程，重点解决了VLM的可解释性问题。

Details

Motivation: 当前视觉语言模型（VLM）在处理复杂场景时表现出色，但其内部预测过程仍然难以理解和控制。本文旨在通过生成视频来揭示VLM的语义规则，提高模型的可解释性。

Result: 定量和定性实验表明，L2V任务为VLM的可解释性提供了新颖且高保真的研究方向。

Insight: 通过视频生成解释VLM的预测行为，不仅能够提高模型的可解释性，还为未来研究提供了新的方向。

Abstract: How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs’ predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM’s high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.

[105] Alias-free 4D Gaussian Splatting cs.CVPDF

Zilong Chen, Huan-ang Gao, Delin Qu, Haohan Chi, Hao Tang

TL;DR: 论文提出了一个消除4D高斯溅射（Gaussian Splatting）中高频伪影的方法，通过最大采样频率公式和4D尺度自适应滤波器来解决分辨率调整导致的伪影问题。

Details

Motivation: 现有基于高斯溅射的动态场景重建方法在调整相机焦距或高斯基元与相机距离时，会因频率约束和尺度不匹配引入伪影，限制了其灵活性。

Result: 实验证明了该方法在单目和多视角视频重建中能有效消除高频伪影并减少冗余高斯基元。

Insight: 通过对频率约束和尺度问题的系统分析，为动态场景重建提供了一种更灵活的解决方案。

Abstract: Existing dynamic scene reconstruction methods based on Gaussian Splatting enable real-time rendering and generate realistic images. However, adjusting the camera’s focal length or the distance between Gaussian primitives and the camera to modify rendering resolution often introduces strong artifacts, stemming from the frequency constraints of 4D Gaussians and Gaussian scale mismatch induced by the 2D dilated filter. To address this, we derive a maximum sampling frequency formulation for 4D Gaussian Splatting and introduce a 4D scale-adaptive filter and scale loss, which flexibly regulates the sampling frequency of 4D Gaussian Splatting. Our approach eliminates high-frequency artifacts under increased rendering frequencies while effectively reducing redundant Gaussians in multi-view video reconstruction. We validate the proposed method through monocular and multi-view video reconstruction experiments.Ours project page: https://4d-alias-free.github.io/4D-Alias-free/

[106] MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models cs.CVPDF

Xiyang Wu, Zongxia Li, Jihui Jin, Guangyao Shi, Gouthaman KV

TL;DR: 该论文提出了MASS方法，通过将物理世界的动态和空间交互信息转化为视觉语言模型(VLM)可理解的表征，增强了VLM在物理推理和理解任务中的表现，并通过MASS-Bench基准测试验证了其有效性。

Details

Motivation: 视觉语言模型在处理涉及物理动态和空间交互的任务时表现不佳，限制了其在真实或AI生成内容视频中的解释能力和生成物理一致性内容的能力。论文旨在填补这一空白。

Result: 实验显示，优化后的VLM在物理推理任务上比基线模型和先前SOTA模型分别提高了8.7%和6.0%，性能接近闭源SOTA模型如Gemini-2.5-Flash。

Insight: 通过显式建模物理世界的动态和空间交互，可以有效提升VLM在复杂物理推理任务中的表现。

Abstract: Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs’ perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.

[107] Synthetic Curriculum Reinforces Compositional Text-to-Image Generation cs.CVPDF

Shijian Wang, Runhao Fu, Siyi Zhao, Qingqin Zhan, Xingjian Wang

TL;DR: 论文提出了一种名为CompGen的新框架，通过强化学习和课程学习提升文本到图像（T2I）模型的组合生成能力。

Details

Motivation: 解决现有T2I模型在处理复杂场景（包含多个对象及其多样属性和空间语义关系）时的组合生成问题。

Result: 实验表明，CompGen显著提升了基于扩散和自回归的T2I模型的组合生成性能，尤其是‘易到难’和高斯采样策略表现更优。

Insight: 难度感知的课程学习可以显著提升组合生成能力，且调度策略对性能有重要影响。

Abstract: Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.

[108] RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models cs.CVPDF

Timing Yang, Guoyizhe Wei, Alan Yuille, Feng Wang

TL;DR: 该论文系统研究了Mamba在视觉任务中的表征能力，揭示了其与Softmax和Linear Attention的关系，并提出新的评估方法，证明了Mamba的长程依赖建模能力和潜在可解释性。

Details

Motivation: Mamba在视觉任务中表现出色，但其工作机制尚不明确。研究旨在填补这一空白，探索其表征能力与潜在优势。

Result: Mamba实现了78.5%的ImageNet线性探测准确率，验证了其性能优势。

Insight: Mamba可作为Softmax Attention的低秩近似，兼具建模长程依赖的能力和潜在可解释性，为未来视觉架构研究提供方向。

Abstract: Mamba has recently garnered attention as an effective backbone for vision tasks. However, its underlying mechanism in visual domains remains poorly understood. In this work, we systematically investigate Mamba’s representational properties and make three primary contributions. First, we theoretically analyze Mamba’s relationship to Softmax and Linear Attention, confirming that it can be viewed as a low-rank approximation of Softmax Attention and thereby bridging the representational gap between Softmax and Linear forms. Second, we introduce a novel binary segmentation metric for activation map evaluation, extending qualitative assessments to a quantitative measure that demonstrates Mamba’s capacity to model long-range dependencies. Third, by leveraging DINO for self-supervised pretraining, we obtain clearer activation maps than those produced by standard supervised approaches, highlighting Mamba’s potential for interpretability. Notably, our model also achieves a 78.5 percent linear probing accuracy on ImageNet, underscoring its strong performance. We hope this work can provide valuable insights for future investigations of Mamba-based vision architectures.

[109] ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access cs.CVPDF

Timing Yang, Sucheng Ren, Alan Yuille, Feng Wang

TL;DR: ViMix-14M 是一个高质量的、多来源的视频-文本数据集，包含约1400万对数据，旨在解决开源视频文本数据缺失的问题。

Details

Motivation: 现有公开数据集通常需要手动爬取YouTube，存在链接失效、访问限制和许可不确定性等问题，导致数据质量和可用性受限。

Result: 在多模态检索、文本到视频生成和视频问答任务中表现优于同类数据集。

Insight: 高质量的、免爬取的数据集是训练开源视频基础模型的关键，且多源融合和重新标注技术能显著提升数据质量。

Abstract: Text-to-video generation has surged in interest since Sora, yet open-source models still face a data bottleneck: there is no large, high-quality, easily obtainable video-text corpus. Existing public datasets typically require manual YouTube crawling, which yields low usable volume due to link rot and access limits, and raises licensing uncertainty. This work addresses this challenge by introducing ViMix-14M, a curated multi-source video-text dataset of around 14 million pairs that provides crawl-free, download-ready access and long-form, high-quality captions tightly aligned to video. ViMix-14M is built by merging diverse open video sources, followed by unified de-duplication and quality filtering, and a multi-granularity, ground-truth-guided re-captioning pipeline that refines descriptions to better match actions, scenes, and temporal structure. We evaluate the dataset by multimodal retrieval, text-to-video generation, and video question answering tasks, observing consistent improvements over counterpart datasets. We hope this work can help removing the key barrier to training and fine-tuning open-source video foundation models, and provide insights of building high-quality and generalizable video-text datasets.

Chuang Peng, Renshuai Tao, Zhongwei Ren, Xianglong Liu, Yunchao Wei

TL;DR: 该论文提出了一种用于X射线违禁物品检测的新方法，通过将第二视角图像视为类似语言的模态，结合几何和语义跨模态推理，提升了检测性能。

Details

Motivation: 传统X射线违禁物品检测方法主要依赖单一视觉模态，但在复杂威胁场景中表现不佳。尽管近期研究尝试引入语言模态引导检测，但实际安检中人类检查员通常使用双视角图像。论文探讨第二视角图像是否能像语言一样提供约束，从而提出新的跨模态推理方法。

Result: 在DualXrayBench上的综合评估表明，GSR在所有X射线任务上均取得了显著提升。

Insight: 第二视角图像可以像语言一样提供语义约束，结合几何和语义信息的多模态学习能够显著提升X射线违禁物品检测的性能。这种方法为实际安检提供了新的视角。

Abstract: Automatic X-ray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modality, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a caption corpus consisting of 45,613 dual-view image pairs across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that jointly learns correspondences between cross-view geometry and cross-modal semantics, treating the second-view images as a “language-like modality”. To enable this, we construct the GSXray dataset, with structured Chain-of-Thought sequences: , , . Comprehensive evaluations on DualXrayBench demonstrate that GSR achieves significant improvements across all X-ray tasks, offering a new perspective for real-world X-ray inspection.

[111] Exploring Weak-to-Strong Generalization for CLIP-based Classification cs.CVPDF

Jinhao Li, Sarah M. Erfani, Lei Feng, James Bailey, Feng Liu

TL;DR: 该论文探索了基于CLIP的分类任务中弱到强泛化的方法，提出了一种名为类别原型学习（CPL）的技术，旨在通过学习更具代表性的类别原型来增强CLIP模型的分类能力。

Details

Motivation: 现有方法依赖人类监督，但随着模型复杂度提高，这种方法变得不切实际。弱到强泛化的概念通过利用较弱模型的评估能力，减轻人类监督负担，并为视觉-语言模型提供新的解决方案。

Result: 实验表明，CPL在目标场景中表现优异，尤其是在预训练受限的情况下，比基线方法提高了3.67%。

Insight: 弱到强泛化的概念可以扩展到多模态任务中，且即使在简单的监督下，通过学习代表性原型也能显著提升性能。

Abstract: Aligning large-scale commercial models with user intent is crucial to preventing harmful outputs. Current methods rely on human supervision but become impractical as model complexity increases. When models surpass human knowledge, providing accurate feedback becomes challenging and inefficient. A novel solution proposed recently is using a weaker model to supervise a stronger model. This concept leverages the ability of weaker models to perform evaluations, thereby reducing the workload on human supervisors. Previous work has shown the effectiveness of weak-to-strong generalization in the context of language-only models. Extending this concept to vision-language models leverages these insights, adapting the proven benefits to a multi-modal context. In our study, we explore weak-to-strong generalization for CLIP-based classification. We propose a method, class prototype learning (CPL), which aims to enhance the classification capabilities of the CLIP model, by learning more representative prototypes for each category. Our findings indicate that, despite using a simple loss function under weak supervision, CPL yields robust improvements in targeted scenarios, particularly when pretraining is limited. Extensive experiments demonstrate that our approach is effective under these settings, achieving a 3.67% improvement over strong baseline methods.

Yuxiang Nie, Han Wang, Yongjie Ye, Haiyang Yu, Weitao Jia

TL;DR: 论文介绍了ChineseVideoBench，一个专门用于评估多模态大语言模型（MLLMs）在中文视频问答任务中的表现的基准。该基准填补了当前缺乏全面且文化敏感的评估框架的空白，提供了复杂的中文视频内容和定制化的评估指标。

Details

Motivation: 随着对复杂视频分析能力需求的增长，需要一个全面且文化敏感的评估框架来评估MLLMs在中文视频问答任务中的表现。

Result: Gemini 2.5 Pro在评估中表现最佳，总体得分77.9%，而InternVL-38B是最具竞争力的开源模型。

Insight: 中文视频问答任务不仅需要深度的视频理解能力，还要求模型具备中文语言和文化的敏感性，这为未来的研究方向提供了重要启示。

Abstract: This paper introduces ChineseVideoBench, a pioneering benchmark specifically designed for evaluating Multimodal Large Language Models (MLLMs) in Chinese Video Question Answering. The growing demand for sophisticated video analysis capabilities highlights the critical need for comprehensive, culturally-aware evaluation frameworks. ChineseVideoBench addresses this gap by providing a robust dataset and tailored evaluation metrics, enabling rigorous assessment of state-of-the-art MLLMs on complex Chinese video content. Specifically, ChineseVideoBench comprises 8 main classes and 12 sub-classes, encompassing tasks that demand both deep video understanding and nuanced Chinese linguistic and cultural awareness. Our empirical evaluations reveal that ChineseVideoBench presents a significant challenge to current MLLMs. Among the models assessed, Gemini 2.5 Pro achieves the highest performance with an overall score of 77.9%, while InternVL-38B emerges as the most competitive open-source model.

[113] NeuroVascU-Net: A Unified Multi-Scale and Cross-Domain Adaptive Feature Fusion U-Net for Precise 3D Segmentation of Brain Vessels in Contrast-Enhanced T1 MRI cs.CV | cs.LGPDF

Mohammad Jafari Vayeghan, Niloufar Delfan, Mehdi Tale Masouleh, Mansour Parvaresh Rizi, Behzad Moshiri

TL;DR: NeuroVascU-Net是一种专为脑肿瘤患者的T1CE MRI设计的深度学习架构，首次实现了直接从临床标准T1CE MRI中分割脑血管结构，填补了现有工作的空白。

Details

Motivation: 手动分割脑血管结构耗时且存在观察者间差异，现有自动化方法常在精度与计算成本之间权衡，限制了临床使用。因此，需要一种高效且精确的解决方案。

Result: Dice得分为0.8609，精度为0.8841，仅需12.4M参数，显著优于基于Transformer的模型（如Swin U-NetR）。

Insight: NeuroVascU-Net在保持计算效率的同时提升了分割精度，为临床计算机辅助神经外科规划提供了实用解决方案。

Abstract: Precise 3D segmentation of cerebral vasculature from T1-weighted contrast-enhanced (T1CE) MRI is crucial for safe neurosurgical planning. Manual delineation is time-consuming and prone to inter-observer variability, while current automated methods often trade accuracy for computational cost, limiting clinical use. We present NeuroVascU-Net, the first deep learning architecture specifically designed to segment cerebrovascular structures directly from clinically standard T1CE MRI in neuro-oncology patients, addressing a gap in prior work dominated by TOF-MRA-based approaches. NeuroVascU-Net builds on a dilated U-Net and integrates two specialized modules: a Multi-Scale Contextual Feature Fusion ($MSC^2F$) module at the bottleneck and a Cross-Domain Adaptive Feature Fusion ($CDA^2F$) module at deeper hierarchical layers. $MSC^2F$ captures both local and global information via multi-scale dilated convolutions, while $CDA^2F$ dynamically integrates domain-specific features, enhancing representation while keeping computation low. The model was trained and validated on a curated dataset of T1CE scans from 137 brain tumor biopsy patients, annotated by a board-certified functional neurosurgeon. NeuroVascU-Net achieved a Dice score of 0.8609 and precision of 0.8841, accurately segmenting both major and fine vascular structures. Notably, it requires only 12.4M parameters, significantly fewer than transformer-based models such as Swin U-NetR. This balance of accuracy and efficiency positions NeuroVascU-Net as a practical solution for computer-assisted neurosurgical planning.

Avishka Perera, Kumal Hewagamage, Saeedha Nazar, Kavishka Abeywardana, Hasitha Gallella

TL;DR: CrossJEPA是一种高效的跨模态联合嵌入预测架构，通过利用2D图像数据的知识，从3D点云预测2D视图的嵌入，避免了复杂的大型模型设计，实现了高效的3D表示学习。

Details

Motivation: 3D表示学习面临大规模3D数据集稀缺的问题，当前方法通常依赖大型模型，计算成本高且难以部署。本文希望通过跨模态联合嵌入预测架构（JEPA）的简单高效特性，提升3D表示学习的效率。

Result: 在ModelNet40（94.2%）和ScanObjectNN（88.3%）上实现新SOTA；仅需14.1M预训练参数和约6小时单GPU训练时间。

Insight: JEPA风格预训练可通过跨模态任务提升效率；冻结教师设计和一次性缓存机制显著减少计算开销；跨模态知识蒸馏可缓解3D数据稀缺问题。

Abstract: Image-to-point cross-modal learning has emerged to address the scarcity of large-scale 3D datasets in 3D representation learning. However, current methods that leverage 2D data often result in large, slow-to-train models, making them computationally expensive and difficult to deploy in resource-constrained environments. The architecture design of such models is therefore critical, determining their performance, memory footprint, and compute efficiency. The Joint-embedding Predictive Architecture (JEPA) has gained wide popularity in self-supervised learning for its simplicity and efficiency, but has been under-explored in cross-modal settings, partly due to the misconception that masking is intrinsic to JEPA. In this light, we propose CrossJEPA, a simple Cross-modal Joint Embedding Predictive Architecture that harnesses the knowledge of an image foundation model and trains a predictor to infer embeddings of specific rendered 2D views from corresponding 3D point clouds, thereby introducing a JEPA-style pretraining strategy beyond masking. By conditioning the predictor on cross-domain projection information, CrossJEPA purifies the supervision signal from semantics exclusive to the target domain. We further exploit the frozen teacher design with a one-time target embedding caching mechanism, yielding amortized efficiency. CrossJEPA achieves a new state-of-the-art in linear probing on the synthetic ModelNet40 (94.2%) and the real-world ScanObjectNN (88.3%) benchmarks, using only 14.1M pretraining parameters (8.5M in the point encoder), and about 6 pretraining hours on a standard single GPU. These results position CrossJEPA as a performant, memory-efficient, and fast-to-train framework for 3D representation learning via knowledge distillation. We analyze CrossJEPA intuitively, theoretically, and empirically, and extensively ablate our design choices. Code will be made available.

[115] LungX: A Hybrid EfficientNet-Vision Transformer Architecture with Multi-Scale Attention for Accurate Pneumonia Detection cs.CVPDF

Mansur Yerzhanuly

TL;DR: LungX是一个结合了EfficientNet多尺度特征、CBAM注意力机制和Vision Transformer全局上下文建模的混合架构，用于提高肺炎检测的准确性。

Details

Motivation: 肺炎是全球死亡率居高不下的主要原因之一，及时诊断至关重要。现有方法在准确性和局部病灶定位方面仍有改进空间。

Result: 在RSNA和CheXpert的20,000张胸部X光片数据集上，LungX达到了86.5%的准确率和0.943的AUC，显著优于基线模型。

Insight: 混合多种技术（CNN+Transformer+注意力）可以提高医学图像的检测性能，同时提供可解释的注意力图，有助于临床诊断。

Abstract: Pneumonia remains a leading global cause of mortality where timely diagnosis is critical. We introduce LungX, a novel hybrid architecture combining EfficientNet’s multi-scale features, CBAM attention mechanisms, and Vision Transformer’s global context modeling for enhanced pneumonia detection. Evaluated on 20,000 curated chest X-rays from RSNA and CheXpert, LungX achieves state-of-the-art performance (86.5 percent accuracy, 0.943 AUC), representing a 6.7 percent AUC improvement over EfficientNet-B0 baselines. Visual analysis demonstrates superior lesion localization through interpretable attention maps. Future directions include multi-center validation and architectural optimizations targeting 88 percent accuracy for clinical deployment as an AI diagnostic aid.

[116] DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation cs.CV | cs.AIPDF

Yongkun Du, Pinxuan Chen, Xuye Ying, Zhineng Chen

TL;DR: DocPTBench是一个专注于拍摄文档解析和翻译的基准测试，填补了现有基准未能涵盖真实世界复杂挑战的空白，实验表明现有模型在此任务上表现大幅下降。

Details

Motivation: 现有的文档解析和翻译基准（如OmniDocBench和DITrans）主要针对扫描或数字原生文档，无法反映真实拍摄场景中的几何畸变和光度变化等挑战，DocPTBench旨在填补这一空白。

Result: 实验显示，从数字原生文档转为拍摄文档时，主流Multimodal Large Language Models（MLLMs）的解析准确率平均下降18%，翻译准确率下降12%；专业解析模型的性能下降更显著（平均25%）。

Insight: 真实拍摄文档的条件对现有模型提出了显著挑战，揭示了现有方法在复杂场景下的鲁棒性不足。

Abstract: The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. To fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance decline: popular MLLMs exhibit an average accuracy drop of 18% in end-to-end parsing and 12% in translation, while specialized document parsing models show significant average decrease of 25%. This substantial performance gap underscores the unique challenges posed by documents captured in real-world conditions and reveals the limited robustness of existing models. Dataset and code are available at https://github.com/Topdu/DocPTBench.

[117] When Generative Replay Meets Evolving Deepfakes: Domain-Aware Relative Weighting for Incremental Face Forgery Detection cs.CVPDF

Hao Shen, Jikang Cheng, Renye Yan, Zhongyuan Wang, Wei Peng

TL;DR: 论文研究了增量式人脸伪造检测中生成回放的应用，提出了域感知相对加权（DARW）策略，通过动态调整监督方式处理域风险样本和域安全样本，提升检测性能。

Details

Motivation: 随着人脸生成技术的快速发展，伪造方法日益多样，传统基于样本回放的增量学习方法因样本多样性和隐私问题受限，生成回放提供了一种可能的解决方案。

Result: 实验表明DARW在不同生成回放设置下均能提升增量学习性能，并减轻域重叠的负面影响。

Insight: 生成回放在伪造检测中的可行性取决于生成器与伪造模型的相似性，DARW通过动态管理样本监督方式有效解决了域混淆问题。

Abstract: The rapid advancement of face generation techniques has led to a growing variety of forgery methods. Incremental forgery detection aims to gradually update existing models with new forgery data, yet current sample replay-based methods are limited by low diversity and privacy concerns. Generative replay offers a potential solution by synthesizing past data, but its feasibility for forgery detection remains unclear. In this work, we systematically investigate generative replay and identify two scenarios: when the replay generator closely resembles the new forgery model, generated real samples blur the domain boundary, creating domain-risky samples; when the replay generator differs significantly, generated samples can be safely supervised, forming domain-safe samples. To exploit generative replay effectively, we propose a novel Domain-Aware Relative Weighting (DARW) strategy. DARW directly supervises domain-safe samples while applying a Relative Separation Loss to balance supervision and potential confusion for domain-risky samples. A Domain Confusion Score dynamically adjusts this tradeoff according to sample reliability. Extensive experiments demonstrate that DARW consistently improves incremental learning performance for forgery detection under different generative replay settings and alleviates the adverse impact of domain overlap.

[118] Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning cs.CVPDF

Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Zhixiong Zeng

TL;DR: PEARL（Perceptual-Evidence Anchored Reinforced Learning）通过双分支感知-推理协同机制，显式地将多模态推理锚定到已验证的视觉证据上，解决了传统RLVR在视觉语言模型中忽视视觉感知的问题。

Details

Motivation: 传统RLVR在视觉语言模型中仅验证最终文本输出，忽视了视觉感知的基础作用，导致视觉幻觉和奖励攻击。PEARL旨在通过显式锚定视觉证据，提升多模态推理的可靠性。

Result: 实验表明，PEARL在多模态推理基准上表现显著优于基线方法（如MathVerse上比GRPO高6.6%）。

Insight: 视觉感知是多模态推理的关键基础，显式验证视觉证据可有效减少幻觉并提升推理可靠性。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) and is now being applied to Vision-Language Models (VLMs). However, vanilla RLVR for VLMs verifies only the final textual output, critically neglecting the foundational step of visual perception. This oversight leads to visual hallucinations and reward hacking, as reasoning built upon flawed perception is inherently unreliable. To address this, we propose PEARL (Perceptual-Evidence Anchored Reinforced Learning), a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence. For each reasoning-oriented QA instance, PEARL first derive a perception checklist – a set of perception-oriented sub-questions with verifiable answers that probe the model’s understanding of key visual evidence. During training, auxiliary rollouts on this checklist yield a perceptual reward that both directly reinforces the model’s perception ability and acts as a fidelity gate for reasoning. If the model passes the perception check, its policy update is biased towards evidence-anchored reasoning. Otherwise, the process is halted to prevent reasoning from flawed premises. PEARL can be seamlessly integrated with popular RL methods like GRPO and DAPO. Comprehensive experiments show PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7% improvement over the baseline and +6.6% over GRPO on MathVerse.

[119] SineProject: Machine Unlearning for Stable Vision Language Alignment cs.CVPDF

Arpit Garg, Hemanth Saratchandran, Simon Lucey

TL;DR: 论文提出SineProject方法，通过在冻结的投影网络中引入正弦调制参数，改善了机器遗忘过程中视觉语言对齐的稳定性，同时实现了对有害信息的完全遗忘和对良性查询的高接受率。

Details

Motivation: 多模态大语言模型（MLLMs）在处理有害或隐私信息时需要选择性遗忘，而现有方法在遗忘时容易破坏视觉语言对齐，导致模型拒绝良性查询。作者发现问题的根源在于投影网络的优化不稳定。

Result: 在LLaVA v1.5 7B和13B模型上的安全性和隐私遗忘基准测试中，SineProject在完全遗忘目标信息的同时减少了良性查询的拒绝率，实现了最优的遗忘-保留权衡。

Insight: 优化投影网络的Jacobian矩阵条件数是提升机器遗忘过程中视觉语言对齐稳定性的关键。正弦调制的方法在计算开销极低的情况下实现了显著的效果。

Abstract: Multimodal Large Language Models (MLLMs) increasingly need to forget specific knowledge such as unsafe or private information without requiring full retraining. However, existing unlearning methods often disrupt vision language alignment, causing models to reject both harmful and benign queries. We trace this failure to the projector network during unlearning, its Jacobian becomes severely illconditioned, leading to unstable optimization and drift in cross modal embeddings. We introduce SineProject, a simple method that augments the frozen projector with sinusoidally modulated trainable parameters, improving the Jacobian’s spectral conditioning and stabilizing alignment throughout unlearning. Across standard safety and privacy unlearning benchmarks using LLaVA v1.5 7B and 13B, SineProject reduces benign query refusals while achieving complete forgetting of targeted information, yielding state of the art forget retain trade offs with negligible computational overhead.

[120] EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs cs.CVPDF

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xiangyang Ji

TL;DR: 论文提出了EventBench，一个全面的基于事件的MLLMs基准测试，包含多样化任务和大规模数据集，评估了当前模型的优缺点。

Details

Motivation: 现有的基于事件的MLLMs在多模态任务中表现出色，但缺乏统一的基准测试来全面评估其能力。

Result: 当前基于事件的MLLMs在事件流理解上表现优异，但在细粒度识别和空间推理方面仍有不足。

Insight: 空间推理任务的设计是一个创新点，揭示了当前模型在这一领域的局限性，为未来研究指明了方向。

Abstract: Multimodal large language models (MLLMs) have made significant advancements in event-based vision, yet the comprehensive evaluation of their capabilities within a unified benchmark remains largely unexplored. In this work, we introduce EventBench, a benchmark that offers eight diverse task metrics together with a large-scale event stream dataset. EventBench differs from existing event-based benchmarks in four key aspects: (1) openness in accessibility, releasing all raw event streams and task instructions across eight evaluation metrics; (2) diversity in task coverage, spanning understanding, recognition, and spatial reasoning tasks for comprehensive capability assessment; (3) integration in spatial dimensions, pioneering the design of 3D spatial reasoning tasks for event-based MLLMs; and (4) scale in data volume, with an accompanying training set of over one million event-text pairs supporting large-scale training and evaluation. Using EventBench, we evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, and event-based MLLMs such as EventGPT that directly process raw event streams. Extensive evaluation reveals that while current event-based MLLMs demonstrate strong performance in event stream understanding, they continue to struggle with fine-grained recognition and spatial reasoning.

[121] NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering cs.CVPDF

Loick Chambon, Paul Couairon, Eloi Zablocki, Alexandre Boulch, Nicolas Thome

TL;DR: NAF提出了一种零样本特征上采样方法，通过邻域注意力过滤和旋转位置嵌入，实现了无需重新训练的VFM通用上采样，并在多个下游任务中超越VFM专用方法。

Details

Motivation: 现有特征上采样方法在速度和准确性之间存在权衡：传统滤波器速度快但形式固定，现代方法依赖VFM专用训练。NAF旨在弥合这一差距，提出一种无需重新训练的高效通用解决方案。

Result: 在多个下游任务中实现SOTA性能，支持2K分辨率特征图的高效处理（18 FPS），并展示在图像恢复任务中的潜力。

Insight: NAF表明，通过注意力机制学习自适应权重，可以高效实现零样本特征上采样，同时保持广泛适用性和高性能。

Abstract: Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.

[122] Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding cs.CVPDF

Bowei Pu, Chuanbin Liu, Yifan Ge, Peichen Zhou, Yiwei Sun

TL;DR: 论文提出了一种交替感知-推理的框架Video-PLR，通过感知循环和抗幻觉奖励机制解决现有视频推理LLM中的感知不足和幻觉问题，取得了SOTA性能。

Details

Motivation: 现有视频推理LLM采用单步感知范式，导致感知证据不足和幻觉风险，亟需一种新框架解决这些问题。

Result: 实验表明Video-PLR在3B和7B参数规模下均达到SOTA，并具有最佳数据效率。

Insight: 迭代感知和抗幻觉奖励机制是解决视频推理中幻觉问题的有效途径。

Abstract: Sufficient visual perception is the foundation of video reasoning. Nevertheless, existing Video Reasoning LLMs suffer from perception shortcuts, relying on a flawed single-step perception paradigm. This paradigm describes the video and then conducts reasoning, which runs the risk of insufficient evidence and emergent hallucinations. To address these issues, we introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward. First, to address the insufficient evidence, we introduce the Perception Loop Reasoning (PLR) paradigm. Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps, analyze this segment, and decide the next action. Second, for the risk of hallucinations, the Factual-Aware Evaluator (FAE) evaluates each perception result as a reliable anti-hallucination reward. This reward encourages the model to provide sufficient and precise video evidence. Our FAE, which performs comparably to GPT-4o, is tuned on our AnetHallu-117K, a large-scale hallucination judgment preference dataset. Extensive experiments show that our Video-PLR achieves the state-of-the-art in both 3B and 7B parameter scales and has the best data efficiency. Our code, models, and datasets are released on: https://github.com/BoweiPu/VideoPLR.

[123] Robust Posterior Diffusion-based Sampling via Adaptive Guidance Scale cs.CVPDF

Liav Hen, Tom Tirer, Raja Giryes, Shady Abu-Hussein

TL;DR: 提出了一种自适应后验扩散采样方法（AdaPS），通过自适应引导尺度解决逆问题中的先验与数据保真度的平衡问题，显著提升了图像重构质量。

Details

Motivation: 扩散模型在逆问题中表现出色，但如何平衡先验与数据保真度是关键挑战。过度依赖似然更新可能导致伪影，而保守更新则会降低收敛速度或效果。

Result: 在CelebA-HQ和ImageNet-256数据集上，AdaPS在超分辨率、高斯去模糊和运动去模糊等任务中表现优异，超越现有扩散基线，且无需任务特定调优。

Insight: 自适应策略显著提升了扩散模型在逆问题中的鲁棒性，尤其是在不同扩散步骤、观测噪声水平和随机性变化的情况下表现稳定。

Abstract: Diffusion models have recently emerged as powerful generative priors for solving inverse problems, achieving state-of-the-art results across various imaging tasks. A central challenge in this setting lies in balancing the contribution of the prior with the data fidelity term: overly aggressive likelihood updates may introduce artifacts, while conservative updates can slow convergence or yield suboptimal reconstructions. In this work, we propose an adaptive likelihood step-size strategy to guide the diffusion process for inverse-problem formulations. Specifically, we develop an observation-dependent weighting scheme based on the agreement between two different approximations of the intractable intermediate likelihood gradients, that adapts naturally to the diffusion schedule, time re-spacing, and injected stochasticity. The resulting approach, Adaptive Posterior diffusion Sampling (AdaPS), is hyperparameter-free and improves reconstruction quality across diverse imaging tasks - including super-resolution, Gaussian deblurring, and motion deblurring - on CelebA-HQ and ImageNet-256 validation sets. AdaPS consistently surpasses existing diffusion-based baselines in perceptual quality with minimal or no loss in distortion, without any task-specific tuning. Extensive ablation studies further demonstrate its robustness to the number of diffusion steps, observation noise levels, and varying stochasticity.

[124] Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression cs.CVPDF

Md Tasnin Tanvir, Soumitra Das, Sk Md Abidar Rahaman, Ali Shiri Sichani

TL;DR: 该论文提出了两种自适应压缩技术——稀疏时序令牌融合（STTF）和自适应神经压缩（ANC），用于资源受限的边缘设备上的视觉-语言模型，显著提升了效率和性能。

Details

Motivation: 边缘设备在视觉-语言任务中对实时性和资源效率的需求日益增长，传统的静态剪枝或均匀缩放方法无法满足动态场景的需求。

Result: TinyGPT-STTF在COCO 2017测试集上超越LLaVA-1.5 7B，CIDEr提升17.6分，计算量减少62倍；STTF在DVS128手势数据集上令牌数减少84%，精度保持95.6%。

Insight: 自适应压缩技术能够根据场景复杂度动态调整计算资源分配，有效平衡了性能和效率，适用于边缘设备部署。

Abstract: The demand for edge AI in vision-language tasks requires models that achieve real-time performance on resource-constrained devices with limited power and memory. This paper proposes two adaptive compression techniques – Sparse Temporal Token Fusion (STTF) and Adaptive Neural Compression (ANC) – that integrate algorithmic innovations with hardware-aware optimizations. Unlike previous approaches relying on static pruning or uniform scaling, STTF dynamically reuses visual tokens through event-driven change detection, while ANC conditionally activates encoder branches via a learned router, enabling fine-grained adaptation to scene complexity. Our 3B-parameter TinyGPT-STTF achieves CIDEr 131.2, BLEU-4 0.38, METEOR 0.31, and ROUGE-L 0.56 on the COCO 2017 test set, surpassing LLaVA-1.5 7B by 17.6 CIDEr points while using 2.3x fewer parameters and 62x fewer on-device FLOPs. TinyGPT-ANC reaches CIDEr 128.5. On event-based vision tasks, STTF reduces average token count by 84% (from 196 to 31 tokens) while preserving 95.6% accuracy on the DVS128 Gesture dataset, and ANC cuts FLOPs by up to 90% in low-motion scenes. Compared to strong baselines, our models improve accuracy by up to 4.4% and reduce latency by up to 13x. These results enable efficient deployment of capable vision-language models on real-world edge devices.

[125] Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives cs.CV | cs.AIPDF

Kai Jiang, Siqi Huang, Xiangyu Chen, Jiawei Shao, Hongyuan Zhang

TL;DR: 论文提出了一种多模态持续学习方法UNIFIER，通过分支结构和一致性约束缓解MLLMs在多场景转换中的灾难性遗忘问题。

Details

Motivation: 解决MLLMs在多场景动态适应中的灾难性遗忘问题，提升其在复杂视觉任务中的持续学习能力。

Result: 在MSVQA数据集上验证了UNIFIER能有效减轻跨场景遗忘，并实现同场景知识积累。

Insight: 分支结构和特征空间一致性是多模态持续学习中缓解灾难性遗忘的有效策略。

Abstract: Continual learning in visual understanding aims to deal with catastrophic forgetting in Multimodal Large Language Models (MLLMs). MLLMs deployed on devices have to continuously adapt to dynamic scenarios in downstream tasks, such as variations in background and perspective, to effectively perform complex visual tasks. To this end, we construct a multimodal visual understanding dataset (MSVQA) encompassing four different scenarios and perspectives including high altitude, underwater, low altitude and indoor, to investigate the catastrophic forgetting in MLLMs under the dynamics of scenario shifts in real-world data streams. Furthermore, we propose mUltimodal coNtInual learning with MLLMs From multi-scenarIo pERspectives (UNIFIER) to address visual discrepancies while learning different scenarios. Specifically, it decouples the visual information from different scenarios into distinct branches within each vision block and projects them into the same feature space. A consistency constraint is imposed on the features of each branch to maintain the stability of visual representations across scenarios. Extensive experiments on the MSVQA dataset demonstrate that UNIFIER effectively alleviates forgetting of cross-scenario tasks and achieves knowledge accumulation within the same scenario.

[126] Unified Deep Learning Platform for Dust and Fault Diagnosis in Solar Panels Using Thermal and Visual Imaging cs.CVPDF

Abishek Karthik, Sreya Mynampati, Pandiyaraju V

TL;DR: 本文提出了一种统一的深度学习平台，用于通过热成像和视觉成像技术检测太阳能电池板上的灰尘和故障，提高了维护效率。

Details

Motivation: 太阳能电池板的输出效率受多种因素影响（如灰尘、故障等），传统检测方法效率低下。本文旨在通过深度学习模型实现高效、集中的检测。

Result: 实验表明，该模型在灰尘和故障检测方面的效率和准确性优于现有模型。

Insight: 通过统一的深度学习平台，可以实现太阳能电池板的高效维护，适用于不同规模和地理环境的应用场景。

Abstract: Solar energy is one of the most abundant and tapped sources of renewable energies with enormous future potential. Solar panel output can vary widely with factors like intensity, temperature, dirt, debris and so on affecting it. We have implemented a model on detecting dust and fault on solar panels. These two applications are centralized as a single-platform and can be utilized for routine-maintenance and any other checks. These are checked against various parameters such as power output, sinusoidal wave (I-V component of solar cell), voltage across each solar cell and others. Firstly, we filter and preprocess the obtained images using gamma removal and Gaussian filtering methods alongside some predefined processes like normalization. The first application is to detect whether a solar cell is dusty or not based on various pre-determined metrics like shadowing, leaf, droppings, air pollution and from other human activities to extent of fine-granular solar modules. The other one is detecting faults and other such occurrences on solar panels like faults, cracks, cell malfunction using thermal imaging application. This centralized platform can be vital since solar panels have different efficiency across different geography (air and heat affect) and can also be utilized for small-scale house requirements to large-scale solar farm sustentation effectively. It incorporates CNN, ResNet models that with self-attention mechanisms-KerNet model which are used for classification and results in a fine-tuned system that detects dust or any fault occurring. Thus, this multi-application model proves to be efficient and optimized in detecting dust and faults on solar panels. We have performed various comparisons and findings that demonstrates that our model has better efficiency and accuracy results overall than existing models.

[127] Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion cs.CVPDF

Haidong Kang, Ketong Qian, Yi Lu

TL;DR: 这篇论文提出了一种无训练的Few-Shot Class-Incremental Learning (FSCIL)方法CD-FSCIL，通过将传统的梯度优化替换为基于条件扩散模型的生成过程，解决了梯度学习在小样本增量学习中导致的灾难性遗忘和训练成本问题。

Details

Motivation: 当前FSCIL方法主要依赖梯度优化，导致灾难性遗忘和训练成本急剧增加，特别是在小样本情况下性能受限。论文试图探索一种完全不需要梯度优化的FSCIL方法。

Result: 实验表明，CD-FSCIL在性能上达到SOTA，同时大幅降低了计算和内存开销。

Insight: 扩散模型可以替代梯度优化实现增量学习，多模态学习能有效缓解小样本数据的局限性。

Abstract: Efforts to overcome catastrophic forgetting in Few-Shot Class-Incremental Learning (FSCIL) have primarily focused on developing more effective gradient-based optimization strategies. In contrast, little attention has been paid to the training cost explosion that inevitably arises as the number of novel classes increases, a consequence of relying on gradient learning even under extreme data scarcity. More critically, since FSCIL typically provides only a few samples for each new class, gradient-based updates not only induce severe catastrophic forgetting on base classes but also hinder adaptation to novel ones. This paper seeks to break this long-standing limitation by asking: Can we design a training-free FSCIL paradigm that entirely removes gradient optimization? We provide an affirmative answer by uncovering an intriguing connection between gradient-based optimization and the Conditional Diffusion process. Building on this observation, we propose a Conditional Diffusion-driven FSCIL (CD-FSCIL) framework that substitutes the conventional gradient update process with a diffusion-based generative transition, enabling training-free incremental adaptation while effectively mitigating forgetting. Furthermore, to enhance representation under few-shot constraints, we introduce a multimodal learning strategy that integrates visual features with natural language descriptions automatically generated by Large Language Models (LLMs). This synergy substantially alleviates the sample scarcity issue and improves generalization across novel classes. Extensive experiments on mainstream FSCIL benchmarks demonstrate that our method not only achieves state-of-the-art performance but also drastically reduces computational and memory overhead, marking a paradigm shift toward training-free continual adaptation.

[128] HiFi-MambaV2: Hierarchical Shared-Routed MoE for High-Fidelity MRI Reconstruction cs.CVPDF

Pengcheng Fang, Hongli Chen, Guangzhen Yao, Jian Shi, Fangfang Tang

TL;DR: HiFi-MambaV2 是一种用于高保真 MRI 重建的分层共享路由 MoE Mamba 架构，结合频率分解与内容自适应计算，显著提升重建质量和结构保真度。

Details

Motivation: MRI 重建需要从欠采样 k 空间数据中恢复高频细节同时保持解剖结构一致性，传统方法（CNN、Transformer）在性能和稳定性上存在不足。

Result: 在多个数据集和加速因子下，HiFi-MambaV2 在 PSNR、SSIM 和 NMSE 上均优于 CNN、Transformer 和 Mamba 基线。

Insight: 频率分解与内容自适应计算的结合是提升 MRI 重建质量的关键，而 MoE 架构的稀疏分发显著提升了计算效率。

Abstract: Reconstructing high-fidelity MR images from undersampled k-space data requires recovering high-frequency details while maintaining anatomical coherence. We present HiFi-MambaV2, a hierarchical shared-routed Mixture-of-Experts (MoE) Mamba architecture that couples frequency decomposition with content-adaptive computation. The model comprises two core components: (i) a separable frequency-consistent Laplacian pyramid (SF-Lap) that delivers alias-resistant, stable low- and high-frequency streams; and (ii) a hierarchical shared-routed MoE that performs per-pixel top-1 sparse dispatch to shared experts and local routers, enabling effective specialization with stable cross-depth behavior. A lightweight global context path is fused into an unrolled, data-consistency-regularized backbone to reinforce long-range reasoning and preserve anatomical coherence. Evaluated on fastMRI, CC359, ACDC, M4Raw, and Prostate158, HiFi-MambaV2 consistently outperforms CNN-, Transformer-, and prior Mamba-based baselines in PSNR, SSIM, and NMSE across single- and multi-coil settings and multiple acceleration factors, consistently surpassing consistent improvements in high-frequency detail and overall structural fidelity. These results demonstrate that HiFi-MambaV2 enables reliable and robust MRI reconstruction.

[129] Zero-Shot Video Deraining with Video Diffusion Models cs.CVPDF

Tuomas Varanka, Juan Luis Gonzalez, Hyeongwoo Kim, Pablo Garrido, Xu Yao

TL;DR: 本文提出了一种零样本视频去雨方法，利用预训练的文生视频扩散模型实现无需合成数据或微调的复杂动态场景去雨。通过潜在空间反转和负提示干预，结合注意力切换机制，显著提升了真实世界数据集上的去雨效果。

Details

Motivation: 现有视频去雨方法依赖于合成数据或静态摄像头采集的数据，泛化能力有限；而扩散模型的微调会削弱生成先验。需一种无需训练、适用于动态场景的通用去雨方法。

Result: 在真实世界数据集上表现优于现有方法，展示了强泛化能力且无需监督训练。

Insight: 预训练扩散模型的生成先验可被有效用于零样本任务；注意力切换机制是处理动态背景的关键。

Abstract: Existing video deraining methods are often trained on paired datasets, either synthetic, which limits their ability to generalize to real-world rain, or captured by static cameras, which restricts their effectiveness in dynamic scenes with background and camera motion. Furthermore, recent works in fine-tuning diffusion models have shown promising results, but the fine-tuning tends to weaken the generative prior, limiting generalization to unseen cases. In this paper, we introduce the first zero-shot video deraining method for complex dynamic scenes that does not require synthetic data nor model fine-tuning, by leveraging a pretrained text-to-video diffusion model that demonstrates strong generalization capabilities. By inverting an input video into the latent space of diffusion models, its reconstruction process can be intervened and pushed away from the model’s concept of rain using negative prompting. At the core of our approach is an attention switching mechanism that we found is crucial for maintaining dynamic backgrounds as well as structural consistency between the input and the derained video, mitigating artifacts introduced by naive negative prompting. Our approach is validated through extensive experiments on real-world rain datasets, demonstrating substantial improvements over prior methods and showcasing robust generalization without the need for supervised training.

[130] C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction cs.CVPDF

Kuan Wei Huang, Brandon Li, Bharath Hariharan, Noah Snavely

TL;DR: 该论文提出了一个新数据集C3，用于解决地面照片与平面图之间的对应关系问题，并通过实验表明现有方法在该任务上的不足和改进潜力。

Details

Motivation: 现有几何模型在处理来自完全不同视角或模态的输入时表现不佳，而现有数据集要么模态单一（如VIGOR），要么缺少对应关系标注（如WAFFLE）。因此，需要一个新数据集来支持跨视角跨模态的几何推理研究。

Result: 实验结果表明，现有模型在C3数据集上表现不佳，而在C3上训练的模型能显著提升性能（RMSE改进34%）。

Insight: 跨模态几何推理仍然是一个开放挑战，C3数据集为未来研究提供了重要支持。

Abstract: Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs. ground) or modalities (e.g., photos vs. abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo–floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive correspondence between images and floor plans. C3 contains 90K paired floor plans and photos across 597 scenes with 153M pixel-level correspondences and 85K camera poses. We find that state-of-the-art correspondence models struggle on this task. By training on our new data, we can improve on the best performing method by 34% in RMSE. We also identify open challenges in cross-modal geometric reasoning that our dataset aims to help address.

[131] PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation cs.CV | cs.ROPDF

Samarth Chopra, Jing Liang, Gershom Seneviratne, Dinesh Manocha

TL;DR: PhysGS是一个基于3D高斯泼溅的贝叶斯推断方法，用于从视觉线索和视觉-语言先验中估计密集的物理性质，如摩擦、硬度和材料组成。它通过贝叶斯推断迭代优化高斯泼溅点的物理属性，并建模不确定性，显著提升了物理性质估计的准确性。

Details

Motivation: 现有的3D重建方法主要关注几何和外观，而无法推断物体的物理性质（如摩擦、硬度等）。这对于机器人安全与环境交互至关重要，因此需要一种能够结合3D重建和物理性质估计的方法。

Result: 在ABO-500、室内和室外数据集中，PhysGS在质量估计上提升了22.8%，硬度误差降低了61.2%，摩擦误差降低了18.1%，超越了确定性基线方法。

Insight: PhysGS首次将3D重建、不确定建模和物理推理统一在一个框架中，为实现机器人对环境的物理性质理解提供了新思路。

Abstract: Understanding physical properties such as friction, stiffness, hardness, and material composition is essential for enabling robots to interact safely and effectively with their surroundings. However, existing 3D reconstruction methods focus on geometry and appearance and cannot infer these underlying physical properties. We present PhysGS, a Bayesian-inferred extension of 3D Gaussian Splatting that estimates dense, per-point physical properties from visual cues and vision–language priors. We formulate property estimation as Bayesian inference over Gaussian splats, where material and property beliefs are iteratively refined as new observations arrive. PhysGS also models aleatoric and epistemic uncertainties, enabling uncertainty-aware object and scene interpretation. Across object-scale (ABO-500), indoor, and outdoor real-world datasets, PhysGS improves accuracy of the mass estimation by up to 22.8%, reduces Shore hardness error by up to 61.2%, and lowers kinetic friction error by up to 18.1% compared to deterministic baselines. Our results demonstrate that PhysGS unifies 3D reconstruction, uncertainty modeling, and physical reasoning in a single, spatially continuous framework for dense physical property estimation. Additional results are available at https://samchopra2003.github.io/physgs.

[132] Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation cs.CVPDF

Wei Dong, Han Zhou, Junwei Lin, Jun Chen

TL;DR: 论文提出了一种基于视觉自回归（VAR）建模的无监督生成框架，结合视觉语言模型（VLM）提供的感知先验，实现了低光照增强和去模糊的联合任务。

Details

Motivation: 现实中的暗光图像不仅存在低可见度和低对比度问题，还可能包含复杂的噪声和模糊，现有方法依赖于配对数据或未能动态建模光照和模糊特性，导致泛化能力差。

Result: 在基准数据集上实现了最先进的性能。

Insight: 结合VLM的感知先验和动态建模技术可以显著提升低光照图像恢复的效果，尤其在无监督条件下表现优异。

Abstract: Real-world dark images commonly exhibit not only low visibility and contrast but also complex noise and blur, posing significant restoration challenges. Existing methods often rely on paired data or fail to model dynamic illumination and blur characteristics, leading to poor generalization. To tackle this, we propose a generative framework based on visual autoregressive (VAR) modeling, guided by perceptual priors from the vision-language model (VLM). Specifically, to supply informative conditioning cues for VAR models, we deploy an adaptive curve estimation scheme to modulate the diverse illumination based on VLM-derived visibility scores. In addition, we integrate dynamic and spatial-frequency-aware Rotary Positional Encodings (SF-RoPE) into VAR to enhance its ability to model structures degraded by blur. Furthermore, we propose a recursive phase-domain modulation strategy that mitigates blur-induced artifacts in the phase domain via bounded iterative refinement guided by VLM-assessed blur scores. Our framework is fully unsupervised and achieves state-of-the-art performance on benchmark datasets.

[133] NeAR: Coupled Neural Asset-Renderer Stack cs.CVPDF

Hong Li, Chongjie Ye, Houyuan Chen, Weiqing Xiao, Ziyang Yan

TL;DR: NeAR提出了一种耦合神经资产与渲染器的统一框架，通过联合设计资产表示和渲染器，提升渲染的保真度、一致性和效率。

Details

Motivation: 现有神经资产创作与神经渲染通常独立设计，限制了图形栈的潜力。NeAR探索二者的耦合设计，以实现端到端可学习的图形栈。

Result: 在四项任务（G-buffer渲染、随机光照重建、未知光照重光照、新视角重光照）中，NeAR均超越现有方法。

Insight: 神经资产与渲染器的联合设计可显著提升图形栈的性能，为未来研究提供了新方向。

Abstract: Neural asset authoring and neural rendering have emerged as fundamentally disjoint threads: one generates digital assets using neural networks for traditional graphics pipelines, while the other develops neural renderers that map conventional assets to images. However, the potential of jointly designing the asset representation and renderer remains largely unexplored. We argue that coupling them can unlock an end-to-end learnable graphics stack with benefits in fidelity, consistency, and efficiency. In this paper, we explore this possibility with NeAR: a Coupled Neural Asset-Renderer Stack. On the asset side, we build on Trellis-style Structured 3D Latents and introduce a lighting-homogenized neural asset: from a casually lit input, a rectified-flow backbone predicts a Lighting-Homogenized SLAT that encodes geometry and intrinsic material cues in a compact, view-agnostic latent. On the renderer side, we design a lighting-aware neural renderer that uses this neural asset, along with explicit view embeddings and HDR environment maps, to achieve real-time, relightable rendering. We validate NeAR on four tasks: (1) G-buffer-based forward rendering, (2) random-lit single-image reconstruction, (3) unknown-lit single-image relighting, and (4) novel-view relighting. Our coupled stack surpasses state-of-the-art baselines in both quantitative metrics and perceptual quality. We hope this coupled asset-renderer perspective inspires future graphics stacks that view neural assets and renderers as co-designed components instead of independent entities.

[134] RigAnyFace: Scaling Neural Facial Mesh Auto-Rigging with Unlabeled Data cs.CVPDF

Wenchao Ma, Dario Kneubuehler, Maurice Chu, Ian Sachs, Haomiao Jiang

TL;DR: RigAnyFace (RAF) 是一种可扩展的神经自动绑定框架，用于处理多样拓扑结构的面部网格，包括多断开组件。通过结合专业艺术家标注的数据和设计的2D监督策略，RAF在准确性和泛化能力上优于先前方法。

Details

Motivation: 传统面部网格绑定需要大量专业艺术家标注的数据，成本高且规模有限，限制了模型的泛化能力。

Result: RAF在多样拓扑结构的面部网格上表现优异，支持多断开组件（如眼球），在准确性和泛化能力上超越先前方法。

Insight: 结合有限专业标注数据和大量未标注数据的2D监督策略，可以有效提升模型的泛化能力。

Abstract: In this paper, we present RigAnyFace (RAF), a scalable neural auto-rigging framework for facial meshes of diverse topologies, including those with multiple disconnected components. RAF deforms a static neutral facial mesh into industry-standard FACS poses to form an expressive blendshape rig. Deformations are predicted by a triangulation-agnostic surface learning network augmented with our tailored architecture design to condition on FACS parameters and efficiently process disconnected components. For training, we curated a dataset of facial meshes, with a subset meticulously rigged by professional artists to serve as accurate 3D ground truth for deformation supervision. Due to the high cost of manual rigging, this subset is limited in size, constraining the generalization ability of models trained exclusively on it. To address this, we design a 2D supervision strategy for unlabeled neutral meshes without rigs. This strategy increases data diversity and allows for scaled training, thereby enhancing the generalization ability of models trained on this augmented data. Extensive experiments demonstrate that RAF is able to rig meshes of diverse topologies on not only our artist-crafted assets but also in-the-wild samples, outperforming previous works in accuracy and generalizability. Moreover, our method advances beyond prior work by supporting multiple disconnected components, such as eyeballs, for more detailed expression animation. Project page: https://wenchao-m.github.io/RigAnyFace.github.io

[135] Functional Localization Enforced Deep Anomaly Detection Using Fundus Images cs.CV | cs.LGPDF

Jan Benedikt Ruhland, Thorsten Papenbrock, Jan-Peter Sowa, Ali Canbay, Nicole Eter

TL;DR: 论文研究了使用Vision Transformer (ViT)分类器和GANomaly异常检测器在眼底图像中检测视网膜疾病的方法，证明了ViT在多数据集训练中的优势，并展示了GANomaly的解释性和泛化能力。

Details

Motivation: 眼底图像中视网膜疾病的可靠检测面临成像质量差异、早期表现细微以及数据集间的域偏移等挑战，需开发更有效的分类和异常检测方法。

Result: ViT分类器在AUC（0.91）上优于卷积集成基线（0.87）；GANomaly异常检测器AUC为0.76，具解释性和泛化能力。

Insight: ViT在多数据集训练中表现优异，数据增强（几何和颜色）提升效果显著；GANomaly的可解释性为临床决策提供支持。

Abstract: Reliable detection of retinal diseases from fundus images is challenged by the variability in imaging quality, subtle early-stage manifestations, and domain shift across datasets. In this study, we systematically evaluated a Vision Transformer (ViT) classifier under multiple augmentation and enhancement strategies across several heterogeneous public datasets, as well as the AEyeDB dataset, a high-quality fundus dataset created in-house and made available for the research community. The ViT demonstrated consistently strong performance, with accuracies ranging from 0.789 to 0.843 across datasets and diseases. Diabetic retinopathy and age-related macular degeneration were detected reliably, whereas glaucoma remained the most frequently misclassified disease. Geometric and color augmentations provided the most stable improvements, while histogram equalization benefited datasets dominated by structural subtlety. Laplacian enhancement reduced performance across different settings. On the Papila dataset, the ViT with geometric augmentation achieved an AUC of 0.91, outperforming previously reported convolutional ensemble baselines (AUC of 0.87), underscoring the advantages of transformer architectures and multi-dataset training. To complement the classifier, we developed a GANomaly-based anomaly detector, achieving an AUC of 0.76 while providing inherent reconstruction-based explainability and robust generalization to unseen data. Probabilistic calibration using GUESS enabled threshold-independent decision support for future clinical implementation.

[136] From Healthy Scans to Annotated Tumors: A Tumor Fabrication Framework for 3D Brain MRI Synthesis cs.CVPDF

Nayu Dong, Townim Chowdhury, Hieu Phan, Mark Jenkinson, Johan Verjans

TL;DR: 该论文提出了一种称为Tumor Fabrication（TF）的两阶段框架，用于无需配对的3D脑肿瘤合成，旨在解决MRI肿瘤数据稀缺的问题，并通过合成数据提升下游肿瘤分割任务的性能。

Details

Motivation: MRI肿瘤数据的稀缺性是准确和自动化肿瘤分割的主要障碍。现有合成方法要么需要人工建模（劳动密集且需要专家知识），要么依赖大量配对数据（在临床数据有限的情况下不实用）。

Result: 实验表明，TF合成的图像-标签对能够显著提升低数据量情况下的肿瘤分割性能，为临床AI应用中的数据稀缺问题提供了可扩展的解决方案。

Insight: TF展示了如何在数据稀缺的医疗影像领域中，通过自动化合成高质量标注数据来增强下游任务的性能，为类似领域的AI模型训练提供了新思路。

Abstract: The scarcity of annotated Magnetic Resonance Imaging (MRI) tumor data presents a major obstacle to accurate and automated tumor segmentation. While existing data synthesis methods offer promising solutions, they often suffer from key limitations: manual modeling is labor intensive and requires expert knowledge. Deep generative models may be used to augment data and annotation, but they typically demand large amounts of training pairs in the first place, which is impractical in data limited clinical settings. In this work, we propose Tumor Fabrication (TF), a novel two-stage framework for unpaired 3D brain tumor synthesis. The framework comprises a coarse tumor synthesis process followed by a refinement process powered by a generative model. TF is fully automated and leverages only healthy image scans along with a limited amount of real annotated data to synthesize large volumes of paired synthetic data for enriching downstream supervised segmentation training. We demonstrate that our synthetic image-label pairs used as data enrichment can significantly improve performance on downstream tumor segmentation tasks in low-data regimes, offering a scalable and reliable solution for medical image enrichment and addressing critical challenges in data scarcity for clinical AI applications.

[137] Data Augmentation Strategies for Robust Lane Marking Detection cs.CV | eess.IVPDF

Flora Lian, Dinh Quang Huynh, Hector Penades, J. Stephany Berrio Perez, Mao Shan

TL;DR: 本文提出了一种基于生成式AI的数据增强方法，用于提升车道线检测模型在不同摄像头视角下的鲁棒性，通过几何变换、图像修复和车身遮挡合成特定视角数据。

Details

Motivation: 公开数据集（如CULane）训练的车道检测模型在侧视摄像头等特定场景中泛化能力不足，需解决领域偏移问题以提升实际部署中的检测可靠性。

Result: 在SCNN和UFLDv2模型中，增强数据训练显著提升了模型在阴影等复杂条件下的检测精度、召回率和F1分数。

Insight: 通过生成式数据增强填补公开数据集与实际部署场景的差距，是一种可扩展且实用的解决方案。

Abstract: Robust lane detection is essential for advanced driver assistance and autonomous driving, yet models trained on public datasets such as CULane often fail to generalise across different camera viewpoints. This paper addresses the challenge of domain shift for side-mounted cameras used in lane-wheel monitoring by introducing a generative AI-based data enhancement pipeline. The approach combines geometric perspective transformation, AI-driven inpainting, and vehicle body overlays to simulate deployment-specific viewpoints while preserving lane continuity. We evaluated the effectiveness of the proposed augmentation in two state-of-the-art models, SCNN and UFLDv2. With the augmented data trained, both models show improved robustness to different conditions, including shadows. The experimental results demonstrate gains in precision, recall, and F1 score compared to the pre-trained model. By bridging the gap between widely available datasets and deployment-specific scenarios, our method provides a scalable and practical framework to improve the reliability of lane detection in a pilot deployment scenario.

[138] Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers cs.CVPDF

Yiqing Shi, Yiren Song, Mike Zheng Shou

TL;DR: Edit2Perceive通过图像编辑扩散模型实现了高效的密集感知任务，利用FLUX.1架构和一致性损失，在深度、法线和matting任务中取得SOTA结果。

Details

Motivation: 传统密集感知方法依赖文本到图像生成器，但这些设计主要用于随机生成，缺乏图像到图像的一致性。本文提出图像编辑扩散模型更适合密集感知任务。

Result: 在深度、法线和matting任务上实现了全面的SOTA结果；单步推理速度显著提升。

Insight: 图像编辑扩散模型具有更强的几何感知能力，适合密集视觉任务；编辑导向的扩散变换器在一致性要求高的任务中潜力巨大。

Abstract: Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.

[139] MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis cs.CV | cs.AIPDF

Yongcheng Yao, Yongshuo Zong, Raman Dutt, Yongxin Yang, Sotirios A Tsaftaris

TL;DR: 本文介绍了MedVision，一个专注于定量医学图像分析的大规模数据集和基准，填补了当前视觉语言模型（VLMs）在定量任务上的不足。

Details

Motivation: 临床决策常依赖定量评估（如肿瘤大小测量），但现有医学VLMs主要针对分类或定性任务，定量能力未被充分探索和支持。

Result: 实验表明，现有VLMs在定量任务上表现不佳，但经MedVision微调后，检测、大小估计和角度测量的错误率显著降低，精度提高。

Insight: MedVision为医学影像中定量推理能力的VLMs开发奠定了基础，凸显了定制化数据集和任务对提升模型性能的重要性。

Abstract: Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., “Is this normal or abnormal?”) or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. Our benchmarks show that current off-the-shelf VLMs perform poorly on these tasks. However, with supervised fine-tuning on MedVision, we significantly enhance their performance across detection, T/L estimation, and A/D measurement, demonstrating reduced error rates and improved precision. This work provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging. Code and data are available at https://medvision-vlm.github.io.

[140] Neural Geometry Image-Based Representations with Optimal Transport (OT) cs.CVPDF

Xiang Gao, Yuanpeng Liu, Xinmu Wang, Jiazhi Li, Minghao Guo

TL;DR: 该论文提出了一种基于几何图像的神经表示方法，通过Optimal Transport（OT）将不规则3D网格转换为规则的图像网格，从而实现了高效的存储和恢复。

Details

Motivation: 现有的3D网格神经表示方法依赖于神经过拟合和复杂的解码器网络，计算成本高且存储效率低。规则的图像结构更适合高效的神经处理，但将其应用于网格数据存在困难。

Result: 实验结果表明，该方法在存储效率（CR）和恢复精度（CD、HD）上达到了最先进水平。

Insight: 通过Optimal Transport实现的几何图像转换，将网格处理的复杂性转移到规则的图像域，极大提升了效率和灵活性。

Abstract: Neural representations for 3D meshes are emerging as an effective solution for compact storage and efficient processing. Existing methods often rely on neural overfitting, where a coarse mesh is stored and progressively refined through multiple decoder networks. While this can restore high-quality surfaces, it is computationally expensive due to successive decoding passes and the irregular structure of mesh data. In contrast, images have a regular structure that enables powerful super-resolution and restoration frameworks, but applying these advantages to meshes is difficult because their irregular connectivity demands complex encoder-decoder architectures. Our key insight is that a geometry image-based representation transforms irregular meshes into a regular image grid, making efficient image-based neural processing directly applicable. Building on this idea, we introduce our neural geometry image-based representation, which is decoder-free, storage-efficient, and naturally suited for neural processing. It stores a low-resolution geometry-image mipmap of the surface, from which high-quality meshes are restored in a single forward pass. To construct geometry images, we leverage Optimal Transport (OT), which resolves oversampling in flat regions and undersampling in feature-rich regions, and enables continuous levels of detail (LoD) through geometry-image mipmapping. Experimental results demonstrate state-of-the-art storage efficiency and restoration accuracy, measured by compression ratio (CR), Chamfer distance (CD), and Hausdorff distance (HD).

[141] Now You See It, Now You Don’t - Instant Concept Erasure for Safe Text-to-Image and Video Generation cs.CVPDF

Shristi Das Biswas, Arani Roy, Kaushik Roy

TL;DR: ICE提出了一种训练无关、模态无关的单次权重修改方法，用于文本到图像和文本到视频生成中的精确概念擦除，避免了现有方法的昂贵重训练和对抗攻击漏洞。

Details

Motivation: 现有方法在文本到图像和视频生成中进行概念擦除时，存在成本高、推理开销大、易受对抗攻击等问题，且难以处理目标概念与周围内容的语义重叠。

Result: ICE在艺术风格、物体、身份和敏感内容的擦除中表现强劲，同时保持原始生成能力，且在文本到图像和视频模型中均有效。

Insight: 通过数学建模语义重叠和闭式优化，ICE提供了一种高效、鲁棒的概念擦除方案，突破了传统方法的局限性。

Abstract: Robust concept removal for text-to-image (T2I) and text-to-video (T2V) models is essential for their safe deployment. Existing methods, however, suffer from costly retraining, inference overhead, or vulnerability to adversarial attacks. Crucially, they rarely model the latent semantic overlap between the target erase concept and surrounding content – causing collateral damage post-erasure – and even fewer methods work reliably across both T2I and T2V domains. We introduce Instant Concept Erasure (ICE), a training-free, modality-agnostic, one-shot weight modification approach that achieves precise, persistent unlearning with zero overhead. ICE defines erase and preserve subspaces using anisotropic energy-weighted scaling, then explicitly regularises against their intersection using a unique, closed-form overlap projector. We pose a convex and Lipschitz-bounded Spectral Unlearning Objective, balancing erasure fidelity and intersection preservation, that admits a stable and unique analytical solution. This solution defines a dissociation operator that is translated to the model’s text-conditioning layers, making the edit permanent and runtime-free. Across targeted removals of artistic styles, objects, identities, and explicit content, ICE efficiently achieves strong erasure with improved robustness to red-teaming, all while causing only minimal degradation of original generative abilities in both T2I and T2V models.

[142] Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents cs.CV | cs.ROPDF

Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang

TL;DR: 本文提出了CFG-Bench，一个专注于评估多模态大语言模型（MLLMs）在细粒度动作理解与高阶推理能力上的新基准，揭示了当前模型的局限性，并通过监督微调验证了其在具身智能体任务中的潜力。

Details

Motivation: 现有的基准测试主要关注高层次规划或空间推理，忽略了对具身智能体所需的细粒度动作理解的评估。为解决这一问题，作者提出了CFG-Bench，旨在系统地评测模型在物理交互、时序因果关系、意图理解和评价判断等方面的能力。

Result: 实验表明，现有MLLMs在细粒度动作指令生成和高阶推理（如意图和评价）上表现较差，但监督微调显著提升了模型性能，尤其是在具身任务上。

Insight: 1) 细粒度动作理解是具身智能体的关键能力；2) 高阶推理任务揭示了MLLMs的局限性；3) CFG-Bench的数据对提升模型性能具有重要价值。

Abstract: Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model’s ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.

[143] EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification cs.CVPDF

Kazi Reyazul Hasan, Md Nafiu Rahman, Wasif Jalal, Sadif Ahmed, Shahriar Raj

TL;DR: EVCC是一种新型多分支架构，通过自适应令牌修剪、双向交叉注意力等技术，实现了Transformer与CNN的高效融合，在多个数据集上达到SOTA精度，同时显著降低计算开销。

Details

Motivation: 为解决现有混合视觉架构计算成本高的问题，提出一种高效融合Transformer与CNN的方法。

Result: 在CIFAR-100等数据集上，EVCC精度提升2%，计算量减少25%-35%。

Insight: 动态调整计算需求为实际部署提供了灵活性，同时平衡了精度与效率。

Abstract: Hybrid vision architectures combining Transformers and CNNs have significantly advanced image classification, but they usually do so at significant computational cost. We introduce EVCC (Enhanced Vision Transformer-ConvNeXt-CoAtNet), a novel multi-branch architecture integrating the Vision Transformer, lightweight ConvNeXt, and CoAtNet through key innovations: (1) adaptive token pruning with information preservation, (2) gated bidirectional cross-attention for enhanced feature refinement, (3) auxiliary classification heads for multi-task learning, and (4) a dynamic router gate employing context-aware confidence-driven weighting. Experiments across the CIFAR-100, Tobacco3482, CelebA, and Brain Cancer datasets demonstrate EVCC’s superiority over powerful models like DeiT-Base, MaxViT-Base, and CrossViT-Base by consistently achieving state-of-the-art accuracy with improvements of up to 2 percentage points, while reducing FLOPs by 25 to 35%. Our adaptive architecture adjusts computational demands to deployment needs by dynamically reducing token count, efficiently balancing the accuracy-efficiency trade-off while combining global context, local details, and hierarchical features for real-world applications. The source code of our implementation is available at https://anonymous.4open.science/r/EVCC.

[144] Dendritic Convolution for Noise Image Recognition cs.CV | cs.LGPDF

Jiarui Xue, Dongjian Yang, Ye Sun, Gang Liu

TL;DR: 论文提出了一种名为树突卷积（Dendritic Convolution）的新方法，通过模拟生物神经元的树突结构，增强卷积操作的抗噪性能，在图像分类和目标检测任务中显著提升模型对噪声数据的识别能力。

Details

Motivation: 现有方法主要通过调整网络或训练策略解决噪声图像识别问题，但其抗噪性能已达瓶颈。本文从神经元的角度出发，探索抗干扰解决方案，模仿树突结构设计卷积运算。

Result: 在图像分类（EfficientNet-B0）和目标检测（YOLOv8）任务中，DDC分别相对提升11.23%准确率和19.80% mAP，显著优于传统卷积。

Insight: 生物神经元树突的计算逻辑为设计抗噪卷积提供了新思路，表明底层操作的抗干扰能力对复杂噪声环境至关重要。

Abstract: In real-world scenarios of image recognition, there exists substantial noise interference. Existing works primarily focus on methods such as adjusting networks or training strategies to address noisy image recognition, and the anti-noise performance has reached a bottleneck. However, little is known about the exploration of anti-interference solutions from a neuronal perspective.This paper proposes an anti-noise neuronal convolution. This convolution mimics the dendritic structure of neurons, integrates the neighborhood interaction computation logic of dendrites into the underlying design of convolutional operations, and simulates the XOR logic preprocessing function of biological dendrites through nonlinear interactions between input features, thereby fundamentally reconstructing the mathematical paradigm of feature extraction. Unlike traditional convolution where noise directly interferes with feature extraction and exerts a significant impact, DDC mitigates the influence of noise by focusing on the interaction of neighborhood information. Experimental results demonstrate that in image classification tasks (using YOLOv11-cls, VGG16, and EfficientNet-B0) and object detection tasks (using YOLOv11, YOLOv8, and YOLOv5), after replacing traditional convolution with the dendritic convolution, the accuracy of the EfficientNet-B0 model on noisy datasets is relatively improved by 11.23%, and the mean Average Precision (mAP) of YOLOv8 is increased by 19.80%. The consistency between the computation method of this convolution and the dendrites of biological neurons enables it to perform significantly better than traditional convolution in complex noisy environments.

[145] ObjectAlign: Neuro-Symbolic Object Consistency Verification and Correction cs.CV | cs.AI | cs.FL | cs.LGPDF

Mustafa Munir, Harsh Goel, Xiwen Wei, Minkyu Choi, Sahil Shah

TL;DR: ObjectAlign是一个结合感知指标与符号推理的框架，用于检测、验证和修正视频中的对象一致性问题，显著提升了视频编辑的质量。

Details

Motivation: 视频编辑和合成中常出现对象不一致问题（如帧闪烁和身份漂移），影响了感知质量，迫切需要一个能自动检测和修正这些问题的系统。

Result: 在DAVIS和Pexels数据集上，ObjectAlign在CLIP Score和warp error上分别提升了1.4分和6.1分，优于现有方法。

Insight: 神经符号方法的结合能有效解决视频编辑中的一致性问题，同时自适应插值为修复提供了灵活性和效率。

Abstract: Video editing and synthesis often introduce object inconsistencies, such as frame flicker and identity drift that degrade perceptual quality. To address these issues, we introduce ObjectAlign, a novel framework that seamlessly blends perceptual metrics with symbolic reasoning to detect, verify, and correct object-level and temporal inconsistencies in edited video sequences. The novel contributions of ObjectAlign are as follows: First, we propose learnable thresholds for metrics characterizing object consistency (i.e. CLIP-based semantic similarity, LPIPS perceptual distance, histogram correlation, and SAM-derived object-mask IoU). Second, we introduce a neuro-symbolic verifier that combines two components: (a) a formal, SMT-based check that operates on masked object embeddings to provably guarantee that object identity does not drift, and (b) a temporal fidelity check that uses a probabilistic model checker to verify the video’s formal representation against a temporal logic specification. A frame transition is subsequently deemed “consistent” based on a single logical assertion that requires satisfying both the learned metric thresholds and this unified neuro-symbolic constraint, ensuring both low-level stability and high-level temporal correctness. Finally, for each contiguous block of flagged frames, we propose a neural network based interpolation for adaptive frame repair, dynamically choosing the interpolation depth based on the number of frames to be corrected. This enables reconstruction of the corrupted frames from the last valid and next valid keyframes. Our results show up to 1.4 point improvement in CLIP Score and up to 6.1 point improvement in warp error compared to SOTA baselines on the DAVIS and Pexels video datasets.

[146] Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation cs.CV | cs.AIPDF

Yuyang Wanyan, Xiaoshan Yang, Weiming Dong, Changsheng Xu

TL;DR: 本文提出了模态协作低秩分解器（MC-LRD）用于少样本视频域适应（FSVDA），通过分解模态独特和模态共享特征，结合跨域激活一致性损失和多模态分解路由器（MDR），显著提升了性能。

Details

Motivation: 视频的多模态特性在少样本域适应中带来了独特挑战，现有方法忽视了同时考虑域对齐和模态协作的需求，导致目标域性能受限。

Result: 在三个公共基准测试中，性能显著优于现有方法。

Insight: 分解不同域偏移的特征并协同处理模态关系可有效提升少样本域适应性能。

Abstract: In this paper, we study the challenging task of Few-Shot Video Domain Adaptation (FSVDA). The multimodal nature of videos introduces unique challenges, necessitating the simultaneous consideration of both domain alignment and modality collaboration in a few-shot scenario, which is ignored in previous literature. We observe that, under the influence of domain shift, the generalization performance on the target domain of each individual modality, as well as that of fused multimodal features, is constrained. Because each modality is comprised of coupled features with multiple components that exhibit different domain shifts. This variability increases the complexity of domain adaptation, thereby reducing the effectiveness of multimodal feature integration. To address these challenges, we introduce a novel framework of Modality-Collaborative LowRank Decomposers (MC-LRD) to decompose modality-unique and modality-shared features with different domain shift levels from each modality that are more friendly for domain alignment. The MC-LRD comprises multiple decomposers for each modality and Multimodal Decomposition Routers (MDR). Each decomposer has progressively shared parameters across different modalities. The MDR is leveraged to selectively activate the decomposers to produce modality-unique and modality-shared features. To ensure efficient decomposition, we apply orthogonal decorrelation constraints separately to decomposers and subrouters, enhancing their diversity. Furthermore, we propose a cross-domain activation consistency loss to guarantee that target and source samples of the same category exhibit consistent activation preferences of the decomposers, thereby facilitating domain alignment. Extensive experimental results on three public benchmarks demonstrate that our model achieves significant improvements over existing methods.

[147] DriveFlow: Rectified Flow Adaptation for Robust 3D Object Detection in Autonomous Driving cs.CVPDF

Hongbin Lin, Yiming Yang, Chaoda Zheng, Yifan Zhang, Shuaicheng Niu

TL;DR: 论文提出DriveFlow，一种基于预训练文本到图像流模型的修正流适应方法，通过高频前景保存和双频背景优化增强自动驾驶中的3D目标检测训练数据，提升OOD场景下的性能。

Details

Motivation: 自动驾驶中，基于视觉的3D目标检测面临训练数据不足和分布外（OOD）问题。训练免费图像编辑提供了一种无需修改预训练扩散模型的解决方案，但目前方法在保留3D几何精度和编辑灵活性上存在不足。

Result: 实验表明DriveFlow在OOD场景下所有类别上均取得全面性能提升。

Insight: 通过频率分解和特定优化策略，DriveFlow在不修改预训练模型的情况下提升了3D目标检测的鲁棒性，为数据增强提供了新思路。

Abstract: In autonomous driving, vision-centric 3D object detection recognizes and localizes 3D objects from RGB images. However, due to high annotation costs and diverse outdoor scenes, training data often fails to cover all possible test scenarios, known as the out-of-distribution (OOD) issue. Training-free image editing offers a promising solution for improving model robustness by training data enhancement without any modifications to pre-trained diffusion models. Nevertheless, inversion-based methods often suffer from limited effectiveness and inherent inaccuracies, while recent rectified-flow-based approaches struggle to preserve objects with accurate 3D geometry. In this paper, we propose DriveFlow, a Rectified Flow Adaptation method for training data enhancement in autonomous driving based on pre-trained Text-to-Image flow models. Based on frequency decomposition, DriveFlow introduces two strategies to adapt noise-free editing paths derived from text-conditioned velocities. 1) High-Frequency Foreground Preservation: DriveFlow incorporates a high-frequency alignment loss for foreground to maintain precise 3D object geometry. 2) Dual-Frequency Background Optimization: DriveFlow also conducts dual-frequency optimization for background, balancing editing flexibility and semantic consistency. Comprehensive experiments validate the effectiveness and efficiency of DriveFlow, demonstrating comprehensive performance improvements on all categories across OOD scenarios. Code is available at https://github.com/Hongbin98/DriveFlow.

[148] Seeing What Matters: Visual Preference Policy Optimization for Visual Generation cs.CVPDF

Ziqi Ni, Yuanzhi Liang, Rui Li, Yi Zhou, Haibing Huang

TL;DR: 这篇文章提出了Visual Preference Policy Optimization (ViPO)，一种改进的GRPO方法，利用像素级的优势图替代单一标量奖励，从而更精细地优化视觉生成模型。

Details

Motivation: 现有的GRPO方法依赖单一标量奖励，忽略了视觉内容的空间和时间结构，导致无法有效修正局部瑕疵或建模细粒度感知线索。

Result: ViPO在图像和视频基准测试中均优于标准GRPO，提升了对人类偏好的对齐能力和泛化性能。

Insight: 结构化像素级奖励信号可以更有效地指导生成模型，提升局部细节和整体一致性。

Abstract: Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.

[149] GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving cs.CVPDF

Lin Liu, Caiyan Jia, Guanyi Yu, Ziying Song, JunQiao Li

TL;DR: GuideFlow提出了一种新的端到端自动驾驶规划框架，通过约束流的匹配解决多模态轨迹崩溃问题，并直接嵌入安全与物理约束，同时结合能量基模型提升优化能力。

Details

Motivation: 现有模仿式端到端规划器容易陷入多模态轨迹崩溃，而生成式规划器难以直接在生成过程中嵌入约束。

Result: 在NavSim测试集上达到SOTA（EPDMS 43.0）。

Insight: 显式约束结合生成模型可同时解决多模态崩溃与约束满足问题，灵活的参数化为轨迹生成提供了可控性。

Abstract: Driving planning is a critical component of end-to-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose \textit{\textbf{GuideFlow}}, a novel planning framework that leverages Constrained Flow Matching. Concretely, \textit{\textbf{GuideFlow}} explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, \textit{\textbf{GuideFlow}} unifies the training of the flow matching with the Energy-Based Model (EBM) to enhance the model’s autonomous optimization capability to robustly satisfy physical constraints. Secondly, \textit{\textbf{GuideFlow}} parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Extensive evaluations on major driving benchmarks (Bench2Drive, NuScenes, NavSim and ADV-NuScenes) validate the effectiveness of \textit{\textbf{GuideFlow}}. Notably, on the NavSim test hard split (Navhard), \textit{\textbf{GuideFlow}} achieved SOTA with an EPDMS score of 43.0. The code will be released.

[150] Yo’City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion cs.CV | cs.AIPDF

Keyang Lu, Sifan Zhou, Hongbin Xu, Gang Xu, Zhifei Yang

TL;DR: Yo’City提出了一种基于Agentic框架的个性化、无限扩展的3D城市生成方法，通过分层规划和自评测循环实现高质量生成，支持用户交互的城市扩展。

Details

Motivation: 现有方法依赖单一扩散模型，难以生成个性化且规模可扩展的城市场景，Yo’City旨在解决这一问题。

Result: 实验表明Yo’City在多维度评估指标上均优于现有方法。

Insight: 分层规划和用户交互是实现高质量、个性化城市生成的关键。

Abstract: Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo’City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo’City first conceptualize the city through a top-down planning strategy that defines a hierarchical “City-District-Grid” structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a “produce-refine-evaluate” isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo’City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo’City consistently outperforms existing state-of-the-art methods across all evaluation aspects.

[151] Thinking Ahead: Foresight Intelligence in MLLMs and World Models cs.CV | cs.AIPDF

Zhantao Gong, Liaoyuan Fan, Qing Guo, Xun Xu, Xulei Yang

TL;DR: 该论文定义了前瞻性智能（Foresight Intelligence）并提出了FSU-QA数据集，用于评估和增强多模态语言模型在前瞻任务中的能力。实验表明，现有模型在此类任务上表现不佳，而通过FSU-QA微调的小模型能显著优于更大模型。

Details

Motivation: 现有研究普遍忽视对未来事件的预测能力（前瞻性智能），而这在自动驾驶等应用中至关重要。作者旨在填补这一空白。

Result: 当前模型在前瞻任务中表现不佳，但FSU-QA微调的小模型显著优于更大模型，证实了数据集的有效性。

Insight: 前瞻性智能是一个重要但被忽视的研究方向，FSU-QA为开发具有未来预测能力的下一代模型奠定了基础。

Abstract: In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.

[152] ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion cs.CV | cs.AI | cs.LGPDF

Zhenghan Fang, Jian Zheng, Qiaozi Gao, Xiaofeng Gao, Jeremias Sulam

TL;DR: ProxT2I提出了一种基于反向离散化的文本到图像扩散模型，利用条件近端算子替代传统的分数函数，结合强化学习优化任务特定奖励，显著提高了采样效率和人类偏好对齐。

Details

Motivation: 传统的扩散模型依赖于正向离散化的反向扩散过程和分数函数，存在采样步骤多、效率低和不稳定的问题。ProxT2I旨在解决这些问题，提高生成效率和质量。

Result: ProxT2I在采样效率和人类偏好对齐上优于传统方法，计算资源和模型规模更小，性能与现有SOTA相当。

Insight: 反向离散化和近端算子的结合可能是扩散模型高效化的新方向；任务特定的奖励优化有助于提升生成质量。

Abstract: Diffusion models have emerged as a dominant paradigm for generative modeling across a wide range of domains, including prompt-conditional generation. The vast majority of samplers, however, rely on forward discretization of the reverse diffusion process and use score functions that are learned from data. Such forward and explicit discretizations can be slow and unstable, requiring a large number of sampling steps to produce good-quality samples. In this work we develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions. We further leverage recent advances in reinforcement learning and policy optimization to optimize our samplers for task-specific rewards. Additionally, we develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation. Our approach consistently enhances sampling efficiency and human-preference alignment compared to score-based baselines, and achieves results on par with existing state-of-the-art and open-source text-to-image models while requiring lower compute and smaller model size, offering a lightweight yet performant solution for human text-to-image generation.

[153] Any4D: Open-Prompt 4D Generation from Natural Language and Images cs.CV | cs.AIPDF

Hao Li, Qiao Sun

TL;DR: PEWM是一种基于原始运动的视频生成方法，通过固定短时间窗口生成视频，解决了大尺度体现交互数据稀缺和复杂性的问题，同时提升了语言与动作的对齐粒度、学习效率和数据收集效率。

Details

Motivation: 当前基于视频生成的体现世界模型依赖大数据，数据稀缺和高维度限制了语言与动作的对齐粒度，阻碍了长视频生成的实现。

Result: PEWM提升了语言与动作的对齐粒度、学习效率和数据效率，同时支持复杂任务的组合泛化。

Insight: 原始运动的多样性远低于体现数据的多样性，通过限制时间窗口可以简化问题，同时结合语义和时空先验提升模型表现。

Abstract: While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation–hindering generative models from achieving a \textit{“GPT moment”} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \textit{2) reduces} learning complexity, \textit{3) improves} data efficiency in embodied data collection, and \textit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

[154] VAOT: Vessel-Aware Optimal Transport for Retinal Fundus Enhancement cs.CVPDF

Xuanzhao Dong, Wenhui Zhu, Yujian Xiong, Xiwen Chen, Hao Wang

TL;DR: VAOT提出了一种结合最优传输目标与结构保留正则化的视网膜眼底图像增强框架，旨在减少噪声的同时保护血管结构的完整性。

Details

Motivation: 眼底摄影的图像质量常因光照变化等因素下降，现有的无配对增强方法（如GAN）可能破坏血管拓扑结构，因此需要一种既能增强图像又能保护血管结构的方法。

Result: 在合成退化基准和血管/病变分割的下游任务中，VAOT优于多种先进基线方法。

Insight: VAOT表明，通过引入结构保留正则化，可以在不依赖配对数据的情况下有效增强图像质量并保护临床关键结构。

Abstract: Color fundus photography (CFP) is central to diagnosing and monitoring retinal disease, yet its acquisition variability (e.g., illumination changes) often degrades image quality, which motivates robust enhancement methods. Unpaired enhancement pipelines are typically GAN-based, however, they can distort clinically critical vasculature, altering vessel topology and endpoint integrity. Motivated by these structural alterations, we propose Vessel-Aware Optimal Transport (\textbf{VAOT}), a framework that combines an optimal-transport objective with two structure-preserving regularizers: (i) a skeleton-based loss to maintain global vascular connectivity and (ii) an endpoint-aware loss to stabilize local termini. These constraints guide learning in the unpaired setting, reducing noise while preserving vessel structure. Experimental results on synthetic degradation benchmark and downstream evaluations in vessel and lesion segmentation demonstrate the superiority of the proposed methods against several state-of-the art baselines. The code is available at https://github.com/Retinal-Research/VAOT

[155] NI-Tex: Non-isometric Image-based Garment Texture Generation cs.CVPDF

Hui Shan, Ming Li, Haitao Yang, Kai Zheng, Sizhe Zheng

TL;DR: 本文提出了NI-Tex方法，用于非等距图像条件下的服装纹理生成，解决了现有方法对拓扑一致性和精确变形依赖的限制。通过构建3D Garment Videos数据集和使用Nano Banana进行高质量图像编辑，实现了跨姿势和跨拓扑的纹理生成。

Details

Motivation: 现有3D服装网格的纹理多样性有限，而生成方法通常需要输入图像与3D网格的严格拓扑一致性或精确变形匹配，限制了纹理生成的质量和灵活性。

Result: 实验表明，提出的方法能够生成多样且空间对齐的PBR材质，适用于工业级3D服装设计。

Insight: 通过物理模拟数据集和高质量图像编辑的结合，可以实现跨姿势和非等距条件下的纹理生成，为3D服装设计提供了更大的灵活性。

Abstract: Existing industrial 3D garment meshes already cover most real-world clothing geometries, yet their texture diversity remains limited. To acquire more realistic textures, generative methods are often used to extract Physically-based Rendering (PBR) textures and materials from large collections of wild images and project them back onto garment meshes. However, most image-conditioned texture generation approaches require strict topological consistency between the input image and the input 3D mesh, or rely on accurate mesh deformation to match to the image poses, which significantly constrains the texture generation quality and flexibility. To address the challenging problem of non-isometric image-based garment texture generation, we construct 3D Garment Videos, a physically simulated, garment-centric dataset that provides consistent geometry and material supervision across diverse deformations, enabling robust cross-pose texture learning. We further employ Nano Banana for high-quality non-isometric image editing, achieving reliable cross-topology texture generation between non-isometric image-geometry pairs. Finally, we propose an iterative baking method via uncertainty-guided view selection and reweighting that fuses multi-view predictions into seamless, production-ready PBR textures. Through extensive experiments, we demonstrate that our feedforward dual-branch architecture generates versatile and spatially aligned PBR materials suitable for industry-level 3D garment design.

[156] Rethinking Garment Conditioning in Diffusion-based Virtual Try-On cs.CV | cs.AIPDF

Kihyun Na, Jinyoung Choi, Injung Kim

TL;DR: 这篇论文提出了一个高效的虚拟试穿（VTON）模型Re-CatVTON，通过改进的条件学习方法和单UNet结构，显著提升了性能，同时减少了计算和内存开销。

Details

Motivation: 现有的扩散基VTON模型（如Dual UNet架构）虽然性能优越，但计算和内存开销较大。作者通过分析和假设，提出了一种更高效的单UNet解决方案。

Result: Re-CatVTON在FID、KID和LPIPS分数上优于CatVTON，且计算和内存开销低于Dual UNet模型Leffa，仅在SSIM上略有下降。

Insight: 单UNet模型通过合理的条件学习设计，可以在性能和效率之间取得更好的平衡，同时改进的引导策略和误差控制方法对VTON任务尤为重要。

Abstract: Virtual Try-On (VTON) is the task of synthesizing an image of a person wearing a target garment, conditioned on a person image and a garment image. While diffusion-based VTON models featuring a Dual UNet architecture demonstrate superior fidelity compared to single UNet models, they incur substantial computational and memory overhead due to their heavy structure. In this study, through visualization analysis and theoretical analysis, we derived three hypotheses regarding the learning of context features to condition the denoising process. Based on these hypotheses, we developed Re-CatVTON, an efficient single UNet model that achieves high performance. We further enhance the model by introducing a modified classifier-free guidance strategy tailored for VTON’s spatial concatenation conditioning, and by directly injecting the ground-truth garment latent derived from the clean garment latent to prevent the accumulation of prediction error. The proposed Re-CatVTON significantly improves performance compared to its predecessor (CatVTON) and requires less computation and memory than the high-performance Dual UNet model, Leffa. Our results demonstrate improved FID, KID, and LPIPS scores, with only a marginal decrease in SSIM, establishing a new efficiency-performance trade-off for single UNet VTON models.

[157] ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection cs.CV | cs.AIPDF

Ruize Ma, Minghong Cai, Yilei Jiang, Jiaming Han, Yi Feng

TL;DR: ConceptGuard是一个多模态视频生成安全框架，通过对比检测和语义抑制机制，主动识别并减少不安全语义。

Details

Motivation: 多模态视频生成技术的发展带来了新的安全风险，现有方法多为文本单一模态或后验审计，无法主动应对多模态交互风险。

Result: 在ConceptRisk和T2VSafetyBench-TI2V基准测试中，ConceptGuard表现优于现有基线，达到最先进水平。

Insight: 多模态安全风险需要结构化概念空间的支持，主动干预生成过程是实现安全视频生成的有效途径。

Abstract: Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt’s multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.

[158] A Novel Dual-Stream Framework for dMRI Tractography Streamline Classification with Joint dMRI and fMRI Data cs.CV | cs.AIPDF

Haotian Yan, Bocheng Guo, Jianzhong He, Nir A. Sochen, Ofer Pasternak

TL;DR: 提出了一种新颖的双流框架，结合dMRI和fMRI数据用于流线分类，通过功能一致性改善白质束的划分。

Details

Motivation: 现有流线分类方法主要依赖流线轨迹的几何特征，难以区分功能不同但路径相似的纤维束。为了解决这一问题，研究提出了结合功能MRI数据的新方法。

Result: 通过消融实验和与现有方法的比较，展示了该方法在皮质脊髓束细分任务中的优越性能。

Insight: 结合功能和几何信息能够显著提升流线分类的准确性，为白质束的功能研究提供了新工具。

Abstract: Streamline classification is essential to identify anatomically meaningful white matter tracts from diffusion MRI (dMRI) tractography. However, current streamline classification methods rely primarily on the geometric features of the streamline trajectory, failing to distinguish between functionally distinct fiber tracts with similar pathways. To address this, we introduce a novel dual-stream streamline classification framework that jointly analyzes dMRI and functional MRI (fMRI) data to enhance the functional coherence of tract parcellation. We design a novel network that performs streamline classification using a pretrained backbone model for full streamline trajectories, while augmenting with an auxiliary network that processes fMRI signals from fiber endpoint regions. We demonstrate our method by parcellating the corticospinal tract (CST) into its four somatotopic subdivisions. Experimental results from ablation studies and comparisons with state-of-the-art methods demonstrate our approach’s superior performance.

[159] STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution cs.CVPDF

Junyang Chen, Jiangxin Dong, Long Sun, Yixin Yang, Jinshan Pan

TL;DR: STCDiT是一种基于预训练视频扩散模型的视频超分辨率框架，旨在从降质输入中恢复结构忠实且时间稳定的视频，尤其适用于复杂相机运动场景。

Details

Motivation: 视频超分辨率的主要挑战在于在重建过程中保持时间稳定性，以及在生成过程中保留结构忠实性。STCDiT通过分段重建和锚定帧引导的方法，解决了这些挑战。

Result: 实验表明，STCDiT在结构忠实性和时间一致性上优于现有方法，适用于复杂运动场景。

Insight: 分段重建和锚定帧引导的结合，可以有效提升视频超分辨率的质量，尤其在复杂运动场景中表现出色。

Abstract: We present STCDiT, a video super-resolution framework built upon a pre-trained video diffusion model, aiming to restore structurally faithful and temporally stable videos from degraded inputs, even under complex camera motions. The main challenges lie in maintaining temporal stability during reconstruction and preserving structural fidelity during generation. To address these challenges, we first develop a motion-aware VAE reconstruction method that performs segment-wise reconstruction, with each segment clip exhibiting uniform motion characteristic, thereby effectively handling videos with complex camera motions. Moreover, we observe that the first-frame latent extracted by the VAE encoder in each clip, termed the anchor-frame latent, remains unaffected by temporal compression and retains richer spatial structural information than subsequent frame latents. We further develop an anchor-frame guidance approach that leverages structural information from anchor frames to constrain the generation process and improve structural fidelity of video features. Coupling these two designs enables the video diffusion model to achieve high-quality video super-resolution. Extensive experiments show that STCDiT outperforms state-of-the-art methods in terms of structural fidelity and temporal consistency.

[160] Understanding Task Transfer in Vision-Language Models cs.CV | cs.LGPDF

Bhuvan Sachdeva, Karan Uppal, Abhinav Java, Vineeth N. Balasubramanian

TL;DR: 本文系统研究了视觉语言模型（VLM）的任务可迁移性，提出了一种度量标准PGF（Perfection Gap Factor），量化任务间迁移的广度和幅度，并通过实验揭示了感知任务之间的迁移关系。

Details

Motivation: 现有的VLM在多模态基准测试中表现良好，但在视觉感知任务（如深度估计或目标计数）上落后于人类和专用模型。微调一个任务可能会对其他任务的零样本表现产生不可预测的影响，这使得任务特定微调变得困难。

Result: 实验发现了任务间的正负迁移模式，构建了任务迁移图，揭示了任务间的相互关系，并证明了PGF可以帮助选择数据以实现更高效的训练。

Insight: 研究不仅展示了正迁移的机会，也指出了负干扰的风险，为VLM的进一步发展提供了实用的指导。

Abstract: Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.

[161] StereoDETR: Stereo-based Transformer for 3D Object Detection cs.CVPDF

Shiyi Mu, Zichong Gu, Zhiqi Ai, Anqi Liu, Yilin Gao

TL;DR: StereoDETR是一个基于DETR的高效立体3D目标检测框架，结合了单目和立体分支，通过低成本的多尺度视差特征和深度采样策略实现实时推理，并在KITTI基准测试中取得了竞争性的精度。

Details

Motivation: 相比于单目3D目标检测，立体方法虽精度更高但计算开销大、延迟高。StereoDETR旨在以高效的方式提升立体3D检测的速度和精度。

Result: StereoDETR实现了实时推理，速度超越单目方法，并在KITTI基准的行人和骑行者子集上取得了新的SOTA结果。

Insight: 高效的单目与立体分支结合是提升3D目标检测性能的关键，而低成本的多尺度特征和可微采样策略显著降低了计算开销。

Abstract: Compared to monocular 3D object detection, stereo-based 3D methods offer significantly higher accuracy but still suffer from high computational overhead and latency. The state-of-the-art stereo 3D detection method achieves twice the accuracy of monocular approaches, yet its inference speed is only half as fast. In this paper, we propose StereoDETR, an efficient stereo 3D object detection framework based on DETR. StereoDETR consists of two branches: a monocular DETR branch and a stereo branch. The DETR branch is built upon 2D DETR with additional channels for predicting object scale, orientation, and sampling points. The stereo branch leverages low-cost multi-scale disparity features to predict object-level depth maps. These two branches are coupled solely through a differentiable depth sampling strategy. To handle occlusion, we introduce a constrained supervision strategy for sampling points without requiring extra annotations. StereoDETR achieves real-time inference and is the first stereo-based method to surpass monocular approaches in speed. It also achieves competitive accuracy on the public KITTI benchmark, setting new state-of-the-art results on pedestrian and cyclist subsets. The code is available at https://github.com/shiyi-mu/StereoDETR-OPEN.

[162] PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion cs.CVPDF

Yichen Yang, Hong Li, Haodong Zhu, Linin Yang, Guojun Lei

TL;DR: PartDiffuser是一种半自回归扩散框架，通过分部分设计解决了现有自回归方法在生成3D网格时全局结构一致性与高保真局部细节之间的平衡问题。

Details

Motivation: 现有的自回归方法在生成艺术家设计的网格时难以兼顾全局结构一致性与局部细节，且容易受到错误累积的影响。PartDiffuser旨在通过分部分设计解决这一问题。

Result: 实验表明PartDiffuser在生成细节丰富的3D网格方面显著优于现有SOTA模型，非常适合实际应用。

Insight: 通过分部分设计将全局和局部生成任务解耦，有效解决了自回归方法在3D网格生成中的局限性。这种半自回归结合扩散的方法为细节丰富模型的生成提供了新思路。

Abstract: Existing autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a “part-wise” manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local generation tasks. Experiments demonstrate that this method significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail, exhibiting exceptional detail representation suitable for real-world applications.

[163] Mitigating Long-Tail Bias in HOI Detection via Adaptive Diversity Cache cs.CV | cs.AIPDF

Yuqiu Jiang, Xiaozhen Qiao, Tianyu Mei, Haojian Huang, Yifan Chen

TL;DR: 本文提出了一种名为自适应多样性缓存（ADC）的训练免费即插即用模块，旨在缓解HOI检测中的长尾偏差问题，无需额外训练或微调，显著提升了稀有类别的检测性能。

Details

Motivation: HOI检测任务在长尾场景下存在稀有交互样本严重不足的问题，现有基于VLM的方法需要额外训练或提示调优，计算开销大且扩展性差，因此需要一种高效的无训练方法。

Result: 在HICO-DET和V-COCO数据集上，ADC显著提升了稀有类别的检测性能（+8.57% mAP），同时保持了整体性能（+4.39% mAP）。

Insight: ADC展示了在HOI检测中通过动态特征缓存和类别适应机制可以有效缓解长尾问题，为无需训练的方法提供了新思路。

Abstract: Human-Object Interaction (HOI) detection is a fundamental task in computer vision, empowering machines to comprehend human-object relationships in diverse real-world scenarios. Recent advances in VLMs have significantly improved HOI detection by leveraging rich cross-modal representations. However, most existing VLM-based approaches rely heavily on additional training or prompt tuning, resulting in substantial computational overhead and limited scalability, particularly in long-tailed scenarios where rare interactions are severely underrepresented. In this paper, we propose the Adaptive Diversity Cache (ADC) module, a novel training-free and plug-and-play mechanism designed to mitigate long-tail bias in HOI detection. ADC constructs class-specific caches that accumulate high-confidence and diverse feature representations during inference. The method incorporates frequency-aware cache adaptation that favors rare categories and is designed to enable robust prediction calibration without requiring additional training or fine-tuning. Extensive experiments on HICO-DET and V-COCO datasets show that ADC consistently improves existing HOI detectors, achieving up to +8.57% mAP gain on rare categories and +4.39% on the full dataset, demonstrating its effectiveness in mitigating long-tail bias while preserving overall performance.

[164] DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video cs.CVPDF

Jiawei Hou, Shenghao Zhang, Can Wang, Zheng Gu, Yonggen Ling

TL;DR: DetAny4D是一个端到端的框架，用于实时视频中的4D物体检测（3D检测+时间一致性），通过融合多模态特征和几何感知时空解码器提升性能。

Details

Motivation: 现有方法在实时视频中的4D物体检测存在时间一致性不足或多阶段复杂性问题，且缺乏大规模标注数据集。

Result: 实验表明DetAny4D检测精度高，显著提升了时间稳定性，解决了抖动和不一致问题。

Insight: 大规模标注数据和端到端设计对4D检测的时间一致性至关重要，多模态特征融合进一步提升了性能。

Abstract: Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.

[165] Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring cs.CVPDF

Siyuan Wei, Chunjie Wang, Xiao Liu, Xiaosheng Yan, Zhishan Zhou

TL;DR: 论文提出了一种全自动流水线方法Disc3D，通过结合规则约束与多模态大语言模型（MLLMs）和大语言模型（LLMs），高效生成高质量的3D对话数据，解决了现有3D MLLMs因数据稀缺和模糊性问题而表现不佳的困境。

Details

Motivation: 3D多模态大语言模型（MLLMs）的表现落后于2D模型，主要原因是缺乏大规模的3D场景-对话数据。传统方法依赖昂贵的人工标注，且无法解决视角模糊性和对象指代模糊性问题。

Result: 生成的Disc3D数据集包含25K混合3D场景的200多万样本，显著提升了3D MLLMs在公共基准和Disc3D-QA任务上的表现。

Insight: 通过自动化流水线结合规则和模型，可以高效解决3D数据稀缺问题，同时提升模型性能。

Abstract: 3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene, view, and object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.

[166] VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models cs.CVPDF

Fufangchen Zhao, Liao Zhang, Daiqi Shi, Yuanjun Gao, Chen Ye

TL;DR: VideoPerceiver是一种新型视频多模态大语言模型（VMLLM），通过两阶段训练框架（监督微调和强化学习）及创新的数据增强方法，显著提升了细粒度动作理解和罕见事件描述能力。

Details

Motivation: 现有VMLLMs在理解短片段中的短暂动作或长视频中的罕见事件方面能力有限，VideoPerceiver旨在解决这一问题。

Result: 在细粒度动作理解和罕见事件描述任务上显著优于现有VMLLMs，同时保持标准任务的性能。

Insight: 通过关注任务相关的视觉特征，重新定义了视频语言模型的细粒度感知训练方法。

Abstract: We propose VideoPerceiver, a novel video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding, addressing VMLLMs’ limited ability to reason about brief actions in short clips or rare transient events in long videos. VideoPerceiver adopts a two-stage training framework. During supervised fine-tuning (SFT), we construct “key-information-missing” videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames. We jointly encode original and modified video tokens with text tokens, aligning intermediate visual representations with keywords via an auxiliary contrastive loss to enhance sensitivity to fine-grained motion cues. In reinforcement learning (RL), both video variants are fed into the model to generate descriptions, and a novel relative reward ensures responses from complete videos outperform those from degraded inputs, explicitly training the model to recover temporally precise action details. We also curate a dataset of 80,000 videos with fine-grained actions and transient events. Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks, while maintaining strong performance on standard tasks. By prioritizing task-relevant visual features, our work redefines video-language model training for fine-grained perception.

[167] Q-Save: Towards Scoring and Attribution for Generated Video Evaluation cs.CVPDF

Xiele Wu, Zicheng Zhang, Mingtao Chen, Yixian Liu, Yiming Liu

TL;DR: 提出了Q-Save，一个用于AI生成视频（AIGV）质量评估的新数据集和模型，支持多维度标注（视觉质量、动态质量和文本视频对齐）和解释性评分。模型基于SlowFast框架，结合链式思维（COT）训练策略，实现了高效准确的评估。

Details

Motivation: 现有AI生成视频（AIGV）的质量评估缺乏全面性和解释性，需要一种能够提供详细评分和解释的方法。

Result: 模型在视频质量预测上达到了SOTA性能，并能提供人类对齐的解释性结果。

Insight: 多维度标注和解释性评分是提升AIGV评估的关键；SlowFast框架和GRPO优化策略对模型性能提升显著。

Abstract: We present Q-Save, a new benchmark dataset and model for holistic and explainable evaluation of AI-generated video (AIGV) quality. The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels along three core dimensions: visual quality, dynamic quality, and text-video alignment. These multi-aspect annotations enable both accurate quality assessment and interpretable reasoning behind the scores. To leverage this data, we propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation. The model adopts the SlowFast framework to distinguish between fast frames and slow frames - slow frames are processed with high resolution while fast frames use low resolution, balancing evaluation accuracy and computational efficiency. For training, we use data formatted in Chain-of-Thought (COT) style and employ a multi-stage strategy: we first conduct Supervised Fine-Tuning (SFT), then further enhance the model with Grouped Relative Policy Optimization (GRPO), and finally perform SFT again to improve model stability. Experimental results demonstrate that our model achieves state-of-the-art performance in video quality prediction while also providing human-aligned, interpretable justifications. Our dataset and model establish a strong foundation for explainable evaluation in generative video research, contributing to the development of multimodal generation and trustworthy AI. Code and dataset will be released upon publication.

[168] Leveraging Metaheuristic Approaches to Improve Deep Learning Systems for Anxiety Disorder Detection cs.CVPDF

Mohammadreza Amiri, Monireh Hosseini

TL;DR: 本文提出了一种结合元启发式优化方法和深度学习的新型混合模型，用于通过多模态和可穿戴设备数据检测焦虑障碍，显著提升了检测性能和泛化能力。

Details

Motivation: 传统的焦虑障碍检测方法依赖主观评估，耗时且结果不一致。人工智能技术为更一致和自动化的检测提供了可能，但目前方法仍需优化特征选择和超参数调优。

Result: 混合模型在检测准确性和泛化能力上显著优于单独使用深度学习方法，实现了更高效且临床意义显著的焦虑障碍检测。

Insight: 将元启发式优化与深度学习结合，能够有效解决复杂医学数据中的特征选择和模型优化问题，为自动化心理障碍检测提供了新思路。

Abstract: Despite being among the most common psychological disorders, anxiety-related conditions are still primarily identified through subjective assessments, such as clinical interviews and self-evaluation questionnaires. These conventional methods often require significant time and may vary depending on the evaluator. However, the emergence of advanced artificial intelligence techniques has created new opportunities for detecting anxiety in a more consistent and automated manner. To address the limitations of traditional approaches, this study introduces a comprehensive model that integrates deep learning architectures with optimization strategies inspired by swarm intelligence. Using multimodal and wearable-sensor datasets, the framework analyzes physiological, emotional, and behavioral signals. Swarm intelligence techniques including genetic algorithms and particle swarm optimization are incorporated to refine the feature space and optimize hyperparameters. Meanwhile, deep learning components are tasked with deriving layered and discriminative representations from sequential, multi-source inputs. Our evaluation shows that the fusion of these two computational paradigms significantly enhances detection performance compared with using deep networks alone. The hybrid model achieves notable improvements in accuracy and demonstrates stronger generalization across various individuals. Overall, the results highlight the potential of combining metaheuristic optimization with deep learning to develop scalable, objective, and clinically meaningful solutions for assessing anxiety disorders

[169] VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction cs.CVPDF

Shaobo Wang, Tianle Niu, Runkang Yang, Deshan Liu, Xu He

TL;DR: VideoCompressa提出了一种新颖的视频数据合成框架，通过联合优化关键帧选择器和潜在压缩网络，显著提高了视频理解的数据效率，仅需极少数据即可达到或超越完整数据集性能。

Details

Motivation: 大规模视频数据集的高存储和计算成本限制了视频理解模型的扩展性，而现有方法在视频数据合成中的效率不足，主要由于帧级冗余和复杂的时空动态性。

Result: 在UCF101上仅用0.13%的数据超越完整数据集性能（提升2.34%），速度提升5800倍；在HMDB51上仅用0.41%数据匹配完整数据性能。

Insight: 视频数据集的效率瓶颈在于帧级冗余而非样本间冗余，动态潜在压缩是解决这一问题的有效途径。

Abstract: The scalability of video understanding models is increasingly limited by the prohibitive storage and computational costs of large-scale video datasets. While data synthesis has improved data efficiency in the image domain, its extension to video remains challenging due to pervasive temporal redundancy and complex spatiotemporal dynamics. In this work, we uncover a critical insight: the primary source of inefficiency in video datasets is not inter-sample redundancy, but intra-sample frame-level redundancy. To leverage this insight, we introduce VideoCompressa, a novel framework for video data synthesis that reframes the problem as dynamic latent compression. Specifically, VideoCompressa jointly optimizes a differentiable keyframe selector-implemented as a lightweight ConvNet with Gumbel-Softmax sampling-to identify the most informative frames, and a pretrained, frozen Variational Autoencoder (VAE) to compress these frames into compact, semantically rich latent codes. These latent representations are then fed into a compression network, enabling end-to-end backpropagation. Crucially, the keyframe selector and synthetic latent codes are co-optimized to maximize retention of task-relevant information. Experiments show that our method achieves unprecedented data efficiency: on UCF101 with ConvNets, VideoCompressa surpasses full-data training by 2.34% points using only 0.13% of the original data, with over 5800x speedup compared to traditional synthesis method. Moreover, when fine-tuning Qwen2.5-7B-VL on HMDB51, VideoCompressa matches full-data performance using just 0.41% of the training data-outperforming zero-shot baseline by 10.61%.

[170] FVAR: Visual Autoregressive Modeling via Next Focus Prediction cs.CVPDF

Xiaofan Li, Chenming Wu, Yanpeng Sun, Jiaming Zhou, Delin Qu

TL;DR: FVAR通过将传统的next-scale预测改为next-focus预测，解决了多尺度金字塔中的锯齿问题，提升了生成图像的细节保留和文本可读性。

Details

Motivation: 传统视觉自回归模型通过均匀降尺度建模多尺度金字塔会导致锯齿和摩尔纹等伪影，影响细节表现。FVAR旨在模仿相机从模糊到清晰的自然聚焦过程，消除这些伪影。

Result: 在ImageNet上大幅减少锯齿伪影，提升细节保留和文本可读性，同时兼容现有VAR框架。

Insight: 模仿自然聚焦过程能有效避免人工降尺度带来的伪影问题，高频残差学习为细节生成提供了新思路。

Abstract: Visual autoregressive models achieve remarkable generation quality through next-scale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moiré patterns. To tackle this issue, we present \textbf{FVAR}, which reframes the paradigm from \emph{next-scale prediction} to \emph{next-focus prediction}, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: \textbf{1) Next-Focus Prediction Paradigm} that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; \textbf{2) Progressive Refocusing Pyramid Construction} that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and \textbf{3) High-Frequency Residual Learning} that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.

[171] Personalized Federated Segmentation with Shared Feature Aggregation and Boundary-Focused Calibration cs.CV | cs.AIPDF

Ishmam Tashdeed, Md. Atiqur Rahman, Sabrina Islam, Md. Azam Hossain

TL;DR: 论文提出了一种新的个性化联邦学习方法FedOAP，通过跨注意力机制和边界损失提升异构数据下的肿瘤分割性能。

Details

Motivation: 现有个性化联邦学习方法未充分利用不同客户端间的共享特征，尤其在医学图像分割任务中，不同客户端的器官数据具有潜在关联性。

Result: 在多种器官肿瘤分割任务上，FedOAP表现优于现有联邦学习和个性化分割方法。

Insight: 跨客户端共享特征及边界校准是提升联邦学习中医学图像分割性能的关键。

Abstract: Personalized federated learning (PFL) possesses the unique capability of preserving data confidentiality among clients while tackling the data heterogeneity problem of non-independent and identically distributed (Non-IID) data. Its advantages have led to widespread adoption in domains such as medical image segmentation. However, the existing approaches mostly overlook the potential benefits of leveraging shared features across clients, where each client contains segmentation data of different organs. In this work, we introduce a novel personalized federated approach for organ agnostic tumor segmentation (FedOAP), that utilizes cross-attention to model long-range dependencies among the shared features of different clients and a boundary-aware loss to improve segmentation consistency. FedOAP employs a decoupled cross-attention (DCA), which enables each client to retain local queries while attending to globally shared key-value pairs aggregated from all clients, thereby capturing long-range inter-organ feature dependencies. Additionally, we introduce perturbed boundary loss (PBL) which focuses on the inconsistencies of the predicted mask’s boundary for each client, forcing the model to localize the margins more precisely. We evaluate FedOAP on diverse tumor segmentation tasks spanning different organs. Extensive experiments demonstrate that FedOAP consistently outperforms existing state-of-the-art federated and personalized segmentation methods.

[172] Robust Long-term Test-Time Adaptation for 3D Human Pose Estimation through Motion Discretization cs.CVPDF

Yilin Wen, Kechuan Dong, Yusuke Sugano

TL;DR: 该论文提出了一种通过运动离散化实现鲁棒长期测试时适应的方法，用于3D人体姿态估计，旨在解决自监督学习中错误累积的问题。通过无监督聚类生成锚点运动，并结合软重置机制，显著提升了长期测试时适应的性能。

Details

Motivation: 现有的在线测试时适应方法在3D人体姿态估计中依赖自监督学习时，容易因预测不完美而导致错误累积，长期性能下降。论文旨在解决这一问题，提出了一种更稳健的适应策略。

Result: 实验表明，该方法优于现有在线测试时适应方法，验证了其设计选择的有效性。

Insight: 运动离散化和软重置机制的结合能够有效降低错误累积，同时利用个人特征的持久性进一步提升性能。

Abstract: Online test-time adaptation addresses the train-test domain gap by adapting the model on unlabeled streaming test inputs before making the final prediction. However, online adaptation for 3D human pose estimation suffers from error accumulation when relying on self-supervision with imperfect predictions, leading to degraded performance over time. To mitigate this fundamental challenge, we propose a novel solution that highlights the use of motion discretization. Specifically, we employ unsupervised clustering in the latent motion representation space to derive a set of anchor motions, whose regularity aids in supervising the human pose estimator and enables efficient self-replay. Additionally, we introduce an effective and efficient soft-reset mechanism by reverting the pose estimator to its exponential moving average during continuous adaptation. We examine long-term online adaptation by continuously adapting to out-of-domain streaming test videos of the same individual, which allows for the capture of consistent personal shape and motion traits throughout the streaming observation. By mitigating error accumulation, our solution enables robust exploitation of these personal traits for enhanced accuracy. Experiments demonstrate that our solution outperforms previous online test-time adaptation methods and validate our design choices.

[173] Deep Hybrid Model for Region of Interest Detection in Omnidirectional Videos cs.CV | cs.AIPDF

Sana Alamgeer

TL;DR: 该论文提出了一种混合显著性模型，用于在360度视频中检测感兴趣区域（ROI），以优化视频流传输和用户观看体验。

Details

Motivation: 360度视频中的ROI检测对视频流传输至关重要，例如预测视口和智能剪辑视频以减少带宽使用，同时提升用户体验。

Result: 提出的方法与360RAT数据集的主观标注进行了性能比较，验证了其有效性。

Insight: 混合显著性模型能够有效检测360度视频中的ROI，为视频流传输和用户体验提供了技术支持。

Abstract: The main goal of the project is to design a new model that predicts regions of interest in 360$^{\circ}$ videos. The region of interest (ROI) plays an important role in 360$^{\circ}$ video streaming. For example, ROIs are used to predict view-ports, intelligently cut the videos for live streaming, etc so that less bandwidth is used. Detecting view-ports in advance helps reduce the movement of the head while streaming and watching a video via the head-mounted device. Whereas, intelligent cuts of the videos help improve the efficiency of streaming the video to users and enhance the quality of their viewing experience. This report illustrates the secondary task to identify ROIs, in which, we design, train, and test a hybrid saliency model. In this work, we refer to saliency regions to represent the regions of interest. The method includes the processes as follows: preprocessing the video to obtain frames, developing a hybrid saliency model for predicting the region of interest, and finally post-processing the output predictions of the hybrid saliency model to obtain the output region of interest for each frame. Then, we compare the performance of the proposed method with the subjective annotations of the 360RAT dataset.

[174] Rethinking Long-tailed Dataset Distillation: A Uni-Level Framework with Unbiased Recovery and Relabeling cs.CVPDF

Xiao Cui, Yulei Qin, Xinyue Li, Wengang Zhou, Hongsheng Li

TL;DR: 该论文提出了一种新的长尾数据集蒸馏框架，通过无偏恢复和重新标记的方法，解决了现有方法在长尾分布数据集上的局限性。

Details

Motivation: 现有的数据集蒸馏方法在平衡数据集上表现良好，但在长尾分布数据集中，由于类别不平衡导致模型表示偏差和统计估计（如批归一化统计）的污染，性能下降。

Result: 实验结果显示，在IPC=10和IF=10条件下，CIFAR-100-LT的top-1准确率提高了15.6%，Tiny-ImageNet-LT提高了11.8%。

Insight: 论文的见解在于：1) 长尾数据集蒸馏的核心挑战是模型表示偏差和统计污染；2) 通过联合优化统计对齐和软标签生成可以有效解决这些问题。

Abstract: Dataset distillation creates a small distilled set that enables efficient training by capturing key information from the full dataset. While existing dataset distillation methods perform well on balanced datasets, they struggle under long-tailed distributions, where imbalanced class frequencies induce biased model representations and corrupt statistical estimates such as Batch Normalization (BN) statistics. In this paper, we rethink long-tailed dataset distillation by revisiting the limitations of trajectory-based methods, and instead adopt the statistical alignment perspective to jointly mitigate model bias and restore fair supervision. To this end, we introduce three dedicated components that enable unbiased recovery of distilled images and soft relabeling: (1) enhancing expert models (an observer model for recovery and a teacher model for relabeling) to enable reliable statistics estimation and soft-label generation; (2) recalibrating BN statistics via a full forward pass with dynamically adjusted momentum to reduce representation skew; (3) initializing synthetic images by incrementally selecting high-confidence and diverse augmentations via a multi-round mechanism that promotes coverage and diversity. Extensive experiments on four long-tailed benchmarks show consistent improvements over state-of-the-art methods across varying degrees of class imbalance.Notably, our approach improves top-1 accuracy by 15.6% on CIFAR-100-LT and 11.8% on Tiny-ImageNet-LT under IPC=10 and IF=10.

[175] DualGazeNet: A Biologically Inspired Dual-Gaze Query Network for Salient Object Detection cs.CVPDF

Yu Zhang, Haoan Ping, Yuchen Li, Zhenshan Bing, Fuchun Sun

TL;DR: DualGazeNet 是一个受生物视觉系统启发的纯 Transformer 框架，通过模拟人类的双路径加工机制，实现了高效的显著目标检测（SOD）。

Details

Motivation: 现有的 SOD 方法通过复杂的架构和模块来提升性能，但引入了特征冗余和干扰，而人类视觉系统却能高效地实现显著目标检测。因此，作者希望设计一个更简单但高效的生物启发式框架。

Result: 在五个 RGB SOD 基准测试中，DualGazeNet 超越了 25 种 SOTA 方法，推理速度提高了 60%，计算量减少了 53.4%。

Insight: 生物启发式设计可以简化复杂的人工智能模型，同时保持高性能和高效率，为 SOD 任务提供了新的方向。

Abstract: Recent salient object detection (SOD) methods aim to improve performance in four key directions: semantic enhancement, boundary refinement, auxiliary task supervision, and multi-modal fusion. In pursuit of continuous gains, these approaches have evolved toward increasingly sophisticated architectures with multi-stage pipelines, specialized fusion modules, edge-guided learning, and elaborate attention mechanisms. However, this complexity paradoxically introduces feature redundancy and cross-component interference that obscure salient cues, ultimately reaching performance bottlenecks. In contrast, human vision achieves efficient salient object identification without such architectural complexity. This contrast raises a fundamental question: can we design a biologically grounded yet architecturally simple SOD framework that dispenses with most of this engineering complexity, while achieving state-of-the-art accuracy, computational efficiency, and interpretability? In this work, we answer this question affirmatively by introducing DualGazeNet, a biologically inspired pure Transformer framework that models the dual biological principles of robust representation learning and magnocellular-parvocellular dual-pathway processing with cortical attention modulation in the human visual system. Extensive experiments on five RGB SOD benchmarks show that DualGazeNet consistently surpasses 25 state-of-the-art CNN- and Transformer-based methods. On average, DualGazeNet achieves about 60% higher inference speed and 53.4% fewer FLOPs than four Transformer-based baselines of similar capacity (VST++, MDSAM, Sam2unet, and BiRefNet). Moreover, DualGazeNet exhibits strong cross-domain generalization, achieving leading or highly competitive performance on camouflaged and underwater SOD benchmarks without relying on additional modalities.

[176] HunyuanVideo 1.5 Technical Report cs.CVPDF

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang

TL;DR: HunyuanVideo 1.5是一个轻量级但功能强大的开源视频生成模型，仅需83亿参数即可实现最先进的视觉质量和运动连贯性，支持在消费级GPU上高效推理。

Details

Motivation: 降低高质量视频生成的门槛，使其更广泛地应用于社区和研究领域。

Result: 模型在多个持续时间和分辨率下表现出色，成为开源视频生成模型的新标杆。

Insight: 轻量化和高效设计的结合可以显著提升视频生成的性能，同时降低计算资源需求。

Abstract: We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions.Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

[177] Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference cs.CV | cs.MMPDF

Wengyi Zhan, Mingbao Lin, Zhihang Lin, Rongrong Ji

TL;DR: ParVTS是一种无需训练的并行视觉令牌调度框架，通过分区处理并丢弃低优先级令牌，显著提高MLLMs的推理速度和计算效率。

Details

Motivation: 当前MLLMs因高分辨率图像产生大量视觉令牌，导致自注意力计算复杂度飙升，传统令牌剪枝可能损害上下文信息，影响准确性。

Result: 实验表明，ParVTS可剪枝88.9%的视觉令牌，性能损失极小，实现1.77倍加速和70%的FLOPs减少。

Insight: 分区并行处理和动态丢弃策略是解决MLLMs高计算复杂度的有效方法，同时保持多模态推理的准确性。

Abstract: Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel Vision Token Scheduling), a training-free scheduling framework that partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference to reduce computation. This scheduling reduces computational complexity, requires no heuristics or additional modules, and is compatible with diverse existing MLLM architectures. Experiments across multiple MLLM backbones show that ParVTS prunes up to 88.9% of visual tokens with minimal performance drop, achieving 1.77x speedup and 70% FLOPs reduction.

[178] MagicWorld: Interactive Geometry-driven Video World Exploration cs.CVPDF

Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li

TL;DR: MagicWorld提出了一种结合3D几何先验和历史检索的交互式视频世界模型，解决了现有方法在视角变化下的结构不稳定性及多步交互中的信息遗忘问题。

Details

Motivation: 现有交互式视频世界模型未能充分利用指令驱动的场景运动与底层3D几何的对应关系，且在多步交互中易遗忘历史信息，导致语义和结构漂移。

Result: 实验表明，MagicWorld在场景稳定性和交互连续性上显著优于现有方法。

Insight: 显式引入3D几何约束和历史信息检索可有效提升交互式视频生成的鲁棒性和一致性。

Abstract: Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.

[179] MetaDCSeg: Robust Medical Image Segmentation via Meta Dynamic Center Weighting cs.CV | cs.AIPDF

Chenyu Mu, Guihai Chen, Xun Yang, Erkun Yang, Cheng Deng

TL;DR: 该论文提出了一种名为MetaDCSeg的框架，通过元动态中心加权方法解决医学图像分割中因噪声标注和模糊边界导致的不稳定性问题。

Details

Motivation: 医学图像分割在临床应用中至关重要，但噪声标注和模糊边界会导致模型训练不稳定。现有的方法依赖于全局噪声假设或基于置信度的样本选择，无法有效缓解噪声标注带来的性能下降。

Result: 在四个不同噪声水平的基准数据集上的实验表明，MetaDCSeg的性能优于现有的最先进方法。

Insight: 该方法通过动态学习像素级权重和显式边界建模，为解决医学图像分割中的噪声标注问题提供了新思路。

Abstract: Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, which lead to instability in model training. Existing methods typically rely on global noise assumptions or confidence-based sample selection, which inadequately mitigate the performance degradation caused by annotation noise, especially in challenging boundary regions. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy ground-truth labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model’s attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg consistently outperforms existing state-of-the-art methods.

[180] Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation cs.CV | cs.AIPDF

Ruiying Liu, Yuanzhi Liang, Haibin Huang, Tianshu Yu, Chi Zhang

TL;DR: BPGO是一种针对视觉生成任务的优化方法，通过贝叶斯先验锚点显式建模奖励不确定性，提升了语义对齐和生成质量。

Details

Motivation: 现有GRPO方法在多对多的文本-视觉对应关系中表现受限，奖励信号模糊，导致优化过程未能充分利用可靠反馈且容易过拟合噪声。

Result: 在图像和视频生成任务中，BPGO比GRPO及其变体具有更强的语义对齐、更高的感知保真度和更快的收敛速度。

Insight: 显式建模奖励不确定性并通过贝叶斯先验引导优化，可以有效提升视觉生成模型的性能。

Abstract: Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.

[181] EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models cs.CVPDF

Wenhao Xu, Xin Dong, Yue Li, Haoyuan Shi, Zhiwei Xiong

TL;DR: EventSTU是一个基于事件的训练免费框架，用于高效视频理解，通过关键帧采样和令牌修剪减少计算量，同时提升性能。

Details

Motivation: 当前视频大语言模型在处理长视频时计算成本高，冗余帧和令牌导致效率低下。受事件相机启发，作者提出用事件信息优化时空理解以减少冗余。

Result: 实验表明EventSTU相比基线减少3.01x FLOPs和提速3.10x，且性能提升。

Insight: 事件信息可作为高效计算的无成本先验，同时支持模拟事件扩展应用场景。

Abstract: Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.

[182] BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models cs.CVPDF

Juncheng Li, Yige Li, Hanxun Huang, Yunhao Chen, Xin Wang

TL;DR: 该论文提出了第一个全面的基准BackdoorVLM，用于评估视觉语言模型（VLM）中的后门攻击，揭示了VLM对文本指令的高敏感性和文本触发器在后门映射中的主导作用。

Details

Motivation: 尽管后门攻击在单模态环境中已被广泛研究，但其对多模态基础模型（尤其是视觉语言模型）的影响尚未充分探索。

Result: 实验表明，VLM对文本指令高度敏感，双模态后门中文本触发器通常主导后门映射，部分攻击仅需1%的投毒率即可达到90%以上的成功率。

Insight: 揭示了当前VLM中未被充分探索的脆弱性，尤其是文本域的后门攻击潜力巨大，为未来防御研究提供了重要基准。

Abstract: Backdoor attacks undermine the reliability and trustworthiness of machine learning systems by injecting hidden behaviors that can be maliciously activated at inference time. While such threats have been extensively studied in unimodal settings, their impact on multimodal foundation models, particularly vision-language models (VLMs), remains largely underexplored. In this work, we introduce \textbf{BackdoorVLM}, the first comprehensive benchmark for systematically evaluating backdoor attacks on VLMs across a broad range of settings. It adopts a unified perspective that injects and analyzes backdoors across core vision-language tasks, including image captioning and visual question answering. BackdoorVLM organizes multimodal backdoor threats into 5 representative categories: targeted refusal, malicious injection, jailbreak, concept substitution, and perceptual hijack. Each category captures a distinct pathway through which an adversary can manipulate a model’s behavior. We evaluate these threats using 12 representative attack methods spanning text, image, and bimodal triggers, tested on 2 open-source VLMs and 3 multimodal datasets. Our analysis reveals that VLMs exhibit strong sensitivity to textual instructions, and in bimodal backdoors the text trigger typically overwhelms the image trigger when forming the backdoor mapping. Notably, backdoors involving the textual modality remain highly potent, with poisoning rates as low as 1% yielding over 90% success across most tasks. These findings highlight significant, previously underexplored vulnerabilities in current VLMs. We hope that BackdoorVLM can serve as a useful benchmark for analyzing and mitigating multimodal backdoor threats. Code is available at: https://github.com/bin015/BackdoorVLM .

[183] One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control cs.CVPDF

Zhenxing Mi, Yuxin Wang, Dan Xu

TL;DR: One4D提出了一种统一的4D生成与重建框架，通过解耦的LoRA控制机制同步生成RGB帧和点云图，支持从单张图像生成、完整视频重建到稀疏帧混合任务的无缝切换。

Details

Motivation: 现有方法在联合生成RGB和点云图时容易导致视频生成模型性能下降，且难以灵活处理不同稀疏度的输入帧。One4D旨在解决这些问题，实现高质量的4D内容建模。

Result: 在合成与真实4D数据集上，One4D生成了高质量的RGB帧和精确的点云图，优于现有方法。

Insight: 解耦模态的LoRA控制能有效保留视频生成模型的性能，同时实现多模态一致性，为4D建模提供了高效且通用的解决方案。

Abstract: We present One4D, a unified framework for 4D generation and reconstruction that produces dynamic 4D content as synchronized RGB frames and pointmaps. By consistently handling varying sparsities of conditioning frames through a Unified Masked Conditioning (UMC) mechanism, One4D can seamlessly transition between 4D generation from a single image, 4D reconstruction from a full video, and mixed generation and reconstruction from sparse frames. Our framework adapts a powerful video generation model for joint RGB and pointmap generation, with carefully designed network architectures. The commonly used diffusion finetuning strategies for depthmap or pointmap reconstruction often fail on joint RGB and pointmap generation, quickly degrading the base video model. To address this challenge, we introduce Decoupled LoRA Control (DLC), which employs two modality-specific LoRA adapters to form decoupled computation branches for RGB frames and pointmaps, connected by lightweight, zero-initialized control links that gradually learn mutual pixel-level consistency. Trained on a mixture of synthetic and real 4D datasets under modest computational budgets, One4D produces high-quality RGB frames and accurate pointmaps across both generation and reconstruction tasks. This work represents a step toward general, high-quality geometry-based 4D world modeling using video diffusion models. Project page: https://mizhenxing.github.io/One4D

[184] FineXtrol: Controllable Motion Generation via Fine-Grained Text cs.CVPDF

Keming Shen, Bizhu Wu, Junliang Chen, Xiaoqin Wang, Linlin Shen

TL;DR: FineXtrol是用于高效运动生成的控制框架，通过细粒度文本信号实现对特定身体部分运动的精确控制。

Details

Motivation: 现有文本驱动运动生成方法存在细节不对齐或计算成本高的问题，FineXtrol旨在解决这些问题。

Result: 定量和定性分析显示FineXtrol在可控运动生成中表现优异。

Insight: 细粒度文本信号和分层对比学习可以有效提升运动生成的精确性和可控性。

Abstract: Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.

[185] Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search cs.CVPDF

Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang

TL;DR: 该论文提出了一个面向人类中心开放未来任务发现的问题（HOTD），并构建了一个包含2000多个真实视频的HOTD-Bench基准。此外，作者提出了一个协作多代理搜索树框架（CMAST），通过多代理系统和可扩展搜索树模块优化任务发现性能。

Details

Motivation: 当前基于大型多模态模型（LMMs）的研究主要集中在封闭场景中，而如何处理开放未来场景中高度并发和动态的人类意图仍未深入探索。论文旨在通过HOTD问题推动LMMs在开放未来场景中的应用，发现能够减少人类工作量的任务。

Result: CMAST在HOTD-Bench中表现出色，显著超越现有LMMs。同时，该框架能与现有LMMs良好集成，持续提升性能。

Insight: 通过结构化搜索和多代理协作，可以高效处理开放未来场景中的动态任务发现。HOTD-Bench为未来相关研究提供了重要基准。

Abstract: Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that directly assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across multiple plausible futures. To facilitate this study, we propose an HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes the complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.

[186] VeCoR - Velocity Contrastive Regularization for Flow Matching cs.CVPDF

Zong-Wei Hong, Jing-lun Li, Lin-Ze Li, Shen Zhang, Yao Tang

TL;DR: VeCoR通过对比性正则化改进Flow Matching模型，增强稳定性和生成质量。

Details

Motivation: 标准Flow Matching在低步或轻量配置下可能导致样本偏离数据流形，影响生成质量。VeCoR旨在通过双向监督提升模型的稳定性和泛化能力。

Result: 在ImageNet-1K和MS-COCO上显著降低FID，尤其是在低步和轻量设置中。

Insight: 双向监督是提升Flow Matching稳定性和生成质量的有效策略。

Abstract: Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations. To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both “where to go” and “where not to go.” To be formal, we propose \textbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones. On ImageNet-1K 256$\times$256, VeCoR yields 22% and 35% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: https://p458732.github.io/VeCoR_Project_Page/

[187] Eevee: Towards Close-up High-resolution Video-based Virtual Try-on cs.CVPDF

Jianhao Zeng, Yancheng Bai, Ruidong Chen, Xuanpu Zhang, Lei Sun

TL;DR: 这篇论文提出了一个面向高分辨率视频虚拟试穿的数据集和方法，解决了现有技术中因单一输入图像和忽略特写需求导致的纹理细节不足问题。

Details

Motivation: 现有虚拟试穿技术依赖单一服装图像输入，且仅关注全景视频，无法满足商业对高分辨率特写视频的需求。

Result: 实验验证了新数据集和VGID指标的有效性，显著提高了虚拟试穿的细节表现。

Insight: 特写视频对一致性的要求更高，需兼顾纹理和结构的保真度。

Abstract: Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business’s demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.

[188] CataractCompDetect: Intraoperative Complication Detection in Cataract Surgery cs.CVPDF

Bhuvan Sachdeva, Sneha Kumari, Rudransh Agarwal, Shalaka Kumaraswamy, Niharika Singri Prasad

TL;DR: 该论文提出了CataractCompDetect，一个结合了相位感知定位、SAM 2跟踪、并发症风险评分和视觉语言推理的白内障手术并发症检测框架，并在自建数据集CataComp上取得了不错的性能。

Details

Motivation: 白内障手术是全球最常见的手术之一，但术中并发症如虹膜脱垂、后囊膜破裂（PCR）和玻璃体流失可能导致不良后果。自动检测这些事件有助于早期预警和客观的培训反馈。

Result: 在CataComp数据集上，平均F1得分为70.63%，各并发症的检测性能分别为虹膜脱垂81.8%、PCR 60.87%、玻璃体流失69.23%。

Insight: 结合结构化手术先验和视觉语言推理可以有效识别罕见但高影响的术中事件，为医疗AI提供了新的思路。

Abstract: Cataract surgery is one of the most commonly performed surgeries worldwide, yet intraoperative complications such as iris prolapse, posterior capsule rupture (PCR), and vitreous loss remain major causes of adverse outcomes. Automated detection of such events could enable early warning systems and objective training feedback. In this work, we propose CataractCompDetect, a complication detection framework that combines phase-aware localization, SAM 2-based tracking, complication-specific risk scoring, and vision-language reasoning for final classification. To validate CataractCompDetect, we curate CataComp, the first cataract surgery video dataset annotated for intraoperative complications, comprising 53 surgeries, including 23 with clinical complications. On CataComp, CataractCompDetect achieves an average F1 score of 70.63%, with per-complication performance of 81.8% (Iris Prolapse), 60.87% (PCR), and 69.23% (Vitreous Loss). These results highlight the value of combining structured surgical priors with vision-language reasoning for recognizing rare but high-impact intraoperative events. Our dataset and code will be publicly released upon acceptance.

[189] Zero-shot segmentation of skin tumors in whole-slide images with vision-language foundation models cs.CVPDF

Santiago Moreno, Pablo Meseguer, Rocío del Amor, Valery Naranjo

TL;DR: 这篇论文提出了一种名为ZEUS的零样本视觉语言分割框架，用于全切片图像（WSIs）中的皮肤肿瘤分割，减少了标注负担并提供了可扩展的肿瘤划分方法。

Details

Motivation: 皮肤肿瘤活检的准确标注具有挑战性，因为其形态多变、组织学模式重叠且良恶性之间区别细微。现有的视觉语言基础模型（VLM）在全切片图像中难以实现细粒度分割。

Result: 在两个内部数据集（原发性梭形细胞肿瘤和皮肤转移瘤）上展示了竞争性性能，分析了提示设计、领域偏移和机构变异性对VLM的影响。

Insight: ZEUS显著减少了标注负担，为下游诊断工作流程提供了可扩展且可解释的肿瘤划分方法。

Abstract: Accurate annotation of cutaneous neoplasm biopsies represents a major challenge due to their wide morphological variability, overlapping histological patterns, and the subtle distinctions between benign and malignant lesions. Vision-language foundation models (VLMs), pre-trained on paired image-text corpora, learn joint representations that bridge visual features and diagnostic terminology, enabling zero-shot localization and classification of tissue regions without pixel-level labels. However, most existing VLM applications in histopathology remain limited to slide-level tasks or rely on coarse interactive prompts, and they struggle to produce fine-grained segmentations across gigapixel whole-slide images (WSIs). In this work, we introduce a zero-shot visual-language segmentation pipeline for whole-slide images (ZEUS), a fully automated, zero-shot segmentation framework that leverages class-specific textual prompt ensembles and frozen VLM encoders to generate high-resolution tumor masks in WSIs. By partitioning each WSI into overlapping patches, extracting visual embeddings, and computing cosine similarities against text prompts, we generate a final segmentation mask. We demonstrate competitive performance on two in-house datasets, primary spindle cell neoplasms and cutaneous metastases, highlighting the influence of prompt design, domain shifts, and institutional variability in VLMs for histopathology. ZEUS markedly reduces annotation burden while offering scalable, explainable tumor delineation for downstream diagnostic workflows.

[190] UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection cs.CVPDF

Ching-Yi Lai, Chih-Yu Jian, Pei-Cheng Chuang, Chia-Ming Lee, Chih-Chung Hsu

TL;DR: 论文提出了一种新型的单模态生成多模态对比学习框架UMCL，旨在解决深度伪造检测中因社交媒体压缩导致的数据质量不一致问题。通过将单一视觉模态转化为互补特征，并结合亲和力驱动的语义对齐策略和跨质量相似性学习策略，提高了模型在多种压缩率和操作类型下的鲁棒性。

Details

Motivation: 社交媒体平台对数据的压缩处理导致深度伪造检测模型的泛化能力和可靠性受到挑战。现有的单模态方法在压缩数据下表现不佳，而多模态方法则面临数据收集和标注成本高、模态质量不一致等问题。

Result: 实验表明，UMCL在多种压缩率和操作类型下均表现出色，即使在单个特征退化时也能保持高检测精度。

Insight: 通过显式对齐模态间关系，UMCL不仅提高了性能，还提供了可解释的特征关系分析，为深度伪造检测的鲁棒性研究提供了新思路。

Abstract: In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability. Although existing methods have progressed from single-modal to multimodal approaches, they face critical limitations: single-modal methods struggle with feature degradation under data compression in social media streaming, while multimodal approaches require expensive data collection and labeling and suffer from inconsistent modal quality or accessibility in real-world scenarios. To address these challenges, we propose a novel Unimodal-generated Multimodal Contrastive Learning (UMCL) framework for robust cross-compression-rate (CCR) deepfake detection. In the training stage, our approach transforms a single visual modality into three complementary features: compression-robust rPPG signals, temporal landmark dynamics, and semantic embeddings from pre-trained vision-language models. These features are explicitly aligned through an affinity-driven semantic alignment (ASA) strategy, which models inter-modal relationships through affinity matrices and optimizes their consistency through contrastive learning. Subsequently, our cross-quality similarity learning (CQSL) strategy enhances feature robustness across compression rates. Extensive experiments demonstrate that our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection. Notably, our approach maintains high detection accuracy even when individual features degrade, while providing interpretable insights into feature relationships through explicit alignment.

[191] Rethinking Plant Disease Diagnosis: Bridging the Academic-Practical Gap with Vision Transformers and Zero-Shot Learning cs.CV | cs.AIPDF

Wassim Benabbas, Mohammed Brahimi, Samir Akhrouf, Bilal Fortas

TL;DR: 本文探讨了如何利用Vision Transformers和零样本学习弥合植物病害分类中学术数据集与实际应用的差距，比较了CNN、Vision Transformers和CLIP模型的性能，发现CLIP模型在无任务特定训练下表现出色。

Details

Motivation: 现有研究多基于PlantVillage数据集，模型在学术数据集上表现优异，但在实际田间图像中泛化能力不足，导致学术与实际应用的脱节。

Result: CNN在域偏移下鲁棒性有限，Vision Transformers因捕捉全局特征表现更强，CLIP模型则展示了零样本学习的潜力。

Insight: 零样本学习可作为适应多样化田间环境的实用策略，尤其是在缺乏标注数据的情况下具备显著优势。

Abstract: Recent advances in deep learning have enabled significant progress in plant disease classification using leaf images. Much of the existing research in this field has relied on the PlantVillage dataset, which consists of well-centered plant images captured against uniform, uncluttered backgrounds. Although models trained on this dataset achieve high accuracy, they often fail to generalize to real-world field images, such as those submitted by farmers to plant diagnostic systems. This has created a significant gap between published studies and practical application requirements, highlighting the necessity of investigating and addressing this issue. In this study, we investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions in plant disease classification. We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models. While CNNs exhibit limited robustness under domain shift, Vision Transformers demonstrate stronger generalization by capturing global contextual features. Most notably, CLIP models classify diseases directly from natural language descriptions without any task-specific training, offering strong adaptability and interpretability. These findings highlight the potential of zero-shot learning as a practical and scalable domain adaptation strategy for plant health diagnosis in diverse field environments.

[192] View-Consistent Diffusion Representations for 3D-Consistent Video Generation cs.CVPDF

Duolikun Danier, Ge Gao, Steven McDonagh, Changjian Li, Hakan Bilen

TL;DR: 该论文提出了一种称为ViCoDR的方法，通过学习多视角一致的扩散表示，改善视频生成中的3D一致性，从而减少视觉伪影。

Details

Motivation: 当前视频生成模型在生成真实内容方面取得进展，但仍存在由3D不一致性引起的视觉伪影（如物体变形），影响用户体验和模拟保真度。论文基于扩散模型表示对齐的最新研究，假设改进多视角一致性可提升3D一致性。

Result: 在多种条件下（图像到视频、文本到视频、多视角生成）的实验表明，ViCoDR显著提升了生成视频的3D一致性。

Insight: 多视角一致性是提升视频生成3D一致性的关键因素，优化扩散表示可以有效减少视觉伪影。

Abstract: Video generation models have made significant progress in generating realistic content, enabling applications in simulation, gaming, and film making. However, current generated videos still contain visual artifacts arising from 3D inconsistencies, e.g., objects and structures deforming under changes in camera pose, which can undermine user experience and simulation fidelity. Motivated by recent findings on representation alignment for diffusion models, we hypothesize that improving the multi-view consistency of video diffusion representations will yield more 3D-consistent video generation. Through detailed analysis on multiple recent camera-controlled video diffusion models we reveal strong correlations between 3D-consistent representations and videos. We also propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations. We evaluate ViCoDR on camera controlled image-to-video, text-to-video, and multi-view generation models, demonstrating significant improvements in the 3D consistency of the generated videos. Project page: https://danier97.github.io/ViCoDR.

[193] AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization cs.CVPDF

Christos Koutlis, Symeon Papadopoulos

TL;DR: 本文提出了一种名为AuViRe的新方法，通过音频-视觉语音表示重构来定位深伪造视频的时间位置。该方法通过跨模态重构的差异性提高检测精度，显著优于现有技术。

Details

Motivation: 随着合成音频-视觉内容的快速发展，确保数字媒体的真实性变得至关重要。现有的深伪造检测方法在时间定位上存在不足，AuViRe旨在解决这一问题。

Result: 在LAV-DF、AV-Deepfake1M和野外实验中分别提升了8.9 AP@0.95、9.6 AP@0.5和5.1 AUC。

Insight: 跨模态重构的差异性可以作为深伪造检测的有效线索，尤其在时间定位任务中表现突出。

Abstract: With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 AP@0.95 on LAV-DF, +9.6 AP@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.

[194] A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation cs.CVPDF

Wentao Qu, Guofeng Mei, Yang Wu, Yongshun Gong, Xiaoshui Huang

TL;DR: 本文提出了T2LDM，一种基于自条件表示引导的扩散模型，用于文本到LiDAR场景生成，解决了训练数据稀缺和文本描述质量低的问题。

Details

Motivation: 文本到LiDAR生成可以为下游任务提供丰富的3D数据，但训练数据的稀缺和低质量的文本描述会导致生成场景过于平滑且可控性差。

Result: T2LDM在无条件和有条件生成任务中均超越了现有方法，实现了最先进的场景生成效果。

Insight: 1. SCRG能够感知丰富几何结构；2. 高质量文本提示对生成质量至关重要；3. 方向位置先验能显著提升场景保真度。

Abstract: Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.

[195] Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting cs.CVPDF

Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen

TL;DR: 该论文提出了一种动态粒度的Vision Transformer（Grc-ViT），通过自适应调整视觉粒度来解决ViTs在捕捉局部细节上的不足。

Details

Motivation: Vision Transformers在处理全局依赖关系时表现出色，但在捕捉细粒度局部细节时效率较低。现有方法依赖固定补丁大小，导致计算冗余。

Result: Grc-ViT在提升细粒度区分能力的同时，实现了精度与计算效率的更好平衡。

Insight: 动态粒度调整是优化ViTs性能的有效方法，尤其是在处理复杂视觉任务时。

Abstract: Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, α and \b{eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.

Long Tang, Guoquan Zhen, Jie Hao, Jianbo Zhang, Huiyu Duan

TL;DR: 本文提出了Life-IQA，一种通过GCN增强层交互和MoE特征解耦提升盲图像质量评估（BIQA）的方法，解决了传统方法中特征贡献不均和解码架构不足的问题。

Details

Motivation: 现有的BIQA方法通常融合深浅层特征但忽略了其贡献不平等性，同时缺乏有效的质量解码架构。

Result: 在多个BIQA基准测试中实现了最优性能，同时在准确性和成本之间取得更好平衡。

Insight: 深浅层特征的交互与解耦能显著提升BIQA性能，GCN与MoE的组合是一个有效方向。

Abstract: Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks.The code is available at: \href{https://github.com/TANGLONG2/Life-IQA/tree/main}{\texttt{Life-IQA}}.

[197] Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric cs.CVPDF

Xiangjie Sui, Songyang Li, Hanwei Zhu, Baoliang Chen, Yuming Fang

TL;DR: 论文提出了Bench-C基准和RAS指标，用于评估大型视觉语言模型在视觉损坏下的鲁棒性，强调了判别性样本的重要性，并揭示了模型在损坏下的不同行为模式。

Details

Motivation: 现有的评估方法存在低判别性样本主导和传统准确性指标无法捕捉预测结构退化的问题，需要新的评估方法和指标。

Result: 实验发现模型在损坏下表现出错误置信和犹豫等不同行为；轻微损坏可能导致准确性轻微提升，但预测结构仍退化。

Insight: 视觉损坏下，模型的预测结构退化可能与准确性变化不一致；分解鲁棒性为破坏性和纠正性部分，有助于揭示模型的失败与恢复模式。

Abstract: Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.

[198] ReEXplore: Improving MLLMs for Embodied Exploration with Contextualized Retrospective Experience Replay cs.CVPDF

Gengyuan Zhang, Mingcong Ding, Jingpei Wu, Ruotong Liao, Volker Tresp

TL;DR: ReEXplore提出了一种无需训练的框架，通过回顾性经验回放和分层边界选择，提升了MLLM在具身探索中的性能和效率。

Details

Motivation: MLLM在具身探索中存在依赖预训练知识、训练成本高以及动作空间复杂等问题，ReEXplore旨在解决这些挑战。

Result: 在多个基准测试中，ReEXplore性能超越基线方法，成功率提高3倍，导航效率显著提升。

Insight: 通过动态更新经验和分层决策，MLLM在具身探索中可以更高效地适应新环境，减少对预训练知识的依赖。

Abstract: Embodied exploration is a target-driven process that requires embodied agents to possess fine-grained perception and knowledge-enhanced decision making. While recent attempts leverage MLLMs for exploration due to their strong perceptual and reasoning abilities, we find that MLLM-based embodied agents remain suboptimal in exploring new environments: (i) they rely on profound but stale pre-trained knowledge, (ii) training-based approaches such as imitation learning or reinforcement learning are expensive for long-horizon tasks with sparse outcome rewards, and (iii) frontier-based exploration yields a large, visually nuanced action space that is difficult for MLLMs to make reliable decisions. We address these challenges with ReEXplore, a training-free framework that performs retrospective experience replay to inject distilled, abstract experience at inference time, and hierarchical frontier selection to decompose frontier ranking into coarse-to-fine decisions. Our approach enables robust, traceable, and efficient exploration. Across multiple embodied exploration benchmarks, ReEXplore yields great improvements over strong MLLM baselines, up to 3x higher performance in both success rate and in navigation efficiency under open-source backbones.

[199] MedSAM3: Delving into Segment Anything with Medical Concepts cs.CV | cs.AIPDF

Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu

TL;DR: MedSAM3通过引入文本提示和医学概念，改进了医学图像分割的通用性和精确性，同时还结合了多模态大语言模型以实现复杂推理和迭代优化。

Details

Motivation: 现有医学图像分割方法在通用性上表现不足，且需要大量人工标注。为提高分割的灵活性和精确性，研究者提出了MedSAM3，支持通过文本提示进行分割，并结合多模态大语言模型增强推理能力。

Result: 实验表明，MedSAM3在X光、MRI、超声、CT等多种医学影像模态上显著优于现有专业模型和基础模型。

Insight: 结合文本提示和多模态大语言模型的思路，可以显著提升医学图像分割的灵活性和精确性，同时也为其他领域的通用分割任务提供了参考。

Abstract: Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.

[200] Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation cs.CVPDF

Ruojun Xu, Yu Kai, Xuhua Ren, Jiaxiang Cheng, Bing Ma

TL;DR: 该论文揭示了直接偏好优化（DPO）在扩散模型中存在的似然位移问题，并提出了结合自适应拒绝缩放（ARS）和隐式偏好正则化（IPR）的新方法PG-DPO来解决这一问题。

Details

Motivation: DPO在生成任务中表现出色，但在扩散模型中训练时会导致似然位移，即选择的样本概率下降，影响生成质量。论文旨在解决这一问题，特别是在视频生成任务中的局限性。

Result: PG-DPO在定量和定性评估中均优于现有方法，为视频生成任务提供了更鲁棒的偏好对齐解决方案。

Insight: DPO在扩散模型中的失效模式源于奖励边际的大小，需要通过动态调整和正则化来解决。这一方法可推广到其他生成任务中。

Abstract: Direct Preference Optimization (DPO) has shown promising results in aligning generative outputs with human preferences by distinguishing between chosen and rejected samples. However, a critical limitation of DPO is likelihood displacement, where the probabilities of chosen samples paradoxically decrease during training, undermining the quality of generation. Although this issue has been investigated in autoregressive models, its impact within diffusion-based models remains largely unexplored. This gap leads to suboptimal performance in tasks involving video generation. To address this, we conduct a formal analysis of DPO loss through updating policy within the diffusion framework, which describes how the updating of specific training samples influences the model’s predictions on other samples. Using this tool, we identify two main failure modes: (1) Optimization Conflict, which arises from small reward margins between chosen and rejected samples, and (2) Suboptimal Maximization, caused by large reward margins. Informed by these insights, we introduce a novel solution named Policy-Guided DPO (PG-DPO), combining Adaptive Rejection Scaling (ARS) and Implicit Preference Regularization (IPR) to effectively mitigate likelihood displacement. Experiments show that PG-DPO outperforms existing methods in both quantitative metrics and qualitative evaluations, offering a robust solution for improving preference alignment in video generation tasks.

[201] Granular Computing-driven SAM: From Coarse-to-Fine Guidance for Prompt-Free Segmentation cs.CVPDF

Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen

TL;DR: Grc-SAM是一个基于粒度计算的分层框架，通过粗到细的策略实现了无需人工提示的图像分割，解决了SAM模型的局部性和扩展性问题。

Details

Motivation: 传统SAM模型在单粒度水平生成提示，缺乏自主区域定位机制和高分辨率下的细粒度建模能力。Grc-SAM旨在解决这些问题。

Result: 实验表明，Grc-SAM在准确性和扩展性上均优于基线方法。

Insight: 结合粒度计算和多粒度注意力，Grc-SAM为提示无关的分割提供了新视角，展示了分层建模在高分辨率任务中的潜力。

Abstract: Prompt-free image segmentation aims to generate accurate masks without manual guidance. Typical pre-trained models, notably Segmentation Anything Model (SAM), generate prompts directly at a single granularity level. However, this approach has two limitations: (1) Localizability, lacking mechanisms for autonomous region localization; (2) Scalability, limited fine-grained modeling at high resolution. To address these challenges, we introduce Granular Computing-driven SAM (Grc-SAM), a coarse-to-fine framework motivated by Granular Computing (GrC). First, the coarse stage adaptively extracts high-response regions from features to achieve precise foreground localization and reduce reliance on external prompts. Second, the fine stage applies finer patch partitioning with sparse local swin-style attention to enhance detail modeling and enable high-resolution segmentation. Third, refined masks are encoded as latent prompt embeddings for the SAM decoder, replacing handcrafted prompts with an automated reasoning process. By integrating multi-granularity attention, Grc-SAM bridges granular computing with vision transformers. Extensive experimental results demonstrate Grc-SAM outperforms baseline methods in both accuracy and scalability. It offers a unique granular computational perspective for prompt-free segmentation.

[202] DynaMix: Generalizable Person Re-identification via Dynamic Relabeling and Mixed Data Sampling cs.CV | cs.AI | cs.LGPDF

Timur Mamedov, Anton Konushin, Vadim Konushin

TL;DR: DynaMix是一种新颖的人体再识别方法，通过动态重标记和混合数据采样，结合多摄像头标记数据和单摄像头伪标记数据，显著提升了模型的泛化能力。

Details

Motivation: 现有方法主要依赖有限的多摄像头标记数据，限制了模型的泛化能力。DynaMix旨在通过结合大量伪标记数据和动态适应训练数据的结构和噪声，解决这一问题。

Result: 实验表明，DynaMix在泛化人体再识别任务上优于现有方法。

Insight: 通过动态适应数据噪声和大规模数据训练的设计，DynaMix展示了结合伪标记数据和标记数据的潜力，为泛化性问题提供了新思路。

Abstract: Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.

[203] Graph-based 3D Human Pose Estimation using WiFi Signals cs.CVPDF

Jichao Chen, YangYang Qu, Ruibo Tang, Dirk Slock

TL;DR: GraphPose-Fi提出了一种基于图的WiFi信号3D人体姿态估计框架，通过显式建模骨骼拓扑关系，显著提升了性能。

Details

Motivation: 现有基于WiFi的人体姿态估计方法通常忽略关节间的拓扑关系，导致性能受限。GraphPose-Fi利用图结构显式建模这种关系。

Result: 在MM-Fi数据集上显著优于现有方法。

Insight: 显式建模关节拓扑关系是提升WiFi姿态估计性能的关键。

Abstract: WiFi-based human pose estimation (HPE) has attracted increasing attention due to its resilience to occlusion and privacy-preserving compared to camera-based methods. However, existing WiFi-based HPE approaches often employ regression networks that directly map WiFi channel state information (CSI) to 3D joint coordinates, ignoring the inherent topological relationships among human joints. In this paper, we present GraphPose-Fi, a graph-based framework that explicitly models skeletal topology for WiFi-based 3D HPE. Our framework comprises a CNN encoder shared across antennas for subcarrier-time feature extraction, a lightweight attention module that adaptively reweights features over time and across antennas, and a graph-based regression head that combines GCN layers with self-attention to capture local topology and global dependencies. Our proposed method significantly outperforms existing methods on the MM-Fi dataset in various settings. The source code is available at: https://github.com/Cirrick/GraphPose-Fi.

[204] HABIT: Human Action Benchmark for Interactive Traffic in CARLA cs.CVPDF

Mohan Ramesh, Mark Azer, Fabian B. Flohr

TL;DR: HABIT是一个高保真的人类行为模拟基准，旨在解决自动驾驶仿真中行人行为多样性和真实性的不足。通过集成真实世界的人体动作数据，HABIT揭示了现有自动驾驶代理在复杂交互场景中的关键缺陷。

Details

Motivation: 现有自动驾驶仿真缺乏对行人复杂行为的真实模拟，限制了系统的安全性和可靠性测试。HABIT旨在填补这一空白。

Result: HABIT暴露了现有自动驾驶代理（如InterFuser、TransFuser和BEVDriver）在复杂交互中的显著缺陷，碰撞率和误刹车率显著高于脚本仿真。

Insight: HABIT强调了真实人类行为多样性对自动驾驶测试的重要性，现有代理在非脚本环境中表现不佳，需要进一步优化。

Abstract: Current autonomous driving (AD) simulations are critically limited by their inadequate representation of realistic and diverse human behavior, which is essential for ensuring safety and reliability. Existing benchmarks often simplify pedestrian interactions, failing to capture complex, dynamic intentions and varied responses critical for robust system deployment. To overcome this, we introduce HABIT (Human Action Benchmark for Interactive Traffic), a high-fidelity simulation benchmark. HABIT integrates real-world human motion, sourced from mocap and videos, into CARLA (Car Learning to Act, a full autonomous driving simulator) via a modular, extensible, and physically consistent motion retargeting pipeline. From an initial pool of approximately 30,000 retargeted motions, we curate 4,730 traffic-compatible pedestrian motions, standardized in SMPL format for physically consistent trajectories. HABIT seamlessly integrates with CARLA’s Leaderboard, enabling automated scenario generation and rigorous agent evaluation. Our safety metrics, including Abbreviated Injury Scale (AIS) and False Positive Braking Rate (FPBR), reveal critical failure modes in state-of-the-art AD agents missed by prior evaluations. Evaluating three state-of-the-art autonomous driving agents, InterFuser, TransFuser, and BEVDriver, demonstrates how HABIT exposes planner weaknesses that remain hidden in scripted simulations. Despite achieving close or equal to zero collisions per kilometer on the CARLA Leaderboard, the autonomous agents perform notably worse on HABIT, with up to 7.43 collisions/km and a 12.94% AIS 3+ injury risk, and they brake unnecessarily in up to 33% of cases. All components are publicly released to support reproducible, pedestrian-aware AI research.

[205] DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection cs.CVPDF

Hai Ci, Ziheng Peng, Pei Yang, Yingxin Xuan, Mike Zheng Shou

TL;DR: 提出了DiffSeg30k数据集，专注于扩散编辑的区域检测，支持细粒度AI生成内容检测任务，并提出了一种基于语义分割的新方法。

Details

Motivation: 现有的AI生成内容检测基准主要关注整张图像的分类，忽视了扩散基编辑的区域定位任务，亟需一个细粒度的数据集和方法。

Result: 实验表明，分割模型在整图分类任务中表现优异，且具有跨生成模型的泛化能力，但在语义分割任务中仍面临图像失真的挑战。

Insight: 语义分割方法不仅可用于像素级定位，还能作为高效的整图分类器，展示了AI生成内容检测的新方向。

Abstract: Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images–we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models–local edits using eight SOTA diffusion models; 3) Multi-turn editing–each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios–a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel-level localization, emerge as highly reliable whole-image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross-generator generalization. We believe DiffSeg30k will advance research in fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k

[206] MonoSR: Open-Vocabulary Spatial Reasoning from Monocular Images cs.CVPDF

Qirui Wang, Jingyi He, Yining Pan, Si Yong Yeo, Xulei Yang

TL;DR: MonoSR提出了一个大规模的单目空间推理数据集，涵盖室内、室外和物体为中心的多种场景，并支持多种问题类型，为开放世界的单目空间推理提供了基础。

Details

Motivation: 现有的空间推理研究主要依赖多视角观察，且集中在室内环境，限制了其在单目图像和室外场景中的通用性。

Result: MonoSR数据集为任务提供了基准，揭示了现有模型的局限性，并强调了辅助信息的重要性。

Insight: 单目空间推理在开放世界中具有广泛的应用潜力，但需要更强大的模型设计和辅助信息的利用。

Abstract: Spatial reasoning (SR), the ability to infer 3D spatial information from 2D inputs, is essential for real-world applications such as embodied AI and autonomous driving. However, existing research primarily focuses on indoor environments and typically relies on multi-view observations, which limits their generalizability to outdoor scenarios and constrains their applicability to monocular images, the most common real-world setting. In this work, we propose MonoSR, a large-scale monocular spatial reasoning dataset that spans diverse scenarios including indoor, outdoor, and object-centric settings, and supports multiple question types. MonoSR provides a path toward open-world monocular spatial reasoning. Beyond introducing the dataset, we evaluate advanced vision-language models to reveal their limitations on this challenging task. We further analyze whether auxiliary information is crucial for monocular spatial reasoning and offer practical guidance for designing future models. These contributions collectively establish a foundation for advancing monocular spatial reasoning in real-world, open-world environments.

[207] MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery cs.CVPDF

Shuyu Cao, Minxin Chen, Yucheng Song, Zhaozhong Chen, Xinyou Zhang

TL;DR: MambaRefine-YOLO提出了一种用于无人机图像的双模态小目标检测方法，通过DGC-MFM和HFAN模块，实现了高精度和高效的计算平衡。

Details

Motivation: 无人机图像中的小目标检测因低分辨率和背景干扰而具有挑战性，现有方法在跨模态交互和计算效率之间存在权衡问题。

Result: 在DroneVehicle数据集上达到83.2%的mAP，提升7.9%；在VisDrone数据集上也表现优异。

Insight: 该方法在精度和速度之间取得了优越平衡，适用于实际无人机应用。

Abstract: Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse’’ strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.

[208] ABM-LoRA: Activation Boundary Matching for Fast Convergence in Low-Rank Adaptation cs.CVPDF

Dongha Lee, Jinhee Park, Minjun Kim, Junseok Kwon

TL;DR: ABM-LoRA提出了一种基于激活边界匹配的初始化策略，显著加速了低秩适配器的收敛速度。通过将适配器的激活边界与预训练模型对齐，减少梯度更新的信息损失，从而在多个任务中实现快速收敛和高性能。

Details

Motivation: 低秩适配器（LoRA）虽然参数效率高，但其随机初始化容易导致梯度更新在错误的切线空间中进行，造成信息损失，阻碍早期收敛。ABM-LoRA旨在解决这一问题，提高适配器的初始化质量。

Result: 在GLUE、WizardLM和VTAB-1K等多个任务中，ABM-LoRA表现优异，尤其在需要几何理解的视觉任务中取得了最高准确率。

Insight: 通过对齐激活边界，可以有效减少初始化阶段的梯度信息损失，从而显著提升低秩适配器的训练效率和最终性能。

Abstract: We propose Activation Boundary Matching for Low-Rank Adaptation (ABM-LoRA), a principled initialization strategy that substantially accelerates the convergence of low-rank adapters. While LoRA offers high parameter efficiency, its random initialization restricts gradient updates to a mismatched tangent space, causing significant information loss and hindering early convergence. Our ABM-LoRA addresses this by aligning the adapter’s activation boundaries with those of the pretrained model before downstream training, thereby maximizing the projection of full-parameter gradients into the adapter subspace. This alignment sharply reduces information loss at initialization, yields a lower starting loss, and accelerates convergence. We demonstrate ABM-LoRA’s effectiveness across diverse architectures and tasks: language understanding (T5-Base on GLUE), dialogue generation (LLaMA2-7B on WizardLM), and vision recognition (ViT-B/16 on VTAB-1K). On VTAB-1K, it achieves the highest accuracy among all methods, with strong gains on structured reasoning tasks requiring geometric understanding.

[209] Test-Time Preference Optimization for Image Restoration cs.CVPDF

Bingchen Li, Xin Li, Jiaqi Xu, Jiaming Guo, Wenbo Li

TL;DR: 该论文提出了一种测试时偏好优化（TTPO）方法，用于提升图像恢复任务中恢复图像的感知质量，无需模型重新训练或大量偏好数据收集。

Details

Motivation: 现有图像恢复方法通常无法满足人类偏好，导致恢复结果不易被用户接受。作者希望通过一种无需重新训练的方法动态优化恢复图像的质量。

Result: 在多种图像恢复任务和模型上验证了方法的有效性，证明了其感知质量的提升和灵活性。

Insight: 测试时优化无需重新训练和收集大量数据，是一种高效适配人类偏见的图像恢复优化方案。

Abstract: Image restoration (IR) models are typically trained to recover high-quality images using L1 or LPIPS loss. To handle diverse unknown degradations, zero-shot IR methods have also been introduced. However, existing pre-trained and zero-shot IR approaches often fail to align with human preferences, resulting in restored images that may not be favored. This highlights the critical need to enhance restoration quality and adapt flexibly to various image restoration tasks or backbones without requiring model retraining and ideally without labor-intensive preference data collection. In this paper, we propose the first Test-Time Preference Optimization (TTPO) paradigm for image restoration, which enhances perceptual quality, generates preference data on-the-fly, and is compatible with any IR model backbone. Specifically, we design a training-free, three-stage pipeline: (i) generate candidate preference images online using diffusion inversion and denoising based on the initially restored image; (ii) select preferred and dispreferred images using automated preference-aligned metrics or human feedback; and (iii) use the selected preference images as reward signals to guide the diffusion denoising process, optimizing the restored image to better align with human preferences. Extensive experiments across various image restoration tasks and models demonstrate the effectiveness and flexibility of the proposed pipeline.

[210] Evaluating Deep Learning and Traditional Approaches Used in Source Camera Identification cs.CVPDF

Mansur Ozaman

TL;DR: 比较了三种图像来源设备识别（SCI）技术的性能：PRNU、JPEG压缩伪影分析和CNN，重点评估了分类准确性并探讨实际应用中的需求。

Details

Motivation: 图像来源设备识别在计算机视觉中具有重要意义，需比较不同方法的性能以找到最优解决方案。

Result: 研究表明CNN在设备分类准确率上表现最优，但PRNU和JPEG方法在小样本或特定场景下仍有优势。

Insight: 未来研究需结合深度学习和传统方法，以适应不同应用场景的需求。

Abstract: One of the most important tasks in computer vision is identifying the device using which the image was taken, useful for facilitating further comprehensive analysis of the image. This paper presents comparative analysis of three techniques used in source camera identification (SCI): Photo Response Non-Uniformity (PRNU), JPEG compression artifact analysis, and convolutional neural networks (CNNs). It evaluates each method in terms of device classification accuracy. Furthermore, the research discusses the possible scientific development needed for the implementation of the methods in real-life scenarios.

[211] SpectraNet: FFT-assisted Deep Learning Classifier for Deepfake Face Detection cs.CV | cs.LGPDF

Nithira Jayarathne, Naveen Basnayake, Keshawa Jayasundara, Pasindu Dodampegama, Praveen Wijesinghe

TL;DR: 论文提出了一种名为SpectraNet的轻量级深度学习分类器，基于EfficientNet-B6，通过Fine-tuning和预处理策略解决类别不平衡问题，实现了高效的深度伪造检测。

Details

Motivation: 深度伪造图像在信息传播中的滥用日益严重，需要一种轻量、通用且高效的检测方法，帮助非专业人士识别伪造图像。

Result: 模型在高精度、稳定性和泛化能力方面表现优异，但在傅里叶变换特征上的改进效果不明显。

Insight: 1. 轻量化的模型设计结合高效预处理策略是关键；2. 傅里叶变换特征在深度伪造检测中的作用仍需进一步研究。

Abstract: Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.

[212] Three-Dimensional Anatomical Data Generation Based on Artificial Neural Networks cs.CV | cs.ROPDF

Ann-Sophia Müller, Moonkwang Jeong, Meng Zhang, Jiyuan Tian, Arkadiusz Miernik

TL;DR: 论文提出了一种基于人工神经网络的三维解剖数据生成方法，用于解决手术规划和训练中三维数据不足的问题，通过使用3D生成对抗网络（GAN）和生物仿生模型生成多样化的3D解剖模型。

Details

Motivation: 手术规划和训练需要大量三维解剖模型，但获取真实患者数据面临法律、伦理和技术挑战，尤其是对比度差的软组织器官（如前列腺）。

Result: 神经网络的分割方法在IoU指标上优于传统计算机视觉技术。

Insight: 3D GAN可以作为解决医学影像数据稀缺问题的有效工具，同时生物仿生模型为手术训练提供了可控的实验环境。

Abstract: Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.

Teodora Popordanoska, Jiameng Li, Matthew B. Blaschko

TL;DR: 论文介绍了CLASH，一个用于多模态矛盾检测的新基准，通过COCO图像与矛盾描述配对，评估模型在对象或属性级别矛盾上的检测能力，揭示了现有模型的局限性，并证明针对性微调能显著提升性能。

Details

Motivation: 现实中的多模态输入常存在矛盾，但现有基准通常假设输入一致性，缺乏对多模态矛盾检测能力的评估，这可能导致模型产生幻觉或不可靠的输出。CLASH旨在填补这一空白。

Result: 分析发现现有模型在检测跨模态矛盾上存在显著局限性；针对性微调显著提升了模型的冲突检测能力。

Insight: 跨模态矛盾的检测是模型可靠性的关键指标，当前模型在这一任务上的不足暗示了进一步研究的方向。

Abstract: Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.

[214] Can Modern Vision Models Understand the Difference Between an Object and a Look-alike? cs.CVPDF

Itay Cohen, Ethan Fetaya, Amir Rosenfeld

TL;DR: 论文探讨了现代视觉模型能否区分真实物体与其外观相似的替代物（如玩具、雕像等），通过CLIP模型在嵌入空间中估计真实与相似物的方向，发现这一方法能够提升跨模态检索和图像描述的准确性。

Details

Motivation: 当前的计算机视觉模型虽然在识别任务上表现优异，但在区分真实物体与其外观相似的替代物（如玩具、雕像等）方面仍存在不足。论文旨在研究视觉-语言模型（如CLIP）是否具备这种区分能力。

Result: 结果显示，通过对CLIP嵌入空间的方向估计，模型在区分真实与相似物任务上的表现显著提升，同时在跨模态检索和图像描述任务中也取得了更好的效果。

Insight: 论文揭示了现代视觉模型在语义理解上的局限性，尤其是在区分真实物体与其相似物时。通过嵌入空间的方向调整，可以进一步增强模型的语义分辨能力。

Abstract: Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired “real”/“lookalike” prompts. We then estimate a direction in CLIP’s embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.

[215] ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment cs.CVPDF

Wanjiang Weng, Xiaofeng Tan, Junbo Wang, Guo-Sen Xie, Pan Zhou

TL;DR: 论文提出了一种奖励引导的对齐方法（ReAlign），通过步感知的奖励模型和改进的采样策略，解决了扩散模型中文本与运动分布不对齐的问题，显著提升了文本到运动的生成质量。

Details

Motivation: 现有的基于扩散模型的文本到运动生成方法虽然能产生多样且真实的运动，但由于文本和运动分布之间的不对齐问题，可能导致语义不一致或低质量的结果。

Result: 在运动生成和检索任务中，ReAlign显著提升了文本与运动的对齐质量和运动质量，超越了现有方法。

Insight: 通过引入动态对齐评估和改进采样过程，可以更有效地优化扩散模型的生成质量，尤其适用于文本和复杂输出对齐的任务。

Abstract: Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.

[216] Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering cs.CV | cs.AIPDF

Federico Felizzi, Olivia Riccomi, Michele Ferramola, Francesco Andrea Causio, Manuel Del Medico

TL;DR: 该研究探讨了前沿VLMs在处理意大利医学视觉问答任务时是否真正依赖于视觉信息，发现不同模型对视觉输入的依赖程度差异显著。

Details

Motivation: 尽管VLMs在医学视觉问答任务中表现出色，但其对视觉信息的依赖程度尚不明确，这可能影响其在临床环境中的可靠性。

Result: GPT-4o对视觉依赖最强（准确率下降27.9个百分点），而其他模型依赖较小（GPT-5-mini、Gemini和Claude分别下降8.5、2.4和5.6个百分点）。

Insight: VLMs在医学任务中的表现可能高度依赖文本提示而非视觉分析，需警惕其在临床环境中的盲信。

Abstract: Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.

[217] Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving cs.CVPDF

Jianhua Han, Meng Tian, Jiangtong Zhu, Fan He, Huixin Zhang

TL;DR: Percept-WAM是一个感知增强的世界-感知-行动模型，首次在单个视觉语言模型（VLM）中隐式集成了2D/3D场景理解能力，显著提升了自动驾驶的鲁棒性和准确性。

Details

Motivation: 当前自动驾驶系统的空间感知能力有限，尤其是在长尾场景和复杂交互中表现不稳定。视觉语言模型（VLM）在空间理解和定位方面较弱，限制了其应用效果。

Result: 在COCO 2D检测和nuScenes BEV 3D检测中分别达到51.7/58.9 mAP，规划性能优于DiffusionDrive（NAVSIM上PMDS提升2.1）。

Insight: Percept-WAM展示了在单模型中统一多模态感知任务的潜力，同时保持了通用推理能力，为自动驾驶的长尾场景和复杂交互提供了新思路。

Abstract: Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.

[218] Learning Plug-and-play Memory for Guiding Video Diffusion Models cs.CV | cs.AIPDF

Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo

TL;DR: 论文探索了一种方法，为基于扩散Transformer的视频生成模型注入显式的世界知识，通过可学习的记忆编码器改进物理规则遵守和视频质量。

Details

Motivation: 现有的扩散Transformer视频生成模型虽然视觉质量和时间连贯性表现良好，但仍常违反物理规律和常识动态，说明缺乏显式的世界知识。

Result: 实验表明，该方法能有效提升视频生成的物理规则遵循和画质，训练参数少（150M）且仅需10K数据样本。

Insight: 高低通滤波器自然解耦了视频的低层外观和高层物理/语义线索，有助于实现有针对性的引导。

Abstract: Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.

[219] IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes cs.CVPDF

Carl Lindström, Mahan Rafidashti, Maryam Fatemi, Lars Hammarstrand, Martin R. Oswald

TL;DR: IDSplat是一个自监督的3D高斯泼溅框架，专注于动态驾驶场景的重建，无需人工标注即可实现实例分解和学习运动轨迹。

Details

Motivation: 动态驾驶场景的重建对自动驾驶系统的开发至关重要，但现有方法要么依赖昂贵的人工标注，要么无法明确分解静态与动态元素。IDSplat旨在解决这些问题，提供更高效且无需人工干预的解决方案。

Result: 在Waymo Open Dataset上的实验表明，IDSplat在重建质量和实例分解方面表现优异，且无需重新训练即可适应多样化的场景和视角密度。

Insight: 通过将动态物体建模为刚性变换的连贯实例而非非结构化时变基元，IDSplat能够更高效地实现动态场景的重建和分解。

Abstract: Reconstructing dynamic driving scenes is essential for developing autonomous systems through sensor-realistic simulation. Although recent methods achieve high-fidelity reconstructions, they either rely on costly human annotations for object trajectories or use time-varying representations without explicit object-level decomposition, leading to intertwined static and dynamic elements that hinder scene separation. We present IDSplat, a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic scenes with explicit instance decomposition and learnable motion trajectories, without requiring human annotations. Our key insight is to model dynamic objects as coherent instances undergoing rigid transformations, rather than unstructured time-varying primitives. For instance decomposition, we employ zero-shot, language-grounded video tracking anchored to 3D using lidar, and estimate consistent poses via feature correspondences. We introduce a coordinated-turn smoothing scheme to obtain temporally and physically consistent motion trajectories, mitigating pose misalignments and tracking failures, followed by joint optimization of object poses and Gaussian parameters. Experiments on the Waymo Open Dataset demonstrate that our method achieves competitive reconstruction quality while maintaining instance-level decomposition and generalizes across diverse sequences and view densities without retraining, making it practical for large-scale autonomous driving applications. Code will be released.

[220] Adversarial Patch Attacks on Vision-Based Cargo Occupancy Estimation via Differentiable 3D Simulation cs.CV | cs.AIPDF

Mohamed Rissal Hedna, Sesugh Samuel Nder

TL;DR: 该论文研究了针对基于计算机视觉的货物占用率估计系统的对抗性补丁攻击，通过可微分3D模拟优化补丁纹理，展示了其在3D环境中的高攻击成功率。

Details

Motivation: 现代物流系统中广泛采用计算机视觉技术，但其安全性可能受到对抗性补丁攻击的威胁。论文旨在验证此类攻击在3D模拟环境中的可行性。

Result: 3D优化的补丁在拒绝服务攻击（空车变满车）中成功率高达84.94%，而隐蔽攻击（满车变空车）的成功率为30.32%。

Insight: 研究揭示了自动化物流系统的潜在安全漏洞，并为提升物理鲁棒性提供了方向。

Abstract: Computer vision systems are increasingly adopted in modern logistics operations, including the estimation of trailer occupancy for planning, routing, and billing. Although effective, such systems may be vulnerable to physical adversarial attacks, particularly adversarial patches that can be printed and placed on interior surfaces. In this work, we study the feasibility of such attacks on a convolutional cargo-occupancy classifier using fully simulated 3D environments. Using Mitsuba 3 for differentiable rendering, we optimize patch textures across variations in geometry, lighting, and viewpoint, and compare their effectiveness to a 2D compositing baseline. Our experiments demonstrate that 3D-optimized patches achieve high attack success rates, especially in a denial-of-service scenario (empty to full), where success reaches 84.94 percent. Concealment attacks (full to empty) prove more challenging but still reach 30.32 percent. We analyze the factors influencing attack success, discuss implications for the security of automated logistics pipelines, and highlight directions for strengthening physical robustness. To our knowledge, this is the first study to investigate adversarial patch attacks for cargo-occupancy estimation in physically realistic, fully simulated 3D scenes.

[221] LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models cs.CVPDF

Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo

TL;DR: 论文提出LAST方法，通过让视觉语言模型在空间中思考，联合提升3D空间和长视频理解能力，无需单独设计架构。

Details

Motivation: 现有视觉语言模型在3D空间和长视频理解任务上表现不佳，需要专门的设计，而LAST旨在通过统一方法解决这些问题。

Result: 在3D空间、视频和图像理解任务中显著提升性能，零样本场景下EgoSchema提升15.8%，微调场景下VSI-Bench提升8.3%。

Insight: 通过统一的视觉思考机制，LAST展示了在通用视觉语言模型中联合提升空间和时间理解的潜力。

Abstract: Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.

[222] Diffusion Reconstruction-based Data Likelihood Estimation for Core-Set Selection cs.CVPDF

Mingyang Chen, Jiawei Du, Bo Huang, Yi Wang, Xiaobo Zhang

TL;DR: 本文提出了一种基于扩散模型的数据似然估计方法，用于核心集选择，通过部分逆去噪诱导的重构偏差评估数据重要性，显著优于传统启发式方法。

Details

Motivation: 现有核心集选择方法多依赖启发式评分信号（如训练动态或模型不确定性），缺乏对数据似然的显式建模，可能忽略关键的分布结构。

Result: 在ImageNet上实验表明，该方法仅用50%数据即可媲美全数据训练，显著优于基线方法。

Insight: 数据似然评分为数据选择提供了新视角，揭示了数据分布特性与模型学习偏好的关系。

Abstract: Existing core-set selection methods predominantly rely on heuristic scoring signals such as training dynamics or model uncertainty, lacking explicit modeling of data likelihood. This omission may hinder the constructed subset from capturing subtle yet critical distributional structures that underpin effective model training. In this work, we propose a novel, theoretically grounded approach that leverages diffusion models to estimate data likelihood via reconstruction deviation induced by partial reverse denoising. Specifically, we establish a formal connection between reconstruction error and data likelihood, grounded in the Evidence Lower Bound (ELBO) of Markovian diffusion processes, thereby enabling a principled, distribution-aware scoring criterion for data selection. Complementarily, we introduce an efficient information-theoretic method to identify the optimal reconstruction timestep, ensuring that the deviation provides a reliable signal indicative of underlying data likelihood. Extensive experiments on ImageNet demonstrate that reconstruction deviation offers an effective scoring criterion, consistently outperforming existing baselines across selection ratios, and closely matching full-data training using only 50% of the data. Further analysis shows that the likelihood-informed nature of our score reveals informative insights in data selection, shedding light on the interplay between data distributional characteristics and model learning preferences.

[223] ReMatch: Boosting Representation through Matching for Multimodal Retrieval cs.CVPDF

Qianying Liu, Xiao Liang, Zhiqiang Zhang, Yibo Chen, Xu Tang

TL;DR: ReMatch提出了一个通过匹配增强多模态检索表示能力的框架，结合MLLM的生成能力，利用多视图输入和自回归方式进行相关性判断，提升了零样本泛化能力。

Details

Motivation: 现有方法将MLLM仅视为编码器，忽视了其生成能力和组合推理能力。ReMatch旨在充分利用MLLM的生成特性，提升多模态检索的性能和泛化能力。

Result: 在MMEB基准上达到新SOTA，零样本泛化能力在五个数据集上表现突出。

Insight: ReMatch证明了MLLM的生成能力在多模态检索中的重要性，自回归相关性和多视图输入的结合显著提升了模型的鲁棒性和泛化能力。

Abstract: We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.

[224] IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection cs.CVPDF

Johannes Meier, Florian Günther, Riccardo Marin, Oussema Dhaouadi, Jacques Kaiser

TL;DR: IDEAL-M3D提出了一种针对单目3D检测的实例级主动学习方法，解决了传统方法效率低和深度偏见的问题。

Details

Motivation: 单目3D检测的标注成本高昂，传统主动学习方法选择整幅图像效率低，且依赖不确定性会导致深度偏见。

Result: 仅用60%标注即达到或超越全数据集训练的AP3D性能。

Insight: 多样性驱动的实例级选择是单目3D检测中标注效率的关键。

Abstract: Monocular 3D detection relies on just a single camera and is therefore easy to deploy. Yet, achieving reliable 3D understanding from monocular images requires substantial annotation, and 3D labels are especially costly. To maximize performance under constrained labeling budgets, it is essential to prioritize annotating samples expected to deliver the largest performance gains. This prioritization is the focus of active learning. Curiously, we observed two significant limitations in active learning algorithms for 3D monocular object detection. First, previous approaches select entire images, which is inefficient, as non-informative instances contained in the same image also need to be labeled. Secondly, existing methods rely on uncertainty-based selection, which in monocular 3D object detection creates a bias toward depth ambiguity. Consequently, distant objects are selected, while nearby objects are overlooked. To address these limitations, we propose IDEAL-M3D, the first instance-level pipeline for monocular 3D detection. For the first time, we demonstrate that an explicitly diverse, fast-to-train ensemble improves diversity-driven active learning for monocular 3D. We induce diversity with heterogeneous backbones and task-agnostic features, loss weight perturbation, and time-dependent bagging. IDEAL-M3D shows superior performance and significant resource savings: with just 60% of the annotations, we achieve similar or better AP3D on KITTI validation and test set results compared to training the same detector on the whole dataset.

[225] Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach cs.CV | cs.AIPDF

Xincheng Wang, Hanchi Sun, Wenjun Sun, Kejun Xue, Wangqiu Zhou

TL;DR: 该论文提出一种评估数据集水印的综合性框架，并揭示现有方法在实际威胁场景下的不足，同时提出一种实用的水印去除方法。

Details

Motivation: 随着扩散模型的微调技术发展，定制化模型复制特定图像集的能力带来了版权和安全隐患。数据集水印被认为是一种潜在的解决方案，但目前缺乏统一的评估框架。

Result: 现有水印方法在通用性和可传递性上表现良好，但对实际威胁场景的鲁棒性不足。提出的去除方法能完全消除水印而不影响模型微调效果。

Insight: 论文揭示了当前数据集水印技术在现实应用中的脆弱性，为未来研究指明了方向，特别是在提升水印鲁棒性和抵御攻击的能力上。

Abstract: Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.

[226] SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis cs.CVPDF

Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An

TL;DR: SyncMV4D提出了一种同步生成多视角手物交互视频与4D运动的方法，解决了现有单视角或3D数据依赖的局限性，实现了视觉真实性与运动合理性的统一。

Details

Motivation: 现有手物交互生成方法多基于单视角或依赖实验室高质量3D数据，难以推广到真实场景。SyncMV4D致力于结合多视角几何与动态运动，提升交互合成的全面性与真实性。

Result: 实验表明，SyncMV4D在视觉真实性、运动合理性和多视角一致性上优于现有方法。

Insight: 结合多视角几何与动态运动能显著提升交互合成的质量，闭环循环设计是实现外观与运动统一的有效途径。

Abstract: Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.

[227] SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation cs.CVPDF

Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui

TL;DR: SteadyDancer提出了一个基于Image-to-Video（I2V）范式的框架，通过条件协调机制和协同姿态调制模块，实现了第一帧保留和精确运动控制，并在实验中表现出色。

Details

Motivation: 现有Reference-to-Video（R2V）范式在处理时空错位时存在身份漂移和视觉伪影问题，无法同时实现第一帧保留和精确运动控制。

Result: SteadyDancer在表观保真度和运动控制上达到SOTA，且训练资源需求更低。

Insight: 通过调和条件和分阶段优化，I2V范式可以有效解决R2V范式中的时空错位问题。

Abstract: Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.

[228] MonoMSK: Monocular 3D Musculoskeletal Dynamics Estimation cs.CVPDF

Farnoosh Koleini, Hongfei Xue, Ahmed Helmy, Pu Wang

TL;DR: MonoMSK提出了一种混合框架，结合了数据驱动学习和基于物理的模拟，用于从单目视频中估计生物力学上真实的3D人体运动，同时恢复了运动学和动力学。

Details

Motivation: 当前的单目方法使用过于简化且解剖学上不准确的模型（如SMPL），且忽略了物理约束，限制了其生物力学保真度。因此，需要一种方法能够同时恢复运动学和动力学，并提高生物力学准确性。

Result: 在BML-MoVi、BEDLAM和OpenCap数据集上的实验表明，MonoMSK在运动学准确性上显著优于现有方法，并首次实现了精确的单目动力学估计。

Insight: 通过结合数据驱动学习和物理模拟，MonoMSK展示了生物力学准确的运动估计的可能性，为未来的研究提供了新的方向。

Abstract: Reconstructing biomechanically realistic 3D human motion - recovering both kinematics (motion) and kinetics (forces) - is a critical challenge. While marker-based systems are lab-bound and slow, popular monocular methods use oversimplified, anatomically inaccurate models (e.g., SMPL) and ignore physics, fundamentally limiting their biomechanical fidelity. In this work, we introduce MonoMSK, a hybrid framework that bridges data-driven learning and physics-based simulation for biomechanically realistic 3D human motion estimation from monocular video. MonoMSK jointly recovers both kinematics (motions) and kinetics (forces and torques) through an anatomically accurate musculoskeletal model. By integrating transformer-based inverse dynamics with differentiable forward kinematics and dynamics layers governed by ODE-based simulation, MonoMSK establishes a physics-regulated inverse-forward loop that enforces biomechanical causality and physical plausibility. A novel forward-inverse consistency loss further aligns motion reconstruction with the underlying kinetic reasoning. Experiments on BML-MoVi, BEDLAM, and OpenCap show that MonoMSK significantly outperforms state-of-the-art methods in kinematic accuracy, while for the first time enabling precise monocular kinetics estimation.

[229] POUR: A Provably Optimal Method for Unlearning Representations via Neural Collapse cs.CVPDF

Anjie Le, Can Peng, Yuyuan Liu, J. Alison Noble

TL;DR: POUR提出了一种通过神经坍缩理论（Neural Collapse）在表示层面上最优忘记视觉概念的方法，包含闭式投影（POUR-P）和蒸馏方案下的特征级遗忘（POUR-D），实验表明其在CIFAR和PathMNIST上优于现有方法。

Details

Motivation: 现有机器遗忘方法通常仅修改分类器而保留内部表示，导致不完全遗忘。POUR旨在在表示层面上实现更彻底的遗忘，同时保留其他知识。

Result: 在CIFAR-10/100和PathMNIST上，POUR在分类和表示级别均优于现有方法，有效实现遗忘并保留知识。

Insight: 表示层面的遗忘比仅调整分类器更彻底，神经坍缩理论为优化遗忘提供了理论保障。

Abstract: In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting. In this work, we extend the notion of unlearning to the representation level, deriving a three-term interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower dimensional space, yielding a provably optimal forgetting operator. We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and a feature-level unlearning variant under a distillation scheme (POUR-D). Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics.

[230] Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning cs.CVPDF

Qihan Huang, Haofei Zhang, Rong Wei, Yi Wang, Rui Tang

TL;DR: Syn-GRPO提出了一种自演化数据合成方法，通过在线数据生成器为MLLM的强化学习（GRPO）提供高质量、多样化的训练数据，解决了现有方法数据质量低的问题。

Details

Motivation: 现有MLLM强化学习方法（如GRPO）的数据质量低，限制了模型的探索范围，亟需一种从根本上解决问题的方案。

Result: 在三个视觉感知任务中，Syn-GRPO显著提升数据质量，性能超越现有MLLM方法。

Insight: Syn-GRPO展示了长期自演化强化学习的潜力，为MLLM数据质量优化提供了新思路。

Abstract: RL (reinforcement learning) methods (e.g., GRPO) for MLLM (Multimodal LLM) perception ability has attracted wide research interest owing to its remarkable generalization ability. Nevertheless, existing reinforcement learning methods still face the problem of low data quality, where data samples cannot elicit diverse responses from MLLMs, thus restricting the exploration scope for MLLM reinforcement learning. Some methods attempt to mitigate this problem by imposing constraints on entropy, but none address it at its root. Therefore, to tackle this problem, this work proposes Syn-GRPO (Synthesis-GRPO), which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training. Specifically, Syn-GRPO consists of two components: (1) data server; (2) GRPO workflow. The data server synthesizes new samples from existing ones using an image generation model, featuring a decoupled and asynchronous scheme to achieve high generation efficiency. The GRPO workflow provides the data server with the new image descriptions, and it leverages a diversity reward to supervise the MLLM to predict image descriptions for synthesizing samples with diverse responses. Experiment results across three visual perception tasks demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods, and Syn-GRPO presents promising potential for scaling long-term self-evolving RL. Our code is available at https://github.com/hqhQAQ/Syn-GRPO.

[231] Growing with the Generator: Self-paced GRPO for Video Generation cs.CVPDF

Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang

TL;DR: 该论文提出了自适应的Group Relative Policy Optimization框架（Self-Paced GRPO），通过在视频生成过程中动态调整奖励模型，缓解了传统GRPO中静态奖励导致的分布偏差和训练稳定性问题。

Details

Motivation: 传统的GRPO方法依赖于静态奖励模型，其评估行为在训练过程中保持不变，容易引入分布偏差和奖励饱和问题，限制了强化学习对齐的效果和稳定性。

Result: 在多种视频生成骨干网络上进行实验，证明了Self-Paced GRPO在视觉质量和语义对齐上优于传统GRPO基准方法。

Insight: 奖励机制应与生成器的能力共同进化，动态调整奖励重点可以显著提升强化学习对齐的效果和训练稳定性。

Abstract: Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.

[232] An Anatomy Aware Hybrid Deep Learning Framework for Lung Cancer Tumor Stage Classification cs.CV | cs.AIPDF

Saniah Kayenat Chowdhury, Rusab Sarmun, Muhammad E. H. Chowdhury, Sohaib Bassam Zoghoul, Israa Al-Hashimi

TL;DR: 该论文提出了一种结合医学解剖信息的深度学习框架，用于肺癌分期分类，通过明确测量肿瘤大小和距离属性，而非单纯图像分类任务，实现了91.36%的高分类准确率。

Details

Motivation: 肺癌分期的准确性对预后和治疗规划至关重要。现有的端到端深度学习方法往往忽略了肿瘤-淋巴结-转移系统中的空间和解剖学信息，导致分期的挑战性。

Result: 在Lung-PET-CT-Dx数据集上，整体分类准确率达91.36%，各阶段的F1分数分别为T1:0.93、T2:0.89、T3:0.96、T4:0.90。

Insight: 通过在深度学习框架中显式嵌入医学解剖和定量规则，可以显著提升肺癌分期的准确性和可解释性。

Abstract: Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor’s size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable “black box” manner, our method offers both state-of-the-art performance and transparent decision support.

[233] UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval cs.CVPDF

Maroun Ayli, Youssef Bakouny, Tushar Sharma, Nader Jalloul, Hani Seifeddine

TL;DR: 本文提出了一种基于图的嵌入方法UISearch，用于多模态企业UI截图检索，通过编码层次关系和空间排列的图表示实现优于现有视觉编码器的性能。

Details

Motivation: 企业软件中存在大量UI截图，现有方法仅依赖视觉或文本相似性，缺乏对UI结构特性的显式建模。

Result: 在20,396个金融软件UI上，UISearch达到Top-5准确率0.92，延迟中位数47.5ms。

Insight: 结构化嵌入显著提升了UI表示的区分能力，适用于复杂查询和细粒度区分。

Abstract: Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.

[234] In-Video Instructions: Visual Signals as Generative Control cs.CV | cs.AIPDF

Gongfan Fang, Xinyin Ma, Xinchao Wang

TL;DR: 本文提出了一种名为“In-Video Instruction”的新范式，通过将视觉信号（如叠加文本、箭头或轨迹）嵌入视频帧中，实现对图像到视频生成的可控性，解决了基于文本提示的全局性和模糊性问题。

Details

Motivation: 现有的视频生成模型主要依赖文本提示（prompt）进行控制，但文本描述通常具有全局性和模糊性，难以在复杂场景中为不同对象分配精确的动作指令。本文通过将指令直接嵌入视觉信号，解决了这一问题。

Result: 实验表明，视频生成模型能够可靠地解析和执行嵌入的视觉指令，尤其是在复杂的多对象场景中，优于传统的文本提示方法。

Insight: 视觉信号可以直接作为生成模型的控制信号，提供更精确的空间和语义信息，为复杂场景的可控视频生成提供了新思路。

Abstract: Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.

[235] Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens cs.CV | cs.AI | cs.LGPDF

Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu

TL;DR: 论文提出了Chain-of-Visual-Thought (COVT)框架，通过连续视觉标记增强视觉语言模型（VLMs）的空间感知和几何理解能力，显著提升多种感知任务的性能。

Details

Motivation: 现有的视觉语言模型在语言空间推理表现出色，但在需要密集视觉感知的任务（如空间推理和几何理解）中表现不足，这主要是由于缺乏捕捉空间维度密集信息的机制。

Result: 在超过十个感知基准测试中，COVT显著提升了VLMs的性能，性能增益范围在3%到16%之间。

Insight: 连续视觉标记作为一种紧凑的潜在表示，能够有效编码丰富的感知线索，为VLMs提供了更精确、接地气和可解释的多模态推理能力。

Abstract: Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.

[236] SAM3-Adapter: Efficient Adaptation of Segment Anything 3 for Camouflage Object Segmentation, Shadow Detection, and Medical Image Segmentation cs.CVPDF

Tianrun Chen, Runlong Cao, Xinda Yu, Lanyun Zhu, Chaotao Ding

TL;DR: SAM3-Adapter是一种为Segment Anything 3（SAM3）设计的适配器框架，旨在解决低层次细分任务（如伪装物体分割、阴影检测和医学图像分割）的挑战，显著提升性能和效率。

Details

Motivation: SAM及其后续版本在细粒度和低层次分割任务（如伪装物体检测和医学图像分割）表现不佳。SAM3的推出带来了更高的效率和性能，但仍需适配以充分释放潜力。

Result: 实验表明，SAM3-Adapter在准确性、鲁棒性和效率上均优于所有基于SAM的适配方法，并在多个任务中刷新了最佳结果。

Insight: SAM3-Adapter展示了如何通过适配器框架高效适配基础模型，为未来研究和实际分割应用提供了重要参考。

Abstract: The rapid rise of large-scale foundation models has reshaped the landscape of image segmentation, with models such as Segment Anything achieving unprecedented versatility across diverse vision tasks. However, previous generations-including SAM and its successor-still struggle with fine-grained, low-level segmentation challenges such as camouflaged object detection, medical image segmentation, cell image segmentation, and shadow detection. To address these limitations, we originally proposed SAM-Adapter in 2023, demonstrating substantial gains on these difficult scenarios. With the emergence of Segment Anything 3 (SAM3)-a more efficient and higher-performing evolution with a redesigned architecture and improved training pipeline-we revisit these long-standing challenges. In this work, we present SAM3-Adapter, the first adapter framework tailored for SAM3 that unlocks its full segmentation capability. SAM3-Adapter not only reduces computational overhead but also consistently surpasses both SAM and SAM2-based solutions, establishing new state-of-the-art results across multiple downstream tasks, including medical imaging, camouflaged (concealed) object segmentation, and shadow detection. Built upon the modular and composable design philosophy of the original SAM-Adapter, SAM3-Adapter provides stronger generalizability, richer task adaptability, and significantly improved segmentation precision. Extensive experiments confirm that integrating SAM3 with our adapter yields superior accuracy, robustness, and efficiency compared to all prior SAM-based adaptations. We hope SAM3-Adapter can serve as a foundation for future research and practical segmentation applications. Code, pre-trained models, and data processing pipelines are available.

[237] Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution cs.CVPDF

Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo

TL;DR: 论文提出了ORS3D任务，结合语言理解、3D空间定位和效率优化，并构建了ORS3D-60K数据集，开发了GRANT模型用于高效的任务调度和动作生成。

Details

Motivation: 现有数据集在任务规划中忽略了运筹学知识和3D空间定位，导致代理无法高效执行复杂任务。ORS3D任务旨在解决这一问题。

Result: 在ORS3D-60K数据集上的实验验证了GRANT在语言理解、3D空间定位和调度效率上的有效性。

Insight: 结合运筹学知识和3D空间定位可以显著提升代理的任务执行效率，尤其是在并行任务场景中。

Abstract: Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT

[238] Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts cs.CV | cs.LG | stat.MLPDF

Yasin Esfandiari, Stefan Bauer, Sebastian U. Stich, Andrea Dittadi

TL;DR: 该论文提出了一种简单的插入式采样方法，通过结合两个预训练的扩散模型专家（分别针对高噪声和低噪声优化），以解决扩散模型中似然与生成质量之间的权衡问题。

Details

Motivation: 扩散模型在图像生成中常面临生成样本的感知质量与数据似然之间的权衡：高噪声去噪步骤强调视觉保真度但似然较差，低噪声步骤优化似然却损害图像质量。

Result: 在CIFAR-10和ImageNet32上，合并后的模型在似然和样本质量上均优于或与基础专家模型持平。

Insight: 通过在不同噪声水平上切换专家，可以有效打破扩散模型中似然与生成质量的传统权衡。

Abstract: Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning – only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.

[239] Are Image-to-Video Models Good Zero-Shot Image Editors? cs.CVPDF

Zechuan Zhang, Zhenyuan Chen, Zongxin Yang, Yi Yang

TL;DR: 论文提出IF-Edit，一种无需微调的方法，利用预训练的图像到视频扩散模型实现指令驱动的图像编辑。通过改进提示、压缩隐变量和后期帧锐化，IF-Edit在推理任务中表现优异。

Details

Motivation: 大型视频扩散模型具有强大的世界模拟和时间推理能力，但其在零样本图像编辑中的应用尚未充分探索。本文旨在探索如何将此类模型转化为高效的图像编辑器。

Result: IF-Edit在非刚性编辑、物理和时间推理任务中表现优异，同时在通用编辑任务中保持竞争力。

Insight: 视频扩散模型可作为统一的图像-视频生成推理工具，提示增强和隐变量压缩是提升零样本编辑性能的关键。

Abstract: Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.

[240] VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection cs.CV | cs.AI | cs.LG | cs.MMPDF

Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li

TL;DR: VDC-Agent是一个自演化的视频详细描述框架，无需人工标注或大型教师模型，通过自我反思和优化生成高质量描述。

Details

Motivation: 传统视频描述方法依赖大量人工标注或教师模型，成本高且效率低。VDC-Agent旨在通过自动化闭环系统减少依赖，提升效率和质量。

Result: VDC-Agent-7B在VDC基准测试中达到49.08%的平均准确率和2.50的分数，优于其他视频描述方法。

Insight: 自我反思和自动化优化框架能显著提升视频描述任务的性能，减少对人工标注的依赖。

Abstract: We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.

cs.CL [Back]

[241] $A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving cs.CL | cs.AIPDF

Yuechi Zhou, Yi Su, Jianxin Zhang, Juntao Li, Qingrong Xia

TL;DR: 论文提出了一种名为$A^3$的KV缓存融合算法，通过基于注意力的选择性融合方法，降低了大型语言模型（LLMs）的解码延迟和内存开销，同时保持了任务性能。

Details

Motivation: 尽管大型语言模型能够处理长上下文任务，但其解码延迟和内存开销仍然很大，限制了实际部署。现有的KV缓存重用方法虽能部分缓解这些问题，但性能下降明显。

Result: 在多个基准测试和LLMs上的实验表明，$A^3$算法在任务性能上优于4种基线方法，并将首词生成时间（TTFT）减少了一倍。

Insight: KV缓存的选择性融合可以通过注意力机制优化，从而在不牺牲性能的情况下显著提升模型服务的效率。

Abstract: Large language models (LLMs) have demonstrated strong capabilities in processing long contexts, enabling them to tackle tasks involving long textual inputs such as multi-turn conversations, legal documents, or retrieved documents in Retrieval-Augmented Generation (RAG) systems. However, despite their ability to handle long sequences, the resulting decoding latency and memory overhead remain substantial, posing challenges for real-world deployment. Recent advances in KV Cache reuse have shown potential to mitigate these costs, but still suffer from notable performance degradation. To address this issue, we conduct an in-depth investigation of recomputation-based reuse methods and observe that the recomputed tokens often fail to align with the context segments most relevant to the question. This misalignment hinders proper updates to the critical contextual representations. Therefore, we propose the $\textbf{A}$ttention-$\textbf{A}$ware $\textbf{A}$ccurate KV Cache Fusion algorithm ($A^3$), which precomputes and selectively fuses the KV Cache of text chunks based on their relevance to the question, achieving accurate integration with minimal computational overhead. Extensive experiments on various benchmarks and LLMs demonstrate that $A^3$ achieves the best task performance compared to four baselines while reducing the time-to-first-token (TTFT) by 2$\times$.

[242] ChineseErrorCorrector3-4B: State-of-the-Art Chinese Spelling and Grammar Corrector cs.CL | cs.AIPDF

Wei Tian, YuhaoZhou

TL;DR: 论文介绍了基于Qwen3-4B的中文拼写和语法纠错统一模型ChineseErrorCorrector3-4B，该模型在多领域文本纠错任务中表现卓越，并在SIGHAN-2015等权威数据集上取得SOTA成绩。

Details

Motivation: 现有的中文拼写和语法纠错模型通常独立设计，未能充分利用统一的预训练模型潜力。本研究旨在通过结合Qwen3-4B的强大能力，构建一个统一且高效的纠错模型。

Result: 模型在SIGHAN-2015等数据集上的F1和F0.5分数显著优于现有模型，同时在拼写和语法纠错任务中均排名第一。

Insight: 结合统一的预训练模型和多任务学习策略，可以有效提升中文文本纠错的性能，为后续研究提供了新方向。

Abstract: This paper introduces ChineseErrorCorrector3-4B, a unified model for Chinese spelling and grammatical error correction based on Qwen3-4B. The model demonstrates outstanding performance in general text correction tasks and achieves state-of-the-art results in both spelling correction (CSC) and grammatical correction (CGC). On several authoritative benchmark datasets – including SIGHAN-2015, EC-LAW, MCSC, and NaCGEC – the model’s F1 and F0.5 scores significantly surpass existing publicly available models, ranking first in both spelling and grammatical error correction tasks.

[243] A superpersuasive autonomous policy debating system cs.CL | cs.AI | cs.CY | cs.HC | cs.MAPDF

Allen Roush, Devin Gonier, John Hines, Judah Goldfeder, Philippe Martin Wyder

TL;DR: DeepDebater是一个新颖的自主系统，能够在完整的政策辩论中参与并获胜，采用分层多智能体工作流程，结合LLM协同与自我修正，生成高质量辩论内容。

Details

Motivation: 高度复杂、基于证据且具备战略适应能力的说服性辩论仍是AI的重大挑战，此前工作如IBM Project Debater仅针对简化辩论形式，而DeepDebater旨在实现完整辩论场景。

Result: 在模拟辩论中，DeepDebater生成的论证组件优于人工撰写内容，并赢得比赛；专家辩论教练也对其论证质量表示认可。

Insight: 多智能体协作与迭代自我修正在复杂辩论任务中表现出色，混合人机模式为AI在现实场景中的应用提供了灵活性。

Abstract: The capacity for highly complex, evidence-based, and strategically adaptive persuasion remains a formidable great challenge for artificial intelligence. Previous work, like IBM Project Debater, focused on generating persuasive speeches in simplified and shortened debate formats intended for relatively lay audiences. We introduce DeepDebater, a novel autonomous system capable of participating in and winning a full, unmodified, two-team competitive policy debate. Our system employs a hierarchical architecture of specialized multi-agent workflows, where teams of LLM-powered agents collaborate and critique one another to perform discrete argumentative tasks. Each workflow utilizes iterative retrieval, synthesis, and self-correction using a massive corpus of policy debate evidence (OpenDebateEvidence) and produces complete speech transcripts, cross-examinations, and rebuttals. We introduce a live, interactive end-to-end presentation pipeline that renders debates with AI speech and animation: transcripts are surface-realized and synthesized to audio with OpenAI TTS, and then displayed as talking-head portrait videos with EchoMimic V1. Beyond fully autonomous matches (AI vs AI), DeepDebater supports hybrid human-AI operation: human debaters can intervene at any stage, and humans can optionally serve as opponents against AI in any speech, allowing AI-human and AI-AI rounds. In preliminary evaluations against human-authored cases, DeepDebater produces qualitatively superior argumentative components and consistently wins simulated rounds as adjudicated by an independent autonomous judge. Expert human debate coaches also prefer the arguments, evidence, and cases constructed by DeepDebater. We open source all code, generated speech transcripts, audio and talking head video here: https://github.com/Hellisotherpeople/DeepDebater/tree/main

[244] Principled Context Engineering for RAG: Statistical Guarantees via Conformal Prediction cs.CL | cs.AI | cs.IRPDF

Debashish Chakraborty, Eugene Yang, Daniel Khashabi, Dawn Lawrie, Kevin Duh

TL;DR: 该论文提出了一种基于共形预测（conformal prediction）的上下文工程方法，用于RAG（检索增强生成）系统，通过统计保证过滤无关内容，同时保留关键证据。

Details

Motivation: 现有RAG系统在处理冗长或噪声较多的上下文时，LLM性能会下降，而现有过滤方法缺乏统计控制。论文旨在提供一种可靠的覆盖率控制方法。

Result: 共形过滤可达到目标覆盖率，减少了2-3倍的上下文内容，且在严格过滤下仍能保持或提升下游任务的事实准确性（ARGUE F1）。

Insight: 共形预测为RAG系统提供了一种可靠且覆盖率可控的上下文精简方法，丢弃的内容多为冗余或无关信息。

Abstract: Retrieval-Augmented Generation (RAG) enhances factual grounding in large language models (LLMs) by incorporating retrieved evidence, but LLM accuracy declines when long or noisy contexts exceed the model’s effective attention span. Existing pre-generation filters rely on heuristics or uncalibrated LLM confidence scores, offering no statistical control over retained evidence. We evaluate and demonstrate context engineering through conformal prediction, a coverage-controlled filtering framework that removes irrelevant content while preserving recall of supporting evidence. Using both embedding- and LLM-based scoring functions, we test this approach on the NeuCLIR and RAGTIME collections. Conformal filtering consistently meets its target coverage, ensuring that a specified fraction of relevant snippets are retained, and reduces retained context by 2-3x relative to unfiltered retrieval. On NeuCLIR, downstream factual accuracy measured by ARGUE F1 improves under strict filtering and remains stable at moderate coverage, indicating that most discarded material is redundant or irrelevant. These results demonstrate that conformal prediction enables reliable, coverage-controlled context reduction in RAG, offering a model-agnostic and principled approach to context engineering.

Yuliang Zhan, Xinyu Tang, Han Wan, Jian Li, Ji-Rong Wen

TL;DR: 论文提出L2V-CoT方法，通过频率域的潜在干预，将LLMs的CoT推理能力迁移到VLMs，无需训练且效果显著。

Details

Motivation: VLMs在多步推理任务上能力不足，现有迁移方法成本高或需架构对齐，因此研究如何在保留模型独立性的情况下高效迁移CoT推理能力。

Result: L2V-CoT在多项实验中优于无训练基线，甚至超越有监督方法。

Insight: LLMs和VLMs在低频CoT表征上的相似性是跨模态推理迁移的关键，频率域干预是实现高效迁移的新思路。

Abstract: Recently, Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs), but Vision-Language Models (VLMs) still struggle with multi-step reasoning tasks due to limited multimodal reasoning data. To bridge this gap, researchers have explored methods to transfer CoT reasoning from LLMs to VLMs. However, existing approaches either need high training costs or require architectural alignment. In this paper, we use Linear Artificial Tomography (LAT) to empirically show that LLMs and VLMs share similar low-frequency latent representations of CoT reasoning despite architectural differences. Based on this insight, we propose L2V-CoT, a novel training-free latent intervention approach that transfers CoT reasoning from LLMs to VLMs. L2V-CoT extracts and resamples low-frequency CoT representations from LLMs in the frequency domain, enabling dimension matching and latent injection into VLMs during inference to enhance reasoning capabilities. Extensive experiments demonstrate that our approach consistently outperforms training-free baselines and even surpasses supervised methods.

[246] Towards Efficient LLM-aware Heterogeneous Graph Learning cs.CL | cs.AIPDF

Wenda Li, Tongya Zheng, Shunyu Liu, Yu Wang, Kaixuan Chen

TL;DR: 论文提出了一种高效的LLM感知异构图学习框架（ELLA），通过LLM编码多跳多类型关系以捕捉复杂语义，优化计算复杂度，并利用任务感知的链式思维提示弥合预训练与微调间的语义鸿沟。

Details

Motivation: 异构图中节点和关系类型的多样性导致复杂语义，现有的方法受限于预定义的语义依赖和缺乏监督信号。LLM虽能通过推理能力解决语义问题，但其计算复杂度高，难以直接应用于异构图。

Result: 在四个异构图上验证，ELLA性能优于SOTA方法，支持13b参数的LLM，计算速度提升4倍。

Insight: LLM的强大推理能力可通过高效框架应用于异构图学习，计算复杂度优化是关键。任务感知的CoT提示能有效改善预训练与微调的一致性。

Abstract: Heterogeneous graphs are widely present in real-world complex networks, where the diversity of node and relation types leads to complex and rich semantics. Efforts for modeling complex relation semantics in heterogeneous graphs are restricted by the limitations of predefined semantic dependencies and the scarcity of supervised signals. The advanced pre-training and fine-tuning paradigm leverages graph structure to provide rich self-supervised signals, but introduces semantic gaps between tasks. Large Language Models (LLMs) offer significant potential to address the semantic issues of relations and tasks in heterogeneous graphs through their strong reasoning capabilities in textual modality, but their incorporation into heterogeneous graphs is largely limited by computational complexity. Therefore, in this paper, we propose an Efficient LLM-Aware (ELLA) framework for heterogeneous graphs, addressing the above issues. To capture complex relation semantics, we propose an LLM-aware Relation Tokenizer that leverages LLM to encode multi-hop, multi-type relations. To reduce computational complexity, we further employ a Hop-level Relation Graph Transformer, which help reduces the complexity of LLM-aware relation reasoning from exponential to linear. To bridge semantic gaps between pre-training and fine-tuning tasks, we introduce the fine-grained task-aware textual Chain-of-Thought (CoT) prompts. Extensive experiments on four heterogeneous graphs show that our proposed ELLA outperforms state-of-the-art methods in the performance and efficiency. In particular, ELLA scales up to 13b-parameter LLMs and achieves up to a 4x speedup compared with existing LLM-based methods. Our code is publicly available at https://github.com/l-wd/ELLA.

[247] SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization cs.CL | cs.LGPDF

Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt

TL;DR: SPINE提出了一种基于熵带状正则化的令牌选择性测试时强化学习方法，通过仅在高熵分叉点进行更新，避免了传统方法的崩溃问题，显著提升了Pass@1性能。

Details

Motivation: 大型语言模型（LLMs）和多模态LLMs（MLLMs）在链式推理中表现出色，但测试时的分布偏移和缺乏可验证监督成为挑战。传统测试时强化学习（TTRL）方法容易崩溃，表现为多数投票奖励主导、响应缩短和Pass@1下降。

Result: 在10个基准任务（如多模态VQA、数学推理等）中，SPINE显著提升了Pass@1，避免了响应长度崩溃，并稳定了训练动态。

Insight: 链式推理的分支点对齐更新是稳定测试时自适应的有效机制，无需额外监督即可实现性能提升。

Abstract: Large language models (LLMs) and multimodal LLMs (MLLMs) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose SPINE, a token-selective test-time reinforcement learning framework that (i) updates only forking tokens, the high-entropy branch points identified from forward-pass statistics, and (ii) applies an entropy-band regularizer at those tokens to sustain exploration when entropy is too low and to suppress noisy supervision when it is too high. SPINE plugs into GRPO-style objectives, optionally with a KL anchor, and requires no labels or reward models. Across ten benchmarks spanning multimodal VQA, general and expert QA, mathematical reasoning, and medical QA, SPINE consistently improves Pass@1 over TTRL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones. These results indicate that aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models. Code is available at https://github.com/JianghaoWu/SPINE.

[248] Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models cs.CL | cs.AIPDF

Shuo Zhang, Fabrizio Gotti, Fengran Mo, Jian-Yun Nie

TL;DR: 论文探讨了大型语言模型（LLM）中幻觉检测的新方法，重点关注预训练数据覆盖对幻觉检测的影响。通过构建后缀数组分析问题和生成答案的词汇覆盖情况，研究发现词汇覆盖特征在与其他信号（如对数概率）结合时，能有效提升幻觉检测效果。

Details

Motivation: 现有研究主要依赖模型内部信号（如令牌熵或生成一致性）检测幻觉，而预训练数据覆盖与幻觉之间的联系尚未充分探究。本研究提出，词汇覆盖是否能为幻觉检测提供额外信号，填补了这一空白。

Result: 研究表明，词汇覆盖特征单独使用时效果较弱，但与对数概率结合后能提升幻觉检测性能，尤其是在模型不确定性较高的数据集上。

Insight: 词汇覆盖特征可以作为幻觉检测的辅助信号，尤其在处理长尾知识或高风险任务时，结合多种信号能更全面地检测幻觉。

Abstract: Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering. Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency, while the connection between pretraining data exposure and hallucination is underexplored. Existing studies show that LLMs underperform on long-tail knowledge, i.e., the accuracy of the generated answer drops for the ground-truth entities that are rare in pretraining. However, examining whether data coverage itself can serve as a detection signal is overlooked. We propose a complementary question: Does lexical training-data coverage of the question and/or generated answer provide additional signal for hallucination detection? To investigate this, we construct scalable suffix arrays over RedPajama’s 1.3-trillion-token pretraining corpus to retrieve $n$-gram statistics for both prompts and model generations. We evaluate their effectiveness for hallucination detection across three QA benchmarks. Our observations show that while occurrence-based features are weak predictors when used alone, they yield modest gains when combined with log-probabilities, particularly on datasets with higher intrinsic model uncertainty. These findings suggest that lexical coverage features provide a complementary signal for hallucination detection. All code and suffix-array infrastructure are provided at https://github.com/WWWonderer/ostd.

[249] MTikGuard System: A Transformer-Based Multimodal System for Child-Safe Content Moderation on TikTok cs.CLPDF

Dat Thanh Nguyen, Nguyen Hung Lam, Anh Hoang-Thi Nguyen, Trong-Hop Do

TL;DR: 该论文提出MTikGuard系统，一个基于Transformer的多模态系统，用于TikTok上的儿童安全内容审核。通过扩展数据集、多模态分类框架和实时部署架构，该系统实现了高效的有害内容检测。

Details

Motivation: TikTok的快速增长带来了大量有害内容，尤其是对儿童和青少年的潜在负面影响。传统审核方法难以应对大规模实时上传的挑战，因此需要一种高效的多模态检测系统。

Result: 系统在多模态分类任务上实现了89.37%的准确率和89.45%的F1分数，证明了其高效性和实用性。

Insight: 结合数据集扩展、多模态融合和实时部署架构，可以显著提升社交媒体内容审核的效果和效率。

Abstract: With the rapid rise of short-form videos, TikTok has become one of the most influential platforms among children and teenagers, but also a source of harmful content that can affect their perception and behavior. Such content, often subtle or deceptive, challenges traditional moderation methods due to the massive volume and real-time nature of uploads. This paper presents MTikGuard, a real-time multimodal harmful content detection system for TikTok, with three key contributions: (1) an extended TikHarm dataset expanded to 4,723 labeled videos by adding diverse real-world samples, (2) a multimodal classification framework integrating visual, audio, and textual features to achieve state-of-the-art performance with 89.37% accuracy and 89.45% F1-score, and (3) a scalable streaming architecture built on Apache Kafka and Apache Spark for real-time deployment. The results demonstrate the effectiveness of combining dataset expansion, advanced multimodal fusion, and robust deployment for practical large-scale social media content moderation. The dataset is available at https://github.com/ntdat-8324/MTikGuard-System.git.

Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya

TL;DR: Blu-WERP是一个新颖的数据预处理流水线，旨在优化Common Crawl WARC文件的质量，显著提升大型语言模型（LLM）的训练效果。

Details

Motivation: 高质量的训练数据对LLM性能至关重要，但现有预处理流水线在处理大规模网络数据时难以有效去除噪声和非结构化内容。

Result: 在1B参数规模下，Blu-WERP较DCLM和Fineweb分别提升4.0%和9.5%的综合性能，同时在知识推理、语言理解和常识推理三类任务上均有显著改进。

Insight: 预处理流水线设计对LLM能力有显著影响，Blu-WERP为数据为中心的AI研究提供了实用工具，优化了训练效率和模型性能。

Abstract: High-quality training data is fundamental to large language model (LLM) performance, yet existing preprocessing pipelines often struggle to effectively remove noise and unstructured content from web-scale corpora. This paper presents Blu-WERP, a novel data preprocessing pipeline designed to optimize the quality of Common Crawl WARC files for LLM training. We demonstrate that Blu-WERP significantly outperforms established baselines including DCLM across multiple model scales and evaluation benchmarks. Our pipeline processes CC WARC dumps, implementing advanced filtering and quality assessment mechanisms. We conducted comprehensive evaluations using models with 150M, 400M, 530M, 750M, and 1B parameters, testing against nine standard benchmarks categorized as World Knowledge & Reasoning, Language Understanding, and Commonsense Reasoning. Results show Blu-WERP consistently achieved superior performance across all model scales. At the 1B parameter scale, Relatively Blu-WERP demonstrates a 4.0% and 9.5% aggregate improvement over DCLM and Fineweb respectively, while achieving quality-per-token efficiency gain. Categorical analysis reveals 2.4% improvement in World Knowledge & Reasoning, 6.2% improvement in Language Understanding, and 4.2% improvement in Commonsense Reasoning. These results establish Blu-WERP as a state-of-the-art preprocessing pipeline that substantially improves LLM training data quality and downstream model performance with reduced computational cost. Our findings contribute to the growing body of research on data-centric AI, demonstrating that preprocessing pipeline design significantly impacts LLM capabilities. The Blu-WERP pipeline represents a practical advancement in data quality optimization, offering researchers and practitioners an effective solution for improving LLM training efficiency and model performance.

[251] GeeSanBhava: Sentiment Tagged Sinhala Music Video Comment Data Set cs.CLPDF

Yomal De Mel, Nisansa de Silva

TL;DR: 该论文介绍了GeeSanBhava数据集，一个高质量的人工标注僧伽罗语歌曲评论数据集，基于Russell的Valence-Arousal模型，展示了较高的标注一致性（Fleiss kappa=84.96%）。研究还探讨了评论与歌曲情感的差异，并通过预训练模型实现了较高的分类性能（ROC-AUC=0.887）。

Details

Motivation: 研究旨在填补僧伽罗语在音乐情感识别领域的资源空白，同时探讨用户生成内容的情感分析挑战。

Result: 优化的MLP模型在情感分类任务中取得了0.887的ROC-AUC分数，表现出色。

Insight: 用户生成内容的情感分析需注意与原始媒体情感的差异，同时高质量标注数据集对NLP任务至关重要。

Abstract: This study introduce GeeSanBhava, a high-quality data set of Sinhala song comments extracted from YouTube manually tagged using Russells Valence-Arousal model by three independent human annotators. The human annotators achieve a substantial inter-annotator agreement (Fleiss kappa = 84.96%). The analysis revealed distinct emotional profiles for different songs, highlighting the importance of comment based emotion mapping. The study also addressed the challenges of comparing comment-based and song-based emotions, mitigating biases inherent in user-generated content. A number of Machine learning and deep learning models were pre-trained on a related large data set of Sinhala News comments in order to report the zero-shot result of our Sinhala YouTube comment data set. An optimized Multi-Layer Perceptron model, after extensive hyperparameter tuning, achieved a ROC-AUC score of 0.887. The model is a three-layer MLP with a configuration of 256, 128, and 64 neurons. This research contributes a valuable annotated dataset and provides insights for future work in Sinhala Natural Language Processing and music emotion recognition.

[252] Vector Arithmetic in Concept and Token Subspaces cs.CLPDF

Sheridan Feucht, Byron Wallace, David Bau

TL;DR: 本文通过分离概念和词符注意力头，展示了LLMs隐藏状态中的语义和表面信息可以通过特定变换提取，从而更准确地执行向量算术任务。

Details

Motivation: 大型语言模型（LLMs）需要同时表示语义和表面信息，但如何明确提取和利用这些信息仍是未解问题。

Result: 在语义子空间中，向量算术任务的近邻准确率从47%提升至80%；在词符子空间中，能够有效执行表面级词符操作。

Insight: 注意力头可以揭示模型激活中的结构化子空间，这些子空间分别对应语义和表面信息，为模型可解释性提供了新思路。

Abstract: In order to predict the next token, LLMs must represent semantic and surface-level information about the current word. Previous work identified two types of attention heads that disentangle this information: (i) Concept induction heads, which copy word meanings, and (ii) Token induction heads, which copy literal token representations (Feucht et al., 2025). We show that these heads can be used to identify subspaces of model activations that exhibit coherent semantic structure in Llama-2-7b. Specifically, when we transform hidden states using the attention weights of concept heads, we are able to more accurately perform parallelogram arithmetic (Mikolov et al., 2013) on the resulting hidden states, e.g., showing that “Athens” - “Greece” + “China” = “Beijing”. This transformation allows for much higher nearest-neighbor accuracy (80%) than direct use of raw hidden states (47%). Analogously, we show that token heads allow for transformations that reveal surface-level word information in hidden states, allowing for operations like “coding” - “code” + “dance” = “dancing”.

[253] Rethinking Retrieval: From Traditional Retrieval Augmented Generation to Agentic and Non-Vector Reasoning Systems in the Financial Domain for Large Language Models cs.CLPDF

Elias Lumer, Matt Melich, Olivia Zino, Elena Kim, Sara Dieter

TL;DR: 论文系统比较了向量化和非向量化的检索增强生成（RAG）架构在金融领域的表现，发现向量化智能RAG在检索精度和答案质量上优于非向量化方法，并通过交叉编码器重排和大小块检索技术进一步优化性能。

Details

Motivation: 现有研究缺乏对向量化和非向量化RAG架构在金融文档中的系统性比较，且高级RAG技术对检索准确性、答案质量、延迟和成本的影响尚不明确。

Result: 向量化智能RAG在检索精度（MRR@5提升59%）和答案质量（68%胜率）上表现更优，且延迟较低。大小块检索技术进一步提高了65%的胜率。

Insight: 金融领域的问答系统可通过高级RAG技术显著提升性能，但在生产环境中需权衡成本和性能。

Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models to answer financial questions using external knowledge bases of U.S. SEC filings, earnings reports, and regulatory documents. However, existing work lacks systematic comparison of vector-based and non-vector RAG architectures for financial documents, and the empirical impact of advanced RAG techniques on retrieval accuracy, answer quality, latency, and cost remain unclear. We present the first systematic evaluation comparing vector-based agentic RAG using hybrid search and metadata filtering against hierarchical node-based systems that traverse document structure without embeddings. We evaluate two enhancement techniques applied to the vector-based architecture, i) cross-encoder reranking for retrieval precision, and ii) small-to-big chunk retrieval for context completeness. Across 1,200 SEC 10-K, 10-Q, and 8-K filings on a 150-question benchmark, we measure retrieval metrics (MRR, Recall@5), answer quality through LLM-as-a-judge pairwise comparisons, latency, and preprocessing costs. Vector-based agentic RAG achieves a 68% win rate over hierarchical node-based systems with comparable latency (5.2 compared to 5.98 seconds). Cross-encoder reranking achieves a 59% absolute improvement at optimal parameters (10, 5) for MRR@5. Small-to-big retrieval achieves a 65% win rate over baseline chunking with only 0.2 seconds additional latency. Our findings reveal that applying advanced RAG techniques to financial Q&A systems improves retrieval accuracy, answer quality, and has cost-performance tradeoffs to be considered in production.

[254] Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning cs.CLPDF

Mohammad Aqib, Mohd Hamza, Ying Hei Chui, Qipei Mei

TL;DR: 该论文研究了利用视觉语言模型（VLM）和领域特定微调方法从建筑规范中提取表格数据的方法，比较了直接输入和间接输入两种方法，并通过低秩适应（LoRA）微调提升模型性能。

Details

Motivation: 建筑规范中的表格数据提取对提高自动问答系统的效率和准确性至关重要，但传统自然语言处理技术和VLM难以处理复杂的表格布局和语义关系。

Result: 直接输入方法效果优于间接输入方法；微调后的模型性能显著提升，如Qwen2.5-VL-3B-Instruct的相对准确率提升超过100%。

Insight: 参数高效的微调方法（如LoRA）可以显著提升VLM在复杂结构化数据理解任务（如建筑规范）中的表现。

Abstract: Building codes contain critical information for ensuring safety, regulatory compliance, and informed decision-making in construction and engineering. Automated question answering systems over such codes enable quick and accurate access to specific regulatory clauses, improving efficiency and reducing errors. Retrieval-Augmented Generation (RAG) systems are essential for this task as they combine the precision of information retrieval with the generative capabilities of language models. However, tabular data are challenging to extract as they often involve complex layouts, merged cells, multi-row headers, and embedded semantic relationships that are not easily captured by traditional natural language processing techniques and Vision Language Models (VLMs). This paper explores and compares two methods for extracting information from tabular data in building codes using several pre-trained VLMs. First, a direct input method is used, where the image of the page is input directly into the VLMs, which are then tasked with answering questions based on the image. Second, an indirect input method is introduced, which involves converting an image of a page containing tables into the LaTeX code and then answering inquires based on the LaTeX-based input. The experiments find that the direct input method generally resulted in higher accuracy than the indirect input method. To further improve the performance, we fine-tuned each VLM using Low Rank Adaptation (LoRA) on a domain-specific tabular dataset. The fine-tuned models exhibited substantial improvements, with Qwen2.5-VL-3B-Instruct achieving relative accuracy gains exceeding 100%. Our results highlight the potential of parameter-efficient fine-tuning methods to adapt powerful VLMs for understanding complex structured data in specialized fields, such as building code interpretation and regulatory compliance.

[255] Path-Constrained Retrieval: A Structural Approach to Reliable LLM Agent Reasoning Through Graph-Scoped Semantic Search cs.CL | cs.DB | cs.IR | cs.LGPDF

Joseph Oladokun

TL;DR: 论文提出了Path-Constrained Retrieval (PCR)，一种结合知识图谱结构和语义搜索的检索方法，确保检索信息与LLM代理的推理状态逻辑一致，从而提高推理的可靠性。

Details

Motivation: 现有大型语言模型代理在检索上下文时，常因知识库结构与其推理状态不一致，导致推理链不连贯。需要一种方法在检索时保持逻辑关系的结构性。

Result: 在PathRAG-6基准测试中，PCR实现了100%的结构一致性（基线为24-32%），且在技术领域实现了完全相关性和结构一致性。平均图谱距离比基线减少78%。

Insight: 结构化的检索方法能显著提升LLM代理推理的可靠性和连贯性，为未来知识驱动的语言模型系统提供了新方向。

Abstract: Large Language Model agents often retrieve context from knowledge bases that lack structural consistency with the agent’s current reasoning state, leading to incoherent reasoning chains. We introduce Path-Constrained Retrieval (PCR), a retrieval method that combines structural graph constraints with semantic search to ensure retrieved information maintains logical relationships within a knowledge graph. PCR restricts the search space to nodes reachable from an anchor node, preventing retrieval of structurally disconnected information that may lead to inconsistent reasoning. We evaluate PCR on PathRAG-6, a benchmark spanning six domains with 180 nodes and 360 edges. Our results show that PCR achieves full structural consistency compared to 24-32 percent in baseline methods, while maintaining strong relevance scores. On the technology domain, PCR obtains full relevance at rank 10 with full structural consistency, significantly outperforming vector search and hybrid retrieval. PCR reduces the average graph distance of retrieved context by 78 percent compared to baselines, demonstrating retrieval of more structurally consistent information. These findings suggest that path-constrained retrieval is an effective approach for improving the reliability and coherence of LLM agent reasoning systems.

[256] Towards Robust and Fair Next Visit Diagnosis Prediction under Noisy Clinical Notes with Large Language Models cs.CLPDF

Heejoon Koo

TL;DR: 该论文研究了在噪声临床笔记下，利用大语言模型（LLMs）提高下一次就诊诊断预测的鲁棒性和公平性，并提出了一种分层思维链（CoT）策略和临床标签缩减方案。

Details

Motivation: 临床文本常因人为错误或自动化流程问题而被污染，影响AI辅助决策的可靠性和公平性，但目前对这种噪声影响的系统性研究较少。

Result: 该方法显著提升了模型对噪声输入的鲁棒性，减少了不同人口亚组之间的预测不稳定。

Insight: 研究强调了噪声对临床决策支持的潜在影响，并通过分层CoT策略展示了如何改进模型推理的可解释性和公平性。

Abstract: A decade of rapid advances in artificial intelligence (AI) has opened new opportunities for clinical decision support systems (CDSS), with large language models (LLMs) demonstrating strong reasoning abilities on timely medical tasks. However, clinical texts are often degraded by human errors or failures in automated pipelines, raising concerns about the reliability and fairness of AI-assisted decision-making. Yet the impact of such degradations remains under-investigated, particularly regarding how noise-induced shifts can heighten predictive uncertainty and unevenly affect demographic subgroups. We present a systematic study of state-of-the-art LLMs under diverse text corruption scenarios, focusing on robustness and equity in next-visit diagnosis prediction. To address the challenge posed by the large diagnostic label space, we introduce a clinically grounded label-reduction scheme and a hierarchical chain-of-thought (CoT) strategy that emulates clinicians’ reasoning. Our approach improves robustness and reduces subgroup instability under degraded inputs, advancing the reliable use of LLMs in CDSS. We release code at https://github.com/heejkoo9/NECHOv3.

[257] SmolKalam: Ensemble Quality-Filtered Translation at Scale for High Quality Arabic Post-Training Data cs.CL | cs.AIPDF

Sultan Alrashed, Chadi Helwe, Francesco Orabona

TL;DR: SmolKalam提出了一种多模型集成翻译流程，结合质量过滤，为阿拉伯语后训练数据生成高质量的多轮对话数据集。

Details

Motivation: 目前缺乏大规模、高质量的多轮阿拉伯语数据集，尤其是在推理和工具调用领域，而简单的翻译方法无法满足后训练的高质量需求。

Result: 生成了一个高质量、大规模的阿拉伯语多轮对话数据集，适用于后训练任务。

Insight: 集成翻译和质量过滤是提升阿拉伯语数据集质量的关键，特别是在后训练阶段。

Abstract: Although the community has tackled the acquisition of high-quality Arabic pretraining data, we still lack large-scale, multi-turn Arabic datasets that include reasoning and tool calling. Naive translation can work at the pretraining scale, but post-training demands much higher quality, which requires a stricter approach to dataset curation. In this work, we introduce SmolKalam, a translation of Smoltalk2 that uses a multi-model ensemble translation pipeline, applies quality filtering, and examines effective translation techniques for traditional decoder-only models through ablations.

[258] Multi-Agent Collaborative Filtering: Orchestrating Users and Items for Agentic Recommendations cs.CL | cs.IRPDF

Yu Xia, Sungchul Kim, Tong Yu, Ryan A. Rossi, Julian McAuely

TL;DR: 论文提出了一个多智能体协作过滤（MACF）框架，将传统协同过滤算法与基于大语言模型（LLM）的多智能体协作相结合，以改善智能体推荐的效果。

Details

Motivation: 现有智能体推荐系统大多关注单智能体或多智能体任务分解，未能充分利用用户-物品交互历史中的协作信号，导致推荐结果不理想。

Result: 在三个不同领域的数据集上验证了MACF框架的优越性，优于现有智能体推荐基线方法。

Insight: 将多智能体协作与协同过滤结合，能够更灵活地利用用户和物品的交互信号，提升推荐的个性化和准确性。

Abstract: Agentic recommendations cast recommenders as large language model (LLM) agents that can plan, reason, use tools, and interact with users of varying preferences in web applications. However, most existing agentic recommender systems focus on generic single-agent plan-execute workflows or multi-agent task decomposition pipelines. Without recommendation-oriented design, they often underuse the collaborative signals in the user-item interaction history, leading to unsatisfying recommendation results. To address this, we propose the Multi-Agent Collaborative Filtering (MACF) framework for agentic recommendations, drawing an analogy between traditional collaborative filtering algorithms and LLM-based multi-agent collaboration. Specifically, given a target user and query, we instantiate similar users and relevant items as LLM agents with unique profiles. Each agent is able to call retrieval tools, suggest candidate items, and interact with other agents. Different from the static preference aggregation in traditional collaborative filtering, MACF employs a central orchestrator agent to adaptively manage the collaboration between user and item agents via dynamic agent recruitment and personalized collaboration instruction. Experimental results on datasets from three different domains show the advantages of our MACF framework compared to strong agentic recommendation baselines.

[259] General Agentic Memory Via Deep Research cs.CL | cs.AI | cs.IR | cs.LGPDF

B. Y. Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, Zheng Liu

TL;DR: 论文提出了一种称为通用代理记忆（GAM）的新框架，通过动态运行时优化上下文解决静态记忆的信息丢失问题，并在实验中验证了其优越性。

Details

Motivation: 静态记忆系统在离线阶段预先存储信息，导致运行时信息丢失严重。为解决这一问题，论文提出了一种动态优化的记忆框架。

Result: 在实验中，GAM在多种记忆基础任务场景中显著优于现有记忆系统。

Insight: 动态优化记忆上下文的设计可以有效解决静态记忆的信息丢失问题，同时充分利用前沿大语言模型的代理能力和测试扩展性。

Abstract: Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of “\textbf{just-in time (JIT) compilation}” where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.

[260] MindEval: Benchmarking Language Models on Multi-turn Mental Health Support cs.CL | cs.AIPDF

José Pombal, Maya D’Eon, Nuno M. Guerreiro, Pedro Henrique Martins, António Farinhas

TL;DR: MindEval是一个用于评估语言模型在多轮心理健康支持对话中表现的自动化框架，填补了现有基准测试的不足。通过模拟患者和LLM自动评估，验证了其与现实对话的相关性，并测试了12个先进LLM的表现，发现它们普遍表现不佳。

Details

Motivation: 当前AI心理健康支持系统存在局限性（如迎合用户或强化不良信念），且缺乏能真实反映治疗对话复杂性的基准测试。

Result: 12个先进LLM平均得分低于4（满分6），在特定AI问题沟通模式中表现差，且推理能力和模型规模不保证更好表现。

Insight: 心理健康支持对话需要专门设计的评估方法，仅依赖通用语言模型能力不足；长对话或严重症状患者支持更具挑战。

Abstract: Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.

[261] Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks cs.CLPDF

H. M. Shadman Tabib, Jaber Ahmed Deedar

TL;DR: 这篇论文研究了大语言模型（LLMs）在评估编程问题难度时的可信度问题。通过与基于显式特征的LightGBM模型对比，发现GPT-4o在预测LeetCode问题难度时准确率较低（37.75% vs. 86%），且容易忽视数值约束等关键特征。

Details

Motivation: 随着大语言模型在自然语言处理和代码生成领域的广泛应用，它们在评估任务难度和模型输出方面的潜力引起了研究者的兴趣。然而，其在结构化任务（如编程问题难度评估）中的表现尚不明确，需要系统验证。

Result: LightGBM的准确率达到86%，而GPT-4o仅为37.75%。GPT-4o倾向于低估真实难问题的难度，同时在合成任务中矛盾地将自身生成的难问题标记为Medium。

Insight: 研究发现数值约束（如输入规模限制和接受率）对区分问题难度至关重要，而GPT-4o容易忽视这些特征。这揭示了在大语言模型作为评估工具时，需解决的关键失败模式。

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in natural language and code generation, and are increasingly deployed as automatic judges of model outputs and learning activities. Yet, their behavior on structured tasks such as predicting the difficulty of competitive programming problems remains under-explored. We conduct a systematic comparison of GPT-4o, used purely as a natural-language difficulty assessor, against an interpretable Light-GBM ensemble trained on explicit numeric and textual features. On a dataset of 1,825 LeetCode problems labeled Easy, Medium, or Hard, LightGBM attains 86% accuracy, whereas GPT-4o reaches only 37.75%. Detailed analyses, including confusion matrices and SHAP-based interpretability, show that numeric constraints – such as input size limits and acceptance rates – play a crucial role in separating Hard problems from easier ones. By contrast, GPT-4o often overlooks these cues and exhibits a strong bias toward simpler categories. We further probe GPT-4o through a synthetic Hard-problem generation protocol. Surprisingly, GPT-4o labels almost all of its own synthetic Hard problems as Medium, contradicting its tendency to downgrade real Hard problems to Easy. Our findings connect to recent work on LLMs-as-judges and automatic difficulty estimation in programming and education, and highlight concrete failure modes that must be addressed before LLM-based judges can be considered trustworthy in competitive programming, educational platforms, or reinforcement-learning pipelines.

[262] A Benchmark for Zero-Shot Belief Inference in Large Language Models cs.CLPDF

Joseph Malone, Rachith Aiyappa, Byunghwee Lee, Haewoon Kwak, Jisun An

TL;DR: 本研究引入了一个系统性、可复现的基准测试，用于评估大型语言模型（LLMs）在零样本设置下预测个体对广泛话题的立场的能力。通过在线辩论平台的数据，研究发现提供更多背景信息可以提高预测准确性，但性能在不同信念领域差异显著。

Details

Motivation: 信念是人类推理、沟通和社会联系的核心，但现有计算研究方法局限于狭窄的社会政治背景且依赖微调。随着大型语言模型在多学科中的应用增多，其在多样化信念领域的泛化能力尚不明确。

Result: 研究发现提供更多个体背景信息能提升预测准确性，但模型的性能在不同信念领域变化显著，显示出LLMs在模拟人类推理方面的潜力和局限。

Insight: 当前LLMs在模拟人类信念推理方面具备一定能力，但性能受限于领域多样性，未来研究需关注如何进一步提升其泛化能力。

Abstract: Beliefs are central to how humans reason, communicate, and form social connections, yet most computational approaches to studying them remain confined to narrow sociopolitical contexts and rely on fine-tuning for optimal performance. Despite the growing use of large language models (LLMs) across disciplines, how well these systems generalize across diverse belief domains remains unclear. We introduce a systematic, reproducible benchmark that evaluates the ability of LLMs to predict individuals’ stances on a wide range of topics in a zero-shot setting using data from an online debate platform. The benchmark includes multiple informational conditions that isolate the contribution of demographic context and known prior beliefs to predictive success. Across several small- to medium-sized models, we find that providing more background information about an individual improves predictive accuracy, but performance varies substantially across belief domains. These findings reveal both the capacity and limitations of current LLMs to emulate human reasoning, advancing the study of machine behavior and offering a scalable framework for modeling belief systems beyond the sociopolitical sphere.

[263] A Unified BERT-CNN-BiLSTM Framework for Simultaneous Headline Classification and Sentiment Analysis of Bangla News cs.CL | cs.AIPDF

Mirza Raquib, Munazer Montasir Akash, Tawhid Ahmed, Saydul Akbar Murad, Farida Siddiqi Prity

TL;DR: 该论文提出了一种结合BERT-CNN-BiLSTM的混合迁移学习模型，用于同时进行孟加拉语新闻标题的分类和情感分析，取得了优于基线模型的性能。

Details

Motivation: 新闻标题的分类和情感分析有助于快速理解新闻内容和情感基调，但孟加拉语作为低资源语言，缺乏相关研究和技术。

Result: 在分类任务中表现优异，标题分类和情感分析的准确率分别为81.37%和64.46%。

Insight: 1. 混合模型在处理低资源语言时表现突出；2. 数据集平衡技术对性能影响显著。

Abstract: In our daily lives, newspapers are an essential information source that impacts how the public talks about present-day issues. However, effectively navigating the vast amount of news content from different newspapers and online news portals can be challenging. Newspaper headlines with sentiment analysis tell us what the news is about (e.g., politics, sports) and how the news makes us feel (positive, negative, neutral). This helps us quickly understand the emotional tone of the news. This research presents a state-of-the-art approach to Bangla news headline classification combined with sentiment analysis applying Natural Language Processing (NLP) techniques, particularly the hybrid transfer learning model BERT-CNN-BiLSTM. We have explored a dataset called BAN-ABSA of 9014 news headlines, which is the first time that has been experimented with simultaneously in the headline and sentiment categorization in Bengali newspapers. Over this imbalanced dataset, we applied two experimental strategies: technique-1, where undersampling and oversampling are applied before splitting, and technique-2, where undersampling and oversampling are applied after splitting on the In technique-1 oversampling provided the strongest performance, both headline and sentiment, that is 78.57% and 73.43% respectively, while technique-2 delivered the highest result when trained directly on the original imbalanced dataset, both headline and sentiment, that is 81.37% and 64.46% respectively. The proposed model BERT-CNN-BiLSTM significantly outperforms all baseline models in classification tasks, and achieves new state-of-the-art results for Bangla news headline classification and sentiment analysis. These results demonstrate the importance of leveraging both the headline and sentiment datasets, and provide a strong baseline for Bangla text classification in low-resource.

[264] Prompt Optimization as a State-Space Search Problem cs.CLPDF

Maanas Taneja

TL;DR: 这篇论文提出了一种将提示优化建模为状态空间搜索问题的新方法，通过图模型和搜索算法提升语言模型的性能。

Details

Motivation: 语言模型对输入提示的微小变化非常敏感，导致性能崩溃。现有方法（如DSpy）通过基于示例的优化解决问题，但本文提出了更系统的搜索方法。

Result: 实验表明，浅层搜索（波束宽度=2，深度=2）在开发集上有显著提升（如推理任务准确率从0.40升至0.80），但测试集提升较小（0.20至0.50）。

Insight: 成功的优化路径表明，简洁的提示转换最有效，而过多的冗长操作不被选择。未来的改进方向包括更多计算资源和更好的评估指标。

Abstract: Language Models are extremely susceptible to performance collapse with even small changes to input prompt strings. Libraries such as DSpy (from Stanford NLP) avoid this problem through demonstration-based prompt optimisation. Inspired by this, I propose an alternative approach that treats prompt optimisation as a classical state-space search problem. I model the prompt space as a graph where nodes represent prompt states and edges correspond to deliberate transformations such as shortening, adding examples, or re- ordering content. Using beam search and random walk algorithms, I systematically explore this space, evaluating candidates on development sets and pruning unpromising branches. Across five NLP tasks (sentiment classification, question answering, summarisation, reason- ing, and natural language inference), I find that even shallow search configurations (beam width=2, depth=2) improve upon seed prompts on development sets. For instance, beam search achieves development accuracy gains from 0.40 to 0.80 on reasoning tasks, though test set improvements are more modest (0.20 to 0.50), indicating overfitting to the develop- ment heuristic. Analysis of successful optimisation paths reveals that transformations that make prompts concise appear most frequently, while verbosity operators are never selected. My results validate prompt optimization as a search problem and suggest that with greater computational resources and improved evaluation metrics, deeper exploration could yield more robust prompts that generalize beyond development sets. Code and implementation are available at [https://github.com/MaanasTaneja/PromptOptimiser].

[265] Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting cs.CLPDF

Goun Pyeon, Inbum Heo, Jeesu Jung, Taewook Hwang, Hyuk Namgoong

TL;DR: 本研究通过2026年韩国高考数学考试，系统评估了大语言模型（LLMs）的数学推理能力，确保无数据泄露。GPT-5 Codex表现最佳，但在实际应用中，低推理强度的模型可能更具实用性。

Details

Motivation: 现有评测基准存在数据泄露问题，无法准确评估LLMs的真实能力，因此本研究设计了一个全新的无污染评估环境。

Result: GPT-5 Codex以满分（100分）表现最佳，几何领域表现最弱（平均77.7%）。文本输入优于图像输入，推理增强实验表明高推理强度虽提升性能但效率下降。

Insight: 在实际应用中，低推理强度的模型可能更具成本效益；几何领域和高难度问题是LLMs的薄弱环节。

Abstract: This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam’s public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (text, image, text+figure) and prompt languages (Korean, English). GPT-5 Codex achieved the only perfect score (100 points) with text input and Korean prompts, while Grok 4, GPT-5, and Deepseek R1 scored above 95 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed geometry as the weakest domain (77.7% average) with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (from 82.6 to 100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a real-exam-based LLM assessment framework, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard (https://isoft.cnu.ac.kr/csat2026/).

[266] CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning cs.CLPDF

Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly

TL;DR: CLaRa提出了一种结合检索与生成的连续潜在推理框架，通过语义压缩和端到端优化提升检索增强生成的效果，并在多个QA基准测试中取得SOTA性能。

Details

Motivation: 现有的检索增强生成（RAG）方法存在长上下文和检索-生成优化分离的问题，CLaRa旨在通过连续潜在推理空间实现统一优化。

Result: 在多个QA基准测试中，CLaRa实现了SOTA的压缩和重新排序性能，甚至超越基于文本微调的基线。

Insight: CLaRa的统一优化框架证明了将检索相关性与答案质量对齐的理论优势，为检索增强生成任务提供了新的优化思路。

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.

[267] RhinoInsight: Improving Deep Research through Control Mechanisms for Model Behavior and Context cs.CL | cs.AIPDF

Yu Lei, Shuzheng Si, Wei Wang, Yifei Wu, Gang Chen

TL;DR: RhinoInsight是一个深度学习研究框架，通过添加可验证清单和证据审计两种控制机制，提升了模型的鲁棒性、可追溯性和整体质量，无需参数更新。

Details

Motivation: 当前大型语言模型在深度研究中存在错误累积和上下文退化的问题，主要由于缺乏对模型行为和上下文的显式控制。

Result: 实验表明RhinoInsight在深度研究任务中达到最优性能，同时在深度搜索任务中保持竞争力。

Insight: 显示性控制机制（如清单和审计）是提升模型可靠性和减少幻觉的有效方法。

Abstract: Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.

[268] Large Language Models Require Curated Context for Reliable Political Fact-Checking – Even with Reasoning and Web Search cs.CL | cs.CY | cs.IRPDF

Matthew R. DeVerna, Kai-Cheng Yang, Harry Yaojun Yan, Filippo Menczer

TL;DR: 这篇论文研究了大型语言模型（LLMs）在政治事实核查中的表现，发现标准模型表现不佳，推理能力和网络搜索仅带来有限改进，而使用高质量上下文（如PolitiFact摘要）的RAG系统显著提高了性能。

Details

Motivation: 随着主流聊天机器人逐渐具备推理能力和网络搜索工具，用户越来越多地依赖它们进行事实核查。然而，LLMs在此任务中的表现参差不齐，急需严格评估。

Result: 标准模型表现不佳，推理和网络搜索改进有限，而RAG系统的宏F1提高了233%。

Insight: 高质量上下文（如权威事实核查摘要）是提升LLMs事实核查可靠性的关键。

Abstract: Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools – and millions of users already rely on them for verification – rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.

[269] Robust Multimodal Sentiment Analysis with Distribution-Based Feature Recovery and Fusion cs.CLPDF

Daiqing Wu, Dongbao Yang, Can Ma, Yu Zhou

TL;DR: 本文提出了一种基于分布的特征恢复与融合方法（DRF），用于提高多模态情感分析的鲁棒性，特别是在处理低质量和缺失模态的情况下。

Details

Motivation: 社交媒体上的图像-文本对数据可能因低质量或缺失模态而影响情感分析的准确性，现有方法对此考虑不足。

Result: 在三个公开数据集上，DRF在两类干扰策略（低质量和缺失模态）下均优于现有方法，验证了其鲁棒性。

Insight: 1. 模态分布的近似是实现鲁棒多模态分析的关键；2. 跨模态映射有助于缺失模态的恢复；3. 定量评估模态质量可提高融合效果。

Abstract: As posts on social media increase rapidly, analyzing the sentiments embedded in image-text pairs has become a popular research topic in recent years. Although existing works achieve impressive accomplishments in simultaneously harnessing image and text information, they lack the considerations of possible low-quality and missing modalities. In real-world applications, these issues might frequently occur, leading to urgent needs for models capable of predicting sentiment robustly. Therefore, we propose a Distribution-based feature Recovery and Fusion (DRF) method for robust multimodal sentiment analysis of image-text pairs. Specifically, we maintain a feature queue for each modality to approximate their feature distributions, through which we can simultaneously handle low-quality and missing modalities in a unified framework. For low-quality modalities, we reduce their contributions to the fusion by quantitatively estimating modality qualities based on the distributions. For missing modalities, we build inter-modal mapping relationships supervised by samples and distributions, thereby recovering the missing modalities from available ones. In experiments, two disruption strategies that corrupt and discard some modalities in samples are adopted to mimic the low-quality and missing modalities in various real-world scenarios. Through comprehensive experiments on three publicly available image-text datasets, we demonstrate the universal improvements of DRF compared to SOTA methods under both two strategies, validating its effectiveness in robust multimodal sentiment analysis.

[270] Context-Aware Whisper for Arabic ASR Under Linguistic Varieties cs.CLPDF

Bashar Talafha, Amin Abu Alhassan, Muhammad Abdul-Mageed

TL;DR: 该论文提出了一种针对阿拉伯语低资源自动语音识别（ASR）的上下文感知提示策略，利用Whisper模型在不重新训练的情况下提升其性能，显著降低了阿拉伯语的词错误率（WER）。

Details

Motivation: 阿拉伯语具有广泛的方言变体和有限的标注数据，使得低资源ASR成为一项挑战。需要一种无需重新训练的方法来适应语言的多样性。

Result: 在九种阿拉伯语语言条件下，该方法将现代标准阿拉伯语的WER降低了22.3%，方言语音降低了9.2%，有效减少了幻觉和说话者不匹配问题。

Insight: 上下文感知提示技术可以在不重新训练模型的情况下显著提升低资源语言的ASR性能，尤其是在语言变体丰富的场景中。

Abstract: Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI’s Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker’s voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.

[271] HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations cs.CL | cs.AIPDF

Cao Linxiao, Wang Ruitao, Li Jindong, Zhou Zhipeng, Yang Menglin

TL;DR: HyperbolicRAG通过引入双曲几何改进基于图的RAG框架，显著提升了在知识检索中的层次关系捕获能力。

Details

Motivation: 传统的基于欧几里得空间的RAG方法在捕捉知识的层次结构上存在局限，无法有效表达复杂知识图谱中的抽象关系。

Result: 在多个QA基准测试中表现优于传统RAG和图增强基线方法。

Insight: 双曲几何能更好地捕捉知识的层次结构和抽象关系，为RAG提供了新的优化方向。

Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to access external knowledge, helping mitigate hallucinations and enhance domain-specific expertise. Graph-based RAG enhances structural reasoning by introducing explicit relational organization that enables information propagation across semantically connected text units. However, these methods typically rely on Euclidean embeddings that capture semantic similarity but lack a geometric notion of hierarchical depth, limiting their ability to represent abstraction relationships inherent in complex knowledge graphs. To capture both fine-grained semantics and global hierarchy, we propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG. HyperbolicRAG introduces three key designs: (1) a depth-aware representation learner that embeds nodes within a shared Poincare manifold to align semantic similarity with hierarchical containment, (2) an unsupervised contrastive regularization that enforces geometric consistency across abstraction levels, and (3) a mutual-ranking fusion mechanism that jointly exploits retrieval signals from Euclidean and hyperbolic spaces, emphasizing cross-space agreement during inference. Extensive experiments across multiple QA benchmarks demonstrate that HyperbolicRAG outperforms competitive baselines, including both standard RAG and graph-augmented baselines.

[272] Concept than Document: Context Compression via AMR-based Conceptual Entropy cs.CLPDF

Kaize Shi, Xueyao Sun, Xiaohui Tao, Lin Li, Qika Lin

TL;DR: 该论文提出了一种基于抽象意义表示(AMR)的无监督上下文压缩框架，通过量化AMR图中节点的概念熵，保留核心语义信息，减少冗余内容。

Details

Motivation: 大语言模型(LLMs)在处理长上下文时会面临信息过载问题，尤其是在检索增强生成(RAG)场景中，冗长的支持文档会降低推理准确性并增加计算开销。

Result: 在PopQA和EntityQuestions数据集上的实验表明，该方法优于基线方法，实现了更高的准确性和更短的上下文长度。

Insight: AMR图的概念熵能够有效识别和保留核心语义信息，有助于减少计算开销并提升模型性能。

Abstract: Large Language Models (LLMs) face information overload when handling long contexts, particularly in Retrieval-Augmented Generation (RAG) where extensive supporting documents often introduce redundant content. This issue not only weakens reasoning accuracy but also increases computational overhead. We propose an unsupervised context compression framework that exploits Abstract Meaning Representation (AMR) graphs to preserve semantically essential information while filtering out irrelevant text. By quantifying node-level entropy within AMR graphs, our method estimates the conceptual importance of each node, enabling the retention of core semantics. Specifically, we construct AMR graphs from raw contexts, compute the conceptual entropy of each node, and screen significant informative nodes to form a condensed and semantically focused context than raw documents. Experiments on the PopQA and EntityQuestions datasets show that our method outperforms vanilla and other baselines, achieving higher accuracy while substantially reducing context length. To the best of our knowledge, this is the first work introducing AMR-based conceptual entropy for context compression, demonstrating the potential of stable linguistic features in context engineering.

[273] A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis cs.CL | cs.HC | cs.LGPDF

Heger Arfaoui, Mohammed Iheb Hergli, Beya Benzina, Slimane BenMiled

TL;DR: 该论文提出了一种可重复的神经网络主题建模框架，用于分析焦点小组讨论，解决了超参数敏感性、模型稳定性和可解释性验证等挑战。通过BERTopic模型在10个焦点组的实际应用，展示了框架的可行性和效果。

Details

Motivation: 传统焦点小组分析依赖人工编码，效率低且难以复现。论文希望通过计算框架提升分析的规模和可重复性。

Result: 模型的主题连贯性达到0.558，优于直接提取的0.539；人类评估显示较高的评分一致性（ICC=0.79，Cohen’s kappa=0.578）。

Insight: 超参数选择和评估指标对模型性能影响显著；分层策略能有效平衡主题稳定性和可解释性。

Abstract: Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a rigorous, reproducible computational framework for applying neural topic modeling to focus group transcripts, addressing fundamental methodological challenges: hyperparameter sensitivity, model stability, and validation of interpretability. Using BERTopic applied to ten focus groups exploring HPV vaccine perceptions in Tunisia (1,076 utterances), we conducted systematic evaluation across 27 hyperparameter configurations, assessed stability through bootstrap resampling with 30 replicates per configuration, and validated interpretability through formal human evaluation by three domain experts. Our analysis demonstrates substantial sensitivity to hyperparameter choices and reveals that metric selection for stability assessment must align with analytical goals. A hierarchical merging strategy (extracting fine-grained topics for stability then consolidating for interpretability) effectively navigates the stability-coherence tradeoff, achieving coherence of 0.558 compared to 0.539 for direct extraction. Human validation confirmed topic quality with very good inter-rater reliability (ICC = 0.79, weighted Cohen’s kappa = 0.578). Our framework provides practical guidelines that researchers can adapt to their own qualitative research contexts. All code, data processing scripts, and evaluation protocols are publicly available to support reproduction and extension of this work.

[274] Large Language Models for the Summarization of Czech Documents: From History to the Present cs.CLPDF

Václav Tran, Jakub Šmíd, Ladislav Lenc, Jean-Pierre Salmon, Pavel Král

TL;DR: 该论文研究了大型语言模型（LLMs）在捷克语文档摘要中的应用，特别是针对历史和现代文本，提出了基于翻译的方法，并引入了新的数据集。

Details

Motivation: 捷克语摘要任务在高资源语言中已有广泛研究，但捷克语由于其语言复杂性和缺乏高质量标注数据集，相关研究较少，特别是在历史文档领域。

Result: LLMs在SumeCzech数据集上取得了最优结果，证明了其在捷克语等中等资源语言上的有效性；同时为Posel od Čerchova数据集提供了初步基线。

Insight: LLMs在处理形态复杂的中等资源语言（如捷克语）时表现优异；翻译方法为低资源语言摘要提供了可行方案。

Abstract: Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information. Although this task has been extensively studied in English and other high-resource languages, Czech summarization, particularly in the context of historical documents, remains underexplored. This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets. In this work, we address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5, which have demonstrated strong performance across a wide range of natural language processing tasks and multilingual settings. In addition, we also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech. Our study makes the following main contributions: We demonstrate that LLMs achieve new state-of-the-art results on the SumeCzech dataset, a benchmark for modern Czech text summarization, showing the effectiveness of multilingual LLMs even for morphologically rich, medium-resource languages like Czech. We introduce a new dataset, Posel od Čerchova, designed for the summarization of historical Czech texts. This dataset is derived from digitized 19th-century publications and annotated for abstractive summarization. We provide initial baselines using modern LLMs to facilitate further research in this underrepresented area. By combining cutting-edge models with both modern and historical Czech datasets, our work lays the foundation for further progress in Czech summarization and contributes valuable resources for future research in Czech historical document processing and low-resource summarization more broadly.

[275] Cognitive Alpha Mining via LLM-Driven Code-Based Evolution cs.CLPDF

Fengyuan Liu, Huang Yi, Sichun Luo, Yuqi Wang, Yazheng Yang

TL;DR: 论文提出了一个名为CogAlpha的框架，通过结合代码级表示、LLM驱动的推理和进化搜索，解决了金融数据中高效预测信号发现的难题。实验证明该方法在预测准确性、鲁棒性和泛化性上优于现有方法。

Details

Motivation: 金融数据的高维性和极低信噪比使得发现有效的预测信号（“alphas”）极为困难。现有方法（如深度学习、遗传编程和LLM生成的因子）仅在有限的搜索空间内探索，且在逻辑一致性和创造性之间难以平衡。

Result: 在A股实验表明，CogAlpha发现的alpha在预测准确性、鲁棒性和泛化性上均优于现有方法。

Insight: 研究展示了将进化优化与LLM推理相结合的潜力，可用于自动化且可解释的alpha发现。

Abstract: Discovering effective predictive signals, or ``alphas,’’ from financial data with high dimensionality and extremely low signal-to-noise ratio remains a difficult open problem. Despite progress in deep learning, genetic programming, and, more recently, large language model (LLM)–based factor generation, existing approaches still explore only a narrow region of the vast alpha search space. Neural models tend to produce opaque and fragile patterns, while symbolic or formula-based methods often yield redundant or economically ungrounded expressions that generalize poorly. Although different in form, these paradigms share a key limitation: none can conduct broad, structured, and human-like exploration that balances logical consistency with creative leaps. To address this gap, we introduce the Cognitive Alpha Mining Framework (CogAlpha), which combines code-level alpha representation with LLM-driven reasoning and evolutionary search. Treating LLMs as adaptive cognitive agents, our framework iteratively refines, mutates, and recombines alpha candidates through multi-stage prompts and financial feedback. This synergistic design enables deeper thinking, richer structural diversity, and economically interpretable alpha discovery, while greatly expanding the effective search space. Experiments on A-share equities demonstrate that CogAlpha consistently discovers alphas with superior predictive accuracy, robustness, and generalization over existing methods. Our results highlight the promise of aligning evolutionary optimization with LLM-based reasoning for automated and explainable alpha discovery. All source code will be released.

[276] Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models cs.CLPDF

Yang Xiang, Yixin Ji, Juntao Li, Min Zhang

TL;DR: 本文首次对大型推理模型（LRMs）的剪枝进行了实证研究，发现直接将现有剪枝技术应用于LRMs效果不佳，并提出选择性自生成推理（SSGR）数据构建策略以提升剪枝效果。

Details

Motivation: 大型推理模型（LRMs）在复杂推理任务中表现出色，但其推理过程计算开销大，而现有剪枝技术主要针对大型语言模型（LLMs），LRMs剪枝尚未探索。

Result: 在DeepSeek-R1-Distill模型系列上验证，SSGR策略比通用剪枝方法提高推理能力10%-13%。

Insight: 挑战性和中等长度的自生成推理数据是剪枝校准的理想选择，SSGR策略显著优化LRMs剪枝效果。

Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.

[277] CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation cs.CL | cs.AIPDF

Jingqian Zhao, Bingbing Wang, Geng Tu, Yice Zhang, Qianlong Wang

TL;DR: CoreEval提出了一种自动化策略，通过整合实时知识构建抗污染的评估数据集，以解决数据污染对LLM评估的影响。

Details

Motivation: 数据污染问题严重影响LLM评估的公平性，现有方法无法完全消除预训练知识或保留数据集语义复杂性。

Result: 实验证明CoreEval能有效减少数据污染导致的性能高估，提升评估的可靠性。

Insight: 动态整合实时知识是解决数据污染的有效途径，同时需关注语义一致性与任务的适配性。

Abstract: Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.

[278] GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning cs.CL | cs.AIPDF

Yutong Li, Yitian Zhou, Xudong Wang, GuoChen, Caiyan Qin

TL;DR: GraphMind 提出了一种基于动态 GNN 的框架，用于结合 LLM 进行多步推理，通过建模异构图结构提升定理选择和结论生成的上下文感知能力。

Details

Motivation: 现有 LLM 在多步推理中缺乏显式和动态的中间状态表示机制，限制了其上下文感知和迭代推理能力。

Result: 在多个 QA 数据集上表现优于基线方法，验证了方法的有效性和泛化性。

Insight: 动态图结构能更好地表示推理过程的中间状态，提升 LLM 的结构化推理能力。

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.

[279] A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval, Disambiguation and Reflective Analysis cs.CLPDF

Wenxuan Mu, Jinzhong Ning, Di Zhao, Yijia Zhang

TL;DR: 论文提出了一种名为KDR-Agent的多智能体框架，用于解决低资源多领域命名实体识别（NER）中的问题，通过知识检索、消歧和反思分析提升性能。

Details

Motivation: 现有的基于上下文学习（ICL）的NER方法存在依赖动态检索标注数据、泛化能力不足以及缺乏外部知识和消歧机制的局限。

Result: 在10个数据集上的实验表明，KDR-Agent显著优于现有的零样本和少样本ICL基线模型。

Insight: 通过引入外部知识和多智能体协作机制，可以有效提升LLM在低资源多领域NER任务中的表现。

Abstract: In-context learning (ICL) with large language models (LLMs) has emerged as a promising paradigm for named entity recognition (NER) in low-resource scenarios. However, existing ICL-based NER methods suffer from three key limitations: (1) reliance on dynamic retrieval of annotated examples, which is problematic when annotated data is scarce; (2) limited generalization to unseen domains due to the LLM’s insufficient internal domain knowledge; and (3) failure to incorporate external knowledge or resolve entity ambiguities. To address these challenges, we propose KDR-Agent, a novel multi-agent framework for multi-domain low-resource in-context NER that integrates Knowledge retrieval, Disambiguation, and Reflective analysis. KDR-Agent leverages natural-language type definitions and a static set of entity-level contrastive demonstrations to reduce dependency on large annotated corpora. A central planner coordinates specialized agents to (i) retrieve factual knowledge from Wikipedia for domain-specific mentions, (ii) resolve ambiguous entities via contextualized reasoning, and (iii) reflect on and correct model predictions through structured self-assessment. Experiments across ten datasets from five domains demonstrate that KDR-Agent significantly outperforms existing zero-shot and few-shot ICL baselines across multiple LLM backbones. The code and data can be found at https://github.com/MWXGOD/KDR-Agent.

[280] DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF cs.CLPDF

Ziyuan Gao, Di Liang, Xianjie Wu, Philippe Morel, Minlong Peng

TL;DR: DeCoRL提出了一种解耦推理链的新框架，通过并行生成推理子步骤和级联强化学习解决现有RL方法的局限，提升推理速度、可解释性和能效。

Details

Motivation: 现有强化学习方法在Chain-of-Thought推理中存在两个问题：黑盒式奖励信号阻碍错误诊断，以及顺序解码的时间复杂度高，影响实时部署。

Result: 在RM-Bench、RMB和RewardBench上达到SOTA，推理速度提升3.8倍，可解释性提升22.7%，能效降低72.4%，吞吐量提高68%。

Insight: 模块化和并行化设计是提升推理效率的关键，同时独立奖励函数增强了透明度和错误诊断能力。

Abstract: Existing reinforcement learning methods for Chain-of-Thought reasoning suffer from two critical limitations. First, they operate as monolithic black boxes that provide undifferentiated reward signals, obscuring individual step contributions and hindering error diagnosis. Second, sequential decoding has O(n) time complexity. This makes real-time deployment impractical for complex reasoning tasks. We present DeCoRL (Decoupled Reasoning Chains via Coordinated Reinforcement Learning), a novel framework that transforms reasoning from sequential processing into collaborative modular orchestration. DeCoRL trains lightweight specialized models to generate reasoning sub-steps concurrently, eliminating sequential bottlenecks through parallel processing. To enable precise error attribution, the framework designs modular reward functions that score each sub-step independently. Cascaded DRPO optimization then coordinates these rewards while preserving inter-step dependencies. Comprehensive evaluation demonstrates state-of-the-art results across RM-Bench, RMB, and RewardBench, outperforming existing methods including large-scale models. DeCoRL delivers 3.8 times faster inference while maintaining superior solution quality and offers a 22.7% improvement in interpretability through explicit reward attribution. These advancements, combined with a 72.4% reduction in energy consumption and a 68% increase in throughput, make real-time deployment of complex reasoning systems a reality.

[281] A symbolic Perl algorithm for the unification of Nahuatl word spellings cs.CLPDF

Juan-José Guzmán-Landa, Jesús Vázquez-Osorio, Juan-Manuel Torres-Moreno, Ligia Quintana Torres, Miguel Figueroa-Saavedra

TL;DR: 本文提出了一种基于符号正则表达式的算法，用于自动统一Nahuatl文本的拼写，并通过人工评估协议验证其效果。

Details

Motivation: Nahuatl文本存在多种拼写形式，需要统一的自动化方法来提升文本处理效率。

Result: 评估结果显示，该方法在大多数期望特征上表现良好。

Insight: 符号规则和正则表达式的结合可以有效处理语言统一问题，同时人工评估验证了方法的实用性。

Abstract: In this paper, we describe a symbolic model for the automatic orthographic unification of Nawatl text documents. Our model is based on algorithms that we have previously used to analyze sentences in Nawatl, and on the corpus called $π$-yalli, consisting of texts in several Nawatl orthographies. Our automatic unification algorithm implements linguistic rules in symbolic regular expressions. We also present a manual evaluation protocol that we have proposed and implemented to assess the quality of the unified sentences generated by our algorithm, by testing in a sentence semantic task. We have obtained encouraging results from the evaluators for most of the desired features of our artificially unified sentences

[282] On the Optimality of Discrete Object Naming: a Kinship Case Study cs.CL | cs.AIPDF

Phong Le, Mees Lindeman, Raquel G. Alhama

TL;DR: 该论文通过信息论框架研究了离散对象命名系统的优化问题，证明了只有在听者的解码器等同于说话者的贝叶斯解码器时，才能实现最优的信息量与复杂度权衡，并在亲属关系领域进行了验证。

Details

Motivation: 研究自然语言中命名系统的结构如何在信息的丰富性与系统的复杂度之间取得平衡，克服了先前研究中假设理想听众和通用沟通需求的简化问题。

Result: 在亲属关系领域中，通过实验验证了理论上的最优命名系统可以在实际通信系统中自然涌现。

Insight: 只有在听者和说话者的解码方式一致时，才能实现命名系统的最优化，这为理解自然语言命名系统提供了新的理论支持和实证基础。

Abstract: The structure of naming systems in natural languages hinges on a trade-off between high informativeness and low complexity. Prior work capitalizes on information theory to formalize these notions; however, these studies generally rely on two simplifications: (i) optimal listeners, and (ii) universal communicative need across languages. Here, we address these limitations by introducing an information-theoretic framework for discrete object naming systems, and we use it to prove that an optimal trade-off is achievable if and only if the listener’s decoder is equivalent to the Bayesian decoder of the speaker. Adopting a referential game setup from emergent communication, and focusing on the semantic domain of kinship, we show that our notion of optimality is not only theoretically achievable but also emerges empirically in learned communication systems.

[283] Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization cs.CLPDF

Zijian Wang, Yanxiang Ma, Chang Xu

TL;DR: 该论文提出一种基于梯度表示优化的方法，通过概率条件生成操纵隐藏状态，从而在基础大语言模型（LLMs）中引发链式思维（CoT）推理，显著提升其多步复杂任务的能力。

Details

Motivation: 基础LLMs虽然在通用文本预训练中表现良好，但缺乏专门训练使其在多步推理任务中表现不佳。研究发现其隐藏状态具有潜在的推理能力，但现有方法（如线性激活引导）因过于僵化且缺乏约束，导致分布偏移和文本质量下降。

Result: 在数学、常识和逻辑推理基准测试中，该方法显著优于现有隐藏状态操纵方法，验证了其高效性和理论合理性。

Insight: 1. 基础LLMs的隐藏状态具有潜在推理能力，可通过优化引导提升；2. 约束性优化框架能有效平衡推理能力与文本质量，避免传统方法的局限性。

Abstract: Chain-of-Thought (CoT) reasoning is a critical capability for large language models (LLMs), enabling them to tackle com- plex multi-step tasks. While base LLMs, pre-trained on general text corpora, often struggle with reasoning due to a lack of specialized training, recent studies reveal their latent reason- ing potential tied to hidden states. However, existing hidden state manipulation methods, such as linear activation steering, suffer from limitations due to their rigid and unconstrained nature, often leading to distribution shifts and degraded text quality. In this work, we propose a novel approach for elic- iting CoT reasoning from base LLMs through hidden state manipulation grounded in probabilistic conditional generation. By reformulating the challenge as an optimization problem with a balanced likelihood and prior regularization framework, our method guides hidden states toward reasoning-oriented trajectories while preserving linguistic coherence. Extensive evaluations across mathematical, commonsense, and logical reasoning benchmarks demonstrate that our approach con- sistently outperforms existing steering methods, offering a theoretically principled and effective solution for enhancing reasoning capabilities in base LLMs.

[284] Learning to Reason: Training LLMs with GPT-OSS or DeepSeek R1 Reasoning Traces cs.CLPDF

Shaltiel Shmidman, Asher Fredman, Oleg Sudakov, Meriem Bendris

TL;DR: 论文研究了通过GPT-OSS和DeepSeek R1生成的中级推理轨迹对中等规模LLMs在数学问题上性能的影响，比较了准确性和推理效率。

Details

Motivation: 利用前沿大型语言模型（如DeepSeek-R1和GPT-OSS）生成的推理轨迹作为高质量监督数据，为中小规模LLMs提供推理能力训练，避免昂贵的人工标注。

Result: 论文分析了两种推理轨迹在提高模型准确性和推理效率方面的差异。

Insight: 前沿LLMs生成的推理轨迹可以作为低成本的高质量监督数据，有效提升中小规模模型的推理能力。

Abstract: Test-time scaling, which leverages additional computation during inference to improve model accuracy, has enabled a new class of Large Language Models (LLMs) that are able to reason through complex problems by understanding the goal, turning this goal into a plan, working through intermediate steps, and checking their own work before answering . Frontier large language models with reasoning capabilities, such as DeepSeek-R1 and OpenAI’s gpt-oss, follow the same procedure when solving complex problems by generating intermediate reasoning traces before giving the final answer. Today, these models are being increasingly used to generate reasoning traces that serve as high-quality supervised data for post-training of small and medium-sized language models to teach reasoning capabilities without requiring expensive human curation. In this work, we compare the performance of medium-sized LLMs on Math problems after post-training on two kinds of reasoning traces. We compare the impact of reasoning traces generated by DeepSeek-R1 and gpt-oss LLMs in terms of accuracy and inference efficiency.

[285] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research cs.CL | cs.AI | cs.LGPDF

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore

TL;DR: 论文提出了RLER方法，通过动态更新的评分标准（rubrics）结合强化学习，训练首个开源长形式深度研究模型DR Tulu-8B，在多个领域超越现有开源模型，与专有系统表现相当。

Details

Motivation: 现有深度研究模型多基于RLVR训练，难以扩展到真实长形式任务，RLER通过动态评分标准解决了这一问题。

Result: DR Tulu-8B在四个长形式研究基准上超越现有开源模型，表现接近专有系统。

Insight: 动态评分标准是提升长形式任务性能的关键，开源支持促进了深度研究系统的发展。

Abstract: Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.

[286] Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration cs.CL | cs.AI | cs.LGPDF

James Y. Huang, Sheng Zhang, Qianchu Liu, Guanghui Qin, Tinghui Zhu

TL;DR: BeMyEyes提出了一种模块化、多智能体的框架，通过高效的可视化语言模型（VLM）与强大的语言模型（LLM）协作，扩展LLM的多模态推理能力，避免了训练大规模多模态模型的成本。

Details

Motivation: 当前的LLM在多模态任务中表现优异，但扩展其能力需要开发大规模的多模态模型，成本高昂且不够灵活。BeMyEyes的动机是通过协作实现多模态推理，保留LLM的泛化和推理能力。

Result: 在多种知识密集型多模态任务中，轻量化的开源方案（如DeepSeek-R1结合Qwen2.5-VL-7B）优于GPT-4o等大规模专有VLM。

Insight: 多智能体协作是一种成本低、灵活性高的方式，可以充分利用现有模型的优势，避免从头训练大规模模型的开销。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.

cs.RO [Back]

[287] MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots cs.RO | cs.CVPDF

Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang

TL;DR: MobileVLA-R1是一个统一的视觉-语言-动作框架，解决了四足机器人将自然语言指令映射到连续控制的挑战，并通过大规模数据集和两阶段训练方法提升了推理一致性和控制稳定性。

Details

Motivation: 现有方法在高级语义推理和低级驱动之间缺乏有效桥梁，导致实际应用中的不稳定性和泛化能力弱。

Result: 在VLN和VLA任务上表现优于基线方法，提升约5%；在真实四足机器人上验证了稳健性能。

Insight: 多粒度推理监督和两阶段训练方法是提升机器人控制稳定性和泛化能力的关键。

Abstract: Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real world. To address these issues, we present MobileVLA-R1, a unified vision-language-action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity chain-of-thought (CoT) for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments. Code: https://github.com/AIGeeksGroup/MobileVLA-R1. Website: https://aigeeksgroup.github.io/MobileVLA-R1.

[288] Observer Actor: Active Vision Imitation Learning with Sparse View Gaussian Splatting cs.RO | cs.CV | cs.LGPDF

Yilong Wang, Cheng Qian, Ruomeng Fan, Edward Johns

TL;DR: ObAct框架通过动态分配观察者和执行者角色，结合3D高斯溅射技术优化视觉观测，提升主动视觉模仿学习的性能。实验表明，该方法在轨迹迁移和行为克隆任务中表现显著优于静态相机设置。

Details

Motivation: 在机器人操作任务中，静态相机的视角可能存在遮挡或视野不清晰的问题，影响模仿学习的性能。ObAct通过动态调整观察视角，提升观测质量，从而增强策略的鲁棒性。

Result: 在轨迹迁移和行为克隆任务中，ObAct分别提升了145%/233%和75%/143%（无遮挡/有遮挡场景），显著优于静态相机设置。

Insight: 动态调整观测视角是提升模仿学习性能的有效手段，特别是在存在遮挡的环境中，优化的观测能够显著提高策略的成功率。

Abstract: We propose Observer Actor (ObAct), a novel framework for active vision imitation learning in which the observer moves to optimal visual observations for the actor. We study ObAct on a dual-arm robotic system equipped with wrist-mounted cameras. At test time, ObAct dynamically assigns observer and actor roles: the observer arm constructs a 3D Gaussian Splatting (3DGS) representation from three images, virtually explores this to find an optimal camera pose, then moves to this pose; the actor arm then executes a policy using the observer’s observations. This formulation enhances the clarity and visibility of both the object and the gripper in the policy’s observations. As a result, we enable the training of ambidextrous policies on observations that remain closer to the occlusion-free training distribution, leading to more robust policies. We study this formulation with two existing imitation learning methods – trajectory transfer and behavior cloning – and experiments show that ObAct significantly outperforms static-camera setups: trajectory transfer improves by 145% without occlusion and 233% with occlusion, while behavior cloning improves by 75% and 143%, respectively. Videos are available at https://obact.github.io.

[289] Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video cs.RO | cs.CV | cs.LGPDF

Henrik Krauss, Johann Licher, Naoya Takeishi, Annika Raatz, Takehisa Yairi

TL;DR: 论文提出了一种结合数据驱动和物理可解释性的方法，用于学习软体连续机器人的动力学模型，通过注意力机制和振荡器网络实现了高精度的动态预测和可视化。

Details

Motivation: 现有数据驱动方法缺乏物理可解释性，而基于模型的方法依赖于先验知识且计算成本高。论文旨在填补这一空白，实现既灵活又物理可解释的动力学建模。

Result: 在单段和双段软体连续机器人上验证，ABCD模型显著提升了多步预测精度（Koopman算子误差降低5.7倍，振荡器网络误差降低3.5倍），并能自然地发现振荡器的链式结构。

Insight: 完全数据驱动的方法可以实现紧凑且物理可解释的模型，适用于控制应用，并能平滑地外推到训练数据之外，展示了潜在空间的泛化能力。

Abstract: Data-driven learning of soft continuum robot (SCR) dynamics from high-dimensional observations offers flexibility but often lacks physical interpretability, while model-based approaches require prior knowledge and can be computationally expensive. We bridge this gap by introducing (1) the Attention Broadcast Decoder (ABCD), a plug-and-play module for autoencoder-based latent dynamics learning that generates pixel-accurate attention maps localizing each latent dimension’s contribution while filtering static backgrounds. (2) By coupling these attention maps to 2D oscillator networks, we enable direct on-image visualization of learned dynamics (masses, stiffness, and forces) without prior knowledge. We validate our approach on single- and double-segment SCRs, demonstrating that ABCD-based models significantly improve multi-step prediction accuracy: 5.7x error reduction for Koopman operators and 3.5x for oscillator networks on the two-segment robot. The learned oscillator network autonomously discovers a chain structure of oscillators. Unlike standard methods, ABCD models enable smooth latent space extrapolation beyond training data. This fully data-driven approach yields compact, physically interpretable models suitable for control applications.

[290] Enhancing UAV Search under Occlusion using Next Best View Planning cs.RO | cs.CVPDF

Sigrid Helene Strand, Thomas Wiedemann, Bram Burczek, Dmitriy Shutin

TL;DR: 论文提出了一种优化的规划和算法，用于无人机在遮挡环境中的Next Best View问题，通过几何启发式和可见性启发式提升搜索性能，可见性启发式表现更优。

Details

Motivation: 搜索和救援任务在自然灾难或高风险环境中至关重要，但遮挡环境（如茂密森林）增加了搜索难度。无人机需要优化相机视角以提升搜索效率。

Result: 可见性启发式在模拟森林中识别了90%以上的隐藏目标，检测率比几何启发式高10%，且在真实环境中提供了更好的覆盖能力。

Insight: 可见性启发式在遮挡环境中更具潜力，能够显著提升搜索效率，为救援任务提供更优的解决方案。

Abstract: Search and rescue missions are often critical following sudden natural disasters or in high-risk environmental situations. The most challenging search and rescue missions involve difficult-to-access terrains, such as dense forests with high occlusion. Deploying unmanned aerial vehicles for exploration can significantly enhance search effectiveness, facilitate access to challenging environments, and reduce search time. However, in dense forests, the effectiveness of unmanned aerial vehicles depends on their ability to capture clear views of the ground, necessitating a robust search strategy to optimize camera positioning and perspective. This work presents an optimized planning strategy and an efficient algorithm for the next best view problem in occluded environments. Two novel optimization heuristics, a geometry heuristic, and a visibility heuristic, are proposed to enhance search performance by selecting optimal camera viewpoints. Comparative evaluations in both simulated and real-world settings reveal that the visibility heuristic achieves greater performance, identifying over 90% of hidden objects in simulated forests and offering 10% better detection rates than the geometry heuristic. Additionally, real-world experiments demonstrate that the visibility heuristic provides better coverage under the canopy, highlighting its potential for improving search and rescue missions in occluded environments.

[291] AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations cs.RO | cs.CVPDF

Litian Gong, Fatemeh Bahrani, Yutai Zhou, Amin Banayeeanzade, Jiachen Li

TL;DR: AutoFocus-IL提出了一种无需额外人工标注的数据高效视觉模仿学习方法，通过视觉语言模型生成时间显著性图来指导策略关注任务相关特征，优于传统行为克隆和依赖人类监督的基线方法。

Details

Motivation: 传统的视觉模仿学习常因注意力分散和虚假相关性导致数据效率低和泛化能力差。现有的显著性正则化方法依赖高成本的人工监督（如注视数据或显著性标注）。AutoFocus-IL旨在利用视觉语言模型自动生成显著性图，避免额外人工标注。

Result: 在CARLA仿真和真实机器人实验中，AutoFocus-IL优于标准行为克隆和依赖人类监督的基线方法，展现了更高的数据效率和泛化能力。

Insight: 视觉语言模型可用于自动生成任务相关的显著性图，显著降低人工标注成本，同时提升模仿学习的性能和稳健性。

Abstract: AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.

[292] Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation cs.RO | cs.CV | cs.LGPDF

Juntao Gao, Feiyang Ye, Jing Zhang, Wenjing Qian

TL;DR: Compressor-VLA通过指令引导的视觉令牌压缩框架，解决了在机器人操作中处理冗余视觉令牌的计算开销问题，同时保留了任务相关的视觉信息。

Details

Motivation: Vision-Language-Action（VLA）模型在Embodied AI中表现强大，但处理冗余视觉令牌的高计算开销成为实时机器人部署的瓶颈。现有的任务无关令牌剪枝方法难以保留任务关键信息。

Result: 在LIBERO基准测试中，Compressor-VLA保持了高成功率，同时将FLOPs减少59%，视觉令牌数量减少超过3倍。实际机器人部署验证了其SIM-to-Real可迁移性。

Insight: 研究表明，指令引导能动态调整模型的感知焦点，集中于任务相关对象，验证了方法的有效性。

Abstract: Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model’s sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model’s perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.

[293] Mixture of Horizons in Action Chunking cs.RO | cs.AI | cs.CVPDF

Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun

TL;DR: 本文提出了一种混合视野（Mixture of Horizons, MoH）策略，通过并行处理不同长度的动作片段并融合输出，解决了传统VLA模型中固定动作片段长度导致的性能权衡问题。

Details

Motivation: 传统VLA模型在机器人操纵任务中对动作片段长度（horizon）的选择非常敏感：较长视野提供全局预见性但降低细粒度精度，较短视野增强局部控制但难以完成长期任务。单一固定视野选择不足。

Result: 实验中MoH在多种策略（π0、π0.5和πreg）上表现一致优异，混合任务设置下π0.5+MoH仅需30k训练迭代即在LIBERO任务上达到99%平均成功率（SOTA）。

Insight: 动作片段长度的动态组合是提升VLA模型性能的有效途径，MoH的设计兼顾了性能和效率，为复杂任务提供了灵活的解决方案。

Abstract: Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons

cs.GR [Back]

[294] Inverse Rendering for High-Genus Surface Meshes from Multi-View Images cs.GR | cs.CVPDF

Xiang Gao, Xinmu Wang, Xiaolong Wu, Jiazhi Li, Jingyu Shi

TL;DR: 本文提出了一种基于拓扑信息的逆渲染方法，用于从多视角图像重建高亏格（high-genus）表面网格。该方法通过自适应V循环重网格化和重新参数化的Adam优化器，解决了现有方法在拓扑和几何特征上的不足，显著提升了高亏格表面的重建效果。

Details

Motivation: 现有的基于网格的逆渲染方法在高亏格表面上表现不佳，容易丢失关键拓扑特征，或在低亏格表面上过度平滑导致细节丢失。这些问题源于对Adam优化器的过度依赖，可能导致梯度消失或爆炸。

Result: 在Chamfer距离和Volume IoU指标上显著优于现有方法，尤其在高亏格表面上表现突出，同时在低亏格表面上也能更好地保留细节。

Insight: 1. 逆渲染中的拓扑信息是关键，需通过几何和拓扑感知优化策略来提升重建质量。2. 自适应网格处理和优化器改进是解决梯度问题的有效途径。

Abstract: We present a topology-informed inverse rendering approach for reconstructing high-genus surface meshes from multi-view images. Compared to 3D representations like voxels and point clouds, mesh-based representations are preferred as they enable the application of differential geometry theory and are optimized for modern graphics pipelines. However, existing inverse rendering methods often fail catastrophically on high-genus surfaces, leading to the loss of key topological features, and tend to oversmooth low-genus surfaces, resulting in the loss of surface details. This failure stems from their overreliance on Adam-based optimizers, which can lead to vanishing and exploding gradients. To overcome these challenges, we introduce an adaptive V-cycle remeshing scheme in conjunction with a re-parametrized Adam optimizer to enhance topological and geometric awareness. By periodically coarsening and refining the deforming mesh, our method informs mesh vertices of their current topology and geometry before optimization, mitigating gradient issues while preserving essential topological features. Additionally, we enforce topological consistency by constructing topological primitives with genus numbers that match those of ground truth using Gauss-Bonnet theorem. Experimental results demonstrate that our inverse rendering approach outperforms the current state-of-the-art method, achieving significant improvements in Chamfer Distance and Volume IoU, particularly for high-genus surfaces, while also enhancing surface details for low-genus surfaces.

[295] ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes cs.GR | cs.CVPDF

Zhongtao Wang, Jiaqi Dai, Qingtian Zhu, Yilong Li, Mai Su

TL;DR: ChronoGS提出了一种时间调制的高斯表示方法，用于重建多时段场景，并分离稳定和变化部分，同时发布了ChronoScene数据集作为基准。

Details

Motivation: 多时段图像集合在现实应用中常见，但现有方法在处理长期、不连续变化时表现不佳，需要一种能统一重建并分离稳定与变化部分的解决方案。

Result: 实验表明，ChronoGS在重建质量和时间一致性上优于基线方法。

Insight: 通过分离不变与变化部分，ChronoGS为多时段场景的重建提供了一种高效且一致的解决方案。

Abstract: Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It’s also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we release ChronoScene dataset, a benchmark of real and synthetic multi-period scenes, capturing geometric and appearance variation. Experiments demonstrate that ChronoGS consistently outperforms baselines in reconstruction quality and temporal consistency. Our code and the ChronoScene dataset are publicly available at https://github.com/ZhongtaoWang/ChronoGS.

cs.MA [Back]

[296] From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems cs.MA | cs.AI | cs.CLPDF

Brendan Gho, Suman Muppavarapu, Afnan Shaik, Tyson Tsay, James Begin

TL;DR: 该论文提出了一种基于市场机制的框架，用于协调多智能体大型语言模型（LLM）系统，通过经济交换的方式实现集体认知目标，提升准确性、透明度和可解释性。

Details

Motivation: 随着基础模型在多智能体系统中的广泛应用，传统的协调机制（如集中监督或对抗裁决）难以扩展，且缺乏透明度。需要一种新的方法来保证多智能体LLM系统的可信性、透明性和问责性。

Result: 实验结果显示，基于市场的协调方法相比单次推理基线提升了高达10%的准确性，同时保持了中间推理步骤的可解释性和透明度。

Insight: 市场协调原则为多智能体LLM系统提供了一种可扩展、自校正的路径，能够实现社会责任的AI，并在实际部署中保持信任和监督。

Abstract: As foundation models are increasingly deployed as interacting agents in multi-agent systems, their collective behavior raises new challenges for trustworthiness, transparency, and accountability. Traditional coordination mechanisms, such as centralized oversight or adversarial adjudication, struggle to scale and often obscure how decisions emerge. We introduce a market-making framework for multi-agent large language model (LLM) coordination that organizes agent interactions as structured economic exchanges. In this setup, each agent acts as a market participant, updating and trading probabilistic beliefs, to converge toward shared, truthful outcomes. By aligning local incentives with collective epistemic goals, the framework promotes self-organizing, verifiable reasoning without requiring external enforcement. Empirically, we evaluate this approach across factual reasoning, ethical judgment, and commonsense inference tasks. Market-based coordination yields accuracy gains of up to 10% over single-shot baselines while preserving interpretability and transparency of intermediate reasoning steps. Beyond these improvements, our findings demonstrate that economic coordination principles can operationalize accountability and robustness in multi-agent LLM systems, offering a scalable pathway toward self-correcting, socially responsible AI capable of maintaining trust and oversight in real world deployment scenarios.

q-bio.QM [Back]

[297] TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots q-bio.QM | cs.CVPDF

Tianyu Liu, Weihao Xuan, Hao Wu, Peter Humphrey, Marcello DiStasio

TL;DR: 论文提出了TeamPath系统，通过强化学习和路由器增强解决方案，结合大规模病理多模态数据，打造了一个能够灵活处理诊断、信息总结和跨模态生成任务的AI助手，提升了病理学家的工作效率。

Details

Motivation: 当前病理学专用视觉语言模型在严格推理路径的任务和多任务处理上表现不足，难以满足真实场景中AI Copilot的需求。

Result: TeamPath能灵活选择最优配置，协助病理学家高效工作，甚至纠正专家的结论和推理路径。

Insight: 通过多模态数据和强化学习的结合，AI系统可以在复杂任务中表现出更强的适应性和可靠性。

Abstract: Advances in AI have introduced several strong models in computational pathology to usher it into the era of multi-modal diagnosis, analysis, and interpretation. However, the current pathology-specific visual language models still lack capacities in making diagnosis with rigorous reasoning paths as well as handling divergent tasks, and thus challenges of building AI Copilots for real scenarios still exist. Here we introduce TeamPath, an AI system powered by reinforcement learning and router-enhanced solutions based on large-scale histopathology multimodal datasets, to work as a virtual assistant for expert-level disease diagnosis, patch-level information summarization, and cross-modality generation to integrate transcriptomic information for the clinical usage. We also collaborate with pathologists from Yale School of Medicine to demonstrate that TeamPath can assist them in working more efficiently by identifying and correcting expert conclusions and reasoning paths. Overall, TeamPath can flexibly choose the best settings according to the needs, and serve as an innovative and reliable system for information communication across different modalities and experts.

cs.IR [Back]

[298] Generative Query Expansion with Multilingual LLMs for Cross-Lingual Information Retrieval cs.IR | cs.AI | cs.CLPDF

Olivia Macmillan-Scott, Roksana Goworek, Eda B. Özyiğit

TL;DR: 该论文研究了使用多语言大语言模型（mLLMs）进行生成式查询扩展对跨语言信息检索的影响，分析了不同提示技术的效果，并揭示了语言差异和训练数据格式对性能的影响。

Details

Motivation: 查询扩展是信息检索中重要的技术，但传统方法依赖同义词和关联词，多语言大语言模型的引入为生成伪文档提供了新机会。研究旨在探索mLLMs在跨语言检索中的作用及其影响因素。

Result: 结果显示：1) 查询长度决定提示技术的有效性；2) 语言差异（尤其是不同书写系统之间）导致显著的性能差距；3) 微调仅在训练和测试数据格式相似时有效。

Insight: 论文指出跨语言查询扩展对基线较弱的语言效果最明显，强调了平衡的多语言训练和评估资源的重要性。

Abstract: Query expansion is the reformulation of a user query by adding semantically related information, and is an essential component of monolingual and cross-lingual information retrieval used to ensure that relevant documents are not missed. Recently, multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation. Pseudo-documents both introduce additional relevant terms and bridge the gap between short queries and long documents, which is particularly beneficial in dense retrieval. This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance. Results show that query length largely determines which prompting technique is effective, and that more elaborate prompts often do not yield further gains. Substantial linguistic disparities persist: cross-lingual query expansion can produce the largest improvements for languages with the weakest baselines, yet retrieval is especially poor between languages written in different scripts. Fine-tuning is found to lead to performance gains only when the training and test data are of similar format. These outcomes underline the need for more balanced multilingual and cross-lingual training and evaluation resources.

cs.MM [Back]

[299] Self-Empowering VLMs: Achieving Hierarchical Consistency via Self-Elicited Knowledge Distillation cs.MM | cs.CVPDF

Wei Yang, Yiran Zhu, Zilin Li, Xunjia Zhang, Hongtao Wang

TL;DR: 本文提出了一种名为SEKD的自蒸馏方法，通过让视觉语言模型（VLM）逐步推理并生成知识作为教师信号，随后由单步学生模型蒸馏这些信号，从而提升其在层次一致性任务中的表现。

Details

Motivation: 现有的视觉语言模型（VLM）在层次理解任务中表现不佳，主要问题在于无法保持跨层级状态的一致性，而非缺乏分类知识。

Result: 实验结果表明，SEKD在层次一致性任务中提升显著（HCA提升+29.50个百分点），并在未见过的分类任务上实现了从4.15%到42.26%的零样本性能提升。

Insight: 自蒸馏是一种高效的提升模型层次一致性的方法，能够在不增加标注成本的情况下扩展应用到新任务和数据集。

Abstract: Vision-language models (VLMs) possess rich knowledge but often fail on hierarchical understanding tasks, where the goal is to predict a coarse-to-fine taxonomy path that remains consistent across all levels. We compare three inference paradigms for hierarchical VQA and find that stepwise reasoning, when conditioned on prior answers, significantly outperforms single-pass prompting. Further analysis indicates that the main limitation of current VLMs is their inability to maintain cross-level state, rather than a lack of taxonomic knowledge. Motivated by this diagnosis, we propose Self-Elicited Knowledge Distillation (SEKD), which requires no human labels or external tools: the same VLM is prompted to reason step by step and act as a teacher by exposing its hard labels, soft distributions, and decoder hidden states, while a single-pass student distills these signals. The student VLM remains efficient while approaching the accuracy of its multi-step teacher. It improves in-domain path consistency (HCA) by up to +29.50 percentage points, raises zero-shot HCA on an unseen taxonomy from 4.15% to 42.26%, and yields gains on challenging mathematical benchmarks. Because all supervision is self-elicited, SEKD scales to new taxonomies and datasets without annotation cost, providing a practical route to imbue compact VLMs with dependency-aware multi-step reasoning.

[300] Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach cs.MM | cs.CVPDF

Fan Nie, Jiangqun Ni, Jian Zhang, Bin Zhang, Weizhe Zhang

TL;DR: 提出一种基于变分贝叶斯的音频-视觉深度伪造检测方法FoVB，通过捕捉跨模态不一致性实现高性能检测。

Details

Motivation: AIGC内容的广泛应用带来了安全风险（如音频-视觉深度伪造），亟需开发通用的多模态伪造检测方法。

Result: FoVB在多个基准测试中优于现有方法。

Insight: 变分贝叶斯和特征正交化能有效提升跨模态伪造检测的性能和泛化能力。

Abstract: The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.

eess.IV [Back]

[301] Robust Detection of Retinal Neovascularization in Widefield Optical Coherence Tomography eess.IV | cs.CVPDF

Jinyi Hao, Jie Wang, Kotaro Tsuboi, Liqin Gao, Tristan T. Hormel

TL;DR: 这篇论文提出了一种新颖的方法，用于在宽场光学相干断层扫描（OCT）中检测视网膜新生血管（RNV），通过直接二值定位任务提高了诊断和分期的准确性。

Details

Motivation: 视网膜新生血管（RNV）是糖尿病视网膜病变（DR）中威胁视力的发展，及时干预可以预防视力丧失。目前的大多数算法是针对窄场OCTA图像优化的，缺乏有效的宽场OCT/OCTA分析方法。

Result: 方法在不同设备上的曲线下面积（AUC）达到了0.96到0.99，分割的交并比（IOU）在0.76到0.88之间，证明了其在RNV诊断和分期中的高准确性。

Insight: 深度学习在宽场OCTA图像分析中显示出巨大潜力，可以为RNV的筛查和管理提供有价值的工具。

Abstract: Retinal neovascularization (RNV) is a vision threatening development in diabetic retinopathy (DR). Vision loss associated with RNV is preventable with timely intervention, making RNV clinical screening and monitoring a priority. Optical coherence tomography (OCT) angiography (OCTA) provides high-resolution imaging and high-sensitivity detection of RNV lesions. With recent commercial devices introducing widefield OCTA imaging to the clinic, the technology stands to improve early detection of RNV pathology. However, to meet clinical requirements these imaging capabilities must be combined with effective RNV detection and quantification, but existing algorithms for OCTA images are optimized for conventional, i.e. narrow, fields of view. Here, we present a novel approach for RNV diagnosis and staging on widefield OCT/OCTA. Unlike conventional methods dependent on multi-layer retinal segmentation, our model reframes RNV identification as a direct binary localization task. Our fully automated approach was trained and validated on 589 widefield scans (17x17-mm to 26x21-mm) collected from multiple devices at multiple clinics. Our method achieved a device-dependent area under curve (AUC) ranging from 0.96 to 0.99 for RNV diagnosis, and mean intersection over union (IOU) ranging from 0.76 to 0.88 for segmentation. We also demonstrate our method’s ability to monitor lesion growth longitudinally. Our results indicate that deep learning-based analysis for widefield OCTA images could offer a valuable means for improving RNV screening and management.

[302] Shape-Adapting Gated Experts: Dynamic Expert Routing for Colonoscopic Lesion Segmentation eess.IV | cs.AI | cs.CVPDF

Gia Huy Thai, Hoang-Nguyen Vu, Anh-Minh Phan, Quang-Thinh Ly, Tram Dinh

TL;DR: 本文提出了一种动态专家路由框架SAGE，用于结肠镜病灶分割任务，通过动态路由和分层门控机制优化模型计算效率，提升适应性。

Details

Motivation: 结肠镜病灶分割任务中，细胞尺度和形式的多样性导致现有静态计算图的CNN-Transformer混合模型适应性不足，计算冗余。

Result: 在EBHI、DigestPath和GlaS三个医学基准上取得了95.57%、95.16%和94.17%的Dice分数，优于现有方法。

Insight: 动态专家路由和分层门控机制是提升模型适应性和计算效率的有效途径，尤其在处理医学图像中多样化的细胞形态时表现突出。

Abstract: The substantial diversity in cell scale and form remains a primary challenge in computer-aided cancer detection on gigapixel Whole Slide Images (WSIs), attributable to cellular heterogeneity. Existing CNN-Transformer hybrids rely on static computation graphs with fixed routing, which consequently causes redundant computation and limits their adaptability to input variability. We propose Shape-Adapting Gated Experts (SAGE), an input-adaptive framework that enables dynamic expert routing in heterogeneous visual networks. SAGE reconfigures static backbones into dynamically routed expert architectures. SAGE’s dual-path design features a backbone stream that preserves representation and selectively activates an expert path through hierarchical gating. This gating mechanism operates at multiple hierarchical levels, performing a two-level, hierarchical selection between shared and specialized experts to modulate model logits for Top-K activation. Our Shape-Adapting Hub (SA-Hub) harmonizes structural and semantic representations across the CNN and the Transformer module, effectively bridging diverse modules. Embodied as SAGE-UNet, our model achieves superior segmentation on three medical benchmarks: EBHI, DigestPath, and GlaS, yielding state-of-the-art Dice Scores of 95.57%, 95.16%, and 94.17%, respectively, and robustly generalizes across domains by adaptively balancing local refinement and global context. SAGE provides a scalable foundation for dynamic expert routing, enabling flexible visual reasoning.

[303] Neural B-Frame Coding: Tackling Domain Shift Issues with Lightweight Online Motion Resolution Adaptation eess.IV | cs.CV | cs.MMPDF

Sang NguyenQuang, Xiem HoangVan, Wen-Hsiao Peng

TL;DR: 该论文提出了一种轻量级在线运动分辨率自适应方法，通过分类器预测降采样因子，解决了B帧编解码器中因训练与测试GOP大小不匹配引起的域偏移问题，显著降低了计算复杂度。

Details

Motivation: 现有的学习型B帧编解码器在训练和测试时GOP大小不匹配，导致运动估计不准确，尤其是在大运动场景下。传统降采样方法需要昂贵的率失真优化，而本工作旨在通过轻量级分类器高效预测降采样因子。

Result: 实验表明，这些分类器方法的编码性能接近穷举搜索方法，同时显著减少了计算复杂度。

Insight: 轻量级分类器可以高效解决B帧编码中的域偏移问题，而无需重新训练编解码器，为实际应用提供了可行的方案。

Abstract: Learned B-frame codecs with hierarchical temporal prediction often encounter the domain-shift issue due to mismatches between the Group-of-Pictures (GOP) sizes for training and testing, leading to inaccurate motion estimates, particularly for large motion. A common solution is to turn large motion into small motion by downsampling video frames during motion estimation. However, determining the optimal downsampling factor typically requires costly rate-distortion optimization. This work introduces lightweight classifiers to predict downsampling factors. These classifiers leverage simple state signals from current and reference frames to balance rate-distortion performance with computational cost. Three variants are proposed: (1) a binary classifier (Bi-Class) trained with Focal Loss to choose between high and low resolutions, (2) a multi-class classifier (Mu-Class) trained with novel soft labels based on rate-distortion costs, and (3) a co-class approach (Co-Class) that combines the predictive capability of the multi-class classifier with the selective search of the binary classifier. All classifier methods can work seamlessly with existing B-frame codecs without requiring codec retraining. Experimental results show that they achieve coding performance comparable to exhaustive search methods while significantly reducing computational complexity. The code is available at: https://github.com/NYCU-MAPL/Fast-OMRA.git.

cs.AI [Back]

[304] Bridging Symbolic Control and Neural Reasoning in LLM Agents: The Structured Cognitive Loop cs.AI | cs.CLPDF

Myung Ho Kim

TL;DR: 论文提出了一种模块化架构SCL，通过分离认知过程的五个阶段（R-CCAM）和软符号控制机制，解决了LLM代理的推理与执行纠缠、内存易失性和不可控动作序列等问题。

Details

Motivation: 为了解决大型语言模型代理在架构上的根本问题（如推理与执行纠缠、内存易失性和动作序列不可控），论文提出了SCL架构，旨在通过模块化设计和符号控制提升代理的可解释性和可控性。

Result: 在多步条件推理任务上，SCL实现了零策略违规、消除冗余工具调用，并保持了决策的完全可追溯性。

Insight: SCL通过结合专家系统原则与现代LLM能力，为实现可靠、可解释和可治理的AI代理提供了理论和实践路径。

Abstract: Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences. We introduce Structured Cognitive Loop (SCL), a modular architecture that explicitly separates agent cognition into five phases: Retrieval, Cognition, Control, Action, and Memory (R-CCAM). At the core of SCL is Soft Symbolic Control, an adaptive governance mechanism that applies symbolic constraints to probabilistic inference, preserving neural flexibility while restoring the explainability and controllability of classical symbolic systems. Through empirical validation on multi-step conditional reasoning tasks, we demonstrate that SCL achieves zero policy violations, eliminates redundant tool calls, and maintains complete decision traceability. These results address critical gaps in existing frameworks such as ReAct, AutoGPT, and memory-augmented approaches. Our contributions are threefold: (1) we situate SCL within the taxonomy of hybrid intelligence, differentiating it from prompt-centric and memory-only approaches; (2) we formally define Soft Symbolic Control and contrast it with neuro-symbolic AI; and (3) we derive three design principles for trustworthy agents: modular decomposition, adaptive symbolic governance, and transparent state management. We provide a complete open-source implementation demonstrating the R-CCAM loop architecture, alongside a live GPT-4o-powered travel planning agent. By connecting expert system principles with modern LLM capabilities, this work offers a practical and theoretically grounded path toward reliable, explainable, and governable AI agents. Code: https://github.com/enkiluv/scl-core-experiment Demo: https://scl-travel-planner.streamlit.app/

[305] PRInTS: Reward Modeling for Long-Horizon Information Seeking cs.AI | cs.CL | cs.LGPDF

Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin

TL;DR: PRInTS是一种生成式过程奖励模型（PRM），通过密集评分和轨迹摘要技术，提升了AI代理在长视野信息寻求任务中的能力，表现优于现有模型。

Details

Motivation: 现有过程奖励模型（PRMs）在多步信息寻求任务中存在局限性，无法高质量评估工具交互和信息推理效果，也无法处理长视野任务中快速增长的上下文。

Result: 在FRAMES、GAIA和WebWalkerQA等基准测试中，PRInTS显著提升了开源模型和专业代理的性能，表现优于前沿模型和其他基线方法。

Insight: 生成式PRM的设计能够有效解决长视野任务中的上下文管理和多维评估问题。

Abstract: Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM’s reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.

[306] GContextFormer: A global context-aware hybrid multi-head attention approach with scaled additive aggregation for multimodal trajectory prediction cs.AI | cs.CV | cs.LG | cs.MA | cs.RO | cs.SIPDF

Yuzhi Chen, Yuanchang Xie, Lei Zhao, Pan Liu, Yajie Zou

TL;DR: GContextFormer是一种无需地图依赖的多模态轨迹预测方法，通过全局上下文感知的混合注意力和缩放加法聚合，实现了意图对齐的多模态预测。

Details

Motivation: 传统的高清地图依赖模型存在数据获取成本高、更新延迟和输入损坏导致预测失败的问题，而无地图方法因缺乏全局上下文导致运动意图不对齐。

Result: 在TOD-VT数据集的八个高速坡道场景中，GContextFormer优于现有方法，在高曲率和过渡区域表现更稳健。

Insight: 1. 全局上下文和混合注意力可以弥补无地图方法的不足；2. 模块化设计支持跨领域多模态推理任务。

Abstract: Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.

cs.CR [Back]

[307] FedPoisonTTP: A Threat Model and Poisoning Attack for Federated Test-Time Personalization cs.CR | cs.CVPDF

Md Akil Raihan Iftee, Syed Md. Ahnaf Hasan, Amin Ahsan Ali, AKM Mahbubur Rahman, Sajib Mistry

TL;DR: 论文介绍了FedPoisonTTP，一种针对联邦测试时个性化（TTP）的灰色盒子攻击框架，通过数据投毒在本地适应阶段损害全局和客户端性能。

Details

Motivation: 现有联邦学习工作忽视了测试时本地适应带来的安全风险，尤其是异构域到达、多样化适应算法和有限的跨客户端可见性导致的漏洞。

Result: 在损坏的视觉基准测试中，被攻陷的参与者可显著降低测试时性能。

Insight: 联邦学习在测试时适应阶段存在安全漏洞，需要设计更鲁棒的防御机制以防止此类攻击。

Abstract: Test-time personalization in federated learning enables models at clients to adjust online to local domain shifts, enhancing robustness and personalization in deployment. Yet, existing federated learning work largely overlooks the security risks that arise when local adaptation occurs at test time. Heterogeneous domain arrivals, diverse adaptation algorithms, and limited cross-client visibility create vulnerabilities where compromised participants can craft poisoned inputs and submit adversarial updates that undermine both global and per-client performance. To address this threat, we introduce FedPoisonTTP, a realistic grey-box attack framework that explores test-time data poisoning in the federated adaptation setting. FedPoisonTTP distills a surrogate model from adversarial queries, synthesizes in-distribution poisons using feature-consistency, and optimizes attack objectives to generate high-entropy or class-confident poisons that evade common adaptation filters. These poisons are injected during local adaptation and spread through collaborative updates, leading to broad degradation. Extensive experiments on corrupted vision benchmarks show that compromised participants can substantially diminish overall test-time performance.

cs.LG [Back]

[308] Practical Machine Learning for Aphasic Discourse Analysis cs.LG | cs.AI | cs.CLPDF

Jason M. Pittman, Anton Phillips, Yesenia Medina-Santos, Brielle C. Stark

TL;DR: 该研究探讨了使用机器学习（ML）来自动分析失语症患者的言语信息单位（CIU），以减轻临床医生的手动劳动。通过评估五种ML模型，发现它们在区分词与非词上表现优异，但在识别CIU上仍面临挑战。

Details

Motivation: 临床中失语症患者的言语分析（如CIU分析）依赖医生手动编码，耗时耗力。机器学习有望自动完成这一任务，提升效率。

Result: 所有模型在词与非词分类上表现优异（准确率0.995），但CIU识别准确率较低且波动大（k-NN最高0.824）。AUC范围显示CIU识别更具挑战性。

Insight: 当前ML模型在自动化CIU分析上表现有限，提示未来需改进模型设计或引入更复杂的特征，以更好捕捉言语的语义和语境信息。

Abstract: Analyzing spoken discourse is a valid means of quantifying language ability in persons with aphasia. There are many ways to quantify discourse, one common way being to evaluate the informativeness of the discourse. That is, given the total number of words produced, how many of those are context-relevant and accurate. This type of analysis is called Correct Information Unit (CIU) analysis and is one of the most prevalent discourse analyses used by speech-language pathologists (SLPs). Despite this, CIU analysis in the clinic remains limited due to the manual labor needed by SLPs to code and analyze collected speech. Recent advances in machine learning (ML) seek to augment such labor by automating modeling of propositional, macrostructural, pragmatic, and multimodal dimensions of discourse. To that end, this study evaluated five ML models for reliable identification of Correct Information Units (CIUs, Nicholas & Brookshire, 1993), during a picture description task. The five supervised ML models were trained using randomly selected human-coded transcripts and accompanying words and CIUs from persons with aphasia. The baseline model training produced a high accuracy across transcripts for word vs non-word, with all models achieving near perfect performance (0.995) with high AUC range (0.914 min, 0.995 max). In contrast, CIU vs non-CIU showed a greater variability, with the k-nearest neighbor (k-NN) model the highest accuracy (0.824) and second highest AUC (0.787). These findings indicate that while the supervised ML models can distinguish word from not word, identifying CIUs is challenging.

[309] Efficient Mathematical Reasoning Models via Dynamic Pruning and Knowledge Distillation cs.LG | cs.AI | cs.CLPDF

Fengming Yu, Qingyu Meng, Haiwei Pan, Kejia Zhang

TL;DR: 该论文提出了一种轻量化的优化方法，通过动态注意力头剪枝和知识蒸馏技术，显著减少了大型语言模型的计算和存储成本，同时在数学推理任务中保持了较强的性能。

Details

Motivation: 大型语言模型在数学推理等复杂任务中表现优异，但其高昂的计算和存储成本限制了实际部署。论文旨在通过动态剪枝和知识蒸馏技术，实现模型的高效轻量化。

Result: 在Math23k数据集上，30%的剪枝率下，参数减少18.7%，推理速度提升27.5%，FLOPs降低19.3%，准确率仅下降0.7%（84.4%→83.7%）。

Insight: 动态剪枝与知识蒸馏的结合是一种高效轻量化大型语言模型的可行方案，能够在显著降低计算开销的同时，保持其推理能力。

Abstract: With the rapid development of deep learning, large language models have shown strong capabilities in complex reasoning tasks such as mathematical equation solving. However, their substantial computational and storage costs hinder practical deployment. This paper proposes a lightweight optimization method that integrates dynamic attention head pruning with knowledge distillation. The approach dynamically evaluates the importance of each attention head in the multi-head attention mechanism using a combination of weight norms and entropy, and prunes redundant heads in real time to reduce computational overhead. To mitigate performance degradation, knowledge distillation transfers information from the original model to the pruned student, enabling the smaller model to preserve reasoning ability. Experiments conducted on both Math23k and ASDiv-A verify the effectiveness of the proposed method. For example, on Math23k with a 30% pruning ratio, parameters are reduced by 18.7%, inference speed is improved by 27.5%, FLOPs are reduced by 19.3%, and accuracy drops only 0.7% (from 84.4% to 83.7%). These results demonstrate that the method achieves substantial efficiency gains while maintaining strong reasoning performance, providing a practical solution for efficient deployment of large language models in mathematical reasoning tasks.

[310] Llamazip: Leveraging LLaMA for Lossless Text Compression and Training Dataset Detection cs.LG | cs.CLPDF

Sören Dréano, Derek Molloy, Noel Murphy

TL;DR: Llamazip是一种基于LLaMA3语言模型的无损文本压缩算法，通过仅存储模型未能预测的标记来显著减少数据量，同时分析了量化和上下文窗口大小对其性能的影响。此外，它还能识别文档是否属于语言模型的训练数据集。

Details

Motivation: 动机是利用LLaMA3语言模型的预测能力实现高效的无损文本压缩，并解决语言模型训练数据来源和透明性问题。

Result: 实验结果表明Llamazip在压缩效率和数据集检测方面表现优异。

Insight: 该研究表明语言模型的预测能力可用于无损压缩和数据来源分析，为模型透明性提供了新工具。

Abstract: This work introduces Llamazip, a novel lossless text compression algorithm based on the predictive capabilities of the LLaMA3 language model. Llamazip achieves significant data reduction by only storing tokens that the model fails to predict, optimizing storage efficiency without compromising data integrity. Key factors affecting its performance, including quantization and context window size, are analyzed, revealing their impact on compression ratios and computational requirements. Beyond compression, Llamazip demonstrates the potential to identify whether a document was part of the training dataset of a language model. This capability addresses critical concerns about data provenance, intellectual property, and transparency in language model training.

[311] DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams cs.LG | cs.CL | cs.CVPDF

Ginés Carreto Picón, Peng Yuan Zhou, Qi Zhang, Alexandros Iosifidis

TL;DR: DeepCoT是一种专为实时数据流推理设计的深度持续Transformer模型，通过减少冗余计算实现了高效的低延迟推理，适用于音频、视频和文本流任务。

Details

Motivation: 随着Transformer模型规模的增加，资源受限设备上的低延迟推理需求日益显著，尤其是在数据流任务中，传统滑动窗口方法导致大量冗余计算。现有持续Transformer仅适用于浅层模型，限制了其泛化能力。

Result: 在音频、视频和文本流任务中，DeepCoT与非持续基线模型性能相当，但运行时间减少了两个数量级。

Insight: DeepCoT通过消除冗余计算，为实时数据流任务提供了一种高效的解决方案，展示了深度持续Transformer的潜力。

Abstract: Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks. At the same time, there is a growing demand for low-latency inference on resource-constrained devices that achieves high performance. In particular, stream data inference is typically performed over a sliding temporal window, leading to highly redundant computations. The recent Continual Transformers have addressed this issue, but they can only be effectively used in shallow models, which limits their scope and generalization power. In this paper, we propose the Deep Continual Transformer (DeepCoT), a redundancy-free encoder-only model that can be applied over existing deep encoder architectures with minimal changes. In our experiments over audio, video, and text streams, we show that DeepCoTs retain comparative performance to their non-continual baselines while offering a linear computational cost for all Transformer layers, which reduces up to two orders of magnitude in the running time compared to previous efficient models.

[312] Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch cs.LG | cs.CL | stat.MLPDF

Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao

TL;DR: 该论文提出了Tree-Based Invariant Kernels (TBIK)，一种保证在不同张量并行（TP）大小下比特级一致的矩阵乘法和规约原语，解决了训练与推理配置不匹配导致的非确定性输出问题。

Details

Motivation: 大型语言模型（LLM）应用中确定性推理的需求日益增长，但现有框架因浮点运算的非结合性和GPU间规约顺序不一致导致输出非确定性。训练与推理配置（如TP大小）不匹配尤其影响强化学习（RL）的性能。

Result: 实验验证了在不同TP大小下零概率发散和比特级可重现性，并支持RL训练中不同并行策略下的比特级一致性输出。

Insight: 通过统一规约顺序解决GPU间一致性问题是实现确定性推理的关键，尤其适用于训练与推理配置不同的场景如强化学习。

Abstract: Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size-related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives that guarantee bit-wise identical results regardless of TP size. Our key insight is to align intra- and inter-GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit-wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit-wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at https://github.com/nanomaoli/llm_reproducibility.

[313] RAVEN++: Pinpointing Fine-Grained Violations in Advertisement Videos with Active Reinforcement Reasoning cs.LG | cs.CLPDF

Deyi Ji, Yuekui Yang, Liqun Liu, Peng Shu, Haiyang Wu

TL;DR: RAVEN++是一个新颖的框架，通过主动强化学习、细粒度违规理解和渐进式多阶段训练，提升了对广告视频中细粒度违规的检测能力。

Details

Motivation: 现有模型如RAVEN在粗粒度违规检测上表现良好，但在细粒度理解、可解释性和泛化能力方面存在明显不足。

Result: 在公开和专有数据集上，RAVEN++在细粒度理解、推理能力和泛化性上优于通用LLMs和专业模型如RAVEN。

Insight: 主动强化学习和多阶段训练是提升细粒度违规检测的有效方法。

Abstract: Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability.

[314] A Nutrition Multimodal Photoplethysmography Language Model cs.LG | cs.AI | cs.CLPDF

Kyle Verrier, Achille Nazaret, Joseph Futoma, Andrew C. Miller, Guillermo Sapiro

TL;DR: 该论文提出了一种营养多模态光电容积描记语言模型（NPLM），结合可穿戴设备的PPG数据和膳食描述，通过嵌入投影实现生理与膳食的联合推理，提升了热量摄入预测的准确性。

Details

Motivation: 饥饿和饱腹动态对饮食行为和新陈代谢健康有重要影响，但在日常环境中难以捕捉。作者希望通过结合可穿戴设备的PPG数据和膳食描述，实现非侵入式的大规模饮食监测。

Result: 结果表明，NPLM在热量摄入预测上比纯文本基线提升了11%。即使在80%膳食文本被移除的情况下，准确性仍能保持。独立验证研究（n=140）也重现了这一结果。

Insight: 研究展示了结合消费者可穿戴设备的生理数据和膳食信息的潜力，为大规模非侵入式饮食监测提供了新思路。

Abstract: Hunger and satiety dynamics shape dietary behaviors and metabolic health, yet remain difficult to capture in everyday settings. We present a Nutrition Photoplethysmography Language Model (NPLM), integrating continuous photoplethysmography (PPG) from wearables with meal descriptions. NPLM projects PPG into embeddings interpretable by language models, enabling joint reasoning over physiology and meal context. Trained on 19,340 participants and 1.1 million meal-PPG pairs, the model improved daily caloric intake prediction by 11% over text-only baselines, with accuracy maintained when 80% of meal text was removed. In an independent validation study (n=140) with controlled dining and detailed meal information, the model replicated these findings. These results demonstrate the value of integrating physiological measurements from consumer wearables with meal information for noninvasive dietary monitoring at scale.

[315] CDLM: Consistency Diffusion Language Models For Faster Sampling cs.LG | cs.CLPDF

Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun

TL;DR: CDLM通过一致性建模和块状因果注意力掩码，显著减少了扩散语言模型的采样步骤并支持KV缓存，实现了3.6x-14.5x的延迟降低，同时保持准确性。

Details

Motivation: 扩散语言模型（DLMs）在并行生成方面表现优异，但因需要大量细化步骤和无法使用标准KV缓存而导致推理速度慢。

Result: 在数学和编码任务中，CDLM实现了3.6x-14.5x的延迟降低，同时保持准确性。

Insight: 通过训练加速方法，CDLM展示了扩散模型在语言生成任务中的高效潜力。

Abstract: Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.

[316] BOOD: Boundary-based Out-Of-Distribution Data Generation cs.LG | cs.CVPDF

Qilin Liao, Shuo Yang, Bo Zhao, Ping Luo, Hengshuang Zhao

TL;DR: BOOD提出了一种基于边界的OOD数据生成框架，通过扩散模型合成高质量OOD特征，显著提升了OOD检测性能。

Details

Motivation: 现有的OOD数据生成方法在潜在空间中提取有效特征时难以突破ID边界，限制了OOD检测的性能提升。

Result: 在CIFAR-100数据集上，BOOD显著优于现有方法，FPR95降低29.64%，AUROC提升7.27%。

Insight: 通过扰动ID边界附近的特征生成OOD数据，是一种高效且有效的方法，能够显著提升OOD检测的性能。

Abstract: Harnessing the power of diffusion models to synthesize auxiliary training data based on latent space features has proven effective in enhancing out-of-distribution (OOD) detection performance. However, extracting effective features outside the in-distribution (ID) boundary in latent space remains challenging due to the difficulty of identifying decision boundaries between classes. This paper proposes a novel framework called Boundary-based Out-Of-Distribution data generation (BOOD), which synthesizes high-quality OOD features and generates human-compatible outlier images using diffusion models. BOOD first learns a text-conditioned latent feature space from the ID dataset, selects ID features closest to the decision boundary, and perturbs them to cross the decision boundary to form OOD features. These synthetic OOD features are then decoded into images in pixel space by a diffusion model. Compared to previous works, BOOD provides a more training efficient strategy for synthesizing informative OOD features, facilitating clearer distinctions between ID and OOD data. Extensive experimental results on common benchmarks demonstrate that BOOD surpasses the state-of-the-art method significantly, achieving a 29.64% decrease in average FPR95 (40.31% vs. 10.67%) and a 7.27% improvement in average AUROC (90.15% vs. 97.42%) on the CIFAR-100 dataset.

[317] Classification of Transient Astronomical Object Light Curves Using LSTM Neural Networks cs.LG | astro-ph.IM | cs.AI | cs.CVPDF

Guilherme Grancho D. Fernandes, Marco A. Barroca, Mateus dos Santos, Rafael S. Oliveira

TL;DR: 本文提出了一种双向LSTM神经网络，用于对PLAsTiCC数据集中的瞬态天文对象光变曲线进行分类，通过重新分类和预处理提高了某些类别的性能，但存在类别不平衡和时间信息不足的问题。

Details

Motivation: 天文瞬变对象的分类对于理解宇宙现象至关重要，但由于类别不平衡和时间信息的复杂性，传统方法效果有限。

Result: 模型在S-Like和Periodic类别上表现优异（ROC AUC分别为0.95和0.99），但在Fast和Long类别上性能较差（ROC AUC为0.68）。部分时间数据的分类性能显著下降。

Insight: 类别不平衡和有限的时间信息是主要限制因素，未来可以通过类别平衡策略和优化检测时刻的预处理技术进一步提升性能。

Abstract: This study presents a bidirectional Long Short-Term Memory (LSTM) neural network for classifying transient astronomical object light curves from the Photometric LSST Astronomical Time-series Classification Challenge (PLAsTiCC) dataset. The original fourteen object classes were reorganized into five generalized categories (S-Like, Fast, Long, Periodic, and Non-Periodic) to address class imbalance. After preprocessing with padding, temporal rescaling, and flux normalization, a bidirectional LSTM network with masking layers was trained and evaluated on a test set of 19,920 objects. The model achieved strong performance for S-Like and Periodic classes, with ROC area under the curve (AUC) values of 0.95 and 0.99, and Precision-Recall AUC values of 0.98 and 0.89, respectively. However, performance was significantly lower for Fast and Long classes (ROC AUC of 0.68 for Long class), and the model exhibited difficulty distinguishing between Periodic and Non-Periodic objects. Evaluation on partial light curve data (5, 10,and 20 days from detection) revealed substantial performance degradation, with increased misclassification toward the S-Like class. These findings indicate that class imbalance and limited temporal information are primary limitations, suggesting that class balancing strategies and preprocessing techniques focusing on detection moments could improve performance.

Zhiwen Qiu, Ziang Liu, Wenqian Niu, Tapomayukh Bhattacharjee, Saleh Kalantari

TL;DR: EgoCogNav是一个多模态的自我中心导航框架，通过融合场景特征与感官线索来预测路径不确定性，并联合预测轨迹和头部运动。

Details

Motivation: 现有方法多关注完全观察场景下的运动预测，而忽略了人类对空间的情感和反应因素。

Result: EgoCogNav能够学习与人类行为（如扫描、犹豫、回溯）高度相关的感知不确定性，并在新环境中表现出泛化能力。

Insight: 人类的感知不确定性是多模态数据中隐含的关键因素，对建模导航行为至关重要。

Abstract: Modeling the cognitive and experiential factors of human navigation is central to deepening our understanding of human-environment interaction and to enabling safe social navigation and effective assistive wayfinding. Most existing methods focus on forecasting motions in fully observed scenes and often neglect human factors that capture how people feel and respond to space. To address this gap, We propose EgoCogNav, a multimodal egocentric navigation framework that predicts perceived path uncertainty as a latent state and jointly forecasts trajectories and head motion by fusing scene features with sensory cues. To facilitate research in the field, we introduce the Cognition-aware Egocentric Navigation (CEN) dataset consisting 6 hours of real-world egocentric recordings capturing diverse navigation behaviors in real-world scenarios. Experiments show that EgoCogNav learns the perceived uncertainty that highly correlates with human-like behaviors such as scanning, hesitation, and backtracking while generalizing to unseen environments.

[319] PaSE: Prototype-aligned Calibration and Shapley-based Equilibrium for Multimodal Sentiment Analysis cs.LG | cs.AI | cs.CVPDF

Kang He, Boyu Chen, Yuzhe Ding, Fei Li, Chong Teng

TL;DR: PaSE提出了一种原型对齐校准和基于Shapley优化的多模态情感分析框架，通过原型引导校准学习和Shapley梯度调制缓解模态竞争，提升性能。

Details

Motivation: 多模态情感分析中，模态间存在竞争（主导模态掩盖弱势模态），导致性能下降。PaSE旨在通过原型对齐和Shapley优化增强模态协作。

Result: 在IEMOCAP、MOSI和MOSEI数据集上表现优异，有效缓解模态竞争。

Insight: 原型对齐和Shapley优化能显著提升多模态融合的协作效率。

Abstract: Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by integrating textual, acoustic, and visual signals. Although multimodal fusion is designed to leverage cross-modal complementarity, real-world scenarios often exhibit modality competition: dominant modalities tend to overshadow weaker ones, leading to suboptimal performance.In this paper, we propose PaSE, a novel Prototype-aligned Calibration and Shapley-optimized Equilibrium framework, which enhances collaboration while explicitly mitigating modality competition. PaSE first applies Prototype-guided Calibration Learning (PCL) to refine unimodal representations and align them through an Entropic Optimal Transport mechanism that ensures semantic consistency. To further stabilize optimization, we introduce a Dual-Phase Optimization strategy. A prototype-gated fusion module is first used to extract shared representations, followed by Shapley-based Gradient Modulation (SGM), which adaptively adjusts gradients according to the contribution of each modality. Extensive experiments on IEMOCAP, MOSI, and MOSEI confirm that PaSE achieves the superior performance and effectively alleviates modality competition.

[320] pFedBBN: A Personalized Federated Test-Time Adaptation with Balanced Batch Normalization for Class-Imbalanced Data cs.LG | cs.CVPDF

Md Akil Raihan Iftee, Syed Md. Ahnaf Hasan, Mir Sazzat Hossain, Rakibul Hasan Rajib, Amin Ahsan Ali

TL;DR: 该论文提出了pFedBBN，一种针对类不平衡数据的个性化联邦测试时自适应框架，通过平衡批归一化和域感知协作解决了分布偏移和类不平衡问题。

Details

Motivation: 联邦学习中测试时自适应（TTA）在处理客户端间的分布偏移和类不平衡时至关重要，但现有方法未能有效解决这些问题。

Result: 实验表明pFedBBN在鲁棒性和少数类性能上优于现有联邦学习和TTA方法。

Insight: 平衡批归一化和域感知协作是解决联邦学习中类不平衡和分布偏移的有效方法。

Abstract: Test-time adaptation (TTA) in federated learning (FL) is crucial for handling unseen data distributions across clients, particularly when faced with domain shifts and skewed class distributions. Class Imbalance (CI) remains a fundamental challenge in FL, where rare but critical classes are often severely underrepresented in individual client datasets. Although prior work has addressed CI during training through reliable aggregation and local class distribution alignment, these methods typically rely on access to labeled data or coordination among clients, and none address class unsupervised adaptation to dynamic domains or distribution shifts at inference time under federated CI constraints. Revealing the failure of state-of-the-art TTA in federated client adaptation in CI scenario, we propose pFedBBN,a personalized federated test-time adaptation framework that employs balanced batch normalization (BBN) during local client adaptation to mitigate prediction bias by treating all classes equally, while also enabling client collaboration guided by BBN similarity, ensuring that clients with similar balanced representations reinforce each other and that adaptation remains aligned with domain-specific characteristics. pFedBBN supports fully unsupervised local adaptation and introduces a class-aware model aggregation strategy that enables personalized inference without compromising privacy. It addresses both distribution shifts and class imbalance through balanced feature normalization and domain-aware collaboration, without requiring any labeled or raw data from clients. Extensive experiments across diverse baselines show that pFedBBN consistently enhances robustness and minority-class performance over state-of-the-art FL and TTA methods.

[321] Coherent Multi-Agent Trajectory Forecasting in Team Sports with CausalTraj cs.LG | cs.CVPDF

Wei Zhen Teoh

TL;DR: CausalTraj是一个基于时间因果关系的多智能体轨迹预测模型，旨在生成联合预测。通过强调联合度量（minJADE、minJFDE），它在团队运动中实现了准确的集体预测，优于现有方法。

Details

Motivation: 现有的多智能体轨迹预测模型主要基于单智能体准确性度量（如minADE、minFDE），忽略了联合预测的合理性。这在团队运动等复杂交互场景中可能导致预测结果不连贯或不切实际。

Result: 在NBA SportVU、Basketball-U和Football-U数据集上，CausalTraj在单智能体和联合度量上均表现优异，生成的结果更连贯且现实。

Insight: 联合度量（如minJADE、minJFDE）在多智能体预测任务中比单智能体度量更能反映模型的真实性能，尤其是在复杂交互场景中。

Abstract: Jointly forecasting trajectories of multiple interacting agents is a core challenge in sports analytics and other domains involving complex group dynamics. Accurate prediction enables realistic simulation and strategic understanding of gameplay evolution. Most existing models are evaluated solely on per-agent accuracy metrics (minADE, minFDE), which assess each agent independently on its best-of-k prediction. However these metrics overlook whether the model learns which predicted trajectories can jointly form a plausible multi-agent future. Many state-of-the-art models are designed and optimized primarily based on these metrics. As a result, they may underperform on joint predictions and also fail to generate coherent, interpretable multi-agent scenarios in team sports. We propose CausalTraj, a temporally causal, likelihood-based model that is built to generate jointly probable multi-agent trajectory forecasts. To better assess collective modeling capability, we emphasize joint metrics (minJADE, minJFDE) that measure joint accuracy across agents within the best generated scenario sample. Evaluated on the NBA SportVU, Basketball-U, and Football-U datasets, CausalTraj achieves competitive per-agent accuracy and the best recorded results on joint metrics, while yielding qualitatively coherent and realistic gameplay evolutions.

[322] TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis cs.LG | cs.CV | q-bio.QMPDF

Rui Peng, Ziru Liu, Lingyuan Ye, Yuxing Lu, Boxin Shi

TL;DR: TRIDENT是一个三重模态级联生成框架，能够基于药物和RNA条件合成细胞形态，显著优于现有方法，并通过RNA指导的合成验证了其高保真度。

Details

Motivation: 现有方法通常仅建模扰动与RNA或形态之间的直接关联，忽略了RNA与形态之间的因果关系，导致无法准确建模细胞表型变化。

Result: 在未见过的新化合物上实现了7倍的性能提升，并通过RNA指导的实验验证了模型的准确性。

Insight: 明确建模转录组与表型组的映射关系是实现预测性虚拟细胞的关键，RNA条件对高保真合成至关重要。

Abstract: Accurately modeling the relationship between perturbations, transcriptional responses, and phenotypic changes is essential for building an AI Virtual Cell (AIVC). However, existing methods typically constrained to modeling direct associations, such as Perturbation $\rightarrow$ RNA or Perturbation $\rightarrow$ Morphology, overlook the crucial causal link from RNA to morphology. To bridge this gap, we propose TRIDENT, a cascade generative framework that synthesizes realistic cellular morphology by conditioning on both the perturbation and the corresponding gene expression profile. To train and evaluate this task, we construct MorphoGene, a new dataset pairing L1000 gene expression with Cell Painting images for 98 compounds. TRIDENT significantly outperforms state-of-the-art approaches, achieving up to 7-fold improvement with strong generalization to unseen compounds. In a case study on docetaxel, we validate that RNA-guided synthesis accurately produces the corresponding phenotype. An ablation study further confirms that this RNA conditioning is essential for the model’s high fidelity. By explicitly modeling transcriptome-phenome mapping, TRIDENT provides a powerful in silico tool and moves us closer to a predictive virtual cell.

Duncan Stothers, Ben Stothers, Emily Schaeffer, Kishore Mulpuri

TL;DR: 该论文提出了一种超声优先、减少辐射的策略，用于小儿髋关节发育不良（DDH）的诊断，仅在必要时才进行X光检查。通过模态特异性编码器预训练和小型头部拟合，结合校准的延迟规则，实现了高覆盖率和可调的选择性成像。

Details

Motivation: DDH诊断通常需要X光检查，但其辐射对儿童有害。论文旨在开发一种超声优先的策略，减少不必要的X光检查，同时确保诊断准确性。

Result: 在评估集上，超声测量的平均绝对误差（MAE）较低（如alpha MAE约9.7度），X光测量的MAE也表现良好。选择性成像策略可根据需求调节性能平衡。

Insight: 该方法将有限标签数据转化为可解释的测量和可调的成像策略，为临床决策提供了灵活工具，同时减少了对X光的依赖。

Abstract: We study an ultrasound-first, radiation-preserving policy for developmental dysplasia of the hip (DDH) that requests a radiograph only when needed. We (i) pretrain modality-specific encoders (ResNet-18) with SimSiam on a large unlabelled registry (37186 ultrasound; 19546 radiographs), (ii) freeze the backbones and fit small, measurement-faithful heads on DDH relevant landmarks and measurements (iii) calibrate a one sided conformal deferral rule on ultrasound predictions that provides finite sample coverage guarantees under exchangeability, using a held-out calibration set. Ultrasound heads predict Graf alpha, beta, and femoral head coverage; X-ray heads predict acetabular index (AI), center-edge (CE) angle and IHDI grade. On our held out labeled evaluation set, ultrasound measurement error is modest (e.g., alpha MAE ~= 9.7 degrees, coverage MAE ~= 14.0%), while radiographic probes achieve AI and CE MAEs of ~= 7.6 degrees and ~= 8.9 degrees, respectively. The calibrated US-only policy is explored across rule families (alpha-only; alpha OR coverage; alpha AND coverage), uncertainty inflation factors, and per-utility trade-offs using decision-curve analysis. Conservative settings yield high coverage with near-zero US-only rates; permissive settings (e.g., alpha OR coverage at larger deltas) achieve non-zero US-only throughput with expected coverage tradeoffs. The result is a simple, reproducible pipeline that turns limited labels into interpretable measurements and tunable selective imaging curves suitable for clinical handoff and future external validation.

[324] VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking cs.LG | cs.AI | cs.CV | cs.PFPDF

Kichang Yang, Seonjun Kim, Minjae Kim, Nairan Zhang, Chi Zhang

TL;DR: 该论文提出了一种名为Neuron Chunking的I/O高效稀疏化方法，通过结合神经元重要性和存储访问成本来优化视觉语言模型（VLM）的边缘部署。

Details

Motivation: 传统的稀疏化方法仅基于激活幅度选择神经元，忽略了存储访问模式对性能的影响，导致I/O效率低下。因此，需要一种更高效的方法来减少边缘设备中VLM的I/O开销。

Result: 在Jetson Orin Nano和Jetson AGX Orin设备上，Neuron Chunking分别将I/O效率提升了4.65倍和5.76倍。

Insight: 论文揭示了稀疏化不仅仅需要考虑神经元的激活特性，还需要结合硬件存储访问模式，以实现更高效的边缘部署。

Abstract: Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.

[325] GRIT-LP: Graph Transformer with Long-Range Skip Connection and Partitioned Spatial Graphs for Accurate Ice Layer Thickness Prediction cs.LG | cs.CVPDF

Zesheng Liu, Maryam Rahnemoonfar

TL;DR: 这篇论文提出了GRIT-LP，一种针对极地雷达图像中冰层厚度预测的图变换器，通过创新的分区空间图构建和长程跳跃连接机制，显著提升了预测精度。

Details

Motivation: 准确估计冰层厚度对理解积雪、重建古气候模式以及减少未来冰盖演化和海平面上升预测的不确定性至关重要。现有图变换器因过平滑和长程依赖建模不足而限制了性能。

Result: 实验表明，GRIT-LP在均方根误差上比现有方法提升了24.92%，证明了其在捕捉局部特征和长程依赖上的有效性。

Insight: 图变换器通过结合空间连贯性和长程依赖建模，在时空模式分析中表现出潜力，特别是在冰冻圈过程的数据驱动理解中。

Abstract: Graph transformers have demonstrated remarkable capability on complex spatio-temporal tasks, yet their depth is often limited by oversmoothing and weak long-range dependency modeling. To address these challenges, we introduce GRIT-LP, a graph transformer explicitly designed for polar ice-layer thickness estimation from polar radar imagery. Accurately estimating ice layer thickness is critical for understanding snow accumulation, reconstructing past climate patterns and reducing uncertainties in projections of future ice sheet evolution and sea level rise. GRIT-LP combines an inductive geometric graph learning framework with self-attention mechanism, and introduces two major innovations that jointly address challenges in modeling the spatio-temporal patterns of ice layers: a partitioned spatial graph construction strategy that forms overlapping, fully connected local neighborhoods to preserve spatial coherence and suppress noise from irrelevant long-range links, and a long-range skip connection mechanism within the transformer that improves information flow and mitigates oversmoothing in deeper attention layers. We conducted extensive experiments, demonstrating that GRIT-LP outperforms current state-of-the-art methods with a 24.92% improvement in root mean squared error. These results highlight the effectiveness of graph transformers in modeling spatiotemporal patterns by capturing both localized structural features and long-range dependencies across internal ice layers, and demonstrate their potential to advance data-driven understanding of cryospheric processes.

[326] AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention cs.LG | cs.CV | cs.ROPDF

Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin

TL;DR: AVA-VLA是一种改进的视觉-语言-动作模型，通过动态调整视觉注意力来优化任务相关的视觉标记处理，从而在动态序列决策中取得更优效果。

Details

Motivation: 现有的视觉-语言-动作模型在处理密集视觉输入时缺乏历史上下文利用，导致在动态任务中效率不高。

Result: 在LIBERO和CALVIN等机器人任务中取得SOTA性能，并在真实双臂机器人平台上验证了实用性。

Insight: 动态视觉注意力能够显著提升模型在复杂任务中的表现，尤其是需要历史上下文的任务。

Abstract: Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent’s belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework’s practical applicability and robust sim-to-real transferability.

[327] UniGame: Turning a Unified Multimodal Model Into Its Own Adversary cs.LG | cs.AI | cs.CVPDF

Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang

TL;DR: UniGame提出了一种自对抗的后训练框架，通过轻量级扰动器在共享令牌接口上改进统一多模态模型的理解和生成一致性，显著提升了模型的鲁棒性和性能。

Details

Motivation: 统一多模态模型（UMMs）在理解和生成任务中表现出性能不一致的问题，主要原因是两者的嵌入需求不同。这种不一致会导致决策边界错位、跨模态连贯性下降以及对分布和对抗性变化的脆弱性。UniGame旨在解决这一问题。

Result: 实验表明，UniGame显著提升了模型的连贯性（+4.6%）、理解能力（+3.6%）和生成能力（+0.02），同时在分布外和对抗性任务中表现出更强的鲁棒性（+4.8%和+6.2%）。

Insight: 自对抗训练是一种有效提升多模态模型一致性和鲁棒性的通用原则，且该方法与现有后训练技术兼容，具有广泛的应用潜力。

Abstract: Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame

cs.CY [Back]

[328] Animated Territorial Data Extractor (ATDE): A Computer-Vision Method for Extracting Territorial Data from Animated Historical Maps cs.CY | cs.CVPDF

Hamza Alshamy, Isaiah Woram, Advay Mishra, Zihan Xia, Pascal Wallisch

TL;DR: ATDE是一种计算机视觉工具，通过HSV颜色分割、RGB通道滤波和直接邻居滤波从历史动画地图视频中提取领土数据，并将其转化为时间序列数据。

Details

Motivation: 历史地图的动态变化通常难以量化，ATDE旨在通过自动化方法提取动画地图中的领土数据，填补传统方法的不足。

Result: 在10个中国朝代（公元前200年至公元1912年）上验证了工具的有效性，提取的数据与历史模式一致。

Insight: ATDE适用于教育演示、初步数据探索和领土动态比较分析，但无法取代权威历史数据集。

Abstract: We present Animated Territorial Data Extractor (ATDE), a computer vision tool that extracts quantitative territorial data from animated historical map videos. ATDE employs HSV-based color segmentation, RGB channel filtering, and Direct-Neighbor Filtering to identify and count pixels representing territorial control. Combined with preprocessing for temporal alignment and cross-video scaling, the pipeline converts animated videos into structured time-series data. We demonstrate the tool on ten Chinese dynasties (200 BCE - 1912 CE), producing year-by-year pixel counts that align with expected historical patterns. While not a substitute for authoritative historical datasets, ATDE is well-suited for educational demonstrations, preliminary data exploration, and comparative analysis of territorial dynamics. The tool requires no pre-existing shapefiles and can be applied to any animated map video given seed colors and basic configuration. Code and examples are available on GitHub.

cs.HC [Back]

[329] Deep Learning-based Lightweight RGB Object Tracking for Augmented Reality Devices cs.HC | cs.CVPDF

Alice Smith, Bob Johnson, Xiaoyu Zhu, Carol Lee

TL;DR: 这篇论文提出了一种轻量化的RGB目标跟踪算法，专为资源受限的AR设备设计，通过紧凑的Siamese神经网络结构和优化技术实现实时跟踪。

Details

Motivation: AR应用需要实时跟踪用户环境中的目标以正确叠加虚拟内容，但现有深度学习跟踪器计算开销大，不适合轻量级设备。

Result: 在标准跟踪基准测试中，该方法的准确性与最先进的跟踪器相当，同时能在移动AR设备上以约30 FPS实时运行，速度提升了一个数量级。

Insight: 轻量化技术和优化方法在资源受限设备上实现高性能跟踪是可行的，为AR应用中更动态和交互式的体验提供了可能。

Abstract: Augmented Reality (AR) applications often require robust real-time tracking of objects in the user’s environment to correctly overlay virtual content. Recent advances in computer vision have produced highly accurate deep learning-based object trackers, but these models are typically too heavy in computation and memory for wearable AR devices. In this paper, we present a lightweight RGB object tracking algorithm designed specifically for resource-constrained AR platforms. The proposed tracker employs a compact Siamese neural network architecture and incorporates optimization techniques such as model pruning, quantization, and knowledge distillation to drastically reduce model size and inference cost while maintaining high tracking accuracy. We train the tracker offline on large video datasets using deep convolutional neural networks and then deploy it on-device for real-time tracking. Experimental results on standard tracking benchmarks show that our approach achieves comparable accuracy to state-of-the-art trackers, yet runs in real-time on a mobile AR headset at around 30 FPS – more than an order of magnitude faster than prior high-performance trackers on the same hardware. This work enables practical, robust object tracking for AR use-cases, opening the door to more interactive and dynamic AR experiences on lightweight devices.

cs.DB [Back]

[330] LLM and Agent-Driven Data Analysis: A Systematic Approach for Enterprise Applications and System-level Deployment cs.DB | cs.AI | cs.CLPDF

Xi Wang, Xianyao Ling, Kun Li, Gang Yin, Liang Zhang

TL;DR: 这篇论文探讨了生成式AI和智能代理技术如何变革企业数据管理与分析，重点关注了基于大型语言模型（LLM）和AI代理的SQL生成技术，以及企业级应用和系统部署中的关键挑战。

Details

Motivation: 企业数据管理与分析正在经历生成式AI和智能代理技术的快速变革，亟需解决语义查询、数据安全性和SQL生成等核心问题。

Result: 通过代表性用例展示了该框架在分布式部署、数据安全和SQL生成任务中的有效性。

Insight: LLM和智能代理技术在企业数据管理中具有巨大潜力，但需关注数据安全和任务复杂性等挑战。

Abstract: The rapid progress in Generative AI and Agent technologies is profoundly transforming enterprise data management and analytics. Traditional database applications and system deployment are fundamentally impacted by AI-driven tools, such as Retrieval-Augmented Generation (RAG) and vector database technologies, which provide new pathways for semantic querying over enterprise knowledge bases. In the meantime, data security and compliance are top priorities for organizations adopting AI technologies. For enterprise data analysis, SQL generations powered by large language models (LLMs) and AI agents, has emerged as a key bridge connecting natural language with structured data, effectively lowering the barrier to enterprise data access and improving analytical efficiency. This paper focuses on enterprise data analysis applications and system deployment, covering a range of innovative frameworks, enabling complex query understanding, multi-agent collaboration, security verification, and computational efficiency. Through representative use cases, key challenges related to distributed deployment, data security, and inherent difficulties in SQL generation tasks are discussed.

cs.DC [Back]

[331] AVERY: Adaptive VLM Split Computing through Embodied Self-Awareness for Efficient Disaster Response Systems cs.DC | cs.AR | cs.CV | cs.LG | cs.NIPDF

Rajat Bhattacharjya, Sing-Yao Wu, Hyunwoo Oh, Chaewon Nam, Suyeon Koo

TL;DR: AVERY提出了一种自适应VLM分割计算框架，通过双流分割和高轻量自感知控制器，实现在灾难响应系统中高效部署VLM，提升准确性和能源效率。

Details

Motivation: 灾难响应中无人机（UAVs）需要复杂语义推理，但现有CNN无法满足；VLM资源需求大，云端卸载在低带宽场景下不适用，需一种自适应解决方案。

Result: 在边缘-云场景下，AVERY比静态配置准确率高11.2%，比全边缘执行能耗低93.98%。

Insight: 功能性分割和自适应控制能有效解决VLM在资源受限环境中的部署瓶颈，为动态任务提供实时智能支持。

Abstract: Unmanned Aerial Vehicles (UAVs) in disaster response require complex, queryable intelligence that on-board CNNs cannot provide. While Vision-Language Models (VLMs) offer this semantic reasoning, their high resource demands make on-device deployment infeasible, and naive cloud offloading fails under the low-bandwidth networks common in disaster zones. We present AVERY, a framework that enables VLM deployment through adaptive split computing. We advance the split computing paradigm beyond traditional depth-wise partitioning by introducing a functional, cognitive-inspired dual-stream split that separates the VLM into a high-frequency, low-resolution “context stream” for real-time awareness and a low-frequency, high-fidelity “insight stream” for deep analysis. A lightweight, self-aware on-board controller manages this architecture, monitoring network conditions and operator intent to dynamically select from pre-trained compression models, navigating the fundamental accuracy-throughput trade-off. Evaluated using the VLM LISA-7B across an edge-cloud scenario under fluctuating network conditions, AVERY consistently outperforms static configurations, achieving 11.2% higher accuracy than raw image compression and 93.98% lower energy consumption compared to full-edge execution, thereby enhancing mission efficiency and enabling real-time, queryable intelligence on resource-constrained platforms in dynamic environments.

cs.SD [Back]

[332] Multimodal Real-Time Anomaly Detection and Industrial Applications cs.SD | cs.AI | cs.CV | cs.LG | cs.MMPDF

Aman Verma, Keshav Samdani, Mohd. Samiuddin Shafi

TL;DR: 该论文提出了一种多模态实时异常检测系统，结合了视频和音频处理，通过两轮迭代显著提升了准确性和工业适用性。

Details

Motivation: 为了解决实时活动识别和异常检测在多模态场景中的挑战，尤其是在工业环境中的应用需求，作者设计和改进了该系统。

Result: 实验证明系统在实时性和准确性上表现优异，适用于通用监控和工业安全场景。

Insight: 多模态融合和多模型集成是提升异常检测系统性能的关键，工业应用需兼顾实时性和鲁棒性。

Abstract: This paper presents the design, implementation, and evolution of a comprehensive multimodal room-monitoring system that integrates synchronized video and audio processing for real-time activity recognition and anomaly detection. We describe two iterations of the system: an initial lightweight implementation using YOLOv8, ByteTrack, and the Audio Spectrogram Transformer (AST), and an advanced version that incorporates multi-model audio ensembles, hybrid object detection, bidirectional cross-modal attention, and multi-method anomaly detection. The evolution demonstrates significant improvements in accuracy, robustness, and industrial applicability. The advanced system combines three audio models (AST, Wav2Vec2, and HuBERT) for comprehensive audio understanding, dual object detectors (YOLO and DETR) for improved accuracy, and sophisticated fusion mechanisms for enhanced cross-modal learning. Experimental evaluation shows the system’s effectiveness in general monitoring scenarios as well as specialized industrial safety applications, achieving real-time performance on standard hardware while maintaining high accuracy.

[333] PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation cs.SD | cs.CV | eess.AS | eess.IVPDF

Huadai Liu, Kaicheng Luo, Wen Wang, Qian Chen, Peiwen Sun

TL;DR: PrismAudio提出了一种新颖的视频到音频生成框架，通过分解链式思维（CoT）和多维奖励函数解决目标纠缠问题，结合强化学习优化生成过程。

Details

Motivation: 现有视频到音频生成方法存在目标纠缠问题，即在单一损失函数中混合了多个竞争性目标（语义一致性、视听时序同步、美学质量和空间准确性），且缺乏与人类偏好的一致性。

Result: 在VGGSound测试集和AudioCanvas基准上，PrismAudio在四项感知维度上均达到SOTA性能。

Insight: 通过分解推理和多维奖励对齐目标，可以更有效地解决视频到音频生成中的复杂多目标优化问题，同时保持模型的可解释性。

Abstract: Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio-Project.github.io.

[334] Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments cs.SD | cs.AI | cs.CVPDF

Jorge Ortigoso-Narro, Jose A. Belloch, Adrian Amor-Martin, Sandra Roger, Maximo Cobos

TL;DR: 该论文提出了一种嵌入式系统，结合深度学习跟踪和波束成形技术，实现动态环境中声源的精确定位和定向音频捕获。

Details

Motivation: 动态环境中的声源定位和定向音频捕获在监控、人机交互和机器人领域具有重要应用价值。现有系统在效率和鲁棒性上存在不足，需要一种实时且适应性强的方法。

Result: 实验表明，系统在信噪比上有显著提升，适用于视频会议、智能家居和辅助技术等场景。

Insight: 通过结合深度学习与声学技术，动态调整波束方向能够显著提升复杂环境中的音频捕获效果。

Abstract: Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array’s focus, synchronizing the acoustic response with the target’s position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.

cs.SE [Back]

[335] From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence cs.SE | cs.CLPDF

Jian Yang, Wei Zhang, Shark Liu, Jiajun Wu, Shawn Guo

TL;DR: 这篇论文提供了关于代码LLMs的全面综合与实践指南，分析了从数据收集到后训练的完整模型生命周期，对比了通用LLMs与专用代码LLMs的能力，并探讨了研究与实践的差距。

Details

Motivation: 随着LLMs在自动化软件开发中的广泛应用，如Github Copilot等工具的普及，亟需系统地总结和分析代码LLMs的实现链路与技术挑战，弥合学术界与工业界的差距。

Result: 实验结果显示，代码LLMs在HumanEval等基准测试中表现优异（成功率超95%），但也揭示了在代码正确性、安全性、大规模代码库上下文等方面的挑战。

Insight: 1. 预训练与微调对性能提升至关重要；2. 专用代码LLMs在特定任务上表现更优；3. 研究与实践的差距需进一步弥合；4. 强化学习与自主代理是未来重要方向。

Abstract: Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.

Table of Contents

cs.CV [Back]

[1] Multimodal AI for Body Fat Estimation: Computer Vision and Anthropometry with DEXA Benchmarks cs.CV | cs.AI | cs.LGPDF

[2] Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding cs.CV | cs.AIPDF

[3] BCWildfire: A Long-term Multi-factor Dataset and Deep Learning Benchmark for Boreal Wildfire Risk Prediction cs.CVPDF

[4] Robustness of Structured Data Extraction from Perspectively Distorted Documents cs.CV | cs.CL | cs.LGPDF

[5] 3D Ground Truth Reconstruction from Multi-Camera Annotations Using UKF cs.CVPDF

[6] Unified Low-Light Traffic Image Enhancement via Multi-Stage Illumination Recovery and Adaptive Noise Suppression cs.CV | cs.AIPDF

[7] HSMix: Hard and Soft Mixing Data Augmentation for Medical Image Segmentation cs.CV | cs.LGPDF

[8] Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach cs.CVPDF

[9] Upstream Probabilistic Meta-Imputation for Multimodal Pediatric Pancreatitis Classification cs.CV | cs.LGPDF

[10] SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios cs.CV | cs.AI | cs.ROPDF

[11] MedPEFT-CL: Dual-Phase Parameter-Efficient Continual Learning with Medical Semantic Adapter and Bidirectional Memory Consolidation cs.CVPDF

[12] Person Recognition in Aerial Surveillance: A Decade Survey cs.CVPDF

[13] Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models cs.CVPDF

[14] Understanding Counting Mechanisms in Large Language and Vision-Language Models cs.CV | cs.AIPDF

[15] Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions cs.CVPDF

[16] The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation cs.CVPDF

[17] VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning cs.CV | cs.LGPDF

[18] Towards Open-Ended Visual Scientific Discovery with Sparse Autoencoders cs.CVPDF

[19] AEGIS: Preserving privacy of 3D Facial Avatars with Adversarial Perturbations cs.CV | cs.AIPDF

[20] SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration cs.CVPDF

[21] CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation cs.CVPDF

[22] Deepfake Geography: Detecting AI-Generated Satellite Images cs.CVPDF

[23] Target-Bench: Can World Models Achieve Mapless Path Planning with Semantic Targets? cs.CV | cs.ROPDF

[24] Attention Guided Alignment in Efficient Vision-Language Models cs.CV | cs.LGPDF

[25] Pillar-0: A New Frontier for Radiology Foundation Models cs.CV | cs.AIPDF

[26] A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking cs.CV | cs.AIPDF

[27] QAL: A Loss for Recall Precision Balance in 3D Reconstruction cs.CV | cs.ROPDF

[28] Toward explainable AI approaches for breast imaging: adapting foundation models to diverse populations cs.CV | cs.AIPDF

[29] Show Me: Unifying Instructional Image and Video Generation with Diffusion Models cs.CVPDF

[30] JigsawComm: Joint Semantic Feature Encoding and Transmission for Communication-Efficient Cooperative Perception cs.CVPDF

[31] Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation cs.CV | cs.AIPDF

[32] MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use cs.CV | cs.AIPDF

[33] ArticFlow: Generative Simulation of Articulated Mechanisms cs.CV | cs.ROPDF

[34] FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning cs.CV | cs.LGPDF

[35] When Better Teachers Don’t Make Better Students: Revisiting Knowledge Distillation for CLIP Models in VQA cs.CV | cs.CLPDF

[36] CUS-GS: A Compact Unified Structured Gaussian Splatting Framework for Multimodal Scene Representation cs.CV | cs.ROPDF

[37] PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning cs.CV | cs.AIPDF

[38] MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection cs.CV | cs.AIPDF

[39] UniRSCD: A Unified Novel Architectural Paradigm for Remote Sensing Change Detection cs.CVPDF

[40] Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion cs.CV | cs.GRPDF

[41] SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System cs.CVPDF

[42] Test-Time Temporal Sampling for Efficient MLLM Video Understanding cs.CVPDF

[43] Multi-speaker Attention Alignment for Multimodal Social Interaction cs.CVPDF

[44] HEAL: Learning-Free Source Free Unsupervised Domain Adaptation for Cross-Modality Medical Image Segmentation cs.CVPDF

[45] VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment cs.CV | cs.AIPDF

[46] X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification cs.CVPDF

[47] Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification cs.CV | cs.MMPDF

[48] Plan-X: Instruct Video Generation via Semantic Planning cs.CV | cs.AIPDF

[49] HyM-UNet: Synergizing Local Texture and Global Context via Hybrid CNN-Mamba Architecture for Medical Image Segmentation cs.CV | cs.IRPDF

[50] SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining cs.CVPDF

[51] RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale cs.CVPDF

[52] Is Complete Labeling Necessary? Understanding Active Learning in Longitudinal Medical Imaging cs.CVPDF

[53] RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios cs.CVPDF

[54] Modeling Retinal Ganglion Cells with Neural Differential Equations cs.CV | cs.AIPDF

[55] MambaX: Image Super-Resolution with State Predictive Control cs.CVPDF

[56] Hybrid Event Frame Sensors: Modeling, Calibration, and Simulation cs.CVPDF

[57] UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios cs.CVPDF

[58] IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment cs.CV | cs.AI | cs.CLPDF

[59] VK-Det: Visual Knowledge Guided Prototype Learning for Open-Vocabulary Aerial Object Detection cs.CVPDF

[60] ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models cs.CV | cs.ROPDF

[61] Less Is More: An Explainable AI Framework for Lightweight Malaria Classification cs.CVPDF

[62] Together, Then Apart: Revisiting Multimodal Survival Analysis via a Min-Max Perspective cs.CVPDF

[63] Spotlight: Identifying and Localizing Video Generation Errors Using VLMs cs.CVPDF

[64] Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning cs.CVPDF

[65] Muskie: Multi-view Masked Image Modeling for 3D Vision Pre-training cs.CVPDF

[66] PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures cs.CVPDF

[67] MVS-TTA: Test-Time Adaptation for Multi-View Stereo via Meta-Auxiliary Learning cs.CVPDF

[68] VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging cs.CV | cs.AIPDF

[69] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

[70] SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation cs.CVPDF

[71] Video4Edit: Viewing Image Editing as a Degenerate Temporal Process cs.CVPDF

[72] SCALER: SAM-Enhanced Collaborative Learning for Label-Deficient Concealed Object Segmentation cs.CV | cs.AIPDF

[73] Compact neural networks for astronomy with optimal transport bias correction cs.CVPDF

[74] UnfoldLDM: Deep Unfolding-based Blind Image Restoration with Latent Diffusion Priors cs.CV | cs.AIPDF

[75] Assessing the alignment between infants’ visual and linguistic experience using multimodal language models cs.CV | cs.CLPDF

[76] Matching-Based Few-Shot Semantic Segmentation Models Are Interpretable by Design cs.CVPDF

[77] Nested Unfolding Network for Real-World Concealed Object Segmentation cs.CV | cs.AIPDF

[78] EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses cs.CVPDF