cs.CV [Total: 133]
cs.CL [Total: 32]
cs.DL [Total: 1]
cs.LO [Total: 1]
cs.CY [Total: 3]
eess.IV [Total: 4]
cs.GR [Total: 5]
cs.AI [Total: 9]
cs.DB [Total: 1]
cs.CR [Total: 1]
cs.SD [Total: 4]
cs.RO [Total: 7]
cs.MM [Total: 1]
cs.FL [Total: 1]
cs.IR [Total: 2]
physics.geo-ph [Total: 1]
eess.AS [Total: 1]
cs.LG [Total: 13]

cs.CV [Back]

[1] SRKD: Towards Efficient 3D Point Cloud Segmentation via Structure- and Relation-aware Knowledge Distillation cs.CVPDF

Yuqi Li, Junhao Dong, Zeyu Dong, Chuanguang Yang, Zhulin An

TL;DR: 论文提出了一种名为SRKD的结构和关系感知知识蒸馏框架，用于高效3D点云分割，通过从大型教师模型向轻量级学生模型传递几何和语义知识，显著降低了模型复杂度。

Details

Motivation: 3D点云分割中，基于Transformer的大型模型存在计算复杂性和部署限制的问题。为解决这些问题，需要一个高效的轻量级模型。

Result: 在保持轻量级（<15M参数量）的同时，实现了SOTA性能，显著提升了实际部署的效率。

Insight: 结构和关系感知的蒸馏方法可以显著提升轻量级模型的性能，跨样本对比学习有助于模型泛化能力的提升。

Abstract: 3D point cloud segmentation faces practical challenges due to the computational complexity and deployment limitations of large-scale transformer-based models. To address this, we propose a novel Structure- and Relation-aware Knowledge Distillation framework, named SRKD, that transfers rich geometric and semantic knowledge from a large frozen teacher model (>100M) to a lightweight student model (<15M). Specifically, we propose an affinity matrix-based relation alignment module, which distills structural dependencies from the teacher to the student through point-wise similarity matching, enhancing the student’s capability to learn contextual interactions. Meanwhile, we introduce a cross-sample mini-batch construction strategy that enables the student to perceive stable and generalized geometric structure. This aligns across diverse point cloud instances of the teacher, rather than within a single sample. Additionally, KL divergence is applied to align semantic distributions, and ground-truth supervision further reinforces accurate segmentation. Our method achieves state of the art performance with significantly reduced model complexity, demonstrating its effectiveness and efficiency in real-world deployment scenarios. Our Code is available at https://github.com/itsnotacie/SRKD.

[2] Fine-Scale Soil Mapping in Alaska with Multimodal Machine Learning cs.CV | cs.LGPDF

Yijun Lin, Theresa Chen, Colby Brungard, Grunwald Sabine, Sue Ives

TL;DR: 该论文提出了MISO模型，一种基于视觉的机器学习方法，用于生成阿拉斯加的高分辨率土壤地图，重点预测近地表永久冻土和土壤分类。模型通过结合地理空间基础模型、隐式神经表示和对比学习，在性能和泛化能力上优于传统的随机森林方法。

Details

Motivation: 阿拉斯加的土壤测绘传统上依赖实地工作和局部模拟，但分辨率不足。随着气候变化加速永久冻土融化，迫切需要高分辨率土壤地图以支持生态保护和基础设施规划。

Result: MISO在空间交叉验证和区域分析中表现优于随机森林，泛化能力更强，召回率更高，适用于永久冻土监测和环境过程研究。

Insight: 先进的机器学习方法在土壤测绘中具有潜力，尤其在多模态数据融合和空间预测方面。为未来土壤采样和基础设施规划提供了实用指导。

Abstract: Fine-scale soil mapping in Alaska, traditionally relying on fieldwork and localized simulations, remains a critical yet underdeveloped task, despite the region’s ecological importance and extensive permafrost coverage. As permafrost thaw accelerates due to climate change, it threatens infrastructure stability and key ecosystem services, such as soil carbon storage. High-resolution soil maps are essential for characterizing permafrost distribution, identifying vulnerable areas, and informing adaptation strategies. We present MISO, a vision-based machine learning (ML) model to produce statewide fine-scale soil maps for near-surface permafrost and soil taxonomy. The model integrates a geospatial foundation model for visual feature extraction, implicit neural representations for continuous spatial prediction, and contrastive learning for multimodal alignment and geo-location awareness. We compare MISO with Random Forest (RF), a traditional ML model that has been widely used in soil mapping applications. Spatial cross-validation and regional analysis across Permafrost Zones and Major Land Resource Areas (MLRAs) show that MISO generalizes better to remote, unseen locations and achieves higher recall than RF, which is critical for monitoring permafrost thaw and related environmental processes. These findings demonstrate the potential of advanced ML approaches for fine-scale soil mapping and provide practical guidance for future soil sampling and infrastructure planning in permafrost-affected landscapes. The project will be released at https://github.com/knowledge-computing/Peatland-permafrost.

[3] RadarSeq: A Temporal Vision Framework for User Churn Prediction via Radar Chart Sequences cs.CV | cs.AIPDF

Sina Najafi, M. Hadi Sepanj, Fahimeh Jafari

TL;DR: 该论文提出了一种名为RadarSeq的时序视觉框架，用于通过雷达图序列预测用户流失行为。通过结合CNN编码器和双向LSTM，该框架有效捕捉时空模式，显著提升性能。

Details

Motivation: 在非订阅制的零工平台中，用户流失预测困难，缺乏显式标签且用户行为动态变化。现有方法多依赖静态表示，忽略关键时序信息。

Result: 在真实数据集上实验表明，性能显著优于经典模型和基于ViT的基线（F1提升17.7，精确度提升29.4，AUC提升16.1）。

Insight: 该框架的模块化设计和可解释性工具使其适用于动态零工平台的大规模流失建模。

Abstract: Predicting user churn in non-subscription gig platforms, where disengagement is implicit, poses unique challenges due to the absence of explicit labels and the dynamic nature of user behavior. Existing methods often rely on aggregated snapshots or static visual representations, which obscure temporal cues critical for early detection. In this work, we propose a temporally-aware computer vision framework that models user behavioral patterns as a sequence of radar chart images, each encoding day-level behavioral features. By integrating a pretrained CNN encoder with a bidirectional LSTM, our architecture captures both spatial and temporal patterns underlying churn behavior. Extensive experiments on a large real-world dataset demonstrate that our method outperforms classical models and ViT-based radar chart baselines, yielding gains of 17.7 in F1 score, 29.4 in precision, and 16.1 in AUC, along with improved interpretability. The framework’s modular design, explainability tools, and efficient deployment characteristics make it suitable for large-scale churn modeling in dynamic gig-economy platforms.

[4] P2MFDS: A Privacy-Preserving Multimodal Fall Detection System for Elderly People in Bathroom Environments cs.CV | cs.AIPDF

Haitian Wang, Yiren Wang, Xinyu Wang, Yumeng Miao, Yuliang Zhang

TL;DR: P2MFDS提出了一种隐私保护的多模态跌倒检测系统，特别针对老年人在浴室环境中的需求，通过融合毫米波雷达和3D振动传感，结合双流网络提高了检测准确性。

Details

Motivation: 全球老龄化趋势加剧，浴室作为跌倒高发环境，现有单模态系统因环境干扰和系统偏差导致准确性不足，亟需隐私保护且更可靠的多模态解决方案。

Result: P2MFDS在准确率和召回率上显著优于现有方法，数据集和模型将公开。

Insight: 多模态融合能有效克服单模态系统的局限性，双流网络设计结合时空特征为复杂环境中的跌倒检测提供了新思路。

Abstract: By 2050, people aged 65 and over are projected to make up 16 percent of the global population. As aging is closely associated with increased fall risk, particularly in wet and confined environments such as bathrooms where over 80 percent of falls occur. Although recent research has increasingly focused on non-intrusive, privacy-preserving approaches that do not rely on wearable devices or video-based monitoring, these efforts have not fully overcome the limitations of existing unimodal systems (e.g., WiFi-, infrared-, or mmWave-based), which are prone to reduced accuracy in complex environments. These limitations stem from fundamental constraints in unimodal sensing, including system bias and environmental interference, such as multipath fading in WiFi-based systems and drastic temperature changes in infrared-based methods. To address these challenges, we propose a Privacy-Preserving Multimodal Fall Detection System for Elderly People in Bathroom Environments. First, we develop a sensor evaluation framework to select and fuse millimeter-wave radar with 3D vibration sensing, and use it to construct and preprocess a large-scale, privacy-preserving multimodal dataset in real bathroom settings, which will be released upon publication. Second, we introduce P2MFDS, a dual-stream network combining a CNN-BiLSTM-Attention branch for radar motion dynamics with a multi-scale CNN-SEBlock-Self-Attention branch for vibration impact detection. By uniting macro- and micro-scale features, P2MFDS delivers significant gains in accuracy and recall over state-of-the-art approaches. Code and pretrained models will be made available at: https://github.com/HaitianWang/P2MFDS-A-Privacy-Preserving-Multimodal-Fall-Detection-Network-for-Elderly-Individuals-in-Bathroom.

[5] A Novel Multi-layer Task-centric and Data Quality Framework for Autonomous Driving cs.CV | cs.AIPDF

Yuhan Zhou, Haihua Chen, Kewei Sha

TL;DR: 本文提出了一种新颖的多层任务中心和数据质量框架，为自动驾驶系统提供功能、效率和可信度的保障，并通过案例研究验证了其在提升任务性能方面的潜力。

Details

Motivation: 自动驾驶领域的研究和实践过度关注模型和算法，而忽视了多源多模态数据质量的挑战。为满足下一代自动驾驶系统对动态环境和异构数据流的响应需求，需系统性解决数据质量问题。

Result: 实验证明，部分去除冗余数据可提高目标检测任务性能，同时揭示了多模态数据中的冗余问题。

Insight: 数据质量的系统性管理对自动驾驶系统的性能至关重要，框架为未来研究提供了新方向。

Abstract: The next-generation autonomous vehicles (AVs), embedded with frequent real-time decision-making, will rely heavily on a large volume of multisource and multimodal data. In real-world settings, the data quality (DQ) of different sources and modalities usually varies due to unexpected environmental factors or sensor issues. However, both researchers and practitioners in the AV field overwhelmingly concentrate on models/algorithms while undervaluing the DQ. To fulfill the needs of the next-generation AVs with guarantees of functionality, efficiency, and trustworthiness, this paper proposes a novel task-centric and data quality vase framework which consists of five layers: data layer, DQ layer, task layer, application layer, and goal layer. The proposed framework aims to map DQ with task requirements and performance goals. To illustrate, a case study investigating redundancy on the nuScenes dataset proves that partially removing redundancy on multisource image data could improve YOLOv8 object detection task performance. Analysis on multimodal data of image and LiDAR further presents existing redundancy DQ issues. This paper opens up a range of critical but unexplored challenges at the intersection of DQ, task orchestration, and performance-oriented system development in AVs. It is expected to guide the AV community toward building more adaptive, explainable, and resilient AVs that respond intelligently to dynamic environments and heterogeneous data streams. Code, data, and implementation details are publicly available at: https://anonymous.4open.science/r/dq4av-framework/README.md.

[6] Efficient Feedback Gate Network for Hyperspectral Image Super-Resolution cs.CV | cs.LGPDF

Xufei Wang, Mingjian Zhang, Fei Ge, Jinchen Zhu, Wen Sha

TL;DR: 该论文提出了一种高效反馈门网络（Efficient Feedback Gate Network），用于单幅高光谱图像超分辨率（SHSR）任务，通过独特的反馈和门操作结合大核卷积与光谱交互，显著提升了空间分辨率和光谱保真度。

Details

Motivation: 现有的单幅高光谱图像超分辨率方法未能充分利用波段间的相干性和空间-光谱信息，导致性能受限。因此，作者提出一种新方法以更全面地探索这些信息并提升重建质量。

Result: 在三个高光谱数据集上的实验表明，该方法在光谱保真和空间内容重建方面优于最先进的算法。

Insight: 通过反馈门操作和分组策略，该方法有效整合了空间与光谱信息，从而在高光谱图像超分辨率任务中表现出色。

Abstract: Even without auxiliary images, single hyperspectral image super-resolution (SHSR) methods can be designed to improve the spatial resolution of hyperspectral images. However, failing to explore coherence thoroughly along bands and spatial-spectral information leads to the limited performance of the SHSR. In this study, we propose a novel group-based SHSR method termed the efficient feedback gate network, which uses various feedbacks and gate operations involving large kernel convolutions and spectral interactions. In particular, by providing different guidance for neighboring groups, we can learn rich band information and hierarchical hyperspectral spatial information using channel shuffling and dilatation convolution in shuffled and progressive dilated fusion module(SPDFM). Moreover, we develop a wide-bound perception gate block and a spectrum enhancement gate block to construct the spatial-spectral reinforcement gate module (SSRGM) and obtain highly representative spatial-spectral features efficiently. Additionally, we apply a three-dimensional SSRGM to enhance holistic information and coherence for hyperspectral data. The experimental results on three hyperspectral datasets demonstrate the superior performance of the proposed network over the state-of-the-art methods in terms of spectral fidelity and spatial content reconstruction.

[7] From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge cs.CV | cs.AI | cs.IRPDF

Muhammad Tayyab Khan, Lequn Chen, Zane Yong, Jun Ming Tan, Wenhe Feng

TL;DR: 论文提出了一种混合视觉-语言框架，用于从2D工程图纸中提取结构化制造知识，解决了传统OCR模型在复杂工程图纸中的局限性。

Details

Motivation: 手动提取2D工程图纸中的关键信息（如几何尺寸公差、材料规格等）效率低且易出错，而通用OCR模型对复杂布局和旋转文本的识别效果较差。

Result: Donut模型在解析任务中表现优于Florence-2，精度为88.5%，召回率为99.2%，F1分数为93.5%，幻觉率为11.5%。

Insight: 混合视觉-语言框架在工业任务中具有实用价值，能够高效支持下游制造任务（如工艺和工具选择），同时轻量化设计适合有限计算资源场景。

Abstract: Efficient and accurate extraction of key information from 2D engineering drawings is essential for advancing digital manufacturing workflows. Such information includes geometric dimensioning and tolerancing (GD&T), measures, material specifications, and textual annotations. Manual extraction is slow and labor-intensive, while generic OCR models often fail due to complex layouts, engineering symbols, and rotated text, leading to incomplete and unreliable outputs. These limitations result in incomplete and unreliable outputs. To address these challenges, we propose a hybrid vision-language framework that integrates a rotation-aware object detection model (YOLOv11-obb) with a transformer-based vision-language parser. Our structured pipeline applies YOLOv11-OBB to localize annotations and extract oriented bounding box (OBB) patches, which are then parsed into structured outputs using a fine-tuned, lightweight vision-language model (VLM). We curate a dataset of 1,367 2D mechanical drawings annotated across nine key categories. YOLOv11-OBB is trained on this dataset to detect OBBs and extract annotation patches. These are parsed using two open-source VLMs: Donut and Florence-2. Both models are lightweight and well-suited for specialized industrial tasks under limited computational overhead. Following fine-tuning of both models on the curated dataset of image patches paired with structured annotation labels, a comparative experiment is conducted to evaluate parsing performance across four key metrics. Donut outperforms Florence-2, achieving 88.5% precision, 99.2% recall, and a 93.5% F1-score, with a hallucination rate of 11.5%. Finally, a case study demonstrates how the extracted structured information supports downstream manufacturing tasks such as process and tool selection, showcasing the practical utility of the proposed framework in modernizing 2D drawing interpretation.

[8] Spatial-Temporal Pre-Training for Embryo Viability Prediction Using Time-Lapse Videos cs.CVPDF

Zhiyi Shi, Junsik Kim, Helen Y. Yang, Yonghyun Song, Hyun-Jic Oh

TL;DR: 论文提出了一种针对胚胎发育时间推移视频的时空预训练方法（STPT），通过分阶段训练空间和时序编码器，解决了长视频和时序不对齐的挑战，显著提升了胚胎存活预测的准确性。

Details

Motivation: 胚胎存活预测在体外受精（IVF）中至关重要，但标记数据稀缺且传统自监督学习方法难以处理长视频和时序不对齐问题。因此，需要一种高效的自监督学习方法。

Result: 在23,027个时间推移视频（其中3,286个标记）上，STPT达到最高AUC 0.635（95% CI: 0.632-0.638），优于基线方法。

Insight: 分阶段训练和避免跨视频帧对齐能有效处理长视频和时序不对齐问题，为类似医学视频分析任务提供了新思路。

Abstract: Automating embryo viability prediction for in vitro fertilization (IVF) is important but challenging due to the limited availability of labeled pregnancy outcome data, as only a small fraction of embryos are labeled after transfer. Self-supervised learning (SSL) can leverage both labeled and unlabeled data to improve prediction. However, existing SSL methods for videos are not directly applicable to embryo development videos due to two challenges: (1) embryo time-lapse videos contain hundreds of frames, requiring significant GPU memory for conventional SSL; (2) the dataset contains videos with varying lengths and many outlier frames, causing traditional video alignment methods to struggle with semantic misalignment. We propose Spatial-Temporal Pre-Training (STPT) to address these challenges. STPT includes two stages: spatial and temporal. In each stage, only one encoder is trained while the other is frozen, reducing memory demands. To handle temporal misalignment, STPT avoids frame-by-frame alignment across videos. The spatial stage learns from alignments within each video and its temporally consistent augmentations. The temporal stage then models relationships between video embeddings. Our method efficiently handles long videos and temporal variability. On 23,027 time-lapse videos (3,286 labeled), STPT achieves the highest AUC of 0.635 (95% CI: 0.632-0.638) compared to baselines, with limited computational resources.

[9] VMRA-MaR: An Asymmetry-Aware Temporal Framework for Longitudinal Breast Cancer Risk Prediction cs.CVPDF

Zijun Sun, Solveig Thrun, Michael Kampffmeyer

TL;DR: 这篇论文提出了一种基于Vision Mamba RNN（VMRNN）和状态空间模型（SSM）的时序框架VMRA-MaR，用于纵向乳腺癌风险预测，结合了不对称模块以捕捉双侧乳腺差异，显著提升了高风险群体的预测性能，尤其是在高密度乳腺和远期预测中表现突出。

Details

Motivation: 乳腺癌是全球主要致死疾病之一，现有的自动化风险预测方法大多仅关注最近一次的筛查数据，忽视了纵向时间序列中潜在的动态变化信息。为了更好地捕捉乳腺组织的演变趋势并模拟临床实践，论文提出了一种创新的时序框架。

Result: VMRA-MaR在乳腺癌发病预测中表现优异，特别是在高密度乳腺和远期预测任务中，显著优于现有方法，展示了其在早期识别和个性化筛查中的潜力。

Insight: 论文的创新点在于将时序建模与临床关注的形态不对称性结合，为纵向医学影像分析提供了一种新思路，推动了更动态和个性化的癌症筛查策略的发展。

Abstract: Breast cancer remains a leading cause of mortality worldwide and is typically detected via screening programs where healthy people are invited in regular intervals. Automated risk prediction approaches have the potential to improve this process by facilitating dynamically screening of high-risk groups. While most models focus solely on the most recent screening, there is growing interest in exploiting temporal information to capture evolving trends in breast tissue, as inspired by clinical practice. Early methods typically relied on two time steps, and although recent efforts have extended this to multiple time steps using Transformer architectures, challenges remain in fully harnessing the rich temporal dynamics inherent in longitudinal imaging data. In this work, we propose to instead leverage Vision Mamba RNN (VMRNN) with a state-space model (SSM) and LSTM-like memory mechanisms to effectively capture nuanced trends in breast tissue evolution. To further enhance our approach, we incorporate an asymmetry module that utilizes a Spatial Asymmetry Detector (SAD) and Longitudinal Asymmetry Tracker (LAT) to identify clinically relevant bilateral differences. This integrated framework demonstrates notable improvements in predicting cancer onset, especially for the more challenging high-density breast cases and achieves superior performance at extended time points (years four and five), highlighting its potential to advance early breast cancer recognition and enable more personalized screening strategies. Our code is available at https://github.com/Mortal-Suen/VMRA-MaR.git.

[10] Trans${^2}$-CBCT: A Dual-Transformer Framework for Sparse-View CBCT Reconstruction cs.CV | cs.AIPDF

Minmin Yang, Huantao Ren, Senem Velipasalar

TL;DR: Trans$^2$-CBCT提出了一种双Transformer框架，用于稀疏视图CBCT重建，通过结合CNN和Transformer特征以及基于点的几何推理，显著提升了重建质量。

Details

Motivation: 稀疏视图CBCT扫描速度快且辐射剂量低，但会导致严重的伪影和空间覆盖不足。作者提出了一种统一框架来解决这些挑战。

Result: 在LUNA16数据集上，Trans-CBCT比基线方法提升1.17 dB PSNR和0.0163 SSIM；加入Point Transformer后，进一步提升了0.63 dB PSNR和0.0117 SSIM。

Insight: 结合CNN的局部特征提取能力和Transformer的全局建模能力，以及基于点的几何推理，是提升稀疏视图CBCT重建质量的有效途径。

Abstract: Cone-beam computed tomography (CBCT) using only a few X-ray projection views enables faster scans with lower radiation dose, but the resulting severe under-sampling causes strong artifacts and poor spatial coverage. We address these challenges in a unified framework. First, we replace conventional UNet/ResNet encoders with TransUNet, a hybrid CNN-Transformer model. Convolutional layers capture local details, while self-attention layers enhance global context. We adapt TransUNet to CBCT by combining multi-scale features, querying view-specific features per 3D point, and adding a lightweight attenuation-prediction head. This yields Trans-CBCT, which surpasses prior baselines by 1.17 dB PSNR and 0.0163 SSIM on the LUNA16 dataset with six views. Second, we introduce a neighbor-aware Point Transformer to enforce volumetric coherence. This module uses 3D positional encoding and attention over k-nearest neighbors to improve spatial consistency. The resulting model, Trans$^2$-CBCT, provides an additional gain of 0.63 dB PSNR and 0.0117 SSIM. Experiments on LUNA16 and ToothFairy show consistent gains from six to ten views, validating the effectiveness of combining CNN-Transformer features with point-based geometry reasoning for sparse-view CBCT reconstruction.

[11] Enhancing Wireless Device Identification through RF Fingerprinting: Leveraging Transient Energy Spectrum Analysis cs.CVPDF

Nisar Ahmed, Gulshan Saleem, Hafiz Muhammad Shahzad Asif, Muhammad Usman Younus, Kalsoom Safdar

TL;DR: 该论文提出了一种基于瞬态能量谱分析的RF设备识别方法，结合CNN-Bi-GRU混合深度学习模型，显著提升了无线设备识别的准确性和性能。

Details

Motivation: 随着物联网技术和5G网络的快速发展，复杂电磁环境中的辐射设备数量激增，如何准确识别和分类这些设备成为关键挑战。

Result: 实验结果表明，该方法在10折交叉验证中达到了99.33%的精确率、99.53%的召回率、99.43%的F1分数和99.17%的分类准确率。

Insight: 混合深度学习模型（如CNN-Bi-GRU）在RF指纹识别中表现出色，能够有效利用瞬态特征提升设备识别的准确性。

Abstract: In recent years, the rapid growth of the Internet of Things technologies and the widespread adoption of 5G wireless networks have led to an exponential increase in the number of radiation devices operating in complex electromagnetic environments. A key challenge in managing and securing these devices is accurate identification and classification. To address this challenge, specific emitter identification techniques have emerged as a promising solution that aims to provide reliable and efficient means of identifying individual radiation devices in a unified and standardized manner. This research proposes an approach that leverages transient energy spectrum analysis using the General Linear Chirplet Transform to extract features from RF devices. A dataset comprising nine RF devices is utilized, with each sample containing 900 attributes and a total of 1080 equally distributed samples across the devices. These features are then used in a classification modeling framework. To overcome the limitations of conventional machine learning methods, we introduce a hybrid deep learning model called the CNN-Bi-GRU for learning the identification of RF devices based on their transient characteristics. The proposed approach provided a 10-fold cross-validation performance with a precision of 99.33%, recall of 99.53%, F1-score of 99.43%, and classification accuracy of 99.17%. The results demonstrate the promising classification performance of the CNN-Bi-GRU approach, indicating its suitability for accurately identifying RF devices based on their transient characteristics and its potential for enhancing device identification and classification in complex wireless environments.

[12] AQUA20: A Benchmark Dataset for Underwater Species Classification under Challenging Conditions cs.CVPDF

Taufikur Rahman Fuad, Sabbir Ahmed, Shahriar Ivan

TL;DR: AQUA20是一个用于水下物种分类的基准数据集，包含8,171张图像，覆盖20种海洋物种，旨在解决复杂的水下视觉挑战。通过评估13种深度学习模型，ConvNeXt表现最佳。

Details

Motivation: 水下环境的视觉识别因浑浊、低光照和遮挡等复杂失真问题极具挑战性，现有视觉系统性能受限。

Result: ConvNeXt表现最佳，Top-3准确率达98.82%，Top-1准确率为90.69%，F1-score为88.92%。

Insight: 模型复杂度与性能存在权衡，水下物种识别仍有改进空间，AQUA20为未来研究提供了重要基础。

Abstract: Robust visual recognition in underwater environments remains a significant challenge due to complex distortions such as turbidity, low illumination, and occlusion, which severely degrade the performance of standard vision systems. This paper introduces AQUA20, a comprehensive benchmark dataset comprising 8,171 underwater images across 20 marine species reflecting real-world environmental challenges such as illumination, turbidity, occlusions, etc., providing a valuable resource for underwater visual understanding. Thirteen state-of-the-art deep learning models, including lightweight CNNs (SqueezeNet, MobileNetV2) and transformer-based architectures (ViT, ConvNeXt), were evaluated to benchmark their performance in classifying marine species under challenging conditions. Our experimental results show ConvNeXt achieving the best performance, with a Top-3 accuracy of 98.82% and a Top-1 accuracy of 90.69%, as well as the highest overall F1-score of 88.92% with moderately large parameter size. The results obtained from our other benchmark models also demonstrate trade-offs between complexity and performance. We also provide an extensive explainability analysis using GRAD-CAM and LIME for interpreting the strengths and pitfalls of the models. Our results reveal substantial room for improvement in underwater species recognition and demonstrate the value of AQUA20 as a foundation for future research in this domain. The dataset is publicly available at: https://huggingface.co/datasets/taufiktrf/AQUA20.

[13] When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network cs.CVPDF

Dong Xiao, Guangyao Chen, Peixi Peng, Yangru Huang, Yifan Zhao

TL;DR: 本文提出了一种多模态异步混合网络，用于实现自动驾驶中的实时异常检测，兼顾高精度和低响应时间。

Details

Motivation: 现有的异常检测方法通常侧重于准确性而忽略了响应时间，而在时间敏感的驾驶场景中，响应时间至关重要。

Result: 在基准数据集上的实验表明，该方法在精度和响应时间上均优于现有方法，实现了毫秒级实时性能。

Insight: 通过多模态异步架构，可以同时优化异常检测的时空特性，为实时应用提供了有效解决方案。

Abstract: Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in time-sensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynchronous hybrid network that combines event streams from event cameras with image data from RGB cameras. Our network utilizes the high temporal resolution of event cameras through an asynchronous Graph Neural Network and integrates it with spatial features extracted by a CNN from RGB images. This combination effectively captures both the temporal dynamics and spatial details of the driving environment, enabling swift and precise anomaly detection. Extensive experiments on benchmark datasets show that our approach outperforms existing methods in both accuracy and response time, achieving millisecond-level real-time performance.

[14] Few-Shot, Now for Real: Medical VLMs Adaptation without Balanced Sets or Validation cs.CVPDF

Julio Silva-Rodríguez, Fereshteh Shakeri, Houda Bahig, Jose Dolz, Ismail Ben Ayed

TL;DR: 本文研究了在医学图像分析中适应视觉-语言模型（VLM）的少样本学习问题，挑战了现有方法对平衡数据集和验证集的不现实假设，提出了一种无需平衡数据或验证集的适应方法。

Details

Motivation: 医学领域的数据通常具有天然不均衡性且缺乏验证集，而现有VLM少样本适应方法依赖平衡数据集和验证集，限制了实际应用。

Result: 在多种模态和下游任务上的实验表明，现有方法在现实条件下性能下降，而提出的基线方法在挑战性场景下表现鲁棒。

Insight: 医学领域的少样本适应需要更现实的假设，训练无关的方法是高效且鲁棒的解决方案。

Abstract: Vision-language models (VLMs) are gaining attention in medical image analysis. These are pre-trained on large, heterogeneous data sources, yielding rich and transferable representations. Notably, the combination of modality-specialized VLMs with few-shot adaptation has provided fruitful results, enabling the efficient deployment of high-performing solutions. However, previous works on this topic make strong assumptions about the distribution of adaptation data, which are unrealistic in the medical domain. First, prior art assumes access to a balanced support set, a condition that breaks the natural imbalance in disease prevalence found in real-world scenarios. Second, these works typically assume the presence of an additional validation set to fix critical hyper-parameters, which is highly data-inefficient. This work challenges these favorable deployment scenarios and introduces a realistic, imbalanced, validation-free adaptation setting. Our extensive benchmark across various modalities and downstream tasks demonstrates that current methods systematically compromise their performance when operating under realistic conditions, occasionally even performing worse than zero-shot inference. Also, we introduce a training-free linear probe that adaptively blends visual and textual supervision. Detailed studies demonstrate that the proposed solver is a strong, efficient baseline, enabling robust adaptation in challenging scenarios.

[15] Trustworthy Few-Shot Transfer of Medical VLMs through Split Conformal Prediction cs.CVPDF

Julio Silva-Rodríguez, Ismail Ben Ayed, Jose Dolz

TL;DR: 本文提出了一种名为转导式分拆共形适应（SCA-T）的新方法，用于在医学视觉语言模型（VLMs）的少样本转移学习中提供可信度保证。该方法通过无监督的转导适应联合处理校准和测试数据，解决了传统分拆共形预测（SCP）在转移学习中的局限性。

Details

Motivation: 医学视觉语言模型（VLMs）在少样本图像分类中表现出强大的转移能力，但其可靠性尚未充分研究。传统分拆共形预测（SCP）框架在转移学习中可能因预训练的通用性而影响任务特异性共形集的质量。

Result: 与SCP相比，SCA-T在保持相同经验保证的同时，提供了更高的效率和条件覆盖性。

Insight: 在转移学习中，直接适应模型可能会破坏共形预测的理论基础，而无监督的转导适应是解决这一问题的有效途径。

Abstract: Medical vision-language models (VLMs) have demonstrated unprecedented transfer capabilities and are being increasingly adopted for data-efficient image classification. Despite its growing popularity, its reliability aspect remains largely unexplored. This work explores the split conformal prediction (SCP) framework to provide trustworthiness guarantees when transferring such models based on a small labeled calibration set. Despite its potential, the generalist nature of the VLMs’ pre-training could negatively affect the properties of the predicted conformal sets for specific tasks. While common practice in transfer learning for discriminative purposes involves an adaptation stage, we observe that deploying such a solution for conformal purposes is suboptimal since adapting the model using the available calibration data breaks the rigid exchangeability assumptions for test data in SCP. To address this issue, we propose transductive split conformal adaptation (SCA-T), a novel pipeline for transfer learning on conformal scenarios, which performs an unsupervised transductive adaptation jointly on calibration and test data. We present comprehensive experiments utilizing medical VLMs across various image modalities, transfer tasks, and non-conformity scores. Our framework offers consistent gains in efficiency and conditional coverage compared to SCP, maintaining the same empirical guarantees.

[16] Learning golf swing signatures from a single wrist-worn inertial sensor cs.CVPDF

Jessy Lauer

TL;DR: 论文提出了一种基于单个腕戴式惯性传感器的个性化高尔夫挥杆分析框架，通过从公开视频中构建专业运动员的大数据集，生成合成惯性数据，训练神经网络推断运动信息，并学习离散的运动基元词汇以检测技术缺陷。

Details

Motivation: 当前高尔夫挥杆分析受限于孤立的指标、缺乏专业运动员数据以及缺乏丰富且可解释的运动表征。论文旨在通过数据驱动的方法解决这些问题，提供个性化的挥杆分析。

Result: 系统能准确估计全身运动学和挥杆事件，捕捉技术进展并提供针对性反馈。可解释性方法揭示了细微的个性化运动特征，且纵向跟踪展示了实际应用效果。

Insight: 研究挑战了一些常见假设，如挥杆一致性和单一理想挥杆的存在。同时揭示了运动变异性是高水平表现的特征，为运动生物力学研究开辟了新方向。

Abstract: Despite its importance for performance and injury prevention, golf swing analysis is limited by isolated metrics, underrepresentation of professional athletes, and a lack of rich, interpretable movement representations. We address these gaps with a holistic, data-driven framework for personalized golf swing analysis from a single wrist-worn sensor. We build a large dataset of professional swings from publicly available videos, reconstruct full-body 3D kinematics using biologically accurate human mesh recovery, and generate synthetic inertial data to train neural networks that infer motion and segment swing phases from wrist-based input. We learn a compositional, discrete vocabulary of motion primitives that facilitates the detection and visualization of technical flaws, and is expressive enough to predict player identity, club type, sex, and age. Our system accurately estimates full-body kinematics and swing events from wrist data, delivering lab-grade motion analysis on-course and supporting early detection of anomalous movement patterns. Explainability methods reveal subtle, individualized movement signatures, reinforcing the view that variability is a hallmark of skilled performance. Longitudinal tracking demonstrates practical value: as one player’s handicap improved from 50 to 2.2 over 1.5 years, our system captured measurable technical progress and provided targeted, actionable feedback. Our findings challenge common assumptions, such as swing consistency across clubs and the existence of a single “ideal” swing, and uncover latent biomarkers shaped by both intrinsic traits and task-specific constraints. This work bridges lab and field-based biomechanics, offering scalable, accessible, high-fidelity motion analysis for research, coaching, and injury prevention, while opening new directions in movement-based phenotyping, personalized equipment design, and motor skill development.

[17] Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations cs.CVPDF

Zhihao Yuan, Shuyi Jiang, Chun-Mei Feng, Yaolun Zhang, Shuguang Cui

TL;DR: Scene-R1是一个视频驱动的框架，通过强化学习和两阶段定位，无需3D标注即可进行3D场景推理。

Details

Motivation: 现有3D感知的大型语言模型依赖预训练的3D检测器提供对象提议，且决策过程不透明。Scene-R1旨在通过视频驱动的方法实现透明、无需3D标注的3D场景推理。

Result: Scene-R1在多个数据集上超越现有开放词汇基准，同时提供透明的推理过程，证明了仅用RGB-D视频和强化学习即可高效理解3D场景。

Insight: 强化学习结合视频数据为3D场景理解提供了一种实用且标注高效的方法，同时提升了模型的透明度和可解释性。

Abstract: Currently, utilizing large language models to understand the 3D world is becoming popular. Yet existing 3D-aware LLMs act as black boxes: they output bounding boxes or textual answers without revealing how those decisions are made, and they still rely on pre-trained 3D detectors to supply object proposals. We introduce Scene-R1, a video-grounded framework that learns to reason about 3D scenes without any point-wise 3D instance supervision by pairing reinforcement-learning-driven reasoning with a two-stage grounding pipeline. In the temporal grounding stage, we explicitly reason about the video and select the video snippets most relevant to an open-ended query. In the subsequent image grounding stage, we analyze the image and predict the 2D bounding box. After that, we track the object using SAM2 to produce pixel-accurate masks in RGB frames, and project them back into 3D, thereby eliminating the need for 3D detector-based proposals while capturing fine geometry and material cues. Scene-R1 can also adapt to the 3D visual question answering task to answer free-form questions directly from video. Our training pipeline only needs task-level 2D boxes or textual labels without dense 3D point-wise labels. Scene-R1 surpasses existing open-vocabulary baselines on multiple datasets, while delivering transparent, step-by-step rationales. These results show that reinforcement-learning-based reasoning combined with RGB-D video alone offers a practical, annotation-efficient route to trustworthy 3D scene understanding.

[18] SynDaCaTE: A Synthetic Dataset For Evaluating Part-Whole Hierarchical Inference cs.CV | cs.AI | cs.LGPDF

Jake Levi, Mark van der Wilk

TL;DR: SynDaCaTE是一个用于评估部分-整体层次推理的合成数据集，旨在解决现有胶囊网络模型在训练和评估中的问题，并展示了自注意力机制在部分到整体推理中的有效性。

Details

Motivation: 现有胶囊网络模型在训练和评估中难以验证其是否真正学习到部分-整体层次结构，因此需要一个专门的数据集来测试和评估这种能力。

Result: 发现现有胶囊模型在部分-整体层次推理中存在瓶颈，自注意力机制在该任务中表现优异。

Insight: 自注意力机制为设计计算机视觉任务的归纳偏置提供了新的方向。

Abstract: Learning to infer object representations, and in particular part-whole hierarchies, has been the focus of extensive research in computer vision, in pursuit of improving data efficiency, systematic generalisation, and robustness. Models which are \emph{designed} to infer part-whole hierarchies, often referred to as capsule networks, are typically trained end-to-end on supervised tasks such as object classification, in which case it is difficult to evaluate whether such a model \emph{actually} learns to infer part-whole hierarchies, as claimed. To address this difficulty, we present a SYNthetic DAtaset for CApsule Testing and Evaluation, abbreviated as SynDaCaTE, and establish its utility by (1) demonstrating the precise bottleneck in a prominent existing capsule model, and (2) demonstrating that permutation-equivariant self-attention is highly effective for parts-to-wholes inference, which motivates future directions for designing effective inductive biases for computer vision.

[19] VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models cs.CV | cs.AI | cs.ROPDF

Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei

TL;DR: 论文提出VLA-OS，一个统一的多模态（视觉-语言-动作）架构系列，通过系统实验研究了不同规划范式与表征对任务性能的影响，发现基于视觉的规划表征和分层范式表现更优。

Details

Motivation: 现有VLA模型在规划范式、表征方式等方面差异巨大，难以明确性能提升的具体来源和改进方向。因此，需要一个系统方法来隔离网络架构和训练数据的影响，专注于研究规划范式与表征的作用。

Result: 实验结果表明视觉规划表征优于语言表征，分层范式在任务性能、泛化能力和持续学习等方面表现更优，但训练和推理速度较慢。

Insight: 研究揭示了规划表征和范式选择对VLA模型性能的关键影响，为未来改进提供了方向。

Abstract: Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.

[20] LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning cs.CV | cs.CLPDF

Haoxuan Che, Haibo Jin, Zhengrui Guo, Yi Lin, Cheng Jin

TL;DR: 该论文提出了FedMRG框架，首次利用联邦学习（FL）实现隐私保护的多中心医学报告生成（MRG）模型开发，解决了多模态数据异构性下的通信效率和异构性问题。

Details

Motivation: 医学报告生成（MRG）需要大量分散在各中心的医学图像-报告对，但由于隐私法规难以集中训练数据，阻碍了LLM驱动的MRG模型发展。FedMRG旨在通过联邦学习解决这一挑战。

Result: 实验验证了FedMRG在多中心数据下的泛化能力和适应性，同时保持了通信效率和临床准确性。

Insight: 联邦学习与LLM的结合可有效解决医学数据隐私问题，而低秩分解和异构性适配机制为多模态数据FL提供了新思路。

Abstract: LLMs have demonstrated significant potential in Medical Report Generation (MRG), yet their development requires large amounts of medical image-report pairs, which are commonly scattered across multiple centers. Centralizing these data is exceptionally challenging due to privacy regulations, thereby impeding model development and broader adoption of LLM-driven MRG models. To address this challenge, we present FedMRG, the first framework that leverages Federated Learning (FL) to enable privacy-preserving, multi-center development of LLM-driven MRG models, specifically designed to overcome the critical challenge of communication-efficient LLM training under multi-modal data heterogeneity. To start with, our framework tackles the fundamental challenge of communication overhead in FL-LLM tuning by employing low-rank factorization to efficiently decompose parameter updates, significantly reducing gradient transmission costs and making LLM-driven MRG feasible in bandwidth-constrained FL settings. Furthermore, we observed the dual heterogeneity in MRG under the FL scenario: varying image characteristics across medical centers, as well as diverse reporting styles and terminology preferences. To address this, we further enhance FedMRG with (1) client-aware contrastive learning in the MRG encoder, coupled with diagnosis-driven prompts, which capture both globally generalizable and locally distinctive features while maintaining diagnostic accuracy; and (2) a dual-adapter mutual boosting mechanism in the MRG decoder that harmonizes generic and specialized adapters to address variations in reporting styles and terminology. Through extensive evaluation of our established FL-MRG benchmark, we demonstrate the generalizability and adaptability of FedMRG, underscoring its potential in harnessing multi-center data and generating clinically accurate reports while maintaining communication efficiency.

[21] HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models cs.CV | cs.AI | cs.LGPDF

Le Yu, Kaishen Wang, Jianlong Xiong, Yue Cao, Tao He

TL;DR: HalluRNN通过引入Dual-Gated Depth Propagation Unit (DG-DPU)模块，通过循环跨层推理增强模型稳定性，减少了大型视觉语言模型中的幻觉问题，且仅需微调DG-DPU模块即可在多基准测试中表现优异。

Details

Motivation: 大型视觉语言模型（LVLMs）虽然表现优异，但容易产生视觉无依据的幻觉输出。现有方法多依赖数据微调或解码策略，资源消耗大且需任务特定配置。

Result: 仅微调DG-DPU模块的HalluRNN在多个基准测试中表现出强大且鲁棒的性能。

Insight: 架构层面的改进（如循环跨层推理）可在避免高资源消耗的同时有效解决幻觉问题，为LVLMs优化提供了新方向。

Abstract: Though Large Vision-Language Models (LVLMs) have achieved remarkable performance across various tasks, they are still prone to hallucinations-generating outputs that are textually plausible but visually ungrounded. While prior approaches generally address this issue through data-centric fine-tuning or innovative decoding strategies, these methods often require substantial resources or task-specific configurations. In this work, we introduce an architecture-level solution, HalluRNN, which enhances model stability through recurrent cross-layer reasoning. Specifically, we propose a novel Dual-Gated Depth Propagation Unit (DG-DPU) module, which is shared across layers and recurrently refines hidden states. This allows for the adaptive propagation of information throughout the model, enforces consistency across layers, and mitigates hallucinations caused by representational drift. By fine-tuning only the DG-DPU module, HalluRNN achieves strong and robust performance across multiple benchmarks.

[22] DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving cs.CV | cs.AI | cs.ROPDF

Mihir Godbole, Xiangbo Gao, Zhengzhong Tu

TL;DR: DRAMA-X是一个细粒度意图预测与风险推理的驾驶基准，提供多类意图标注与安全关键情景评估，并提出SGG-Intent作为轻量级基线方法。

Details

Motivation: 现有视觉语言模型在细粒度意图推理方面研究不足，且缺乏评估多类意图预测的基准，尤其在安全关键场景中。

Result: 实验表明，基于场景图的推理能显著提升意图预测与风险评估性能，尤其在显式建模上下文时。

Insight: 上下文显式建模对安全关键场景的意图与风险推理至关重要，轻量级框架可实现高效推理。

Abstract: Understanding the short-term motion of vulnerable road users (VRUs) like pedestrians and cyclists is critical for safe autonomous driving, especially in urban scenarios with ambiguous or high-risk behaviors. While vision-language models (VLMs) have enabled open-vocabulary perception, their utility for fine-grained intent reasoning remains underexplored. Notably, no existing benchmark evaluates multi-class intent prediction in safety-critical situations, To address this gap, we introduce DRAMA-X, a fine-grained benchmark constructed from the DRAMA dataset via an automated annotation pipeline. DRAMA-X contains 5,686 accident-prone frames labeled with object bounding boxes, a nine-class directional intent taxonomy, binary risk scores, expert-generated action suggestions for the ego vehicle, and descriptive motion summaries. These annotations enable a structured evaluation of four interrelated tasks central to autonomous decision-making: object detection, intent prediction, risk assessment, and action suggestion. As a reference baseline, we propose SGG-Intent, a lightweight, training-free framework that mirrors the ego vehicle’s reasoning pipeline. It sequentially generates a scene graph from visual input using VLM-backed detectors, infers intent, assesses risk, and recommends an action using a compositional reasoning stage powered by a large language model. We evaluate a range of recent VLMs, comparing performance across all four DRAMA-X tasks. Our experiments demonstrate that scene-graph-based reasoning enhances intent prediction and risk assessment, especially when contextual cues are explicitly modeled.

[23] SELFI: Selective Fusion of Identity for Generalizable Deepfake Detection cs.CVPDF

Younghun Kim, Minsuk Jang, Myung-Joon Kwon, Wonjun Lee, Changick Kim

TL;DR: SELFI 提出了一种选择性融合人脸身份信息的深度伪造检测框架，通过动态调整身份特征的使用，显著提升了跨操纵方法的泛化性能。

Details

Motivation: 现有研究对身份信息在深度伪造检测中的作用存在分歧：一些方法抑制身份特征以减少偏见，而另一些则依赖身份作为取证证据。本文旨在调和这一矛盾，探讨身份特征在不同操纵方法中的泛化能力。

Result: 在四个基准测试中，SELFI 相较于之前的方法平均提升 3.1% AUC，在 DFDC 数据集上优于之前最佳方法 6%。

Insight: 身份信息的有效性是上下文相关的，应通过动态调制避免盲目依赖或抑制。

Abstract: Face identity provides a powerful signal for deepfake detection. Prior studies show that even when not explicitly modeled, classifiers often learn identity features implicitly. This has led to conflicting views: some suppress identity cues to reduce bias, while others rely on them as forensic evidence. To reconcile these views, we analyze two hypotheses: (1) whether face identity alone is discriminative for detecting deepfakes, and (2) whether such identity features generalize poorly across manipulation methods. Our experiments confirm that identity is informative but context-dependent. While some manipulations preserve identity-consistent artifacts, others distort identity cues and harm generalization. We argue that identity features should neither be blindly suppressed nor relied upon, but instead be explicitly modeled and adaptively controlled based on per-sample relevance. We propose \textbf{SELFI} (\textbf{SEL}ective \textbf{F}usion of \textbf{I}dentity), a generalizable detection framework that dynamically modulates identity usage. SELFI consists of: (1) a Forgery-Aware Identity Adapter (FAIA) that extracts identity embeddings from a frozen face recognition model and projects them into a forgery-relevant space via auxiliary supervision; and (2) an Identity-Aware Fusion Module (IAFM) that selectively integrates identity and visual features using a relevance-guided fusion mechanism. Experiments on four benchmarks show that SELFI improves cross-manipulation generalization, outperforming prior methods by an average of 3.1% AUC. On the challenging DFDC dataset, SELFI exceeds the previous best by 6%. Code will be released upon paper acceptance.

[24] A Multimodal In Vitro Diagnostic Method for Parkinson’s Disease Combining Facial Expressions and Behavioral Gait Data cs.CVPDF

Wei Huang, Yinxuan Xu, Yintao Zhou, Zhengyu Li, Jing Huang

TL;DR: 提出了一种结合面部表情和行为步态的多模态体外诊断帕金森病方法，解决了现有方法的数据不足、通用性差和单模态误诊问题，并通过实验验证了其有效性。

Details

Motivation: 帕金森病的早期非侵入性诊断需求迫切，但现有方法存在训练数据不足、设备依赖性强和单模态误诊风险等问题。

Result: 实验结果验证了所提方法的有效性，提高了诊断准确性。

Insight: 多模态方法能够弥补单模态的局限性，结合轻量级模型可实现高效部署，为帕金森病的早期诊断提供了新思路。

Abstract: Parkinson’s disease (PD), characterized by its incurable nature, rapid progression, and severe disability, poses significant challenges to the lives of patients and their families. Given the aging population, the need for early detection of PD is increasing. In vitro diagnosis has garnered attention due to its non-invasive nature and low cost. However, existing methods present several challenges: 1) limited training data for facial expression diagnosis; 2) specialized equipment and acquisition environments required for gait diagnosis, resulting in poor generalizability; 3) the risk of misdiagnosis or missed diagnosis when relying on a single modality. To address these issues, we propose a novel multimodal in vitro diagnostic method for PD, leveraging facial expressions and behavioral gait. Our method employs a lightweight deep learning model for feature extraction and fusion, aimed at improving diagnostic accuracy and facilitating deployment on mobile devices. Furthermore, we have established the largest multimodal PD dataset in collaboration with a hospital and conducted extensive experiments to validate the effectiveness of our proposed method.

[25] OpenMAP-BrainAge: Generalizable and Interpretable Brain Age Predictor cs.CVPDF

Pengyu Kan, Craig Jones, Kenichi Oishi

TL;DR: 该论文提出了一种基于Transformer的模型OpenMAP-BrainAge，用于预测大脑年龄，具有可解释性和对人口统计学与技术差异的鲁棒性。

Details

Motivation: 开发一个能够处理大脑MRI扫描中人口统计学和技术差异的年龄预测模型，同时具有可解释性和通用性。

Result: 在ADNI2 & 3和OASIS3测试集上MAE为3.65年，在AIBL数据集上MAE为3.54年。大脑年龄差异（BAG）与认知分数显著负相关。

Insight: 模型成功融合多视角信息，同时通过梯度归因方法揭示了脑室和白质结构在大脑老化中的关键作用。

Abstract: Purpose: To develop an age prediction model which is interpretable and robust to demographic and technological variances in brain MRI scans. Materials and Methods: We propose a transformer-based architecture that leverages self-supervised pre-training on large-scale datasets. Our model processes pseudo-3D T1-weighted MRI scans from three anatomical views and incorporates brain volumetric information. By introducing a stem architecture, we reduce the conventional quadratic complexity of transformer models to linear complexity, enabling scalability for high-dimensional MRI data. We trained our model on ADNI2 $&$ 3 (N=1348) and OASIS3 (N=716) datasets (age range: 42 - 95) from the North America, with an 8:1:1 split for train, validation and test. Then, we validated it on the AIBL dataset (N=768, age range: 60 - 92) from Australia. Results: We achieved an MAE of 3.65 years on ADNI2 $&$ 3 and OASIS3 test set and a high generalizability of MAE of 3.54 years on AIBL. There was a notable increase in brain age gap (BAG) across cognitive groups, with mean of 0.15 years (95% CI: [-0.22, 0.51]) in CN, 2.55 years ([2.40, 2.70]) in MCI, 6.12 years ([5.82, 6.43]) in AD. Additionally, significant negative correlation between BAG and cognitive scores was observed, with correlation coefficient of -0.185 (p < 0.001) for MoCA and -0.231 (p < 0.001) for MMSE. Gradient-based feature attribution highlighted ventricles and white matter structures as key regions influenced by brain aging. Conclusion: Our model effectively fused information from different views and volumetric information to achieve state-of-the-art brain age prediction accuracy, improved generalizability and interpretability with association to neurodegenerative disorders.

[26] HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs cs.CVPDF

Nikitha SR, Aradhya Neeraj Mathur, Tarun Ram Menta, Rishabh Jain, Mausoom Sarkar

TL;DR: 论文提出了一种轻量级的浅层特征增强器HIRE，用于提升多模态大语言模型的高分辨率图像特征集成效率，显著降低了计算成本。

Details

Motivation: 高分辨率图像特征在多模态大语言模型中的集成虽然能提升细粒度视觉理解任务性能，但由于依赖大型图像编码器（如ViT），计算成本显著增加。作者希望通过一种高效的方法解决这一问题。

Result: 实验结果表明，HIRE在保持竞争力的性能的同时，实现了高达1.5倍的FLOPs节省。

Insight: 浅层特征增强器可以作为大型图像编码器的高效替代方案，显著降低计算成本，而不明显损害性能。

Abstract: The integration of high-resolution image features in modern multimodal large language models has demonstrated significant improvements in fine-grained visual understanding tasks, achieving high performance across multiple benchmarks. Since these features are obtained from large image encoders like ViT, they come with a significant increase in computational costs due to multiple calls to these encoders. In this work, we first develop an intuition for feature upsampling as a natural extension of high-resolution feature generation. Through extensive experiments and ablations, we demonstrate how a shallow feature enricher can achieve competitive results with tremendous reductions in training and inference time as well as computational cost, with upto 1.5x saving in FLOPs.

[27] JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent cs.CVPDF

Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan

TL;DR: JarvisArt是一个基于多模态大语言模型的智能修图代理，通过模拟专业艺术家的推理过程，结合Lightroom中的200多个工具，实现用户友好的图像编辑。它采用两阶段训练方法，并在MMArt-Bench上展示了优于GPT-4o的性能。

Details

Motivation: 现有AI修图工具自动化程度高但缺乏可调性和泛化能力，而专业工具如Lightroom需要大量手动操作和专业知识。JarvisArt旨在结合两者的优势，提供智能且个性化的修图解决方案。

Result: JarvisArt在MMArt-Bench上取得优于GPT-4o的性能（平均像素级指标提升60%），同时保持指令跟随能力，支持全局和局部精细调整。

Insight: 结合多模态大语言模型和专业修图工具，能够显著提升AI修图的质量和灵活性，同时满足个性化需求。

Abstract: Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. Project Page: https://jarvisart.vercel.app/.

[28] CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning cs.CV | cs.AI | cs.CLPDF

Kailing Li, Qi’ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao

TL;DR: CLiViS 是一个无需训练的新框架，通过语言-视觉协同构建动态认知地图，用于具身视觉推理（EVR），解决了复杂指令和长时序视频中的时空动态问题。

Details

Motivation: EVR 面临复杂指令多样性和长时序视频中动态时空关系的挑战。现有的方法要么依赖静态视频字幕的 LLMs（忽略视觉细节），要么使用端到端 VLMs（难以分步推理）。CLiViS 结合 LLMs 的推理能力和 VLMs 的感知能力，提出了动态认知地图的新思路。

Result: 在多个基准测试中表现优异，尤其是在处理长时序视觉依赖任务时。

Insight: LLMs 和 VLMs 的协同能够弥补各自的不足，动态认知地图为 EVR 提供了一种新的结构化表示方法。

Abstract: Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.

[29] Optimization-Free Patch Attack on Stereo Depth Estimation cs.CVPDF

Hangcheng Liu, Xu Kuang, Xingshuo Han, Xingwan Wu, Haoran Ou

TL;DR: 该论文提出了一种针对立体深度估计（SDE）的新型优化自由对抗补丁攻击方法PatchHunter，通过强化学习搜索视觉模式，实现了高可迁移性和现实场景适用性。

Details

Motivation: 现有SDE对抗攻击多为优化驱动，受限于数字扰动和不现实的静态场景，难以在实际中应用。因此，需要设计一种物理可实现、场景自适应的攻击方法。

Result: 在KITTI数据集、CARLA模拟器和真实车辆部署中，PatchHunter超越了优化方法，尤其在黑盒迁移性和低光条件下表现优异（D1-all > 0.4）。

Insight: 模式（而非像素级扰动）是实现可迁移攻击的关键；优化自由方法在现实约束下更具优势。

Abstract: Stereo Depth Estimation (SDE) is essential for scene understanding in vision-based systems like autonomous driving. However, recent studies show that SDE models are vulnerable to adversarial attacks, which are often limited to unrealistic settings, e.g., digital perturbations on separate stereo views in static scenes, restricting their real-world applicability. This raises a critical question: how can we design physically realizable, scene-adaptive, and transferable attacks against SDE under realistic constraints? To answer this, we make two key contributions. First, we propose a unified attack framework that extends optimization-based techniques to four core stages of stereo matching: feature extraction, cost-volume construction, cost aggregation, and disparity regression. A comprehensive stage-wise evaluation across 9 mainstream SDE models, under constraints like photometric consistency, reveals that optimization-based patches suffer from poor transferability. Interestingly, partially transferable patches suggest that patterns, rather than pixel-level perturbations, may be key to generalizable attacks. Motivated by this, we present PatchHunter, the first optimization-free adversarial patch attack against SDE. PatchHunter formulates patch generation as a reinforcement learning-driven search over a structured space of visual patterns crafted to disrupt SDE assumptions. We validate PatchHunter across three levels: the KITTI dataset, the CARLA simulator, and real-world vehicle deployment. PatchHunter not only surpasses optimization-based methods in effectiveness but also achieves significantly better black-box transferability. Even under challenging physical conditions like low light, PatchHunter maintains high attack success (e.g., D1-all > 0.4), whereas optimization-based methods fail.

[30] Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection cs.CV | cs.AIPDF

Xiang Fang, Arvind Easwaran, Blaise Genest

TL;DR: 论文提出了一种自适应多提示对比网络（AMCN），用于少样本离群检测（OOD），通过文本和图像的结合，学习类间和类内分布，显著优于现有方法。

Details

Motivation: 现有的OOD检测方法通常需要大量ID（In-Distribution）样本进行训练，而现实场景中往往只有少量标记ID样本可用。此外，现有方法忽视了不同类别间的多样性差异。

Result: 实验表明，AMCN在少样本OOD检测任务中优于其他最新方法。

Insight: 通过结合文本和图像信息，利用CLIP模型生成自适应提示，有效解决了少样本OOD检测的挑战。

Abstract: Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many IID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where {Only a few {\em labeled ID} samples are available.} Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID {\em image samples}, we leverage CLIP, connecting text with images, engineering learnable ID and OOD {\em textual prompts}. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.

[31] Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning cs.CVPDF

Shih-Wen Liu, Hsuan-Yu Fan, Wei-Ta Chu, Fu-En Yang, Yu-Chiang Frank Wang

TL;DR: 论文提出了一种名为PathGenIC的多模态上下文学习框架，用于自动化从组织病理学图像中生成报告。通过动态检索语义相似的WSI-报告对并结合自适应反馈，显著提升了报告生成的性能。

Details

Motivation: 自动化医学报告生成是一个关键挑战，需要有效的视觉表示和领域特定知识。受人类专家实践的启发，论文旨在通过上下文学习机制解决这一问题。

Result: 在HistGen基准上达到了state-of-the-art结果，显著提升了BLEU、METEOR和ROUGE-L等指标，并在不同报告长度和疾病类别中表现出鲁棒性。

Insight: 通过最大化训练数据的效用并利用ICL桥接视觉与语言，该研究为AI驱动的组织病理学报告提供了解决方案，为未来多模态临床应用奠定了基础。

Abstract: Automating medical report generation from histopathology images is a critical challenge requiring effective visual representations and domain-specific knowledge. Inspired by the common practices of human experts, we propose an in-context learning framework called PathGenIC that integrates context derived from the training set with a multimodal in-context learning (ICL) mechanism. Our method dynamically retrieves semantically similar whole slide image (WSI)-report pairs and incorporates adaptive feedback to enhance contextual relevance and generation quality. Evaluated on the HistGen benchmark, the framework achieves state-of-the-art results, with significant improvements across BLEU, METEOR, and ROUGE-L metrics, and demonstrates robustness across diverse report lengths and disease categories. By maximizing training data utility and bridging vision and language with ICL, our work offers a solution for AI-driven histopathology reporting, setting a strong foundation for future advancements in multimodal clinical applications.

[32] MDSAM:Memory-Driven Sparse Attention Matrix for LVLMs Hallucination Mitigation cs.CVPDF

Shuaiye Lu, Linjiang Zhou, Xiaochuan Shi

TL;DR: 该论文提出了一种无需训练的Memory-Driven Sparse Attention Matrix (MDSAM)方法，通过动态调整注意力分配，减少大视觉语言模型（LVLM）中的幻觉现象。

Details

Motivation: LVLM在解码时对图像标记的敏感性是导致幻觉的主要原因。论文旨在通过动态调整注意力分配，减少模型生成错误实体的可能性。

Result: 在图像描述和视觉问答任务上，MDSAM显著减少了幻觉现象并提高了模型的可靠性。

Insight: MDSAM展示了通过动态注意力优化来提升模型性能的可能性，且具有广泛适用性。

Abstract: Hallucinations in large vision-language models (LVLMs) often stem from the model’s sensitivity to image tokens during decoding, as evidenced by attention peaks observed when generating both real and hallucinated entities. To address this, we propose Memory-Driven Sparse Attention Matrix (MDSAM) , a novel training-free approach that dynamically captures and refines the attention allocated to image tokens at each layer. MDSAM memorizes attention patterns and activates updates through alignment during decoding, enhancing focus on relevant image tokens while effectively reducing hallucinations. We evaluate MDSAM on multiple benchmarks for tasks such as image captioning and visual question answering, demonstrating its ability to consistently reduce hallucinations and improve reliability. Compatible with various LVLM architectures, MDSAM highlights its adaptability and effectiveness in mitigating hallucinations without requiring additional training or external tools.

[33] CSDN: A Context-Gated Self-Adaptive Detection Network for Real-Time Object Detection cs.CVPDF

Wei Haolin

TL;DR: 本文提出了一种基于Transformer的自适应上下文门控检测网络CSDN，旨在解决CNN在目标检测中由于有限感受野导致的全局信息利用不足问题。

Details

Motivation: CNN在目标检测中因感受野有限，难以捕捉全局上下文信息，且传统注意力机制存在信息冗余问题。

Result: CSDN可直接替换多种CNN检测器的原生头部，仅需少量微调即可显著提升检测精度。

Insight: 门控机制的引入不仅提升了全局上下文建模能力，还避免了传统注意力机制的冗余计算。

Abstract: Convolutional neural networks (CNNs) have long been the cornerstone of target detection, but they are often limited by limited receptive fields, which hinders their ability to capture global contextual information. This paper believes that the effective utilization of extracted features is as important as the feature extraction process itself. We critically re-evaluated the DETR-inspired header network architecture, questioning the indispensable nature of its self-attention mechanism, and discovering significant information redundancies. To solve these problems, we introduced the Context-Gated Scale-Adaptive Detection Network (CSDN), a Transformer-based detection header inspired by natural language processing architecture and human visual perception. CSDN aims to efficiently utilize the characteristics of the CNN backbone network by replacing the traditional stacked self-attention and cross-attention layers with a novel gating mechanism. This mechanism enables each region of interest (ROI) to adaptively select and combine feature dimensions and scale information from multiple attention patterns. CSDN provides more powerful global context modeling capabilities and can better adapt to objects of different sizes and structures. Our proposed detection head can directly replace the native heads of various CNN-based detectors, and only a few rounds of fine-tuning on the pre-training weights can significantly improve the detection accuracy, thus avoiding the need to achieve small improvements. Various layer modules undergo extensive re-training.

[34] Domain Generalization using Action Sequences for Egocentric Action Recognition cs.CVPDF

Amirshayan Nasirimajd, Chiara Plizzari, Simone Alberto Peirone, Marco Ciccone, Giuseppe Averta

TL;DR: 本文提出了一种名为SeqDG的域泛化方法，通过利用动作序列的一致性来提高第一人称动作识别模型在未见环境中的性能。

Details

Motivation: 第一人称视角下的动作识别面临光照、视角和环境多样性带来的性能下降问题，导致模型在未见环境中的泛化能力不足。

Result: 在EPIC-KITCHENS-100上实现了跨域动作识别性能提升2.4%，在EGTEA上达到了SOTA的Top-1准确率。

Insight: 动作序列能够反映用户意图的一致性，有助于模型在不同视觉域中的泛化。

Abstract: Recognizing human activities from visual inputs, particularly through a first-person viewpoint, is essential for enabling robots to replicate human behavior. Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment. This variability leads to a notable drop in the performance of Egocentric Action Recognition models when tested in environments not seen during training. In this paper, we tackle these challenges by proposing a domain generalization approach for Egocentric Action Recognition. Our insight is that action sequences often reflect consistent user intent across visual domains. By leveraging action sequences, we aim to enhance the model’s generalization ability across unseen environments. Our proposed method, named SeqDG, introduces a visual-text sequence reconstruction objective (SeqRec) that uses contextual cues from both text and visual inputs to reconstruct the central action of the sequence. Additionally, we enhance the model’s robustness by training it on mixed sequences of actions from different domains (SeqMix). We validate SeqDG on the EGTEA and EPIC-KITCHENS-100 datasets. Results on EPIC-KITCHENS-100, show that SeqDG leads to +2.4% relative average improvement in cross-domain action recognition in unseen environments, and on EGTEA the model achieved +0.6% Top-1 accuracy over SOTA in intra-domain action recognition.

[35] SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification cs.CVPDF

Gnana Praveen Rajasekhar, Jahangir Alam

TL;DR: 这篇论文提出了一种基于自监督学习的统一框架用于音频-视觉说话人验证，通过对比学习和掩码数据建模获得鲁棒的多模态特征表示。

Details

Motivation: 传统的音频-视觉说话人验证方法需要大量标注数据和分离的模态特定架构，计算成本高且难以扩展。作者希望通过自监督学习和共享主干网络解决这些问题。

Result: 实验表明，该方法在无需标注数据的情况下取得了与传统方法竞争的性能，同时显著降低了计算成本。

Insight: 论文展示了自监督学习和统一架构在音频-视觉任务中的潜力，为多模态学习提供了一种高效且可扩展的解决方案。

Abstract: Conventional audio-visual methods for speaker verification rely on large amounts of labeled data and separate modality-specific architectures, which is computationally expensive, limiting their scalability. To address these problems, we propose a self-supervised learning framework based on contrastive learning with asymmetric masking and masked data modeling to obtain robust audiovisual feature representations. In particular, we employ a unified framework for self-supervised audiovisual speaker verification using a single shared backbone for audio and visual inputs, leveraging the versatility of vision transformers. The proposed unified framework can handle audio, visual, or audiovisual inputs using a single shared vision transformer backbone during training and testing while being computationally efficient and robust to missing modalities. Extensive experiments demonstrate that our method achieves competitive performance without labeled data while reducing computational costs compared to traditional approaches.

[36] DreamJourney: Perpetual View Generation with Video Diffusion Models cs.CVPDF

Bo Pan, Yang Chen, Yingwei Pan, Ting Yao, Wei Chen

TL;DR: DreamJourney提出了一种两阶段框架，通过视频扩散模型实现动态场景的长期视图生成，解决了现有方法在3D感知和动态对象运动方面的不足。

Details

Motivation: 现有的方法利用预训练的文本到图像扩散模型生成新视图，但缺乏3D感知能力，且无法捕捉动态对象运动。DreamJourney旨在解决这些问题，实现动态场景的长期视图生成。

Result: 实验表明，DreamJourney在质量和数量上均优于现有方法，实现了长期动态场景的高质量视图生成。

Insight: DreamJourney展示了视频扩散模型在动态场景生成中的潜力，并提出了早期停止和视图填充等策略提升稳定性。

Abstract: Perpetual view generation aims to synthesize a long-term video corresponding to an arbitrary camera trajectory solely from a single input image. Recent methods commonly utilize a pre-trained text-to-image diffusion model to synthesize new content of previously unseen regions along camera movement. However, the underlying 2D diffusion model lacks 3D awareness and results in distorted artifacts. Moreover, they are limited to generating views of static 3D scenes, neglecting to capture object movements within the dynamic 4D world. To alleviate these issues, we present DreamJourney, a two-stage framework that leverages the world simulation capacity of video diffusion models to trigger a new perpetual scene view generation task with both camera movements and object dynamics. Specifically, in stage I, DreamJourney first lifts the input image to 3D point cloud and renders a sequence of partial images from a specific camera trajectory. A video diffusion model is then utilized as generative prior to complete the missing regions and enhance visual coherence across the sequence, producing a cross-view consistent video adheres to the 3D scene and camera trajectory. Meanwhile, we introduce two simple yet effective strategies (early stopping and view padding) to further stabilize the generation process and improve visual quality. Next, in stage II, DreamJourney leverages a multimodal large language model to produce a text prompt describing object movements in current view, and uses video diffusion model to animate current view with object movements. Stage I and II are repeated recurrently, enabling perpetual dynamic scene view generation. Extensive experiments demonstrate the superiority of our DreamJourney over state-of-the-art methods both quantitatively and qualitatively. Our project page: https://dream-journey.vercel.app.

[37] Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models cs.CV | cs.AI | cs.MMPDF

Jihyun Kim, Junho Park, Kyeongbo Kong, Suk-Ju Kang

TL;DR: Programmable-Room是一个通过自然语言指令交互生成和编辑3D房间网格的框架，利用视觉编程和大语言模型分解任务并提升全景图像生成质量。

Details

Motivation: 现有的3D房间生成方法通常需要复杂的用户输入或专业工具，难以通过自然语言交互实现精确控制。本文旨在提出一种灵活且交互性强的解决方案。

Result: 框架在生成和编辑3D房间网格方面展现了灵活性，定量和定性实验证明其优于现有模型。

Insight: 通过自然语言指令和模块化分解，可以简化复杂的3D生成任务；双向LSTM在提升全景图像生成质量中发挥了关键作用。

Abstract: We present Programmable-Room, a framework which interactively generates and edits a 3D room mesh, given natural language instructions. For precise control of a room’s each attribute, we decompose the challenging task into simpler steps such as creating plausible 3D coordinates for room meshes, generating panorama images for the texture, constructing 3D meshes by integrating the coordinates and panorama texture images, and arranging furniture. To support the various decomposed tasks with a unified framework, we incorporate visual programming (VP). VP is a method that utilizes a large language model (LLM) to write a Python-like program which is an ordered list of necessary modules for the various tasks given in natural language. We develop most of the modules. Especially, for the texture generating module, we utilize a pretrained large-scale diffusion model to generate panorama images conditioned on text and visual prompts (i.e., layout, depth, and semantic map) simultaneously. Specifically, we enhance the panorama image generation quality by optimizing the training objective with a 1D representation of a panorama scene obtained from bidirectional LSTM. We demonstrate Programmable-Room’s flexibility in generating and editing 3D room meshes, and prove our framework’s superiority to an existing model quantitatively and qualitatively. Project page is available in https://jihyun0510.github.io/Programmable_Room_Page/.

[38] PDC-Net: Pattern Divide-and-Conquer Network for Pelvic Radiation Injury Segmentation cs.CVPDF

Xinyu Xiong, Wuteng Cao, Zihuang Wu, Lei Zhang, Chong Gao

TL;DR: PDC-Net 提出了一种新的网络架构，通过分而治之的策略解决盆腔辐射损伤（PRI）的分割问题，结合了多方向聚合（MDA）模块、记忆引导上下文（MGC）模块和自适应融合解码器（AFD），显著提升了分割精度。

Details

Motivation: 盆腔辐射损伤（PRI）的准确分割对预后评估和个性化治疗至关重要，但由于复杂的器官形态和上下文干扰，自动分割仍然具有挑战性。

Result: 在首个大型 PRI 数据集上表现优于现有方法。

Insight: 分而治之和动态特征选择是处理复杂医学图像分割任务的有效策略。

Abstract: Accurate segmentation of Pelvic Radiation Injury (PRI) from Magnetic Resonance Images (MRI) is crucial for more precise prognosis assessment and the development of personalized treatment plans. However, automated segmentation remains challenging due to factors such as complex organ morphologies and confusing context. To address these challenges, we propose a novel Pattern Divide-and-Conquer Network (PDC-Net) for PRI segmentation. The core idea is to use different network modules to “divide” various local and global patterns and, through flexible feature selection, to “conquer” the Regions of Interest (ROI) during the decoding phase. Specifically, considering that our ROI often manifests as strip-like or circular-like structures in MR slices, we introduce a Multi-Direction Aggregation (MDA) module. This module enhances the model’s ability to fit the shape of the organ by applying strip convolutions in four distinct directions. Additionally, to mitigate the challenge of confusing context, we propose a Memory-Guided Context (MGC) module. This module explicitly maintains a memory parameter to track cross-image patterns at the dataset level, thereby enhancing the distinction between global patterns associated with the positive and negative classes. Finally, we design an Adaptive Fusion Decoder (AFD) that dynamically selects features from different patterns based on the Mixture-of-Experts (MoE) framework, ultimately generating the final segmentation results. We evaluate our method on the first large-scale pelvic radiation injury dataset, and the results demonstrate the superiority of our PDC-Net over existing approaches.

[39] YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception cs.CVPDF

Mengqi Lei, Siqi Li, Yihong Wu, Han Hu, You Zhou

TL;DR: YOLOv13提出了一种基于超图的实时目标检测方法，通过超图自适应增强机制解决了现有YOLO系列模型在复杂场景中全局高阶相关性建模的局限性，同时通过全流程聚合与分发范式实现高效特征融合，并在MS COCO基准上取得了性能提升。

Details

Motivation: YOLO系列模型虽在实时目标检测中表现优异，但其卷积架构和自注意力机制仅能建模局部和成对相关性，无法捕捉全局多对多高阶相关性，限制了复杂场景下的检测性能。

Result: 在MS COCO上，YOLOv13-N的mAP比YOLO11-N提升3.0%，比YOLOv12-N提升1.5%，同时参数量和计算量更低。

Insight: 超图建模为高阶相关性提供新思路，全流程设计优化特征流动，深度可分离卷积在性能与效率间取得平衡。

Abstract: The YOLO series models reign supreme in real-time object detection due to their superior accuracy and computational efficiency. However, both the convolutional architectures of YOLO11 and earlier versions and the area-based self-attention mechanism introduced in YOLOv12 are limited to local information aggregation and pairwise correlation modeling, lacking the capability to capture global multi-to-multi high-order correlations, which limits detection performance in complex scenarios. In this paper, we propose YOLOv13, an accurate and lightweight object detector. To address the above-mentioned challenges, we propose a Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism that adaptively exploits latent high-order correlations and overcomes the limitation of previous methods that are restricted to pairwise correlation modeling based on hypergraph computation, achieving efficient global cross-location and cross-scale feature fusion and enhancement. Subsequently, we propose a Full-Pipeline Aggregation-and-Distribution (FullPAD) paradigm based on HyperACE, which effectively achieves fine-grained information flow and representation synergy within the entire network by distributing correlation-enhanced features to the full pipeline. Finally, we propose to leverage depthwise separable convolutions to replace vanilla large-kernel convolutions, and design a series of blocks that significantly reduce parameters and computational complexity without sacrificing performance. We conduct extensive experiments on the widely used MS COCO benchmark, and the experimental results demonstrate that our method achieves state-of-the-art performance with fewer parameters and FLOPs. Specifically, our YOLOv13-N improves mAP by 3.0% over YOLO11-N and by 1.5% over YOLOv12-N. The code and models of our YOLOv13 model are available at: https://github.com/iMoonLab/yolov13.

[40] PhysID: Physics-based Interactive Dynamics from a Single-view Image cs.CVPDF

Sourabh Vasant Gothe, Ayon Chattopadhyay, Gunturi Venkata Sai Phani Kiran, Pratik, Vibhav Agarwal

TL;DR: PhysID利用生成模型从单视角图像生成3D网格和物理属性，实现基于物理的实时交互动态效果，显著降低了3D建模和属性校准的门槛，适用于移动设备。

Details

Motivation: 当前从静态图像生成交互动态效果的方法依赖于多视角图像输入或预录制的视频，限制了应用场景。PhysID旨在通过单视角图像实现实时、物理真实的交互效果。

Result: 实验验证了PhysID端到端框架中各模块的协同有效性，证明了其在实时交互和个性化应用中的潜力。

Insight: 通过生成模型简化3D建模流程，结合物理引擎的实时渲染能力，可以大幅提升移动端交互动态效果的开发效率和用户体验。

Abstract: Transforming static images into interactive experiences remains a challenging task in computer vision. Tackling this challenge holds the potential to elevate mobile user experiences, notably through interactive and AR/VR applications. Current approaches aim to achieve this either using pre-recorded video responses or requiring multi-view images as input. In this paper, we present PhysID, that streamlines the creation of physics-based interactive dynamics from a single-view image by leveraging large generative models for 3D mesh generation and physical property prediction. This significantly reduces the expertise required for engineering-intensive tasks like 3D modeling and intrinsic property calibration, enabling the process to be scaled with minimal manual intervention. We integrate an on-device physics-based engine for physically plausible real-time rendering with user interactions. PhysID represents a leap forward in mobile-based interactive dynamics, offering real-time, non-deterministic interactions and user-personalization with efficient on-device memory consumption. Experiments evaluate the zero-shot capabilities of various Multimodal Large Language Models (MLLMs) on diverse tasks and the performance of 3D reconstruction models. These results demonstrate the cohesive functioning of all modules within the end-to-end framework, contributing to its effectiveness.

[41] LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging cs.CVPDF

Fadi Abdeladhim Zidi, Djamel Eddine Boukhari, Abdellah Zakaria Sellam, Abdelkrim Ouafi, Cosimo Distante

TL;DR: 该论文提出了LoLA-SpecViT，一种轻量级的视觉Transformer模型，结合局部注意力机制和低秩适应（LoRA），用于高光谱图像分类，显著减少了可训练参数并提升了分类精度。

Details

Motivation: 高光谱图像分类面临高维数据、冗余频带和标注样本有限等挑战，现有的Transformer模型在标签稀缺条件下的扩展性和适应性不足。

Result: 在三组基准数据集上表现优异，最高达到99.91%的准确率，参数减少80%以上，标签稀缺条件下鲁棒性更强。

Insight: 局部注意力与LoRA结合为高光谱图像分类提供了轻量级且高效的解决方案，适用于农业、环境监测等实际应用。

Abstract: Hyperspectral image classification remains a challenging task due to the high dimensionality of spectral data, significant inter-band redundancy, and the limited availability of annotated samples. While recent transformer-based models have improved the global modeling of spectral-spatial dependencies, their scalability and adaptability under label-scarce conditions remain limited. In this work, we propose \textbf{LoLA-SpecViT}(Low-rank adaptation Local Attention Spectral Vision Transformer), a lightweight spectral vision transformer that addresses these limitations through a parameter-efficient architecture tailored to the unique characteristics of hyperspectral imagery. Our model combines a 3D convolutional spectral front-end with local window-based self-attention, enhancing both spectral feature extraction and spatial consistency while reducing computational complexity. To further improve adaptability, we integrate low-rank adaptation (LoRA) into attention and projection layers, enabling fine-tuning with over 80% fewer trainable parameters. A novel cyclical learning rate scheduler modulates LoRA adaptation strength during training, improving convergence and generalisation. Extensive experiments on three benchmark datasets WHU-Hi LongKou, WHU-Hi HongHu, and Salinas demonstrate that LoLA-SpecViT consistently outperforms state-of-the-art baselines, achieving up to 99.91% accuracy with substantially fewer parameters and enhanced robustness under low-label regimes. The proposed framework provides a scalable and generalizable solution for real-world HSI applications in agriculture, environmental monitoring, and remote sensing analytics. Our code is available in the following \href{https://github.com/FadiZidiDz/LoLA-SpecViT}{GitHub Repository}.

[42] Incorporating Rather Than Eliminating: Achieving Fairness for Skin Disease Diagnosis Through Group-Specific Expert cs.CVPDF

Gelei Xu, Yuying Duan, Zheyuan Liu, Xueyang Li, Meng Jiang

TL;DR: FairMoE通过动态路由数据到不同的专家模块，解决了传统公平性方法在皮肤疾病诊断中因消除敏感属性而损失诊断性能的问题，实现了公平性和性能的双重提升。

Details

Motivation: 现有基于AI的皮肤疾病诊断系统在高准确率的同时存在跨人口群体的偏见，导致医疗结果不公平。传统方法试图消除敏感属性与预测的关联，但会损失临床相关诊断线索，降低性能。

Result: 实验表明，FairMoE在保持公平性指标的同时显著提升了诊断准确率，优于传统方法。

Insight: 敏感属性的动态利用而非消除，可以在公平性任务中保留诊断性能，为其他领域的公平性研究提供了新思路。

Abstract: AI-based systems have achieved high accuracy in skin disease diagnostics but often exhibit biases across demographic groups, leading to inequitable healthcare outcomes and diminished patient trust. Most existing bias mitigation methods attempt to eliminate the correlation between sensitive attributes and diagnostic prediction, but those methods often degrade performance due to the lost of clinically relevant diagnostic cues. In this work, we propose an alternative approach that incorporates sensitive attributes to achieve fairness. We introduce FairMoE, a framework that employs layer-wise mixture-of-experts modules to serve as group-specific learners. Unlike traditional methods that rigidly assign data based on group labels, FairMoE dynamically routes data to the most suitable expert, making it particularly effective for handling cases near group boundaries. Experimental results show that, unlike previous fairness approaches that reduce performance, FairMoE achieves substantial accuracy improvements while preserving comparable fairness metrics.

[43] Time-Contrastive Pretraining for In-Context Image and Video Segmentation cs.CVPDF

Assefa Wahd, Jacob Jaremko, Abhilash Hareendranathan

TL;DR: 本文提出了一种时间对比的自监督学习方法Temporal，用于视觉上下文学习（ICL），将其重新定义为视频目标分割（VOS）任务，解决了网格化方法在上下文图像数量和分辨率上的限制。

Details

Motivation: 主流ICL方法依赖网格化策略，缺乏视觉应用所需的灵活性。本文旨在通过时间对比自监督学习，克服这一限制，提升视觉任务中的上下文学习能力。

Result: 在MICCAI FLARE 2022上，图像分割的Dice分数达到90.95%（提升10.64%），视频分割达到92.45%（提升14.88%）。

Insight: 将ICL与VOS结合是一种有效解决视觉任务中上下文学习限制的方法，时间对比学习有助于优化上下文选择。

Abstract: In-context learning (ICL) enables generalization to new tasks with minimal labeled data. However, mainstream ICL approaches rely on a gridding strategy, which lacks the flexibility required for vision applications. We introduce Temporal, a time-contrastive self-supervised objective that pretrains a prompt retriever for visual ICL, and formulate ICL as a video object segmentation (VOS) task. Temporal addresses key limitations of grid-based methods that restrict the number and resolution of context images. By reframing ICL as a VOS problem, our approach supports a variable number of context images while preserving their full resolution. To address the challenge of selecting optimal context sets for queries, we pretrain a prompt retriever on videos via self-supervised learning, where adjacent frames serve as positives and distant frames as negatives. For image segmentation, the prompt retriever selects relevant sequences that, when combined with the query, form coherent videos for VOS processing. For video segmentation, it identifies keyframes, predicts their masks using our ICL pipeline, and propagates them throughout the sequence. When evaluated on MICCAI FLARE 2022, our method achieves substantial improvements over baselines: 90.95% Dice score for image segmentation (10.64% improvement) and 92.45% Dice for video segmentation (14.88% improvement).

[44] Robust Foreground-Background Separation for Severely-Degraded Videos Using Convolutional Sparse Representation Modeling cs.CV | eess.IVPDF

Kazuki Naganuma, Shunsuke Ono

TL;DR: 该论文提出了一种基于卷积稀疏表示（CSR）的鲁棒前景-背景分离（FBS）方法，用于处理严重退化的视频，通过结合数据特定特征和通用特征，以及显式噪声建模，显著提升了分离效果。

Details

Motivation: 由于现有FBS方法在低帧率和多种噪声条件下表现不佳，无法准确分离前景与背景，论文旨在解决这一问题。

Result: 实验表明，该方法在红外和显微镜视频上优于现有方法。

Insight: CSR模型能够有效捕捉空间结构，而多凸优化和显式噪声建模是提升FBS鲁棒性的关键。

Abstract: This paper proposes a foreground-background separation (FBS) method with a novel foreground model based on convolutional sparse representation (CSR). In order to analyze the dynamic and static components of videos acquired under undesirable conditions, such as hardware, environmental, and power limitations, it is essential to establish an FBS method that can handle videos with low frame rates and various types of noise. Existing FBS methods have two limitations that prevent us from accurately separating foreground and background components from such degraded videos. First, they only capture either data-specific or general features of the components. Second, they do not include explicit models for various types of noise to remove them in the FBS process. To this end, we propose a robust FBS method with a CSR-based foreground model. This model can adaptively capture specific spatial structures scattered in imaging data. Then, we formulate FBS as a constrained multiconvex optimization problem that incorporates CSR, functions that capture general features, and explicit noise characterization functions for multiple types of noise. Thanks to these functions, our method captures both data-specific and general features to accurately separate the components from various types of noise even under low frame rates. To obtain a solution of the optimization problem, we develop an algorithm that alternately solves its two convex subproblems by newly established algorithms. Experiments demonstrate the superiority of our method over existing methods using two types of degraded videos: infrared and microscope videos.

[45] Fetuses Made Simple: Modeling and Tracking of Fetal Shape and Pose cs.CVPDF

Yingcheng Liu, Peiqi Wang, Sebastian Diaz, Esra Abaci Turk, Benjamin Billot

TL;DR: 该论文提出了一种基于SMPL的3D统计胎儿体模型，用于同时估计胎儿的形状和姿态，解决了现有方法在运动分析和形状捕捉上的局限性。

Details

Motivation: 在产前诊断中，分析胎儿运动和形状至关重要。现有方法（如关键点或体积分割）各有不足，前者忽视了全形状细节，后者因运动大而难以进行时间分析。

Result: 在未见数据上实现了3.2毫米的表面对齐误差（3毫米MRI体素大小），代码已开源。

Insight: 模型不仅增强了胎儿的运动和形状分析，还支持直观的可视化，为产前诊断提供了新工具。

Abstract: Analyzing fetal body motion and shape is paramount in prenatal diagnostics and monitoring. Existing methods for fetal MRI analysis mainly rely on anatomical keypoints or volumetric body segmentations. Keypoints simplify body structure to facilitate motion analysis, but may ignore important details of full-body shape. Body segmentations capture complete shape information but complicate temporal analysis due to large non-local fetal movements. To address these limitations, we construct a 3D articulated statistical fetal body model based on the Skinned Multi-Person Linear Model (SMPL). Our algorithm iteratively estimates body pose in the image space and body shape in the canonical pose space. This approach improves robustness to MRI motion artifacts and intensity distortions, and reduces the impact of incomplete surface observations due to challenging fetal poses. We train our model on segmentations and keypoints derived from $19,816$ MRI volumes across $53$ subjects. Our model captures body shape and motion across time series and provides intuitive visualization. Furthermore, it enables automated anthropometric measurements traditionally difficult to obtain from segmentations and keypoints. When tested on unseen fetal body shapes, our method yields a surface alignment error of $3.2$ mm for $3$ mm MRI voxel size. To our knowledge, this represents the first 3D articulated statistical fetal body model, paving the way for enhanced fetal motion and shape analysis in prenatal diagnostics. The code is available at https://github.com/MedicalVisionGroup/fetal-smpl .

Xiaodong Guo, Zi’ang Lin, Luwen Hu, Zhihong Deng, Tong Liu

TL;DR: 论文提出了一种高效的RGB-热成像语义分割架构CM-SSM，通过跨模态状态空间建模方法解决了多源数据处理中的计算开销问题，并实现了线性计算复杂性。

Details

Motivation: 在野外环境中，RGB和热成像数据的融合可以显著提升语义分割性能，但现有方法（如基于Transformer的模型）计算开销大，尤其不适合资源受限的系统。

Result: 在CART数据集上实现了最优性能，且参数和计算成本更低；在PST900数据集上验证了泛化能力。

Insight: CM-SSM通过状态空间建模避免了Transformer的计算复杂性，为资源受限系统提供了一种高效的多模态语义分割方案。

Abstract: The integration of RGB and thermal data can significantly improve semantic segmentation performance in wild environments for field robots. Nevertheless, multi-source data processing (e.g. Transformer-based approaches) imposes significant computational overhead, presenting challenges for resource-constrained systems. To resolve this critical limitation, we introduced CM-SSM, an efficient RGB-thermal semantic segmentation architecture leveraging a cross-modal state space modeling (SSM) approach. Our framework comprises two key components. First, we introduced a cross-modal 2D-selective-scan (CM-SS2D) module to establish SSM between RGB and thermal modalities, which constructs cross-modal visual sequences and derives hidden state representations of one modality from the other. Second, we developed a cross-modal state space association (CM-SSA) module that effectively integrates global associations from CM-SS2D with local spatial features extracted through convolutional operations. In contrast with Transformer-based approaches, CM-SSM achieves linear computational complexity with respect to image resolution. Experimental results show that CM-SSM achieves state-of-the-art performance on the CART dataset with fewer parameters and lower computational cost. Further experiments on the PST900 dataset demonstrate its generalizability. Codes are available at https://github.com/xiaodonguo/CMSSM.

[47] SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model cs.CV | cs.AIPDF

Guankun Wang, Wenjin Mo, Junyi Wang, Long Bai, Kun Yuan

TL;DR: SurgVidLM是首个专为精细和全面的手术视频理解设计的视频语言模型，通过SVU-31K数据集和StageFocus机制，显著提升了手术视频分析能力。

Details

Motivation: 目前缺乏针对精细手术视频理解任务的专业视频语言模型，而手术视频需要捕捉复杂的时序信息和细节。SurgVidLM旨在填补这一空白。

Result: 实验表明，SurgVidLM在全面和精细视频理解任务中均优于现有Vid-LLMs，展现出强大的复杂流程捕捉能力。

Insight: SurgVidLM通过多粒度设计和数据增强，为手术视频理解提供了新思路，尤其在细节分析和整体流程捕捉上表现突出。

Abstract: Recent advances in Multimodal Large Language Models have demonstrated great potential in the medical domain, facilitating users to understand surgical scenes and procedures. Beyond image-based methods, the exploration of Video Large Language Models (Vid-LLMs) has emerged as a promising avenue for capturing the complex sequences of information involved in surgery. However, there is still a lack of Vid-LLMs specialized for fine-grained surgical video understanding tasks, which is crucial for analyzing specific processes or details within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K dataset which consists of over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Furthermore, we introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos. We also develop the Multi-frequency Fusion Attention to effectively integrate low and high-frequency visual tokens, ensuring the retention of critical information. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing complex procedural contexts.

[48] StainPIDR: A Pathological Image Decouplingand Reconstruction Method for StainNormalization Based on Color VectorQuantization and Structure Restaining cs.CV | cs.AIPDF

Zheng Chen

TL;DR: 该论文提出了一种称为StainPIDR的染色归一化方法，通过解耦病理图像为结构特征和向量量化的颜色特征，重新染色结构特征，并解码为归一化图像，解决了颜色差异问题。还设计了模板图像选择算法以优化性能。

Details

Motivation: 由于病理图像的色彩表现受成像协议、染料比例和扫描设备影响，计算机辅助诊断系统在颜色变异图像上的性能可能下降。StainPIDR旨在消除这些颜色差异。

Result: 实验验证了StainPIDR的有效性，显示其在染色归一化任务中表现优异。

Insight: 固定颜色码本和交叉注意力机制有助于高效处理颜色变异，模板选择算法进一步优化了归一化效果。

Abstract: The color appearance of a pathological image is highly related to the imaging protocols, the proportion of different dyes, and the scanning devices. Computer-aided diagnostic systems may deteriorate when facing these color-variant pathological images. In this work, we propose a stain normalization method called StainPIDR. We try to eliminate this color discrepancy by decoupling the image into structure features and vector-quantized color features, restaining the structure features with the target color features, and decoding the stained structure features to normalized pathological images. We assume that color features decoupled by different images with the same color should be exactly the same. Under this assumption, we train a fixed color vector codebook to which the decoupled color features will map. In the restaining part, we utilize the cross-attention mechanism to efficiently stain the structure features. As the target color (decoupled from a selected template image) will also affect the performance of stain normalization, we further design a template image selection algorithm to select a template from a given dataset. In our extensive experiments, we validate the effectiveness of StainPIDR and the template image selection algorithm. All the results show that our method can perform well in the stain normalization task. The code of StainPIDR will be publicly available later.

[49] Cloud-Aware SAR Fusion for Enhanced Optical Sensing in Space Missions cs.CV | cs.LGPDF

Trong-An Bui, Thanh-Thoai Le

TL;DR: 论文提出了一种Cloud-Attentive Reconstruction Framework，通过结合SAR与光学数据的特征融合和深度学习重建技术，生成无云的高质量光学图像。

Details

Motivation: 云层污染严重影响光学卫星图像的可用性，限制其在环境监测、灾害响应和土地利用分析等关键应用中的表现。

Result: 实验结果显示，该方法在PSNR、SSIM和MAE指标上均优于现有方法，PSNR达到31.01 dB，SSIM为0.918，MAE为0.017。

Insight: 通过融合SAR的结构信息与光学数据的光谱特性，能够显著提升无云光学图像的质量，为卫星遥感应用提供了更可靠的数据支持。

Abstract: Cloud contamination significantly impairs the usability of optical satellite imagery, affecting critical applications such as environmental monitoring, disaster response, and land-use analysis. This research presents a Cloud-Attentive Reconstruction Framework that integrates SAR-optical feature fusion with deep learning-based image reconstruction to generate cloud-free optical imagery. The proposed framework employs an attention-driven feature fusion mechanism to align complementary structural information from Synthetic Aperture Radar (SAR) with spectral characteristics from optical data. Furthermore, a cloud-aware model update strategy introduces adaptive loss weighting to prioritize cloud-occluded regions, enhancing reconstruction accuracy. Experimental results demonstrate that the proposed method outperforms existing approaches, achieving a PSNR of 31.01 dB, SSIM of 0.918, and MAE of 0.017. These outcomes highlight the framework’s effectiveness in producing high-fidelity, spatially and spectrally consistent cloud-free optical images.

[50] EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations cs.CV | cs.AIPDF

Junho Park, Andrew Sangwoo Ye, Taein Kwon

TL;DR: EgoWorld提出了一种新颖的两阶段框架，将外中心视图（第三人称）转换为内中心视图（第一人称），利用丰富的3D信息（如点云、手部姿态和文本描述）克服了现有方法的局限性，实现了无需多视角同步或初始内中心帧的转换。

Details

Motivation: 内中心视觉对于人和机器的视觉理解至关重要，但现有方法依赖2D线索、多视角同步等限制条件。EgoWorld旨在通过3D重建和扩散生成技术突破这些限制，提升AR/VR和机器人应用的效果。

Result: 在H2O和TACO数据集上表现优于现有方法，且能泛化到未标注的真实世界数据。

Insight: 通过3D信息的引入，EgoWorld实现了更鲁棒的内中心视图生成，为AR/VR和机器人应用提供了新思路。

Abstract: Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited by their dependence on 2D cues, synchronized multi-view settings, and unrealistic assumptions such as necessity of initial egocentric frame and relative camera poses during inference. To overcome these challenges, we introduce EgoWorld, a novel two-stage framework that reconstructs an egocentric view from rich exocentric observations, including projected point clouds, 3D hand poses, and textual descriptions. Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the egocentric perspective, and then applies diffusion-based inpainting to produce dense, semantically coherent egocentric images. Evaluated on the H2O and TACO datasets, EgoWorld achieves state-of-the-art performance and demonstrates robust generalization to new objects, actions, scenes, and subjects. Moreover, EgoWorld shows promising results even on unlabeled real-world examples.

[51] PostAlign: Multimodal Grounding as a Corrective Lens for MLLMs cs.CVPDF

Yixuan Wu, Yang Zhang, Jian Wu, Philip Torr, Jindong Gu

TL;DR: 论文提出MMGrounded-PostAlign框架，通过后多模态对齐技术增强MLLMs的视觉理解能力并减少幻觉，包括多模态grounding模块和负拒绝机制。

Details

Motivation: MLLMs在视觉语言任务中过度依赖虚假关联，尤其是语言先验，导致模型忽视实际视觉信息。为解决这一问题，论文提出了一个后对齐框架。

Result: 在POPE、HaloQuest等基准测试中表现出色，显著提升了细粒度视觉理解和幻觉抑制能力。

Insight: 后对齐框架能够有效修正MLLMs的依赖偏差，增强模型的视觉-文本协同能力，为多模态任务提供了更可靠的技术路径。

Abstract: Multimodal Large Language Models (MLLMs) excel in vision-language tasks, such as image captioning and visual question answering. However, they often suffer from over-reliance on spurious correlations, primarily due to linguistic priors that distract the model from leveraging actual visual information. To address these issues, we introduce MMGrounded-PostAlign, a post-multimodal alignment framework designed to enhance the visual understanding capabilities and mitigate the hallucinations of MLLMs. Our framework incorporates a multimodal grounding module for both visual grounding, which identifies the referred object in the image, and textual grounding, which generates the rationale for the final answer, ensuring that outputs are anchored in both visual and textual evidence. To mitigate the hallucinations, we introduce a negative rejection mechanism in the visual grounding module to distinguish grounded entities from non-existent objects influenced by linguistic biases. On the textual grounding side, we propose a selective reasoning mechanism that adjusts the model’s reasoning strategy based on query complexity. Extensive evaluations are conducted on benchmarks such as POPE, HaloQuest, VQAv2, MME, and MMBench showing significant improvements in fine-grained visual understanding and hallucination suppression.

[52] Cause-Effect Driven Optimization for Robust Medical Visual Question Answering with Language Biases cs.CV | cs.AIPDF

Huanjia Zhu, Yishu Liu, Xiaozhao Fang, Guangming Lu, Bingzhi Chen

TL;DR: 该论文提出了一种名为CEDO的因果效应驱动优化框架，通过三种机制（MHO、GMS、DLR）从因果和效应角度全面缓解医学视觉问答中的语言偏差问题，并在实验中表现出色。

Details

Motivation: 医学视觉问答（Med-VQA）模型常因语言偏差（如问题类型与答案类别之间的虚假关联）而表现不佳。本文旨在通过因果效应驱动的优化方法解决这一问题。

Result: 在多个传统和偏差敏感基准测试中，CEDO均表现出优于现有方法的鲁棒性。

Insight: 结合因果和效应视角的优化方法能更全面地解决语言偏差问题，多模态协同和动态损失加权是关键创新点。

Abstract: Existing Medical Visual Question Answering (Med-VQA) models often suffer from language biases, where spurious correlations between question types and answer categories are inadvertently established. To address these issues, we propose a novel Cause-Effect Driven Optimization framework called CEDO, that incorporates three well-established mechanisms, i.e., Modality-driven Heterogeneous Optimization (MHO), Gradient-guided Modality Synergy (GMS), and Distribution-adapted Loss Rescaling (DLR), for comprehensively mitigating language biases from both causal and effectual perspectives. Specifically, MHO employs adaptive learning rates for specific modalities to achieve heterogeneous optimization, thus enhancing robust reasoning capabilities. Additionally, GMS leverages the Pareto optimization method to foster synergistic interactions between modalities and enforce gradient orthogonality to eliminate bias updates, thereby mitigating language biases from the effect side, i.e., shortcut bias. Furthermore, DLR is designed to assign adaptive weights to individual losses to ensure balanced learning across all answer categories, effectively alleviating language biases from the cause side, i.e., imbalance biases within datasets. Extensive experiments on multiple traditional and bias-sensitive benchmarks consistently demonstrate the robustness of CEDO over state-of-the-art competitors.

[53] Feedback Driven Multi Stereo Vision System for Real-Time Event Analysis cs.CV | cs.AIPDF

Mohamed Benkedadra, Matei Mancas, Sidi Ahmed Mahmoudi

TL;DR: 该论文提出了一种基于多3D立体视觉的实时事件分析系统，通过融合多个3D摄像头实现全场景重建，支持事件识别、目标追踪等任务，并利用反馈机制优化决策。

Details

Motivation: 现有2D摄像头和短距离3D摄像头在复杂大场景中可靠性不足，难以满足交互系统的需求，因此需要一种更可靠的3D立体视觉方案。

Result: 初步实验验证了流水线的可行性，展示了其在事件识别和目标追踪等任务中的潜力。

Insight: 多摄像头融合和反馈机制的结合可以显著提升复杂环境中立体视觉系统的鲁棒性和适应性。

Abstract: 2D cameras are often used in interactive systems. Other systems like gaming consoles provide more powerful 3D cameras for short range depth sensing. Overall, these cameras are not reliable in large, complex environments. In this work, we propose a 3D stereo vision based pipeline for interactive systems, that is able to handle both ordinary and sensitive applications, through robust scene understanding. We explore the fusion of multiple 3D cameras to do full scene reconstruction, which allows for preforming a wide range of tasks, like event recognition, subject tracking, and notification. Using possible feedback approaches, the system can receive data from the subjects present in the environment, to learn to make better decisions, or to adapt to completely new environments. Throughout the paper, we introduce the pipeline and explain our preliminary experimentation and results. Finally, we draw the roadmap for the next steps that need to be taken, in order to get this pipeline into production

[54] PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis cs.CV | cs.MMPDF

Chuhao Jin, Haosen Li, Bingzi Zhang, Che Liu, Xiting Wang

TL;DR: PlanMoGPT通过渐进式规划和流增强的细粒度运动标记化，解决了LLM在文本到运动生成中的性能瓶颈，实现了质量和多样性的双重提升。

Details

Motivation: 当前基于LLM的文本到运动生成方法在性能上远落后于非LLM方法，主要原因是运动标记化的粒度问题：细粒度标记化导致局部依赖性问题，而粗粒度标记化牺牲了运动细节。

Result: 在长序列生成任务中，FID分数提升63.8%，运动多样性提升49.9%，显著优于现有方法。

Insight: 渐进式规划和流增强细粒度标记化是解决文本到运动生成中多样性与质量矛盾的关键。

Abstract: Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs’ autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full sequences. Second, our flow-enhanced tokenizer doubles the downsampling resolution and expands the codebook size by eight times, minimizing detail loss during discretization, while a flow-enhanced decoder recovers motion nuances. Extensive experiments on text-to-motion benchmarks demonstrate that it achieves state-of-the-art performance, improving FID scores by 63.8% (from 0.380 to 0.141) on long-sequence generation while enhancing motion diversity by 49.9% compared to existing methods. The proposed framework successfully resolves the diversity-quality trade-off that plagues current non-LLM approaches, establishing new standards for text-to-motion generation.

[55] IDAL: Improved Domain Adaptive Learning for Natural Images Dataset cs.CV | cs.AI | cs.LGPDF

Ravi Kant Gupta, Shounak Das, Amit Sethi

TL;DR: 本文提出了一种改进的无监督域适应（UDA）方法IDAL，通过结合ResNet和FPN的架构以及新型损失函数，有效解决了自然图像数据中的多模态分布对齐问题。

Details

Motivation: 现有的对抗域适应方法在多模态分布的分类问题中可能无法有效对齐不同域的表示。为了解决这一问题，作者提出了一种结合内容和风格特征的改进方法。

Result: 在Office-Home、Office-31和VisDA-2017数据集上表现优于现有的CNN方法，在DomainNet数据集上表现相当。

Insight: 通过同时处理内容和风格特征，能够更好地对齐多模态分布的自然图像域。定制的损失函数设计可以显著提升域适应性能和训练效率。

Abstract: We present a novel approach for unsupervised domain adaptation (UDA) for natural images. A commonly-used objective for UDA schemes is to enhance domain alignment in representation space even if there is a domain shift in the input space. Existing adversarial domain adaptation methods may not effectively align different domains of multimodal distributions associated with classification problems. Our approach has two main features. Firstly, its neural architecture uses the deep structure of ResNet and the effective separation of scales of feature pyramidal network (FPN) to work with both content and style features. Secondly, it uses a combination of a novel loss function and judiciously selected existing loss functions to train the network architecture. This tailored combination is designed to address challenges inherent to natural images, such as scale, noise, and style shifts, that occur on top of a multi-modal (multi-class) distribution. The combined loss function not only enhances model accuracy and robustness on the target domain but also speeds up training convergence. Our proposed UDA scheme generalizes better than state-of-the-art for CNN-based methods on Office-Home, Office-31, and VisDA-2017 datasets and comaparable for DomainNet dataset.

[56] GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning cs.CV | cs.AIPDF

Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu

TL;DR: 该论文提出了ThinkVG数据集和一种基于强化学习的可验证奖励机制，以提升医学VQA模型的可靠性和可解释性。

Details

Motivation: 现有医学VQA方法存在答案可靠性低和可解释性差的问题，限制了临床决策中对其的信任。

Result: 仅用八分之一训练数据即实现了可比性能，验证了方法的效率与有效性。

Insight: 显式关联视觉区域的中间推理步骤和强化学习的奖励机制可能是提升医学VQA可靠性和可解释性的有效途径。

Abstract: Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers. To address this, this work first proposes a Thinking with Visual Grounding (ThinkVG) dataset wherein the answer generation is decomposed into intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model’s reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://huggingface.co/datasets/BoKelvin/GEMeX-ThinkVG.

[57] SegChange-R1:Augmented Reasoning for Remote Sensing Change Detection via Large Language Models cs.CVPDF

Fei Zhou

TL;DR: 论文提出了一种基于大语言模型（LLM）的增强推理方法SegChange-R1，用于遥感变化检测，通过整合文本描述信息提升检测能力，并设计了一个基于线性注意力的空间变换模块BEV，同时在无人机视角下构建了首个建筑变化检测数据集DVCD。

Details

Motivation: 遥感变化检测在多个领域有广泛应用，但现有方法在模态对齐和检测效率方面存在不足，因此需要一种更高效且精准的方法。

Result: 在四个广泛使用的变化检测数据集上，SegChange-R1显著优于现有方法。

Insight: LLM的文本信息整合和BEV的空间变换能够有效提升遥感变化检测的精确性和收敛速度。

Abstract: Remote sensing change detection is widely used in a variety of fields such as urban planning, terrain and geomorphology analysis, and environmental monitoring, mainly by analyzing the significant change differences of features (e.g., building changes) in the same spatial region at different time phases. In this paper, we propose a large language model (LLM) augmented inference approach (SegChange-R1), which enhances the detection capability by integrating textual descriptive information and aims at guiding the model to segment the more interested change regions, thus accelerating the convergence speed. Moreover, we design a spatial transformation module (BEV) based on linear attention, which solves the problem of modal misalignment in change detection by unifying features from different temporal perspectives onto the BEV space. In addition, we construct the first dataset for building change detection from UAV viewpoints (DVCD ), and our experiments on four widely-used change detection datasets show a significant improvement over existing methods. The code and pre-trained models are available in https://github.com/Yu-Zhouz/SegChange-R1.

[58] Classification of Tents in Street Bazaars Using CNN cs.CVPDF

Azamat Ibragimov, Ruslan Isaev, Remudin Reshid Mekuria, Gulnaz Gimaletdinova, Dim Shaiakhmetov

TL;DR: 本文提出了一种改进的深度学习模型，用于街头集市帐篷的分类任务，比较了自定义CNN与EfficientNetB0的性能，证明了迁移学习的有效性。

Details

Motivation: 街头集市是许多地区重要的经济中心，但其非结构化特性使得帐篷等基础设施的自动分类具有挑战性。传统手动方法效率低下，因此需探索深度学习方法的应用。

Result: 自定义CNN准确率为92.8%，EfficientNetB0达到98.4%，表明预训练模型显著提升分类性能；混淆矩阵分析揭示了模型的优缺点。

Insight: 预训练模型（如EfficientNetB0）在特定任务（如集市帐篷分类）中具有更高的准确性和泛化能力，适合类似场景的迁移学习应用。

Abstract: This research paper proposes an improved deep learning model for classifying tents in street bazaars, comparing a custom Convolutional Neural Network (CNN) with EfficientNetB0. This is a critical task for market organization with a tent classification, but manual methods in the past have been inefficient. Street bazaars represent a vital economic hub in many regions, yet their unstructured nature poses significant challenges for the automated classification of market infrastructure, such as tents. In Kyrgyzstan, more than a quarter of the country’s GDP is derived from bazaars. While CNNs have been widely applied to object recognition, their application to bazaar-specific tasks remains underexplored. Here, we build upon our original approach by training on an extended set of 126 original photographs that were augmented to generate additional images. This dataset is publicly available for download on Kaggle. A variety of performance metrics, such as accuracy, precision, recall, F1 score, and mean average precision (mAP), were used to assess the models comparatively, providing a more extensive analysis of classification performance. The results show that the CNN custom model achieved 92.8% accuracy, and EfficientNetB0 showed 98.4% accuracy results, confirming the effectiveness of transfer learning in the bazaar image classification. Also, when analyzing the confusion matrix, the analysis reveals the weaknesses and strengths of each model. These findings suggest that using a pre-trained model such as EfficientNetB0 significantly improves classification accuracy and generalization.

[59] Mobile Image Analysis Application for Mantoux Skin Test cs.CVPDF

Liong Gele, Tan Chye Cheah

TL;DR: 该论文介绍了一款新开发的移动应用，用于通过Mantoux皮肤试验（TST）诊断潜伏性结核感染（LTBI），解决了传统方法中随访率低和主观解释的问题。

Details

Motivation: 传统的TST方法存在随访率低、患者不适及主观解释误差的问题，尤其是使用圆珠笔方法时容易导致误诊。

Result: 应用在实际评估中表现出优于传统临床实践的准确性和可靠性，为结核病管理提供了高效工具。

Insight: 自动化技术可以显著改善资源有限地区的结核病诊断效率，未来的工作重点是优化算法和扩展功能。

Abstract: This paper presents a newly developed mobile application designed to diagnose Latent Tuberculosis Infection (LTBI) using the Mantoux Skin Test (TST). Traditional TST methods often suffer from low follow-up return rates, patient discomfort, and subjective manual interpretation, particularly with the ball-point pen method, leading to misdiagnosis and delayed treatment. Moreover, previous developed mobile applications that used 3D reconstruction, this app utilizes scaling stickers as reference objects for induration measurement. This mobile application integrates advanced image processing technologies, including ARCore, and machine learning algorithms such as DeepLabv3 for robust image segmentation and precise measurement of skin indurations indicative of LTBI. The system employs an edge detection algorithm to enhance accuracy. The application was evaluated against standard clinical practices, demonstrating significant improvements in accuracy and reliability. This innovation is crucial for effective tuberculosis management, especially in resource-limited regions. By automating and standardizing TST evaluations, the application enhances the accessibility and efficiency of TB di-agnostics. Future work will focus on refining machine learning models, optimizing measurement algorithms, expanding functionalities to include comprehensive patient data management, and enhancing ARCore’s performance across various lighting conditions and operational settings.

[60] OSDMamba: Enhancing Oil Spill Detection from Remote Sensing Images Using Selective State Space Model cs.CVPDF

Shuaiyu Chen, Fu Wang, Peng Ren, Chunbo Luo, Zeyu Fu

TL;DR: OSDMamba是首个基于Mamba的油污检测架构，通过选择性状态空间模型增强遥感图像中的油污检测性能，解决了样本不足和类别不平衡问题，并在两个公开数据集上取得了显著提升。

Details

Motivation: 现有基于CNN的方法在油污检测中存在小区域检测能力不足和全局信息捕捉有限的缺点，且样本不足和类别不平衡问题进一步降低了检测精度。

Result: 在两个公开数据集上，OSDMamba实现了8.9%和11.8%的性能提升，达到SOTA水平。

Insight: 选择性状态空间模型在遥感图像分割任务中具有潜力，尤其能有效解决小目标和全局信息捕捉问题。

Abstract: Semantic segmentation is commonly used for Oil Spill Detection (OSD) in remote sensing images. However, the limited availability of labelled oil spill samples and class imbalance present significant challenges that can reduce detection accuracy. Furthermore, most existing methods, which rely on convolutional neural networks (CNNs), struggle to detect small oil spill areas due to their limited receptive fields and inability to effectively capture global contextual information. This study explores the potential of State-Space Models (SSMs), particularly Mamba, to overcome these limitations, building on their recent success in vision applications. We propose OSDMamba, the first Mamba-based architecture specifically designed for oil spill detection. OSDMamba leverages Mamba’s selective scanning mechanism to effectively expand the model’s receptive field while preserving critical details. Moreover, we designed an asymmetric decoder incorporating ConvSSM and deep supervision to strengthen multi-scale feature fusion, thereby enhancing the model’s sensitivity to minority class samples. Experimental results show that the proposed OSDMamba achieves state-of-the-art performance, yielding improvements of 8.9% and 11.8% in OSD across two publicly available datasets.

[61] On the Robustness of Human-Object Interaction Detection against Distribution Shift cs.CV | cs.MMPDF

Chi Xie, Shuang Liang, Jie Li, Feng Zhu, Rui Zhao

TL;DR: 该论文研究了人-物交互（HOI）检测模型在分布偏移下的鲁棒性问题，提出了一个自动化基准测试，并提出了两种增强方法：跨域数据增强与特征融合策略。

Details

Motivation: 现有的HOI检测研究集中在理想图像和自然分布的标准场景，忽略了现实中的分布偏移问题，限制了其实用性。

Result: 实验表明，所提方法显著提升了多种模型的鲁棒性，且在标准基准测试中也有提升。

Insight: HOI检测的鲁棒性与其他任务不同，跨域数据增强和特征融合策略是简单有效的解决方案。

Abstract: Human-Object Interaction (HOI) detection has seen substantial advances in recent years. However, existing works focus on the standard setting with ideal images and natural distribution, far from practical scenarios with inevitable distribution shifts. This hampers the practical applicability of HOI detection. In this work, we investigate this issue by benchmarking, analyzing, and enhancing the robustness of HOI detection models under various distribution shifts. We start by proposing a novel automated approach to create the first robustness evaluation benchmark for HOI detection. Subsequently, we evaluate more than 40 existing HOI detection models on this benchmark, showing their insufficiency, analyzing the features of different frameworks, and discussing how the robustness in HOI is different from other tasks. With the insights from such analyses, we propose to improve the robustness of HOI detection methods through: (1) a cross-domain data augmentation integrated with mixup, and (2) a feature fusion strategy with frozen vision foundation models. Both are simple, plug-and-play, and applicable to various methods. Our experimental results demonstrate that the proposed approach significantly increases the robustness of various methods, with benefits on standard benchmarks, too. The dataset and code will be released.

[62] PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding cs.CV | cs.AI | cs.CLPDF

Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang

TL;DR: PP-DocBee2 是对 PP-DocBee 的改进版本，旨在通过提高合成数据质量、改进视觉特征融合策略和优化推理方法，提升多模态文档理解的性能。

Details

Motivation: 多模态文档理解任务中，数据质量和模型特征融合策略对性能有重要影响，PP-DocBee2 旨在解决这些问题。

Result: 在中文商业文档任务中性能提升 11.4%，推理延迟降低 73%。

Insight: 数据质量和中间特征的充分利用是多模态文档理解的关键改进方向。

Abstract: This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.

[63] MiCo: Multiple Instance Learning with Context-Aware Clustering for Whole Slide Image Analysis cs.CVPDF

Junjian Li, Hulin Kuang, Jin Liu, Hailin Yue, Mengshen He

TL;DR: 该论文提出了一种名为MiCo的多示例学习框架，通过上下文感知聚类提升全切片图像（WSI）分析中的跨区域组织关联。MiCo通过聚类和语义锚点增强组织间相关性，显著优于现有方法。

Details

Motivation: 全切片图像的固有空间异质性导致形态相似的组织类型分散在远距离解剖区域，传统多示例学习方法难以有效建模此类分布和跨区域交互。

Result: 在九个大型公开癌症数据集上的实验表明，MiCo在性能上明显优于现有最先进方法。

Insight: 通过上下文感知聚类和语义锚点机制，可以更有效地建模全切片图像中的复杂空间分布和语义关系。

Abstract: Multiple instance learning (MIL) has shown significant promise in histopathology whole slide image (WSI) analysis for cancer diagnosis and prognosis. However, the inherent spatial heterogeneity of WSIs presents critical challenges, as morphologically similar tissue types are often dispersed across distant anatomical regions. Conventional MIL methods struggle to model these scattered tissue distributions and capture cross-regional spatial interactions effectively. To address these limitations, we propose a novel Multiple instance learning framework with Context-Aware Clustering (MiCo), designed to enhance cross-regional intra-tissue correlations and strengthen inter-tissue semantic associations in WSIs. MiCo begins by clustering instances to distill discriminative morphological patterns, with cluster centroids serving as semantic anchors. To enhance cross-regional intra-tissue correlations, MiCo employs a Cluster Route module, which dynamically links instances of the same tissue type across distant regions via feature similarity. These semantic anchors act as contextual hubs, propagating semantic relationships to refine instance-level representations. To eliminate semantic fragmentation and strengthen inter-tissue semantic associations, MiCo integrates a Cluster Reducer module, which consolidates redundant anchors while enhancing information exchange between distinct semantic groups. Extensive experiments on two challenging tasks across nine large-scale public cancer datasets demonstrate the effectiveness of MiCo, showcasing its superiority over state-of-the-art methods. The code is available at https://github.com/junjianli106/MiCo.

[64] Pre-Trained LLM is a Semantic-Aware and Generalizable Segmentation Booster cs.CV | cs.AI | cs.MMPDF

Fenghe Tang, Wenxin Ma, Zhiyang He, Xiaodong Tao, Zihang Jiang

TL;DR: 该论文发现预训练的大型语言模型（LLM）可以用于医学图像分割任务，提出了一种名为LLM4Seg的混合结构，通过冻结的LLM层提升分割性能，且对训练参数的增加很小。

Details

Motivation: 研究动机是利用预训练的LLM的语义理解能力来增强医学图像分割任务，探索LLM在视觉任务中的潜力。

Result: 实验结果表明，该方法在不同模态（如超声、皮肤镜、内窥镜和CT）上均能提升分割性能，且对多种LLM（如LLaMA和DeepSeek）有效。

Insight: 关键发现是LLM的语义感知能力可以迁移到视觉任务中，既能增强全局理解，也能改善局部建模能力。

Abstract: With the advancement of Large Language Model (LLM) for natural language processing, this paper presents an intriguing finding: a frozen pre-trained LLM layer can process visual tokens for medical image segmentation tasks. Specifically, we propose a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). Surprisingly, this design improves segmentation performance with a minimal increase in trainable parameters across various modalities, including ultrasound, dermoscopy, polypscopy, and CT scans. Our in-depth analysis reveals the potential of transferring LLM’s semantic awareness to enhance segmentation tasks, offering both improved global understanding and better local modeling capabilities. The improvement proves robust across different LLMs, validated using LLaMA and DeepSeek.

Dongdong Meng, Sheng Li, Hao Wu, Suqing Tian, Wenjun Ma

TL;DR: CmFNet是一种新型的3D弱监督跨模态医学图像分割方法，通过模态特定和跨模态特征学习网络整合多模态图像信息，并结合混合监督学习策略改善分割性能。

Details

Motivation: 研究解决了医学图像分割中稀疏标注导致的性能下降和过拟合问题，提出了一种更高效的弱监督学习方法。

Result: 实验表明，CmFNet在弱监督和全监督情况下均优于现有方法，尤其在挑战性的小肿瘤区域分割上表现突出。

Insight: 多模态信息的有效整合和混合监督策略的结合可以显著提升医学图像分割的鲁棒性和性能。

Abstract: Accurate automatic medical image segmentation relies on high-quality, dense annotations, which are costly and time-consuming. Weakly supervised learning provides a more efficient alternative by leveraging sparse and coarse annotations instead of dense, precise ones. However, segmentation performance degradation and overfitting caused by sparse annotations remain key challenges. To address these issues, we propose CmFNet, a novel 3D weakly supervised cross-modal medical image segmentation approach. CmFNet consists of three main components: a modality-specific feature learning network, a cross-modal feature learning network, and a hybrid-supervised learning strategy. Specifically, the modality-specific feature learning network and the cross-modal feature learning network effectively integrate complementary information from multi-modal images, enhancing shared features across modalities to improve segmentation performance. Additionally, the hybrid-supervised learning strategy guides segmentation through scribble supervision, intra-modal regularization, and inter-modal consistency, modeling spatial and contextual relationships while promoting feature alignment. Our approach effectively mitigates overfitting, delivering robust segmentation results. It excels in segmenting both challenging small tumor regions and common anatomical structures. Extensive experiments on a clinical cross-modal nasopharyngeal carcinoma (NPC) dataset (including CT and MR imaging) and the publicly available CT Whole Abdominal Organ dataset (WORD) show that our approach outperforms state-of-the-art weakly supervised methods. In addition, our approach also outperforms fully supervised methods when full annotation is used. Our approach can facilitate clinical therapy and benefit various specialists, including physicists, radiologists, pathologists, and oncologists.

[66] CLGRPO: Reasoning Ability Enhancement for Small VLMs cs.CVPDF

Fanyi Wang, Binzhi Dong, Haotian Hu, Jinjin Xu, Zhiwang Zhang

TL;DR: 论文提出了一种增量训练策略（CLGRPO），通过自监督构建链式思考（COT）数据，分阶段优化小型视觉语言模型（SVLM）的推理能力，显著提升了1B参数模型的性能。

Details

Motivation: 小型视觉语言模型（SVLM）因参数限制导致推理能力不足，但其低成本和高商业价值值得进一步优化。

Result: 在EMOSet-118K数据集上，1B SVLM的准确率提升2.77，召回率提升0.69，性能接近8B模型。

Insight: 通过分阶段优化和空间约束，可以有效提升小规模模型的推理能力，平衡计算资源和性能。

Abstract: Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B. Their low cost and power consumption characteristics confer high commercial value. However, their reasoning abilities are limited by the number of parameters. To address this issue, this paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs. Firstly, we constructed a Self-Supervised Chain-of-Thought (COT) Data Construction System, which leverages multiple LVLMs with 7B parameters or more to transform original data into COT data in a self-supervised manner. Our proposed Incremental Training Strategy consists of four stages. Stage 1 injects domain knowledge by performing Supervised Fine-Tuning (SFT) to the pretrained model on the COT data. Stage 2 aligns the COT data format by conducting a small amount of Group Relative Policy Optimization (GRPO) training constrained only by format rewards on the COT data. Stage 3 enhances reasoning ability by applying GRPO training on the COT data with constraints on both format and accuracy rewards. The resulting model shows significant improvement compared to the baseline. Stage 4 addresses the limited capacity of the SVLMs and the weak ability to capture complex patterns by proposing ClipLow GRPO (CLGRPO) to constrain the capture space of the training process. We conducted extensive comparative and ablation experiments on the abstract semantic recognition dataset EMOSet-118K. Experimental results demonstrate that our method significantly improves the reasoning ability of 1B SVLM. Compared to the baseline model fine-tuned on the original data, accuracy increased by 2.77 and recall by 0.69, achieving performance comparable to that of 8B models.

[67] Deep Supervised LSTM for 3D morphology estimation from Multi-View RGB Images of Wheat Spikes cs.CVPDF

Olivia Zumsteg, Nico Graf, Aaron Haeusler, Norbert Kirchgessner, Nicola Storni

TL;DR: 本文提出了一种结合DINOv2自监督视觉Transformer和单向LSTM的深度学习模型，用于从多视角RGB图像估计小麦穗的三维形态，在室内和野外场景中均优于传统几何方法。

Details

Motivation: 从二维RGB图像估计三维形态面临深度信息丢失、投影失真和遮挡等挑战，尤其是复杂几何（如小麦穗）。本文旨在通过深度学习解决这些问题。

Result: 室内六视角图像的MAPE为6.46%，优于面积投影（9.36%）和几何重建（13.98%）；野外单视角微调后MAPE为10.82%。

Insight: 复杂几何对象（如小麦穗）对传统几何方法挑战更大，而深度学习模型能更好地捕捉其形态特征。

Abstract: Estimating three-dimensional morphological traits from two-dimensional RGB images presents inherent challenges due to the loss of depth information, projection distortions, and occlusions under field conditions. In this work, we explore multiple approaches for non-destructive volume estimation of wheat spikes, using RGB image sequences and structured-light 3D scans as ground truth references. Due to the complex geometry of the spikes, we propose a neural network approach for volume estimation in 2D images, employing a transfer learning pipeline that combines DINOv2, a self-supervised Vision Transformer, with a unidirectional Long Short-Term Memory (LSTM) network. By using deep supervision, the model is able to learn more robust intermediate representations, which enhances its generalisation ability across varying evaluation sequences. We benchmark our model against two conventional baselines: a 2D area-based projection and a geometric reconstruction using axis-aligned cross-sections. Our deep supervised model achieves a mean absolute percentage error (MAPE) of 6.46% on six-view indoor images, outperforming the area (9.36%) and geometric (13.98%) baselines. Fine-tuning the model on field-based single-image data enables domain adaptation, yielding a MAPE of 10.82%. We demonstrate that object shape significantly impacts volume prediction accuracy, with irregular geometries such as wheat spikes posing greater challenges for geometric methods compared to our deep learning approach.

[68] Training-free Test-time Improvement for Explainable Medical Image Classification cs.CVPDF

Hangzhou He, Jiachen Tang, Lei Zhu, Kaiwen Li, Yanye Lu

TL;DR: 该论文提出了一种无需训练的测试时间改进方法，用于提升可解释性医学图像分类模型在新环境中的性能，通过最小化新数据（仅需每类4张图像）实现对混淆概念的修正。

Details

Motivation: 医学图像分类模型在部署到新环境时，可能因成像协议和染色方法的变化导致概念层面偏移。同时，由于概念瓶颈模型（CBM）训练需要明确的概念标注，仅用图像级标签微调会降低概念预测的准确性和可靠性。

Result: 该方法在皮肤和白细胞图像数据上验证了有效性，能够提升模型在新环境中的分类性能。

Insight: 通过无需训练的方式修正概念预测，为医学领域的高成本概念标注问题提供了一种高效且经济的解决方案。

Abstract: Deep learning-based medical image classification techniques are rapidly advancing in medical image analysis, making it crucial to develop accurate and trustworthy models that can be efficiently deployed across diverse clinical scenarios. Concept Bottleneck Models (CBMs), which first predict a set of explainable concepts from images and then perform classification based on these concepts, are increasingly being adopted for explainable medical image classification. However, the inherent explainability of CBMs introduces new challenges when deploying trained models to new environments. Variations in imaging protocols and staining methods may induce concept-level shifts, such as alterations in color distribution and scale. Furthermore, since CBM training requires explicit concept annotations, fine-tuning models solely with image-level labels could compromise concept prediction accuracy and faithfulness - a critical limitation given the high cost of acquiring expert-annotated concept labels in medical domains. To address these challenges, we propose a training-free confusion concept identification strategy. By leveraging minimal new data (e.g., 4 images per class) with only image-level labels, our approach enhances out-of-domain performance without sacrificing source domain accuracy through two key operations: masking misactivated confounding concepts and amplifying under-activated discriminative concepts. The efficacy of our method is validated on both skin and white blood cell images. Our code is available at: https://github.com/riverback/TF-TTI-XMed.

[69] MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering cs.CV | cs.AIPDF

Jisheng Dang, Huilin Song, Junbin Xiao, Bimei Wang, Han Peng

TL;DR: MUPA提出了一种多路径代理推理方法，通过协同工作提升视频问答的推理和视觉证据对齐能力，优于现有模型。

Details

Motivation: 现有的多模态模型在视频问答任务中依赖语言先验和虚假相关性，导致预测缺乏视觉证据支持。

Result: 在2B参数规模下优于7B规模的竞争对手，7B规模下在NExT-GQA和DeVE-QA上达到SOTA。

Insight: 多路径代理协作能显著提升视觉证据的忠实性，同时保持答案准确性。

Abstract: Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA’ effectiveness towards trustworthy video-language understanding. Our code is available in https://github.com/longmalongma/MUPA.

[70] TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving cs.CVPDF

Wenzhuo Liu, Yicheng Qiao, Zhen Wang, Qiannan Guo, Zilong Chen

TL;DR: 该论文提出了TEM^3-Learning框架，通过结合多模态和多任务学习，高效解决了辅助驾驶中的多个任务，实现了高精度和实时性。

Details

Motivation: 现有的多任务学习方法面临单模态限制和低效架构的问题，难以实现全面的场景理解和实时部署。

Result: 在AIDE数据集上，模型在四个任务中均达到最优精度，参数量少于600万，推理速度达142.32 FPS。

Insight: 多模态和多任务学习的结合能显著提升辅助驾驶系统的性能，高效的架构设计对实时部署至关重要。

Abstract: Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM^3-Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second component, the MTL-based gated multimodal feature integrator (MGMI), employs task-specific multi-gating modules to adaptively highlight the most relevant modality features for each task, effectively alleviating the negative transfer problem in MTL. Evaluation on the AIDE dataset, our proposed model achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142.32 FPS inference speed. Rigorous ablation studies further validate the effectiveness of the proposed framework and the independent contributions of each module. The code is available on https://github.com/Wenzhuo-Liu/TEM3-Learning.

[71] ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation cs.CV | cs.AI | cs.LGPDF

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji

TL;DR: ShareGPT-4o-Image是一个多模态数据集，通过GPT-4o生成图像，用于训练Janus-4o模型。Janus-4o在图像生成任务上表现优异，并支持文本和图像到图像的生成。

Details

Motivation: 当前先进的多模态生成模型（如GPT-4o-Image）多为闭源，限制了开放研究。本文旨在通过开源数据集和模型，推动多模态图像生成的研究。

Result: Janus-4o在文本到图像生成上优于前代Janus-Pro，并首次实现文本+图像到图像的生成，仅用91K样本和6小时训练即达到高性能。

Insight: 1. 合成数据可以有效提升模型性能。2. 多模态生成任务在少量数据和高效训练下也能取得显著成果。3. 开源数据集和模型有助于推动领域研究。

Abstract: Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o’s image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.

[72] Enhancing VICReg: Random-Walk Pairing for Improved Generalization and Better Global Semantics Capturing cs.CV | cs.LGPDF

Idan Simai, Ronen Talmon, Uri Shaham

TL;DR: 论文提出SAG-VICReg，通过随机游走配对增强VICReg，提升模型的泛化能力和全局语义捕获能力。

Details

Motivation: 研究VICReg在自监督学习中过度依赖训练数据导致的泛化问题，提出改进方法以提升全局语义捕获能力。

Result: SAG-VICReg在全局语义理解和局部评估指标上均表现优越，超越现有自监督学习基线。

Insight: 通过谱嵌入视角分析VICReg的局限性，为自监督学习的泛化问题提供新解决方案。

Abstract: In this paper, we argue that viewing VICReg-a popular self-supervised learning (SSL) method–through the lens of spectral embedding reveals a potential source of sub-optimality: it may struggle to generalize robustly to unseen data due to overreliance on the training data. This observation invites a closer look at how well this method achieves its goal of producing meaningful representations of images outside of the training set as well. Here, we investigate this issue and introduce SAG-VICReg (Stable and Generalizable VICReg), a method that builds on VICReg by incorporating new training techniques. These enhancements improve the model’s ability to capture global semantics within the data and strengthen the generalization capabilities. Experiments demonstrate that SAG-VICReg effectively addresses the generalization challenge while matching or surpassing diverse state-of-the-art SSL baselines. Notably, our method exhibits superior performance on metrics designed to evaluate global semantic understanding, while simultaneously maintaining competitive results on local evaluation metrics. Furthermore, we propose a new standalone evaluation metric for embeddings that complements the standard evaluation methods and accounts for the global data structure without requiring labels–a key issue when tagged data is scarce or not available.

[73] See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis cs.CVPDF

Ruinan Jin, Gexin Huang, Xinwei Shen, Qiong Zhang, Yan Shuo Tan

TL;DR: 论文《See-in-Pairs》提出了一种基于参考图像的比较视觉语言模型，用于医学诊断，通过结合医疗领域知识和多图像比较推理能力，显著提升了诊断准确性。

Details

Motivation: 医学影像诊断中存在疾病模拟正常解剖结构及患者间差异大的问题，现有医学视觉语言模型缺乏比较分析机制，而通用视觉语言模型虽然具备多图像比较能力，但缺乏医疗领域知识。

Result: 多图像比较分析显著优于单图像基线方法，尤其是在监督微调后，诊断准确性显著提高。

Insight: 比较推理在医学诊断中至关重要，结合领域知识和参考图像能够有效提升模型对细微异常的识别能力。

Abstract: Medical imaging diagnosis presents inherent challenges due to diseases that mimic normal anatomy and exhibit significant inter-patient variability. Clinicians routinely employ comparative reasoning-using reference images from healthy controls or previous patient examinations-to discern subtle yet diagnostically critical abnormalities. However, existing medical vision-language models (VLMs) focus primarily on single-image or single-series analyses and lack explicit mechanisms for comparative reasoning. Conversely, general-purpose VLMs demonstrate strong multi-image comparative reasoning capabilities but lack essential medical-domain knowledge to identify nuanced clinical differences. This work aims to bridge this gap by exploring clinically-inspired comparative analysis within VLMs, leveraging reference images to enhance diagnostic accuracy. Through extensive empirical analysis, we show that providing general-purpose VLMs with query and normative matched reference images, accompanied by clinically-informed comparative prompts, significantly improves diagnostic outcomes compared to single-image baselines, especially after supervised finetuning (SFT). Our contributions highlight the clinical relevance of comparative analysis introduce novel strategies for leveraging reference images in VLMs, empirically demonstrate enhanced performance across multiple medical visual question answering (VQA) tasks, and provide theoretical insights into the efficacy of comparative image analysis in medical diagnosis.

[74] Pattern-Based Phase-Separation of Tracer and Dispersed Phase Particles in Two-Phase Defocusing Particle Tracking Velocimetry cs.CV | physics.app-ph | physics.flu-dynPDF

Christian Sax, Jochen Kriegseis

TL;DR: 该论文提出了一种基于卷积神经网络（CNN）的后处理方法，用于在两相散焦粒子跟踪测速中区分示踪粒子和分散相粒子（如气泡或液滴），通过生成对抗网络生成标记数据集，并在多种数据集中验证了高精度检测和分类。

Details

Motivation: 传统的基于波长、尺寸或相关性的方法在两相散焦粒子跟踪测速中难以区分示踪粒子和分散相粒子。为了解决这一问题，论文提出基于图像模式差异的自动分类方法。

Result: 在六组合成和真实数据集中，方法实现了95-100%的检测精度和分类准确率，证明了其在域偏移下的鲁棒性。

Insight: 论文表明，基于CNN的模式识别方法可以有效替代传统方法，为两相流测速提供了一种新的自动化解决方案。

Abstract: This work investigates the feasibility of a post-processing-based approach for phase separation in defocusing particle tracking velocimetry for dispersed two-phase flows. The method enables the simultaneous 3D localization determination of both tracer particles and particles of the dispersed phase, using a single-camera setup. The distinction between phases is based on pattern differences in defocused particle images, which arise from distinct light scattering behaviors of tracer particles and bubbles or droplets. Convolutional neural networks, including Faster R-CNN and YOLOv4 variants, are trained to detect and classify particle images based on these pattern features. To generate large, labeled training datasets, a generative adversarial network based framework is introduced, allowing the generation of auto-labeled data that more closely reflects experiment-specific visual appearance. Evaluation across six datasets, comprising synthetic two-phase and real single- and two-phase flows, demonstrates high detection precision and classification accuracy (95-100%), even under domain shifts. The results confirm the viability of using CNNs for robust phase separation in disperse two-phase DPTV, particularly in scenarios where traditional wavelength-, size-, or ensemble correlation-based methods are impractical.

[75] CDG-MAE: Learning Correspondences from Diffusion Generated Views cs.CVPDF

Varun Belagali, Pierre Marza, Srikar Yellapragada, Zilinghan Li, Tarak Nath Nandi

TL;DR: CDG-MAE提出了一种基于MAE的自监督方法，利用扩散模型从静态图像生成多样化的合成视图，解决了传统方法中训练数据不足的问题，显著提升了稠密对应学习的效果。

Details

Motivation: 稠密对应学习需要大量标注数据，但手动标注耗时且难以扩展。传统自监督方法依赖有限的视频数据或简单的图像裁剪，缺乏足够的视角变化。

Result: CDG-MAE在稠密对应学习任务中显著优于仅依赖图像的MAE方法，并接近基于视频的方法的性能。

Insight: 利用扩散模型生成合成数据可以有效缓解训练数据不足的问题，为自监督学习提供新的思路。

Abstract: Learning dense correspondences, critical for application such as video label propagation, is hindered by tedious and unscalable manual annotation. Self-supervised methods address this by using a cross-view pretext task, often modeled with a masked autoencoder, where a masked target view is reconstructed from an anchor view. However, acquiring effective training data remains a challenge - collecting diverse video datasets is difficult and costly, while simple image crops lack necessary pose variations. This paper introduces CDG-MAE, a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video and crop-based anchors. We present a quantitative method to evaluate local and global consistency of generated images, discussing their use for cross-view self-supervised pretraining. Furthermore, we enhance the standard single-anchor MAE setting to a multi-anchor strategy to effectively modulate the difficulty of pretext task. CDG-MAE significantly outperforms state-of-the-art MAE methods reliant only on images and substantially narrows the performance gap to video-based approaches.

[76] Multimodal Fusion SLAM with Fourier Attention cs.CV | cs.AIPDF

Youjie Zhou, Guofeng Mei, Yiming Wang, Yi Wan, Fabio Poiesi

TL;DR: 该论文提出了一种高效的多模态融合SLAM方法FMF-SLAM，通过快速傅里叶变换（FFT）和新型的傅里叶自注意力与交叉注意力机制，提升了算法效率，并在噪声、变光和黑暗条件下表现出色。

Details

Motivation: 传统的基于光流的视觉SLAM方法在噪声、光照变化和黑暗环境下表现不佳，且计算资源消耗大，因此需要一种更高效的解决方案。

Result: 在TUM、TartanAir和自采集数据上验证，FMF-SLAM在噪声、变光和黑暗条件下表现优越。

Insight: 傅里叶变换和注意力机制的融合为多模态SLAM提供了新的效率优化方向，同时展示了与硬件结合的可行性。

Abstract: Visual SLAM is particularly challenging in environments affected by noise, varying lighting conditions, and darkness. Learning-based optical flow algorithms can leverage multiple modalities to address these challenges, but traditional optical flow-based visual SLAM approaches often require significant computational resources.To overcome this limitation, we propose FMF-SLAM, an efficient multimodal fusion SLAM method that utilizes fast Fourier transform (FFT) to enhance the algorithm efficiency. Specifically, we introduce a novel Fourier-based self-attention and cross-attention mechanism to extract features from RGB and depth signals. We further enhance the interaction of multimodal features by incorporating multi-scale knowledge distillation across modalities. We also demonstrate the practical feasibility of FMF-SLAM in real-world scenarios with real time performance by integrating it with a security robot by fusing with a global positioning module GNSS-RTK and global Bundle Adjustment. Our approach is validated using video sequences from TUM, TartanAir, and our real-world datasets, showcasing state-of-the-art performance under noisy, varying lighting, and dark conditions.Our code and datasets are available at https://github.com/youjie-zhou/FMF-SLAM.git.

[77] Limitations of NERF with pre-trained Vision Features for Few-Shot 3D Reconstruction cs.CVPDF

Ankit Sanjyal

TL;DR: 论文系统评估了DINO增强的NeRF模型在极端少样本3D重建中的表现，发现所有DINO变体均不如基线NeRF，甚至引入有害偏差，揭示了预训练视觉特征在少样本场景下的局限性。

Details

Motivation: 探索预训练视觉特征（如DINO）能否提升NeRF在少样本3D重建中的表现，尤其是在极端少样本情况下。

Result: DINO变体的PSNR（12.9-13.0）显著低于基线NeRF（14.71），证明预训练特征可能有害。

Insight: 少样本3D重建中，预训练视觉特征可能不适用，更关注几何一致性的简单架构可能更有效。

Abstract: Neural Radiance Fields (NeRF) have revolutionized 3D scene reconstruction from sparse image collections. Recent work has explored integrating pre-trained vision features, particularly from DINO, to enhance few-shot reconstruction capabilities. However, the effectiveness of such approaches remains unclear, especially in extreme few-shot scenarios. In this paper, we present a systematic evaluation of DINO-enhanced NeRF models, comparing baseline NeRF, frozen DINO features, LoRA fine-tuned features, and multi-scale feature fusion. Surprisingly, our experiments reveal that all DINO variants perform worse than the baseline NeRF, achieving PSNR values around 12.9 to 13.0 compared to the baseline’s 14.71. This counterintuitive result suggests that pre-trained vision features may not be beneficial for few-shot 3D reconstruction and may even introduce harmful biases. We analyze potential causes including feature-task mismatch, overfitting to limited data, and integration challenges. Our findings challenge common assumptions in the field and suggest that simpler architectures focusing on geometric consistency may be more effective for few-shot scenarios.

[78] Cross-Architecture Knowledge Distillation (KD) for Retinal Fundus Image Anomaly Detection on NVIDIA Jetson Nano cs.CV | cs.AI | cs.LG | 68T07 | I.2.6; I.5.1; J.3PDF

Berk Yilmaz, Aniruddh Aiyengar

TL;DR: 本文提出了一种用于视网膜眼底图像异常检测的跨架构知识蒸馏（KD）方法，通过在NVIDIA Jetson Nano等边缘设备上部署轻量级CNN学生模型，实现了高效的疾病分类。

Details

Motivation: 在资源受限的环境中，需要一种轻量且高效的视网膜疾病诊断解决方案。跨架构知识蒸馏可以实现模型压缩，同时保持高精度。

Result: 学生模型参数比教师模型少97.4%，但分类准确率达到89%，性能保留了教师模型的93%。

Insight: 跨架构知识蒸馏在保持高精度的同时显著减少模型复杂度，适用于边缘设备部署，为资源受限区域的医疗诊断提供了可行方案。

Abstract: Early and accurate identification of retinal ailments is crucial for averting ocular decline; however, access to dependable diagnostic devices is not often available in low-resourced settings. This project proposes to solve that by developing a lightweight, edge-device deployable disease classifier using cross-architecture knowledge distilling. We first train a high-capacity vision transformer (ViT) teacher model, pre-trained using I-JEPA self-supervised learning, to classify fundus images into four classes: Normal, Diabetic Retinopathy, Glaucoma, and Cataract. We kept an Internet of Things (IoT) focus when compressing to a CNN-based student model for deployment in resource-limited conditions, such as the NVIDIA Jetson Nano. This was accomplished using a novel framework which included a Partitioned Cross-Attention (PCA) projector, a Group-Wise Linear (GL) projector, and a multi-view robust training method. The teacher model has 97.4 percent more parameters than the student model, with it achieving 89 percent classification with a roughly 93 percent retention of the teacher model’s diagnostic performance. The retention of clinical classification behavior supports our method’s initial aim: compression of the ViT while retaining accuracy. Our work serves as an example of a scalable, AI-driven triage solution for retinal disorders in under-resourced areas.

[79] Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation cs.CV | cs.AIPDF

Xunzhi Xiang, Qi Fan

TL;DR: 这篇论文提出了一种动态稀疏注意力机制（ADSA），用于自回归图像生成任务，显著降低了KV缓存的内存开销和计算延迟。

Details

Motivation: 自回归图像生成模型在处理长上下文时面临内存和计算效率的问题，KV缓存导致显著的内存开销和延迟。本文旨在通过动态稀疏注意力优化上下文管理。

Result: 实验表明，ADSA在生成质量和资源效率上均优于现有方法。

Insight: 动态稀疏注意力可以高效平衡全局语义与局部纹理的关系，显著提升自回归图像生成的效率。

Abstract: Autoregressive conditional image generation models have emerged as a dominant paradigm in text-to-image synthesis. These methods typically convert images into one-dimensional token sequences and leverage the self-attention mechanism, which has achieved remarkable success in natural language processing, to capture long-range dependencies, model global context, and ensure semantic coherence. However, excessively long contexts during inference lead to significant memory overhead caused by KV-cache and computational delays. To alleviate these challenges, we systematically analyze how global semantics, spatial layouts, and fine-grained textures are formed during inference, and propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA). Conceptually, ADSA dynamically identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention computation. Additionally, we introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately $50%$. Extensive qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in terms of both generation quality and resource efficiency.

[80] Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning cs.CV | cs.ROPDF

Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin

TL;DR: Drive-R1是一个专为自动驾驶设计的视觉语言模型（VLM），通过结合监督微调和强化学习，解决了现有VLM在规划任务中过度依赖历史输入和推理与规划结果不一致的问题。

Details

Motivation: 现有VLM在自动驾驶规划任务中存在两个主要问题：一是过度依赖历史输入而忽略视觉理解；二是推理过程与规划结果脱节。Drive-R1旨在解决这些问题。

Result: 在nuScenes和DriveLM-nuScenes基准测试中，Drive-R1性能显著优于现有VLM，验证了其方法的有效性。

Insight: Drive-R1展示了将推理与规划结合的潜力，为未来自动驾驶研究提供了新的方法论参考。

Abstract: Large vision-language models (VLMs) for autonomous driving (AD) are evolving beyond perception and cognition tasks toward motion planning. However, we identify two critical challenges in this direction: (1) VLMs tend to learn shortcuts by relying heavily on history input information, achieving seemingly strong planning results without genuinely understanding the visual inputs; and (2) the chain-ofthought (COT) reasoning processes are always misaligned with the motion planning outcomes, and how to effectively leverage the complex reasoning capability to enhance planning remains largely underexplored. In this paper, we start from a small-scale domain-specific VLM and propose Drive-R1 designed to bridges the scenario reasoning and motion planning for AD. Drive-R1 first undergoes the supervised finetuning on a elaborate dataset containing both long and short COT data. Drive-R1 is encouraged to reason step-by-step from visual input to final planning decisions. Subsequently, Drive-R1 is trained within a reinforcement learning framework that incentivizes the discovery of reasoning paths that are more informative for planning, guided by rewards based on predicted trajectories and meta actions. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate that Drive-R1 achieves superior performance compared to existing state-of-the-art VLMs. We believe that Drive-R1 presents a promising direction for bridging reasoning and planning in AD, offering methodological insights for future research and applications.

[81] Referring Expression Instance Retrieval and A Strong End-to-End Baseline cs.CVPDF

Xiangzhao Hao, Kuan Zhu, Hongyu Guo, Haiyun Guo, Ming Tang

TL;DR: 该论文提出了一个新的任务——Referring Expression Instance Retrieval (REIR)，结合了实例级检索和定位的需求，并提出了一个大规模基准REIRCOCO和一个基线方法CLARE。

Details

Motivation: 现有的文本-图像检索（TIR）和Referring Expression Comprehension（REC）分别存在缺乏精度和可扩展性的问题，而实际应用中需要同时支持实例级检索和定位。

Result: CLARE在REIR任务上达到了最先进的性能，同时也能很好地推广到TIR和REC任务。

Insight: 论文展示了将实例级检索与定位相结合的潜力，为实际应用中的多模态任务提供了新思路。

Abstract: Natural language querying of visual content underpins many vision-language tasks, typically categorized by text granularity and visual search scope. Text-Image Retrieval (TIR) retrieves whole images using coarse descriptions, while Referring Expression Comprehension (REC) localizes objects using fine-grained expressions within a single image. However, real-world scenarios often require both instance-level retrieval and localization across large galleries – tasks where TIR lacks precision and REC lacks scalability. To address this gap, we propose a new task: Referring Expression Instance Retrieval (REIR), which jointly supports instance-level retrieval and localization. We introduce REIRCOCO, a large-scale benchmark constructed by prompting vision-language models to generate fine-grained expressions for MSCOCO and RefCOCO instances. We also present a baseline method, CLARE, featuring a dual-stream architecture with a Mix of Relation Experts (MORE) module for capturing inter-instance relationships. CLARE integrates object detection and REC pretraining with Contrastive Language-Instance Alignment (CLIA) for end-to-end optimization. Experiments show that CLARE achieves state-of-the-art performance on REIR and generalizes well to TIR and REC, highlighting its effectiveness and versatility.

[82] Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-resolution Information in Temporal Domain cs.CVPDF

Rui Su, Dong Xu, Luping Zhou, Wanli Ouyang

TL;DR: 该论文提出了一种两阶段方法，通过利用时间域的多分辨率信息，生成高质量的帧级伪标签，以解决弱监督时间动作定位问题。

Details

Motivation: 弱监督时间动作定位仅依赖视频级标注，而帧级标注的缺失导致任务极具挑战性。论文旨在通过多分辨率信息提升伪标签质量，从而提高定位性能。

Result: 通过多分辨率信息交换和伪标签优化，显著提升了弱监督时间动作定位的性能。

Insight: 利用多分辨率信息不仅有助于生成高质量的伪标签，还能通过交互优化进一步提升模型性能。

Abstract: Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to fully exploit multi-resolution information in the temporal domain and generate high quality frame-level pseudo labels based on both appearance and motion streams. Specifically, in the first stage, we generate reliable initial frame-level pseudo labels, and in the second stage, we iteratively refine the pseudo labels and use a set of selected frames with highly confident pseudo labels to train neural networks and better predict action class scores at each frame. We fully exploit temporal information at multiple scales to improve temporal action localization performance. Specifically, in order to obtain reliable initial frame-level pseudo labels, in the first stage, we propose an Initial Label Generation (ILG) module, which leverages temporal multi-resolution consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework. In our PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, the multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each stream (i.e., the OTS/RTS stream) by exploiting the refined pseudo labels from another stream (i.e., the RTS/OTS stream).

[83] YouTube-Occ: Learning Indoor 3D Semantic Occupancy Prediction from YouTube Videos cs.CVPDF

Haoming Chen, Lichen Yuan, TianFang Sun, Jingyu Gong, Xin Tan

TL;DR: 这篇论文提出了一种从YouTube视频中学习室内3D语义占据预测的方法，无需相机参数或精确几何关系，实现了自监督训练。

Details

Motivation: 传统的3D语义占据预测需要精确的几何关系和数据标注，但在复杂室内环境中，大规模数据采集和标注成本高昂且不切实际。本文利用网络视频（如YouTube房屋游览）作为数据源，解决了这些问题。

Result: 在NYUv2和OccScanNet两个基准测试中，实现了零样本（zero-shot）上的state-of-the-art性能。

Insight: 网络视频可以作为无标注数据源，有效支持3D视觉任务的自监督学习，同时超像素分组技术是2D到3D知识蒸馏的关键。

Abstract: 3D semantic occupancy prediction in the past was considered to require precise geometric relationships in order to enable effective training. However, in complex indoor environments, the large-scale and widespread collection of data, along with the necessity for fine-grained annotations, becomes impractical due to the complexity of data acquisition setups and privacy concerns. In this paper, we demonstrate that 3D spatially-accurate training can be achieved using only indoor Internet data, without the need for any pre-knowledge of intrinsic or extrinsic camera parameters. In our framework, we collect a web dataset, YouTube-Occ, which comprises house tour videos from YouTube, providing abundant real house scenes for 3D representation learning. Upon on this web dataset, we establish a fully self-supervised model to leverage accessible 2D prior knowledge for reaching powerful 3D indoor perception. Specifically, we harness the advantages of the prosperous vision foundation models, distilling the 2D region-level knowledge into the occupancy network by grouping the similar pixels into superpixels. Experimental results show that our method achieves state-of-the-art zero-shot performance on two popular benchmarks (NYUv2 and OccScanNet

[84] ThermalLoc: A Vision Transformer-Based Approach for Robust Thermal Camera Relocalization in Large-Scale Environments cs.CVPDF

Yu Liu, Yangtao Meng, Xianfei Pan, Jie Jiang, Changhao Chen

TL;DR: ThermalLoc 是一种基于 Vision Transformer 的端到端深度学习方法，专为热成像相机在大型环境中的重新定位设计，通过结合 EfficientNet 和 Transformer 提取局部与全局特征，并在公开数据集上展现了优越性能。

Details

Motivation: 传统视觉重新定位方法依赖可见光图像，而热成像相机通过热辐射捕捉数据，机制完全不同，现有方法不适用。热成像相机重新定位的研究仍不足，急需专门方法。

Result: 在公开热成像数据集和自制数据集上测试，ThermalLoc 在准确性和鲁棒性上优于 AtLoc、MapNet、PoseNet 和 RobustLoc。

Insight: 热成像重新定位需结合局部与全局特征，Transformer 在捕捉全局上下文信息方面表现出色，为热成像任务提供了新思路。

Abstract: Thermal cameras capture environmental data through heat emission, a fundamentally different mechanism compared to visible light cameras, which rely on pinhole imaging. As a result, traditional visual relocalization methods designed for visible light images are not directly applicable to thermal images. Despite significant advancements in deep learning for camera relocalization, approaches specifically tailored for thermal camera-based relocalization remain underexplored. To address this gap, we introduce ThermalLoc, a novel end-to-end deep learning method for thermal image relocalization. ThermalLoc effectively extracts both local and global features from thermal images by integrating EfficientNet with Transformers, and performs absolute pose regression using two MLP networks. We evaluated ThermalLoc on both the publicly available thermal-odometry dataset and our own dataset. The results demonstrate that ThermalLoc outperforms existing representative methods employed for thermal camera relocalization, including AtLoc, MapNet, PoseNet, and RobustLoc, achieving superior accuracy and robustness.

[85] Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations? cs.CV | cs.LGPDF

Yiwei Yang, Chung Peng Lee, Shangbin Feng, Dora Zhao, Bingbing Wen

TL;DR: 该论文提出了SpuriVerse基准测试，用于研究大型视觉语言模型（LVLM）在真实世界视觉问答任务中对虚假相关性的鲁棒性。即使最先进的闭源模型在该基准上表现不佳，但通过合成数据微调可显著提升性能。

Details

Motivation: 论文旨在探索LVLMs在无显式任务监督下学习到的虚假相关性对其泛化能力的影响，并填补现有研究中缺乏真实世界任务的空白。

Result: 模型在SpuriVerse基准上表现不佳（最高37.1%准确率），但通过微调可提升至78.40%。

Insight: 1. LVLMs在真实任务中可能依赖虚假相关性；2. 合成数据微调能帮助模型避免“捷径学习”，关注图像整体上下文。

Abstract: Finetuning can cause spurious correlations to arise between non-essential features and the target labels, but benchmarks to study these effects involve contrived settings and narrow tasks. In contrast, we consider spurious correlations in multi-modal Large Vision Language Models (LVLMs) pretrained on extensive and diverse datasets without explicit task supervision. We develop a benchmark by sourcing GPT-4o errors on real-world visual-question-answering (VQA) benchmarks, then curating a subset through LVLM-human annotation and synthetic counterfactual evaluation to identify errors caused by spurious correlations. This process yields SpuriVerse, a novel benchmark comprised of 124 distinct types of spurious correlations extracted from real-world datasets, each containing 1 realistic and 10 synthetic VQA samples for a total of 1364 multiple choice questions. We evaluate 15 open and closed-source LVLMs on SpuriVerse, finding that even state-of-the-art closed-source models struggle significantly, achieving at best only 37.1% accuracy. Fine-tuning on synthetic examples that emphasize the spurious correlation improves performance to 78.40%, suggesting that training on diverse spurious patterns generalizes to unseen situations: models appear to learn to avoid “shortcuts” and attend to the overall image context.

[86] A Multi-Scale Spatial Attention-Based Zero-Shot Learning Framework for Low-Light Image Enhancement cs.CV | cs.AIPDF

Muhammad Azeem Aslam, Hassan Khalid, Nisar Ahmed

TL;DR: 该论文提出了一种基于多尺度空间注意力的零样本学习框架LucentVisionNet，用于低光照图像增强，结合多种创新方法显著提升了图像质量和实用性。

Details

Motivation: 低光照图像增强是一个具有挑战性的任务，尤其是在缺乏配对训练数据的情况下。传统方法和深度学习方法的局限性促使作者提出一种不需要训练数据的零样本学习框架。

Result: 在多个基准数据集上的实验表明，LucentVisionNet在质量和效率上均优于现有方法。

Insight: 1. 多尺度空间注意力和深度曲线估计网络的结合能有效提升图像质量；2. 零样本学习方法在低光照图像增强中具有潜力；3. 无参考损失函数的引入有助于模拟人类视觉感知。

Abstract: Low-light image enhancement remains a challenging task, particularly in the absence of paired training data. In this study, we present LucentVisionNet, a novel zero-shot learning framework that addresses the limitations of traditional and deep learning-based enhancement methods. The proposed approach integrates multi-scale spatial attention with a deep curve estimation network, enabling fine-grained enhancement while preserving semantic and perceptual fidelity. To further improve generalization, we adopt a recurrent enhancement strategy and optimize the model using a composite loss function comprising six tailored components, including a novel no-reference image quality loss inspired by human visual perception. Extensive experiments on both paired and unpaired benchmark datasets demonstrate that LucentVisionNet consistently outperforms state-of-the-art supervised, unsupervised, and zero-shot methods across multiple full-reference and no-reference image quality metrics. Our framework achieves high visual quality, structural consistency, and computational efficiency, making it well-suited for deployment in real-world applications such as mobile photography, surveillance, and autonomous navigation.

[87] Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection cs.CVPDF

Anja Delić, Matej Grcić, Siniša Šegvić

TL;DR: SeeKer是一种基于骨架序列的异常检测方法，通过自回归分解建模关键点的联合分布，取得了在多个数据集上的领先性能。

Details

Motivation: 异常行为检测在安全关键应用中非常重要，而异常行为常表现为不寻常的人体姿态。本文旨在提出一种基于骨架序列的方法来解决这一问题。

Result: 在UBnormal和MSAD-HR数据集上超越了所有先前方法，在ShanghaiTech数据集上表现竞争性。

Insight: 简单但有效的密度估计方法可以显著提升骨架序列异常检测的性能，关键点置信度权重对结果有重要影响。

Abstract: Detecting anomalous human behaviour is an important visual task in safety-critical applications such as healthcare monitoring, workplace safety, or public surveillance. In these contexts, abnormalities are often reflected with unusual human poses. Thus, we propose SeeKer, a method for detecting anomalies in sequences of human skeletons. Our method formulates the skeleton sequence density through autoregressive factorization at the keypoint level. The corresponding conditional distributions represent probable keypoint locations given prior skeletal motion. We formulate the joint distribution of the considered skeleton as causal prediction of conditional Gaussians across its constituent keypoints. A skeleton is flagged as anomalous if its keypoint locations surprise our model (i.e. receive a low density). In practice, our anomaly score is a weighted sum of per-keypoint log-conditionals, where the weights account for the confidence of the underlying keypoint detector. Despite its conceptual simplicity, SeeKer surpasses all previous methods on the UBnormal and MSAD-HR datasets while delivering competitive performance on the ShanghaiTech dataset.

Yeongtak Oh, Jisoo Mok, Dohyun Chung, Juhyeon Shin, Sangha Park

TL;DR: 本文提出了一种基于强化学习的后训练框架RePIC，用于提升多模态大语言模型（MLLM）在个性化图像描述生成任务中的性能，解决了传统监督微调（SFT）方法在复杂场景下的局限性。

Details

Motivation: 现有的多模态大语言模型（MLLM）即使在高质量标注数据上进行监督微调（SFT），仍然难以在真实场景（如多概念图像描述）中生成准确且个性化的描述。大规模高质量标注数据的获取成本高且困难，促使研究者探索更高效的后训练方法。

Result: RePIC在个性化图像描述任务中显著优于现有SFT基线，尤其是在多概念图像描述任务中表现突出。

Insight: 强化学习可以更高效地优化多模态大语言模型的后训练过程，尤其是在数据稀缺或复杂的任务场景下，为个性化内容生成提供了新的方向。

Abstract: Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task.

[89] OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding cs.CVPDF

Hieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, Minh-Quang Nguyen, Tam V. Nguyen

TL;DR: OpenEvents V1是一个大规模多模态事件基准数据集，专注于事件感知的图像描述生成和事件相关的图像检索任务，助力深度学习模型的开发。

Details

Motivation: 现有的图像描述和检索数据集通常关注表面描述，缺乏对上下文和时间的深入理解。OpenEvents V1旨在填补这一空白，推动事件中心的多模态理解。

Result: OpenEvents V1为多模态模型提供了深度推理复杂现实事件的基础，推动了事件相关任务的发展。

Insight: 数据集的大规模和多样性为多模态研究提供了丰富资源，事件中心的视角有助于模型从更深层次理解内容。

Abstract: We introduce OpenEvents V1, a large-scale benchmark dataset aimed at advancing event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that emphasize surface-level descriptions, OpenEvents V1 focuses on contextual and temporal grounding through two primary tasks: (1) generating rich, event-aware image captions and (2) retrieving event-relevant images based on narrative-style textual queries. The dataset contains over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for both tasks. OpenEvents V1 establishes a robust foundation for developing multimodal models capable of deep reasoning over complex real-world events. The dataset is available at https://ltnghia.github.io/eventa/openevents-v1

[90] InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models cs.CVPDF

Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen

TL;DR: 论文提出了InternSpatial和InternSpatial-Bench，分别是目前最大的开源空间推理数据集和对应的评估基准，旨在提升视觉-语言模型的空间推理能力。

Details

Motivation: 现有数据集在规模、视觉多样性和指令表达性方面有限，限制了视觉-语言模型在空间推理任务上的表现。

Result: 实验表明，基于InternSpatial训练的模型在InternSpatial-Bench上提升12.1%，在VSI-Bench上提升10.7%，同时在通用任务上保持强性能。

Insight: 大规模多样化数据集和针对性任务设计能显著提升模型空间推理能力，支持机器人等实际应用。

Abstract: Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that models trained on InternSpatial achieve 12.1% improvement on InternSpatial-Bench and 10.7% on VSI-Bench, while maintaining strong performance on general-purpose benchmarks. We hope these resources will support the development of spatially capable VLMs in practical applications such as robotics and embodied AI.

[91] Distributed Poisson multi-Bernoulli filtering via generalised covariance intersection cs.CV | math.ST | stat.THPDF

Ángel F. García-Fernández, Giorgio Battistelli

TL;DR: 该论文提出了一种基于广义协方差交（GCI）融合规则的分布式泊松多伯努利（PMB）滤波器，用于分布式多目标跟踪。通过将PMB密度的幂近似为未归一化的PMB密度，实现了GCI融合的可行方法。

Details

Motivation: 为了解决分布式多目标跟踪中PMB密度的GCI融合难题，该研究提出了一种可行的近似方法，以提高滤波器的实用性。

Result: 实验结果表明，该方法优于其他分布式多目标滤波器，验证了其有效性。

Insight: 该研究提供了一种分布式多目标跟踪的可行方案，通过近似和闭式表达简化了复杂问题的处理。

Abstract: This paper presents the distributed Poisson multi-Bernoulli (PMB) filter based on the generalised covariance intersection (GCI) fusion rule for distributed multi-object filtering. Since the exact GCI fusion of two PMB densities is intractable, we derive a principled approximation. Specifically, we approximate the power of a PMB density as an unnormalised PMB density, which corresponds to an upper bound of the PMB density. Then, the GCI fusion rule corresponds to the normalised product of two unnormalised PMB densities. We show that the result is a Poisson multi-Bernoulli mixture (PMBM), which can be expressed in closed form. Future prediction and update steps in each filter preserve the PMBM form, which can be projected back to a PMB density before the next fusion step. Experimental results show the benefits of this approach compared to other distributed multi-object filters.

[92] Frequency-Domain Fusion Transformer for Image Inpainting cs.CVPDF

Sijin He, Guangfeng Lin, Tao Li, Yajun Chen

TL;DR: 该论文提出了一种基于Transformer的图像修复方法，通过结合频域融合技术提升高频细节保留能力，同时降低计算成本。

Details

Motivation: 传统图像修复方法在处理复杂纹理和大遮挡时表现不佳，而现有Transformer方法由于自注意力的低通特性无法保留高频细节且计算成本高。本文旨在解决这些问题。

Result: 实验结果表明，该方法通过保留更多高频信息，显著提升了图像修复质量。

Insight: 频域融合技术能够有效补充Transformer的不足，特别是在高频细节保留和计算效率方面。这一思路或可扩展到其他视觉任务中。

Abstract: Image inpainting plays a vital role in restoring missing image regions and supporting high-level vision tasks, but traditional methods struggle with complex textures and large occlusions. Although Transformer-based approaches have demonstrated strong global modeling capabilities, they often fail to preserve high-frequency details due to the low-pass nature of self-attention and suffer from high computational costs. To address these challenges, this paper proposes a Transformer-based image inpainting method incorporating frequency-domain fusion. Specifically, an attention mechanism combining wavelet transform and Gabor filtering is introduced to enhance multi-scale structural modeling and detail preservation. Additionally, a learnable frequency-domain filter based on the fast Fourier transform is designed to replace the feedforward network, enabling adaptive noise suppression and detail retention. The model adopts a four-level encoder-decoder structure and is guided by a novel loss strategy to balance global semantics and fine details. Experimental results demonstrate that the proposed method effectively improves the quality of image inpainting by preserving more high-frequency information.

[93] OmniGen2: Exploration to Advanced Multimodal Generation cs.CV | cs.AI | cs.CLPDF

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo

TL;DR: OmniGen2是一个多功能开源生成模型，旨在统一处理多种生成任务，如文本到图像、图像编辑和上下文生成。其特点包括两种解码路径、解耦的图像标记器和数据管道，性能在多项任务中表现优异。

Details

Motivation: 现有生成模型在多模态任务中常需重新调整参数或牺牲某些功能。OmniGen2旨在提供一种统一且高效的解决方案，同时保留原始任务的能力。

Result: 在文本到图像、图像编辑等任务中表现优异，OmniContext基准上达到开源模型的最优一致性。

Insight: 解耦设计（如未共享参数）能有效平衡多任务需求，反射机制可能提升生成质量。

Abstract: In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2

[94] Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations cs.CV | cs.AI | cs.CL | cs.MMPDF

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao

TL;DR: 该论文提出了一种多模态框架Tar，通过文本对齐的标记化方法（TA-Tok）将视觉与文本统一到共享的离散语义表示中，实现了视觉理解与生成的联合任务。

Details

Motivation: 现有方法在视觉理解和生成任务中通常采用独立的模块，导致模态间的割裂。本文旨在通过统一的离散表示弥合这一差距。

Result: 实验表明，Tar在视觉理解与生成任务中表现优异，收敛速度更快且训练效率更高。

Insight: 统一的离散语义表示能够有效整合多模态任务，文本对齐的编码方式为跨模态交互提供了新思路。

Abstract: This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model’s (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com

[95] DIP: Unsupervised Dense In-Context Post-training of Visual Representations cs.CVPDF

Sophia Sirko-Galouchenko, Spyros Gidaris, Antonin Vobecky, Andrei Bursuc, Nicolas Thome

TL;DR: DIP是一种新颖的无监督后训练方法，通过模拟下游任务场景增强密集视觉表征，优于现有方法和初始视觉编码器。

Details

Motivation: 动机在于提升视觉编码器在无监督场景下的密集表征能力，避免依赖复杂自蒸馏架构。

Result: DIP在多种上下文场景理解任务中表现优异，单卡A100训练时间少于9小时。

Insight: 通过自动生成伪任务模拟真实场景，DIP证明了无监督密集表征优化的潜力。

Abstract: We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: https://github.com/sirkosophia/DIP

[96] AViLA: Asynchronous Vision-Language Agent for Streaming Multimodal Data Interaction cs.CVPDF

Gengyuan Zhang, Tanveer Hannan, Hermine Kleiner, Beste Aydemir, Xinyu Xie

TL;DR: 论文提出了AViLA（异步视觉语言代理），用于处理流式多模态数据交互中的查询-证据异步问题，通过三个核心模块提升时间感知和准确性。

Details

Motivation: 现实应用中，如自动驾驶和具身代理，用户查询与支持证据往往异步到达，现有模型难以高效处理此类动态数据流交互。

Result: 实验表明，AViLA显著提升响应准确性和时间感知能力，优于现有模型。

Insight: 处理动态数据流需兼顾历史、当前和未来信息，异步机制和模块化设计是关键。

Abstract: An ideal vision-language agent serves as a bridge between the human users and their surrounding physical world in real-world applications like autonomous driving and embodied agents, and proactively provides accurate and timely responses given user intents. An intriguing challenge arises when agents interact with the world as a dynamic data stream and ad-hoc queries from users: supporting knowledge for queries, namely evidence, usually appears asynchronously with the arrival time of queries, and agents need to ground their responses in historical data, present observations, and even future streams. We frame this challenge as Query-Evidence Asynchrony, where user queries and their supporting evidence typically arrive asynchronously in the streaming setting. This setting requires not only strong reasoning capabilities but also the ability to retain past observations and respond to queries with temporal awareness. In this paper, we introduce a diagnostic benchmark that evaluates Multimodal Large Language Models (MLLMs) on their ability to handle interaction with streaming data. Further, we present AViLA, Asynchronous Video-Language Agent for streaming data interaction that can handle ad-hoc queries and give time-aware responses. For this purpose, AViLA consists of three key modules: comprehensive memory retention, evidence identification, and evidence-grounded trigger, that are designed to maintain a general-purpose memory and respond readily and timely to queries. Our experiments show that existing models often fail to respond at appropriate times, while AViLA significantly improves both accuracy and temporal awareness. Our code and dataset will be publicly available.

[97] Context Consistency Learning via Sentence Removal for Semi-Supervised Video Paragraph Grounding cs.CVPDF

Yaokun Zhong, Siyu Jiang, Jian Zhu, Jian-Fang Hu

TL;DR: 本文提出了一种基于句子移除的上下文一致性学习框架（CCL），用于半监督视频段落定位（SSVPG），通过统一一致性正则化和伪标签两种范式，显著提升了性能。

Details

Motivation: 现有方法在SSVPG任务中通常关注教师-学生一致性学习和视频级对比损失，但忽略了通过扰动查询上下文生成强监督信号的重要性。

Result: 实验表明，CCL在性能上大幅超越现有方法。

Insight: 通过扰动查询上下文（如句子移除）可以有效生成强监督信号，提升半监督学习的性能。

Abstract: Semi-Supervised Video Paragraph Grounding (SSVPG) aims to localize multiple sentences in a paragraph from an untrimmed video with limited temporal annotations. Existing methods focus on teacher-student consistency learning and video-level contrastive loss, but they overlook the importance of perturbing query contexts to generate strong supervisory signals. In this work, we propose a novel Context Consistency Learning (CCL) framework that unifies the paradigms of consistency regularization and pseudo-labeling to enhance semi-supervised learning. Specifically, we first conduct teacher-student learning where the student model takes as inputs strongly-augmented samples with sentences removed and is enforced to learn from the adequately strong supervisory signals from the teacher model. Afterward, we conduct model retraining based on the generated pseudo labels, where the mutual agreement between the original and augmented views’ predictions is utilized as the label confidence. Extensive experiments show that CCL outperforms existing methods by a large margin.

[98] ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation cs.CVPDF

Trong-Vu Hoang, Quang-Binh Nguyen, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran

TL;DR: ShowFlow提出了一种综合框架，包括ShowFlow-S和ShowFlow-M，分别用于单概念和多概念的图像生成，解决了身份保留和提示对齐的挑战。

Details

Motivation: 定制化图像生成在可控图像合成中是一个核心挑战，尤其是在单概念和多概念场景中，身份保留和概念遗漏问题突出。

Result: 实验和用户研究表明ShowFlow在广告和虚拟试衣等实际应用中具有潜力。

Insight: ShowFlow为无条件的多概念生成提供了一种新思路，通过模块化设计实现了高效的模型重用。

Abstract: Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and employs a disentangled learning approach with a novel attention regularization objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses the learned models from ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a layout consistency strategy as the plug-and-play module. Extensive experiments and user studies validate ShowFlow’s effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing.

[99] Biased Teacher, Balanced Student cs.CVPDF

Seonghak Kim

TL;DR: 论文提出了一种针对长尾数据分布的知识蒸馏框架LTKD，通过分解KL散度并引入重新平衡的损失函数，有效解决了传统知识蒸馏在尾部类别上的偏差问题。

Details

Motivation: 传统知识蒸馏在长尾数据分布中表现不佳，因为教师模型对头部类别存在严重偏差，导致尾部类别学习不足。论文旨在解决这一问题。

Result: 在多个长尾数据集上，LTKD优于现有方法，整体和尾部类别准确率均有显著提升。

Insight: 教师模型的偏差可以通过重新平衡损失函数进行校准，从而有效提升长尾数据下的知识蒸馏效果。

Abstract: Knowledge Distillation (KD) is a widely adopted model compression technique where a compact student model learns from the output of a larger, pre-trained teacher. While effective in balanced settings, conventional KD suffers significantly when applied to long-tailed data distributions, as the teacher model tends to be biased toward head classes and provides limited supervision for tail classes. In this paper, we propose Long-Tailed Knowledge Distillation (LTKD), a novel framework tailored for class-imbalanced scenarios. We begin by reformulating the standard KD objective into two components: inter-group and intra-group Kullback-Leibler (KL) divergence, corresponding to the prediction distributions across and within class groups (head, medium, tail), respectively. This decomposition allows us to identify and quantify the sources of teacher bias. To address them, we introduce (1) a rebalanced inter-group loss that calibrates the teacher’s group-level predictions and (2) a uniform intra-group loss that ensures equal contribution from all groups during distillation. Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT show that LTKD consistently outperforms existing KD methods, achieving significant gains in both overall accuracy and tail-class performance. Our results demonstrate that LTKD enables effective knowledge transfer even from biased teachers, making it a strong candidate for real-world deployment in resource-constrained and imbalanced settings.

[100] Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey cs.CV | cs.AIPDF

Xinyao Li, Jingjing Li, Fengling Li, Lei Zhu, Yang Yang

TL;DR: 这篇综述全面总结了视觉语言模型（VLM）在新领域的泛化方法、设置、基准测试及结果，探讨了基于提示、参数和特征的模块化方法，并分析了其与多模态大语言模型（MLLM）的关系与区别。

Details

Motivation: 尽管视觉语言预训练模型在零样本任务中表现优异，但在领域特定或专业化任务中性能下降。因此，研究如何将VLM中的丰富知识转移到下游应用具有重要意义。

Result: 通过综述发现，不同方法在不同任务和基准测试中表现各异，参数和提示方法在领域泛化中具有潜力。

Insight: 未来研究可以结合模块化方法和多模态大语言模型（MLLM）的优势，进一步提升VLM的泛化能力。

Abstract: Recently, vision-language pretraining has emerged as a transformative technique that integrates the strengths of both visual and textual modalities, resulting in powerful vision-language models (VLMs). Leveraging web-scale pretraining data, these models exhibit strong zero-shot capabilities. However, their performance often deteriorates when confronted with domain-specific or specialized generalization tasks. To address this, a growing body of research focuses on transferring or generalizing the rich knowledge embedded in VLMs to various downstream applications. This survey aims to comprehensively summarize the generalization settings, methodologies, benchmarking and results in VLM literatures. Delving into the typical VLM structures, current literatures are categorized into prompt-based, parameter-based and feature-based methods according to the transferred modules. The differences and characteristics in each category are furthered summarized and discussed by revisiting the typical transfer learning (TL) settings, providing novel interpretations for TL in the era of VLMs. Popular benchmarks for VLM generalization are further introduced with thorough performance comparisons among the reviewed methods. Following the advances in large-scale generalizable pretraining, this survey also discusses the relations and differences between VLMs and up-to-date multimodal large language models (MLLM), e.g., DeepSeek-VL. By systematically reviewing the surging literatures in vision-language research from a novel and practical generalization prospective, this survey contributes to a clear landscape of current and future multimodal researches.

[101] MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis cs.CVPDF

Yuting Zhang, Kaishen Yuan, Hao Lu, Yutao Yue, Jintai Chen

TL;DR: MedTVT-R1 是一种多模态大语言模型（MLLM），通过整合临床多模态数据提升医疗推理和多疾病诊断能力。

Details

Motivation: 当前医疗诊断方法多依赖单模态数据，难以全面理解复杂疾病，亟需一种能够整合多模态数据的解决方案。

Result: 实验表明 MedTVT-R1 在多模态特征利用和多疾病诊断方面表现优异。

Insight: 多模态数据和强化学习的结合有望提升医疗诊断的准确性和可解释性。

Abstract: Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1’s superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.

[102] Multi-Scale Representation of Follicular Lymphoma Pathology Images in a Single Hyperbolic Space cs.CVPDF

Kei Taguchi, Kazumasa Ohara, Tatsuya Yokota, Hiroaki Miyoshi, Noriaki Hashimoto

TL;DR: 该论文提出了一种利用自监督学习在双曲空间（Poincaré球）中表示恶性淋巴瘤病理图像的多尺度方法，通过捕捉不同尺度间的层次关系学习疾病的特征表示。

Details

Motivation: 病理图像的多尺度分析（从细胞核到组织）对于理解疾病进展至关重要，但传统方法难以统一表示这些层次关系。双曲空间因其能有效编码层次结构，成为解决这一问题的理想选择。

Result: 实验表明，学习到的特征能够有效捕捉疾病状态的变化和细胞类型的差异。

Insight: 双曲空间在表示层次结构数据（如病理图像的多尺度关系）中具有独特优势，自监督学习可以无需标注数据实现有效的特征学习。

Abstract: We propose a method for representing malignant lymphoma pathology images, from high-resolution cell nuclei to low-resolution tissue images, within a single hyperbolic space using self-supervised learning. To capture morphological changes that occur across scales during disease progression, our approach embeds tissue and corresponding nucleus images close to each other based on inclusion relationships. Using the Poincar'e ball as the feature space enables effective encoding of this hierarchical structure. The learned representations capture both disease state and cell type variations.

[103] Auto-Regressively Generating Multi-View Consistent Images cs.CVPDF

JiaKui Hu, Yuxiao Yang, Jialun Liu, Jinbo Wu, Chen Zhao

TL;DR: MV-AR方法通过自回归模型实现多视角一致性图像生成，利用条件注入模块和渐进训练策略处理多模态输入，并通过数据增强扩展训练数据，性能媲美扩散模型。

Details

Motivation: 多视角图像生成在3D内容创作中至关重要，但现有方法难以保证视图间的全局一致性，且难以处理多样化的输入条件。

Result: 实验表明MV-AR能生成一致的多视角图像，性能与领先的扩散模型相当。

Insight: 自回归模型在渐进式生成和多模态条件处理上具有潜力；数据增强对解决高质量数据稀缺问题至关重要。

Abstract: Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the “Shuffle View” data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. Code and models will be released at https://github.com/MILab-PKU/MVAR.

[104] Geometry-aware Distance Measure for Diverse Hierarchical Structures in Hyperbolic Spaces cs.CVPDF

Pengxiang Li, Yuwei Wu, Zhi Gao, Xiaomeng Fan, Wei Wu

TL;DR: 本文提出了一种在双曲空间中几何感知的动态距离度量方法，能够自适应不同层次结构，显著提升了少数样本学习任务的性能。

Details

Motivation: 现实世界的数据层次结构具有多样性，而现有双曲学习方法通常假设所有数据点具有统一的层次结构，导致模型性能受限。

Result: 在图像分类、层次分类和少数样本学习任务（如mini-ImageNet）上性能显著优于固定距离度量的方法，少数样本学习任务提升超过5%。

Insight: 自适应距离度量能更好地捕捉多样化的层次结构，可视化结果显示类边界更清晰，原型分离更明显。

Abstract: Learning in hyperbolic spaces has attracted increasing attention due to its superior ability to model hierarchical structures of data. Most existing hyperbolic learning methods use fixed distance measures for all data, assuming a uniform hierarchy across all data points. However, real-world hierarchical structures exhibit significant diversity, making this assumption overly restrictive. In this paper, we propose a geometry-aware distance measure in hyperbolic spaces, which dynamically adapts to varying hierarchical structures. Our approach derives the distance measure by generating tailored projections and curvatures for each pair of data points, effectively mapping them to an appropriate hyperbolic space. We introduce a revised low-rank decomposition scheme and a hard-pair mining mechanism to mitigate the computational cost of pair-wise distance computation without compromising accuracy. We present an upper bound on the low-rank approximation error using Talagrand’s concentration inequality, ensuring theoretical robustness. Extensive experiments on standard image classification (MNIST, CIFAR-10 and CIFAR-100), hierarchical classification (5-level CIFAR-100), and few-shot learning tasks (mini-ImageNet, tiered-ImageNet) demonstrate the effectiveness of our method. Our approach consistently outperforms learning methods that use fixed distance measures, with notable improvements on few-shot learning tasks, where it achieves over 5% gains on mini-ImageNet. The results reveal that adaptive distance measures better capture diverse hierarchical structures, with visualization showing clearer class boundaries and improved prototype separation in hyperbolic spaces.

[105] Normality Prior Guided Multi-Semantic Fusion Network for Unsupervised Image Anomaly Detection cs.CVPDF

Muhao Xu, Xueying Zhou, Xizhan Gao, Weiye Song, Guang Feng

TL;DR: 本文提出了一种基于多语义融合的无监督图像异常检测方法，通过引入正常样本的多语义特征引导重构过程，显著提升了逻辑异常的检测性能。

Details

Motivation: 现有编码器-解码器方法在处理逻辑异常时存在局限性，异常语义可能通过低维瓶颈传播，导致重构结果误导性高。因此，需要一种更有效的方法来区分异常与正常模式。

Result: 在MVTec LOCO AD数据集上取得SOTA性能，像素级sPRO提升5.7%，图像级AUROC提升2.6%。

Insight: 通过引入正常样本的多语义特征作为先验知识，可以有效抑制异常语义的传播，同时增强模型对逻辑异常的鲁棒性。

Abstract: Recently, detecting logical anomalies is becoming a more challenging task compared to detecting structural ones. Existing encoder decoder based methods typically compress inputs into low-dimensional bottlenecks on the assumption that the compression process can effectively suppress the transmission of logical anomalies to the decoder. However, logical anomalies present a particular difficulty because, while their local features often resemble normal semantics, their global semantics deviate significantly from normal patterns. Thanks to the generalisation capabilities inherent in neural networks, these abnormal semantic features can propagate through low-dimensional bottlenecks. This ultimately allows the decoder to reconstruct anomalous images with misleading fidelity. To tackle the above challenge, we propose a novel normality prior guided multi-semantic fusion network for unsupervised anomaly detection. Instead of feeding the compressed bottlenecks to the decoder directly, we introduce the multi-semantic features of normal samples into the reconstruction process. To this end, we first extract abstract global semantics of normal cases by a pre-trained vision-language network, then the learnable semantic codebooks are constructed to store representative feature vectors of normal samples by vector quantisation. Finally, the above multi-semantic features are fused and employed as input to the decoder to guide the reconstruction of anomalies to approximate normality. Extensive experiments are conducted to validate the effectiveness of our proposed method, and it achieves the SOTA performance on the MVTec LOCO AD dataset with improvements of 5.7% in pixel-sPRO and 2.6% in image-AUROC. The source code is available at https://github.com/Xmh-L/NPGMF.

[106] Object-aware Sound Source Localization via Audio-Visual Scene Understanding cs.CVPDF

Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim

TL;DR: 该论文提出了一种基于多模态大语言模型（MLLMs）的音频-视觉场景理解框架，通过生成细粒度语义信息来区分发声前景物体和无声背景物体，并引入了两种新损失函数（OCA和ORI）以提高定位准确性。

Details

Motivation: 现有方法在复杂场景中难以准确区分发声物体和视觉相似的无声物体，主要因为依赖于简单的音频-视觉对应关系而忽略语义差异。

Result: 在MUSIC和VGGSSound数据集上显著优于现有方法，适用于单源和多源定位场景。

Insight: 结合语义信息和对比学习有助于在复杂场景中区分发声物体，提升音频-视觉定位的鲁棒性。

Abstract: Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. To effectively integrate this detailed information, we introduce two novel loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. Extensive experimental results on MUSIC and VGGSound datasets demonstrate the effectiveness of our approach, significantly outperforming existing methods in both single-source and multi-source localization scenarios. Code and generated detailed contextual information are available at: https://github.com/VisualAIKHU/OA-SSL.

[107] VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning cs.CVPDF

Xuanyu Zhang, Weiqi Li, Shijie Zhao, Junlin Li, Li Zhang

TL;DR: VQ-Insight is a novel framework for assessing AI-generated video quality using progressive visual reinforcement learning, addressing limitations of current VLMs by integrating multi-dimension rewards and enhancing temporal awareness.

Details

Motivation: Current methods for evaluating AI-generated video quality face challenges like limited generalization, lack of temporal awareness, and reliance on large annotated datasets. Supervised finetuning of VLMs often fails to integrate understanding and generation effectively.

Result: VQ-Insight outperforms state-of-the-art baselines in preference comparison, multi-dimension scoring, and natural video scoring, enhancing video generation tasks.

Insight: Integrating reinforcement learning with VLMs and temporal modeling can significantly improve AI-generated video quality assessment, bridging the gap between understanding and generation.

Abstract: Recent advances in AI-generated content (AIGC) have led to the emergence of powerful text-to-video generation models. Despite these successes, evaluating the quality of AIGC-generated videos remains challenging due to limited generalization, lack of temporal awareness, heavy reliance on large-scale annotated datasets, and the lack of effective interaction with generation models. Most current approaches rely on supervised finetuning of vision-language models (VLMs), which often require large-scale annotated datasets and tend to decouple understanding and generation. To address these shortcomings, we propose VQ-Insight, a novel reasoning-style VLM framework for AIGC video quality assessment. Our approach features: (1) a progressive video quality learning scheme that combines image quality warm-up, general task-specific temporal learning, and joint optimization with the video generation model; (2) the design of multi-dimension scoring rewards, preference comparison rewards, and temporal modeling rewards to enhance both generalization and specialization in video quality evaluation. Extensive experiments demonstrate that VQ-Insight consistently outperforms state-of-the-art baselines in preference comparison, multi-dimension scoring, and natural video scoring, bringing significant improvements for video generation tasks.

[108] VisualChef: Generating Visual Aids in Cooking via Mask Inpainting cs.CVPDF

Oleh Kuzyk, Zuoyue Li, Marc Pollefeys, Xi Wang

TL;DR: VisualChef提出了一种基于掩码修复的视觉辅助生成方法，用于烹饪场景，通过识别动作相关物体并分类，实现有针对性的修改，同时保持环境一致性。其方法简化了视觉与文本的对齐，并通过自动化流程提取高质量帧。

Details

Motivation: 烹饪过程中缺乏一致性的视觉指导，现有方法依赖复杂的文本描述和额外标注，难以高效地生成情境相关的视觉辅助。

Result: 在三个第一人称视频数据集上的定量和定性评估显示，VisualChef优于现有方法。

Insight: 掩码修复技术可以高效实现视觉辅助生成，同时减少对复杂文本对齐的需求；自动化流程提升了数据提取的效率和质量。

Abstract: Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action’s execution and the resulting appearance of the object, while preserving the initial frame’s environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.

[109] Resampling Augmentation for Time Series Contrastive Learning: Application to Remote Sensing cs.CVPDF

Antoine Saget, Baptiste Lafabregue, Antoine Cornuéjols, Pierre Gançarski

TL;DR: 论文提出了一种基于重采样的增强策略，用于时间序列对比学习，特别适用于遥感图像时间序列，通过上采样和提取不重叠子序列生成正样本对，提升了分类性能。

Details

Motivation: 由于卫星图像时间序列（SITS）中未标记数据丰富而标记数据稀缺，自监督对比学习成为利用这些数据的有效工具。然而，设计适用于时间序列的有效数据增强方法仍然具有挑战性。

Result: 在多个农业分类基准测试中，该方法优于常见的增强方法，并在S2-Agri100数据集上达到了最先进的性能，且未使用空间信息或时间编码。

Insight: 论文表明，简单的重采样增强策略可以显著提升时间序列对比学习的效果，尤其是在遥感领域，为大规模未标记数据的利用提供了新思路。

Abstract: Given the abundance of unlabeled Satellite Image Time Series (SITS) and the scarcity of labeled data, contrastive self-supervised pretraining emerges as a natural tool to leverage this vast quantity of unlabeled data. However, designing effective data augmentations for contrastive learning remains challenging for time series. We introduce a novel resampling-based augmentation strategy that generates positive pairs by upsampling time series and extracting disjoint subsequences while preserving temporal coverage. We validate our approach on multiple agricultural classification benchmarks using Sentinel-2 imagery, showing that it outperforms common alternatives such as jittering, resizing, and masking. Further, we achieve state-of-the-art performance on the S2-Agri100 dataset without employing spatial information or temporal encodings, surpassing more complex masked-based SSL frameworks. Our method offers a simple, yet effective, contrastive learning augmentation for remote sensing time series.

[110] RDPO: Real Data Preference Optimization for Physics Consistency Video Generation cs.CV | I.2.6; I.2.10PDF

Wenxu Qian, Chaoyue Wang, Hou Peng, Zhiyu Tan, Hao Li

TL;DR: RDPO提出了一个无需标注的框架，通过从真实视频中提取物理先验，优化视频生成模型的物理一致性。

Details

Motivation: 现有的视频生成技术在视觉质量上取得了显著进展，但在物理一致性上仍存在不足。现有方法依赖昂贵的人工标注数据或不可行的奖励模型，因此需要一种更高效的方法。

Result: 在多个基准测试和人类评估中，RDPO显著提升了生成视频的动作连贯性和物理真实感。

Insight: RDPO通过直接从真实视频中挖掘动态信息，为视频生成的物理一致性优化提供了一种高效且低成本的方法。

Abstract: Video generation techniques have achieved remarkable advancements in visual quality, yet faithfully reproducing real-world physics remains elusive. Preference-based model post-training may improve physical consistency, but requires costly human-annotated datasets or reward models that are not yet feasible. To address these challenges, we present Real Data Preference Optimisation (RDPO), an annotation-free framework that distills physical priors directly from real-world videos. Specifically, the proposed RDPO reverse-samples real video sequences with a pre-trained generator to automatically build preference pairs that are statistically distinguishable in terms of physical correctness. A multi-stage iterative training schedule then guides the generator to obey physical laws increasingly well. Benefiting from the dynamic information explored from real videos, our proposed RDPO significantly improves the action coherence and physical realism of the generated videos. Evaluations on multiple benchmarks and human evaluations have demonstrated that RDPO achieves improvements across multiple dimensions. The source code and demonstration of this paper are available at: https://wwenxu.github.io/RDPO/

Ling Zhang, Boxiang Yun, Qingli Li, Yan Wang

TL;DR: 该论文提出了一种基于历史报告引导的双模态并发学习框架（BiGen），用于从全切片图像（WSIs）中生成病理报告，通过知识检索机制和双模态学习策略解决了视觉特征语义内容不足和信息冗余的问题。

Details

Motivation: 病理报告自动生成的挑战在于WSIs的视觉特征缺乏语义内容，并且图像本身存在信息冗余。作者模仿病理学家的诊断推理，提出了一个双模态学习框架以解决这些问题。

Result: 在PathText（BRCA）数据集上，BiGen在NLP指标上相对提升了7.4%，在Her-2预测的分类指标上提升了19.1%，验证了其优越性。消融实验证明了模块的必要性。

Insight: 通过结合知识检索和双模态学习，BiGen能够有效增强语义内容并减少WSIs的信息冗余，为病理报告生成提供了新思路。

Abstract: Automated pathology report generation from Whole Slide Images (WSIs) faces two key challenges: (1) lack of semantic content in visual features and (2) inherent information redundancy in WSIs. To address these issues, we propose a novel Historical Report Guided \textbf{Bi}-modal Concurrent Learning Framework for Pathology Report \textbf{Gen}eration (BiGen) emulating pathologists’ diagnostic reasoning, consisting of: (1) A knowledge retrieval mechanism to provide rich semantic content, which retrieves WSI-relevant knowledge from pre-built medical knowledge bank by matching high-attention patches and (2) A bi-modal concurrent learning strategy instantiated via a learnable visual token and a learnable textual token to dynamically extract key visual features and retrieved knowledge, where weight-shared layers enable cross-modal alignment between visual features and knowledge features. Our multi-modal decoder integrates both modals for comprehensive diagnostic reports generation. Experiments on the PathText (BRCA) dataset demonstrate our framework’s superiority, achieving state-of-the-art performance with 7.4% relative improvement in NLP metrics and 19.1% enhancement in classification metrics for Her-2 prediction versus existing methods. Ablation studies validate the necessity of our proposed modules, highlighting our method’s ability to provide WSI-relevant rich semantic content and suppress information redundancy in WSIs. Code is publicly available at https://github.com/DeepMed-Lab-ECNU/BiGen.

[112] MedSeg-R: Medical Image Segmentation with Clinical Reasoning cs.CVPDF

Hao Shao, Qibin Hou

TL;DR: MedSeg-R是一个轻量级的双阶段框架，结合临床推理，通过结构化语义先验和SAM主干网络改进医学图像分割，显著提升对小病灶的敏感性。

Details

Motivation: 医学图像分割中的小病灶和重叠解剖结构边界模糊、类别不平衡问题，现有方法依赖局部线索或用户提示，缺乏语义先验，泛化能力不足。

Result: 在具有挑战性的基准测试中，MedSeg-R显著提升了重叠和模糊结构的分割精度（Dice分数提升）。

Insight: 通过嵌入细粒度临床语义先验，MedSeg-R有效解决了类别混淆和小病灶敏感性低的问题，展现了与SAM系统的兼容性。

Abstract: Medical image segmentation is challenging due to overlapping anatomies with ambiguous boundaries and a severe imbalance between the foreground and background classes, which particularly affects the delineation of small lesions. Existing methods, including encoder-decoder networks and prompt-driven variants of the Segment Anything Model (SAM), rely heavily on local cues or user prompts and lack integrated semantic priors, thus failing to generalize well to low-contrast or overlapping targets. To address these issues, we propose MedSeg-R, a lightweight, dual-stage framework inspired by inspired by clinical reasoning. Its cognitive stage interprets medical report into structured semantic priors (location, texture, shape), which are fused via transformer block. In the perceptual stage, these priors modulate the SAM backbone: spatial attention highlights likely lesion regions, dynamic convolution adapts feature filters to expected textures, and deformable sampling refines spatial support. By embedding this fine-grained guidance early, MedSeg-R disentangles inter-class confusion and amplifies minority-class cues, greatly improving sensitivity to small lesions. In challenging benchmarks, MedSeg-R produces large Dice improvements in overlapping and ambiguous structures, demonstrating plug-and-play compatibility with SAM-based systems.

[113] MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation cs.CV | cs.ROPDF

Tianchen Deng, Guole Shen, Xun Chen, Shenghai Yuan, Hongming Shen

TL;DR: 论文提出了一种多智能体协作的神经SLAM框架MCN-SLAM，结合了混合隐式神经场景表示、分布式相机追踪、局内与局外闭环及在线蒸馏技术，解决了现有单智能体SLAM在大场景和长时间序列中的问题。同时，还贡献了首个真实世界的多智能体SLAM数据集DES。

Details

Motivation: 现有神经隐式SLAM算法局限于单智能体场景，且在大场景和长时间序列中表现不佳，而基于NeRF的多智能体SLAM框架无法满足通信带宽限制。因此，论文提出了一个分布式多智能体协作框架来解决这些问题。

Result: 实验表明，该方法在映射、追踪和通信方面表现优越。

Insight: 混合表示方法和在线蒸馏技术为多智能体SLAM提供了新的解决方案，DES数据集将推动SLAM和3D重建领域的进一步发展。

Abstract: Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulties in large-scale scenes and long sequences. Existing NeRF-based multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi-agent collaborative neural SLAM framework with hybrid scene representation, distributed camera tracking, intra-to-inter loop closure, and online distillation for multiple submap fusion. A novel triplane-grid joint scene representation method is proposed to improve scene reconstruction. A novel intra-to-inter loop closure method is designed to achieve local (single-agent) and global (multi-agent) consistency. We also design a novel online distillation method to fuse the information of different submaps to achieve global consistency. Furthermore, to the best of our knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that provides both continuous-time trajectories groundtruth and high-accuracy 3D meshes groundtruth. To this end, we propose the first real-world Dense slam (DES) dataset covering both single-agent and multi-agent scenarios, ranging from small rooms to large-scale outdoor scenes, with high-accuracy ground truth for both 3D mesh and continuous-time camera trajectory. This dataset can advance the development of the research in both SLAM, 3D reconstruction, and visual foundation model. Experiments on various datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset and code will open-source on https://github.com/dtc111111/mcnslam.

[114] MARL-MambaContour: Unleashing Multi-Agent Deep Reinforcement Learning for Active Contour Optimization in Medical Image Segmentation cs.CVPDF

Ruicheng Zhang, Yu Sun, Zeyu Zhang, Jinai Li, Xiaofan Liu

TL;DR: MARL-MambaContour通过多智能体强化学习（MARL）优化医学图像分割中的主动轮廓，解决了传统像素方法缺乏拓扑约束的问题，引入动态调节的熵正则化机制和双向注意力融合策略，实现高精度分割。

Details

Motivation: 传统医学图像分割方法通常基于像素，缺乏对拓扑一致性和整体结构感知的考虑，特别是在边缘模糊或形态复杂的情况下表现不佳。本文提出一种基于轮廓的多智能体强化学习框架，以解决这些问题。

Result: 在五个医学影像数据集上的实验表明，MARL-MambaContour实现了最先进的性能，尤其在处理模糊边缘和复杂形态时表现优异。

Insight: 通过将分割任务建模为多智能体协作问题，显著提升了对拓扑一致性和结构复杂性的处理能力，为医学图像分割提供了新思路。

Abstract: We introduce MARL-MambaContour, the first contour-based medical image segmentation framework based on Multi-Agent Reinforcement Learning (MARL). Our approach reframes segmentation as a multi-agent cooperation task focused on generate topologically consistent object-level contours, addressing the limitations of traditional pixel-based methods which could lack topological constraints and holistic structural awareness of anatomical regions. Each contour point is modeled as an autonomous agent that iteratively adjusts its position to align precisely with the target boundary, enabling adaptation to blurred edges and intricate morphologies common in medical images. This iterative adjustment process is optimized by a contour-specific Soft Actor-Critic (SAC) algorithm, further enhanced with the Entropy Regularization Adjustment Mechanism (ERAM) which dynamically balance agent exploration with contour smoothness. Furthermore, the framework incorporates a Mamba-based policy network featuring a novel Bidirectional Cross-attention Hidden-state Fusion Mechanism (BCHFM). This mechanism mitigates potential memory confusion limitations associated with long-range modeling in state space models, thereby facilitating more accurate inter-agent information exchange and informed decision-making. Extensive experiments on five diverse medical imaging datasets demonstrate the state-of-the-art performance of MARL-MambaContour, highlighting its potential as an accurate and robust clinical application.

[115] SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds from RGB Images for 2D Classification cs.CV | cs.AIPDF

Youcef Sklab, Hanane Ariouat, Eric Chenin, Edi Prifti, Jean-Daniel Zucker

TL;DR: SIM-Net是一种新型的2D图像分类架构，通过从RGB图像推断3D点云表示，融合纹理和几何特征以提高分类性能，尤其在数字化植物标本分类任务中表现优异。

Details

Motivation: 传统基于2D图像的分类方法在面对异质背景、非植物元素和遮挡时性能受限，而3D几何信息可以提供额外线索以提升分类效果。

Result: 在植物标本数据集上，SIM-Net显著优于ResNet101和其他基于Transformer的先进模型，准确率和F-score分别提升9.9%和12.3%。

Insight: 将3D结构推理引入2D图像分类任务，能够有效提升模型在复杂场景中的分类性能。

Abstract: We introduce the Shape-Image Multimodal Network (SIM-Net), a novel 2D image classification architecture that integrates 3D point cloud representations inferred directly from RGB images. Our key contribution lies in a pixel-to-point transformation that converts 2D object masks into 3D point clouds, enabling the fusion of texture-based and geometric features for enhanced classification performance. SIM-Net is particularly well-suited for the classification of digitized herbarium specimens (a task made challenging by heterogeneous backgrounds), non-plant elements, and occlusions that compromise conventional image-based models. To address these issues, SIM-Net employs a segmentation-based preprocessing step to extract object masks prior to 3D point cloud generation. The architecture comprises a CNN encoder for 2D image features and a PointNet-based encoder for geometric features, which are fused into a unified latent space. Experimental evaluations on herbarium datasets demonstrate that SIM-Net consistently outperforms ResNet101, achieving gains of up to 9.9% in accuracy and 12.3% in F-score. It also surpasses several transformer-based state-of-the-art architectures, highlighting the benefits of incorporating 3D structural reasoning into 2D image classification tasks.

[116] Matrix-Game: Interactive World Foundation Model cs.CV | cs.AIPDF

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu

TL;DR: Matrix-Game是一个交互式世界基础模型，通过两阶段训练实现可控游戏世界生成，并在Minecraft数据集上表现优异。

Details

Motivation: 现有游戏世界生成模型在可控性、视觉质量和物理一致性方面存在不足，因此提出了Matrix-Game来解决这些问题。

Result: Matrix-Game在视觉质量、时间连贯性和可控性上优于现有模型（如Oasis和MineWorld），并通过人类评估验证了其优势。

Insight: 两阶段训练和大规模标注数据是提高交互式世界生成质量的关键，可控性和物理规则理解是评测模型的重要维度。

Abstract: We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.

[117] Deep CNN Face Matchers Inherently Support Revocable Biometric Templates cs.CV | cs.AI | cs.CRPDF

Aman Bhatta, Michael C. King, Kevin W. Bowyer

TL;DR: 现代深度CNN人脸匹配器天生支持可撤销的生物特征模板，通过生成多个具有等价识别能力且模板互不兼容的模型来实现可撤销性，同时发现ViT在此任务中不如ResNet适用。

Details

Motivation: 生物特征认证的一个常见问题是生物特征一旦泄露，用户无法撤销或更换。可撤销生物特征技术旨在解决这一问题，本文证明现代深度CNN人脸匹配器天生支持这种技术。

Result: 生成的不同模型实例具有等价识别能力，模板互不兼容，且ViT的适用性较差。

Insight: 深度CNN天生支持可撤销生物特征技术，而ViT可能因其结构特性在这一任务中表现不佳。

Abstract: One common critique of biometric authentication is that if an individual’s biometric is compromised, then the individual has no recourse. The concept of revocable biometrics was developed to address this concern. A biometric scheme is revocable if an individual can have their current enrollment in the scheme revoked, so that the compromised biometric template becomes worthless, and the individual can re-enroll with a new template that has similar recognition power. We show that modern deep CNN face matchers inherently allow for a robust revocable biometric scheme. For a given state-of-the-art deep CNN backbone and training set, it is possible to generate an unlimited number of distinct face matcher models that have both (1) equivalent recognition power, and (2) strongly incompatible biometric templates. The equivalent recognition power extends to the point of generating impostor and genuine distributions that have the same shape and placement on the similarity dimension, meaning that the models can share a similarity threshold for a 1-in-10,000 false match rate. The biometric templates from different model instances are so strongly incompatible that the cross-instance similarity score for images of the same person is typically lower than the same-instance similarity score for images of different persons. That is, a stolen biometric template that is revoked is of less value in attempting to match the re-enrolled identity than the average impostor template. We also explore the feasibility of using a Vision Transformer (ViT) backbone-based face matcher in the revocable biometric system proposed in this work and demonstrate that it is less suitable compared to typical ResNet-based deep CNN backbones.

[118] SWA-SOP: Spatially-aware Window Attention for Semantic Occupancy Prediction in Autonomous Driving cs.CV | cs.AI | cs.ROPDF

Helin Cao, Rafael Materla, Sven Behnke

TL;DR: 论文提出了一种空间感知窗口注意力机制（SWA），用于改进自动驾驶中的语义占用预测（SOP），通过将局部空间上下文融入注意力计算，提升了稀疏或遮挡区域的性能。

Details

Motivation: 现有的基于Transformer的SOP方法在注意力计算中缺乏对空间结构的显式建模，导致几何感知能力有限，在稀疏或遮挡区域表现不佳。

Result: SWA在LiDAR-based SOP基准测试中达到了最先进水平，且在相机-based SOP中也表现一致地提升。

Insight: 显式建模空间结构可以显著提升稀疏或遮挡区域的预测性能，且方法具有跨模态的通用性。

Abstract: Perception systems in autonomous driving rely on sensors such as LiDAR and cameras to perceive the 3D environment. However, due to occlusions and data sparsity, these sensors often fail to capture complete information. Semantic Occupancy Prediction (SOP) addresses this challenge by inferring both occupancy and semantics of unobserved regions. Existing transformer-based SOP methods lack explicit modeling of spatial structure in attention computation, resulting in limited geometric awareness and poor performance in sparse or occluded areas. To this end, we propose Spatially-aware Window Attention (SWA), a novel mechanism that incorporates local spatial context into attention. SWA significantly improves scene completion and achieves state-of-the-art results on LiDAR-based SOP benchmarks. We further validate its generality by integrating SWA into a camera-based SOP pipeline, where it also yields consistent gains across modalities.

[119] Focus Your Attention: Towards Data-Intuitive Lightweight Vision Transformers cs.CV | cs.LGPDF

Suyash Gaurav, Muhammad Farhan Humayun, Jukka Heikkonen, Jatin Chaudhary

TL;DR: 该论文提出了一种轻量级的视觉Transformer方法，通过超像素块池化（SPPP）和轻潜注意力（LLA）模块，显著降低计算复杂度并提升效率，适用于边缘部署。

Details

Motivation: 尽管视觉Transformer在许多领域取得了成功，但其依赖大量计算和内存资源的问题限制了应用，尤其是在边缘设备上。论文旨在解决这些问题，通过改进注意力机制和生成数据直观的嵌入来提升效率。

Result: 实验表明，该方法在计算效率上有显著提升，同时性能与现有最优方法相当。

Insight: 轻量化和数据直观的设计可以在不牺牲性能的情况下显著提升视觉Transformer的效率，适合边缘设备部署。

Abstract: The evolution of Vision Transformers has led to their widespread adaptation to different domains. Despite large-scale success, there remain significant challenges including their reliance on extensive computational and memory resources for pre-training on huge datasets as well as difficulties in task-specific transfer learning. These limitations coupled with energy inefficiencies mainly arise due to the computation-intensive self-attention mechanism. To address these issues, we propose a novel Super-Pixel Based Patch Pooling (SPPP) technique that generates context-aware, semantically rich, patch embeddings to effectively reduce the architectural complexity and improve efficiency. Additionally, we introduce the Light Latent Attention (LLA) module in our pipeline by integrating latent tokens into the attention mechanism allowing cross-attention operations to significantly reduce the time and space complexity of the attention module. By leveraging the data-intuitive patch embeddings coupled with dynamic positional encodings, our approach adaptively modulates the cross-attention process to focus on informative regions while maintaining the global semantic structure. This targeted attention improves training efficiency and accelerates convergence. Notably, the SPPP module is lightweight and can be easily integrated into existing transformer architectures. Extensive experiments demonstrate that our proposed architecture provides significant improvements in terms of computational efficiency while achieving comparable results with the state-of-the-art approaches, highlighting its potential for energy-efficient transformers suitable for edge deployment. (The code is available on our GitHub repository: https://github.com/zser092/Focused-Attention-ViT).

[120] ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs cs.CVPDF

Michal Nazarczuk, Sibi Catley-Chandar, Thomas Tanay, Zhensong Zhang, Gregory Slabaugh

TL;DR: ViDAR提出了一种基于个性化扩散模型的4D重建框架，通过生成伪多视角监督信号，结合高斯泼溅表示来提升单目视频的动态新视角合成效果。

Details

Motivation: 动态新视角合成在单目视频输入下面临结构和运动解耦的困难，且监督数据稀缺。ViDAR旨在通过扩散模型生成伪监督信号来缓解这一问题。

Result: 在DyCheck基准测试中，ViDAR在视觉质量和几何一致性上均优于现有方法，尤其在动态区域的优化效果显著。

Insight: 扩散模型可以作为有效的伪监督信号生成工具，为单目视频的4D重建提供新的解决方案，但时空一致性仍需针对性优化。

Abstract: Dynamic Novel View Synthesis aims to generate photorealistic views of moving subjects from arbitrary viewpoints. This task is particularly challenging when relying on monocular video, where disentangling structure from motion is ill-posed and supervision is scarce. We introduce Video Diffusion-Aware Reconstruction (ViDAR), a novel 4D reconstruction framework that leverages personalised diffusion models to synthesise a pseudo multi-view supervision signal for training a Gaussian splatting representation. By conditioning on scene-specific features, ViDAR recovers fine-grained appearance details while mitigating artefacts introduced by monocular ambiguity. To address the spatio-temporal inconsistency of diffusion-based supervision, we propose a diffusion-aware loss function and a camera pose optimisation strategy that aligns synthetic views with the underlying scene geometry. Experiments on DyCheck, a challenging benchmark with extreme viewpoint variation, show that ViDAR outperforms all state-of-the-art baselines in visual quality and geometric consistency. We further highlight ViDAR’s strong improvement over baselines on dynamic regions and provide a new benchmark to compare performance in reconstructing motion-rich parts of the scene. Project page: https://vidar-4d.github.io

[121] OC-SOP: Enhancing Vision-Based 3D Semantic Occupancy Prediction by Object-Centric Awareness cs.CV | cs.AI | cs.ROPDF

Helin Cao, Sven Behnke

TL;DR: OC-SOP 是一种通过对象中心感知增强基于视觉的 3D 语义占据预测的方法，显著提高了动态前景对象的预测精度，并在 SemanticKITTI 数据集上实现了最先进的性能。

Details

Motivation: 由于环境中的遮挡和不完整场景数据，自动驾驶感知面临重大挑战。传统相机方法对所有类别一视同仁且主要依赖局部特征，导致预测结果不理想，特别是对动态前景对象。

Result: 在 SemanticKITTI 数据集上，OC-SOP 对所有类别的预测均达到最先进水平，尤其显著提升了动态前景对象的预测精度。

Insight: 对象中心的方法能够有效解决传统方法在动态前景对象上的预测不足问题，为自动驾驶感知中的复杂场景理解提供了新思路。

Abstract: Autonomous driving perception faces significant challenges due to occlusions and incomplete scene data in the environment. To overcome these issues, the task of semantic occupancy prediction (SOP) is proposed, which aims to jointly infer both the geometry and semantic labels of a scene from images. However, conventional camera-based methods typically treat all categories equally and primarily rely on local features, leading to suboptimal predictions, especially for dynamic foreground objects. To address this, we propose Object-Centric SOP (OC-SOP), a framework that integrates high-level object-centric cues extracted via a detection branch into the semantic occupancy prediction pipeline. This object-centric integration significantly enhances the prediction accuracy for foreground objects and achieves state-of-the-art performance among all categories on SemanticKITTI.

[122] PicoSAM2: Low-Latency Segmentation In-Sensor for Edge Vision Applications cs.CVPDF

Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qi, Michele Magno

TL;DR: PicoSAM2是一个轻量级（1.3M参数）的分割模型，专为边缘设备和传感器内执行优化，实现低延迟、隐私保护的实时分割任务。在COCO和LVIS上分别达到51.9%和44.9%的mIoU。量化后仅1.22MB，在Sony IMX500上运行时间14.3毫秒。

Details

Motivation: 实时、设备端分割对延迟敏感和隐私保护应用（如智能眼镜和IoT设备）至关重要。现有方法依赖云计算或主机处理，难以满足低延迟和隐私需求。

Result: 在COCO和LVIS上分别达到51.9%和44.9% mIoU。量化模型在IMX500上运行时间14.3毫秒（86 MACs/周期）。

Insight: 直接在传感器上实现高效分割可行，无需依赖云端或主机处理，为隐私保护视觉应用提供解决方案。

Abstract: Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications like smart glasses and IoT devices. We introduce PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500. It builds on a depthwise separable U-Net, with knowledge distillation and fixed-point prompt encoding to learn from the Segment Anything Model 2 (SAM2). On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively. The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment. Distillation boosts LVIS performance by +3.5% mIoU and +5.1% mAP. These results demonstrate that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud or host processing.

[123] 4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation cs.CVPDF

Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin

TL;DR: 该论文提出了一个名为4Real-Video-V2的框架，首次实现通过前馈架构生成4D时空视频帧和3D高斯粒子。其创新在于融合了空间和时间注意力的单一层设计，并改进了3D重建算法。

Details

Motivation: 现有的4D视频生成方法在空间和时间注意力处理上存在局限性，要么是顺序处理，要么是并行双流设计，导致效率和质量不足。因此，作者提出了一种融合的注意力机制和新的重建技术。

Result: 实验表明，该方法在4D生成任务中实现了视觉质量和重建能力的显著提升，达到了新的最优性能。

Insight: 融合的空间-时间注意力机制比传统的顺序或并行处理更高效，而高斯头部的引入为动态场景的重建提供了更好的灵活性。

Abstract: We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.

[124] Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset cs.CVPDF

Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu

TL;DR: Phantom-Data is the first cross-pair subject-to-video consistency dataset designed to address the copy-paste problem in video generation by disentangling subject identity from contextual attributes, improving prompt alignment and visual quality.

Details

Motivation: Existing video generation models suffer from the copy-paste problem due to the in-pair training paradigm, which entangles subject identity with background and contextual attributes. This limits their ability to follow textual instructions faithfully.

Result: Experiments demonstrate that training with Phantom-Data significantly enhances prompt alignment and visual quality while maintaining identity consistency, outperforming in-pair baselines.

Insight: Disentangling subject identity from context through cross-pair data is crucial for improving the generalization and fidelity of subject-to-video generation models.

Abstract: Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce \textbf{Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset}, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.

[125] RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base cs.CVPDF

Kuanning Wang, Yuqian Fu, Tianyu Wang, Yanwei Fu, Longfei Liang

TL;DR: RAG-6DPose提出了一种基于检索增强的6D姿态估计方法，利用CAD模型作为知识库，结合视觉和几何线索提高姿态估计的精度和鲁棒性。

Details

Motivation: 准确的6D姿态估计对机器人操作（如抓取任务）至关重要，但现有方法在遮挡和视角变化下表现不佳。为了解决这一问题，作者提出利用CAD模型作为知识库，通过检索增强的方法提升姿态估计的准确性。

Result: 在标准基准测试和实际机器人任务中验证了方法的有效性，特别是在处理遮挡和新视角时表现出色。

Insight: 检索增强的策略可以利用CAD模型的先验知识，有效提升复杂场景下的姿态估计性能，为机器人应用提供了更可靠的姿态估计方案。

Abstract: Accurate 6D pose estimation is key for robotic manipulation, enabling precise object localization for tasks like grasping. We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base by integrating both visual and geometric cues. Our RAG-6DPose roughly contains three stages: 1) Building a Multi-Modal CAD Knowledge Base by extracting 2D visual features from multi-view CAD rendered images and also attaching 3D points; 2) Retrieving relevant CAD features from the knowledge base based on the current query image via our ReSPC module; and 3) Incorporating retrieved CAD information to refine pose predictions via retrieval-augmented decoding. Experimental results on standard benchmarks and real-world robotic tasks demonstrate the effectiveness and robustness of our approach, particularly in handling occlusions and novel viewpoints. Supplementary material is available on our project website: https://sressers.github.io/RAG-6DPose .

[126] TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting cs.CV | cs.AIPDF

Zhongbin Guo, Yuhao Wang, Ping Jian, Xinyue Chen, Wei Peng

TL;DR: TAMMs是一种时态感知的多模态模型，用于卫星图像的变化理解和未来场景预测，通过轻量级时态模块和语义融合控制注入机制，提升多模态大语言模型（MLLMs）在时空推理任务中的表现。

Details

Motivation: 现有的多模态大语言模型在卫星图像时间序列分析中表现不佳，尤其是在时空变化理解和未来场景生成任务中。TAMMs旨在解决这一挑战，探索MLLMs在复杂多模态时空动态建模中的潜力。

Result: 实验表明，TAMMs在时态变化理解和未来图像预测任务中均优于现有的MLLMs基线模型，验证了其时空推理和语义融合设计的有效性。

Insight: 通过精心设计的时态推理和语义融合机制，MLLMs在复杂时空任务中的潜力可以得到充分释放。TAMMs为卫星图像分析提供了一个新的高效工具。

Abstract: Satellite image time-series analysis demands fine-grained spatial-temporal reasoning, which remains a challenge for existing multimodal large language models (MLLMs). In this work, we study the capabilities of MLLMs on a novel task that jointly targets temporal change understanding and future scene generation, aiming to assess their potential for modeling complex multimodal dynamics over time. We propose TAMMs, a Temporal-Aware Multimodal Model for satellite image change understanding and forecasting, which enhances frozen MLLMs with lightweight temporal modules for structured sequence encoding and contextual prompting. To guide future image generation, TAMMs introduces a Semantic-Fused Control Injection (SFCI) mechanism that adaptively combines high-level semantic reasoning and structural priors within an enhanced ControlNet. This dual-path conditioning enables temporally consistent and semantically grounded image synthesis. Experiments demonstrate that TAMMs outperforms strong MLLM baselines in both temporal change understanding and future image forecasting tasks, highlighting how carefully designed temporal reasoning and semantic fusion can unlock the full potential of MLLMs for spatio-temporal understanding.

[127] OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation cs.CV | cs.AI | cs.MMPDF

Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, Steven Hoi

TL;DR: OmniAvatar提出了一种创新的音频驱动的全身视频生成模型，通过像素级多层级音频嵌入策略和LoRA训练方法，显著提升了唇音同步和动作自然性，同时保留文本提示控制能力。

Details

Motivation: 现有音频驱动的人体动画方法多集中于面部运动，难以生成自然同步的全身动画，且缺乏精细的提示控制能力。OmniAvatar旨在解决这些问题。

Result: 实验表明，OmniAvatar在面部和半身视频生成中优于现有模型，适用于播客、互动、动态场景和歌唱等多种域。

Insight: 通过分层音频嵌入和LoRA训练，可以同时优化音频驱动动画的同步性和文本控制能力，为多场景视频生成提供了新思路。

Abstract: Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is https://omni-avatar.github.io/.

[128] Let Your Video Listen to Your Music! cs.CV | cs.MMPDF

Xinyu Zhang, Dong Gong, Zicheng Duan, Anton van den Hengel, Lingqiao Liu

TL;DR: 该论文提出了一种名为MVAA的新框架，通过自动化编辑视频以与音乐节拍对齐，同时保留原始视觉内容，从而提高视频与音乐节奏的同步效果。

Details

Motivation: 现有方法依赖人工剪辑或启发式编辑技术，缺乏灵活性，且生成模型往往将视频和音乐模态耦合，因此需要一种高效且灵活的方法来自动对齐视频与音乐节拍。

Result: 实验表明，MVAA能在10分钟内完成适配，在单个NVIDIA 4090 GPU上实现高质量的对齐和视觉流畅性。

Insight: 将任务模块化并采用预训练加微调的策略，能够在保留视频语义内容的同时，高效实现视频与音乐的节奏同步。

Abstract: Aligning the rhythm of visual motion in a video with a given music track is a practical need in multimedia production, yet remains an underexplored task in autonomous video editing. Effective alignment between motion and musical beats enhances viewer engagement and visual appeal, particularly in music videos, promotional content, and cinematic editing. Existing methods typically depend on labor-intensive manual cutting, speed adjustments, or heuristic-based editing techniques to achieve synchronization. While some generative models handle joint video and music generation, they often entangle the two modalities, limiting flexibility in aligning video to music beats while preserving the full visual content. In this paper, we propose a novel and efficient framework, termed MVAA (Music-Video Auto-Alignment), that automatically edits video to align with the rhythm of a given music track while preserving the original visual content. To enhance flexibility, we modularize the task into a two-step process in our MVAA: aligning motion keyframes with audio beats, followed by rhythm-aware video inpainting. Specifically, we first insert keyframes at timestamps aligned with musical beats, then use a frame-conditioned diffusion model to generate coherent intermediate frames, preserving the original video’s semantic content. Since comprehensive test-time training can be time-consuming, we adopt a two-stage strategy: pretraining the inpainting module on a small video set to learn general motion priors, followed by rapid inference-time fine-tuning for video-specific adaptation. This hybrid approach enables adaptation within 10 minutes with one epoch on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone. Extensive experiments show that our approach can achieve high-quality beat alignment and visual smoothness.

Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang

TL;DR: 本文提出了一种基于多模态大语言模型（MLLMs）的通用视频时间定位模型UniTime，能够通过自然语言查询准确识别视频中的时间片段。该方法在不同视频类型和长度下表现优异，并在多个基准测试中超越了现有技术。

Details

Motivation: 现有的视频时间定位方法通常局限于特定视频领域或时长，缺乏通用性。本文旨在利用多模态大语言模型的强大视觉语言理解能力，实现跨领域、跨时长的通用视频时间定位。

Result: 实验表明，UniTime在多个基准测试中超越了现有方法，并且在长视频问答任务中显著提升了性能。

Insight: 多模态大语言模型在视频时间定位任务中展现出强大的潜力，自适应帧缩放技术为处理不同长度的视频提供了一种有效解决方案。

Abstract: This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e.g., questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs). Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries. The key contributions include: (i) We consider steering strong MLLMs for temporal grounding in videos. To enable precise timestamp outputs, we incorporate temporal information by interleaving timestamp tokens with video tokens. (ii) By training the model to handle videos with different input granularities through adaptive frame scaling, our approach achieves robust temporal grounding for both short and long videos. (iii) Comprehensive experiments show that UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. (iv) When employed as a preliminary moment retriever for long-form video question-answering (VideoQA), UniTime significantly improves VideoQA accuracy, highlighting its value for complex video understanding tasks.

[130] FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation cs.CVPDF

Kaiyi Huang, Yukun Huang, Xintao Wang, Zinan Lin, Xuefei Ning

TL;DR: FilMaster 是一个端到端的 AI 系统，结合真实世界的电影学原理生成专业级电影，解决了现有系统在多样化的镜头语言和电影节奏上的不足。

Details

Motivation: 现有 AI 电影生成系统缺乏对电影学原理的实现，导致生成内容质量低、叙事单一。FilMaster 旨在通过结合电影学原理提升生成内容的专业性和可编辑性。

Result: 实验表明 FilMaster 在镜头语言设计和电影节奏控制上表现优异，提升了生成电影的吸引力和专业性。

Insight: FilMaster 展示了将领域专业知识（电影学）与生成式 AI 结合的潜力，为自动化电影生成提供了新方向。

Abstract: AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate professional-quality films, particularly lacking diverse camera language and cinematic rhythm. This results in templated visuals and unengaging narratives. To address this, we introduce FilMaster, an end-to-end AI system that integrates real-world cinematic principles for professional-grade film generation, yielding editable, industry-standard outputs. FilMaster is built on two key principles: (1) learning cinematography from extensive real-world film data and (2) emulating professional, audience-centric post-production workflows. Inspired by these principles, FilMaster incorporates two stages: a Reference-Guided Generation Stage which transforms user input to video clips, and a Generative Post-Production Stage which transforms raw footage into audiovisual outputs by orchestrating visual and auditory elements for cinematic rhythm. Our generation stage highlights a Multi-shot Synergized RAG Camera Language Design module to guide the AI in generating professional camera language by retrieving reference clips from a vast corpus of 440,000 film clips. Our post-production stage emulates professional workflows by designing an Audience-Centric Cinematic Rhythm Control module, including Rough Cut and Fine Cut processes informed by simulated audience feedback, for effective integration of audiovisual elements to achieve engaging content. The system is empowered by generative AI models like (M)LLMs and video generation models. Furthermore, we introduce FilmEval, a comprehensive benchmark for evaluating AI-generated films. Extensive experiments show FilMaster’s superior performance in camera language design and cinematic rhythm control, advancing generative AI in professional filmmaking.

[131] From Virtual Games to Real-World Play cs.CVPDF

Wenqiang Sun, Fangyun Wei, Jinjing Zhao, Xi Chen, Zilong Chen

TL;DR: RealPlay 是一个基于神经网络的现实世界游戏引擎，通过用户控制信号生成交互式视频，致力于实现逼真且时间一致的视频序列。

Details

Motivation: 传统方法主要关注游戏风格的视觉效果，而 RealPlay 旨在生成类似于真实世界场景的逼真视频，满足用户交互需求。

Result: 实验证明 RealPlay 能够将虚拟控制信号映射到真实世界，并泛化到非车辆实体（如自行车和行人）。

Insight: 研究展示了虚拟数据训练模型在真实世界场景中的泛化能力，为交互式视频生成提供了新思路。

Abstract: We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and receive a short video chunk in response. To enable such realistic and responsive generation, we address key challenges including iterative chunk-wise prediction for low-latency feedback, temporal consistency across iterations, and accurate control response. RealPlay is trained on a combination of labeled game data and unlabeled real-world videos, without requiring real-world action annotations. Notably, we observe two forms of generalization: (1) control transfer-RealPlay effectively maps control signals from virtual to real-world scenarios; and (2) entity transfer-although training labels originate solely from a car racing game, RealPlay generalizes to control diverse real-world entities, including bicycles and pedestrians, beyond vehicles. Project page can be found: https://wenqsun.github.io/RealPlay/

[132] VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory cs.CVPDF

Runjia Li, Philip Torr, Andrea Vedaldi, Tomas Jakab

TL;DR: 论文提出了一种名为VMem的新记忆机制，用于构建能够交互式探索环境的视频生成器。通过基于3D表面元素（surfels）几何索引记忆的过往视图，VMem能够高效检索相关视图，生成一致的长时程视频，同时显著降低计算成本。

Details

Motivation: 现有方法在长时程场景生成中存在问题：基于二维视图的增量3D重构容易累积误差，而短上下文窗口的视频生成器难以保持场景一致性。VMem旨在解决这些问题。

Result: 在长时程场景合成基准测试中，VMem表现出优于现有方法的场景一致性和相机控制能力。

Insight: 通过几何索引高效管理记忆视图，能够显著提升交互式视频生成的长期一致性，同时降低计算负担。

Abstract: We propose a novel memory mechanism to build video generators that can explore environments interactively. Similar results have previously been achieved by out-painting 2D views of the scene while incrementally reconstructing its 3D geometry, which quickly accumulates errors, or by video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a mechanism that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables the efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost of using all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.

[133] TC-Light: Temporally Consistent Relighting for Dynamic Long Videos cs.CVPDF

Yang Liu, Chuanchen Luo, Zimo Tang, Yingyan Li, Yuran Yang

TL;DR: TC-Light提出了一种新颖的两阶段后优化机制，用于动态长视频中的时间一致性重新光照，显著提升了光照编辑的物理合理性和计算效率。

Details

Motivation: 长视频中的光照编辑在内容和数据增强方面具有重要意义，但现有方法局限于肖像视频或在时间一致性和计算效率上存在瓶颈。

Result: 实验结果表明，TC-Light能够生成物理合理且时间一致的光照效果，同时计算成本低。

Insight: 两阶段优化机制为解决动态视频中的光照编辑问题提供了有效思路，UVT的引入为视频表示提供了新方向。

Abstract: Editing illumination in long videos with complex dynamics has significant value in various downstream tasks, including visual content creation and manipulation, as well as data scaling up for embodied AI through sim2real and real2real transfer. Nevertheless, existing video relighting techniques are predominantly limited to portrait videos or fall into the bottleneck of temporal consistency and computation efficiency. In this paper, we propose TC-Light, a novel paradigm characterized by the proposed two-stage post optimization mechanism. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible relighting results with superior temporal coherence and low computation cost. The code and video demos are available at https://dekuliutesla.github.io/tclight/.

cs.CL [Back]

[134] Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs cs.CL | cs.CRPDF

Xiang Li, Chong Zhang, Jia Wang, Fangyu Wu, Yushi Li

TL;DR: 该论文提出了一种通过对抗性提示蒸馏的方法，将大型语言模型（LLM）的越狱能力迁移到小型语言模型（SLM）上，解决了当前越狱攻击方法效率低、计算成本高和跨模型适应性差的问题。

Details

Motivation: 当前针对大型语言模型的越狱攻击方法效率低下、计算成本高，且难以适应新模型和防御策略的快速发展。论文旨在提出一种高效、隐蔽的越狱攻击方法。

Result: 实验结果表明，该方法在攻击成功率和危害性上表现优异，同时具备资源效率和跨模型适应性。

Insight: 研究揭示了大型语言模型的漏洞，为安全性研究提供了新思路，并证明了将LLM的能力蒸馏到SLM的可行性。

Abstract: Attacks on large language models (LLMs) in jailbreaking scenarios raise many security and ethical issues. Current jailbreak attack methods face problems such as low efficiency, high computational cost, and poor cross-model adaptability and versatility, which make it difficult to cope with the rapid development of LLM and new defense strategies. Our work proposes an Adversarial Prompt Distillation, which combines masked language modeling, reinforcement learning, and dynamic temperature control through a prompt generation and distillation method. It enables small language models (SLMs) to jailbreak attacks on mainstream LLMs. The experimental results verify the superiority of the proposed method in terms of attack success rate and harm, and reflect the resource efficiency and cross-model adaptability. This research explores the feasibility of distilling the jailbreak ability of LLM to SLM, reveals the model’s vulnerability, and provides a new idea for LLM security research.

[135] AI-Generated Game Commentary: A Survey and a Datasheet Repository cs.CL | cs.AI | cs.LGPDF

Qirui Zheng, Xingbo Wang, Keyuan Cheng, Yunlong Lu, Wenxin Li

TL;DR: 这篇论文综述了AI生成游戏解说（AIGGC）的研究现状，提出了通用框架，并对45个数据集和方法进行分类和比较，同时提供了开放的元数据仓库以支持未来研究。

Details

Motivation: AIGGC因其市场潜力和技术挑战吸引了广泛关注，但缺乏系统性综述和标准化资源。

Result: 对45个数据集和方法进行了系统化分析，并发布了结构化元数据仓库。

Insight: AIGGC是一个多模态NLP任务，需兼顾准确性、逻辑性和生成速度，标准化资源将推动该领域发展。

Abstract: AI-Generated Game Commentary (AIGGC) has gained increasing attention due to its market potential and inherent technical challenges. As a comprehensive multimodal Natural Language Processing (NLP) task, AIGGC imposes substantial demands on language models, including factual accuracy, logical reasoning, expressive text generation, generation speed, and context management. In this paper, we introduce a general framework for AIGGC and present a comprehensive survey of 45 existing game commentary dataset and methods according to key challenges they aim to address in this domain. We further classify and compare various evaluation metrics commonly used in this domain. To support future research and benchmarking, we also provide a structured datasheet summarizing the essential attributes of these datasets in appendix, which is meanwhile publicly available in an open repository.

[136] Semantic uncertainty in advanced decoding methods for LLM generation cs.CL | cs.AIPDF

Darius Foodeei, Simin Fan, Martin Jaggi

TL;DR: 该研究探讨了不同解码方法对大型语言模型（LLM）输出语义不确定性的影响，重点关注了推测采样和链式思考（CoT）解码等新兴技术。实验表明，CoT解码显著提升语义多样性并降低预测熵，而推测采样在文本摘要任务中表现优异。

Details

Motivation: 研究动机是分析不同解码方法如何影响LLM输出的多样性和可靠性，以解决实际应用中多样性与准确性之间的权衡问题。

Result: 结果显示，CoT解码在代码生成中Pass@2率提升48.8%，推测采样在摘要任务中取得更高的ROUGE分数，同时保持语义多样性。

Insight: 研究启示是，精心设计的解码方法可以在提升语义探索的同时保持或提高输出质量，为实际应用提供了新的优化方向。

Abstract: This study investigates semantic uncertainty in large language model (LLM) outputs across different decoding methods, focusing on emerging techniques like speculative sampling and chain-of-thought (CoT) decoding. Through experiments on question answering, summarization, and code generation tasks, we analyze how different decoding strategies affect both the diversity and reliability of model outputs. Our findings reveal that while CoT decoding demonstrates higher semantic diversity, it maintains lower predictive entropy, suggesting that structured exploration can lead to more confident and accurate outputs. This is evidenced by a 48.8% improvement in code generation Pass@2 rates, despite lower alignment with reference solutions. For summarization tasks, speculative sampling proved particularly effective, achieving superior ROUGE scores while maintaining moderate semantic diversity. Our results challenge conventional assumptions about trade-offs between diversity and accuracy in language model outputs, demonstrating that properly structured decoding methods can increase semantic exploration while maintaining or improving output quality. These findings have significant implications for deploying language models in practical applications where both reliability and diverse solution generation are crucial.

[137] DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning cs.CLPDF

Yuanhao Wu, Juntong Song, Hanning Zhang, Tong Zhang, Cheng Niu

TL;DR: DuaShepherd是一个新的奖励建模框架，通过结合正确性和潜力两种奖励信号，提升大语言模型（LLMs）的数学推理能力。

Details

Motivation: 现有奖励模型通常只关注步骤正确性，而忽略了步骤对最终答案的潜在贡献。DuaShepherd旨在弥补这一不足。

Result: 在MATH500和ProcessBench上表现优于单一奖励模型，达到SOTA。

Insight: 结合互补的奖励信号可以显著提升模型的数学推理能力。

Abstract: In this paper, we propose DuaShepherd, a novel reward modeling framework that integrates two complementary reward signals, correctness and potential, to enhance the mathematical reasoning capabilities of Large Language Models (LLMs). While correctness-based signals emphasize identification of stepwise errors, potential-based signals focus on the likelihood of reaching the correct final answer. We developed an automated pipeline for constructing large-scale reward modeling dataset with both signals. A unified, multi-head architecture was explored to train the two reward models in a multi-task setup, demonstrating benefits from learning both correctness and potential in parallel. By combining these two signals into a compound probability, our model achieves consistent performance improvements across multiple benchmarks. Empirical evaluations on MATH500 and ProcessBench confirm that this combined reward significantly outperforms models trained on either reward type alone, achieving state-of-the-art performance under comparable resource constraints.

[138] Probing for Phonology in Self-Supervised Speech Representations: A Case Study on Accent Perception cs.CLPDF

Nitin Venkateswaran, Kevin Tang, Ratree Wayland

TL;DR: 该论文探讨了自监督学习（SSL）语音表征如何编码影响口音感知的音位特征梯度变化，研究聚焦于三种特定音段，并通过探测分析表明，口音强度可通过表征特征子集预测。

Details

Motivation: 传统口音感知模型低估了音位特征梯度变化对听者口音判断的作用，论文旨在研究SSL语音表征如何编码这些变化。

Result: 结果表明，口音强度可通过表征特征子集预测，且与预期方向一致的口音基线距离关系显著。

Insight: 自监督语音表征能有效捕捉音位特征的梯度变化，为口音感知研究提供了新视角。

Abstract: Traditional models of accent perception underestimate the role of gradient variations in phonological features which listeners rely upon for their accent judgments. We investigate how pretrained representations from current self-supervised learning (SSL) models of speech encode phonological feature-level variations that influence the perception of segmental accent. We focus on three segments: the labiodental approximant, the rhotic tap, and the retroflex stop, which are uniformly produced in the English of native speakers of Hindi as well as other languages in the Indian sub-continent. We use the CSLU Foreign Accented English corpus (Lander, 2007) to extract, for these segments, phonological feature probabilities using Phonet (V'asquez-Correa et al., 2019) and pretrained representations from Wav2Vec2-BERT (Barrault et al., 2023) and WavLM (Chen et al., 2022) along with accent judgements by native speakers of American English. Probing analyses show that accent strength is best predicted by a subset of the segment’s pretrained representation features, in which perceptually salient phonological features that contrast the expected American English and realized non-native English segments are given prominent weighting. A multinomial logistic regression of pretrained representation-based segment distances from American and Indian English baselines on accent ratings reveals strong associations between the odds of accent strength and distances from the baselines, in the expected directions. These results highlight the value of self-supervised speech representations for modeling accent perception using interpretable phonological features.

[139] AgriCHN: A Comprehensive Cross-domain Resource for Chinese Agricultural Named Entity Recognition cs.CLPDF

Lingxiao Zeng, Yiqi Tong, Wei Guo, Huarui Wu, Lihao Ge

TL;DR: 论文提出了AgriCHN，一个综合性的中文农业命名实体识别数据集，填补了农业领域缺乏高质量数据集的空白，并引入水文和气象等多领域实体，提升了实体多样性。

Details

Motivation: 现有农业命名实体识别任务缺乏高质量的中文数据集，且大多数研究忽略了农业与水文、气象等其他领域的关联。因此，作者提出了AgriCHN数据集。

Result: 实验表明AgriCHN具有挑战性，且展现出进一步研究的潜力，数据质量优于现有资源。

Insight: 农业领域实体识别应考虑多领域关联（如水文和气象），以提升模型的泛化能力和应用价值。

Abstract: Agricultural named entity recognition is a specialized task focusing on identifying distinct agricultural entities within vast bodies of text, including crops, diseases, pests, and fertilizers. It plays a crucial role in enhancing information extraction from extensive agricultural text resources. However, the scarcity of high-quality agricultural datasets, particularly in Chinese, has resulted in suboptimal performance when employing mainstream methods for this purpose. Most earlier works only focus on annotating agricultural entities while overlook the profound correlation of agriculture with hydrology and meteorology. To fill this blank, we present AgriCHN, a comprehensive open-source Chinese resource designed to promote the accuracy of automated agricultural entity annotation. The AgriCHN dataset has been meticulously curated from a wealth of agricultural articles, comprising a total of 4,040 sentences and encapsulating 15,799 agricultural entity mentions spanning 27 diverse entity categories. Furthermore, it encompasses entities from hydrology to meteorology, thereby enriching the diversity of entities considered. Data validation reveals that, compared with relevant resources, AgriCHN demonstrates outstanding data quality, attributable to its richer agricultural entity types and more fine-grained entity divisions. A benchmark task has also been constructed using several state-of-the-art neural NER models. Extensive experimental results highlight the significant challenge posed by AgriCHN and its potential for further research.

[140] Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs cs.CLPDF

Yang Wu, Yifan Zhang, Yiwei Wang, Yujun Cai, Yurong Wu

TL;DR: 论文探讨了大型语言模型（LLMs）是否更多地依赖于最终答案的记忆而非真正的推理能力。通过实验，作者发现模型性能在答案线索被掩盖时显著下降，表明其推理能力可能是一种事后解释而非真正的推断。

Details

Motivation: 作者试图揭示LLMs在推理任务中的表现是否真的源于其推理能力，还是仅仅依赖于记忆的答案-推理模式。

Result: 实验表明，当答案线索被掩盖时，模型性能显著下降26.90%，即使推理链完整，推理能力依然受限。

Insight: LLMs的推理能力可能更多地依赖于记忆的答案模式，而非真正的推理过程，这对模型的深度推理能力提出了质疑。

Abstract: While Large Language Models (LLMs) demonstrate impressive reasoning capabilities, growing evidence suggests much of their success stems from memorized answer-reasoning patterns rather than genuine inference. In this work, we investigate a central question: are LLMs primarily anchored to final answers or to the textual pattern of reasoning chains? We propose a five-level answer-visibility prompt framework that systematically manipulates answer cues and probes model behavior through indirect, behavioral analysis. Experiments across state-of-the-art LLMs reveal a strong and consistent reliance on explicit answers. The performance drops by 26.90% when answer cues are masked, even with complete reasoning chains. These findings suggest that much of the reasoning exhibited by LLMs may reflect post-hoc rationalization rather than true inference, calling into question their inferential depth. Our study uncovers the answer-anchoring phenomenon with rigorous empirical validation and underscores the need for a more nuanced understanding of what constitutes reasoning in LLMs.

[141] Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering cs.CLPDF

Binquan Ji, Haibo Luo, Yifei Lu, Lei Hei, Jiaqi Wang

TL;DR: 提出了DEC框架，通过动态增强链将复杂问题分解为子问题，并利用轻量级关键词提取模块优化检索，在资源受限环境下显著提升多跳问答性能。

Details

Motivation: 多跳问答任务需要从多个来源整合证据，但轻量级大语言模型在处理长上下文时易出现幻觉和语义漂移问题，亟需资源友好的解决方案。

Result: 在三个多跳问答数据集上表现优于或接近SOTA，显著降低token消耗。8B模型上实现最优效果。

Insight: 动态分解和轻量级检索是资源受限环境下提升多跳问答性能的有效途径。

Abstract: Knowledge-intensive multi-hop question answering (QA) tasks, which require integrating evidence from multiple sources to address complex queries, often necessitate multiple rounds of retrieval and iterative generation by large language models (LLMs). However, incorporating many documents and extended contexts poses challenges -such as hallucinations and semantic drift-for lightweight LLMs with fewer parameters. This work proposes a novel framework called DEC (Dynamic Enhancement Chain). DEC first decomposes complex questions into logically coherent subquestions to form a hallucination-free reasoning chain. It then iteratively refines these subquestions through context-aware rewriting to generate effective query formulations. For retrieval, we introduce a lightweight discriminative keyword extraction module that leverages extracted keywords to achieve targeted, precise document recall with relatively low computational overhead. Extensive experiments on three multi-hop QA datasets demonstrate that DEC performs on par with or surpasses state-of-the-art benchmarks while significantly reducing token consumption. Notably, our approach attains state-of-the-art results on models with 8B parameters, showcasing its effectiveness in various scenarios, particularly in resource-constrained environments.

[142] KAG-Thinker: Teaching Large Language Models to Think with Human-like Reasoning Process cs.CL | cs.AIPDF

Dalong Zhang, Jun Xu, Jun Zhou, Lei Liang, Lin Yuan

TL;DR: KAG-Thinker框架通过模拟人类认知机制，提升大语言模型在领域特定知识库问答任务中的逻辑一致性和上下文连贯性。

Details

Motivation: 解决大语言模型在复杂问题推理中的逻辑和上下文一致性问题。

Result: 通过监督微调和多轮对话对齐模型的推理范式，避免了过度反思。

Insight: 将大语言模型与外源知识库等效处理，并通过逻辑函数接口显式建模依赖关系，增强了推理的灵活性和准确性。

Abstract: In this paper, we introduce KAG-Thinker, a novel human-like reasoning framework built upon a parameter-light large language model (LLM). Our approach enhances the logical coherence and contextual consistency of the thinking process in question-answering (Q&A) tasks on domain-specific knowledge bases (KBs) within LLMs. This framework simulates human cognitive mechanisms for handling complex problems by establishing a structured thinking process. Continuing the \textbf{Logical Form} guided retrieval and reasoning technology route of KAG v0.7, firstly, it decomposes complex questions into independently solvable sub-problems(also referred to as logical forms) through \textbf{breadth decomposition}, each represented in two equivalent forms-natural language and logical function-and further classified as either Knowledge Retrieval or Reasoning Analysis tasks, with dependencies and variables passing explicitly modeled via logical function interfaces. In the solving process, the Retrieval function is used to perform knowledge retrieval tasks, while the Math and Deduce functions are used to perform reasoning analysis tasks. Secondly, it is worth noting that, in the Knowledge Retrieval sub-problem tasks, LLMs and external knowledge sources are regarded as equivalent KBs. We use the \textbf{knowledge boundary} model to determine the optimal source using self-regulatory mechanisms such as confidence calibration and reflective reasoning, and use the \textbf{depth solving} model to enhance the comprehensiveness of knowledge acquisition. Finally, instead of utilizing reinforcement learning, we employ supervised fine-tuning with multi-turn dialogues to align the model with our structured inference paradigm, thereby avoiding excessive reflection. This is supported by a data evaluation framework and iterative corpus synthesis, which facilitate the generation of detailed reasoning trajectories…

[143] THCM-CAL: Temporal-Hierarchical Causal Modelling with Conformal Calibration for Clinical Risk Prediction cs.CL | cs.AIPDF

Xin Zhang, Qiyu Wei, Yingjie Zhu, Fanyi Wu, Sophia Ananiadou

TL;DR: THCM-CAL提出了一种结合时间层级因果模型与保形校准的临床风险预测方法，能有效整合电子健康记录中的结构化诊断代码和非结构化叙述文本，通过层级因果发现校准预测可靠性。

Details

Motivation: 现有方法通常独立处理电子健康记录中的结构化诊断代码和非结构化叙述文本，或使用简单融合策略，忽略了层级因果交互及其对风险传播的影响。

Result: 在MIMIC-III和MIMIC-IV数据集上的实验表明，THCM-CAL优于现有方法。

Insight: 层级因果建模和保形校准的结合能够提升临床风险预测的可靠性和解释性，为多模态数据融合提供了新思路。

Abstract: Automated clinical risk prediction from electronic health records (EHRs) demands modeling both structured diagnostic codes and unstructured narrative notes. However, most prior approaches either handle these modalities separately or rely on simplistic fusion strategies that ignore the directional, hierarchical causal interactions by which narrative observations precipitate diagnoses and propagate risk across admissions. In this paper, we propose THCM-CAL, a Temporal-Hierarchical Causal Model with Conformal Calibration. Our framework constructs a multimodal causal graph where nodes represent clinical entities from two modalities: Textual propositions extracted from notes and ICD codes mapped to textual descriptions. Through hierarchical causal discovery, THCM-CAL infers three clinically grounded interactions: intra-slice same-modality sequencing, intra-slice cross-modality triggers, and inter-slice risk propagation. To enhance prediction reliability, we extend conformal prediction to multi-label ICD coding, calibrating per-code confidence intervals under complex co-occurrences. Experimental results on MIMIC-III and MIMIC-IV demonstrate the superiority of THCM-CAL.

[144] How Alignment Shrinks the Generative Horizon cs.CL | cs.AI | cs.LGPDF

Chenghao Yang, Ari Holtzman

TL;DR: 本文通过引入分支因子（BF）这一指标，探究了对齐后大语言模型（LLM）输出多样性下降的现象，发现对齐调优显著减少了模型的BF值，从而解释了其生成稳定性的原因。

Details

Motivation: 尽管对齐后的LLM表现强大，但其输出往往缺乏多样性。为了理解这一现象的驱动因素，作者研究了模型输出分布的概率集中效应。

Result: 对齐调优显著减少了模型的BF值（如从12降至1.2），使得生成更稳定；CoT模型通过长推理链降低了BF，从而提升了输出稳定性。

Insight: 对齐调优通过引导模型选择低熵路径（如特定风格标记）而非改变其核心行为，实现了生成稳定性的提升。这一发现为理解和控制LLM输出提供了新视角。

Abstract: Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in the generation? We investigate this phenomenon through the lens of probability concentration in the model’s output distribution. To quantify this concentration, we introduce the Branching Factor (BF) – a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model’s output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this stability has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model’s behavior, but instead steers it toward stylistic tokens (e.g., “Sure”) that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show that prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.

[145] PDF Retrieval Augmented Question Answering cs.CLPDF

Thi Thu Uyen Hoang, Viet Anh Nguyen

TL;DR: 该论文提出了一种基于检索增强生成（RAG）框架的问答系统，专注于从PDF文件中高效提取多模态信息，解决现有QA系统主要针对文本内容的局限性。

Details

Motivation: PDF文件中包含丰富的多模态数据（如文本、图像、矢量图、图表等），但现有的问答系统主要针对纯文本设计，难以处理此类复杂查询。因此，需要开发一种能够有效整合多模态信息的问答系统。

Result: 实验表明，该系统能够从PDF中精确提取多模态信息，并在不同类型的复杂查询中提供准确的答案。

Insight: 该研究不仅拓展了检索增强问答系统的能力，还为多模态数据集成与处理的进一步研究奠定了基础。

Abstract: This paper presents an advancement in Question-Answering (QA) systems using a Retrieval Augmented Generation (RAG) framework to enhance information extraction from PDF files. Recognizing the richness and diversity of data within PDFs–including text, images, vector diagrams, graphs, and tables–poses unique challenges for existing QA systems primarily designed for textual content. We seek to develop a comprehensive RAG-based QA system that will effectively address complex multimodal questions, where several data types are combined in the query. This is mainly achieved by refining approaches to processing and integrating non-textual elements in PDFs into the RAG framework to derive precise and relevant answers, as well as fine-tuning large language models to better adapt to our system. We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs. This work not only pushes the boundaries of retrieval-augmented QA systems but also lays a foundation for further research in multimodal data integration and processing.

[146] InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating cs.CLPDF

Fuyu Wang, Jiangtong Li, Kun Zhu, Changjun Jiang

TL;DR: 这篇论文提出了一个双组件框架InspireDebate，通过多维评估和优化方法提升辩论任务的效果。

Details

Motivation: 现有的基于LLM的辩论系统忽略了客观评估（如真实性和逻辑有效性），并且缺乏结构化优化方法，限制了其效果。

Result: InspireScore与专家判断的相关性提高44%，InspireDebate在基线模型上提升57%。

Insight: 主观与客观评估的结合及分阶段优化方法对提升辩论系统的效果至关重要。

Abstract: With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach to optimize across various dimensions$-$including evaluation metrics, chain-of-thought (CoT) reasoning, and multi-turn debate refinement$-$thereby limiting their effectiveness. To address these interconnected challenges, we propose a dual-component framework: (1) $\textbf{InspireScore}$, a novel evaluation system that establishes a multi-dimensional assessment architecture incorporating four subjective criteria (emotional appeal, argument clarity, argument arrangement, and topic relevance) alongside two objective metrics (fact authenticity and logical validity); and (2) $\textbf{InspireDebate}$, an optimized debating framework employing a phased optimization approach through CoT reasoning enhancement, multi-dimensional Direct Preference Optimization (DPO), and real-time knowledge grounding via web-based Retrieval Augmented Generation (Web-RAG). Empirical evaluations demonstrate that $\textbf{InspireScore}$ achieves 44$%$ higher correlation with expert judgments compared to existing methods, while $\textbf{InspireDebate}$ shows significant improvements, outperforming baseline models by 57$%$. Source code is available at https://github.com/fywang12/InspireDebate.

[147] Mental Health Equity in LLMs: Leveraging Multi-Hop Question Answering to Detect Amplified and Silenced Perspectives cs.CL | cs.AI | cs.CYPDF

Batool Haider, Atmika Gorti, Aman Chadha, Manas Gaur

TL;DR: 该论文提出了一种多跳问答框架（MHQA），用于检测大型语言模型（LLMs）在心理健康领域中存在的偏见，并展示了两种去偏见方法的有效性。

Details

Motivation: LLMs在心理健康领域的应用中可能会传播偏见，从而加剧对边缘群体的伤害。然而，现有的系统性检测方法有限，因此需要开发更有效的工具来识别和缓解这些偏见。

Result: 实验结果表明，MHQA方法比传统方法更有效地检测偏见。两种去偏见方法（使用BBQ数据集的少样本提示）成功减少了66-94%的偏见。

Insight: 论文揭示了LLMs在心理健康领域中偏见的复杂性，尤其是通过序列推理放大的现象。去偏见方法的成功为开发更公平的AI提供了可行路径。

Abstract: Large Language Models (LLMs) in mental healthcare risk propagating biases that reinforce stigma and harm marginalized groups. While previous research identified concerning trends, systematic methods for detecting intersectional biases remain limited. This work introduces a multi-hop question answering (MHQA) framework to explore LLM response biases in mental health discourse. We analyze content from the Interpretable Mental Health Instruction (IMHI) dataset across symptom presentation, coping mechanisms, and treatment approaches. Using systematic tagging across age, race, gender, and socioeconomic status, we investigate bias patterns at demographic intersections. We evaluate four LLMs: Claude 3.5 Sonnet, Jamba 1.6, Gemma 3, and Llama 4, revealing systematic disparities across sentiment, demographics, and mental health conditions. Our MHQA approach demonstrates superior detection compared to conventional methods, identifying amplification points where biases magnify through sequential reasoning. We implement two debiasing techniques: Roleplay Simulation and Explicit Bias Reduction, achieving 66-94% bias reductions through few-shot prompting with BBQ dataset examples. These findings highlight critical areas where LLMs reproduce mental healthcare biases, providing actionable insights for equitable AI development.

[148] Deciphering Emotions in Children Storybooks: A Comparative Analysis of Multimodal LLMs in Educational Applications cs.CL | cs.CV | cs.HCPDF

Bushra Asseri, Estabraq Abdelaziz, Maha Al Mogren, Tayef Alhefdhi, Areej Al-Wabil

TL;DR: 这篇论文比较了GPT-4o和Gemini 1.5 Pro在阿拉伯儿童故事书插图中的情感识别能力，发现GPT-4o表现更优，并揭示了当前模型在文化理解上的局限性。

Details

Motivation: 在阿拉伯语教育技术中，情感识别能力对开发文化敏感的学习工具至关重要，但目前对这一领域的研究较少。

Result: GPT-4o在所有测试条件下均优于Gemini，最高宏F1分数为59%，而Gemini为43%。模型在文化细微情感和模糊叙事场景中存在系统性错误。

Insight: 当前模型的文化理解能力有限，需要更文化敏感的训练方法以开发适用于阿拉伯学习者的情感感知教育技术。

Abstract: Emotion recognition capabilities in multimodal AI systems are crucial for developing culturally responsive educational technologies, yet remain underexplored for Arabic language contexts where culturally appropriate learning tools are critically needed. This study evaluates the emotion recognition performance of two advanced multimodal large language models, GPT-4o and Gemini 1.5 Pro, when processing Arabic children’s storybook illustrations. We assessed both models across three prompting strategies (zero-shot, few-shot, and chain-of-thought) using 75 images from seven Arabic storybooks, comparing model predictions with human annotations based on Plutchik’s emotional framework. GPT-4o consistently outperformed Gemini across all conditions, achieving the highest macro F1-score of 59% with chain-of-thought prompting compared to Gemini’s best performance of 43%. Error analysis revealed systematic misclassification patterns, with valence inversions accounting for 60.7% of errors, while both models struggled with culturally nuanced emotions and ambiguous narrative contexts. These findings highlight fundamental limitations in current models’ cultural understanding and emphasize the need for culturally sensitive training approaches to develop effective emotion-aware educational technologies for Arabic-speaking learners.

[149] Enhancing Entity Aware Machine Translation with Multi-task Learning cs.CLPDF

An Trieu, Phuong Nguyen, Minh Le Nguyen

TL;DR: 这篇论文提出了一种通过多任务学习优化命名实体识别和机器翻译子任务的方法，从而提升实体感知机器翻译（EAMT）的性能。

Details

Motivation: 实体感知机器翻译任务复杂，既缺乏与翻译相关的实体数据，又需要处理复杂的上下文。

Result: 在SemEval 2025竞赛Task 2提供的数据集上进行了实验，验证了方法的有效性。

Insight: 多任务学习可以有效地结合相关子任务，提升主任务的性能，尤其在数据稀缺的任务中表现突出。

Abstract: Entity-aware machine translation (EAMT) is a complicated task in natural language processing due to not only the shortage of translation data related to the entities needed to translate but also the complexity in the context needed to process while translating those entities. In this paper, we propose a method that applies multi-task learning to optimize the performance of the two subtasks named entity recognition and machine translation, which improves the final performance of the Entity-aware machine translation task. The result and analysis are performed on the dataset provided by the organizer of Task 2 of the SemEval 2025 competition.

[150] TranslationCorrect: A Unified Framework for Machine Translation Post-Editing with Predictive Error Assistance cs.CLPDF

Syed Mekael Wasti, Shou-Yi Hung, Christopher Collins, En-Shiun Annie Lee

TL;DR: TranslationCorrect是一个集成的机器翻译后编辑框架，结合了MT生成、自动错误预测和直观的后编辑界面，旨在提升翻译效率和研究数据收集。

Details

Motivation: 现有的机器翻译后编辑和数据收集工作流程效率低下且不连贯，亟需一个统一的解决方案来优化这些任务。

Result: 用户研究表明，TranslationCorrect显著提高了翻译效率和用户满意度，优于传统注释方法。

Insight: 通过整合MT生成、错误预测和后编辑界面，TranslationCorrect不仅提升了翻译效率，还为研究数据收集提供了高质量的注释输出。

Abstract: Machine translation (MT) post-editing and research data collection often rely on inefficient, disconnected workflows. We introduce TranslationCorrect, an integrated framework designed to streamline these tasks. TranslationCorrect combines MT generation using models like NLLB, automated error prediction using models like XCOMET or LLM APIs (providing detailed reasoning), and an intuitive post-editing interface within a single environment. Built with human-computer interaction (HCI) principles in mind to minimize cognitive load, as confirmed by a user study. For translators, it enables them to correct errors and batch translate efficiently. For researchers, TranslationCorrect exports high-quality span-based annotations in the Error Span Annotation (ESA) format, using an error taxonomy inspired by Multidimensional Quality Metrics (MQM). These outputs are compatible with state-of-the-art error detection models and suitable for training MT or post-editing systems. Our user study confirms that TranslationCorrect significantly improves translation efficiency and user satisfaction over traditional annotation methods.

[151] Less Data Less Tokens: Multilingual Unification Learning for Efficient Test-Time Reasoning in LLMs cs.CLPDF

Kang Chen, Mengdi Zhang, Yixin Cao

TL;DR: 本文提出了一种名为L²的多语言统一学习方法，旨在通过多语言数据的多样性提升大型语言模型（LLMs）的推理效率和性能，减少对数据和推理令牌的需求。

Details

Motivation: 测试时扩展大型语言模型在数据和推理效率方面面临挑战，尤其是在多语言场景下，不同语言的推理过程存在差异，可能相互促进效率和性能的提升。

Result: L²方法在保持性能的同时显著减少了数据和推理令牌的需求，且与其它数据高效方法正交。

Insight: 1. 多语言数据的多样性对LLMs的性能和效率有重要影响；2. 小规模多语言数据也能带来显著的性能提升；3. 数据选择的多样性在高效学习中至关重要。

Abstract: This paper explores the challenges of test-time scaling of large language models (LLMs), regarding both the data and inference efficiency. We highlight the diversity of multi-lingual reasoning based on our pilot studies, and then introduce a novel approach, (L^2) multi-lingual unification learning with a decoding intervention strategy for further investigation. The basic idea of (L^2) is that the reasoning process varies across different languages, which may be mutually beneficial to enhance both model performance and efficiency. In specific, there are two types of multi-lingual data: the entire long chain-of-thought annotations in different languages and the step-wise mixture of languages. By further tuning based on them, we show that even small amounts of data can significantly improve reasoning capabilities. Our findings suggest that multilingual learning reduces both the required data and the number of inference tokens while maintaining a comparable performance. Furthermore, (L^2) is orthogonal to other data efficient methods. Thus, we also emphasize the importance of diverse data selection. The (L^2) method offers a promising solution to the challenges of data collection and test-time compute efficiency in LLMs.

[152] Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics cs.CL | cs.AIPDF

Yousang Cho, Key-Sun Choi

TL;DR: 本研究探讨了不同评估指标在自动生成的诊断报告中如何准确捕捉因果解释的质量，发现GPT-Black在逻辑连贯性和临床有效性方面表现最佳，同时强调了指标选择与加权对评估结果的影响。

Details

Motivation: 研究动机是探索如何更准确地评估自动生成的医疗报告中因果解释的质量，尤其是在逻辑连贯性和临床有效性方面。

Result: 结果显示，GPT-Black在识别逻辑连贯且临床有效的因果叙述方面最具区分力，而基于相似性的指标与临床推理质量偏离。

Insight: 研究揭示了LLM-based评估（尤其是GPT-Black）在需要可解释性和因果推理的任务中的潜力，同时强调评估指标的选择需结合任务特性。

Abstract: This study investigates how accurately different evaluation metrics capture the quality of causal explanations in automatically generated diagnostic reports. We compare six metrics: BERTScore, Cosine Similarity, BioSentVec, GPT-White, GPT-Black, and expert qualitative assessment across two input types: observation-based and multiple-choice-based report generation. Two weighting strategies are applied: one reflecting task-specific priorities, and the other assigning equal weights to all metrics. Our results show that GPT-Black demonstrates the strongest discriminative power in identifying logically coherent and clinically valid causal narratives. GPT-White also aligns well with expert evaluations, while similarity-based metrics diverge from clinical reasoning quality. These findings emphasize the impact of metric selection and weighting on evaluation outcomes, supporting the use of LLM-based evaluation for tasks requiring interpretability and causal reasoning.

[153] TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models cs.CL | cs.AIPDF

Ce Li, Xiaofan Liu, Zhiyan Song, Ce Chi, Chen Zhao

TL;DR: TReB是一个综合性基准测试，用于评估大语言模型在处理表格数据时的理解和推理能力，涵盖了26个子任务。通过高质量的构建数据集和评估框架，实验显示现有模型在复杂表格任务上仍有提升空间。

Details

Motivation: 企业和行业的大部分数据以表格形式存储，但大语言模型在表格推理任务上面临隐藏语义和结构化复杂性等挑战。缺乏有效评估基准是主要问题之一。

Result: 实验结果表明，现有大语言模型在处理复杂表格任务时仍有显著改进空间。

Insight: TReB为评估表格推理能力提供了标准化工具，揭示了模型在结构化数据上的局限性，为未来研究提供了方向。

Abstract: The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs using this frame work and prove its effectiveness. Experimental results reveal that existing LLMs still have significant room for improvement in addressing the complex and real world Table related tasks. Both the dataset and evaluation framework are publicly available, with the dataset hosted on [HuggingFace] and the framework on [GitHub].

[154] MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models cs.CL | cs.AIPDF

Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang

TL;DR: MeRF提出了一种结合强化学习与上下文学习的方法，通过将奖励规范直接注入提示中，激励LLM生成更优输出，在逻辑推理基准测试中表现优异。

Details

Motivation: 现有RLVR方法忽略了LLM的上下文学习能力（如CoT提示的成功），作者希望结合强化学习与上下文学习以提升LLM的推理能力。

Result: 在Knights and Knaves逻辑谜题基准测试中，MeRF显著优于基线，并能适应误导性动机。

Insight: 上下文动机与外部奖励的一致性对性能提升至关重要，LLM能通过强化学习适应不一致动机。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful learn-to-reason paradigm for Large Language Models (LLMs) to tackle complex reasoning tasks. However, existing RLVR methods overlook one of the most distinctive capabilities of LLMs, their in-context learning ability, as prominently demonstrated by the success of Chain-of-Thought (CoT) prompting. This motivates us to explore how reinforcement learning can be effectively combined with in-context learning to better improve the reasoning capabilities of LLMs. In this paper, we introduce Motivation-enhanced Reinforcement Finetuning} (MeRF), an intuitive yet effective method enhancing reinforcement learning of LLMs by involving ``telling LLMs the rules of the game’’. Specifically, MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for model to improve its responses with awareness of the optimization objective. This simple modification leverages the in-context learning ability of LLMs aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations on the Knights and Knaves~(K&K) logic puzzle reasoning benchmark demonstrate that \texttt{MeRF} achieves substantial performance gains over baselines. Moreover, ablation studies show that performance improves with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement learning.

[155] Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance cs.CL | cs.AIPDF

Wael Etaiwi, Bushra Alhijawi

TL;DR: 本文通过实验比较了ChatGPT和DeepSeek在五个关键NLP任务中的表现，揭示了它们各自的优势和局限性，为任务导向的模型选择提供了依据。

Details

Motivation: 随着大型语言模型（LLMs）在NLP任务中的广泛应用，对其在不同任务和领域中的表现进行全面评估变得尤为重要，以帮助用户选择合适的模型。

Result: DeepSeek在分类稳定性和逻辑推理方面表现更优，而ChatGPT在需要细微理解和灵活性的任务中表现更好。

Insight: 研究发现，模型的选择应基于具体任务需求，DeepSeek适合更结构化、逻辑性强的任务，而ChatGPT更适合需要灵活性和创造性理解的任务。

Abstract: The increasing use of large language models (LLMs) in natural language processing (NLP) tasks has sparked significant interest in evaluating their effectiveness across diverse applications. While models like ChatGPT and DeepSeek have shown strong results in many NLP domains, a comprehensive evaluation is needed to understand their strengths, weaknesses, and domain-specific abilities. This is critical as these models are applied to various tasks, from sentiment analysis to more nuanced tasks like textual entailment and translation. This study aims to evaluate ChatGPT and DeepSeek across five key NLP tasks: sentiment analysis, topic classification, text summarization, machine translation, and textual entailment. A structured experimental protocol is used to ensure fairness and minimize variability. Both models are tested with identical, neutral prompts and evaluated on two benchmark datasets per task, covering domains like news, reviews, and formal/informal texts. The results show that DeepSeek excels in classification stability and logical reasoning, while ChatGPT performs better in tasks requiring nuanced understanding and flexibility. These findings provide valuable insights for selecting the appropriate LLM based on task requirements.

[156] Parallel Continuous Chain-of-Thought with Jacobi Iteration cs.CLPDF

Haoyi Wu, Zhihao Teng, Kewei Tu

TL;DR: 论文提出并行连续思维链（PCCoT），利用雅可比迭代更新潜在思维标记，并行而非序列化处理，显著提升训练和推理效率。

Details

Motivation: 连续思维链（CoT）通过潜在标记进行隐式推理，但序列依赖导致训练时间过长，限制了效率。

Result: 实验显示，PCCoT在性能相当或更优的情况下，节省50%训练和推理时间，且训练过程更稳定。

Insight: 并行化思维链中的潜在标记更新是可行的，雅可比迭代为序列依赖问题提供了高效解决方案。

Abstract: Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models. By reasoning with continuous latent thought tokens, continuous CoT is able to perform implicit reasoning in a compact manner. However, the sequential dependencies between latent thought tokens spoil parallel training, leading to long training time. In this paper, we propose Parallel Continuous Chain-of-Thought (PCCoT), which performs Jacobi iteration on the latent thought tokens, updating them iteratively in parallel instead of sequentially and thus improving both training and inference efficiency of continuous CoT. Experiments demonstrate that by choosing the proper number of iterations, we are able to achieve comparable or even better performance while saving nearly 50% of the training and inference time. Moreover, PCCoT shows better stability and robustness in the training process. Our code is available at https://github.com/whyNLP/PCCoT.

[157] Reply to “Emergent LLM behaviors are observationally equivalent to data leakage” cs.CL | cs.GT | cs.MAPDF

Ariel Flint Ashery, Luca Maria Aiello, Andrea Baronchelli

TL;DR: 摘要反驳了对LLM种群研究中数据泄漏问题的批评，强调尽管数据污染是潜在问题，但LLM的自组织和涌现行为仍可被研究。

Details

Motivation: 回应Barrie和Törnberg对LLM种群涌现行为的质疑，澄清数据污染不影响对真正涌现动态的研究。

Result: 指出数据污染不阻碍LLM种群涌现行为的研究，并通过具体案例支持这一观点。

Insight: 数据污染虽是潜在问题，但LLM的涌现行为研究仍可行，且具有实证基础。

Abstract: A potential concern when simulating populations of large language models (LLMs) is data contamination, i.e. the possibility that training data may shape outcomes in unintended ways. While this concern is important and may hinder certain experiments with multi-agent models, it does not preclude the study of genuinely emergent dynamics in LLM populations. The recent critique by Barrie and T"ornberg [1] of the results of Flint Ashery et al. [2] offers an opportunity to clarify that self-organisation and model-dependent emergent dynamics can be studied in LLM populations, highlighting how such dynamics have been empirically observed in the specific case of social conventions.

[158] Semantic similarity estimation for domain specific data using BERT and other techniques cs.CL | stat.APPDF

R. Prashanth

TL;DR: 本文研究了语义相似度估计的多种技术，包括USE、InferSent和BERT，并在特定领域和公开数据集上验证了BERT的优越性能。

Details

Motivation: 语义相似度估计在自然语言处理和理解中有广泛应用，而特定领域数据的需求促使研究者探索最优方法。

Result: BERT表现最优，其性能优于其他方法，归因于其微调过程对训练数据模式的学习能力。

Insight: BERT是处理特定领域数据的最佳选择，其微调过程显著提升了语义相似度估计的性能。

Abstract: Estimation of semantic similarity is an important research problem both in natural language processing and the natural language understanding, and that has tremendous application on various downstream tasks such as question answering, semantic search, information retrieval, document clustering, word-sense disambiguation and machine translation. In this work, we carry out the estimation of semantic similarity using different state-of-the-art techniques including the USE (Universal Sentence Encoder), InferSent and the most recent BERT, or Bidirectional Encoder Representations from Transformers, models. We use two question pairs datasets for the analysis, one is a domain specific in-house dataset and the other is a public dataset which is the Quora’s question pairs dataset. We observe that the BERT model gave much superior performance as compared to the other methods. This should be because of the fine-tuning procedure that is involved in its training process, allowing it to learn patterns based on the training data that is used. This works demonstrates the applicability of BERT on domain specific datasets. We infer from the analysis that BERT is the best technique to use in the case of domain specific data.

[159] The Anatomy of Speech Persuasion: Linguistic Shifts in LLM-Modified Speeches cs.CLPDF

Alisa Barkar, Mathieu Chollet, Matthieu Labeau, Beatrice Biancardi, Chloe Clavel

TL;DR: 该研究通过LLM修改演讲内容，分析其在提升或削弱说服力时的语言学变化，揭示了GPT-4更倾向于系统性风格调整而非人性化优化。

Details

Motivation: 探究大型语言模型在公共演讲中如何理解及操作说服力，填补了LLM在语言学特征和修辞策略上的研究空白。

Result: GPT-4o通过操纵情感词汇和句式（如疑问句和感叹句）来放大修辞效果，而非像人类一样优化说服力。

Insight: LLM在语言生成中更注重风格统一而非内容优化，其操作说服力的方式与人类策略存在显著差异。

Abstract: This study examines how large language models understand the concept of persuasiveness in public speaking by modifying speech transcripts from PhD candidates in the “Ma These en 180 Secondes” competition, using the 3MT French dataset. Our contributions include a novel methodology and an interpretable textual feature set integrating rhetorical devices and discourse markers. We prompt GPT-4o to enhance or diminish persuasiveness and analyze linguistic shifts between original and generated speech in terms of the new features. Results indicate that GPT-4o applies systematic stylistic modifications rather than optimizing persuasiveness in a human-like manner. Notably, it manipulates emotional lexicon and syntactic structures (such as interrogative and exclamatory clauses) to amplify rhetorical impact.

[160] ByteSpan: Information-Driven Subword Tokenisation cs.CLPDF

Zébulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery

TL;DR: ByteSpan提出了一种基于信息驱动的新型子词分词方法，通过外部字节级语言模型识别可预测的字节序列，并将其分组为子词，实验结果优于BPE。

Details

Motivation: 受单词分割模型中基于预测误差确定词汇边界的启发，探索是否可以通过分组可预测字节（而非仅合并其表示）来生成更有用的固定子词词汇。

Result: 实验表明，ByteSpan生成的词汇比BPE更具形态学对齐性（英语），且在25种语言中展现了相似的压缩率和Rényi效率。

Insight: 分组可预测字节而非仅合并其表示，可以生成更高效的子词词汇，这在多语言场景下表现稳定。

Abstract: Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model’s prediction error. Inspired by this connection, we explore whether grouping predictable bytes - rather than pooling their representations - can yield a useful fixed subword vocabulary. We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences and group them into subwords. Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English. Multilingual experiments show similar compression and R'enyi efficiency for 25 languages.

[161] Existing LLMs Are Not Self-Consistent For Simple Tasks cs.CLPDF

Zhenru Lin, Jiawen Tao, Yang Yuan, Andrew Chi-Chih Yao

TL;DR: 研究发现，大型语言模型（LLMs）在简单任务上也存在不一致性问题，即使是先进模型如DeepSeek-R1和GPT-o4-mini也无法完全保证自洽性。论文提出了两种自动化方法来量化和缓解这种不一致性。

Details

Motivation: 尽管LLMs功能强大，但其决策的透明性和可信赖性依赖于自洽性（内部推理无矛盾）。然而研究发现，即使是简单任务，模型也无法避免不一致性，这凸显了解决这一问题的重要性。

Result: 实验表明，现有LLMs（包括最先进模型）在简单任务上均存在不一致性，提出的方法能部分改善问题，但未能彻底解决。

Insight: 研究强调了自洽性对AI可靠性和可解释性的重要性，并表明仅靠自动化方法难以完全解决这一复杂问题。

Abstract: Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency – no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods – a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at https://github.com/scorpio-nova/llm-self-consistency.

[162] STU-PID: Steering Token Usage via PID Controller for Efficient Large Language Model Reasoning cs.CLPDF

Aryasomayajula Ram Bharadwaj

TL;DR: STU-PID提出了一种基于PID控制器的动态调节方法，用于减少大语言模型在链式推理中的冗余计算，同时提高性能。

Details

Motivation: 大型语言模型在长链式推理中常因过度思考生成冗余步骤，增加计算成本并可能降低性能。静态调节方法无法实时适应推理质量。

Result: 在GSM8K上，STU-PID实现了6%的准确率提升，同时减少32%的Token使用量，优于静态调节基线。

Insight: 动态调节机制（如PID控制）是优化语言模型推理效率的有效工具，可扩展到其他推理任务。

Abstract: Large Language Models employing extended chain-of-thought (CoT) reasoning often suffer from the overthinking phenomenon, generating excessive and redundant reasoning steps that increase computational costs while potentially degrading performance. While recent work has explored static steering approaches to mitigate this issue, they lack the adaptability to dynamically adjust intervention strength based on real-time reasoning quality. We propose STUPID (Steering Token Usage via PID controller), a novel training-free method that employs a PID controller to dynamically modulate activation steering strength during inference. Our approach combines a chunk-level classifier for detecting redundant reasoning patterns with a PID control mechanism that adaptively adjusts steering intensity based on the predicted redundancy probability. Experimental evaluation on GSM8K demonstrates that STUPID achieves a 6% improvement in accuracy while reducing token usage by 32%, outperforming static steering baselines. Our method provides a principled framework for dynamic reasoning calibration that maintains reasoning quality while significantly improving computational efficiency.

[163] LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning cs.CL | cs.AI | cs.LGPDF

Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li

TL;DR: 这篇论文提出了一种基于强化学习的方法LongWriter-Zero，从零开始训练LLM，无需依赖标注或合成数据，实现了超长文本生成能力的提升。

Details

Motivation: 现有的超长文本生成方法依赖于合成数据的有监督微调（SFT），但数据构建成本高且质量不佳。作者希望通过强化学习直接从基础模型出发，解决这一问题。

Result: 在WritingBench和Arena-Write基准测试中，LongWriter-Zero超越了传统SFT方法和100B+规模模型，实现了SOTA性能。

Insight: 强化学习可以替代合成数据的有监督微调，直接从零开始训练模型，并在超长文本生成任务中表现优异。

Abstract: Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ‘’teaching’’, which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under https://huggingface.co/THU-KEG/LongWriter-Zero-32B

[164] OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization cs.CL | cs.AIPDF

Yiyou Sun, Shawn Hu, Georgia Zhou, Ken Zheng, Hannaneh Hajishirzi

TL;DR: OMEGA是一个评估大语言模型（LLM）在数学领域能否进行分布式、组合式和转换式泛化的新基准，发现现有模型在复杂性增加时表现急剧下降，尤其缺乏创新策略能力。

Details

Motivation: 现有LLMs在数学问题上依赖于有限的解题策略，难以应对需要新思路的问题，因此设计了OMEGA基准来系统评估其泛化能力。

Result: 前沿LLMs在问题复杂性增加时表现下降显著；微调Qwen模型在探索性泛化上有提升，但在组合式和转换式泛化上改善有限。

Insight: OMEGA揭示了LLMs在数学创新性解题上的不足，为未来提升模型的数学创造力提供了方向。

Abstract: Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden’s typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3) Transformative-adopting novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively. OMEGA consists of programmatically generated training-test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods. We evaluate frontier (or top-tier) LLMs and observe sharp performance degradation as problem complexity increases. Moreover, we fine-tune the Qwen-series models across all generalization settings and observe notable improvements in exploratory generalization, while compositional generalization remains limited and transformative reasoning shows little to no improvement. By isolating and quantifying these fine-grained failures, OMEGA lays the groundwork for advancing LLMs toward genuine mathematical creativity beyond mechanical proficiency.

[165] ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs cs.CLPDF

Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen

TL;DR: ReasonFlux-PRM提出了一种新型轨迹感知PRM，用于评估中间推理步骤，显著提升LLM在复杂推理任务中的性能。

Details

Motivation: 现有PRM主要基于模型最终输出训练，难以稳健评估中间推理轨迹，尤其在轨迹-响应输出的前沿推理模型中表现不足。

Result: 在AIME等基准测试中表现优异，平均提升12.1%（监督微调）、4.5%（强化学习）和6.3%（测试时扩展）。

Insight: 轨迹感知和细粒度奖励分配对提升LLM推理能力至关重要，轻量化模型为资源受限场景提供解决方案。

Abstract: Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Projects: https://github.com/Gen-Verse/ReasonFlux

cs.DL [Back]

[166] Unfolding the Past: A Comprehensive Deep Learning Approach to Analyzing Incunabula Pages cs.DL | cs.CVPDF

Klaudia Ropel, Krzysztof Kutt, Luiz do Valle Miranda, Grzegorz J. Nalepa

TL;DR: 该论文提出了一种基于深度学习的自动化方法，用于分析早期印刷书籍（如古籍）的页面结构和内容，通过目标检测、OCR和图像分类等技术，取得了较高的性能指标。

Details

Motivation: 早期印刷书籍（如古籍）的分析通常依赖人工，耗时且效率低。论文旨在通过深度学习技术，实现对这些书籍页面结构和内容的自动化分析。

Result: 1. YOLO11n在自定义数据集上F1分数达0.94；2. Tesseract OCR优于Kraken；3. ResNet18在图片分类任务中准确率达98.7%；4. CLIP成功生成插图语义描述。

Insight: 机器学习在古籍分析中潜力巨大，但OCR性能和视觉内容理解仍需改进。综合方法可以为文化遗产数字化提供高效工具。

Abstract: We developed a proof-of-concept method for the automatic analysis of the structure and content of incunabula pages. A custom dataset comprising 500 annotated pages from five different incunabula was created using resources from the Jagiellonian Digital Library. Each page was manually labeled with five predefined classes: Text, Title, Picture, Table, and Handwriting. Additionally, the publicly available DocLayNet dataset was utilized as supplementary training data. To perform object detection, YOLO11n and YOLO11s models were employed and trained using two strategies: a combined dataset (DocLayNet and the custom dataset) and the custom dataset alone. The highest performance (F1 = 0.94) was achieved by the YOLO11n model trained exclusively on the custom data. Optical character recognition was then conducted on regions classified as Text, using both Tesseract and Kraken OCR, with Tesseract demonstrating superior results. Subsequently, image classification was applied to the Picture class using a ResNet18 model, achieving an accuracy of 98.7% across five subclasses: Decorative_letter, Illustration, Other, Stamp, and Wrong_detection. Furthermore, the CLIP model was utilized to generate semantic descriptions of illustrations. The results confirm the potential of machine learning in the analysis of early printed books, while emphasizing the need for further advancements in OCR performance and visual content interpretation.

cs.LO [Back]

[167] Beyond Prediction – Structuring Epistemic Integrity in Artificial Reasoning Systems cs.LO | cs.CL | math.LO | 68T27, 03B70 | I.2.4; I.2.3PDF

Craig Steven Wright

TL;DR: 作者提出了一种为人工智能系统设计的框架，强调在严格认知约束下的结构化推理，而不仅限于随机语言预测。

Details

Motivation: 针对现有AI系统在认知完整性和结构化推理方面的不足，作者希望通过一种综合框架支持命题承诺、矛盾检测和真相保持。

Result: 该框架支持结构化推理和矛盾检测，同时确保真相保持和认知完整性。

Insight: 将符号推理与知识图谱和区块链技术结合，可以为AI系统提供更严格的认知约束和审计能力。

Abstract: This paper develops a comprehensive framework for artificial intelligence systems that operate under strict epistemic constraints, moving beyond stochastic language prediction to support structured reasoning, propositional commitment, and contradiction detection. It formalises belief representation, metacognitive processes, and normative verification, integrating symbolic inference, knowledge graphs, and blockchain-based justification to ensure truth-preserving, auditably rational epistemic agents.

cs.CY [Back]

[168] MAARTA:Multi-Agentic Adaptive Radiology Teaching Assistant cs.CY | cs.CV | cs.LGPDF

Akash Awasthi, Brandon V. Chang, Anh M. Vu, Ngan Le, Rishi Agrawal

TL;DR: MAARTA是一个多智能体框架，通过分析视线模式和放射报告，为放射学学生提供个性化反馈，帮助他们纠正视觉搜索和诊断解释中的错误。

Details

Motivation: 放射学学生因导师时间有限，难以培养感知专业技能，导致视觉搜索和诊断解释中的错误。现有AI系统着重诊断准确性，但无法解释错误原因。

Result: MAARTA能提供个性化反馈，帮助学生改进感知和诊断错误。

Insight: 多智能体框架能更灵活地分析复杂错误，并通过逐步提示提升教学效果。

Abstract: Radiology students often struggle to develop perceptual expertise due to limited expert mentorship time, leading to errors in visual search and diagnostic interpretation. These perceptual errors, such as missed fixations, short dwell times, or misinterpretations, are not adequately addressed by current AI systems, which focus on diagnostic accuracy but fail to explain how and why errors occur. To address this gap, we introduce MAARTA (Multi-Agentic Adaptive Radiology Teaching Assistant), a multi-agent framework that analyzes gaze patterns and radiology reports to provide personalized feedback. Unlike single-agent models, MAARTA dynamically selects agents based on error complexity, enabling adaptive and efficient reasoning. By comparing expert and student gaze behavior through structured graphs, the system identifies missed findings and assigns Perceptual Error Teacher agents to analyze discrepancies. MAARTA then uses step-by-step prompting to help students understand their errors and improve diagnostic reasoning, advancing AI-driven radiology education.

[169] AI-based Multimodal Biometrics for Detecting Smartphone Distractions: Application to Online Learning cs.CY | cs.AI | cs.CV | cs.HCPDF

Alvaro Becerra, Roberto Daza, Ruth Cobos, Aythami Morales, Mutlu Cukurova

TL;DR: 论文提出了一种结合生理信号和头部姿态数据的多模态生物特征方法，用于检测在线学习中学生因使用手机而分心的情况。相比单一信号，多模态方法显著提高了准确性。

Details

Motivation: 在线学习中，学生因手机使用等原因容易分心，传统学习平台缺乏详细的行为数据。多模态学习分析（MMLA）和生物传感器为这一问题提供了新的解决思路。

Result: 实验结果显示，多模态模型的检测准确率（91%）显著高于单一信号（如头部姿态的87%或生理信号的较低准确性）。

Insight: 多模态生物特征方法在检测注意力分散方面具有优势，为在线学习中的实时辅助提供了新的技术可能。但其实际部署仍需考虑隐私和技术限制。

Abstract: This work investigates the use of multimodal biometrics to detect distractions caused by smartphone use during tasks that require sustained attention, with a focus on computer-based online learning. Although the methods are applicable to various domains, such as autonomous driving, we concentrate on the challenges learners face in maintaining engagement amid internal (e.g., motivation), system-related (e.g., course design) and contextual (e.g., smartphone use) factors. Traditional learning platforms often lack detailed behavioral data, but Multimodal Learning Analytics (MMLA) and biosensors provide new insights into learner attention. We propose an AI-based approach that leverages physiological signals and head pose data to detect phone use. Our results show that single biometric signals, such as brain waves or heart rate, offer limited accuracy, while head pose alone achieves 87%. A multimodal model combining all signals reaches 91% accuracy, highlighting the benefits of integration. We conclude by discussing the implications and limitations of deploying these models for real-time support in online learning environments.

[170] Multimodal Political Bias Identification and Neutralization cs.CY | cs.AI | cs.CVPDF

Cedric Bernard, Xavier Pleimling, Amun Kharel, Chase Vickery

TL;DR: 论文提出了一种多模态政治偏见识别与消除模型，结合文本与图像的偏差分析，通过四个步骤（图像文本对齐、图像偏见评分、文本去偏和最终去偏）实现偏见中和。初步结果显示该方法有潜力，但需要更多训练时间和资源优化。

Details

Motivation: 政治回音室现象导致政治文章中存在主观偏见和情绪化语言，现有研究仅关注文本部分，忽略了图像作为信息媒介的重要性。

Result: 文本去偏能有效识别偏见词，ViT模型训练效果良好，语义对齐高效，但需更多训练资源。人类评估提议确保生成内容的一致性。

Insight: 多模态偏见处理更全面，结合技术与人工评估可提升去偏效果，未来需优化训练流程与资源分配。

Abstract: Due to the presence of political echo chambers, it becomes imperative to detect and remove subjective bias and emotionally charged language from both the text and images of political articles. However, prior work has focused on solely the text portion of the bias rather than both the text and image portions. This is a problem because the images are just as powerful of a medium to communicate information as text is. To that end, we present a model that leverages both text and image bias which consists of four different steps. Image Text Alignment focuses on semantically aligning images based on their bias through CLIP models. Image Bias Scoring determines the appropriate bias score of images via a ViT classifier. Text De-Biasing focuses on detecting biased words and phrases and neutralizing them through BERT models. These three steps all culminate to the final step of debiasing, which replaces the text and the image with neutralized or reduced counterparts, which for images is done by comparing the bias scores. The results so far indicate that this approach is promising, with the text debiasing strategy being able to identify many potential biased words and phrases, and the ViT model showcasing effective training. The semantic alignment model also is efficient. However, more time, particularly in training, and resources are needed to obtain better results. A human evaluation portion was also proposed to ensure semantic consistency of the newly generated text and images.

eess.IV [Back]

[171] Can Common VLMs Rival Medical VLMs? Evaluation and Strategic Insights eess.IV | cs.AI | cs.CVPDF

Yuan Zhong, Ruinan Jin, Xiaoxiao Li, Qi Dou

TL;DR: 本研究探讨了通用视觉语言模型（VLM）是否通过微调能与针对医学任务训练的专用医疗VLM匹敌。通过对CLIP和LLaVA等通用VLM在疾病诊断和视觉问答任务上的评估，发现轻量级微调（如LoRA）可使通用VLM在部分任务中表现优于专用模型，尤其在跨模态任务中展现出较强适应性。

Details

Motivation: 医疗VLM需要大量计算和数据资源，而通用VLM可能在微调后胜任医学任务。研究旨在验证通用VLM是否可替代医疗VLM，以降低成本并提高可扩展性。

Result: 医疗VLM在域内任务中表现更强，但通用VLM通过微调（尤其是LoRA）可达到或超越其性能；在域外任务中，通用VLM展现了较强的适应性。

Insight: 通用VLM的轻量级微调是一种高效、低成本的替代方案，挑战了医疗专用预训练的传统假设，为未来医学影像研究提供了新思路。

Abstract: Medical vision-language models (VLMs) leverage large-scale pretraining for diverse imaging tasks but require substantial computational and data resources. Meanwhile, common or general-purpose VLMs (e.g., CLIP, LLaVA), though not trained for medical use, show promise with fine-tuning. This raises a key question: Can efficient fine-tuned common VLMs rival generalist medical VLMs for solving specific medical imaging tasks? This study systematically evaluates common and medical VLMs across disease diagnosis and visual question answering (VQA). Using CLIP-based and LLaVA-based models, we examine (1) off-the-shelf performance gaps in in-domain (ID) settings, (2) whether fine-tuning bridges these gaps, and (3) generalization to out-of-domain (OOD) tasks on unseen medical modalities. While medical-specific pretraining provides advantages in ID settings, common VLMs match or surpass medical-specific models after lightweight fine-tuning, with LoRA-based adaptation proving highly effective among different tasks. In OOD tasks, common VLMs demonstrate strong adaptability in some tasks, challenging the assumption that medical-specific pre-training is essential. These findings suggest that leveraging common VLMs with fine-tuning offers a scalable and cost-effective alternative to developing large-scale medical VLMs, providing crucial insights for future research in the medical imaging field.

[172] LVPNet: A Latent-variable-based Prediction-driven End-to-end Framework for Lossless Compression of Medical Images eess.IV | cs.CVPDF

Chenyue Song, Chen Hui, Qing Lin, Wei Zhang, Siqiao Li

TL;DR: LVPNet提出了一种基于潜在变量的预测驱动端到端框架，用于医学图像的无损压缩，解决了现有方法中潜在变量利用率低和信息均匀分布的问题。

Details

Motivation: 现有方法通过子图像自回归和潜在变量建模实现无损压缩，但潜在变量信息分布均匀，导致后验坍塌和利用效率低，LVPNet旨在解决这些问题。

Result: 在多个基准测试中优于现有无损压缩方法，且推理速度具有竞争力。

Insight: 全局潜在变量和多尺度感知能更有效利用图像信息，量化补偿进一步提升了压缩效率。

Abstract: Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of latent variables. To deal with these issues, we propose a prediction-based end-to-end lossless medical image compression method named LVPNet, leveraging global latent variables to predict pixel values and encoding predicted probabilities for lossless compression. Specifically, we introduce the Global Multi-scale Sensing Module (GMSM), which extracts compact and informative latent representations from the entire image, effectively capturing spatial dependencies within the latent space. Furthermore, to mitigate the information loss introduced during quantization, we propose the Quantization Compensation Module (QCM), which learns the distribution of quantization errors and refines the quantized features to compensate for quantization loss. Extensive experiments on challenging benchmarks demonstrate that our method achieves superior compression efficiency compared to state-of-the-art lossless image compression approaches, while maintaining competitive inference speed. The code is at https://github.com/Anonymity00000/Anonymity-repository/.

[173] Multimodal Medical Image Binding via Shared Text Embeddings eess.IV | cs.AI | cs.CVPDF

Yunhao Liu, Suyang Xi, Shiqi Liu, Hong Ding, Chicheng Jin

TL;DR: 论文提出了一种名为M³Bind的新型预训练框架，通过共享文本嵌入空间实现多模态医学图像的无监督对齐，避免了显式配对数据的依赖，并在多个下游任务中表现出色。

Details

Motivation: 医学影像分析需要整合多种模态的图像以提高诊断准确性，但现有方法（如CLIP）需要显式配对数据，这在医学领域难以获取。M³Bind旨在解决这一挑战。

Result: 实验表明，M³Bind在X光、CT、视网膜、ECG和病理图像上的多项任务中优于现有类CLIP方法，验证了其跨模态对齐的有效性。

Insight: 通过共享文本嵌入空间，可以实现无监督的多模态医学图像对齐，为医学影像分析提供了新的思路。

Abstract: Medical image analysis increasingly relies on the integration of multiple imaging modalities to capture complementary anatomical and functional information, enabling more accurate diagnosis and treatment planning. Achieving aligned feature representations across these diverse modalities is therefore important for effective multimodal analysis. While contrastive language-image pre-training (CLIP) and its variant have enabled image-text alignments, they require explicitly paired data between arbitrary two modalities, which is difficult to acquire in medical contexts. To address the gap, we present Multimodal Medical Image Binding with Text (M\textsuperscript{3}Bind), a novel pre-training framework that enables seamless alignment of multiple medical imaging modalities through a shared text representation space without requiring explicit paired data between any two medical image modalities. Specifically, based on the insight that different images can naturally bind with text, M\textsuperscript{3}Bind first fine-tunes pre-trained CLIP-like image-text models to align their modality-specific text embedding space while preserving their original image-text alignments. Subsequently, we distill these modality-specific text encoders into a unified model, creating a shared text embedding space. Experiments on X-ray, CT, retina, ECG, and pathological images on multiple downstream tasks demonstrate that M\textsuperscript{3}Bind achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks compared to its CLIP-like counterparts. These results validate M\textsuperscript{3}Bind’s effectiveness in achieving cross-image-modal alignment for medical analysis.

[174] Taming Vision-Language Models for Medical Image Analysis: A Comprehensive Review eess.IV | cs.CVPDF

Haoneng Lin, Cheng Xu, Jing Qin

TL;DR: 这篇综述系统地总结了视觉-语言模型（VLMs）在医学图像分析领域的应用进展，包括核心学习策略、五种主要适应方法，以及在11种医学任务中的实际应用，同时指出了当前挑战和未来研究方向。

Details

Motivation: 医学图像分析需要多模态整合，而通用VLMs在医学领域的适应面临领域差异大、病理变化复杂等挑战，因此有必要总结现有的适应策略和研究进展。

Result: 综述展示了VLMs在医学图像分析中的潜力，但也揭示了领域差距、任务多样性等实际挑战。

Insight: 未来的研究需要关注如何缩小领域差距、提高模型鲁棒性和安全性，以推动VLMs在临床实践中的创新应用。

Abstract: Modern Vision-Language Models (VLMs) exhibit unprecedented capabilities in cross-modal semantic understanding between visual and textual modalities. Given the intrinsic need for multi-modal integration in clinical applications, VLMs have emerged as a promising solution for a wide range of medical image analysis tasks. However, adapting general-purpose VLMs to medical domain poses numerous challenges, such as large domain gaps, complicated pathological variations, and diversity and uniqueness of different tasks. The central purpose of this review is to systematically summarize recent advances in adapting VLMs for medical image analysis, analyzing current challenges, and recommending promising yet urgent directions for further investigations. We begin by introducing core learning strategies for medical VLMs, including pretraining, fine-tuning, and prompt learning. We then categorize five major VLM adaptation strategies for medical image analysis. These strategies are further analyzed across eleven medical imaging tasks to illustrate their current practical implementations. Furthermore, we analyze key challenges that impede the effective adaptation of VLMs to clinical applications and discuss potential directions for future research. We also provide an open-access repository of related literature to facilitate further research, available at https://github.com/haonenglin/Awesome-VLM-for-MIA. It is anticipated that this article can help researchers who are interested in harnessing VLMs in medical image analysis tasks have a better understanding on their capabilities and limitations, as well as current technical barriers, to promote their innovative, robust, and safe application in clinical practice.

cs.GR [Back]

[175] BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing cs.GR | cs.CVPDF

Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, Sanghyun Woo

TL;DR: BlenderFusion是一个生成式视觉合成框架，通过3D基础的分层-编辑-合成流程，实现场景中对象、相机和背景的灵活编辑与合成。

Details

Motivation: 现有的视觉编辑方法在复杂场景的合成和编辑上存在局限，BlenderFusion旨在提供一个3D基础的框架，支持更灵活和可控的视觉合成。

Result: 在复杂场景编辑任务中显著优于现有方法。

Insight: 3D基础的编辑结合生成式模型能够提升视觉合成的灵活性和可控性，支持更复杂的场景重建与编辑任务。

Abstract: We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks.

[176] Morse: Dual-Sampling for Lossless Acceleration of Diffusion Models cs.GR | cs.AI | cs.CVPDF

Chao Li, Jiawei Fan, Anbang Yao

TL;DR: Morse提出了一种双采样框架，通过快速跳跃采样和自适应残差反馈策略，无损加速扩散模型的生成过程，实现1.78X到3.31X的平均速度提升。

Details

Motivation: 扩散模型虽然生成质量高，但迭代生成过程速度慢，限制了其实际应用。Morse旨在无损提升扩散模型的采样效率。

Result: 在6个图像生成任务上，相对9种基线扩散模型，Morse实现1.78X到3.31X的平均速度提升，并成功推广到LCM-SDXL模型。

Insight: Morse通过双采样策略将效率与性能解耦，为扩散模型加速提供了一种灵活且无损的解决方案。

Abstract: In this paper, we present Morse, a simple dual-sampling framework for accelerating diffusion models losslessly. The key insight of Morse is to reformulate the iterative generation (from noise to data) process via taking advantage of fast jump sampling and adaptive residual feedback strategies. Specifically, Morse involves two models called Dash and Dot that interact with each other. The Dash model is just the pre-trained diffusion model of any type, but operates in a jump sampling regime, creating sufficient space for sampling efficiency improvement. The Dot model is significantly faster than the Dash model, which is learnt to generate residual feedback conditioned on the observations at the current jump sampling point on the trajectory of the Dash model, lifting the noise estimate to easily match the next-step estimate of the Dash model without jump sampling. By chaining the outputs of the Dash and Dot models run in a time-interleaved fashion, Morse exhibits the merit of flexibly attaining desired image generation performance while improving overall runtime efficiency. With our proposed weight sharing strategy between the Dash and Dot models, Morse is efficient for training and inference. Our method shows a lossless speedup of 1.78X to 3.31X on average over a wide range of sampling step budgets relative to 9 baseline diffusion models on 6 image generation tasks. Furthermore, we show that our method can be also generalized to improve the Latent Consistency Model (LCM-SDXL, which is already accelerated with consistency distillation technique) tailored for few-step text-to-image synthesis. The code and models are available at https://github.com/deep-optimization/Morse.

[177] What You Think Is What You Get: Bridge User Intent and Transfer Function Design through Multimodal Large Language Models cs.GR | cs.CVPDF

Yiyao Wang, Bo Pan, Ke Wang, Han Liu, Jinyuan Mao

TL;DR: 论文提出了一种名为WYTWYG的框架，利用多模态大语言模型（MLLMs）根据用户意图引导传递函数（TF）优化，解决了现有方法在探索空间大和泛化能力弱方面的挑战。

Details

Motivation: 直接体渲染（DVR）中传递函数（TF）的设计存在语义鸿沟，传统方法难以直观地捕捉用户意图，因此需要一种更智能的优化方法。

Result: 通过三个案例研究和大量实验验证了框架的通用性和各组件的有效性。

Insight: 多模态大语言模型能够有效弥合用户意图与TF设计之间的语义鸿沟，提升交互式体渲染的效率和直观性。

Abstract: Direct volume rendering (DVR) is a fundamental technique for visualizing volumetric data, with transfer functions (TFs) playing a crucial role in extracting meaningful structures. However, designing effective TFs remains unintuitive due to the semantic gap between user intent and TF parameter space. Researchers have developed numerous TF optimization methods to bridge this gap. However, existing methods still face two challenges: large exploration space and weak generalizability. To address these issues, we propose What You Think is What You Get (WYTWYG) framework, which leveraging Multi-model Large Language Models (MLLMs) to guide the TF optimization based on user intent. Specifically, we first introduce a novel TF optimization approach comprising two core components: (1) an evolution-based explorer for effective exploration of the TF space, and (2) a volume rendering quality evaluator based on MLLMs to provide generalizable visual guidance. We further propose a TF interactive design system based on this approach. We demonstrate the general applicability of our framework through three case studies, and validate the effectiveness of each component through extensive experiments. Our code is available at: https://github.com/wyysteelhead/TFevolve.

[178] BulletGen: Improving 4D Reconstruction with Bullet-Time Generation cs.GR | cs.AI | cs.CV | cs.LGPDF

Denys Rozumnyi, Jonathon Luiten, Numair Khan, Johannes Schönberger, Peter Kontschieder

TL;DR: BulletGen利用生成模型修正四维动态场景重建中的错误和缺失信息，通过扩散模型生成的内容优化高斯模型，实现了在视角合成和跟踪任务中的最新性能。

Details

Motivation: 单目视频的动态场景重建是一个高度不适定问题，存在不可见区域重建和深度估计模糊等挑战。

Result: 在视角合成和2D/3D跟踪任务中达到了最先进的结果。

Insight: 结合生成模型与动态场景表示可以显著提升重建质量和任务性能。

Abstract: Transforming casually captured, monocular videos into fully immersive dynamic experiences is a highly ill-posed task, and comes with significant challenges, e.g., reconstructing unseen regions, and dealing with the ambiguity in monocular depth estimation. In this work we introduce BulletGen, an approach that takes advantage of generative models to correct errors and complete missing information in a Gaussian-based dynamic scene representation. This is done by aligning the output of a diffusion-based video generation model with the 4D reconstruction at a single frozen “bullet-time” step. The generated frames are then used to supervise the optimization of the 4D Gaussian model. Our method seamlessly blends generative content with both static and dynamic scene components, achieving state-of-the-art results on both novel-view synthesis, and 2D/3D tracking tasks.

[179] DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling cs.GR | cs.CV | cs.SD | eess.ASPDF

Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik

TL;DR: DuetGen提出了一种新颖的分层掩码建模框架，通过两阶段方法生成音乐驱动的双人舞动作，实现了高质量的互动与音乐同步。

Details

Motivation: 双人舞生成的关键挑战在于舞伴之间及与音乐的同步问题，现有方法难以有效捕捉复杂的互动细节。

Result: 在基准数据集上，DuetGen在动作真实性、音乐-舞蹈对齐和舞伴协调性上实现了SOTA性能。

Insight: 分层建模和统一表示能有效捕捉双人舞的互动复杂性，掩码预测机制提升了生成的多样性和同步性。

Abstract: We present DuetGen, a novel framework for generating interactive two-person dances from music. The key challenge of this task lies in the inherent complexities of two-person dance interactions, where the partners need to synchronize both with each other and with the music. Inspired by the recent advances in motion synthesis, we propose a two-stage solution: encoding two-person motions into discrete tokens and then generating these tokens from music. To effectively capture intricate interactions, we represent both dancers’ motions as a unified whole to learn the necessary motion tokens, and adopt a coarse-to-fine learning strategy in both the stages. Our first stage utilizes a VQ-VAE that hierarchically separates high-level semantic features at a coarse temporal resolution from low-level details at a finer resolution, producing two discrete token sequences at different abstraction levels. Subsequently, in the second stage, two generative masked transformers learn to map music signals to these dance tokens: the first producing high-level semantic tokens, and the second, conditioned on music and these semantic tokens, producing the low-level tokens. We train both transformers to learn to predict randomly masked tokens within the sequence, enabling them to iteratively generate motion tokens by filling an empty token sequence during inference. Through the hierarchical masked modeling and dedicated interaction representation, DuetGen achieves the generation of synchronized and interactive two-person dances across various genres. Extensive experiments and user studies on a benchmark duet dance dataset demonstrate state-of-the-art performance of DuetGen in motion realism, music-dance alignment, and partner coordination.

cs.AI [Back]

Shahab Rahimirad, Guven Gergerli, Lucia Romero, Angela Qian, Matthew Lyle Olson

TL;DR: 该论文研究了大型语言模型（LLMs）在社交推理任务中的局限性，并提出了一种混合推理框架，结合了结构化概率模型和LLMs，显著提升了性能，甚至在与人对抗的实验中取得了67%的胜率。

Details

Motivation: 当前LLMs在社交推理任务（如推断他者的信念和意图）中表现有限，尤其是对小规模实时模型的支持不足，因此需要一种更高效的方法。

Result: 该框架在Agent-Agent对抗中表现优异，成为首个在受控实验中击败人类玩家的语言智能体，胜率达67%，并获得了比基准模型和人类队友更高的定性评价。

Insight: 将复杂的社交推理任务分解为结构化概率推断和语言理解两部分，可以显著提升LLMs在复杂任务中的表现，同时保证实时性。

Abstract: Social reasoning - inferring unobservable beliefs and intentions from partial observations of other agents - remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test-time inference and degrade sharply when distilled to smaller, real-time-capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves competitive performance with much larger models in Agent-Agent play and, notably, is the first language agent to defeat human players in a controlled study - achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents, which can be found at https://camp-lab-purdue.github.io/bayesian-social-deduction/

[181] Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective cs.AI | cs.CL | cs.LG | cs.NE | cs.ROPDF

Jianyu Wang, Zhiqiang Hu, Lidong Bing

TL;DR: 论文提出了一种新的提示设计范式PromptQuine，通过修剪随机演示生成看似无意义的提示，显著提升大语言模型（LLM）性能，且超越传统提示优化技术。

Details

Motivation: 传统提示设计依赖精心设计的指令和演示，但本文发现修剪随机演示生成的“无意义”提示反而能显著提升任务性能，挑战了传统思维。

Result: 在分类、多选题、生成和数学推理等任务中，PromptQuine均表现出色，且运行时高效。

Insight: 这项研究表明，LLM的上下文学习能力可能超出传统认知，为更开放式的提示设计算法提供了新思路。

Abstract: We propose a novel prompt design paradigm that challenges conventional wisdom in large language model (LLM) prompting. While conventional wisdom prioritizes well-crafted instructions and demonstrations for in-context learning (ICL), we show that pruning random demonstrations into seemingly incoherent “gibberish” can remarkably improve performance across diverse tasks. Notably, the “gibberish” always matches or surpasses state-of-the-art automatic prompt optimization techniques, achieving substantial gains regardless of LLM alignment. Nevertheless, discovering an effective pruning strategy is non-trivial, as existing attribution methods and prompt compression algorithms fail to deliver robust results, let alone human intuition. In terms of this, we propose a self-discover prompt optimization framework, PromptQuine, an evolutionary search framework that automatically searches for the pruning strategy by itself using only low-data regimes. Much like the emergent complexity in nature–such as symbiosis and self-organization–arising in response to resource constraints, our framework evolves and refines unconventional yet highly effective prompts by leveraging only the tokens present within the context. We demonstrate its effectiveness across classification, multi-choice question answering, generation and math reasoning tasks across LLMs, while achieving decent runtime efficiency. We hope our findings can guide mechanistic studies on in-context learning, and provide a call to action, to pave the way for more open-ended search algorithms for more effective LLM prompting.

[182] SE-Merging: A Self-Enhanced Approach for Dynamic Model Merging cs.AI | cs.CLPDF

Zijun Chen, Zhanpeng Zhou, Bo Zhang, Weinan Zhang, Xi Sun

TL;DR: SE-Merging是一种自增强的动态模型融合方法，通过分析样本任务特性和自适应调整融合系数，无需额外训练即可提升多任务能力。

Details

Motivation: 尽管模型融合在多任务学习中展现出潜力，但其内在机制尚不明确。本文旨在揭示模型融合的工作原理，并提出一种自增强方法以优化融合效果。

Result: 实验表明，SE-Merging显著提升了多任务性能，并与现有融合技术兼容。

Insight: 模型融合的多任务能力依赖于对任务样本的区分和对专家模型的适应性，动态调整融合系数可进一步提升性能。

Abstract: Model merging has gained increasing attention due to its intriguing property: interpolating the parameters of different task-specific fine-tuned models leads to multi-task abilities. However, despite its empirical success, the underlying mechanisms of model merging remain poorly understood. In this work, we delve into the mechanism behind model merging from a representation perspective. Our analysis reveals that model merging achieves multi-task abilities through two key capabilities: i) distinguishing samples from different tasks, and ii) adapting to the corresponding expert model for each sample. These two capabilities allow the merged model to retain task-specific expertise, enabling efficient multi-task adaptation. Building on these insights, we propose \texttt{SE-Merging}, a self-enhanced model merging framework that leverages these two characteristics to dynamically identify the corresponding task for each sample and then adaptively rescales the merging coefficients to further enhance task-specific expertise in the merged model. Notably, \texttt{SE-Merging} achieves dynamic model merging without additional training. Extensive experiments demonstrate that \texttt{SE-Merging} achieves significant performance improvements while remaining compatible with existing model merging techniques.

[183] Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know? cs.AI | cs.CLPDF

Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa

TL;DR: 该论文探讨了推理语言模型在处理不确定性时的校准问题，发现这些模型通常过度自信，尤其是对于错误答案；研究还探索了通过自我反思（introspective UQ）提升校准的方法。

Details

Motivation: 推理语言模型在多项基准测试中表现优异，但它们容易生成看似合理但错误的回答（幻觉问题）。为确保这类模型在实际应用中的安全性，需要研究其不确定性量化（UQ）能力。

Result: 1. 推理模型普遍过度自信，错误答案的自述置信度常超过85%；
2. 深度推理会加剧过度自信；
3. 自省UQ可提升部分模型（如o3-Mini、DeepSeek R1）的校准性，但效果不统一（如Claude 3.7 Sonnet校准性更差）。

Insight: 1. 模型校准性问题需更多研究；
2. 自省机制是潜在改进方向；
3. 需设计专门的不确定性量化基准测试。

Abstract: Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans’ innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.

[184] Airalogy: AI-empowered universal data digitization for research automation cs.AI | cs.CE | cs.CLPDF

Zijie Yang, Qiji Zhou, Fang Guo, Sijie Zhang, Yexun Xi

TL;DR: 论文提出了Airalogy平台，这是一个AI驱动的通用数据数字化工具，旨在平衡通用性和标准化，支持多学科研究数据的数字化和管理。

Details

Motivation: 当前AI应用局限于少数领域，研究数据碎片化且缺乏统一标准，阻碍了跨学科的AI赋能。

Result: 已在西湖大学四所学院的实验室中部署，展示了加速和自动化科学创新的潜力。

Insight: 通过结合领域知识和计算技能，Airalogy解决了研究数据标准化和AI赋能的障碍，推动了多学科科学创新。

Abstract: Research data are the foundation of Artificial Intelligence (AI)-driven science, yet current AI applications remain limited to a few fields with readily available, well-structured, digitized datasets. Achieving comprehensive AI empowerment across multiple disciplines is still out of reach. Present-day research data collection is often fragmented, lacking unified standards, inefficiently managed, and difficult to share. Creating a single platform for standardized data digitization needs to overcome the inherent challenge of balancing between universality (supporting the diverse, ever-evolving needs of various disciplines) and standardization (enforcing consistent formats to fully enable AI). No existing platform accommodates both facets. Building a truly multidisciplinary platform requires integrating scientific domain knowledge with sophisticated computing skills. Researchers often lack the computational expertise to design customized and standardized data recording methods, whereas platform developers rarely grasp the intricate needs of multiple scientific domains. These gaps impede research data standardization and hamper AI-driven progress. In this study, we address these challenges by developing Airalogy (https://airalogy.com), the world’s first AI- and community-driven platform that balances universality and standardization for digitizing research data across multiple disciplines. Airalogy represents entire research workflows using customizable, standardized data records and offers an advanced AI research copilot for intelligent Q&A, automated data entry, analysis, and research automation. Already deployed in laboratories across all four schools of Westlake University, Airalogy has the potential to accelerate and automate scientific innovation in universities, industry, and the global research community-ultimately benefiting humanity as a whole.

[185] Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training cs.AI | cs.CL | cs.LGPDF

Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel

TL;DR: 论文通过实验证明了仅通过源代码训练，大语言模型（LLM）能够学会隐式评估程序，获得可重用的算法抽象，提升推理能力，称为“反向传播编程”（PBB）。

Details

Motivation: 研究代码训练如何提升LLM的通用推理能力，并探索其背后机制：是否仅通过源代码就能学会程序评估。

Result: LLM在仅源代码训练下能隐式评估程序，通过代码形式和链式思考（chain-of-thought）效果更可靠。

Insight: 代码训练让LLM内化可重用的算法抽象，为未来通过符号化程序改进模型学习及对齐开辟了新方向。

Abstract: Training large language models (LLMs) on source code significantly enhances their general-purpose reasoning abilities, but the mechanisms underlying this generalisation are poorly understood. In this paper, we propose Programming by Backprop (PBB) as a potential driver of this effect - teaching a model to evaluate a program for inputs by training on its source code alone, without ever seeing I/O examples. To explore this idea, we finetune LLMs on two sets of programs representing simple maths problems and algorithms: one with source code and I/O examples (w/ IO), the other with source code only (w/o IO). We find evidence that LLMs have some ability to evaluate w/o IO programs for inputs in a range of experimental settings, and make several observations. Firstly, PBB works significantly better when programs are provided as code rather than semantically equivalent language descriptions. Secondly, LLMs can produce outputs for w/o IO programs directly, by implicitly evaluating the program within the forward pass, and more reliably when stepping through the program in-context via chain-of-thought. We further show that PBB leads to more robust evaluation of programs across inputs than training on I/O pairs drawn from a distribution that mirrors naturally occurring data. Our findings suggest a mechanism for enhanced reasoning through code training: it allows LLMs to internalise reusable algorithmic abstractions. Significant scope remains for future work to enable LLMs to more effectively learn from symbolic procedures, and progress in this direction opens other avenues like model alignment by training on formal constitutional principles.

[186] ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation cs.AI | cs.CL | cs.CVPDF

Siao Tang, Xinyin Ma, Gongfan Fang, Xinchao Wang

TL;DR: ConciseHint框架通过在推理生成过程中动态注入简洁提示,显著减少了大型推理模型的冗余生成,同时保持性能。

Details

Motivation: 现有大型推理模型在推理过程中倾向于生成过长冗余的内容,影响效率。现有方法多关注推理前优化,忽略了在生成过程中直接干预的潜力。

Result: 在GSM8K等基准测试中,推理长度减少65%且几乎无准确率损失,验证了方法的有效性。

Insight: 推理过程的简洁性可以直接在生成过程中通过动态提示干预优化,兼顾效率与性能。

Abstract: Recent advancements in large reasoning models (LRMs) like DeepSeek-R1 and OpenAI o1 series have achieved notable performance enhancements on complex reasoning tasks by scaling up the generation length by Chain-of-Thought (CoT). However, an emerging issue is their inclination to produce excessively verbose reasoning processes, leading to the inefficiency problem. Existing literature on improving efficiency mainly adheres to the before-reasoning paradigms such as prompting and reasoning or fine-tuning and reasoning, but ignores the promising direction of directly encouraging the model to speak concisely by intervening during the generation of reasoning. In order to fill the blank, we propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely by injecting the textual hint (manually designed or trained on the concise data) during the token generation of the reasoning process. Besides, ConciseHint is adaptive to the complexity of the query by adaptively adjusting the hint intensity, which ensures it will not undermine model performance. Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning processes while maintaining performance well. For instance, we achieve a reduction ratio of 65% for the reasoning length on GSM8K benchmark with Qwen-3 4B with nearly no accuracy loss.

[187] jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval cs.AI | cs.CL | cs.IR | 68T50 | I.2.7PDF

Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu

TL;DR: 论文提出了jina-embeddings-v4，一个38亿参数的多模态嵌入模型，通过新颖的架构统一文本和图像表示，支持单向量和多向量嵌入，并采用LoRA适配器优化性能，在多模态检索任务中达到SOTA效果。

Details

Motivation: 多模态检索任务需要统一的嵌入模型来处理文本和图像的复杂交互，但现有方法在视觉丰富内容（如图表、混合媒体）上的表现不足。

Result: 模型在文本和图像的跨模态检索任务中表现优异，尤其在处理视觉丰富内容时显著优于其他方法。

Insight: 统一的嵌入架构和任务特定适配器是提升多模态检索性能的关键，尤其是在处理复杂视觉内容时。

Abstract: We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-based information retrieval, cross-modal semantic similarity, and programming code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single- modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

Xinzge Gao, Chuanrui Hu, Bin Chen, Teng Li

TL;DR: 提出了一种名为Chain-of-Memory（CoM）的新方法，通过显式建模短期和长期记忆提升GUI代理在多任务导航中的性能。

Details

Motivation: 现有GUI代理依赖历史截图或动作隐式表示任务状态，难以准确理解复杂任务状态，缺乏有效的信息存储机制。

Result: 实验证明CoM显著提升GUI代理在跨应用任务中的性能，7B模型表现接近72B模型。

Insight: 显式记忆表征是提升代理任务理解能力的关键，且数据集的标注质量对模型能力有显著影响。

Abstract: Multimodal large language models (MLLMs) are attracting growing attention in the development of Graphical User Interface (GUI) agents. Existing approaches often rely on historical screenshots or actions to implicitly represent the task state. This reliance poses challenges for GUI agents in accurately understanding task states and underscores the absence of effective mechanisms to store critical information in complex and lengthy cross-app tasks. To address these challenges, we propose Chain-of-Memory (CoM), a novel approach for explicitly modeling short-term and long-term memory in GUI agents. CoM achieves this by capturing action descriptions, integrating task-relevant screen information, and maintaining a dedicated memory module to store and manage this information. By leveraging explicit memory representations, CoM enables GUI agents to better understand task states and retain critical historical information persistently. To equip GUI agents with memory management capabilities and evaluate the effectiveness of CoM, we developed the GUI Odyssey-CoM, a dataset comprising 111k screen-action pairs annotated with Chain-of-Memory. Experimental results demonstrate that CoM significantly improves GUI agents’ performance in cross-application tasks. Additionally, GUI Odyssey-CoM enables 7B models to achieve memory management capabilities comparable to 72B models. The dataset and code will be open-sourced.

cs.DB [Back]

[189] LIGHTHOUSE: Fast and precise distance to shoreline calculations from anywhere on earth cs.DB | cs.CV | cs.LGPDF

Patrick Beukema, Henry Herzog, Yawen Zhang, Hunter Pitelka, Favyen Bastani

TL;DR: 本文提出了一种新的全球海岸线数据集和算法（Lighthouse），实现了高分辨率（10米）和高效的距离计算，适用于资源受限的实时应用。

Details

Motivation: 现有全球海岸线数据集分辨率较低（1-4公里），限制了其实际应用。结合卫星图像和计算机视觉技术，可以实现更高精度的计算。

Result: Lighthouse能够在毫秒级完成在线推断，适用于实时应用，且资源消耗低。

Insight: 高分辨率卫星图像与计算机视觉的结合为地理空间计算提供了新的可能性，Lighthouse的设计展示了资源受限环境下的高效计算潜力。

Abstract: We introduce a new dataset and algorithm for fast and efficient coastal distance calculations from Anywhere on Earth (AoE). Existing global coastal datasets are only available at coarse resolution (e.g. 1-4 km) which limits their utility. Publicly available satellite imagery combined with computer vision enable much higher precision. We provide a global coastline dataset at 10 meter resolution, a 100+ fold improvement in precision over existing data. To handle the computational challenge of querying at such an increased scale, we introduce a new library: Layered Iterative Geospatial Hierarchical Terrain-Oriented Unified Search Engine (Lighthouse). Lighthouse is both exceptionally fast and resource-efficient, requiring only 1 CPU and 2 GB of RAM to achieve millisecond online inference, making it well suited for real-time applications in resource-constrained environments.

cs.CR [Back]

[190] Shrinking the Generation-Verification Gap with Weak Verifiers cs.CR | cs.CLPDF

Jon Saad-Falcon, E. Kelly Buchanan, Mayee F. Chen, Tzu-Heng Huang, Brendan McLaughlin

TL;DR: Weaver框架通过结合多个弱验证器来提高语言模型的验证能力，依赖弱监督减少对标注数据的依赖，显著提升生成候选答案的选择质量。

Details

Motivation: 现有验证器（如人类或工具）要么不可扩展，要么能力有限，而语言模型验证器和奖励模型与理想验证器之间仍存在性能差距，需要一种新方法来缩小这一差距。

Result: 在测试阶段，Weaver显著提升了生成候选答案的选择性能，达到接近GPT-4的准确度（87.7% vs. 86.7%），同时训练了一个低成本的400M交叉编码器。

Insight: 弱验证器的组合可以通过加权和弱监督显著提升验证能力，而无需依赖昂贵的标注数据或大规模微调，为验证器设计提供了新思路。

Abstract: Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier’s accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver’s effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver’s combined output scores.

cs.SD [Back]

[191] Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM cs.SD | cs.AI | cs.CL | cs.MM | eess.ASPDF

Mostafa Shahin, Beena Ahmed, Julien Epps

TL;DR: 论文提出了一种基于AudioLLM的零样本认知障碍检测方法，通过设计提示指令实现从语音中检测认知障碍，无需依赖标注数据，并展示了跨语言和任务的通用性。

Details

Motivation: 认知障碍的早期检测对公共卫生至关重要，但传统方法依赖于标注数据和特定特征，难以泛化。本研究旨在利用AudioLLM开发一种无需标注的零样本检测方法。

Result: 在两个数据集（英语和多语言）上的实验表明，该方法性能接近监督方法，且具有跨语言和任务的通用性和一致性。

Insight: 无需依赖标注数据的零样本方法在认知障碍检测中具有潜力，特别是在资源和数据有限的情况下。

Abstract: Cognitive impairment (CI) is of growing public health concern, and early detection is vital for effective intervention. Speech has gained attention as a non-invasive and easily collectible biomarker for assessing cognitive decline. Traditional CI detection methods typically rely on supervised models trained on acoustic and linguistic features extracted from speech, which often require manual annotation and may not generalise well across datasets and languages. In this work, we propose the first zero-shot speech-based CI detection method using the Qwen2- Audio AudioLLM, a model capable of processing both audio and text inputs. By designing prompt-based instructions, we guide the model in classifying speech samples as indicative of normal cognition or cognitive impairment. We evaluate our approach on two datasets: one in English and another multilingual, spanning different cognitive assessment tasks. Our results show that the zero-shot AudioLLM approach achieves performance comparable to supervised methods and exhibits promising generalizability and consistency across languages, tasks, and datasets.

[192] AI-Generated Song Detection via Lyrics Transcripts cs.SD | cs.AI | cs.CLPDF

Markus Frohmann, Elena V. Epure, Gabriel Meseguer-Brocal, Markus Schedl, Romain Hennequin

TL;DR: 该论文提出了一种通过歌词转录检测AI生成歌曲的方法，解决了音频检测方法的泛化性问题，并在多语言和多流派歌词上表现出色。

Details

Motivation: AI音乐生成工具的快速发展引发了对检测AI生成内容的需求。现有音频检测方法在泛化和音频扰动方面表现不佳，而现有歌词检测方法依赖完美的歌词格式（实际难以获取），因此需要一种更适用的方法。

Result: 在多语言和多流派歌词上表现出强检测性能，且在音频扰动和不同音乐生成器上比现有音频检测方法更鲁棒。

Insight: 歌词转录结合语言模型嵌入是一种有效的AI生成音乐检测方法，尤其在现实应用中更具实用性。

Abstract: The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.

[193] Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts cs.SD | cs.AI | cs.CL | eess.ASPDF

Duygu Altinok

TL;DR: 该论文提出一种利用大语言模型（LLMs）将不完美的语音提示转换为带有流畅性问题的标注转录的方法，实验证明即使输入文本不完美，只要包含时间戳线索，LLMs也能有效生成完整的流畅性标注转录。

Details

Motivation: 检测口语中的流畅性问题对提升自动语音和语言处理系统性能至关重要。现有语音转录工具在标注流畅性方面表现不佳，因此需要一种更稳健的方法来处理不完美的输入。

Result: 实验表明，即使输入文本不完美，只要包含时间戳线索，LLMs也能有效生成完整的流畅性标注转录，展示了其处理不完美提示的鲁棒性。

Insight: LLMs在整合多模态输入（如声学和文本）时表现优异，能够在不完美输入下完成任务，为语音和语言处理系统提供了新的可能性。

Abstract: Accurate detection of disfluencies in spoken language is crucial for enhancing the performance of automatic speech and language processing systems, as well as fostering the development of more inclusive speech and language technologies. Leveraging the growing trend of large language models (LLMs) as versatile learners capable of processing both lexical and non-lexical inputs (e.g., audio and video), we propose a novel approach to transcribing disfluencies as explicit tokens with timestamps, enabling the generation of fully annotated disfluency-rich transcripts. Our method integrates acoustic representations extracted from an audio encoder with textual inputs of varying quality: clean transcriptions without disfluencies, time-aligned transcriptions from aligners, or outputs from phoneme-based ASR models – all of which may contain imperfections. Importantly, our experiments demonstrate that textual inputs do not need to be flawless. As long as they include timestamp-related cues, LLMs can effectively smooth the input and produce fully disfluency-annotated transcripts, underscoring their robustness in handling imperfect hints.

[194] USAD: Universal Speech and Audio Representation via Distillation cs.SD | cs.CL | eess.ASPDF

Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu

TL;DR: USAD提出了一种通过蒸馏技术统一学习和表示语音、音频（包括音乐和声音）的自监督学习框架，表现优异。

Details

Motivation: 现有的自监督学习方法通常局限于特定领域（如语音或非语音任务），缺乏统一的音频表示模型。USAD旨在解决这一问题。

Result: 在SUPERB和HEAR等多个基准测试中表现优异，几乎达到最先进水平。

Insight: 蒸馏技术可以有效统一不同音频类型的表示，为多领域音频任务提供通用解决方案。

Abstract: Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

cs.RO [Back]

[195] RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation cs.RO | cs.AI | cs.CL | cs.CV | cs.MAPDF

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu

TL;DR: RoboTwin 2.0是一个可扩展的仿真框架，用于生成多样化和真实的数据，支持双机械臂操作的稳健性研究。通过结合多模态大语言模型和仿真优化，自动生成任务级代码，并通过结构化域随机化提升数据多样性和策略鲁棒性。

Details

Motivation: 现有合成数据集在双机械臂操作任务中表现不足，主要由于缺乏高效的数据生成方法和过于简化的仿真环境。RoboTwin 2.0旨在解决这些问题，提升仿真到现实的迁移能力。

Result: 模型在未见过的现实任务中表现显著提升，代码生成成功率提高10.9%，零样本模型性能提升228%。

Insight: 结构化域随机化和多模态生成方法能有效提升仿真数据的多样性和策略鲁棒性，支持无监督的现实任务泛化。

Abstract: Simulation-based data synthesis has emerged as a powerful paradigm for enhancing real-world robotic manipulation. However, existing synthetic datasets remain insufficient for robust bimanual manipulation due to two challenges: (1) the lack of an efficient, scalable data generation method for novel tasks, and (2) oversimplified simulation environments that fail to capture real-world complexity. We present RoboTwin 2.0, a scalable simulation framework that enables automated, large-scale generation of diverse and realistic data, along with unified evaluation protocols for dual-arm manipulation. We first construct RoboTwin-OD, a large-scale object library comprising 731 instances across 147 categories, each annotated with semantic and manipulation-relevant labels. Building on this foundation, we develop an expert data synthesis pipeline that combines multimodal large language models (MLLMs) with simulation-in-the-loop refinement to generate task-level execution code automatically. To improve sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions, thereby enhancing data diversity and policy robustness. We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments, and pre-collect over 100,000 domain-randomized expert trajectories. Empirical results show a 10.9% gain in code generation success and improved generalization to novel real-world scenarios. A VLA model fine-tuned on our dataset achieves a 367% relative improvement (42.0% vs. 9.0%) on unseen scene real-world tasks, while zero-shot models trained solely on our synthetic data achieve a 228% relative gain, highlighting strong generalization without real-world supervision. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation.

[196] A workflow for generating synthetic LiDAR datasets in simulation environments cs.RO | cs.CVPDF

Abhishek Phadke, Shakib Mahmud Dipto, Pratip Rana

TL;DR: 该论文提出了一种模拟工作流，用于生成合成LiDAR数据集，支持自动驾驶感知、机器人研究和传感器安全分析。通过CoppeliaSim模拟环境和Python API，集成了LiDAR、图像传感器和二维扫描仪，生成多模态同步数据，并验证了安全隐患。

Details

Motivation: 为自动驾驶和机器人研究提供高质量的合成LiDAR数据集，同时探索LiDAR数据的安全漏洞和防御策略。

Result: 生成了大规模点云及对应的RGB和深度图像，展示了潜在的安全漏洞，并验证了防御策略的可行性。

Insight: 合成数据集可以高效支持感知研究和安全分析，但需改进环境真实性、传感器噪声建模和计算可扩展性。未来可加入天气效果和真实地形模型。

Abstract: This paper presents a simulation workflow for generating synthetic LiDAR datasets to support autonomous vehicle perception, robotics research, and sensor security analysis. Leveraging the CoppeliaSim simulation environment and its Python API, we integrate time-of-flight LiDAR, image sensors, and two dimensional scanners onto a simulated vehicle platform operating within an urban scenario. The workflow automates data capture, storage, and annotation across multiple formats (PCD, PLY, CSV), producing synchronized multimodal datasets with ground truth pose information. We validate the pipeline by generating large-scale point clouds and corresponding RGB and depth imagery. The study examines potential security vulnerabilities in LiDAR data, such as adversarial point injection and spoofing attacks, and demonstrates how synthetic datasets can facilitate the evaluation of defense strategies. Finally, limitations related to environmental realism, sensor noise modeling, and computational scalability are discussed, and future research directions, such as incorporating weather effects, real-world terrain models, and advanced scanner configurations, are proposed. The workflow provides a versatile, reproducible framework for generating high-fidelity synthetic LiDAR datasets to advance perception research and strengthen sensor security in autonomous systems. Documentation and examples accompany this framework; samples of animated cloud returns and image sensor data can be found at this Link.

Bernard Lange, Anil Yildiz, Mansur Arief, Shehryar Khattak, Mykel Kochenderfer

TL;DR: 该论文提出了一种通用机器人导航框架ARNA，利用大型视觉语言模型（LVLM）动态协调感知、推理和行动，实现了在未知环境中的高效导航和推理。

Details

Motivation: 现有的机器人导航系统通常依赖于任务特定的神经网络和固定数据流，限制了系统的泛化能力。利用LVLM的通用知识和推理能力可以解决这一问题。

Result: 在Habitat Lab的HM-EQA基准测试中取得最先进性能，展示了无需预定义地图或固定输入表征的导航和问答能力。

Insight: 利用LVLM的动态规划和多模态推理能力，可以实现更灵活和通用的机器人导航系统。

Abstract: Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed data flows, limiting generalizability. Large Vision-Language Models (LVLMs) offer a promising alternative by embedding human-like knowledge suitable for reasoning and planning. Yet, prior LVLM-robot integrations typically depend on pre-mapped spaces, hard-coded representations, and myopic exploration. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose navigation framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools available within modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query the robotic modules, reason over multimodal inputs, and select appropriate navigation actions. This approach enables robust navigation and reasoning in previously unmapped environments, providing a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA achieves state-of-the-art performance, demonstrating effective exploration, navigation, and embodied question answering without relying on handcrafted plans, fixed input representations, or pre-existing maps.

[198] EASE: Embodied Active Event Perception via Self-Supervised Energy Minimization cs.RO | cs.CVPDF

Zhou Chen, Sanjoy Kundu, Harsimran S. Baweja, Sathyanarayanan N. Aakur

TL;DR: EASE提出了一种基于自监督能量最小化的框架，通过预测误差和熵作为内在信号，实现动态事件感知与跟踪，无需外部标注或奖励。

Details

Motivation: 现有的事件感知方法依赖预定义动作空间、标注数据和外部奖励，难以适应动态现实场景。EASE受认知理论和预测编码启发，旨在解决这一问题。

Result: 实验表明EASE在仿真和现实场景中能实现隐私保护的事件感知，具有隐式记忆和目标连续性等行为。

Insight: 通过自监督能量最小化，EASE展示了适应性事件感知的潜力，为动态环境中的具身系统提供了新方向。

Abstract: Active event perception, the ability to dynamically detect, track, and summarize events in real time, is essential for embodied intelligence in tasks such as human-AI collaboration, assistive robotics, and autonomous navigation. However, existing approaches often depend on predefined action spaces, annotated datasets, and extrinsic rewards, limiting their adaptability and scalability in dynamic, real-world scenarios. Inspired by cognitive theories of event perception and predictive coding, we propose EASE, a self-supervised framework that unifies spatiotemporal representation learning and embodied control through free energy minimization. EASE leverages prediction errors and entropy as intrinsic signals to segment events, summarize observations, and actively track salient actors, operating without explicit annotations or external rewards. By coupling a generative perception model with an action-driven control policy, EASE dynamically aligns predictions with observations, enabling emergent behaviors such as implicit memory, target continuity, and adaptability to novel environments. Extensive evaluations in simulation and real-world settings demonstrate EASE’s ability to achieve privacy-preserving and scalable event perception, providing a robust foundation for embodied systems in unscripted, dynamic tasks.

[199] Radar and Event Camera Fusion for Agile Robot Ego-Motion Estimation cs.RO | cs.CVPDF

Yang Lyu, Zhenghao Zou, Yanfeng Li, Chunhui Zhao, Quan Pan

TL;DR: 该论文提出了一种新型的IMU-free方法，结合事件相机和毫米波雷达，用于敏捷机器人的动态运动估计。该方法无需复杂的特征关联，直接利用原始事件和多普勒测量，实现了高效且稳定的速度估计。

Details

Motivation: 在高度动态的场景中，传统传感器（如IMU）常因测量模糊、失真和延迟而无法提供可靠的运动估计。事件相机和毫米波雷达的结合为解决这一问题提供了新的可能性。

Result: 实验表明，该框架在纹理稀少和无结构环境中表现稳健，计算高效，适合边缘设备。

Insight: 事件相机和毫米波雷达的结合能够有效克服动态场景中的传感器局限性，无需复杂的关联过程即可实现高效的运动估计。

Abstract: Achieving reliable ego motion estimation for agile robots, e.g., aerobatic aircraft, remains challenging because most robot sensors fail to respond timely and clearly to highly dynamic robot motions, often resulting in measurement blurring, distortion, and delays. In this paper, we propose an IMU-free and feature-association-free framework to achieve aggressive ego-motion velocity estimation of a robot platform in highly dynamic scenarios by combining two types of exteroceptive sensors, an event camera and a millimeter wave radar, First, we used instantaneous raw events and Doppler measurements to derive rotational and translational velocities directly. Without a sophisticated association process between measurement frames, the proposed method is more robust in texture-less and structureless environments and is more computationally efficient for edge computing devices. Then, in the back-end, we propose a continuous-time state-space model to fuse the hybrid time-based and event-based measurements to estimate the ego-motion velocity in a fixed-lagged smoother fashion. In the end, we validate our velometer framework extensively in self-collected experiment datasets. The results indicate that our IMU-free and association-free ego motion estimation framework can achieve reliable and efficient velocity output in challenging environments. The source code, illustrative video and dataset are available at https://github.com/ZzhYgwh/TwistEstimator.

[200] TDACloud: Point Cloud Recognition Using Topological Data Analysis cs.RO | cs.CG | cs.CVPDF

Anirban Ghosh, Ian Dahlin, Ayan Dutta

TL;DR: TDACloud提出了一种基于拓扑数据分析（TDA）的局部描述符提取方法，用于点云识别，无需GPU密集型训练，并在噪声和变换条件下表现出色。

Details

Motivation: 点云识别在自动驾驶和场景重建等领域具有重要意义，但现有方法在噪声或变换（如旋转）条件下表现不佳，且常依赖资源密集型训练。

Result: 在真实数据集（如Oxford RobotCar、KITTI-360）和合成数据（如ShapeNet）上测试，TDACloud在噪声和变换条件下识别准确率高，优于基线方法约14%。

Insight: TDA为点云识别提供了一种轻量且鲁棒的方法，尤其在资源受限或噪声环境下具有潜力。

Abstract: Point cloud-based object/place recognition remains a problem of interest in applications such as autonomous driving, scene reconstruction, and localization. Extracting meaningful local descriptors from a query point cloud that can be matched with the descriptors of the collected point clouds is a challenging problem. Furthermore, when the query point cloud is noisy or has been transformed (e.g., rotated), it adds to the complexity. To this end, we propose a novel methodology, named TDACloud, using Topological Data Analysis (TDA) for local descriptor extraction from a point cloud, which does not need resource-intensive GPU-based machine learning training. More specifically, we used the ATOL vectorization method to generate vectors for point clouds. Unlike voxelization, our proposed technique can take raw point clouds as inputs and outputs a fixed-size TDA-descriptor vector. To test the quality of the proposed TDACloud technique, we have implemented it on multiple real-world (e.g., Oxford RobotCar, KITTI-360) and realistic (e.g., ShapeNet) point cloud datasets for object and place recognition. We have also tested TDACloud on noisy and transformed test cases where the query point cloud has been scaled, translated, or rotated. Our results demonstrate high recognition accuracies in noisy conditions and large-scale real-world place recognition while outperforming the baselines by up to approximately 14%.

[201] GRAND-SLAM: Local Optimization for Globally Consistent Large-Scale Multi-Agent Gaussian SLAM cs.RO | cs.CVPDF

Annika Thomas, Aneesa Sonawalla, Alex Rose, Jonathan P. How

TL;DR: GRAND-SLAM 是一种多代理高斯 SLAM 方法，通过局部优化和闭环检测实现大规模室外环境的高效重建，性能优于现有方法。

Details

Motivation: 目前的多代理高斯 SLAM 方法局限于小规模室内环境，无法满足大规模室外场景的需求。

Result: 在 Replica 数据集上 PSNR 提升 28%，在 Kimera-Multi 数据集上多代理跟踪误差降低 91%。

Insight: 高斯泼溅技术可通过局部优化和闭环检测扩展至大规模多代理环境，展现了其在复杂场景中的潜力。

Abstract: 3D Gaussian splatting has emerged as an expressive scene representation for RGB-D visual SLAM, but its application to large-scale, multi-agent outdoor environments remains unexplored. Multi-agent Gaussian SLAM is a promising approach to rapid exploration and reconstruction of environments, offering scalable environment representations, but existing approaches are limited to small-scale, indoor environments. To that end, we propose Gaussian Reconstruction via Multi-Agent Dense SLAM, or GRAND-SLAM, a collaborative Gaussian splatting SLAM method that integrates i) an implicit tracking module based on local optimization over submaps and ii) an approach to inter- and intra-robot loop closure integrated into a pose-graph optimization framework. Experiments show that GRAND-SLAM provides state-of-the-art tracking performance and 28% higher PSNR than existing methods on the Replica indoor dataset, as well as 91% lower multi-agent tracking error and improved rendering over existing multi-agent methods on the large-scale, outdoor Kimera-Multi dataset.

cs.MM [Back]

[202] Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning? cs.MM | cs.CVPDF

Yuesheng Huang, Peng Zhang, Riliang Liu, Jiaqi Liang

TL;DR: 该论文探讨了利用文本生成图像（T2I）模型动态生成的图像是否可以作为文本中心任务的补充模态，并通过文本分类任务验证其有效性。研究发现，这种‘合成感知’方法在特定条件下能显著提升性能，但其效果依赖于文本与生成图像的语义对齐、任务的视觉可解释性以及T2I模型的生成质量。

Details

Motivation: 当前的文本数据丰富而多模态模型能力强大，但两者之间存在‘模态鸿沟’。研究旨在探索生成图像是否可以作为文本中心任务的有效补充模态，以填补这一鸿沟。

Result: 结果表明，合成图像能显著提升性能，但其效果高度依赖于语义对齐、任务的视觉可解释性和T2I模型的生成质量。

Insight: 生成图像可以作为文本中心任务的有效补充模态，但需要满足特定条件，这为多模态学习提供了新方向。

Abstract: A significant modality gap" exists between the abundance of text-only data and the increasing power of multimodal models. This work systematically investigates whether images generated on-the-fly by Text-to-Image (T2I) models can serve as a valuable complementary modality for text-centric tasks. Through a comprehensive evaluation framework on text classification, we analyze the impact of critical variables, including T2I model quality, prompt engineering strategies, and multimodal fusion architectures. Our findings demonstrate that thissynthetic perception” can yield significant performance gains, even when augmenting strong large language model baselines. However, we find the effectiveness of this approach is highly conditional, depending critically on the semantic alignment between text and the generated image, the inherent ``visual groundability” of the task, and the generative fidelity of the T2I model. Our work establishes the first rigorous benchmark for this paradigm, providing a clear analysis of its potential and current limitations, and demonstrating its viability as a pathway to enrich language understanding in traditionally unimodal scenarios.

cs.FL [Back]

[203] Tutorial: $\varphi$-Transductions in OpenFst via the Gallic Semiring cs.FL | cs.CLPDF

Marco Cognetta, Cyril Allauzen

TL;DR: 该教程介绍了如何在OpenFst中通过Gallic半环正确实现$/$-变换，并以MaxMatch（WordPiece）分词算法为例进行演示。

Details

Motivation: OpenFst库虽然支持$/$-变换，但由于实现限制，无法直接在变换器中使用。本文旨在提供一种有效的方法来解决这一问题。

Result: 成功实现了$/$-变换的用例，并通过MaxMatch算法验证了方法的可行性。

Insight: 通过半环的代数学特性，可以灵活解决有限状态变换器库中的技术限制问题。

Abstract: OpenFst, a popular finite-state transducer library, supports $\varphi$-transitions but, due to an implementation constraint, they cannot be used with transducers in a straightforward way. In this short tutorial, we describe how one can use other functionality provided by OpenFst (namely, the Gallic semiring) to correctly implement $\varphi$-transductions and demonstrate it by implementing the MaxMatch (WordPiece) tokenization algorithm (Devlin et al., 2019; Song et al., 2021). Accompanying self-contained code examples are provided. https://www.openfst.org/twiki/pub/Contrib/FstContrib/phi_transduction_tutorial_code.tgz

cs.IR [Back]

[204] Enhancing Document Retrieval in COVID-19 Research: Leveraging Large Language Models for Hidden Relation Extraction cs.IR | cs.CLPDF

Hoang-An Trieu, Dinh-Truong Do, Chau Nguyen, Vu Tran, Minh Le Nguyen

TL;DR: 论文提出了一种利用大语言模型（LLMs）提取隐藏关系的方法，以提升COVID-19文献检索系统的效率和质量。

Details

Motivation: COVID-19疫情导致相关文献激增，传统检索工具无法高效提取隐藏关系，影响了研究信息的快速获取。

Result: 系统能够提供更多高质量的检索结果，帮助研究人员在疫情突发时更高效地获取有用信息。

Insight: 大语言模型在文献挖掘中可以弥补传统工具的不足，尤其在处理大规模、复杂关系的数据时表现出色。

Abstract: In recent years, with the appearance of the COVID-19 pandemic, numerous publications relevant to this disease have been issued. Because of the massive volume of publications, an efficient retrieval system is necessary to provide researchers with useful information if an unexpected pandemic happens so suddenly, like COVID-19. In this work, we present a method to help the retrieval system, the Covrelex-SE system, to provide more high-quality search results. We exploited the power of the large language models (LLMs) to extract the hidden relationships inside the unlabeled publication that cannot be found by the current parsing tools that the system is using. Since then, help the system to have more useful information during retrieval progress.

[205] LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation cs.IR | cs.CVPDF

Wangyu Wu, Zhenhong Chen, Xianglin Qiu, Siqi Song, Xiaowei Huang

TL;DR: 论文提出了一种名为LLM-EMF的新方法，通过增强大语言模型知识和多模态数据融合，显著提升了跨域序列推荐的性能。

Details

Motivation: 跨域序列推荐（CDSR）需要建模跨域偏好并捕获序列内外的项目关系，但现有方法在利用多模态数据时效果有限。

Result: 在四个电商数据集上验证，LLM-EMF优于现有方法，显示多模态数据融合的优势。

Insight: 大语言模型知识与多模态数据结合能更全面地捕捉用户兴趣，提升跨域推荐效果。

Abstract: Cross-Domain Sequential Recommendation (CDSR) predicts user behavior by leveraging historical interactions across multiple domains, focusing on modeling cross-domain preferences and capturing both intra- and inter-sequence item relationships. We propose LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation (LLM-EMF), a novel and advanced approach that enhances textual information with Large Language Models (LLM) knowledge and significantly improves recommendation performance through the fusion of visual and textual data. Using the frozen CLIP model, we generate image and text embeddings, thereby enriching item representations with multimodal data. A multiple attention mechanism jointly learns both single-domain and cross-domain preferences, effectively capturing and understanding complex user interests across diverse domains. Evaluations conducted on four e-commerce datasets demonstrate that LLM-EMF consistently outperforms existing methods in modeling cross-domain user preferences, thereby highlighting the effectiveness of multimodal data integration and its advantages in enhancing sequential recommendation systems. Our source code will be released.

physics.geo-ph [Back]

[206] Pix2Geomodel: A Next-Generation Reservoir Geomodeling with Property-to-Property Translation physics.geo-ph | cs.CE | cs.CV | cs.LG | cs.NEPDF

Abdulrahman Al-Fakih, Ardiansyah Koeshidayatullah, Nabil A. Saraih, Tapan Mukerji, Rayan Kanfar

TL;DR: Pix2Geomodel是一种基于Pix2Pix的cGAN框架，用于从地质数据预测储层属性，展示了高精度属性映射能力。

Details

Motivation: 传统地质建模方法难以处理复杂的地下异质性和观测数据匹配问题，因此需要更高效的方法。

Result: 在岩相和水饱和度预测中表现出高精度（PA分别为0.88和0.96），但在孔隙度和渗透率中表现一般。

Insight: 框架能捕捉空间变异性，但受限于2D建模和微观结构变异性，未来需扩展至3D和多模态数据。

Abstract: Accurate geological modeling is critical for reservoir characterization, yet traditional methods struggle with complex subsurface heterogeneity, and they have problems with conditioning to observed data. This study introduces Pix2Geomodel, a novel conditional generative adversarial network (cGAN) framework based on Pix2Pix, designed to predict reservoir properties (facies, porosity, permeability, and water saturation) from the Rotliegend reservoir of the Groningen gas field. Utilizing a 7.6 million-cell dataset from the Nederlandse Aardolie Maatschappij, accessed via EPOS-NL, the methodology included data preprocessing, augmentation to generate 2,350 images per property, and training with a U-Net generator and PatchGAN discriminator over 19,000 steps. Evaluation metrics include pixel accuracy (PA), mean intersection over union (mIoU), frequency weighted intersection over union (FWIoU), and visualizations assessed performance in masked property prediction and property-to-property translation tasks. Results demonstrated high accuracy for facies (PA 0.88, FWIoU 0.85) and water saturation (PA 0.96, FWIoU 0.95), with moderate success for porosity (PA 0.70, FWIoU 0.55) and permeability (PA 0.74, FWIoU 0.60), and robust translation performance (e.g., facies-to-facies PA 0.98, FWIoU 0.97). The framework captured spatial variability and geological realism, as validated by variogram analysis, and calculated the training loss curves for the generator and discriminator for each property. Compared to traditional methods, Pix2Geomodel offers enhanced fidelity in direct property mapping. Limitations include challenges with microstructural variability and 2D constraints, suggesting future integration of multi-modal data and 3D modeling (Pix2Geomodel v2.0). This study advances the application of generative AI in geoscience, supporting improved reservoir management and open science initiatives.

eess.AS [Back]

[207] Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models eess.AS | cs.CL | cs.SDPDF

Alican Gok, Oguzhan Buyuksolak, Osman Erman Okman, Murat Saraclar

TL;DR: 该论文提出了一种利用自监督学习模型增强少样本关键词检测性能的训练方案，通过特征提取、降维和知识蒸馏，显著提升了分类准确率。

Details

Motivation: 传统少样本关键词检测系统在边缘设备上的准确率和适应能力不足，特别是在资源受限的环境中表现不佳。

Result: 在GSC数据集上，10-shot分类准确率从33.4%提升至74.1%，显著优于传统方法。

Insight: 自监督学习和知识蒸馏的结合能够有效提升少样本任务的表现，尤其在资源受限的设备上具有实际应用潜力。

Abstract: Keyword Spotting plays a critical role in enabling hands-free interaction for battery-powered edge devices. Few-Shot Keyword Spotting (FS-KWS) addresses the scalability and adaptability challenges of traditional systems by enabling recognition of custom keywords with only a few examples. However, existing FS-KWS systems achieve subpar accuracy at desirable false acceptance rates, particularly in resource-constrained edge environments. To address these issues, we propose a training scheme that leverages self-supervised learning models for robust feature extraction, dimensionality reduction, and knowledge distillation. The teacher model, based on Wav2Vec 2.0 is trained using Sub-center ArcFace loss, which enhances inter-class separability and intra-class compactness. To enable efficient deployment on edge devices, we introduce attention-based dimensionality reduction and train a standard lightweight ResNet15 student model. We evaluate the proposed approach on the English portion of the Multilingual Spoken Words Corpus (MSWC) and the Google Speech Commands (GSC) datasets. Notably, the proposed training method improves the 10-shot classification accuracy from 33.4% to 74.1% on 11 classes at 1% false alarm accuracy on the GSC dataset, thus making it significantly better-suited for a real use case scenario.

cs.LG [Back]

[208] FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies cs.LG | cs.AI | cs.CLPDF

Seonglae Cho, Harryn Oh, Donghyun Lee, Luis Eduardo Rodrigues Vieira, Andrew Bermingham

TL;DR: FaithfulSAE是一种训练稀疏自编码器（SAE）的新方法，通过在模型自身合成的数据集上训练来解决传统SAE存在的特征不稳定和虚假特征问题，消除了对外部数据集的依赖。

Details

Motivation: 传统稀疏自编码器（SAE）在训练过程中依赖外部数据集，可能导致特征不稳定（随初始化种子变化）和捕获虚假特征（Fake Features），这些问题源于外部数据可能与模型的泛化能力不匹配。

Result: 实验表明，FaithfulSAE在5/7的模型中表现出更低的虚假特征比例，并且在SAE探测任务中优于基于网络数据集训练的SAE。

Insight: 强调了SAE训练数据集的重要性，提出模型内部生成的数据更能捕捉真实模型特征，为改进模型可解释性提供了新思路。

Abstract: Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets - either collected from the Web or generated by another model - which may contain out-of-distribution (OOD) data beyond the model’s generalisation capabilities. This can result in hallucinated SAE features, which we term “Fake Features”, that misrepresent the model’s internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model’s own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds. Notably, FaithfulSAEs outperform SAEs trained on web-based datasets in the SAE probing task and exhibit a lower Fake Feature Ratio in 5 out of 7 models. Overall, our approach eliminates the dependency on external datasets, advancing interpretability by better capturing model-internal features while highlighting the often neglected importance of SAE training datasets.

[209] Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach cs.LG | cs.AI | cs.CLPDF

Xinnan Zhang, Chenliang Li, Siliang Zeng, Jiaxiang Li, Zhongruo Wang

TL;DR: 提出了一种名为IRO的强化学习框架，可以在不修改模型参数的情况下对冻结LLMs进行对齐，解决了传统方法无法在测试时优化或缺乏模型权重访问权限的问题。

Details

Motivation: 传统方法（如RLHF和DPO）需要直接优化模型参数，无法在测试时使用或适用于权重不可访问的情况，而测试时方法则因高推理成本和基于不完美奖励函数导致次优输出。

Result: 用户可在自己的数据集上对齐模型（类似RFT），无需访问模型权重。

Insight: IRO框架提供了一种新的对齐方式，突破了传统方法在测试时优化和权重访问限制上的局限性。

Abstract: Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using current value functions, and (iii) trains a new lightweight value function that guides the next decoding pass. At test time, the value functions are used to guide the base model generation via a search-based optimization process. Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI’s reinforcement fine-tuning (RFT), but without requiring access to the model weights.

[210] AdapThink: Adaptive Thinking Preferences for Reasoning Language Model cs.LG | cs.AI | cs.CLPDF

Xu Wan, Wei Wang, Wenyue Xu, Wotao Yin, Jie Song

TL;DR: AdapThink 是一种自适应后训练框架，通过动态调整反思偏好和多样性采样机制，提升语言模型的推理效率，同时保持性能。

Details

Motivation: 传统的基于RL的后训练方法在推理效率上存在问题：模型对简单问题过度计算，对复杂问题过早转移推理。静态预算或预定义规则缺乏适应性。

Result: 在多个数学推理数据集上验证了AdapThink能提升推理效率并减少低效问题。

Insight: 动态调整和多样性采样是提升语言模型推理效率的有效途径。

Abstract: Reinforcement Learning (RL)-based post-training has significantly advanced the complex reasoning capabilities of language models, fostering sophisticated self-reflection processes. However, this ``slow thinking’’ paradigm presents a critical challenge to reasoning efficiency: models may expend excessive computation on simple questions and shift reasoning prematurely for complex ones. Previous mechanisms typically rely on static length budgets or predefined rules, lacking the adaptability for varying question complexities and models’ evolving capabilities. To this end, we propose AdapThink, an adaptive post-training framework designed to induce more efficient thinking while maintaining the performance of reasoning language models. Specifically, AdapThink incorporates two key mechanisms: 1) A group-relative reward function that leverages model confidence and response’s characteristic to dynamically adjust the preference of reflection-related transition words without resorting to a fixed length preference. 2) A diversity-aware sampling mechanism that balances the training group’s solution accuracy with reasoning diversity via an entropy-guided score. Experiments on several mathematical reasoning datasets with DeepSeek-distilled models demonstrate AdapThink’s advantages in enabling adaptive reasoning patterns and mitigating the inefficiencies.

[211] RLPR: Extrapolating RLVR to General Domains without Verifiers cs.LG | cs.AI | cs.CLPDF

Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang

TL;DR: RLPR是一种无需验证器的强化学习框架，通过利用LLM自身生成答案的概率作为奖励信号，将其扩展到更广泛的通用领域，显著提升了推理能力。

Details

Motivation: RLVR（带有可验证奖励的强化学习）在数学和代码领域表现出潜力，但因依赖特定领域的验证器而导致复杂性和可扩展性受限。RLPR旨在摆脱验证器的限制，将RLVR推广到更广泛的领域。

Result: 在四个通用领域和三个数学领域的基准测试中，RLPR显著提升了Gemma、Llama和Qwen模型的推理能力，表现优于VeriFree和依赖验证器的General-Reasoner。

Insight: LLM生成答案的内在概率可以替代外部验证器，为强化学习提供可靠的奖励信号，从而扩展RLVR到更广泛的领域。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM’s intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM’s own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.

[212] Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning cs.LG | cs.AI | cs.CLPDF

Lixin Wu, Na Cai, Qiao Cheng, Jiachen Wang, Yitao Duan

TL;DR: Confucius3-Math 是一个轻量级高性能的推理大语言模型，专为中国 K-12 数学学习设计，通过后训练和大规模强化学习实现高效运行与卓越性能。

Details

Motivation: 旨在通过 AI 提升教育和知识传播，特别是针对中国 K-12 数学学习的需求。

Result: 模型在数学推理任务中表现优于许多更大规模的模型，且成本低。

Insight: 展示了在特定领域以低成本构建强大推理模型的可行性，为教育 AI 提供了实用工具。

Abstract: We introduce Confucius3-Math, an open-source large language model with 14B parameters that (1) runs efficiently on a single consumer-grade GPU; (2) achieves SOTA performances on a range of mathematical reasoning tasks, outperforming many models with significantly larger sizes. In particular, as part of our mission to enhancing education and knowledge dissemination with AI, Confucius3-Math is specifically committed to mathematics learning for Chinese K-12 students and educators. Built via post-training with large-scale reinforcement learning (RL), Confucius3-Math aligns with national curriculum and excels at solving main-stream Chinese K-12 mathematical problems with low cost. In this report we share our development recipe, the challenges we encounter and the techniques we develop to overcome them. In particular, we introduce three technical innovations: Targeted Entropy Regularization, Recent Sample Recovery and Policy-Specific Hardness Weighting. These innovations encompass a new entropy regularization, a novel data scheduling policy, and an improved group-relative advantage estimator. Collectively, they significantly stabilize the RL training, improve data efficiency, and boost performance. Our work demonstrates the feasibility of building strong reasoning models in a particular domain at low cost. We open-source our model and code at https://github.com/netease-youdao/Confucius3-Math.

[213] SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation cs.LG | cs.CLPDF

Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim

TL;DR: SlimMoE提出了一种分阶段压缩框架，通过专家剪枝和知识蒸馏，将大型MoE模型压缩为高效的小模型，降低了计算和内存需求，同时保持了性能。

Details

Motivation: 大型MoE模型的高内存需求使其在资源受限环境中难以微调或部署，SlimMoE旨在解决这一问题。

Result: 压缩后的Phi-mini-MoE和Phi-tiny-MoE模型在单GPU上可微调，性能优于同类尺寸模型，接近更大模型的表现。

Insight: 结构化剪枝结合分阶段蒸馏是压缩MoE模型的有效方法，有助于MoE架构在资源受限环境中的推广。

Abstract: The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their enormous memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants without incurring the prohibitive costs of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation common in one-shot pruning approaches. Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters) using only 400B tokens–less than 10% of the original model’s training data. These compressed models can be fine-tuned on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them highly suitable for academic and resource-limited settings. Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models. For instance, Phi-mini-MoE achieves similar or better performance to Phi-3-mini using only 2/3 of the activated parameters and yields comparable MMLU scores to Llama 3.1 8B despite having significantly lower latency. Our findings demonstrate that structured pruning combined with staged distillation offers an effective path to creating high-quality, compact MoE models, paving the way for broader adoption of MoE architectures. We make our models publicly available at https://huggingface.co/microsoft/Phi-mini-MoE-instruct and https://huggingface.co/microsoft/Phi-tiny-MoE-instruct .

[214] No Training Wheels: Steering Vectors for Bias Correction at Inference Time cs.LG | cs.CL | cs.CVPDF

Aviral Gupta, Armaan Sethi, Ameesh Sethi

TL;DR: 论文提出了一种无需训练、低成本的方法，通过计算多数与少数群体的平均激活差异定义“偏置向量”，并在推理时从模型的残差流中减去该向量，以减少分类偏误并提升最差群体准确率。

Details

Motivation: 神经网络分类器在数据分布不均的训练集中容易继承类别偏误和虚假相关性，导致在非典型群体上表现不佳。现有方法通常需要重新训练或大量计算资源，限制了实用性。

Result: 实验表明，该方法能显著减少分类偏误，并提升最差群体的分类准确率。

Insight: 适用于生成模型的“steering vectors”技术也可以有效应用于分类任务，为模型偏置校正提供了一种低成本、即时的解决方案。

Abstract: Neural network classifiers trained on datasets with uneven group representation often inherit class biases and learn spurious correlations. These models may perform well on average but consistently fail on atypical groups. For example, in hair color classification, datasets may over-represent females with blond hair, reinforcing stereotypes. Although various algorithmic and data-centric methods have been proposed to address such biases, they often require retraining or significant compute. In this work, we propose a cheap, training-free method inspired by steering vectors used to edit behaviors in large language models. We compute the difference in mean activations between majority and minority groups to define a “bias vector,” which we subtract from the model’s residual stream. This leads to reduced classification bias and improved worst-group accuracy. We explore multiple strategies for extracting and applying these vectors in transformer-like classifiers, showing that steering vectors, traditionally used in generative models, can also be effective in classification. More broadly, we showcase an extremely cheap, inference time, training free method to mitigate bias in classification models.

[215] ReDit: Reward Dithering for Improved LLM Policy Optimization cs.LG | cs.AI | cs.CLPDF

Chenxing Wei, Jiarui Yu, Ying Tiffany He, Hande Dong, Yao Shu

TL;DR: 论文提出ReDit方法，通过在离散奖励信号中添加随机噪声，优化LLM策略训练中的梯度问题，加速收敛并提升性能。

Details

Motivation: 离散奖励函数可能导致梯度异常、优化不稳定和收敛缓慢的问题，限制了LLM策略优化的效率。

Result: 实验表明，ReDit仅需10%的训练步骤即可达到与vanilla GRPO相当的性能，并在相同训练时间下仍能提升4%性能。

Insight: 噪声的引入不仅平滑了梯度更新，还增强了探索能力，为离散奖励优化提供了有效解决方案。

Abstract: DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it’s a ‘’perfect’’ reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.

Jie Li, Shifei Ding, Lili Guo, Xuan Li

TL;DR: 该论文提出了一种多模态锚点门控变换器与知识蒸馏结合的方法（MAGTKD），用于对话中的情感识别任务，通过提示学习和知识蒸馏增强模态表示，并在多个数据集上取得了最先进性能。

Details

Motivation: 对话中的情感识别（ERC）面临模态表示效率低和贡献不均的挑战，现有方法忽视了模态间的差异性与复杂性。

Result: 在IEMOCAP和MELD数据集上，MAGTKD显著提升了性能，并达到最先进水平。

Insight: 知识蒸馏能有效增强弱模态表示，多模态锚点门控机制为跨模态信息融合提供了新思路。

Abstract: Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.

[217] PCaM: A Progressive Focus Attention-Based Information Fusion Method for Improving Vision Transformer Domain Adaptation cs.LG | cs.AI | cs.CVPDF

Zelin Zang, Fei Wang, Liangyu Li, Jinlin Wu, Chunshui Zhao

TL;DR: 该论文提出了PCaM方法，通过渐进式聚焦交叉注意力机制解决ViT在领域适应中的前景对象不匹配问题，并引入注意力引导损失提升跨领域注意力一致性，显著提升了性能。

Details

Motivation: 已有的基于ViT的无监督域适应方法因前景对象大小和空间分布的不一致导致注意力不匹配，影响跨领域对齐效果。

Result: 在多个数据集上的实验表明，PCaM显著提升了适应性能并达到新的SOTA。

Insight: 通过注意力引导的前景语义融合能有效解决ViT在域适应中的对象不匹配问题。

Abstract: Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Recent UDA methods based on Vision Transformers (ViTs) have achieved strong performance through attention-based feature alignment. However, we identify a key limitation: foreground object mismatch, where the discrepancy in foreground object size and spatial distribution across domains weakens attention consistency and hampers effective domain alignment. To address this issue, we propose the Progressive Focus Cross-Attention Mechanism (PCaM), which progressively filters out background information during cross-attention, allowing the model to focus on and fuse discriminative foreground semantics across domains. We further introduce an attentional guidance loss that explicitly directs attention toward task-relevant regions, enhancing cross-domain attention consistency. PCaM is lightweight, architecture-agnostic, and easy to integrate into existing ViT-based UDA pipelines. Extensive experiments on Office-Home, DomainNet, VisDA-2017, and remote sensing datasets demonstrate that PCaM significantly improves adaptation performance and achieves new state-of-the-art results, validating the effectiveness of attention-guided foreground fusion for domain adaptation.

[218] DRIMV_TSK: An Interpretable Surgical Evaluation Model for Incomplete Multi-View Rectal Cancer Data cs.LG | cs.CVPDF

Wei Zhang, Zi Wang, Hanwen Zhou, Zhaohong Deng, Weiping Ding

TL;DR: 该论文提出了一种可解释的多视图直肠癌手术评估模型DRIMV_TSK，用于处理不完整的多视图数据，结合双表示学习与TSK模糊系统，显著提升了评估效果。

Details

Motivation: 现有的直肠癌手术难度评估主要依赖临床数据，但技术的发展使得更多数据（如MRI图像）可用于评估。同时，人工智能的应用为更全面的评估提供了可能性。

Result: 在MVRC数据集上，DRIMV_TSK优于其他先进算法，取得了最佳结果。

Insight: 多视图数据的综合利用和可解释性模型设计是提升直肠癌手术评估的关键。双表示学习有效处理了数据不完整性问题，而TSK模糊系统增强了模型的可解释性。

Abstract: A reliable evaluation of surgical difficulty can improve the success of the treatment for rectal cancer and the current evaluation method is based on clinical data. However, more data about rectal cancer can be collected with the development of technology. Meanwhile, with the development of artificial intelligence, its application in rectal cancer treatment is becoming possible. In this paper, a multi-view rectal cancer dataset is first constructed to give a more comprehensive view of patients, including the high-resolution MRI image view, pressed-fat MRI image view, and clinical data view. Then, an interpretable incomplete multi-view surgical evaluation model is proposed, considering that it is hard to obtain extensive and complete patient data in real application scenarios. Specifically, a dual representation incomplete multi-view learning model is first proposed to extract the common information between views and specific information in each view. In this model, the missing view imputation is integrated into representation learning, and second-order similarity constraint is also introduced to improve the cooperative learning between these two parts. Then, based on the imputed multi-view data and the learned dual representation, a multi-view surgical evaluation model with the TSK fuzzy system is proposed. In the proposed model, a cooperative learning mechanism is constructed to explore the consistent information between views, and Shannon entropy is also introduced to adapt the view weight. On the MVRC dataset, we compared it with several advanced algorithms and DRIMV_TSK obtained the best results.

[219] Decoding Federated Learning: The FedNAM+ Conformal Revolution cs.LG | cs.CVPDF

Sree Bhargavi Balija, Amitash Nanda, Debashis Sahoo

TL;DR: 论文提出FedNAM+，一种结合神经加法模型（NAM）和保形预测的联邦学习框架，用于可解释且可靠的不确定性估计，通过动态层级调整技术和梯度敏感性映射实现高精度预测与透明不确定性度量。

Details

Motivation: 现有联邦学习框架缺乏结合不确定性量化、可解释性和鲁棒性的综合解决方案，FedNAM+旨在解决这一问题。

Result: 在CT、MNIST和CIFAR数据集上验证，FedNAM+预测精度高（如MNIST上仅损失0.1%），并提供可视化不确定性度量，计算效率优于Monte Carlo Dropout。

Insight: FedNAM+的全局不确定性估计和可解释性提升了联邦学习的透明性，低置信区域可指导数据补充以改进模型性能。

Abstract: Federated learning has significantly advanced distributed training of machine learning models across decentralized data sources. However, existing frameworks often lack comprehensive solutions that combine uncertainty quantification, interpretability, and robustness. To address this, we propose FedNAM+, a federated learning framework that integrates Neural Additive Models (NAMs) with a novel conformal prediction method to enable interpretable and reliable uncertainty estimation. Our method introduces a dynamic level adjustment technique that utilizes gradient-based sensitivity maps to identify key input features influencing predictions. This facilitates both interpretability and pixel-wise uncertainty estimates. Unlike traditional interpretability methods such as LIME and SHAP, which do not provide confidence intervals, FedNAM+ offers visual insights into prediction reliability. We validate our approach through experiments on CT scan, MNIST, and CIFAR datasets, demonstrating high prediction accuracy with minimal loss (e.g., only 0.1% on MNIST), along with transparent uncertainty measures. Visual analysis highlights variable uncertainty intervals, revealing low-confidence regions where model performance can be improved with additional data. Compared to Monte Carlo Dropout, FedNAM+ delivers efficient and global uncertainty estimates with reduced computational overhead, making it particularly suitable for federated learning scenarios. Overall, FedNAM+ provides a robust, interpretable, and computationally efficient framework that enhances trust and transparency in decentralized predictive modeling.

[220] Adapting Vision-Language Models for Evaluating World Models cs.LG | cs.AI | cs.CVPDF

Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter

TL;DR: 本文提出了UNIVERSE方法，通过调整视觉语言模型（VLM）以适应世界模型的细粒度、时序敏感的评估任务，并在多种任务格式和训练策略下验证其有效性。

Details

Motivation: 世界模型的评估需要细粒度和时序敏感的能力，而现有指标无法满足。视觉语言模型（VLM）因其强大的多模态推理能力，有望成为自动评估工具，但需要针对性地调整以适应此类任务。

Result: UNIVERSE在性能上匹配任务专用的基线模型，且人类评估证明了其与人工判断的高度一致性。

Insight: 视觉语言模型可以通过适当的调整适应细粒度的时序评估任务，为世界模型的评估提供了可扩展且语义感知的工具。

Abstract: World models – generative models that simulate environment dynamics conditioned on past observations and actions – are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency – capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce a evaluation protocol targeting two recognition tasks – action recognition and character recognition – each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a method for adapting VLMs to rollout evaluation under data and compute constraints. We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions. The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint. Human studies confirm strong alignment with human judgments, establishing UNIVERSE as a scalable, semantics-aware evaluator for world models.

Table of Contents

cs.CV [Back]

[1] SRKD: Towards Efficient 3D Point Cloud Segmentation via Structure- and Relation-aware Knowledge Distillation cs.CVPDF

[2] Fine-Scale Soil Mapping in Alaska with Multimodal Machine Learning cs.CV | cs.LGPDF

[3] RadarSeq: A Temporal Vision Framework for User Churn Prediction via Radar Chart Sequences cs.CV | cs.AIPDF

[4] P2MFDS: A Privacy-Preserving Multimodal Fall Detection System for Elderly People in Bathroom Environments cs.CV | cs.AIPDF

[5] A Novel Multi-layer Task-centric and Data Quality Framework for Autonomous Driving cs.CV | cs.AIPDF

[6] Efficient Feedback Gate Network for Hyperspectral Image Super-Resolution cs.CV | cs.LGPDF

[7] From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge cs.CV | cs.AI | cs.IRPDF

[8] Spatial-Temporal Pre-Training for Embryo Viability Prediction Using Time-Lapse Videos cs.CVPDF

[9] VMRA-MaR: An Asymmetry-Aware Temporal Framework for Longitudinal Breast Cancer Risk Prediction cs.CVPDF

[10] Trans${^2}$-CBCT: A Dual-Transformer Framework for Sparse-View CBCT Reconstruction cs.CV | cs.AIPDF

[11] Enhancing Wireless Device Identification through RF Fingerprinting: Leveraging Transient Energy Spectrum Analysis cs.CVPDF

[12] AQUA20: A Benchmark Dataset for Underwater Species Classification under Challenging Conditions cs.CVPDF

[13] When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network cs.CVPDF

[14] Few-Shot, Now for Real: Medical VLMs Adaptation without Balanced Sets or Validation cs.CVPDF

[15] Trustworthy Few-Shot Transfer of Medical VLMs through Split Conformal Prediction cs.CVPDF

[16] Learning golf swing signatures from a single wrist-worn inertial sensor cs.CVPDF

[17] Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations cs.CVPDF

[18] SynDaCaTE: A Synthetic Dataset For Evaluating Part-Whole Hierarchical Inference cs.CV | cs.AI | cs.LGPDF

[19] VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models cs.CV | cs.AI | cs.ROPDF

[20] LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning cs.CV | cs.CLPDF

[21] HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models cs.CV | cs.AI | cs.LGPDF

[22] DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving cs.CV | cs.AI | cs.ROPDF

[23] SELFI: Selective Fusion of Identity for Generalizable Deepfake Detection cs.CVPDF

[24] A Multimodal In Vitro Diagnostic Method for Parkinson’s Disease Combining Facial Expressions and Behavioral Gait Data cs.CVPDF

[25] OpenMAP-BrainAge: Generalizable and Interpretable Brain Age Predictor cs.CVPDF

[26] HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs cs.CVPDF

[27] JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent cs.CVPDF

[28] CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning cs.CV | cs.AI | cs.CLPDF

[29] Optimization-Free Patch Attack on Stereo Depth Estimation cs.CVPDF

[30] Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection cs.CV | cs.AIPDF

[31] Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning cs.CVPDF

[32] MDSAM:Memory-Driven Sparse Attention Matrix for LVLMs Hallucination Mitigation cs.CVPDF

[33] CSDN: A Context-Gated Self-Adaptive Detection Network for Real-Time Object Detection cs.CVPDF

[34] Domain Generalization using Action Sequences for Egocentric Action Recognition cs.CVPDF

[35] SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification cs.CVPDF

[36] DreamJourney: Perpetual View Generation with Video Diffusion Models cs.CVPDF

[37] Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models cs.CV | cs.AI | cs.MMPDF

[38] PDC-Net: Pattern Divide-and-Conquer Network for Pelvic Radiation Injury Segmentation cs.CVPDF

[39] YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception cs.CVPDF

[40] PhysID: Physics-based Interactive Dynamics from a Single-view Image cs.CVPDF

[41] LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging cs.CVPDF

[42] Incorporating Rather Than Eliminating: Achieving Fairness for Skin Disease Diagnosis Through Group-Specific Expert cs.CVPDF

[43] Time-Contrastive Pretraining for In-Context Image and Video Segmentation cs.CVPDF

[44] Robust Foreground-Background Separation for Severely-Degraded Videos Using Convolutional Sparse Representation Modeling cs.CV | eess.IVPDF

[45] Fetuses Made Simple: Modeling and Tracking of Fetal Shape and Pose cs.CVPDF

[46] Cross-modal State Space Modeling for Real-time RGB-thermal Wild Scene Semantic Segmentation cs.CV | cs.ROPDF

[47] SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model cs.CV | cs.AIPDF

[48] StainPIDR: A Pathological Image Decouplingand Reconstruction Method for StainNormalization Based on Color VectorQuantization and Structure Restaining cs.CV | cs.AIPDF

[49] Cloud-Aware SAR Fusion for Enhanced Optical Sensing in Space Missions cs.CV | cs.LGPDF

[50] EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations cs.CV | cs.AIPDF

[51] PostAlign: Multimodal Grounding as a Corrective Lens for MLLMs cs.CVPDF

[52] Cause-Effect Driven Optimization for Robust Medical Visual Question Answering with Language Biases cs.CV | cs.AIPDF

[53] Feedback Driven Multi Stereo Vision System for Real-Time Event Analysis cs.CV | cs.AIPDF

[54] PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis cs.CV | cs.MMPDF

[55] IDAL: Improved Domain Adaptive Learning for Natural Images Dataset cs.CV | cs.AI | cs.LGPDF

[56] GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning cs.CV | cs.AIPDF

[57] SegChange-R1:Augmented Reasoning for Remote Sensing Change Detection via Large Language Models cs.CVPDF

[58] Classification of Tents in Street Bazaars Using CNN cs.CVPDF

[59] Mobile Image Analysis Application for Mantoux Skin Test cs.CVPDF

[60] OSDMamba: Enhancing Oil Spill Detection from Remote Sensing Images Using Selective State Space Model cs.CVPDF

[61] On the Robustness of Human-Object Interaction Detection against Distribution Shift cs.CV | cs.MMPDF

[62] PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding cs.CV | cs.AI | cs.CLPDF

[63] MiCo: Multiple Instance Learning with Context-Aware Clustering for Whole Slide Image Analysis cs.CVPDF

[64] Pre-Trained LLM is a Semantic-Aware and Generalizable Segmentation Booster cs.CV | cs.AI | cs.MMPDF

[65] CmFNet: Cross-modal Fusion Network for Weakly-supervised Segmentation of Medical Images cs.CVPDF

[66] CLGRPO: Reasoning Ability Enhancement for Small VLMs cs.CVPDF

[67] Deep Supervised LSTM for 3D morphology estimation from Multi-View RGB Images of Wheat Spikes cs.CVPDF

[68] Training-free Test-time Improvement for Explainable Medical Image Classification cs.CVPDF

[69] MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering cs.CV | cs.AIPDF

[70] TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving cs.CVPDF

[71] ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation cs.CV | cs.AI | cs.LGPDF

[72] Enhancing VICReg: Random-Walk Pairing for Improved Generalization and Better Global Semantics Capturing cs.CV | cs.LGPDF

[73] See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis cs.CVPDF

[74] Pattern-Based Phase-Separation of Tracer and Dispersed Phase Particles in Two-Phase Defocusing Particle Tracking Velocimetry cs.CV | physics.app-ph | physics.flu-dynPDF

[75] CDG-MAE: Learning Correspondences from Diffusion Generated Views cs.CVPDF

[76] Multimodal Fusion SLAM with Fourier Attention cs.CV | cs.AIPDF

[77] Limitations of NERF with pre-trained Vision Features for Few-Shot 3D Reconstruction cs.CVPDF

[78] Cross-Architecture Knowledge Distillation (KD) for Retinal Fundus Image Anomaly Detection on NVIDIA Jetson Nano cs.CV | cs.AI | cs.LG | 68T07 | I.2.6; I.5.1; J.3PDF