cs.CV [Total: 81]
cs.CL [Total: 36]
cs.SE [Total: 1]
q-bio.NC [Total: 2]
cs.HC [Total: 1]
cs.CR [Total: 3]
cs.IR [Total: 1]
cs.AI [Total: 6]
cs.RO [Total: 2]
cs.LG [Total: 7]
eess.IV [Total: 3]

cs.CV [Back]

[1] DINOv3 cs.CV | cs.LGPDF

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab

TL;DR: DINOv3是自监督学习领域的一个重大里程碑，通过数据准备、模型规模扩展和新方法Gram anchoring，成功解决了密集特征图在长时间训练中的退化问题，并展示了在多样化视觉任务中的卓越性能。

Details

Motivation: 自监督学习有望消除手动数据标注的需求，从而让模型能够轻松扩展到大规模数据集和更大架构。DINOv3的目标是提供一个通用的视觉基础模型，能够从多样化数据中学习高质量表示。

Result: DINOv3在无需微调的情况下，在多个视觉任务中超越了当前最先进的专用模型，显著优于之前的自监督和弱监督基础模型。

Insight: 模型规模和数据准备的精细优化、以及解决特征图退化的新方法，可以显著提升自监督学习模型的泛化能力和任务性能。

Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images – using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models’ flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

[2] Empowering Morphing Attack Detection using Interpretable Image-Text Foundation Model cs.CV | cs.AIPDF

Sushrut Patwardhan, Raghavendra Ramachandra, Sushma Venkatesh

TL;DR: 论文提出了一种多模态学习方法，利用可解释的图像-文本基础模型（如CLIP）进行换脸攻击检测，通过文本提示提供检测结果的描述，并在零样本评估中展示了其通用性和效果。

Details

Motivation: 换脸攻击检测在人脸识别系统中至关重要，但现有方法缺乏可解释性。本文旨在通过多模态学习框架，结合图像和文本信息，提供更直观的检测结果描述。

Result: 实验表明，提出的框架在零样本评估中不仅能够检测换脸攻击，还能预测相关的文本描述，且在多种换脸技术和介质中表现良好。

Insight: 图像-文本多模态学习可以提升换脸攻击检测的可解释性，同时通过合理的文本提示设计，能够在不微调的情况下实现通用性检测。

Abstract: Morphing attack detection has become an essential component of face recognition systems for ensuring a reliable verification scenario. In this paper, we present a multimodal learning approach that can provide a textual description of morphing attack detection. We first show that zero-shot evaluation of the proposed framework using Contrastive Language-Image Pretraining (CLIP) can yield not only generalizable morphing attack detection, but also predict the most relevant text snippet. We present an extensive analysis of ten different textual prompts that include both short and long textual prompts. These prompts are engineered by considering the human understandable textual snippet. Extensive experiments were performed on a face morphing dataset that was developed using a publicly available face biometric dataset. We present an evaluation of SOTA pre-trained neural networks together with the proposed framework in the zero-shot evaluation of five different morphing generation techniques that are captured in three different mediums.

[3] Interpretable Oracle Bone Script Decipherment through Radical and Pictographic Analysis with LVLMs cs.CVPDF

Kaixin Peng, Mengyang Zhao, Haiyang Yu, Teng Fu, Bin Li

TL;DR: 该论文提出了一种基于大型视觉语言模型（LVLM）的可解释甲骨文（OBS）破译方法，通过结合部首分析和象形语义理解，提升模型的零样本破译能力和可解释性。

Details

Motivation: 甲骨文作为最古老的成熟文字系统，由于稀有性、抽象性和象形多样性，破译难度大。现有深度学习方法忽略了字形与语义的复杂联系，导致泛化能力和可解释性不足。

Result: 在公开基准测试中，模型取得了Top-10最佳准确率和优异的零样本破译能力，并展示了逻辑分析过程。

Insight: 该方法不仅提升了OBS破译的准确性，还为未破译甲骨文提供了考古学有价值的参考结果，推动了数字人文和历史研究的发展。

Abstract: As the oldest mature writing system, Oracle Bone Script (OBS) has long posed significant challenges for archaeological decipherment due to its rarity, abstractness, and pictographic diversity. Current deep learning-based methods have made exciting progress on the OBS decipherment task, but existing approaches often ignore the intricate connections between glyphs and the semantics of OBS. This results in limited generalization and interpretability, especially when addressing zero-shot settings and undeciphered OBS. To this end, we propose an interpretable OBS decipherment method based on Large Vision-Language Models, which synergistically combines radical analysis and pictograph-semantic understanding to bridge the gap between glyphs and meanings of OBS. Specifically, we propose a progressive training strategy that guides the model from radical recognition and analysis to pictographic analysis and mutual analysis, thus enabling reasoning from glyph to meaning. We also design a Radical-Pictographic Dual Matching mechanism informed by the analysis results, significantly enhancing the model’s zero-shot decipherment performance. To facilitate model training, we propose the Pictographic Decipherment OBS Dataset, which comprises 47,157 Chinese characters annotated with OBS images and pictographic analysis texts. Experimental results on public benchmarks demonstrate that our approach achieves state-of-the-art Top-10 accuracy and superior zero-shot decipherment capabilities. More importantly, our model delivers logical analysis processes, possibly providing archaeologically valuable reference results for undeciphered OBS, and thus has potential applications in digital humanities and historical research. The dataset and code will be released in https://github.com/PKXX1943/PD-OBS.

[4] Deep Learning Enables Large-Scale Shape and Appearance Modeling in Total-Body DXA Imaging cs.CVPDF

Arianna Bunnell, Devon Cataldi, Yannik Glaser, Thomas K. Wolfgruber, Steven Heymsfield

TL;DR: 这篇论文提出了一种基于深度学习的方法，用于在全身双能X线吸收法（TBDXA）图像上自动放置关键点，展示了其在形态和外观建模（SAM）中的价值，并验证了其在健康标记物关联研究中的潜力。

Details

Motivation: TBDXA是一种低成本的全身体成分评估方法，但需要手动标注关键点，限制了其大规模应用。论文旨在通过深度学习实现自动关键点标注，以支持形态和外观建模的研究。

Result: 模型在关键点定位任务中表现优异（99.5%正确率）。SAM特征的分布与多种健康标记物（衰弱、代谢、炎症、心血管健康）显著相关，验证了方法的实用性。

Insight: 自动关键点定位方法可以显著提高TBDXA图像分析的效率，支持大规模的形态和外观建模。SAM为研究体成分与健康关系提供了新的工具，并可能发现新的生物标记物。

Abstract: Total-body dual X-ray absorptiometry (TBDXA) imaging is a relatively low-cost whole-body imaging modality, widely used for body composition assessment. We develop and validate a deep learning method for automatic fiducial point placement on TBDXA scans using 1,683 manually-annotated TBDXA scans. The method achieves 99.5% percentage correct keypoints in an external testing dataset. To demonstrate the value for shape and appearance modeling (SAM), our method is used to place keypoints on 35,928 scans for five different TBDXA imaging modes, then associations with health markers are tested in two cohorts not used for SAM model generation using two-sample Kolmogorov-Smirnov tests. SAM feature distributions associated with health biomarkers are shown to corroborate existing evidence and generate new hypotheses on body composition and shape’s relationship to various frailty, metabolic, inflammation, and cardiometabolic health markers. Evaluation scripts, model weights, automatic point file generation code, and triangulation files are available at https://github.com/hawaii-ai/dxa-pointplacement.

[5] MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning cs.CVPDF

Thanh-Dat Truong, Christophe Bobda, Nitin Agarwal, Khoa Luu

TL;DR: 本文提出了一种新型的多模态融合学习方法MANGO，通过可逆交叉注意力层和三种新的交叉注意力机制，显式捕捉多模态数据的复杂相关性，并在多个任务中实现了最先进的性能。

Details

Motivation: 现有的多模态融合方法依赖于Transformer的注意力机制隐式学习多模态特征的相关性，难以捕捉各模态的本质特征和复杂结构，因此需要一种显式、可解释且可扩展的融合方法。

Result: 在语义分割、图像到图像的翻译和电影流派分类三个多模态学习任务中，MANGO实现了最先进的性能。

Insight: 显式建模多模态数据的相关性能够提升模型的解释性和性能，可逆交叉注意力层和多种注意力机制的结合为多模态融合提供了新的研究方向。

Abstract: Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach\footnote{The source code of this work will be publicly available.} to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.

[6] Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model cs.CV | cs.AI | cs.ETPDF

Nitin Rai, Nathan S. Boyd, Gary E. Vallad, Arnold W. Schumann

TL;DR: 论文探讨了结合生成式人工智能（GenAI）生成的合成图像与真实图像，用于提升西瓜病害分类模型的性能，通过自定义EfficientNetV2-L模型验证了混合数据训练的优势。

Details

Motivation: 农业中病害诊断的计算机视觉模型依赖大量真实图像数据，获取成本高。GenAI为生成高质量合成图像提供了新途径，但合成图像是否足以替代或补充真实图像尚需验证。

Result: H3和H4（1:10真实与合成图像比例）的模型表现最佳，F1-score达到1.00，证明了合成图像与少量真实图像结合的优越性。

Insight: 合成图像不能完全替代真实图像，但与真实图像混合使用可显著提升模型性能与泛化能力，为农业病害诊断提供了高效的数据解决方案。

Abstract: The current advancements in generative artificial intelligence (GenAI) models have paved the way for new possibilities for generating high-resolution synthetic images, thereby offering a promising alternative to traditional image acquisition for training computer vision models in agriculture. In the context of crop disease diagnosis, GenAI models are being used to create synthetic images of various diseases, potentially facilitating model creation and reducing the dependency on resource-intensive in-field data collection. However, limited research has been conducted on evaluating the effectiveness of integrating real with synthetic images to improve disease classification performance. Therefore, this study aims to investigate whether combining a limited number of real images with synthetic images can enhance the prediction accuracy of an EfficientNetV2-L model for classifying watermelon \textit{(Citrullus lanatus)} diseases. The training dataset was divided into five treatments: H0 (only real images), H1 (only synthetic images), H2 (1:1 real-to-synthetic), H3 (1:10 real-to-synthetic), and H4 (H3 + random images to improve variability and model generalization). All treatments were trained using a custom EfficientNetV2-L architecture with enhanced fine-tuning and transfer learning techniques. Models trained on H2, H3, and H4 treatments demonstrated high precision, recall, and F1-score metrics. Additionally, the weighted F1-score increased from 0.65 (on H0) to 1.00 (on H3-H4) signifying that the addition of a small number of real images with a considerable volume of synthetic images improved model performance and generalizability. Overall, this validates the findings that synthetic images alone cannot adequately substitute for real images; instead, both must be used in a hybrid manner to maximize model performance for crop disease classification.

[7] SynSpill: Improved Industrial Spill Detection With Synthetic Data cs.CV | cs.ETPDF

Aaditya Baranwal, Abdul Mueez, Jason Voelker, Guneet Bhatia, Shruti Vyas

TL;DR: 论文提出了一种通过合成数据提升工业泄漏检测性能的框架SynSpill，解决了真实数据稀缺的问题，并展示了其对视觉语言模型和物体检测器的显著改进。

Details

Motivation: 由于工业泄漏检测领域的真实数据稀缺且难以标注，传统微调方法不适用，因此需要一种低成本、可扩展的解决方案。

Result: 合成数据显著提升了视觉语言模型和物体检测器的性能，使其在未见过的泄漏场景下表现更好。

Insight: 高保真合成数据是解决安全关键领域数据稀缺问题的有效手段，结合轻量级适配方法可低成本部署视觉系统。

Abstract: Large-scale Vision-Language Models (VLMs) have transformed general-purpose visual recognition through strong zero-shot capabilities. However, their performance degrades significantly in niche, safety-critical domains such as industrial spill detection, where hazardous events are rare, sensitive, and difficult to annotate. This scarcity – driven by privacy concerns, data sensitivity, and the infrequency of real incidents – renders conventional fine-tuning of detectors infeasible for most industrial settings. We address this challenge by introducing a scalable framework centered on a high-quality synthetic data generation pipeline. We demonstrate that this synthetic corpus enables effective Parameter-Efficient Fine-Tuning (PEFT) of VLMs and substantially boosts the performance of state-of-the-art object detectors such as YOLO and DETR. Notably, in the absence of synthetic data (SynSpill dataset), VLMs still generalize better to unseen spill scenarios than these detectors. When SynSpill is used, both VLMs and detectors achieve marked improvements, with their performance becoming comparable. Our results underscore that high-fidelity synthetic data is a powerful means to bridge the domain gap in safety-critical applications. The combination of synthetic generation and lightweight adaptation offers a cost-effective, scalable pathway for deploying vision systems in industrial environments where real data is scarce/impractical to obtain. Project Page: https://synspill.vercel.app

[8] EntropyGS: An Efficient Entropy Coding on 3D Gaussian Splatting cs.CVPDF

Yuning Huang, Jiahao Pang, Fengqing Zhu, Dong Tian

TL;DR: 该论文提出了一种名为EntropyGS的高效熵编码方法，用于3D高斯泼溅（3DGS）数据的压缩，通过对高斯属性的统计分析和参数化编码，实现了30倍的数据率降低，同时保持渲染质量。

Details

Motivation: 3D高斯泼溅（3DGS）是一种新兴的视点合成方法，但高斯数据的存储和传输需要高效的压缩技术。本研究旨在通过统计分析和熵编码技术解决这一问题。

Result: 在基准数据集上实现了约30倍的数据率降低，同时保持渲染质量，编码和解码时间快。

Insight: 球谐AC属性呈现拉普拉斯分布，旋转、缩放和不透明度可通过高斯混合模型近似，这些统计特性为高效熵编码提供了基础。

Abstract: As an emerging novel view synthesis approach, 3D Gaussian Splatting (3DGS) demonstrates fast training/rendering with superior visual quality. The two tasks of 3DGS, Gaussian creation and view rendering, are typically separated over time or devices, and thus storage/transmission and finally compression of 3DGS Gaussians become necessary. We begin with a correlation and statistical analysis of 3DGS Gaussian attributes. An inspiring finding in this work reveals that spherical harmonic AC attributes precisely follow Laplace distributions, while mixtures of Gaussian distributions can approximate rotation, scaling, and opacity. Additionally, harmonic AC attributes manifest weak correlations with other attributes except for inherited correlations from a color space. A factorized and parameterized entropy coding method, EntropyGS, is hereinafter proposed. During encoding, distribution parameters of each Gaussian attribute are estimated to assist their entropy coding. The quantization for entropy coding is adaptively performed according to Gaussian attribute types. EntropyGS demonstrates about 30x rate reduction on benchmark datasets while maintaining similar rendering quality compared to input 3DGS data, with a fast encoding and decoding time.

[9] CellSymphony: Deciphering the molecular and phenotypic orchestration of cells with single-cell pathomics cs.CVPDF

Paul H. Acosta, Pingjun Chen, Simon P. Castillo, Maria Esther Salvatierra, Yinyin Yuan

TL;DR: CellSymphony是一个多模态框架，结合Xenium空间转录组学数据和病理学图像，利用基础模型提取特征，实现细胞类型注释和微环境分析。

Details

Motivation: 现有方法难以从复杂的肿瘤组织中提取细胞级特征，并将空间转录组数据与丰富的形态信息整合。CellSymphony旨在解决这一挑战。

Result: 在三种癌症类型中实现准确的细胞类型注释，并揭示了不同的微环境生态位。

Insight: 基础模型和多模态融合在复杂组织生态系统分析中具有潜力，能更全面地解析细胞的生理和表型特征。

Abstract: Xenium, a new spatial transcriptomics platform, enables subcellular-resolution profiling of complex tumor tissues. Despite the rich morphological information in histology images, extracting robust cell-level features and integrating them with spatial transcriptomics data remains a critical challenge. We introduce CellSymphony, a flexible multimodal framework that leverages foundation model-derived embeddings from both Xenium transcriptomic profiles and histology images at true single-cell resolution. By learning joint representations that fuse spatial gene expression with morphological context, CellSymphony achieves accurate cell type annotation and uncovers distinct microenvironmental niches across three cancer types. This work highlights the potential of foundation models and multimodal fusion for deciphering the physiological and phenotypic orchestration of cells within complex tissue ecosystems.

[10] Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets cs.CVPDF

Xinan Zhang, Haolin Wang, Yung-An Hsieh, Zhongyu Yang, Anthony Yezzi

TL;DR: 这篇综述论文系统性地回顾了基于深度学习的裂缝检测领域的最新进展，重点关注学习范式的转变、泛化能力的提升以及数据集的多样化。同时，作者还提出了一个新的3D裂缝数据集3DCrack，并通过基准实验为未来研究提供了参考。

Details

Motivation: 裂缝检测在民用基础设施（如道路、建筑等）的维护中至关重要，深度学习在该领域的应用进展迅速。然而，现有的研究和技术综述未能充分反映学习范式、泛化能力和数据集多样性的最新趋势，因此需要一篇系统性的综述来填补这一空白。

Result: 论文总结了该领域的研究趋势，并通过基准实验验证了常用深度学习方法和基础模型在新数据集上的性能。结果表明，最新的学习范式和多样化数据集对提升裂缝检测的性能和泛化能力具有重要意义。

Insight: 裂缝检测领域正朝着更灵活的学习范式和多样化数据集方向发展，而3D数据的引入为深度学习模型提供了更丰富的上下文信息，有望进一步提升检测性能。未来研究可以关注基础模型在裂缝检测中的潜力以及跨模态数据的融合。

Abstract: Crack detection plays a crucial role in civil infrastructures, including inspection of pavements, buildings, etc., and deep learning has significantly advanced this field in recent years. While numerous technical and review papers exist in this domain, emerging trends are reshaping the landscape. These shifts include transitions in learning paradigms (from fully supervised learning to semi-supervised, weakly-supervised, unsupervised, few-shot, domain adaptation and fine-tuning foundation models), improvements in generalizability (from single-dataset performance to cross-dataset evaluation), and diversification in dataset reacquisition (from RGB images to specialized sensor-based data). In this review, we systematically analyze these trends and highlight representative works. Additionally, we introduce a new dataset collected with 3D laser scans, 3DCrack, to support future research and conduct extensive benchmarking experiments to establish baselines for commonly used deep learning methodologies, including recent foundation models. Our findings provide insights into the evolving methodologies and future directions in deep learning-based crack detection. Project page: https://github.com/nantonzhang/Awesome-Crack-Detection

[11] MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs cs.CV | cs.AIPDF

Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai

TL;DR: 论文提出了一种无需训练的解码方法MRFD，通过多区域融合解码和自一致性，显著减少了LVLMs中的幻觉问题。

Details

Motivation: 大型视觉语言模型（LVLMs）在多模态任务中表现优异，但由于对不同图像区域的验证能力有限，常常生成与视觉输入不一致的幻觉文本。

Result: 实验表明，MRFD在多种LVLMs和基准测试中显著减少了幻觉问题，提高了响应的真实性，且无需更新模型。

Insight: 通过多区域信息的融合和自一致性验证，可以有效提升视觉语言模型的输出可靠性，无需额外的模型训练。

Abstract: Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations – text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

[12] High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance cs.CVPDF

Danyi Gao

TL;DR: 提出了一种结合对比学习和结构引导的高保真文本到图像生成方法，通过多目标监督机制提升语义对齐和结构一致性。

Details

Motivation: 现有文本驱动图像生成方法在语义对齐精度和结构一致性方面存在瓶颈，需要更高效的跨模态对齐和空间结构建模。

Result: 在COCO-2014数据集上验证了方法的有效性，CLIP Score、FID和SSIM指标优于基线，生成图像语义清晰且结构完整。

Insight: 通过对比学习与结构引导的联合优化，可以在不增加计算复杂度的情况下显著提升生成图像的语义对齐和结构保真度。

Abstract: This paper addresses the performance bottlenecks of existing text-driven image generation methods in terms of semantic alignment accuracy and structural consistency. A high-fidelity image generation method is proposed by integrating text-image contrastive constraints with structural guidance mechanisms. The approach introduces a contrastive learning module that builds strong cross-modal alignment constraints to improve semantic matching between text and image. At the same time, structural priors such as semantic layout maps or edge sketches are used to guide the generator in spatial-level structural modeling. This enhances the layout completeness and detail fidelity of the generated images. Within the overall framework, the model jointly optimizes contrastive loss, structural consistency loss, and semantic preservation loss. A multi-objective supervision mechanism is adopted to improve the semantic consistency and controllability of the generated content. Systematic experiments are conducted on the COCO-2014 dataset. Sensitivity analyses are performed on embedding dimensions, text length, and structural guidance strength. Quantitative metrics confirm the superior performance of the proposed method in terms of CLIP Score, FID, and SSIM. The results show that the method effectively bridges the gap between semantic alignment and structural fidelity without increasing computational complexity. It demonstrates a strong ability to generate semantically clear and structurally complete images, offering a viable technical path for joint text-image modeling and image generation.

[13] VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation cs.CVPDF

Ryota Tanaka, Tomohiro Suzuki, Keisuke Fujii

TL;DR: 论文提出了一种针对花样滑冰跳跃的动作分割框架（VIFSS），通过结合视图不变的特征学习和动作分类，解决了数据不足和三维动作复杂性的问题。

Details

Motivation: 花样滑冰跳跃的精确识别需要专家知识，现有方法因数据不足和忽略三维特性而受限。

Result: 在元素级动作分割任务中F1@50超过92%，且在数据有限时表现突出。

Insight: 视图不变的预训练在数据有限时特别有效，突出了实际应用价值。

Abstract: Understanding human actions from videos plays a critical role across various domains, including sports analytics. In figure skating, accurately recognizing the type and timing of jumps a skater performs is essential for objective performance evaluation. However, this task typically requires expert-level knowledge due to the fine-grained and complex nature of jump procedures. While recent approaches have attempted to automate this task using Temporal Action Segmentation (TAS), there are two major limitations to TAS for figure skating: the annotated data is insufficient, and existing methods do not account for the inherent three-dimensional aspects and procedural structure of jump actions. In this work, we propose a new TAS framework for figure skating jumps that explicitly incorporates both the three-dimensional nature and the semantic procedure of jump movements. First, we propose a novel View-Invariant, Figure Skating-Specific pose representation learning approach (VIFSS) that combines contrastive learning as pre-training and action classification as fine-tuning. For view-invariant contrastive pre-training, we construct FS-Jump3D, the first publicly available 3D pose dataset specialized for figure skating jumps. Second, we introduce a fine-grained annotation scheme that marks the entry (preparation)'' and landing’’ phases, enabling TAS models to learn the procedural structure of jumps. Extensive experiments demonstrate the effectiveness of our framework. Our method achieves over 92% F1@50 on element-level TAS, which requires recognizing both jump types and rotation levels. Furthermore, we show that view-invariant contrastive pre-training is particularly effective when fine-tuning data is limited, highlighting the practicality of our approach in real-world scenarios.

[14] JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics cs.CVPDF

Simindokht Jahangard, Mehrzad Mohammadi, Yi Shen, Zhixi Cai, Hamid Rezatofighi

TL;DR: JRDB-Reasoning是一个为机器人视觉推理设计的难度分级基准，通过定义推理复杂度并生成自适应问题，填补了现有基准的不足。

Details

Motivation: 现有的视觉推理基准缺乏对推理复杂度的明确定义、无法按难度生成问题，且缺少结构化注释。

Result: 创建了JRDB-Reasoning基准，支持动态评估视觉语言模型在不同推理级别上的性能。

Insight: 细化推理复杂度定义和结构化注释有助于提升视觉推理任务的评测能力，尤其是在人机互动场景中。

Abstract: Recent advances in Vision-Language Models (VLMs) and large language models (LLMs) have greatly enhanced visual reasoning, a key capability for embodied AI agents like robots. However, existing visual reasoning benchmarks often suffer from several limitations: they lack a clear definition of reasoning complexity, offer have no control to generate questions over varying difficulty and task customization, and fail to provide structured, step-by-step reasoning annotations (workflows). To bridge these gaps, we formalize reasoning complexity, introduce an adaptive query engine that generates customizable questions of varying complexity with detailed intermediate annotations, and extend the JRDB dataset with human-object interaction and geometric relationship annotations to create JRDB-Reasoning, a benchmark tailored for visual reasoning in human-crowded environments. Our engine and benchmark enable fine-grained evaluation of visual reasoning frameworks and dynamic assessment of visual-language models across reasoning levels.

[15] A Sub-Pixel Multimodal Optical Remote Sensing Images Matching Method cs.CVPDF

Tao Huang, Hongbo Pan, Nanxi Zhou, Shun Zhou

TL;DR: 该论文提出了一种基于相位一致性加权最小绝对偏差（PCWLAD）的亚像素模板匹配方法，用于提高多模态光学图像的匹配精度。方法分为粗匹配和细匹配两步，分别使用SSIM和WLAD，并在细匹配中引入互结构滤波和WLAD准则以减少噪声影响。实验证明该方法在三个数据集上优于现有八种方法，平均匹配精度约为0.4像素。

Details

Motivation: 多模态光学图像由于不同的光谱响应导致的非线性辐射和几何形变差异，通常会降低匹配精度。论文旨在解决这一问题，提出一种高精度的匹配方法。

Result: 在三个数据集上测试，PCWLAD在正确匹配率（CMR）和均方根误差（RMSE）上优于八种现有方法，平均匹配精度约为0.4像素。

Insight: 相位一致性在多模态图像匹配中具有重要作用，结合结构相似性和噪声抑制策略可以显著提升匹配精度。

Abstract: High-accuracy matching of multimodal optical images is the basis of geometric processing. However, the image matching accuracy is usually degraded by the nonlinear radiation and geometric deformation differences caused by different spectral responses. To address these problems, we proposed a phase consistency weighted least absolute deviation (PCWLAD) sub-pixel template matching method to improve the matching accuracy of multimodal optical images. This method consists of two main steps: coarse matching with the structural similarity index measure (SSIM) and fine matching with WLAD. In the coarse matching step, PCs are calculated without a noise filter to preserve the original structural details, and template matching is performed using the SSIM. In the fine matching step, we applied the radiometric and geometric transformation models between two multimodal PC templates based on the coarse matching. Furthermore, mutual structure filtering is adopted in the model to mitigate the impact of noise within the corresponding templates on the structural consistency, and the WLAD criterion is used to estimate the sub-pixel offset. To evaluate the performance of PCWLAD, we created three types of image datasets: visible to infrared Landsat images, visible to near-infrared close-range images, and visible to infrared uncrewed aerial vehicle (UAV) images. PCWLAD outperformed existing state-of-the-art eight methods in terms of correct matching rate (CMR) and root mean square error (RMSE) and reached an average matching accuracy of approximately 0.4 pixels across all three datasets. Our software and datasets are publicly available at https://github.com/huangtaocsu/PCWLAD.

[16] InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild cs.CVPDF

Yiyi Ma, Yuanzhi Liang, Xiu Li, Chi Zhang, Xuelong Li

TL;DR: InterSyn提出了一种新颖的动态多角色交互运动生成框架，通过交错学习策略联合建模单人动作与多人互动，并在实验中展示了优于现有方法的文本对齐性和多样性。

Details

Motivation: 现有方法通常将单人动作与多人互动分开处理，导致生成的交互运动缺乏真实世界的动态协调与自然性。InterSyn旨在通过联合学习解决这一问题。

Result: 生成的交互运动在文本对齐性、多样性和自然性上优于现有方法，为动态运动合成设定了新基准。

Insight: 联合建模单人动作与多人互动能够更真实地反映现实场景中的动态协调，交错学习策略在复杂多角色运动生成中具有潜力。

Abstract: We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis. Additionally, our code will be open-sourced in the future to promote further research and development in this area.

[17] Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances cs.CVPDF

Yuanzhi Liang, Yijie Fang, Rui Li, Ziqi Ni, Ruijie Su

TL;DR: 这篇综述论文探讨了强化学习（RL）如何与视觉生成模型结合，以优化生成内容的质量、语义准确性和物理真实感。文章系统性地回顾了RL在图像、视频和3D/4D生成中的应用，并讨论了未来的研究方向。

Details

Motivation: 生成模型通常采用似然或重构损失等替代目标进行训练，但这些目标可能与人类感知质量、语义准确性或物理真实感不一致。强化学习为优化这些不可微、偏好驱动且具有时间结构的任务提供了一种原则性框架。

Result: RL已被证明能有效提升生成内容的可控性、一致性和人类对齐性。

Insight: 将RL与生成模型结合为解决生成任务中的非微分量和复杂目标提供了新的方向，但仍需解决计算效率和稳定性等挑战。

Abstract: Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.

Andrew Bai, Justin Cui, Ruochen Wang, Cho-Jui Hsieh

TL;DR: 论文研究了视觉-语言指令调优中视觉概念与视觉技能的平衡问题，提出了一种基于目标的数据选择方法，显著提升了多模态模型在基准测试上的性能。

Details

Motivation: 当前视觉-语言指令调优方法未明确区分学习视觉概念和视觉技能的重要性，导致模型性能提升有限。

Result: 在10多个基准测试中，方法平均提升0.9%，技能聚焦子集提升1.5%。

Insight: 指令选择需要权衡概念知识与视觉技能，忽视这种权衡会限制模型性能。

Abstract: Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9% over the best existing baseline averaged over all benchmarks and +1.5% on the skill-focused subset. Our findings underscore the importance of recognizing the inherent trade-off within instruction selection, which requires balancing the acquisition of conceptual knowledge against visual skill.

[19] Glo-DMU: A Deep Morphometry Framework of Ultrastructural Characterization in Glomerular Electron Microscopic Images cs.CVPDF

Zhentai Zhang, Danyi Weng, Guibin Zhang, Xiang Chen, Kaixing Long

TL;DR: Glo-DMU是一个基于深度学习的肾小球电子显微镜图像形态测量框架，通过三个深度模型实现超微结构特征的自动化、高精度、高通量量化，辅助肾脏病理诊断。

Details

Motivation: 当前研究主要关注单个超微结构的识别，难以满足实际诊断需求。Glo-DMU旨在通过同时量化多个广泛使用的超微结构特征，提供更全面的辅助诊断工具。

Result: 在115名患者的真实诊断场景中验证，自动量化结果与病理报告描述的形态学特征具有良好一致性。

Insight: Glo-DMU实现了全自动化、高精度的多特征量化，为肾脏病理学提供了一种高效工具，有望提升诊断效率和准确性。

Abstract: Complex and diverse ultrastructural features can indicate the type, progression, and prognosis of kidney diseases. Recently, computational pathology combined with deep learning methods has shown tremendous potential in advancing automatic morphological analysis of glomerular ultrastructure. However, current research predominantly focuses on the recognition of individual ultrastructure, which makes it challenging to meet practical diagnostic needs. In this study, we propose the glomerular morphometry framework of ultrastructural characterization (Glo-DMU), which is grounded on three deep models: the ultrastructure segmentation model, the glomerular filtration barrier region classification model, and the electron-dense deposits detection model. Following the conventional protocol of renal biopsy diagnosis, this framework simultaneously quantifies the three most widely used ultrastructural features: the thickness of glomerular basement membrane, the degree of foot process effacement, and the location of electron-dense deposits. We evaluated the 115 patients with 9 renal pathological types in real-world diagnostic scenarios, demonstrating good consistency between automatic quantification results and morphological descriptions in the pathological reports. Glo-DMU possesses the characteristics of full automation, high precision, and high throughput, quantifying multiple ultrastructural features simultaneously, and providing an efficient tool for assisting renal pathologists.

[20] Improving OCR for Historical Texts of Multiple Languages cs.CV | cs.CLPDF

Hylke Westerdijk, Ben Blankenborg, Khondoker Ittehadul Islam

TL;DR: 本文提出了针对多种语言历史文本的OCR改进方法，通过数据增强和深度学习模型（如Kraken、TrOCR和CRNN）提升了字符识别效果。

Details

Motivation: 历史文本的OCR由于字体多样性和保存状态不佳，识别难度大。本文旨在通过深度学习和数据增强技术提升多语言历史文本的OCR性能。

Result: 实验表明，提出的方法在多语言历史文本OCR任务中表现优异，显著提升了识别准确性。

Insight: 数据增强和深度学习模型的结合是提升历史文本OCR性能的有效途径，尤其对于多语言和多字体场景。

Abstract: This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence-based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers valuable insights and suggests potential directions for future research.

[21] AtomDiffuser: Time-Aware Degradation Modeling for Drift and Beam Damage in STEM Imaging cs.CVPDF

Hao Wang, Hongkui Zheng, Kai He, Abolfazl Razi

TL;DR: AtomDiffuser提出了一种时间感知的退化建模框架，用于解决STEM成像中的漂移和束损伤问题。

Details

Motivation: STEM成像在材料科学中应用广泛，但受限于机械/热不稳定性和辐射损伤导致的信号退化，现有方法难以明确分离这些影响或在高分辨率下建模材料动态变化。

Result: 实验表明，该方法在真实cryo-STEM数据上表现良好，可量化与辐射诱导原子不稳定性相关的退化模式。

Insight: 通过退化建模的视角处理时间序列STEM数据，为材料动态演化的高分辨率分析提供了新工具。

Abstract: Scanning transmission electron microscopy (STEM) plays a critical role in modern materials science, enabling direct imaging of atomic structures and their evolution under external interferences. However, interpreting time-resolved STEM data remains challenging due to two entangled degradation effects: spatial drift caused by mechanical and thermal instabilities, and beam-induced signal loss resulting from radiation damage. These factors distort both geometry and intensity in complex, temporally correlated ways, making it difficult for existing methods to explicitly separate their effects or model material dynamics at atomic resolution. In this work, we present AtomDiffuser, a time-aware degradation modeling framework that disentangles sample drift and radiometric attenuation by predicting an affine transformation and a spatially varying decay map between any two STEM frames. Unlike traditional denoising or registration pipelines, our method leverages degradation as a physically heuristic, temporally conditioned process, enabling interpretable structural evolutions across time. Trained on synthetic degradation processes, AtomDiffuser also generalizes well to real-world cryo-STEM data. It further supports high-resolution degradation inference and drift alignment, offering tools for visualizing and quantifying degradation patterns that correlate with radiation-induced atomic instabilities.

[22] Contrast Sensitivity Function of Multimodal Vision-Language Models cs.CVPDF

Pablo Hernández-Cámara, Alexandra Gomez-Villa, Jose Manuel Jaén-Lorites, Jorge Vila-Tomás, Jesus Malo

TL;DR: 该论文提出了一种新的方法，通过行为心理物理实验的方式估计多模态视觉-语言模型（VLMs）的对比敏感度函数（CSF），以评估其与人类视觉感知的一致性。研究发现，虽然部分模型接近人类CSF的形状或幅度，但均未完全复现两者，且提示词的表述对模型响应有显著影响。

Details

Motivation: 研究多模态视觉-语言模型在低层视觉特征上的感知能力，尤其是对比敏感度函数（CSF），以评估其与人类视觉感知的差距。

Result: 一些模型能够接近人类CSF的形状或幅度，但没有模型能完全复现两者；提示词的表述对模型响应有显著影响。

Insight: 多模态视觉-语言模型在视觉感知能力上与人类存在差距，提示词的稳定性是其应用中的一个重要挑战。

Abstract: Assessing the alignment of multimodal vision-language models~(VLMs) with human perception is essential to understand how they perceive low-level visual features. A key characteristic of human vision is the contrast sensitivity function (CSF), which describes sensitivity to spatial frequency at low-contrasts. Here, we introduce a novel behavioral psychophysics-inspired method to estimate the CSF of chat-based VLMs by directly prompting them to judge pattern visibility at different contrasts for each frequency. This methodology is closer to the real experiments in psychophysics than the previously reported. Using band-pass filtered noise images and a diverse set of prompts, we assess model responses across multiple architectures. We find that while some models approximate human-like CSF shape or magnitude, none fully replicate both. Notably, prompt phrasing has a large effect on the responses, raising concerns about prompt stability. Our results provide a new framework for probing visual sensitivity in multimodal models and reveal key gaps between their visual representations and human perception.

[23] PQ-DAF: Pose-driven Quality-controlled Data Augmentation for Data-scarce Driver Distraction Detection cs.CV | cs.AIPDF

Haibin Sun, Xinghui Song

TL;DR: 该论文提出了一种基于姿态驱动的质量可控数据增强框架（PQ-DAF），通过结合扩散模型和视觉语言模型，解决了驾驶员分心检测中数据稀缺和跨域泛化问题。

Details

Motivation: 驾驶员分心检测对交通安全至关重要，但由于数据标注成本高和训练数据与部署环境的域偏移，现有模型在实际场景中泛化能力受限。论文旨在解决这一数据稀缺和跨域挑战。

Result: 实验表明，PQ-DAF在数据稀缺条件下显著提升了驾驶员分心检测的性能和跨域泛化能力。

Insight: 结合生成模型和视觉语言模型可以有效解决数据稀缺问题，同时确保增强数据的质量，为实际应用提供了可靠的技术路径。

Abstract: Driver distraction detection is essential for improving traffic safety and reducing road accidents. However, existing models often suffer from degraded generalization when deployed in real-world scenarios. This limitation primarily arises from the few-shot learning challenge caused by the high cost of data annotation in practical environments, as well as the substantial domain shift between training datasets and target deployment conditions. To address these issues, we propose a Pose-driven Quality-controlled Data Augmentation Framework (PQ-DAF) that leverages a vision-language model for sample filtering to cost-effectively expand training data and enhance cross-domain robustness. Specifically, we employ a Progressive Conditional Diffusion Model (PCDMs) to accurately capture key driver pose features and synthesize diverse training examples. A sample quality assessment module, built upon the CogVLM vision-language model, is then introduced to filter out low-quality synthetic samples based on a confidence threshold, ensuring the reliability of the augmented dataset. Extensive experiments demonstrate that PQ-DAF substantially improves performance in few-shot driver distraction detection, achieving significant gains in model generalization under data-scarce conditions.

[24] SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection cs.CVPDF

Chaesong Park, Eunbin Seo, Jihyeon Hwang, Jongwoo Lim

TL;DR: SC-Lane提出了一种新颖的3D车道检测框架，通过动态融合坡度特征和强制时间一致性，显著提高了道路高度估计的鲁棒性和准确性。

Details

Motivation: 传统方法依赖固定的坡度锚点，无法适应多样化的道路几何形状，因此需要一种更灵活且时间一致的高度估计方法。

Result: 在OpenLane基准测试中，SC-Lane以64.3%的F-score达到最先进性能，显著优于现有方法。

Insight: 动态坡度适应和时间一致性是提升3D车道检测性能的关键因素。

Abstract: In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the appropriate weights from image cues for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios. To evaluate the effectiveness of SC-Lane, we employ three standardized metrics-Mean Absolute Error(MAE), Root Mean Squared Error (RMSE), and threshold-based accuracy-which, although common in surface and depth estimation, have been underutilized for road height assessment. Using the LiDAR-derived heightmap dataset introduced in prior work [20], we benchmark our method under these metrics, thereby establishing a rigorous standard for future comparisons. Extensive experiments on the OpenLane benchmark demonstrate that SC-Lane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3%, outperforming existing methods by a notable margin. For detailed results and a demonstration video, please refer to our project page:https://parkchaesong.github.io/sclane/

[25] NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer cs.CVPDF

Shanyuan Liu, Jian Zhu, Junda Lu, Yue Gong, Liuzhuozheng Li

TL;DR: NanoControl提出了一种轻量级框架，用于在Diffusion Transformer中进行高效且精确的控制生成，显著降低了参数和计算成本，同时保持了高质量的生成效果。

Details

Motivation: 现有基于ControlNet的方法在Diffusion Transformer中使用时会导致较高的参数和计算开销，作者希望设计一种更高效的解决方案。

Result: NanoControl仅带来0.024%的参数增加和0.029%的计算开销，同时实现了SOTA的可控生成性能。

Insight: 轻量化的控制模块设计可以在不牺牲性能的前提下显著降低计算成本，为实现高效可控生成提供了新思路。

Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm originally designed for UNet-based diffusion models. This paradigm introduces significant parameter overhead and increased computational costs. To address these challenges, we propose the Nano Control Diffusion Transformer (NanoControl), which employs Flux as the backbone network. Our model achieves state-of-the-art controllable text-to-image generation performance while incurring only a 0.024% increase in parameter count and a 0.029% increase in GFLOPs, thus enabling highly efficient controllable generation. Specifically, rather than duplicating the DiT backbone for control, we design a LoRA-style (low-rank adaptation) control module that directly learns control signals from raw conditioning inputs. Furthermore, we introduce a KV-Context Augmentation mechanism that integrates condition-specific key-value information into the backbone in a simple yet highly effective manner, facilitating deep fusion of conditional features. Extensive benchmark experiments demonstrate that NanoControl significantly reduces computational overhead compared to conventional control approaches, while maintaining superior generation quality and achieving improved controllability.

[26] STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes cs.CVPDF

Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, Yu Yamaguchi

TL;DR: STRIDE-QA是一个专注于时空推理的大规模视觉问答数据集，专为城市驾驶场景设计，填补了现有视觉语言模型在动态交通场景中的推理能力不足。

Details

Motivation: 现有的视觉语言模型（VLMs）主要基于静态图像-文本对训练，无法满足动态交通场景中精确的时空推理需求，为此提出了STRIDE-QA数据集。

Result: 现有VLMs在预测一致性上表现接近零分，而微调后的模型在空间定位（55%）和运动预测（28%）上显著提升。

Insight: 动态场景的时空推理需要专门的数据集支持，现有通用VLMs在此类任务上存在局限性。

Abstract: Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.

[27] CRISP: Contrastive Residual Injection and Semantic Prompting for Continual Video Instance Segmentation cs.CVPDF

Baichen Liu, Qi Lyu, Xudong Wang, Jiahua Dong, Lianqing Liu

TL;DR: CRISP提出了对比残差注入和语义提示方法，用于解决持续视频实例分割中的实例、类别和任务间的混淆问题，显著提升了性能并避免了灾难性遗忘。

Details

Motivation: 持续视频实例分割需要在吸收新类别的同时保留已学类别，并保持跨帧的时间一致性。现有方法在实例、类别和任务间易产生混淆，需要更有效的解决方案。

Result: 在YouTube-VIS-2019和YouTube-VIS-2021数据集上显著优于现有方法，避免了灾难性遗忘，提升了分割和分类性能。

Insight: 结合对比学习和语义提示可以在持续学习中有效平衡新任务学习和旧任务记忆，尤其在视频实例分割任务中表现突出。

Abstract: Continual video instance segmentation demands both the plasticity to absorb new object categories and the stability to retain previously learned ones, all while preserving temporal consistency across frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), an earlier attempt tailored to address the instance-wise, category-wise, and task-wise confusion in continual video instance segmentation. For instance-wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category-wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query-prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task-wise learning, to ensure the correlation at the inter-task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at https://github.com/01upup10/CRISP.

[28] DOD-SA: Infrared-Visible Decoupled Object Detection with Single-Modality Annotations cs.CV | 68T07, 68T45, 68U10 | I.2.10PDF

Hang Jin, Chenqiang Gao, Junjie Guo, Fangcen Liu, Kanghui Tian

TL;DR: 本文提出了一种新颖的红外-可见光解耦目标检测框架DOD-SA，通过单模态标注实现跨模态知识迁移，显著降低了标注成本，并在性能上优于现有方法。

Details

Motivation: 现有的红外-可见光目标检测方法通常需要双模态标注，导致标注成本高昂。本文旨在解决这一问题，通过单模态标注实现跨模态检测。

Result: 在DroneVehicle数据集上，DOD-SA优于当前最优方法，证明了其在降低标注成本的同时提升检测性能的有效性。

Insight: 通过共享知识迁移和伪标签优化，单模态标注可以支持双模态检测任务，为跨模态学习提供了新思路。

Abstract: Infrared-visible object detection has shown great potential in real-world applications, enabling robust all-day perception by leveraging the complementary information of infrared and visible images. However, existing methods typically require dual-modality annotations to output detection results for both modalities during prediction, which incurs high annotation costs. To address this challenge, we propose a novel infrared-visible Decoupled Object Detection framework with Single-modality Annotations, called DOD-SA. The architecture of DOD-SA is built upon a Single- and Dual-Modality Collaborative Teacher-Student Network (CoSD-TSNet), which consists of a single-modality branch (SM-Branch) and a dual-modality decoupled branch (DMD-Branch). The teacher model generates pseudo-labels for the unlabeled modality, simultaneously supporting the training of the student model. The collaborative design enables cross-modality knowledge transfer from the labeled modality to the unlabeled modality, and facilitates effective SM-to-DMD branch supervision. To further improve the decoupling ability of the model and the pseudo-label quality, we introduce a Progressive and Self-Tuning Training Strategy (PaST) that trains the model in three stages: (1) pretraining SM-Branch, (2) guiding the learning of DMD-Branch by SM-Branch, and (3) refining DMD-Branch. In addition, we design a Pseudo Label Assigner (PLA) to align and pair labels across modalities, explicitly addressing modality misalignment during training. Extensive experiments on the DroneVehicle dataset demonstrate that our method outperforms state-of-the-art (SOTA).

[29] SkeySpot: Automating Service Key Detection for Digital Electrical Layout Plans in the Construction Industry cs.CV | cs.LGPDF

Dhruv Dosi, Rohit Meena, Param Rajpura, Yogesh Kumar Meena

TL;DR: 该论文提出了一个自动化的数字电气布局图服务密钥检测工具SkeySpot，基于YOLOv8模型在DELP数据集上实现了82.5%的mAP性能，为建筑行业的标准化和互操作性提供了轻量级解决方案。

Details

Motivation: 建筑行业中，传统的扫描版楼层平面图缺乏机器可读性，导致大规模解读耗时且易错，亟需自动化符号检测方案以支持成本估算、基础设施维护等工作流程。

Result: YOLOv8在DELP数据集上达到82.5% mAP，SkeySpot工具支持实时检测与结构化输出，提升建筑行业标准化和互操作性。

Insight: 自动化符号检测可显著减少人工标注成本，并推动建筑行业数字化，尤其惠及中小企业；标准化输出为下游应用与监管平台兼容性奠定基础。

Abstract: Legacy floor plans, often preserved only as scanned documents, remain essential resources for architecture, urban planning, and facility management in the construction industry. However, the lack of machine-readable floor plans render large-scale interpretation both time-consuming and error-prone. Automated symbol spotting offers a scalable solution by enabling the identification of service key symbols directly from floor plans, supporting workflows such as cost estimation, infrastructure maintenance, and regulatory compliance. This work introduces a labelled Digitised Electrical Layout Plans (DELP) dataset comprising 45 scanned electrical layout plans annotated with 2,450 instances across 34 distinct service key classes. A systematic evaluation framework is proposed using pretrained object detection models for DELP dataset. Among the models benchmarked, YOLOv8 achieves the highest performance with a mean Average Precision (mAP) of 82.5%. Using YOLOv8, we develop SkeySpot, a lightweight, open-source toolkit for real-time detection, classification, and quantification of electrical symbols. SkeySpot produces structured, standardised outputs that can be scaled up for interoperable building information workflows, ultimately enabling compatibility across downstream applications and regulatory platforms. By lowering dependency on proprietary CAD systems and reducing manual annotation effort, this approach makes the digitisation of electrical layouts more accessible to small and medium-sized enterprises (SMEs) in the construction industry, while supporting broader goals of standardisation, interoperability, and sustainability in the built environment.

[30] From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images cs.CVPDF

Pablo Hernández-Cámara, Jesus Malo, Valero Laparra

TL;DR: 论文提出了一种仿生架构PerceptNet，通过图像重建任务优化，发现其编码器层（V1-like）与人类感知判断高度相关，表明视觉系统可能自然地适应去除特定程度的失真。

Details

Motivation: 探究人类视觉感知是否可以通过图像统计信息自然涌现，以及仿生模型是否能无监督地学习感知度量。

Result: V1-like层在图像失真任务中与人类感知判断相关性最高，且对适度噪声、模糊和稀疏性表现出最优对齐。

Insight: 视觉系统可能自然地适应去除特定失真，仿生模型能无监督学习感知指标，为理解早期视觉提供了新视角。

Abstract: A number of scientists suggested that human visual perception may emerge from image statistics, shaping efficient neural representations in early vision. In this work, a bio-inspired architecture that can accommodate several known facts in the retina-V1 cortex, the PerceptNet, has been end-to-end optimized for different tasks related to image reconstruction: autoencoding, denoising, deblurring, and sparsity regularization. Our results show that the encoder stage (V1-like layer) consistently exhibits the highest correlation with human perceptual judgments on image distortion despite not using perceptual information in the initialization or training. This alignment exhibits an optimum for moderate noise, blur and sparsity. These findings suggest that the visual system may be tuned to remove those particular levels of distortion with that level of sparsity and that biologically inspired models can learn perceptual metrics without human supervision.

[31] Trajectory-aware Shifted State Space Models for Online Video Super-Resolution cs.CVPDF

Qiang Zhu, Xiandong Meng, Yuxian Jiang, Fan Zhang, David Bull

TL;DR: 该论文提出了一种基于轨迹感知的偏移状态空间模型（TS-Mamba）的在线视频超分辨率方法，结合长期轨迹建模和高效Mamba，显著提升了时空信息聚合的效率和性能。

Details

Motivation: 现有的在线视频超分辨率方法通常仅依赖相邻前一帧进行时序对齐，限制了长程时序建模能力。状态空间模型（SSMs）因其线性计算复杂度和全局感受野的优势，为该问题提供了新的解决思路。

Result: 在三个VSR测试数据集上，TS-Mamba在大多数情况下实现了SOTA性能，并减少了22.7%的计算复杂度（以MACs衡量）。

Insight: 结合轨迹建模和高效状态空间模型（如Mamba）是解决在线视频超分辨率中长程时序建模和计算效率问题的有效途径。

Abstract: Online video super-resolution (VSR) is an important technique for many real-world video processing applications, which aims to restore the current high-resolution video frame based on temporally previous frames. Most of the existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Recently, state space models (SSMs) have been proposed with linear computational complexity and a global receptive field, which significantly improve computational efficiency and performance. In this context, this paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Specifically, TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames. Then, a Trajectory-aware Shifted Mamba Aggregation (TSMA) module consisting of proposed shifted SSMs blocks is employed to aggregate the selected tokens. The shifted SSMs blocks are designed based on Hilbert scannings and corresponding shift operations to compensate for scanning losses and strengthen the spatial continuity of Mamba. Additionally, we propose a trajectory-aware loss function to supervise the trajectory generation, ensuring the accuracy of token selection when training our model. Extensive experiments on three widely used VSR test datasets demonstrate that compared with six online VSR benchmark models, our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7% complexity reduction (in MACs). The source code for TS-Mamba will be available at https://github.com.

[32] Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers cs.CV | cs.IR | cs.LGPDF

Hanna Herasimchyk, Robin Labryga, Tomislav Prusina

TL;DR: 该论文提出了一种多头视觉变换器方法，用于植被区域图像中的多标签植物物种预测，结合了多尺度分块和动态阈值优化等技术，在PlantCLEF 2025挑战赛中取得了第三名的成绩。

Details

Motivation: 植物物种预测任务中存在单物种训练图像和多物种测试图像的领域偏移问题，需要通过多标签分类和领域适应技术来解决。

Result: 在包含1.4百万张训练图像的7,806种植物物种数据集上表现出色，最终在Private Leaderboard上排名第三。

Insight: 结合分类层级信息和多尺度分块可以有效应对领域偏移问题，动态阈值优化对多标签分类任务尤为关键。

Abstract: We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at https://github.com/geranium12/plant-clef-2025/tree/v1.0.0.

[33] SingleStrip: learning skull-stripping from a single labeled example cs.CV | cs.LGPDF

Bella Specktor-Fadida, Malte Hoffmann

TL;DR: 这篇论文提出了一种结合领域随机化和自训练的方法，仅需单一标记样本训练3D颅骨剥离网络。通过自动体素强度分箱和卷积自编码器（AE）重建误差筛选高质量伪标签，实现接近多标记模型的效果。

Details

Motivation: 传统深度分割方法依赖大量标记数据，但MRI等体积图像标记耗时且昂贵。领域随机化和半监督自训练虽能缓解，但在极少数标记样本下解剖学变异性受限。

Result: 在分布外数据上，模型性能接近多标记模型，且AE筛选方法比基于一致性的方法更可靠。

Insight: 领域随机化结合AE质量控制的半监督策略，可显著降低标记需求，适用于新解剖结构或成像技术研究。

Abstract: Deep learning segmentation relies heavily on labeled data, but manual labeling is laborious and time-consuming, especially for volumetric images such as brain magnetic resonance imaging (MRI). While recent domain-randomization techniques alleviate the dependency on labeled data by synthesizing diverse training images from label maps, they offer limited anatomical variability when very few label maps are available. Semi-supervised self-training addresses label scarcity by iteratively incorporating model predictions into the training set, enabling networks to learn from unlabeled data. In this work, we combine domain randomization with self-training to train three-dimensional skull-stripping networks using as little as a single labeled example. First, we automatically bin voxel intensities, yielding labels we use to synthesize images for training an initial skull-stripping model. Second, we train a convolutional autoencoder (AE) on the labeled example and use its reconstruction error to assess the quality of brain masks predicted for unlabeled data. Third, we select the top-ranking pseudo-labels to fine-tune the network, achieving skull-stripping performance on out-of-distribution data that approaches models trained with more labeled images. We compare AE-based ranking to consistency-based ranking under test-time augmentation, finding that the AE approach yields a stronger correlation with segmentation accuracy. Our results highlight the potential of combining domain randomization and AE-based quality control to enable effective semi-supervised segmentation from extremely limited labeled data. This strategy may ease the labeling burden that slows progress in studies involving new anatomical structures or emerging imaging techniques.

[34] Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition cs.CV | cs.AIPDF

Maimunatu Tunau, Vincent Gbouna Zakka, Zhuangzhuang Dai

TL;DR: 该论文对毫米波雷达稀疏点云数据中的三种主要处理方法（DBSCAN、匈牙利算法和卡尔曼滤波）进行了全面评估，提出了改进方法，并通过实验验证了其对人动作识别任务的效果和计算成本。

Details

Motivation: 传统基于视觉的人动作识别系统存在隐私问题，而毫米波雷达因其隐私保护特性成为一种替代方案，但其稀疏和噪声点云数据需要改进处理。

Result: 研究分析了不同方法的准确性和计算成本，为毫米波雷达动作识别系统的优化提供了重要参考。

Insight: 组合方法的性能优于单一方法，改进后的算法能显著提升识别精度，但计算成本需权衡。

Abstract: Human Action Recognition (HAR) plays a crucial role in healthcare, fitness tracking, and ambient assisted living technologies. While traditional vision based HAR systems are effective, they pose privacy concerns. mmWave radar sensors offer a privacy preserving alternative but present challenges due to the sparse and noisy nature of their point cloud data. In the literature, three primary data processing methods: Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the Hungarian Algorithm, and Kalman Filtering have been widely used to improve the quality and continuity of radar data. However, a comprehensive evaluation of these methods, both individually and in combination, remains lacking. This paper addresses that gap by conducting a detailed performance analysis of the three methods using the MiliPoint dataset. We evaluate each method individually, all possible pairwise combinations, and the combination of all three, assessing both recognition accuracy and computational cost. Furthermore, we propose targeted enhancements to the individual methods aimed at improving accuracy. Our results provide crucial insights into the strengths and trade-offs of each method and their integrations, guiding future work on mmWave based HAR systems

[35] STAMP: Multi-pattern Attention-aware Multiple Instance Learning for STAS Diagnosis in Multi-center Histopathology Images cs.CV | cs.CYPDF

Liangrui Pan, xiaoyu Li, Guang Zhu, Guanting Li, Ruixin Wang

TL;DR: 论文提出了一种名为STAMP的多模式注意力感知多实例学习框架，用于多中心组织病理学图像中的STAS诊断。通过双分支架构和Transformer编码，模型动态选择与STAS病理相关的区域，提升诊断准确性。

Details

Motivation: STAS（经空气传播）是肺腺癌中的一种新型侵袭模式，与肿瘤复发和生存率下降相关，但其诊断存在劳动密集和易误诊的问题，因此迫切需要利用深度学习模型进行自动化诊断。

Result: 模型在STAS-SXY、STAS-TXY和STAS-TCGA数据集上的AUC分别为0.8058、0.8017和0.7928，超越临床水平。

Insight: 1. 多中心数据可增强模型的泛化能力；2. 动态注意力机制能有效抑制噪声并聚焦关键病理区域；3. 双分支架构通过正则约束避免冗余特征。

Abstract: Spread through air spaces (STAS) constitutes a novel invasive pattern in lung adenocarcinoma (LUAD), associated with tumor recurrence and diminished survival rates. However, large-scale STAS diagnosis in LUAD remains a labor-intensive endeavor, compounded by the propensity for oversight and misdiagnosis due to its distinctive pathological characteristics and morphological features. Consequently, there is a pressing clinical imperative to leverage deep learning models for STAS diagnosis. This study initially assembled histopathological images from STAS patients at the Second Xiangya Hospital and the Third Xiangya Hospital of Central South University, alongside the TCGA-LUAD cohort. Three senior pathologists conducted cross-verification annotations to construct the STAS-SXY, STAS-TXY, and STAS-TCGA datasets. We then propose a multi-pattern attention-aware multiple instance learning framework, named STAMP, to analyze and diagnose the presence of STAS across multi-center histopathology images. Specifically, the dual-branch architecture guides the model to learn STAS-associated pathological features from distinct semantic spaces. Transformer-based instance encoding and a multi-pattern attention aggregation modules dynamically selects regions closely associated with STAS pathology, suppressing irrelevant noise and enhancing the discriminative power of global representations. Moreover, a similarity regularization constraint prevents feature redundancy across branches, thereby improving overall diagnostic accuracy. Extensive experiments demonstrated that STAMP achieved competitive diagnostic results on STAS-SXY, STAS-TXY and STAS-TCGA, with AUCs of 0.8058, 0.8017, and 0.7928, respectively, surpassing the clinical level.

[36] TweezeEdit: Consistent and Efficient Image Editing with Path Regularization cs.CVPDF

Jianda Mao, Kaibo Wang, Yang Xiang, Kani Chen

TL;DR: TweezeEdit提出了一种无需调优或反转的方法，通过正则化整个降噪路径来保留图像语义并缩短编辑路径，显著提升了编辑效率。

Details

Motivation: 现有基于扩散模型的图像编辑方法通常过度对齐目标提示而忽略源图像语义，且依赖反转锚点导致低效。

Result: 实验表明，TweezeEdit在语义保留和目标对齐上优于现有方法，且仅需12步（1.6秒/编辑）。

Insight: 通过优化路径而非依赖反转锚点，可以显著提升编辑效率和语义一致性，适合实时应用。

Abstract: Large-scale pre-trained diffusion models empower users to edit images through text guidance. However, existing methods often over-align with target prompts while inadequately preserving source image semantics. Such approaches generate target images explicitly or implicitly from the inversion noise of the source images, termed the inversion anchors. We identify this strategy as suboptimal for semantic preservation and inefficient due to elongated editing paths. We propose TweezeEdit, a tuning- and inversion-free framework for consistent and efficient image editing. Our method addresses these limitations by regularizing the entire denoising path rather than relying solely on the inversion anchors, ensuring source semantic retention and shortening editing paths. Guided by gradient-driven regularization, we efficiently inject target prompt semantics along a direct path using a consistency model. Extensive experiments demonstrate TweezeEdit’s superior performance in semantic preservation and target alignment, outperforming existing methods. Remarkably, it requires only 12 steps (1.6 seconds per edit), underscoring its potential for real-time applications.

[37] EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba cs.CVPDF

Quang Nguyen, Nhat Le, Baoru Huang, Minh Nhat Vu, Chengcheng Tang

TL;DR: 该论文提出了名为Skeleton Mamba的新方法，结合了扩散模型和Mamba的特点，用于从第一人称视频和音乐中预测人体舞蹈动作。他们引入了一个名为EgoAIST++的新数据集，并在实验中证明了方法的优越性。

Details

Motivation: 目前的研究主要集中在单独使用第一人称视频或音乐来预测人体舞蹈动作，但结合两者的任务尚未充分探索。作者认为第一人称视角容易遮挡身体部位，而音乐与运动的对齐也是一个挑战。

Result: 实验表明，该方法明显优于现有技术，并能有效泛化到真实数据。

Insight: 研究揭示了结合视觉和音乐输入对于舞蹈动作预测的重要性，同时提出了一种新的序列建模方法。

Abstract: Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.

[38] Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies cs.CVPDF

Ayushman Sarkar, Mohd Yamani Idna Idris, Zhenyu Yu

TL;DR: 这篇论文是一篇关于计算机视觉中视觉推理的综合调查，涵盖了五种主要推理类型（关系、符号、时间、因果和常识推理），并系统分析了它们的实现方法、评估协议和开放挑战。

Details

Motivation: 现有研究往往孤立地探讨不同推理类型，缺乏统一的分析和比较，因此本文旨在填补这一空白，提供一个全面的视觉推理分类和分析框架。

Result: 论文总结了当前视觉推理的研究现状，并揭示了评估协议在泛化性、可重复性和解释性方面的不足。

Insight: 未来的视觉系统需要更紧密地结合感知与推理，以构建透明、可信且跨领域适应的人工智能系统，尤其是在自动驾驶和医疗诊断等关键领域。

Abstract: Visual reasoning is critical for a wide range of computer vision tasks that go beyond surface-level object detection and classification. Despite notable advances in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys often address these directions in isolation, lacking a unified analysis and comparison across reasoning types, methodologies, and evaluation protocols. This survey aims to address this gap by categorizing visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and systematically examining their implementation through architectures such as graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems. We review evaluation protocols designed to assess functional correctness, structural consistency, and causal validity, and critically analyze their limitations in terms of generalizability, reproducibility, and explanatory power. Beyond evaluation, we identify key open challenges in visual reasoning, including scalability to complex scenes, deeper integration of symbolic and neural paradigms, the lack of comprehensive benchmark datasets, and reasoning under weak supervision. Finally, we outline a forward-looking research agenda for next-generation vision systems, emphasizing that bridging perception and reasoning is essential for building transparent, trustworthy, and cross-domain adaptive AI systems, particularly in critical domains such as autonomous driving and medical diagnostics.

[39] Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset cs.CV | cs.AIPDF

Ziye Deng, Ruihan He, Jiaxiang Liu, Yuan Wang, Zijie Meng

TL;DR: 论文提出了一个大规模医学图像标注数据集Med-GLIP-5M，并基于此开发了一个模态感知的医学语言-图像预训练框架Med-GLIP，用于解决医学图像与自然语言对齐的挑战。该框架在多个基准测试中表现出色，并显著提升了下游任务（如医学VQA和报告生成）的性能。

Details

Motivation: 现有医学图像标注研究受限于模态覆盖不足、标注粒度粗糙以及缺乏统一的通用框架。该论文旨在解决这些问题，通过构建大规模数据集并提出通用框架，以支持细粒度的医学图像与语言对齐。

Result: Med-GLIP在多个基准测试中显著优于现有方法，并显著提升了下游任务（如医学VQA和报告生成）的性能。

Insight: 大规模细粒度标注数据集和通用预训练框架的结合，为医学图像与语言对齐任务提供了新的解决方案，同时展示了其在复杂下游任务中的潜力。

Abstract: Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data – enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.

[40] GCRPNet: Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images cs.CVPDF

Mengyu Ren, Yutong Li, Hua Li, Runmin Cong, Sam Kwong

TL;DR: GCRPNet是一个用于光学遥感图像显著性目标检测的网络，通过图增强上下文和区域感知，结合Mamba架构和自适应扫描策略，提升了模型在复杂场景中的性能。

Details

Motivation: 光学遥感图像中的显著性目标检测面临目标尺度变化大和背景与目标对比度低的问题，现有方法难以有效整合全局和局部特征。

Result: 实验表明GCRPNet在多个数据集上达到了最先进的性能。

Insight: 通过图增强和自适应扫描策略的结合，可以有效解决遥感图像中显著性目标检测的难点，提升模型的鲁棒性和准确性。

Abstract: Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose a graph-enhanced contextual and regional perception network (GCRPNet), which builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space (VSS) encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a difference-similarity guided hierarchical graph attention module (DS-HGAM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model’s structural perception,allowing it to distinguish between foreground and background more effectively. Then, we design the LEVSS block as the decoder of GCRPNet. This module integrates our proposed adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information and enhancing Mamba’s local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

[41] PSScreen: Partially Supervised Multiple Retinal Disease Screening cs.CVPDF

Boyi Zheng, Qing Liu

TL;DR: PSScreen是一种新型的部分监督多视网膜疾病筛查模型，通过双流结构和不确定性注入学**生确定性特征和概率性特征，利用文本引导和特征蒸馏提升领域泛化能力，并通过伪标签一致性和自蒸馏解决标签缺失问题。实验表明其在六种视网膜疾病和正常状态下表现优异。

Details

Motivation: 现有方法依赖全标注数据集训练多视网膜疾病筛查模型，但全标注数据获取成本高。部分标注数据集虽减轻标注负担，但面临跨域差异和标签缺失问题，亟需解决。

Result: 在六种视网膜疾病和正常状态的筛查任务中，PSScreen平均性能显著提升，在域内和域外数据集上均达到SOTA。

Insight: 通过结合确定性特征和概率性特征学习，并利用文本引导和蒸馏技术，可以有效解决部分监督学习中的跨域泛化和标签缺失问题。

Abstract: Leveraging multiple partially labeled datasets to train a model for multiple retinal disease screening reduces the reliance on fully annotated datasets, but remains challenging due to significant domain shifts across training datasets from various medical sites, and the label absent issue for partial classes. To solve these challenges, we propose PSScreen, a novel Partially Supervised multiple retinal disease Screening model. Our PSScreen consists of two streams and one learns deterministic features and the other learns probabilistic features via uncertainty injection. Then, we leverage the textual guidance to decouple two types of features into disease-wise features and align them via feature distillation to boost the domain generalization ability. Meanwhile, we employ pseudo label consistency between two streams to address the label absent issue and introduce a self-distillation to transfer task-relevant semantics about known classes from the deterministic to the probabilistic stream to further enhance the detection performances. Experiments show that our PSScreen significantly enhances the detection performances on six retinal diseases and the normal state averagely and achieves state-of-the-art results on both in-domain and out-of-domain datasets. Codes are available at https://github.com/boyiZheng99/PSScreen.

Marc J. Fischer, Jeffrey Potts, Gabriel Urreola, Dax Jones, Paolo Palmisciano

TL;DR: 这篇论文提出了一种利用增强现实（AR）进行手术导航的新方法，通过表面追踪和实时工具跟踪技术，提高了神经外科手术中导管放置的准确性和用户体验。

Details

Motivation: 传统手术导航系统存在深度感知和遮挡处理等问题，AR技术有望克服这些限制，但当前商用AR显示技术在手术中仍面临挑战。本研究旨在探索AR引导在神经外科手术中的应用。

Result: 实时工具跟踪引导在所有准确性指标上表现更优，且用户评价更高。

Insight: 实时工具跟踪是提高AR手术导航精度的关键，尤其在神经外科手术等高精度需求场景中具有潜力。

Abstract: Augmented Reality (AR) surgical navigation systems are emerging as the next generation of intraoperative surgical guidance, promising to overcome limitations of traditional navigation systems. However, known issues with AR depth perception due to vergence-accommodation conflict and occlusion handling limitations of the currently commercially available display technology present acute challenges in surgical settings where precision is paramount. This study presents a novel methodology for utilizing AR guidance to register anatomical targets and provide real-time instrument navigation using placement of simulated external ventricular drain catheters on a phantom model as the clinical scenario. The system registers target positions to the patient through a novel surface tracing method and uses real-time infrared tool tracking to aid in catheter placement, relying only on the onboard sensors of the Microsoft HoloLens 2. A group of intended users performed the procedure of simulated insertions under two AR guidance conditions: static in-situ visualization, where planned trajectories are overlaid directly onto the patient anatomy, and real-time tool-tracking guidance, where live feedback of the catheter’s pose is provided relative to the plan. Following the insertion tests, computed tomography scans of the phantom models were acquired, allowing for evaluation of insertion accuracy, target deviation, angular error, and depth precision. System Usability Scale surveys assessed user experience and cognitive workload. Tool-tracking guidance improved performance metrics across all accuracy measures and was preferred by users in subjective evaluations. A free copy of this paper and all supplemental materials are available at https://bit.ly/45l89Hq.

[43] Retrieval-Augmented Prompt for OOD Detection cs.CV | cs.AIPDF

Ruisong Han, Zongbo Han, Jiahao Zhang, Mingyue Cheng, Changqing Zhang

TL;DR: 论文提出了一种新型的OOD检测方法Retrieval-Augmented Prompt (RAP)，通过检索外部知识增强预训练视觉-语言模型的提示，提升OOD检测的语义监督能力。实验表明，RAP在多个基准上表现最优。

Details

Motivation: 现有的OOD检测方法依赖有限的异常样本或ID数据生成异常信息，但由于样本不匹配和语义监督不足，性能受限。

Result: 在ImageNet-1k数据集上，RAP的FPR95降低7.05%，AUROC提升1.71%。

Insight: 利用外部知识动态增强提示是提升OOD检测性能的有效途径。

Abstract: Out-of-Distribution (OOD) detection is crucial for the reliable deployment of machine learning models in-the-wild, enabling accurate identification of test samples that differ from the training data distribution. Existing methods rely on auxiliary outlier samples or in-distribution (ID) data to generate outlier information for training, but due to limited outliers and their mismatch with real test OOD samples, they often fail to provide sufficient semantic supervision, leading to suboptimal performance. To address this, we propose a novel OOD detection method called Retrieval-Augmented Prompt (RAP). RAP augments a pre-trained vision-language model’s prompts by retrieving external knowledge, offering enhanced semantic supervision for OOD detection. During training, RAP retrieves descriptive words for outliers based on joint similarity with external textual knowledge and uses them to augment the model’s OOD prompts. During testing, RAP dynamically updates OOD prompts in real-time based on the encountered OOD samples, enabling the model to rapidly adapt to the test environment. Our extensive experiments demonstrate that RAP achieves state-of-the-art performance on large-scale OOD detection benchmarks. For example, in 1-shot OOD detection on the ImageNet-1k dataset, RAP reduces the average FPR95 by 7.05% and improves the AUROC by 1.71% compared to previous methods. Additionally, comprehensive ablation studies validate the effectiveness of each module and the underlying motivations of our approach.

[44] PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks cs.CV | cs.AIPDF

Xinhao Wang, Zhiwei Lin, Zhongyu Xia, Yongtao Wang

TL;DR: 论文提出了一种混合量化算法PTQAT，结合PTQ和QAT的优势，高效部署3D感知网络，通过在关键层应用QAT微调并对其余层进行PTQ，显著提升量化精度且节省资源。

Details

Motivation: PTQ量化后性能下降明显，而QAT因权重微调导致GPU内存和时间开销大，需一种平衡速度和精度的量化方法。

Result: 在nuScenes数据集的3D感知任务中，PTQAT优于纯QAT基线，检测任务NDS提升0.2%-0.9%，分割任务mIoU提升0.3%-2.0%，且微调权重更少。

Insight: 量化误差的传播补偿比源头修复更有效，关键层选择策略可推广至其他量化任务。

Abstract: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) represent two mainstream model quantization approaches. However, PTQ often leads to unacceptable performance degradation in quantized models, while QAT imposes substantial GPU memory requirements and extended training time due to weight fine-tuning.In this paper, we propose PTQAT, a novel general hybrid quantization algorithm for the efficient deployment of 3D perception networks. To address the speed accuracy trade-off between PTQ and QAT, our method selects critical layers for QAT fine-tuning and performs PTQ on the remaining layers. Contrary to intuition, fine-tuning the layers with smaller output discrepancies before and after quantization, rather than those with larger discrepancies, actually leads to greater improvements in the model’s quantization accuracy. This means we better compensate for quantization errors during their propagation, rather than addressing them at the point where they occur. The proposed PTQAT achieves similar performance to QAT with more efficiency by freezing nearly 50% of quantifiable layers. Additionally, PTQAT is a universal quantization method that supports various quantization bit widths (4 bits) as well as different model architectures, including CNNs and Transformers. The experimental results on nuScenes across diverse 3D perception tasks, including object detection, semantic segmentation, and occupancy prediction, show that our method consistently outperforms QAT-only baselines. Notably, it achieves 0.2%-0.9% NDS and 0.3%-1.0% mAP gains in object detection, 0.3%-2.0% mIoU gains in semantic segmentation and occupancy prediction while fine-tuning fewer weights.

[45] HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis cs.CVPDF

Shiyu Liu, Kui Jiang, Xianming Liu, Hongxun Yao, Xiaocheng Feng

TL;DR: HM-Talker提出了一种结合隐式和显式运动线索的混合运动建模方法，用于生成高质量、时序一致的说话头部视频，解决了现有方法中的运动模糊和唇部抖动问题，并提升了跨身份泛化能力。

Details

Motivation: 现有方法依赖隐式建模音频-面部运动关联，缺乏显式发音先验，导致视频中运动模糊和唇部抖动。

Result: 实验表明，HM-Talker在视觉质量和唇同步准确性上优于现有方法。

Insight: 显式发音先验与隐式特征的结合能显著提升说话头部视频的生成质量，身份无关学习增强了跨身份泛化能力。

Abstract: Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations–an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker’s superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

[46] SpaRC-AD: A Baseline for Radar-Camera Fusion in End-to-End Autonomous Driving cs.CV | cs.ROPDF

Philipp Wolters, Johannes Gilg, Torben Teepe, Gerhard Rigoll

TL;DR: SpaRC-AD 是一个基于雷达-相机融合的端到端自动驾驶框架，通过稀疏3D特征对齐和多普勒速度估计，显著提升了3D检测、多目标跟踪、在线地图等任务的性能。

Details

Motivation: 基于视觉的自动驾驶系统在恶劣天气、部分遮挡和精确速度估计方面存在局限性，SpaRC-AD旨在通过雷达-相机融合解决这些关键问题。

Result: 在多个任务（如3D检测、多目标跟踪、运动预测等）上显著优于纯视觉基线模型，并在多个基准测试中表现优越。

Insight: 雷达-相机融合在安全关键场景中尤为重要，能够提升运动理解和长时程轨迹预测的准确性，从而避免碰撞。

Abstract: End-to-end autonomous driving systems promise stronger performance through unified optimization of perception, motion forecasting, and planning. However, vision-based approaches face fundamental limitations in adverse weather conditions, partial occlusions, and precise velocity estimation - critical challenges in safety-sensitive scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. To address these limitations, we propose SpaRC-AD, a query-based end-to-end camera-radar fusion framework for planning-oriented autonomous driving. Through sparse 3D feature alignment, and doppler-based velocity estimation, we achieve strong 3D scene representations for refinement of agent anchors, map polylines and motion modelling. Our method achieves strong improvements over the state-of-the-art vision-only baselines across multiple autonomous driving tasks, including 3D detection (+4.8% mAP), multi-object tracking (+8.3% AMOTA), online mapping (+1.8% mAP), motion prediction (-4.0% mADE), and trajectory planning (-0.1m L2 and -9% TPC). We achieve both spatial coherence and temporal consistency on multiple challenging benchmarks, including real-world open-loop nuScenes, long-horizon T-nuScenes, and closed-loop simulator Bench2Drive. We show the effectiveness of radar-based fusion in safety-critical scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. The source code of all experiments is available at https://phi-wol.github.io/sparcad/

[47] Adapting SAM via Cross-Entropy Masking for Class Imbalance in Remote Sensing Change Detection cs.CVPDF

Humza Naveed, Xina Zeng, Mitch Bryson, Nagita Mehrseresht

TL;DR: 该论文提出了一种通过交叉熵掩码（CEM）损失和微调SAM编码器的方法，用于解决遥感变化检测中的类别不平衡问题，并结合时空特征增强和多尺度解码器融合，在多个数据集上实现了SOTA性能。

Details

Motivation: 遥感变化检测（RSCD）中存在严重的类别不平衡问题，而现有的基础模型（如SAM）虽在通用分割任务上表现优异，但需进一步适应特定任务的需求。

Result: 在Levir-CD、WHU-CD、CLCD和S2Looking数据集上超越了现有SOTA方法，尤其在S2Looking数据集上F1分数提升了2.5%。

Insight: 基础模型的微调和自定义损失函数的结合可以显著提升特定任务（如变化检测）的性能，尤其是在类别不平衡的场景下。

Abstract: Foundational models have achieved significant success in diverse domains of computer vision. They learn general representations that are easily transferable to tasks not seen during training. One such foundational model is Segment anything model (SAM), which can accurately segment objects in images. We propose adapting the SAM encoder via fine-tuning for remote sensing change detection (RSCD) along with spatial-temporal feature enhancement (STFE) and multi-scale decoder fusion (MSDF) to detect changes robustly at multiple scales. Additionally, we propose a novel cross-entropy masking (CEM) loss to handle high class imbalance in change detection datasets. Our method outperforms state-of-the-art (SOTA) methods on four change detection datasets, Levir-CD, WHU-CD, CLCD, and S2Looking. We achieved 2.5% F1-score improvement on a large complex S2Looking dataset. The code is available at: https://github.com/humza909/SAM-CEM-CD

[48] Towards Agentic AI for Multimodal-Guided Video Object Segmentation cs.CVPDF

Tuyen Tran, Thao Minh Le, Truyen Tran

TL;DR: 本文提出了一种名为Multi-Modal Agent的新型多模态引导视频对象分割方法，利用大型语言模型(LLMs)生成动态工作流，提升了任务灵活性和适应性，显著优于现有方法。

Details

Motivation: 传统方法依赖训练专用模型，计算复杂且需要大量标注；现有训练免费方法缺乏灵活性，无法适应动态任务需求。

Result: 在RVOS和Ref-AVS任务上表现显著优于现有方法。

Insight: 通过LLMs的推理能力结合多模态工具，能够更灵活地处理复杂的多模态任务，为未来AI代理系统提供新思路。

Abstract: Referring-based Video Object Segmentation is a multimodal problem that requires producing fine-grained segmentation results guided by external cues. Traditional approaches to this task typically involve training specialized models, which come with high computational complexity and manual annotation effort. Recent advances in vision-language foundation models open a promising direction toward training-free approaches. Several studies have explored leveraging these general-purpose models for fine-grained segmentation, achieving performance comparable to that of fully supervised, task-specific models. However, existing methods rely on fixed pipelines that lack the flexibility needed to adapt to the dynamic nature of the task. To address this limitation, we propose Multi-Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner. Specifically, our method leverages the reasoning capabilities of large language models (LLMs) to generate dynamic workflows tailored to each input. This adaptive procedure iteratively interacts with a set of specialized tools designed for low-level tasks across different modalities to identify the target object described by the multimodal cues. Our agentic approach demonstrates clear improvements over prior methods on two multimodal-conditioned VOS tasks: RVOS and Ref-AVS.

[49] HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs cs.CVPDF

Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Yi Yuan

TL;DR: HumanSense是一个多模态基准测试，旨在评估MLLM在人类中心化场景中的感知与交互能力，发现领先模型在高级交互任务中仍有改进空间，并通过多模态强化学习提升推理能力。

Details

Motivation: 现有MLLM缺乏针对人类中心化场景的细粒度评估框架，尤其是在复杂意图理解和共情反馈生成方面，亟需一种综合评测方法以推动技术发展。

Result: 实验表明，补充多模态信息能显著提升性能，增强推理的Omni模型在评测中表现优异，且推理过程展现高度一致的思维模式。

Insight: 共情反馈需基于对对话者需求和情感的上下文分析，推理能力是实现这一目标的关键；提示设计可廉价提升非推理模型表现。

Abstract: While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor’s needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: \textcolor{brightpink}https://digital-avatar.github.io/ai/HumanSense/

[50] EvTurb: Event Camera Guided Turbulence Removal cs.CVPDF

Yixing Liu, Minggui Teng, Yifei Xia, Peiqi Duan, Boxin Shi

TL;DR: EvTurb是一个利用事件相机数据去除湍流效应的框架，通过事件流解耦模糊和倾斜效应，优于现有方法。

Details

Motivation: 大气湍流导致图像质量下降（模糊和几何倾斜），现有方法难以处理其高度不适定性。

Result: 实验显示EvTurb性能优于现有方法，且计算高效。

Insight: 事件数据能有效解耦湍流效应，为湍流消除提供新思路。

Abstract: Atmospheric turbulence degrades image quality by introducing blur and geometric tilt distortions, posing significant challenges to downstream computer vision tasks. Existing single-image and multi-frame methods struggle with the highly ill-posed nature of this problem due to the compositional complexity of turbulence-induced distortions. To address this, we propose EvTurb, an event guided turbulence removal framework that leverages high-speed event streams to decouple blur and tilt effects. EvTurb decouples blur and tilt effects by modeling event-based turbulence formation, specifically through a novel two-step event-guided network: event integrals are first employed to reduce blur in the coarse outputs. This is followed by employing a variance map, derived from raw event streams, to eliminate the tilt distortion for the refined outputs. Additionally, we present TurbEvent, the first real-captured dataset featuring diverse turbulence scenarios. Experimental results demonstrate that EvTurb surpasses state-of-the-art methods while maintaining computational efficiency.

[51] Fourier-Guided Attention Upsampling for Image Super-Resolution cs.CV | cs.AIPDF

Daejune Choi, Youchan No, Jinhyung Lee, Duksu Kim

TL;DR: 本文提出了一种轻量级的上采样模块Frequency-Guided Attention (FGA)，用于单图像超分辨率任务。FGA通过结合傅里叶特征、跨分辨率注意力层和频域L1损失，显著提升了高频细节重建能力并减少了伪影。

Details

Motivation: 传统的上采样方法（如Sub-Pixel Convolution）虽然高效，但在重建高频细节时表现不佳，且容易引入伪影。FGA旨在解决这些问题。

Result: 实验表明，FGA在五个超分辨率骨干网络中均取得了PSNR平均提升0.12~0.14 dB的效果，并显著提升了频域一致性（最高29%）。

Insight: FGA证明了频域信息在超分辨率任务中的重要性，为轻量化上采样方法的设计提供了新的方向。

Abstract: We propose Frequency-Guided Attention (FGA), a lightweight upsampling module for single image super-resolution. Conventional upsamplers, such as Sub-Pixel Convolution, are efficient but frequently fail to reconstruct high-frequency details and introduce aliasing artifacts. FGA addresses these issues by integrating (1) a Fourier feature-based Multi-Layer Perceptron (MLP) for positional frequency encoding, (2) a cross-resolution Correlation Attention Layer for adaptive spatial alignment, and (3) a frequency-domain L1 loss for spectral fidelity supervision. Adding merely 0.3M parameters, FGA consistently enhances performance across five diverse super-resolution backbones in both lightweight and full-capacity scenarios. Experimental results demonstrate average PSNR gains of 0.12~0.14 dB and improved frequency-domain consistency by up to 29%, particularly evident on texture-rich datasets. Visual and spectral evaluations confirm FGA’s effectiveness in reducing aliasing and preserving fine details, establishing it as a practical, scalable alternative to traditional upsampling methods.

[52] FIND-Net – Fourier-Integrated Network with Dictionary Kernels for Metal Artifact Reduction cs.CV | eess.IVPDF

Farid Tasharofi, Fuxin Fan, Melika Qahqaie, Mareike Thies, Andreas Maier

TL;DR: FIND-Net 是一种新颖的金属伪影减少（MAR）框架，通过结合频域和空间域处理，显著提升了伪影抑制和结构保留能力。

Details

Motivation: 金属植入物在CT成像中引起的高密度伪影严重降低了图像质量，而现有深度学习方法在抑制伪影同时保留结构细节方面效果有限。

Result: 在合成数据集上，FIND-Net显著优于现有MAR方法，MAE降低了3.07%，SSIM提升0.18%，PSNR改善0.90%。

Insight: 频域和空间域的结合能够更有效地抑制伪影并保留解剖结构，具有更好的临床适用性。

Abstract: Metal artifacts, caused by high-density metallic implants in computed tomography (CT) imaging, severely degrade image quality, complicating diagnosis and treatment planning. While existing deep learning algorithms have achieved notable success in Metal Artifact Reduction (MAR), they often struggle to suppress artifacts while preserving structural details. To address this challenge, we propose FIND-Net (Fourier-Integrated Network with Dictionary Kernels), a novel MAR framework that integrates frequency and spatial domain processing to achieve superior artifact suppression and structural preservation. FIND-Net incorporates Fast Fourier Convolution (FFC) layers and trainable Gaussian filtering, treating MAR as a hybrid task operating in both spatial and frequency domains. This approach enhances global contextual understanding and frequency selectivity, effectively reducing artifacts while maintaining anatomical structures. Experiments on synthetic datasets show that FIND-Net achieves statistically significant improvements over state-of-the-art MAR methods, with a 3.07% MAE reduction, 0.18% SSIM increase, and 0.90% PSNR improvement, confirming robustness across varying artifact complexities. Furthermore, evaluations on real-world clinical CT scans confirm FIND-Net’s ability to minimize modifications to clean anatomical regions while effectively suppressing metal-induced distortions. These findings highlight FIND-Net’s potential for advancing MAR performance, offering superior structural preservation and improved clinical applicability. Code is available at https://github.com/Farid-Tasharofi/FIND-Net

[53] Increasing the Utility of Synthetic Images through Chamfer Guidance cs.CVPDF

Nicola Dall’Asen, Xiaofeng Zhang, Reyhane Askari Hemmat, Melissa Hall, Jakob Verbeek

TL;DR: 该论文提出了一种称为Chamfer Guidance的训练无关方法，通过少量真实样本图像提升合成图像的质量和多样性，解决了合成与真实数据间的分布偏移问题。

Details

Motivation: 现有条件生成模型在生成质量提升的同时牺牲了多样性，限制了合成数据作为训练数据的实用性。传统方法往往忽视了合成与真实数据间的分布偏移。

Result: 在ImageNet-1k和地理多样性基准测试中，质量和多样性显著提升。仅用2个真实样本即取得96.4%的精确度和86.4%的覆盖率。

Insight: 少量真实样本可以有效指导合成数据的生成，解决分布偏移问题，同时减少计算开销（FLOPs降低31%）。

Abstract: Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4% in terms of precision, and 86.4% in terms of distributional coverage, which increase to 97.5% and 92.7%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15% for in-distribution over the baselines, and up to 16% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.

[54] ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation cs.CVPDF

Hosam Elgendy, Ahmed Sharshar, Ahmed Aboeitta, Mohsen Guizani

TL;DR: ChatENV 是一个交互式视觉语言模型，结合卫星图像对和环境传感器数据进行环境监测和场景模拟，优于现有方法。

Details

Motivation: 现有的视觉语言模型忽略了环境传感器的因果信号，依赖单一来源的标注，缺乏交互式的场景推理能力，无法满足环境监测的需求。

Result: ChatENV 在时序推理和假设推理任务中表现优异（如 BERT-F1 0.903），并支持交互式场景分析。

Insight: 多模态数据（特别是传感器数据）的融合能显著提升环境监测模型的性能和实用性。

Abstract: Understanding environmental changes from aerial imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT- 4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and “what-if” reasoning (e.g., BERT-F1 0.903) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.

[55] Processing and acquisition traces in visual encoders: What does CLIP know about your camera? cs.CVPDF

Ryan Ramos, Vladan Stojnić, Giorgos Kordopatis-Zilos, Yuta Nakashima, Giorgos Tolias

TL;DR: 论文分析了视觉编码器（如CLIP）对图像采集和处理参数的敏感性，发现这些参数被系统地编码在表示中，且可能显著影响语义预测。

Details

Motivation: 现有研究多关注视觉编码器对严重图像变换和损坏的鲁棒性，而本文关注图像采集和处理中的细微参数，探讨它们如何影响编码器的表现。

Result: 发现这些参数可被轻松恢复，且其存在对语义预测有显著影响，效果取决于参数与语义标签的相关性。

Insight: 视觉编码器对细微的图像处理变化极为敏感，这种特性可能影响其在真实场景中的表现，需在实际应用中加以考虑。

Abstract: Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces

[56] Lameness detection in dairy cows using pose estimation and bidirectional LSTMs cs.CVPDF

Helena Russello, Rik van der Tol, Eldert J. van Henten, Gert Kootstra

TL;DR: 结合姿态估计和双向LSTM的奶牛跛行检测方法，优于传统特征工程方法，仅需1秒视频数据即可检测。

Details

Motivation: 传统奶牛跛行检测依赖人工设计特征，效率低且不适用于短序列数据。

Result: 分类准确率85%，优于传统方法的80%，且仅需1秒视频数据。

Insight: BLSTM有效捕捉运动序列的时序信息，姿态估计避免了人工特征工程的局限性。

Abstract: This study presents a lameness detection approach that combines pose estimation and Bidirectional Long-Short-Term Memory (BLSTM) neural networks. Combining pose-estimation and BLSTMs classifier offers the following advantages: markerless pose-estimation, elimination of manual feature engineering by learning temporal motion features from the keypoint trajectories, and working with short sequences and small training datasets. Motion sequences of nine keypoints (located on the cows’ hooves, head and back) were extracted from videos of walking cows with the T-LEAP pose estimation model. The trajectories of the keypoints were then used as an input to a BLSTM classifier that was trained to perform binary lameness classification. Our method significantly outperformed an established method that relied on manually-designed locomotion features: our best architecture achieved a classification accuracy of 85%, against 80% accuracy for the feature-based approach. Furthermore, we showed that our BLSTM classifier could detect lameness with as little as one second of video data.

[57] SemPT: Semantic Prompt Tuning for Vision-Language Models cs.CVPDF

Xiao Shi, Yangjun Ou, Zhenzhong Chen

TL;DR: SemPT是一种新颖的语义提示调优框架，通过利用跨类别的共享属性级知识，解决了视觉-语言模型在新类别上的泛化问题，并在多个基准数据集上实现了最先进的性能。

Details

Motivation: 视觉迁移学习在未见类别上的应用面临保留类别特定表示与获取可迁移知识之间的矛盾。现有的提示调优方法依赖稀疏的类别标签或LLM生成的描述，导致知识表示碎片化，影响迁移性。

Result: 在15个基准数据集上的广泛实验表明，SemPT在基础到新类别的泛化、跨数据集迁移、跨领域迁移和少样本学习中均达到了最先进的性能。

Insight: SemPT通过属性级知识共享和多模态对齐，有效解决了视觉-语言模型在未见类别上的泛化问题，同时动态推理机制进一步提升了适应性。

Abstract: Visual transfer learning for unseen categories presents an active research topic yet a challenging task, due to the inherent conflict between preserving category-specific representations and acquiring transferable knowledge. Vision-Language Models (VLMs) pre-trained on large amounts of image-text pairs offer a promising solution. However, existing prompt tuning methods rely on sparse category labels or disparate LLM-generated descriptions, which fragment knowledge representation and hinder transferability. To address this limitation, we introduce Semantic Prompt Tuning (SemPT), a novel framework that tackles the generalization challenge by leveraging shared attribute-level knowledge across categories. Specifically, SemPT adopts a two-step prompting strategy to guide LLM in extracting shared visual attributes and generating attribute-level descriptions, capturing transferable semantic cues beyond labels while ensuring coherent structure. Then, visually guided weighting is applied to the embeddings of attribute-level descriptions to reduce noise from irrelevant attributes and enhance the text embeddings. Additionally, image embeddings are jointly aligned with both label and attribute-enhanced text embeddings, balancing discrimination for seen categories and transferability to unseen ones. Considering the availability of category exposure, our inference dynamically selects between standard label embeddings for seen categories and attribute-enhanced embeddings for unseen ones to ensure effective adaptation. Extensive experiments on 15 benchmark datasets demonstrate that SemPT achieves state-of-the-art performance across various settings, including base-to-novel generalization, cross-dataset transfer, cross-domain transfer, and few-shot learning.

Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Chunyang Cheng, Tao Zhou

TL;DR: 该论文提出了一个统一的多模态视觉目标跟踪（MMVOT）基准UniBench300，并通过序列化学习方式解决多任务训练中的性能退化问题。研究还探讨了持续学习（CL）在统一过程中的作用。

Details

Motivation: 现有方法在多模态视觉目标跟踪任务中采用并行训练方式，由于缺乏统一基准导致训练与测试不一致，引发性能退化。作者希望通过统一基准和序列化学习解决这一问题。

Result: 在四个基准上的实验表明，UniBench300显著减少了性能退化，且持续学习有效支持稳定的统一过程。此外，网络容量与性能退化呈负相关。

Insight: 多模态间的差异导致不同任务的性能退化程度不同（RGBT > RGBD > RGBE），为未来多模态视觉研究提供了重要启示。

Abstract: Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing \textit{inconsistency} between training and testing, thus leading to performance \textit{degradation}. To address these issues, this work advances in two aspects: \ding{182} A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time consumption by 27%. \ding{183} The unification process is reformulated in a serial format, progressively integrating new tasks. In this way, the performance degradation can be specified as knowledge forgetting of previous tasks, which naturally aligns with the philosophy of continual learning (CL), motivating further exploration of injecting CL into the unification process. Extensive experiments conducted on two baselines and four benchmarks demonstrate the significance of UniBench300 and the superiority of CL in supporting a stable unification process. Moreover, while conducting dedicated analyses, the performance degradation is found to be negatively correlated with network capacity. Additionally, modality discrepancies contribute to varying degradation levels across tasks (RGBT > RGBD > RGBE in MMVOT), offering valuable insights for future multi-modal vision research. Source codes and the proposed benchmark is available at \textit{https://github.com/Zhangyong-Tang/UniBench300}.

[59] AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models cs.CV | cs.AIPDF

Shixiong Xu, Chenghao Zhang, Lubin Fan, Yuan Zhou, Bin Fan

TL;DR: AddressVLM通过跨视角对齐调优和卫星-街景图像嫁接机制增强大型视觉语言模型在细粒度地址定位任务中的表现，显著提升了定位精度。

Details

Motivation: 现有的大型视觉语言模型在粗粒度地理定位（如国家或城市级别）表现优异，但在城市内部的细粒度街道级定位任务中表现不佳，尤其是缺乏对微观视觉线索的理解。

Result: 在Pittsburgh和San Francisco数据集上，AddressVLM的平均地址定位准确率分别比基线模型高出9%和12%。

Insight: 通过融合宏观（卫星图）和微观（街景图）视角的视觉线索，可以显著增强模型对复杂城市布局的理解能力。

Abstract: Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images. A key challenge is that the street-view visual question-and-answer (VQA) data provides only microscopic visual cues, leading to subpar performance in fine-tuned models. To tackle this issue, we incorporate perspective-invariant satellite images as macro cues and propose cross-view alignment tuning including a satellite-view and street-view image grafting mechanism, along with an automatic label generation mechanism. Then LVLM’s global understanding of street distribution is enhanced through cross-view matching. Our proposed model, named AddressVLM, consists of two-stage training protocols: cross-view alignment tuning and address localization tuning. Furthermore, we have constructed two street-view VQA datasets based on image address localization datasets from Pittsburgh and San Francisco. Qualitative and quantitative evaluations demonstrate that AddressVLM outperforms counterpart LVLMs by over 9% and 12% in average address localization accuracy on these two datasets, respectively.

[60] IADGPT: Unified LVLM for Few-Shot Industrial Anomaly Detection, Localization, and Reasoning via In-Context Learning cs.CVPDF

Mengyang Zhao, Teng Fu, Haiyang Yu, Ke Niu, Bin Li

TL;DR: 该论文提出了一个名为IADGPT的统一框架，通过三阶段渐进式训练策略和上下文学习方法，实现了少样本工业异常检测（FS-IAD）、定位和推理任务，并引入了新的大规模数据集支持训练和测试。

Details

Motivation: 现有的基于大型视觉语言模型（LVLM）的FS-IAD方法缺乏工业知识和推理能力，无法达到专业人类质检员的水平。IADGPT旨在通过模仿人类的渐进学习方式，填补这一空白，实现对多样化和新型工业产品的高效检测。

Result: 实验表明，IADGPT在异常检测任务中表现显著提升，在定位和推理任务中也具有竞争力。

Insight: 通过模仿人类的学习方式，结合上下文学习，能够有效提升LVLM在工业异常检测任务中的表现，尤其是在少样本场景下的泛化能力。

Abstract: Few-Shot Industrial Anomaly Detection (FS-IAD) has important applications in automating industrial quality inspection. Recently, some FS-IAD methods based on Large Vision-Language Models (LVLMs) have been proposed with some achievements through prompt learning or fine-tuning. However, existing LVLMs focus on general tasks but lack basic industrial knowledge and reasoning capabilities related to FS-IAD, making these methods far from specialized human quality inspectors. To address these challenges, we propose a unified framework, IADGPT, designed to perform FS-IAD in a human-like manner, while also handling associated localization and reasoning tasks, even for diverse and novel industrial products. To this end, we introduce a three-stage progressive training strategy inspired by humans. Specifically, the first two stages gradually guide IADGPT in acquiring fundamental industrial knowledge and discrepancy awareness. In the third stage, we design an in-context learning-based training paradigm, enabling IADGPT to leverage a few-shot image as the exemplars for improved generalization to novel products. In addition, we design a strategy that enables IADGPT to output image-level and pixel-level anomaly scores using the logits output and the attention map, respectively, in conjunction with the language output to accomplish anomaly reasoning. To support our training, we present a new dataset comprising 100K images across 400 diverse industrial product categories with extensive attribute-level textual annotations. Experiments indicate IADGPT achieves considerable performance gains in anomaly detection and demonstrates competitiveness in anomaly localization and reasoning. We will release our dataset in camera-ready.

[61] Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios cs.CVPDF

Zhanwen Liu, Yujing Sun, Yang Wang, Nan Yang, Shengbo Eben Li

TL;DR: 该论文提出一种结合RGB相机和事件相机的多模态融合网络MCFNet，用于动态交通场景下的鲁棒目标检测，解决了传统RGB相机在复杂光照下的动态范围限制问题。

Details

Motivation: 传统RGB相机由于动态范围限制，在复杂交通环境（如夜间驾驶、隧道）中会导致对比度下降和高频细节丢失，从而影响目标检测性能。为解决这一问题，论文引入事件相机提供高动态范围信息。

Result: 在DSEC-Det和PKU-DAVIS-SOD数据集上，MCFNet显著优于现有方法，尤其在DSEC-Det上mAP50和mAP分别提升7.4%和1.7%。

Insight: 跨模态融合（RGB+事件）能有效解决动态范围问题，提升复杂光照下目标检测的鲁棒性；自适应特征对齐和融合是关键。

Abstract: The dynamic range limitation of conventional RGB cameras reduces global contrast and causes loss of high-frequency details such as textures and edges in complex traffic environments (e.g., nighttime driving, tunnels), hindering discriminative feature extraction and degrading frame-based object detection. To address this, we integrate a bio-inspired event camera with an RGB camera to provide high dynamic range information and propose a motion cue fusion network (MCFNet), which achieves optimal spatiotemporal alignment and adaptive cross-modal feature fusion under challenging lighting. Specifically, an event correction module (ECM) temporally aligns asynchronous event streams with image frames via optical-flow-based warping, jointly optimized with the detection network to learn task-aware event representations. The event dynamic upsampling module (EDUM) enhances spatial resolution of event frames to match image structures, ensuring precise spatiotemporal alignment. The cross-modal mamba fusion module (CMM) uses adaptive feature fusion with a novel interlaced scanning mechanism, effectively integrating complementary information for robust detection. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively. The code is available at https://github.com/Charm11492/MCFNet.

[62] Revisiting Cross-View Localization from Image Matching cs.CVPDF

Panwang Xia, Qiong Wu, Lei Yu, Yi Liu, Mingtao Xiong

TL;DR: 该论文提出了一种新的跨视图定位方法，通过改进跨视图图像匹配提升定位精度，引入了表面模型和相似性矩阵细化模块，并提出了首个带有像素级对应标注的跨视图匹配基准CVFM。

Details

Motivation: 在GNSS信号不可用的环境中（如城市峡谷和灾区），跨视图定位至关重要。现有方法要么直接回归姿态，要么在共享的鸟瞰图空间中对齐特征，但未能建立严格的跨视图对应关系，导致匹配粗糙或几何不一致。因此，需要改进跨视图图像匹配以提升定位的可解释性。

Result: 实验表明，该方法显著提升了定位精度和图像匹配质量，并在极端视角差异下设立了新的基准。

Insight: 通过改进跨视图图像匹配而非直接回归姿态或对齐特征，可以显著提升跨视图定位的性能和可解释性。像素级对应标注的基准CVFM也为未来研究提供了重要支持。

Abstract: Cross-view localization aims to estimate the 3 degrees of freedom pose of a ground-view image by registering it to aerial or satellite imagery. It is essential in GNSS-denied environments such as urban canyons and disaster zones. Existing methods either regress poses directly or align features in a shared bird’s-eye view (BEV) space, both built upon accurate spatial correspondences between perspectives. However, these methods fail to establish strict cross-view correspondences, yielding only coarse or geometrically inconsistent matches. Consequently, fine-grained image matching between ground and aerial views remains an unsolved problem, which in turn constrains the interpretability of localization results. In this paper, we revisit cross-view localization from the perspective of cross-view image matching and propose a novel framework that improves both matching and localization. Specifically, we introduce a Surface Model to model visible regions for accurate BEV projection, and a SimRefiner module to refine the similarity matrix through local-global residual correction, eliminating the reliance on post-processing like RANSAC. To further support research in this area, we introduce CVFM, the first benchmark with 32,509 cross-view image pairs annotated with pixel-level correspondences. Extensive experiments demonstrate that our approach substantially improves both localization accuracy and image matching quality, setting new baselines under extreme viewpoint disparity.

[63] EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering cs.CV | cs.AIPDF

Yanjun Li, Yuqian Fu, Tianwen Qian, Qi’ao Xu, Silong Dai

TL;DR: 该论文提出了EgoCross基准，用于评估多模态大语言模型（MLLMs）在跨领域的第一人称视频问答任务中的泛化能力，填补了现有研究仅关注日常活动的空白。

Details

Motivation: 现有的第一人称视频问答研究多局限于日常活动领域（如烹饪、清洁），而实际应用中的领域差异（如视觉风格和语义内容）可能导致模型性能下降。为了评估模型的跨领域泛化能力，作者提出了EgoCross基准。

Result: 实验表明，现有MLLMs在跨领域任务中表现不佳，突出了当前模型的局限性。

Insight: 该研究揭示了MLLMs在跨领域泛化中的挑战，为开发更鲁棒的领域自适应模型奠定了基础。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly pushed the frontier of egocentric video question answering (EgocentricQA). However, existing benchmarks and studies are mainly limited to common daily activities such as cooking and cleaning. In contrast, real-world deployment inevitably encounters domain shifts, where target domains differ substantially in both visual style and semantic content. To bridge this gap, we introduce \textbf{EgoCross}, a comprehensive benchmark designed to evaluate the cross-domain generalization of MLLMs in EgocentricQA. EgoCross covers four diverse and challenging domains, including surgery, industry, extreme sports, and animal perspective, representing realistic and high-impact application scenarios. It comprises approximately 1,000 QA pairs across 798 video clips, spanning four key QA tasks: prediction, recognition, localization, and counting. Each QA pair provides both OpenQA and CloseQA formats to support fine-grained evaluation. Extensive experiments show that most existing MLLMs, whether general-purpose or egocentric-specialized, struggle to generalize to domains beyond daily life, highlighting the limitations of current models. Furthermore, we conduct several pilot studies, \eg, fine-tuning and reinforcement learning, to explore potential improvements. We hope EgoCross and our accompanying analysis will serve as a foundation for advancing domain-adaptive, robust egocentric video understanding. Data and codes will be released at: \href{https://github.com/MyUniverse0726/EgoCross}{https://github.com/MyUniverse0726/EgoCross.}

[64] Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction cs.CV | cs.LGPDF

Luyao Tang, Kunze Huang, Chaoqi Chen, Yuxuan Yuan, Chenxin Li

TL;DR: 本文提出了一种新的广义类别发现方法ConGCD，通过视觉基元和共识机制改进对已知和未知类别的识别。

Details

Motivation: 人类能够轻松识别已知和未知类别的对象，而现有机器学习方法主要通过优化目标函数解决广义类别发现问题，缺乏对人类认知过程的模拟。

Result: 在粗粒度和细粒度数据集上的实验验证了ConGCD作为共识感知范式的有效性。

Insight: 模拟人类认知的视觉基元和共识机制能够显著提升广义类别发现性能。

Abstract: Human perceptual systems excel at inducing and recognizing objects across both known and novel categories, a capability far beyond current machine learning frameworks. While generalized category discovery (GCD) aims to bridge this gap, existing methods predominantly focus on optimizing objective functions. We present an orthogonal solution, inspired by the human cognitive process for novel object understanding: decomposing objects into visual primitives and establishing cross-knowledge comparisons. We propose ConGCD, which establishes primitive-oriented representations through high-level semantic reconstruction, binding intra-class shared attributes via deconstruction. Mirroring human preference diversity in visual processing, where distinct individuals leverage dominant or contextual cues, we implement dominant and contextual consensus units to capture class-discriminative patterns and inherent distributional invariants, respectively. A consensus scheduler dynamically optimizes activation pathways, with final predictions emerging through multiplex consensus integration. Extensive evaluations across coarse- and fine-grained benchmarks demonstrate ConGCD’s effectiveness as a consensus-aware paradigm. Code is available at github.com/lytang63/ConGCD.

[65] Axis-level Symmetry Detection with Group-Equivariant Representation cs.CVPDF

Wongyun Yu, Ahyun Seo, Minsu Cho

TL;DR: 该论文提出了一种新的对称轴检测框架，专注于反射和旋转对称的轴级检测，通过双分支架构和群等变表示实现高精度检测。

Details

Motivation: 计算机视觉中，对称性检测在复杂场景中仍具挑战性，尤其是轴级精度不足。现有方法难以准确定位对称轴，需更精细的检测手段。

Result: 实验表明，该方法在当前任务上表现最优，优于现有方法。

Insight: 1. 群等变表示能有效捕捉对称性；
2. 显式几何表示（如线和点）有助于提升检测精度。

Abstract: Symmetry is a fundamental concept that has been extensively studied, yet detecting it in complex scenes remains a significant challenge in computer vision. Recent heatmap-based approaches can localize potential regions of symmetry axes but often lack precision in identifying individual axes. In this work, we propose a novel framework for axis-level detection of the two most common symmetry types-reflection and rotation-by representing them as explicit geometric primitives, i.e. lines and points. Our method employs a dual-branch architecture that is equivariant to the dihedral group, with each branch specialized to exploit the structure of dihedral group-equivariant features for its respective symmetry type. For reflection symmetry, we introduce orientational anchors, aligned with group components, to enable orientation-specific detection, and a reflectional matching that measures similarity between patterns and their mirrored counterparts across candidate axes. For rotational symmetry, we propose a rotational matching that compares patterns at fixed angular intervals to identify rotational centers. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches.

[66] From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models cs.CVPDF

Tiancheng Han, Yunfei Gao, Yong Li, Wuzhou Yu, Qiaosheng Zhang

TL;DR: 该论文对视觉语言模型（VLMs）在时空物理推理任务中的表现进行了诊断分析，发现当前模型表现不佳，并提出了基于监督微调和规则强化学习的方法，显著提升了Qwen2.5-VL-7B的能力。

Details

Motivation: 时空物理推理是理解真实物理世界的关键能力，但当前的VLMs在此任务上的表现尚未得到充分研究。论文旨在填补这一空白。

Result: 改进后的模型在时空物理推理任务上表现显著提升，超越了主流闭源模型，但在新物理场景的泛化能力上仍有局限。

Insight: 论文揭示了当前VLMs时空物理推理能力不足的主要原因是人类先验偏见和缺乏深度推理，强调了新方法的必要性。

Abstract: Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human-like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine-tuning followed by rule-based reinforcement learning to Qwen2.5-VL-7B, resulting in significant improvements in spatio-physical reasoning capabilities and surpassing leading proprietary models. Nevertheless, despite this success, the model’s generalization to new physics scenarios remains limited – underscoring the pressing need for new approaches in spatio-physical reasoning.

[67] AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences cs.CV | cs.AIPDF

Jieyu Li, Xin Zhang, Joey Tianyi Zhou

TL;DR: AEGIS 是一个新的大规模基准数据集，用于评估AI生成视频的真实性检测，包含10,000多个真实与合成视频，特别针对超现实和语义复杂的伪造内容，挑战现有视觉语言模型的检测能力。

Details

Motivation: 当前视频真实性检测的基准数据集存在现实性不足、规模小和复杂性低的问题，无法有效评估现代视觉语言模型对抗复杂伪造的能力。

Result: 实验表明，现有先进视觉语言模型在AEGIS最具挑战的子集上检测能力有限，凸显了数据集的复杂性和现实性。

Insight: AEGIS为开发鲁棒可靠的视频真实性检测方法提供了不可或缺的基准，为解决真实世界伪造威胁提供了新方向。

Abstract: Recent advances in AI-generated content have fueled the rise of highly realistic synthetic videos, posing severe risks to societal trust and digital integrity. Existing benchmarks for video authenticity detection typically suffer from limited realism, insufficient scale, and inadequate complexity, failing to effectively evaluate modern vision-language models against sophisticated forgeries. To address this critical gap, we introduce AEGIS, a novel large-scale benchmark explicitly targeting the detection of hyper-realistic and semantically nuanced AI-generated videos. AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models, including Stable Video Diffusion, CogVideoX-5B, KLing, and Sora, encompassing open-source and proprietary architectures. In particular, AEGIS features specially constructed challenging subsets enhanced with robustness evaluation. Furthermore, we provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features, facilitating authenticity detection and supporting downstream tasks such as multimodal fusion and forgery localization. Extensive experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS, highlighting the dataset’s unique complexity and realism beyond the current generalization capabilities of existing models. In essence, AEGIS establishes an indispensable evaluation benchmark, fundamentally advancing research toward developing genuinely robust, reliable, broadly generalizable video authenticity detection methodologies capable of addressing real-world forgery threats. Our dataset is available on https://huggingface.co/datasets/Clarifiedfish/AEGIS.

[68] Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation cs.CV | cs.AI | cs.LGPDF

Youping Gu, Xiaolong Li, Yuhao Hu, Bohan Zhuang

TL;DR: 论文提出BLADE框架，结合块稀疏注意力与步长蒸馏技术，显著提升视频生成效率和质量。

Details

Motivation: 当前扩散变换器在高质量视频生成中表现优异，但其迭代去噪过程缓慢且长序列的二次注意力计算成本高昂，亟需优化。

Result: 在Wan2.1-1.3B模型上实现14.10倍加速，CogVideoX-5B上8.89倍加速，且VBench-2.0分数均提升。

Insight: 稀疏注意力与步长蒸馏的联合训练是关键，动态稀疏性和直接融入蒸馏过程显著提升效率与质量。

Abstract: Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges – training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Our code and model weights are publicly available at: http://ziplab.co/BLADE-Homepage/.

[69] Cooperative Face Liveness Detection from Optical Flow cs.CVPDF

Artem Sokolov, Mikhail Nikitin, Anton Konushin

TL;DR: 提出一种基于用户交互和光流分析的视频活体检测方法，通过指定用户缓慢靠近摄像头的动作，结合神经网络光流估计，有效区分真实人脸和多种攻击方式。

Details

Motivation: 传统被动活体检测方法易受多种攻击手段（如打印照片、屏幕显示、面具和视频重放等）的干扰，需要一种更主动、鲁棒的解决方案。

Result: 该方法在多种攻击场景下表现优异，显著优于被动检测方法。

Insight: 用户交互模式的设计和光流分析的结合是提升活体检测鲁棒性的关键。

Abstract: In this work, we proposed a novel cooperative video-based face liveness detection method based on a new user interaction scenario where participants are instructed to slowly move their frontal-oriented face closer to the camera. This controlled approaching face protocol, combined with optical flow analysis, represents the core innovation of our approach. By designing a system where users follow this specific movement pattern, we enable robust extraction of facial volume information through neural optical flow estimation, significantly improving discrimination between genuine faces and various presentation attacks (including printed photos, screen displays, masks, and video replays). Our method processes both the predicted optical flows and RGB frames through a neural classifier, effectively leveraging spatial-temporal features for more reliable liveness detection compared to passive methods.

[70] Object Fidelity Diffusion for Remote Sensing Image Generation cs.CVPDF

Ziqi Ye, Shuran Ma, Jie Yang, Xiaoyi Yang, Ziyang Gong

TL;DR: 论文提出了Object Fidelity Diffusion (OF-Diff)方法，通过提取对象的先验形状和使用双分支扩散模型提升遥感图像生成的质量，显著提高了小目标对象的生成保真度。

Details

Motivation: 现有扩散模型难以捕捉形态细节，导致生成的低保真图像影响目标检测模型的鲁棒性，因此需要一种高精度可控的遥感图像生成方法。

Result: OF-Diff在关键质量指标上优于SOTA方法，飞机、船舶和车辆等小目标的mAP分别提升8.3%、7.7%和4.0%。

Insight: 利用对象先验形状和双分支模型可以显著提升遥感图像生成的保真度，尤其对小目标效果更明显。

Abstract: High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity images due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a dual-branch diffusion model with diffusion consistency loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the generated remote sensing images more diverse and semantically consistent. Comprehensive experiments demonstrate that OF-Diff outperforms state-of-the-art methods in the remote sensing across key quality metrics. Notably, the performance of several polymorphic and small object classes shows significant improvement. For instance, the mAP increases by 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles, respectively.

[71] Mobile-Friendly Deep Learning for Plant Disease Detection: A Lightweight CNN Benchmark Across 101 Classes of 33 Crops cs.CV | cs.LGPDF

Anand Kumar, Harminder Pal Monga, Tapasi Brahma, Satyam Kalra, Navas Sherif

TL;DR: 论文提出了一种轻量级的移动端友好深度学习方法，用于在33种作物中检测101类植物病害，并通过实验评估了多种轻量级架构的性能。

Details

Motivation: 植物病害是全球粮食安全的主要威胁，需要开发高效的早期检测系统。利用计算机视觉技术可以解决这一挑战，尤其需要适用于资源受限设备的解决方案。

Result: 实验结果表明，EfficientNet-B1在准确率和计算效率之间达到最佳平衡，分类准确率为94.7%，适合在实际移动设备中部署。

Insight: 轻量级CNN架构在资源受限设备上的应用潜力巨大，EfficientNet-B1在植物病害检测任务中表现出色，可为实际场景提供高效解决方案。

Abstract: Plant diseases are a major threat to food security globally. It is important to develop early detection systems which can accurately detect. The advancement in computer vision techniques has the potential to solve this challenge. We have developed a mobile-friendly solution which can accurately classify 101 plant diseases across 33 crops. We built a comprehensive dataset by combining different datasets, Plant Doc, PlantVillage, and PlantWild, all of which are for the same purpose. We evaluated performance across several lightweight architectures - MobileNetV2, MobileNetV3, MobileNetV3-Large, and EfficientNet-B0, B1 - specifically chosen for their efficiency on resource-constrained devices. The results were promising, with EfficientNet-B1 delivering our best performance at 94.7% classification accuracy. This architecture struck an optimal balance between accuracy and computational efficiency, making it well-suited for real-world deployment on mobile devices.

[72] UI-Venus Technical Report: Building High-performance UI Agents with RFT cs.CVPDF

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen

TL;DR: UI-Venus 是一个基于多模态大语言模型的高性能 UI 代理，仅需截图作为输入，通过强化微调 (RFT) 在少量高质量训练样本上实现了 SOTA 性能，并在 UI 基础任务和导航任务中表现优异。

Details

Motivation: 现有 UI 代理在复杂任务中表现不足，需要更高效的模型和训练方法以提高性能。

Result: 7B 和 72B 版本的 UI-Venus 在 Screenspot-V2/Pro 基准测试中分别达到 94.1%/50.8% 和 95.3%/61.9%，在 AndroidWorld 导航任务中分别达到 49.1% 和 65.9% 的成功率。

Insight: 通过强化微调和自演进框架可以显著提升 UI 代理的性能，尤其是在复杂任务中的规划能力。数据质量和奖励函数设计对模型性能至关重要。

Abstract: We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus’s summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.

[73] Generalizable Federated Learning using Client Adaptive Focal Modulation cs.CVPDF

Tajamul Ashraf, Iqra Altaf Gillani

TL;DR: 论文提出了AdaptFED方法，通过改进客户自适应焦点调制策略，增强联邦学习的泛化能力，并在多种数据模态上验证了其优越性。

Details

Motivation: 联邦学习在非独立同分布（non-IID）和跨域场景中面临挑战，传统方法表现不佳。AdaptFED旨在通过改进焦点调制策略，提升模型在这些复杂场景中的适应性和通用性。

Result: 在八种多样化数据集上的实验表明，该方法在源无关和跨任务联邦设置中优于现有基准。

Insight: 通过焦点调制的改进，可以构建更具适应性、可扩展性和通用性的基于Transformer的联邦学习系统。

Abstract: Federated learning (FL) has proven essential for privacy-preserving, collaborative training across distributed clients. Our prior work, TransFed, introduced a robust transformer-based FL framework that leverages a learn-to-adapt hypernetwork to generate personalized focal modulation layers per client, outperforming traditional methods in non-IID and cross-domain settings. In this extended version, we propose AdaptFED, where we deepen the investigation of focal modulation in generalizable FL by incorporating: (1) a refined adaptation strategy that integrates task-aware client embeddings to personalize modulation dynamics further, (2) enhanced theoretical bounds on adaptation performance, and (3) broader empirical validation across additional modalities, including time-series and multilingual data. We also introduce an efficient variant of TransFed that reduces server-client communication overhead via low-rank hypernetwork conditioning, enabling scalable deployment in resource-constrained environments. Extensive experiments on eight diverse datasets reaffirm the superiority of our method over state-of-the-art baselines, particularly in source-free and cross-task federated setups. Our findings not only extend the capabilities of focal modulation in FL but also pave the way for more adaptive, scalable, and generalizable transformer-based federated systems. The code is available at http://github.com/Tajamul21/TransFed

[74] Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation cs.CVPDF

Harold Haodong Chen, Haojian Huang, Qifeng Chen, Harry Yang, Ser-Nam Lim

TL;DR: 论文提出PhysHPO框架，通过分层细粒度偏好优化和自动化数据选择，显著提升视频生成的物理合理性和整体质量。

Details

Motivation: 现有视频生成技术难以满足物理合理性需求，限制了在需要高真实度应用中的使用。

Result: 实验表明，PhysHPO在物理合理性和视频质量上显著优于现有方法。

Insight: 分层优化和多粒度对齐是提升视频生成物理合理性的有效途径，自动化数据选择方法可大幅降低高质量数据获取成本。

Abstract: Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize “good data” from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.

[75] Performance of GPT-5 in Brain Tumor MRI Reasoning cs.CV | cs.AIPDF

Mojtaba Safari, Shansong Wang, Mingzhe Hu, Zach Eidex, Qiang Li

TL;DR: 该论文研究了GPT-5系列模型在脑肿瘤MRI图像推理任务中的表现，发现GPT-5-mini在多种脑肿瘤亚型中表现最佳，但整体准确率仍不足以满足临床需求。

Details

Motivation: 脑肿瘤类型的准确区分对治疗规划至关重要，而大型语言模型（LLMs）在视觉问答（VQA）任务中的潜力尚未被充分探索。

Result: GPT-5-mini表现最佳（44.19%），但所有模型的准确率均未达到临床可接受水平。

Insight: 尽管GPT-5系列模型在结构化神经肿瘤VQA任务中表现适中，但其性能在不同肿瘤亚型中存在显著差异，表明仍需进一步优化。

Abstract: Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (GLI), meningioma (MEN), and brain metastases (MET). Each case included multi-sequence MRI triplanar mosaics and structured clinical features transformed into standardized VQA items. Models were assessed in a zero-shot chain-of-thought setting for accuracy on both visual and reasoning tasks. Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied by tumor subtype, with no single model dominating across all cohorts. These findings suggest that GPT-5 family models can achieve moderate accuracy in structured neuro-oncological VQA tasks, but not at a level acceptable for clinical use.

[76] TexVerse: A Universe of 3D Objects with High-Resolution Textures cs.CVPDF

Yibo Zhang, Li Zhang, Rui Ma, Nan Cao

TL;DR: TexVerse是一个大规模3D数据集，专注于高分辨率纹理，填补了现有3D数据集中纹理生成研究的空白，包含858K独特高分辨率3D模型，并标注了详细模型特性。

Details

Motivation: 目前的大规模3D数据集主要关注几何生成，高分辨率纹理的数据集不足，限制了相关研究的进展。TexVerse旨在解决这一问题。

Result: TexVerse包含1.6M 3D实例，并分为专用子集（如69K骨骼绑定和54K动画模型），支持多种3D视觉与图形任务。

Insight: 高分辨率纹理和大规模标注数据集对3D研究和应用至关重要，TexVerse为材料开发、动画合成等领域提供了新机遇。

Abstract: We introduce TexVerse, a large-scale 3D dataset featuring high-resolution textures. While recent advances in large-scale 3D datasets have enhanced high-resolution geometry generation, creating high-resolution textures end-to-end remains underexplored due to the lack of suitable datasets. TexVerse fills this gap with a curated collection of over 858K unique high-resolution 3D models sourced from Sketchfab, including more than 158K models with physically based rendering (PBR) materials. Each model encompasses all of its high-resolution variants, bringing the total to 1.6M 3D instances. TexVerse also includes specialized subsets: TexVerse-Skeleton, with 69K rigged models, and TexVerse-Animation, with 54K animated models, both preserving original skeleton and animation data uploaded by the user. We also provide detailed model annotations describing overall characteristics, structural components, and intricate features. TexVerse offers a high-quality data resource with wide-ranging potential applications in texture synthesis, PBR material development, animation, and various 3D vision and graphics tasks.

[77] Medico 2025: Visual Question Answering for Gastrointestinal Imaging cs.CV | cs.AI | 68T45, 92C55 | I.2.10; I.4.9PDF

Sushant Gautam, Vajira Thambawita, Michael Riegler, Pål Halvorsen, Steven Hicks

TL;DR: Medico 2025挑战赛聚焦于胃肠道（GI）影像的视觉问答（VQA），旨在开发可解释人工智能（XAI）模型，回答临床相关问题并提供符合医学推理的可解释性说明。

Details

Motivation: 推动医学影像分析中可信赖AI的发展，通过结合定量性能指标和专家评审的可解释性评估，提升模型的临床应用价值。

Result: 通过定量和可解释性评估，推动医学影像分析的AI模型发展。

Insight: 挑战赛强调了可解释性在医学AI中的重要性，为临床决策提供透明支持。

Abstract: The Medico 2025 challenge addresses Visual Question Answering (VQA) for Gastrointestinal (GI) imaging, organized as part of the MediaEval task series. The challenge focuses on developing Explainable Artificial Intelligence (XAI) models that answer clinically relevant questions based on GI endoscopy images while providing interpretable justifications aligned with medical reasoning. It introduces two subtasks: (1) answering diverse types of visual questions using the Kvasir-VQA-x1 dataset, and (2) generating multimodal explanations to support clinical decision-making. The Kvasir-VQA-x1 dataset, created from 6,500 images and 159,549 complex question-answer (QA) pairs, serves as the benchmark for the challenge. By combining quantitative performance metrics and expert-reviewed explainability assessments, this task aims to advance trustworthy Artificial Intelligence (AI) in medical image analysis. Instructions, data access, and an updated guide for participation are available in the official competition repository: https://github.com/simula/MediaEval-Medico-2025

[78] ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing cs.CV | cs.AIPDF

Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li

TL;DR: ToonComposer 将中间帧生成和上色统一为单个后关键帧阶段，通过稀疏草图注入和卡通适应方法，显著降低人工工作量并提升灵活性，优于现有方法。

Details

Motivation: 传统卡通和动画生产需要大量手动工作，现有AI方法分阶段处理易导致错误累积。ToonComposer 旨在统一中间帧生成和上色，解决这些问题。

Result: 在视觉质量、运动一致性和生产效率上优于现有方法，支持稀疏输入和精确运动控制。

Insight: 通过统一分阶段任务，ToonComposer 减少了错误累积，同时稀疏输入和灵活控制为实际生产提供了高效工具。

Abstract: Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.

[79] MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data cs.CVPDF

Antoine Labatie, Michael Vaccaro, Nina Lardiere, Anatol Garioud, Nicolas Gonthier

TL;DR: 该论文提出了MAESTRO，一种针对多模态、多时相和多光谱地球观测数据设计的掩码自编码器改进方法，通过优化的融合策略和定制化的目标归一化方案，显著提升了自监督学习在遥感任务中的性能。

Details

Motivation: 遥感数据的自监督学习潜力巨大，但现有方法未充分考虑其多模态、多时相和多光谱特性。论文旨在填补这一空白。

Result: 在四个地球观测数据集上，MAESTRO在多时相动态任务上达到新的SOTA，同时在单一时相任务上表现优异。

Insight: 定制化的自监督方法能显著提升遥感数据的多模态和多时相学习能力，光谱先验的引入是有效的自监督信号。

Abstract: Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalization schemes for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder, featuring optimized fusion strategies and a tailored target normalization scheme that introduces a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.

[80] ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning cs.CVPDF

Jongseo Lee, Kyungho Bae, Kyle Min, Gyeong-Moon Park, Jinwoo Choi

TL;DR: ESSENTIAL提出了一种结合情景记忆和语义记忆的视频类增量学习方法，通过交叉注意力机制从稀疏特征中恢复密集时间信息，显著减少了内存使用同时保持了性能。

Details

Motivation: 现有视频类增量学习方法中，使用密集样本会导致内存效率低下，而稀疏样本又会牺牲时间信息，导致性能下降。ESSENTIAL旨在解决这一权衡问题。

Result: 在多个数据集（UCF-101、HMDB51等）上验证，ESSENTIAL在显著减少内存使用的情况下，性能优于基线方法。

Insight: 结合情景记忆和语义记忆是解决视频类增量学习中内存效率与性能矛盾的有效途径，交叉注意力机制发挥了关键作用。

Abstract: In this work, we tackle the problem of video classincremental learning (VCIL). Many existing VCIL methods mitigate catastrophic forgetting by rehearsal training with a few temporally dense samples stored in episodic memory, which is memory-inefficient. Alternatively, some methods store temporally sparse samples, sacrificing essential temporal information and thereby resulting in inferior performance. To address this trade-off between memory-efficiency and performance, we propose EpiSodic and SEmaNTIc memory integrAtion for video class-incremental Learning (ESSENTIAL). ESSENTIAL consists of episodic memory for storing temporally sparse features and semantic memory for storing general knowledge represented by learnable prompts. We introduce a novel memory retrieval (MR) module that integrates episodic memory and semantic prompts through cross-attention, enabling the retrieval of temporally dense features from temporally sparse features. We rigorously validate ESSENTIAL on diverse datasets: UCF-101, HMDB51, and Something-Something-V2 from the TCD benchmark and UCF-101, ActivityNet, and Kinetics-400 from the vCLIMB benchmark. Remarkably, with significantly reduced memory, ESSENTIAL achieves favorable performance on the benchmarks.

[81] Puppeteer: Rig and Animate Your 3D Models cs.CV | cs.GRPDF

Chaoyue Song, Xiu Li, Fan Yang, Zhongcong Xu, Jiacheng Wei

TL;DR: Puppeteer是一个全面的框架，用于自动绑定和动画化3D模型，通过预测骨骼结构和优化动画流程，解决了现有技术的瓶颈问题。

Details

Motivation: 现代交互应用需要动态3D内容，但静态3D模型的动画化依赖专家干预，成为内容创作流程中的瓶颈。本文旨在通过自动化和高效的方法解决这一问题。

Result: 在多个基准测试中显著优于现有技术，骨骼预测和绑定质量更高，生成的动画更稳定且无抖动问题。

Insight: 结合生成式AI和优化技术，可以大幅提升3D内容创作的效率和多样性，减少对专家干预的依赖。

Abstract: Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.

cs.CL [Back]

[82] A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain cs.CLPDF

Hugo Massaroli, Leonardo Iara, Emmanuel Iarussi, Viviana Siless

TL;DR: 该论文提出了一种基于区块链的透明公平性评测协议，用于开源语言模型的公平性基准测试，通过智能合约确保评测的可验证性、不可篡改性和可复现性。

Details

Motivation: 随着大语言模型（LLMs）在现实应用中的广泛部署，公平性问题日益凸显，尤其是在高风险的领域（如刑事司法、教育、医疗和金融）。传统的评测方法缺乏透明性和可追溯性，需要一种新的方法来确保公平性评测的可信度。

Result: 1. 验证了评测协议的透明性和可复现性；2. 揭示了开源语言模型在公平性方面的差异；3. 在多语言评测中发现跨语言的公平性差异。

Insight: 基于区块链的评测协议为解决语言模型公平性问题提供了新思路，其透明性和不可篡改性为社区审计和长期跟踪公平性变化提供了可能。

Abstract: Large language models (LLMs) are increasingly deployed in realworld applications, yet concerns about their fairness persist especially in highstakes domains like criminal justice, education, healthcare, and finance. This paper introduces transparent evaluation protocol for benchmarking the fairness of opensource LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain (Foundation, 2023). Our method ensures verifiable, immutable, and reproducible evaluations by executing onchain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly onchain. We benchmark the Llama, DeepSeek, and Mistral models on the PISA dataset for academic performance prediction (OECD, 2018), a dataset suitable for fairness evaluation using statistical parity and equal opportunity metrics (Hardt et al., 2016). We also evaluate structured Context Association Metrics derived from the StereoSet dataset (Nadeem et al., 2020) to measure social bias in contextual associations. We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark (Salazar et al., 2025), revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.

[83] INTIMA: A Benchmark for Human-AI Companionship Behavior cs.CL | cs.AIPDF

Lucie-Aimée Kaffee, Giada Pistilli, Yacine Jernite

TL;DR: INTIMA是一个评估语言模型中陪伴行为的基准，基于心理理论和用户数据定义了31种行为分类。研究显示，主流模型更倾向于强化陪伴行为，而非边界维护，凸显了情感交互一致性的重要性。

Details

Motivation: AI陪伴中用户与系统形成情感纽带的现象日益普遍，但缺乏标准化评估工具。INTIMA旨在填补这一空白，帮助识别和比较不同模型在情感交互中的表现。

Result: 所有模型普遍更倾向于强化陪伴行为，但不同模型在敏感行为类别上差异显著，表明厂商对情感支持的优先级不一致。

Insight: AI陪伴需要平衡情感支持和边界维护，当前的模型表现反映出缺乏一致性，可能影响用户福祉，需标准化开发框架。

Abstract: AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.

[84] XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs cs.CL | cs.LGPDF

Yuzhuo Xiao, Zeyu Han, Yuhan Wang, Huaizu Jiang

TL;DR: 提出了一个当代、真实世界的数据集XFacta，用于评估基于多模态大语言模型（MLLMs）的虚假信息检测方法，并通过系统评估和半自动框架推进该领域的发展。

Details

Motivation: 目前的多模态虚假信息检测方法面临数据集过时或人为合成的限制，且缺乏对MLLM模型设计策略的深入分析，阻碍了该领域的进步。

Result: XFacta能更有效地评估MLLM检测器，同时半自动框架确保数据集的持续更新。

Insight: XFacta填补了当前数据集的不足，为多模态虚假信息检测提供了更贴近现实的评估基准。

Abstract: The rapid spread of multimodal misinformation on social media calls for more effective and robust detection methods. Recent advances leveraging multimodal large language models (MLLMs) have shown the potential in addressing this challenge. However, it remains unclear exactly where the bottleneck of existing approaches lies (evidence retrieval v.s. reasoning), hindering the further advances in this field. On the dataset side, existing benchmarks either contain outdated events, leading to evaluation bias due to discrepancies with contemporary social media scenarios as MLLMs can simply memorize these events, or artificially synthetic, failing to reflect real-world misinformation patterns. Additionally, it lacks comprehensive analyses of MLLM-based model design strategies. To address these issues, we introduce XFacta, a contemporary, real-world dataset that is better suited for evaluating MLLM-based detectors. We systematically evaluate various MLLM-based misinformation detection strategies, assessing models across different architectures and scales, as well as benchmarking against existing detection methods. Building on these analyses, we further enable a semi-automatic detection-in-the-loop framework that continuously updates XFacta with new content to maintain its contemporary relevance. Our analysis provides valuable insights and practices for advancing the field of multimodal misinformation detection. The code and data have been released.

[85] HiFACTMix: A Code-Mixed Benchmark and Graph-Aware Model for EvidenceBased Political Claim Verification in Hinglish cs.CL | cs.AIPDF

Rakesh Thakur, Sneha Sharma, Gauri Chopra

TL;DR: 论文提出了一种名为HiFACTMix的基准和模型，用于在Hinglish（印地语-英语混合语）中基于证据的政治主张验证，填补了低资源混合语言中事实核查的空白。

Details

Motivation: 现有的事实核查系统主要针对高资源、单语言环境，无法推广到印度等多语言地区的真实政治言论。Hinglish的广泛使用和社会媒体对公众舆论的影响，凸显了多语言和上下文感知的事实核查工具的需求。

Result: HiFACTMix在多语言基准测试中超越了现有技术，同时提供了可信的解释。

Insight: 该工作为多语言、混合语言和政治相关的事实核查研究开辟了新方向，展示了图神经网络在复杂语言环境中的潜力。

Abstract: Fact-checking in code-mixed, low-resource languages such as Hinglish remains an underexplored challenge in natural language processing. Existing fact-verification systems largely focus on high-resource, monolingual settings and fail to generalize to real-world political discourse in linguistically diverse regions like India. Given the widespread use of Hinglish by public figures, particularly political figures, and the growing influence of social media on public opinion, there’s a critical need for robust, multilingual and context-aware fact-checking tools. To address this gap a novel benchmark HiFACT dataset is introduced with 1,500 realworld factual claims made by 28 Indian state Chief Ministers in Hinglish, under a highly code-mixed low-resource setting. Each claim is annotated with textual evidence and veracity labels. To evaluate this benchmark, a novel graphaware, retrieval-augmented fact-checking model is proposed that combines multilingual contextual encoding, claim-evidence semantic alignment, evidence graph construction, graph neural reasoning, and natural language explanation generation. Experimental results show that HiFACTMix outperformed accuracy in comparison to state of art multilingual baselines models and provides faithful justifications for its verdicts. This work opens a new direction for multilingual, code-mixed, and politically grounded fact verification research.

Dehao Tao, Guangjie Liu, Weizheng, Yongfeng Huang, Minghu jiang

TL;DR: GG Explore引入了一种通过中间引导图（Guidance Graph）来桥接非结构化查询与结构化知识检索的新框架，解决了现有方法在知识密集任务中的效率与上下文利用问题。

Details

Motivation: 现有方法在知识密集型任务中面临粒度不匹配导致的冗余探索或上下文利用不足的问题，需要一种更高效且语义一致的知识探索方法。

Result: 实验表明，该方法在复杂任务中表现优异，效率高，且在小规模LLMs上也有强性能。

Insight: 引导图的引入为知识检索提供了结构化和语义化的平衡，解决了LLMs在动态知识任务中的局限性。

Abstract: While Large Language Models (LLMs) exhibit strong linguistic capabilities, their reliance on static knowledge and opaque reasoning processes limits their performance in knowledge intensive tasks. Knowledge graphs (KGs) offer a promising solution, but current exploration methods face a fundamental trade off: question guided approaches incur redundant exploration due to granularity mismatches, while clue guided methods fail to effectively leverage contextual information for complex scenarios. To address these limitations, we propose Guidance Graph guided Knowledge Exploration (GG Explore), a novel framework that introduces an intermediate Guidance Graph to bridge unstructured queries and structured knowledge retrieval. The Guidance Graph defines the retrieval space by abstracting the target knowledge’ s structure while preserving broader semantic context, enabling precise and efficient exploration. Building upon the Guidance Graph, we develop: (1) Structural Alignment that filters incompatible candidates without LLM overhead, and (2) Context Aware Pruning that enforces semantic consistency with graph constraints. Extensive experiments show our method achieves superior efficiency and outperforms SOTA, especially on complex tasks, while maintaining strong performance with smaller LLMs, demonstrating practical value.

[87] Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis cs.CLPDF

Linqing Chen, Hanmeng Zhong, Wentao Wu, Weilei Wang

TL;DR: 论文提出了一种名为Semantic Bridge的通用框架，通过语义图编织技术生成复杂的多跳推理问题，显著提升了LLM训练数据的质量。

Details

Motivation: 解决LLM训练中高质量推理问题数据稀缺的问题，特别是从稀疏领域（如生物医学、法律文档）生成可控的复杂多跳问题。

Result: 在通用（Wikipedia）和专用（生物医学）数据集上表现优异，相比基线方法提升18.3%-25.4%，生成问题质量超过人工标注数据。

Insight: 1. 语义图编织技术是生成高质量多跳问题的关键；2. AMR驱动的分析显著提升可控性和生成质量；3. 框架在稀疏领域中表现尤为突出。

Abstract: Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present \textbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is \textit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.

[88] PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play? cs.CLPDF

Lingfeng Zhou, Jialing Zhang, Jin Gao, Mohan Jiang, Dequan Wang

TL;DR: 论文提出了PersonaEval基准测试，评估LLM在角色扮演中是否能可靠识别人类角色，发现当前LLM的准确率（69%）远低于人类（90.8%），揭示了LLM评估者在人类推理能力上的不足。

Details

Motivation: 现有角色扮演研究依赖未经验证的LLM评价范式，可能无法反映人类对角色的感知。核心前提是角色识别能力，即通过对话上下文判断说话者。

Result: 最佳LLM准确率仅69%，远低于人类的90.8%。实验表明LLM评估者缺乏人类推理能力，无法胜任角色扮演评价。

Insight: 可靠的评估不仅需要任务微调，还依赖LLM具备类似人类的强推理能力。

Abstract: Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at https://github.com/maple-zhou/PersonaEval.

Enzhi Wang, Qicheng Li, Shiwan Zhao, Aobo Kong, Jiaming Zhou

TL;DR: RealTalk-CN是首个中文多轮、多领域的语音-文本双模态任务导向对话（TOD）数据集，填补了现有TOD数据集缺乏真实语音信号的空白，并支持跨模态交互分析。

Details

Motivation: 现有的TOD数据集多为文本，缺乏真实语音信号，且现有语音TOD数据集主要为英语，缺少语音不流畅和说话人多样性等关键特征。

Result: 实验验证数据集有效性，支持对语音不流畅、说话人敏感性及跨领域性能的评估。

Insight: RealTalk-CN为中文语音基LLM研究提供了坚实基础，突出了跨模态交互在现实对话中的重要性。

Abstract: In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.

[90] Training-Free Multimodal Large Language Model Orchestration cs.CLPDF

Tianyu Xie, Yuhang Wu, Yongdong Luo, Jiayi Ji, Xiawu Zheng

TL;DR: 该论文提出了一种无需额外训练的MLLM协调框架，通过动态任务路由、并行TTS架构和跨模态记忆整合，实现了高效的多模态交互，性能超越传统联合训练方法。

Details

Motivation: 现有的多模态大语言模型（MLLM）无法直接整合为统一的输入输出系统，传统方法需依赖额外训练，面临模态对齐和效率等挑战。

Result: 在标准基准测试中性能提升7.8%，延迟降低10.3%，并通过显式协调流程增强了可解释性。

Insight: 无需训练即可协调多模态模型，通过模块化和显式流程提升效率与可解释性。

Abstract: Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.

[91] Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning cs.CL | cs.AIPDF

Li Wang, Changhao Zhang, Zengqi Xiu, Kai Lu, Xin Yu

TL;DR: 论文提出了一种新框架DURIT，通过将自然语言问题映射到规范问题空间，解耦理解与推理，显著提升了小规模语言模型（SLMs）的推理能力与鲁棒性。

Details

Motivation: 当前小规模语言模型（SLMs）在推理任务上表现不佳，原因在于其需同时处理复杂的语言输入与推理任务。自然语言的多样性和冗余性进一步增加了问题的复杂性，限制了SLMs的优化效果。

Result: 实验表明，DURIT显著提升了SLMs在域内和域外数学与逻辑推理任务中的性能，并增强了推理的鲁棒性。

Insight: 解耦理解与推理是一种有效提升小规模语言模型推理能力的策略，规范问题空间的引入简化了输入复杂性，使SLMs更专注于推理任务。

Abstract: Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., $\leq$ 1.5B) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs’ performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.

[92] FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models cs.CL | cs.AIPDF

Chuan Li, Qianyi Zhao, Fengran Mo, Cen Chen

TL;DR: FedCoT是一个针对联邦学习环境中大型语言模型（LLM）推理能力提升的高效框架，特别关注计算、通信和隐私的平衡，适用于医疗领域。

Details

Motivation: 在联邦学习环境中，提升LLM的推理能力面临性能和隐私的挑战，特别是在医疗领域，需要准确的输出和可解释的推理路径以满足安全和合规要求。

Result: 在医疗推理任务中，FedCoT显著提升了客户端推理性能，同时完全保护了数据隐私。

Insight: FedCoT通过动态选择推理路径和高效聚合，解决了联邦学习中推理增强的通信和隐私问题，适用于资源受限的场景。

Abstract: Efficiently enhancing the reasoning capabilities of large language models (LLMs) in federated learning environments remains challenging, particularly when balancing performance gains with strict computational, communication, and privacy constraints. This challenge is especially acute in healthcare, where decisions-spanning clinical, operational, and patient-facing contexts-demand not only accurate outputs but also interpretable, traceable rationales to ensure safety, accountability, and regulatory compliance. Conventional federated tuning approaches on LLM fail to address this need: they optimize primarily for answer correctness while neglecting rationale quality, leaving CoT capabilities dependent on models’ innate pre-training abilities. Moreover, existing methods for improving rationales typically rely on privacy-violating knowledge distillation from centralized models. Additionally, the communication overhead in traditional federated fine-tuning on LLMs remains substantial. We addresses this gap by proposing FedCoT, a novel framework specifically designed to enhance reasoning in federated settings. FedCoT leverages a lightweight chain-of-thought enhancement mechanism: local models generate multiple reasoning paths, and a compact discriminator dynamically selects the most promising one. This approach improves reasoning accuracy and robustness while providing valuable interpretability, which is particularly critical for medical applications. To manage client heterogeneity efficiently, we adopt an improved aggregation approach building upon advanced LoRA module stacking, incorporating client classifier-awareness to achieve noise-free aggregation across diverse clients. Comprehensive experiments on medical reasoning tasks demonstrate that FedCoT significantly boosts client-side reasoning performance under stringent resource budgets while fully preserving data privacy.

[93] LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients cs.CL | cs.AIPDF

Egor Fadeev, Dzhambulat Mollaev, Aleksei Shestov, Dima Korolev, Omar Zoloev

TL;DR: LATTE提出了一种对比学习框架，通过将原始事件嵌入与冻结LLM生成的语义嵌入对齐，显著降低了金融应用中学习客户表征的计算成本和延迟。

Details

Motivation: 直接使用大型语言模型（LLMs）处理长事件序列在计算成本和实时性上均不实用，因此需要一种高效的方法来学习客户表征。

Result: 在真实金融数据集上，LATTE优于现有的事件序列表征学习方法，且适用于低延迟环境。

Insight: 通过冻结LLM和对比学习的结合，可以在不增加计算负担的情况下，有效利用LLM的语义信息。

Abstract: Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.

[94] SABER: Switchable and Balanced Training for Efficient LLM Reasoning cs.CL | cs.AI | cs.LGPDF

Kai Zhao, Yanjun Zhao, Jiaming Song, Shien He, Lusheng Zhang

TL;DR: SABER是一个强化学习框架，通过分层的推理预算和可控的推理模式，显著减少大型语言模型的推理成本，同时保持高准确率。

Details

Motivation: 当前大型语言模型在所有任务上采用统一的推理深度，导致不必要的计算成本和延迟。SABER旨在通过动态控制推理深度来解决这一问题。

Result: 在数学推理、代码生成和逻辑推理任务上，SABER显著减少推理长度（如MATH任务减少65.4%），同时提升准确性（MATH任务提升3.6%）。

Insight: 动态控制推理深度是优化LLM效率的有效途径，同时需兼顾关闭推理时的可靠性。

Abstract: Large language models (LLMs) empowered by chain-of-thought reasoning have achieved impressive accuracy on complex tasks but suffer from excessive inference costs and latency when applied uniformly to all problems. We propose SABER (Switchable and Balanced Training for Efficient LLM Reasoning), a reinforcement learning framework that endows LLMs with user-controllable, token-budgeted reasoning. SABER first profiles each training example’s base-model thinking token usage and assigns it to one of the predefined budget tiers. During fine-tuning, the model is guided by system prompts and length-aware rewards to respect its assigned budget. In parallel, we incorporate no-think examples to ensure the model remains reliable even when explicit reasoning is turned off. SABER further supports four discrete inference modes - NoThink, FastThink, CoreThink, and DeepThink, enabling flexible trade-offs between latency and reasoning depth. Extensive evaluations on math reasoning (MATH, GSM8K), code generation (MBPP), and logical reasoning (LiveBench-Reasoning) demonstrate that SABER achieves high accuracy under tight budgets, graceful degradation, and effective cross-scale and cross-domain generalization. In particular, SABER-FastThink cuts reasoning length by 65.4% and yields a 3.6% accuracy gain compared with the base model on the MATH benchmark.

[95] LLMCARE: Alzheimer’s Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data cs.CL | cs.AIPDF

Ali Zolnour, Hossein Azadmaleki, Yasaman Haghbin, Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad

TL;DR: 这篇论文提出了LLMCARE方法，通过结合Transformer嵌入和人工设计的语言特征，并利用大语言模型生成的合成数据进行数据增强，实现了阿尔茨海默病的有效检测。

Details

Motivation: 阿尔茨海默病及相关痴呆症（ADRD）在美国影响约500万老年人，但超过一半未被诊断。基于语音的自然语言处理（NLP）提供了一种可扩展的方法来检测早期认知退化。

Result: 融合模型的F1分数为83.3（AUC=89.5），优于单一方法。使用MedAlpaca-7B的合成语音数据增强后，F1提高到85.7。单模态LLM分类器通过微调表现显著提升。

Insight: 结合Transformer嵌入和语言特征能显著提升ADRD检测性能；LLM生成的合成数据对增强训练集有效，但多模态模型仍需进一步改进。

Abstract: Alzheimer’s disease and related dementias (ADRD) affect approximately five million older adults in the U.S., yet over half remain undiagnosed. Speech-based natural language processing (NLP) offers a promising, scalable approach to detect early cognitive decline through linguistic markers. To develop and evaluate a screening pipeline that (i) fuses transformer embeddings with handcrafted linguistic features, (ii) tests data augmentation using synthetic speech generated by large language models (LLMs), and (iii) benchmarks unimodal and multimodal LLM classifiers for ADRD detection. Transcripts from the DementiaBank “cookie-theft” task (n = 237) were used. Ten transformer models were evaluated under three fine-tuning strategies. A fusion model combined embeddings from the top-performing transformer with 110 lexical-derived linguistic features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech, which was used to augment training data. Three multimodal models (GPT-4o, Qwen-Omni, Phi-4) were tested for speech-text classification in zero-shot and fine-tuned settings. The fusion model achieved F1 = 83.3 (AUC = 89.5), outperforming linguistic or transformer-only baselines. Augmenting training data with 2x MedAlpaca-7B synthetic speech increased F1 to 85.7. Fine-tuning significantly improved unimodal LLM classifiers (e.g., MedAlpaca: F1 = 47.3 -> 78.5 F1). Current multimodal models demonstrated lower performance (GPT-4o = 70.2 F1; Qwen = 66.0). Performance gains aligned with the distributional similarity between synthetic and real speech. Integrating transformer embeddings with linguistic features enhances ADRD detection from speech. Clinically tuned LLMs effectively support both classification and data augmentation, while further advancement is needed in multimodal modeling.

[96] PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs cs.CL | cs.AI | cs.HC | cs.LGPDF

Xiao Fu, Hossein A. Rahmani, Bin Wu, Jerome Ramos, Emine Yilmaz

TL;DR: PREF 是一个无参考的个性化文本生成评估框架，通过三步流程（覆盖、偏好、评分）综合评估通用质量和用户特定对齐，无需黄金参考，实验显示其在准确性和与人类判断的一致性上优于基线。

Details

Motivation: 现有的文本生成评估方法大多忽略了用户的个体差异，导致评估结果与用户实际需求不完全匹配。PREF 旨在解决这一问题，提供一个无需黄金参考的个性化评估框架。

Result: 在 PrefEval 基准测试中，PREF 在准确性、校准和与人类判断的一致性上优于基线方法。

Insight: 将通用准则与用户偏好分离，允许小模型逼近大模型的个性化评估质量，为个性化语言生成系统的开发提供了更可靠的评估工具。

Abstract: Personalised text generation is essential for user-centric information systems, yet most evaluation methods overlook the individuality of users. We introduce \textbf{PREF}, a \textbf{P}ersonalised \textbf{R}eference-free \textbf{E}valuation \textbf{F}ramework that jointly measures general output quality and user-specific alignment without requiring gold personalised references. PREF operates in a three-step pipeline: (1) a coverage stage uses a large language model (LLM) to generate a comprehensive, query-specific guideline covering universal criteria such as factuality, coherence, and completeness; (2) a preference stage re-ranks and selectively augments these factors using the target user’s profile, stated or inferred preferences, and context, producing a personalised evaluation rubric; and (3) a scoring stage applies an LLM judge to rate candidate answers against this rubric, ensuring baseline adequacy while capturing subjective priorities. This separation of coverage from preference improves robustness, transparency, and reusability, and allows smaller models to approximate the personalised quality of larger ones. Experiments on the PrefEval benchmark, including implicit preference-following tasks, show that PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than strong baselines. By enabling scalable, interpretable, and user-aligned evaluation, PREF lays the groundwork for more reliable assessment and development of personalised language generation systems.

[97] Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs cs.CL | cs.AI | cs.CRPDF

Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin

TL;DR: 论文提出了一种基于潜在表示的越狱攻击方法LFJ，通过混合有害和无害查询的隐藏状态来引发不安全输出，攻击成功率高达94.01%，并提出了一种对抗训练防御方法，可将攻击成功率降低80%以上。

Details

Motivation: 大型语言模型（LLM）在多种语言任务中表现出色，但其安全对齐机制易受到越狱攻击的绕过。当前攻击方法的效果有限，因此需探索更高效且隐蔽的攻击手段。

Result: 在Vicuna和LLaMA-2等模型上，LFJ的平均攻击成功率（ASR）为94.01%，超越现有方法。对抗训练防御将ASR降低80%以上，同时不影响良性输入的性能。

Insight: 1. 查询对的选择和隐藏状态插值是LFJ高效的关键。2. 对抗训练能显著增强模型对潜在越狱攻击的鲁棒性。3. 需进一步研究更通用的防御机制以应对多样化的攻击手段。

Abstract: Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ’s effectiveness.

[98] Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models cs.CL | cs.AIPDF

Saaduddin Mahmud, Mason Nakamura, Kyle H. Wray, Shlomo Zilberstein

TL;DR: 论文提出了一种名为IAPO的统一框架，该框架在优化提示语的同时考虑了推理策略和预算，通过PSST算法实现了高效的多任务对齐。

Details

Motivation: 现有的提示优化方法未考虑推理策略，而实际中提示和推理策略之间存在强关联性，且用户对多目标和推理预算的偏好影响提示选择。

Result: 在多项任务中，IAPO和PSST显著提升了黑盒大语言模型的对齐效果。

Insight: 推理感知的提示优化是实现高效对齐的关键，用户偏好和预算应作为优化的重要因素。

Abstract: Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have also proven to enhance alignment and performance by trading off computation. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without regard to the inference strategy employed during deployment. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a unified novel framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, which we call PSST (Prompt Scaling via Sequential Trimming), and analyze finite-budget guarantees on error probability. Finally, we evaluate the effectiveness of PSST on six different tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness when aligning black-box LLMs through prompt optimization.

[99] The Cost of Thinking: Increased Jailbreak Risk in Large Language Models cs.CL | cs.AIPDF

Fan Yang

TL;DR: 本文发现大型语言模型（LLMs）的思考模式更容易被越狱攻击攻破，并提出一种安全思考干预方法以降低攻击成功率。

Details

Motivation: 思考模式被认为是LLMs中最有价值的模式之一，但研究发现这种模式存在被越狱攻击利用的风险，亟需解决方案。

Result: 在AdvBench和HarmBench上验证，安全思考干预显著降低了带有思考模式的LLMs的攻击成功率。

Insight: 思考模式虽然强大，但也可能成为安全漏洞，需设计更安全的干预机制。

Abstract: Thinking mode has always been regarded as one of the most valuable modes in LLMs. However, we uncover a surprising and previously overlooked phenomenon: LLMs with thinking mode are more easily broken by Jailbreak attack. We evaluate 9 LLMs on AdvBench and HarmBench and find that the success rate of attacking thinking mode in LLMs is almost higher than that of non-thinking mode. Through large numbers of sample studies, it is found that for educational purposes and excessively long thinking lengths are the characteristics of successfully attacked data, and LLMs also give harmful answers when they mostly know that the questions are harmful. In order to alleviate the above problems, this paper proposes a method of safe thinking intervention for LLMs, which explicitly guides the internal thinking processes of LLMs by adding “specific thinking tokens” of LLMs to the prompt. The results demonstrate that the safe thinking intervention can significantly reduce the attack success rate of LLMs with thinking mode.

[100] mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning cs.CL | cs.AIPDF

Nghia Trung Ngo, Franck Dernoncourt, Thien Huu Nguyen

TL;DR: mSCoRe是一个多语言、可扩展的基准测试，旨在系统性评估大语言模型在技能型常识推理任务中的表现，揭示了当前模型的局限性，并提供了未来改进方向。

Details

Motivation: 当前大语言模型在多语言常识推理任务中表现优异，但其对不同推理技能的利用机制仍未被充分研究，尤其是在跨语言和跨文化的日常知识推理方面。

Result: 实验表明，mSCoRe对当前模型仍极具挑战性，尤其在更高复杂度任务中，模型在多语言及文化常识推理方面存在明显局限性。

Insight: 研究揭示了当前模型在多语言常识推理中的不足，强调了需要进一步改进模型对跨语言和文化背景的常识理解能力。

Abstract: Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for \textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM’s reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models’ reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that \textbf{mSCoRe} remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models’ reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.

[101] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs cs.CLPDF

Kartikeya Badola, Jonathan Simon, Arian Hosseini, Sara Marie Mc Carthy, Tsendsuren Munkhdalai

TL;DR: 论文提出了一个新的基准测试，用于评估大型语言模型在复杂多轮对话和推理任务中的表现，揭示了当前模型的不足，并为未来研究提供了平台。

Details

Motivation: 大型语言模型在清晰完整的任务中表现优异，但在复杂交互式场景中能力不足，因此需要开发能够处理多轮对话、信息寻求和不完整数据推理的模型。

Result: 基准测试显示当前LLMs在复杂交互任务中存在明显不足，主要错误源于指令跟随、推理和规划能力不足。

Insight: 该研究不仅揭示了LLMs在复杂任务中的局限性，还为未来改进这些能力提供了明确的指导。

Abstract: Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.

[102] Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs cs.CLPDF

Wenlong Deng, Jiaming Zhang, Qi Zeng, Christos Thrampoulidis, Boying Gong

TL;DR: To analyze the influence of individual training samples on pretrained LLMs and VLMs, the paper proposes For-Value, a forward-only data valuation framework. It avoids costly gradient computations and achieves comparable or better performance than gradient-based methods.

Details

Motivation: Existing data valuation methods for LLMs and VLMs often require Hessian information or retraining, which is computationally prohibitive for large models. The need for a scalable and efficient solution motivates For-Value.

Result: Experiments show For-Value matches or outperforms gradient-based methods in identifying impactful fine-tuning examples and detecting mislabeled data.

Insight: Modern foundation models’ rich representations enable efficient influence estimation without expensive gradient computations, highlighting a scalable approach for large-scale models.

Abstract: Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.

[103] Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race cs.CL | cs.AIPDF

Gustavo Bonil, Simone Hashiguti, Jhessica Silva, João Gondim, Helena Maia

TL;DR: 该论文通过定性分析方法，探讨大型语言模型（LLMs）如何强化性别和种族的固有偏见，揭示了模型输出的意识形态功能及潜在社会不平等。

Details

Motivation: 随着AI的发展，LLMs被广泛应用，但其可能复制和强化主流社会偏见（如性别和种族歧视）。现有偏见检测方法多为定量分析，忽视了语言中偏见的细微表现。论文旨在填补这一空白。

Result: 研究发现，黑人女性常被刻画为与祖先和抗争相关，而白人女性则被表现为自我发现的过程。模型在修正偏见时仅提供表面修改，未能真正解决深层问题。

Insight: 论文揭示了LLMs的意识形态功能，强调跨学科和批判性方法对AI设计和部署的重要性，以应对模型输出对社会不平等的反映和强化。

Abstract: With the advance of Artificial Intelligence (AI), Large Language Models (LLMs) have gained prominence and been applied in diverse contexts. As they evolve into more sophisticated versions, it is essential to assess whether they reproduce biases, such as discrimination and racialization, while maintaining hegemonic discourses. Current bias detection approaches rely mostly on quantitative, automated methods, which often overlook the nuanced ways in which biases emerge in natural language. This study proposes a qualitative, discursive framework to complement such methods. Through manual analysis of LLM-generated short stories featuring Black and white women, we investigate gender and racial biases. We contend that qualitative methods such as the one proposed here are fundamental to help both developers and users identify the precise ways in which biases manifest in LLM outputs, thus enabling better conditions to mitigate them. Results show that Black women are portrayed as tied to ancestry and resistance, while white women appear in self-discovery processes. These patterns reflect how language models replicate crystalized discursive representations, reinforcing essentialization and a sense of social immobility. When prompted to correct biases, models offered superficial revisions that maintained problematic meanings, revealing limitations in fostering inclusive narratives. Our results demonstrate the ideological functioning of algorithms and have significant implications for the ethical use and development of AI. The study reinforces the need for critical, interdisciplinary approaches to AI design and deployment, addressing how LLM-generated discourses reflect and perpetuate inequalities.

[104] ReviewRL: Towards Automated Scientific Review with RL cs.CL | cs.AIPDF

Sihang Zeng, Kai Tian, Kaiyan Zhang, Yuru wang, Junqi Gao

TL;DR: ReviewRL是一个基于强化学习的框架，旨在自动化生成高质量的科学论文评审，解决了现有方法在事实准确性、评分一致性和分析深度上的不足。

Details

Motivation: 科学评审对学术进展至关重要，但由于投稿量增加和评审疲劳，其效率和质量面临挑战。现有自动化评审方法无法提供高质量的人类评审水平的反馈。

Result: 在ICLR 2025论文上的实验显示，ReviewRL在规则和模型评估中显著优于现有方法。

Insight: ReviewRL展示了强化学习在自动化科学评审中的潜力，为未来发展奠定了基础。

Abstract: Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP retrieval-augmented context generation pipeline that incorporates relevant scientific literature, (2) supervised fine-tuning that establishes foundational reviewing capabilities, and (3) a reinforcement learning procedure with a composite reward function that jointly enhances review quality and rating accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. ReviewRL establishes a foundational framework for RL-driven automatic critique generation in scientific discovery, demonstrating promising potential for future development in this domain. The implementation of ReviewRL will be released at GitHub.

[105] From Surface to Semantics: Semantic Structure Parsing for Table-Centric Document Analysis cs.CL | 68T50 | I.7.5PDF

Xuan Li, Jialiang Dong, Raymond Wong

TL;DR: 论文提出了DOTABLER框架，专注于表格的语义解析及其上下文关联，实现了表格中心文档结构解析和特定领域表格检索，性能优于GPT-4o等先进模型。

Details

Motivation: 现有研究多关注表格的浅层任务（如布局分析、数据提取），缺乏对其语义及上下文关联的深入解析，限制了高级任务的实现。因此，作者提出DOTABLER框架以填补这一空白。

Result: 在近4,000页包含1,000多个表格的真实PDF数据集上，DOTABLER的Precision和F1分数均超过90%，性能优于GPT-4o等先进模型。

Insight: 通过语义解析表格及其上下文，能够支持跨段落数据解释和上下文一致性分析，为文档理解任务提供了更深层次的支持。

Abstract: Documents are core carriers of information and knowl-edge, with broad applications in finance, healthcare, and scientific research. Tables, as the main medium for structured data, encapsulate key information and are among the most critical document components. Existing studies largely focus on surface-level tasks such as layout analysis, table detection, and data extraction, lacking deep semantic parsing of tables and their contextual associations. This limits advanced tasks like cross-paragraph data interpretation and context-consistent analysis. To address this, we propose DOTABLER, a table-centric semantic document parsing framework designed to uncover deep semantic links between tables and their context. DOTABLER leverages a custom dataset and domain-specific fine-tuning of pre-trained models, integrating a complete parsing pipeline to identify context segments semantically tied to tables. Built on this semantic understanding, DOTABLER implements two core functionalities: table-centric document structure parsing and domain-specific table retrieval, delivering comprehensive table-anchored semantic analysis and precise extraction of semantically relevant tables. Evaluated on nearly 4,000 pages with over 1,000 tables from real-world PDFs, DOTABLER achieves over 90% Precision and F1 scores, demonstrating superior performance in table-context semantic analysis and deep document parsing compared to advanced models such as GPT-4o.

[106] Making Qwen3 Think in Korean with Reinforcement Learning cs.CLPDF

Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee

TL;DR: 论文提出了一种两阶段微调方法，通过监督微调（SFT）和强化学习（GRPO）使Qwen3 14B模型在韩语中实现原生逻辑推理。通过引入判断模型解决稳定性问题，显著提升了推理能力，同时保持语言熟练度。

Details

Motivation: 解决大语言模型（如Qwen3）在韩语任务中的原生逻辑推理能力不足问题，同时提升其整体推理表现。

Result: 模型在韩语推理任务（尤其是数学和编程）上表现大幅提升，同时保持语言和知识熟练度。

Insight: 通过分阶段微调和强化学习的结合，可以显著提升模型在小语言任务中的表现，同时避免训练不稳定问题。

Abstract: We present a two-stage fine-tuning approach to make the large language model Qwen3 14B “think” natively in Korean. In the first stage, supervised fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a strong foundation in Korean logical reasoning, yielding notable improvements in Korean-language tasks and even some gains in general reasoning ability. In the second stage, we employ reinforcement learning with a customized Group Relative Policy Optimization (GRPO) algorithm to further enhance both Korean reasoning alignment and overall problem-solving performance. We address critical stability challenges in GRPO training - such as reward hacking and policy collapse - by introducing an oracle judge model that calibrates the reward signal. Our approach achieves stable learning (avoiding the collapse observed in naive GRPO) and leads to steady, incremental performance gains. The final RL-tuned model demonstrates substantially improved results on advanced reasoning benchmarks (particularly math and coding tasks) while maintaining knowledge and language proficiency, successfully conducting its internal chain-of-thought entirely in Korean.

[107] ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning cs.CL | cs.AI | cs.LGPDF

Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu

TL;DR: ComoRAG是一种受认知启发的记忆组织RAG方法，用于长叙事推理任务，通过动态记忆工作区和多步推理循环提升性能。

Details

Motivation: 长叙事文本的理解因复杂情节和动态关系而困难，传统RAG方法由于单步检索和静态特性表现不足。ComoRAG借鉴人类认知机制，提出动态记忆和多步推理策略。

Result: 在200K+ tokens的长叙事数据集上，性能优于传统RAG方法，相对提升达11%。

Insight: 动态记忆和多步推理机制模拟人类认知过程，为长文本推理提供了新思路。

Abstract: Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM’s diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at https://github.com/EternityJune25/ComoRAG

[108] DiFaR: Enhancing Multimodal Misinformation Detection with Diverse, Factual, and Relevant Rationales cs.CLPDF

Herun Wan, Jiaying Wu, Minnan Luo, Xiangzheng Kong, Zihan Ma

TL;DR: DiFaR是一个检测无关的多模态框架，通过生成多样、真实且相关的理由来增强虚假信息检测能力。它通过链式思维提示和轻量级过滤模块解决了现有方法的多样性不足、事实错误和内容无关问题，实验证明其显著优于基线方法。

Details

Motivation: 多模态虚假信息检测中，现有方法生成的文本理由存在多样性不足、事实不准确和内容无关等问题，限制了检测效果。DiFaR旨在通过改进理由生成和过滤，提升检测性能。

Result: 在四个基准测试中，DiFaR比基线方法提升5.9%，并可将现有检测器性能提升8.7%。自动指标和人工评估均验证了其理由质量的显著改善。

Insight: 通过多样化推理路径和严格的内容过滤可以有效提升多模态虚假信息检测的准确性和鲁棒性。

Abstract: Generating textual rationales from large vision-language models (LVLMs) to support trainable multimodal misinformation detectors has emerged as a promising paradigm. However, its effectiveness is fundamentally limited by three core challenges: (i) insufficient diversity in generated rationales, (ii) factual inaccuracies due to hallucinations, and (iii) irrelevant or conflicting content that introduces noise. We introduce DiFaR, a detector-agnostic framework that produces diverse, factual, and relevant rationales to enhance misinformation detection. DiFaR employs five chain-of-thought prompts to elicit varied reasoning traces from LVLMs and incorporates a lightweight post-hoc filtering module to select rationale sentences based on sentence-level factuality and relevance scores. Extensive experiments on four popular benchmarks demonstrate that DiFaR outperforms four baseline categories by up to 5.9% and boosts existing detectors by as much as 8.7%. Both automatic metrics and human evaluations confirm that DiFaR significantly improves rationale quality across all three dimensions.

[109] When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models cs.CL | cs.AIPDF

Huyu Wu, Meng Tang, Xinhan Zheng, Haiyun Jiang

TL;DR: 本文系统研究了多模态大语言模型（MLLMs）中的文本主导问题，提出了两种评估指标MDI和AEI，揭示了该问题在多模态任务中的普遍性，并提出了一种简单的标记压缩方法以有效平衡模型注意力。

Details

Motivation: 多模态大语言模型虽在多模态任务中表现出色，但其对文本的过度依赖（即文本主导问题）导致其他模态的信息未被充分利用。本文旨在揭示这一现象的普遍性及其成因，并提出解决方案。

Result: 实验中提出的方法将LLaVA-7B的MDI从10.23显著降低至0.86，验证了方法的有效性。

Insight: 1. 文本主导是多模态模型的普遍问题；2. 通过优化标记压缩可以提高多模态利用效率；3. 未来研究需关注更加均衡的多模态架构和任务设计。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a diverse range of multimodal tasks. However, these models suffer from a core problem known as text dominance: they depend heavily on text for their inference, while underutilizing other modalities. While prior work has acknowledged this phenomenon in vision-language tasks, often attributing it to data biases or model architectures. In this paper, we conduct the first systematic investigation of text dominance across diverse data modalities, including images, videos, audio, time-series, and graphs. To measure this imbalance, we propose two evaluation metrics: the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI). Our comprehensive analysis reveals that text dominance is both significant and pervasive across all tested modalities. Our in-depth analysis identifies three underlying causes: attention dilution from severe token redundancy in non-textual modalities, the influence of fusion architecture design, and task formulations that implicitly favor textual inputs. Furthermore, we propose a simple token compression method that effectively rebalances model attention. Applying this method to LLaVA-7B, for instance, drastically reduces its MDI from 10.23 to a well-balanced value of 0.86. Our analysis and methodological framework offer a foundation for the development of more equitable and comprehensive multimodal language models.

[110] eDIF: A European Deep Inference Fabric for Remote Interpretability of LLM cs.CLPDF

Irma Heithoff. Marc Guggenberger, Sandra Kalogiannis, Susanne Mayer, Fabian Maag, Sigurd Schacht

TL;DR: 论文介绍了欧洲深度推理框架(eDIF)的可行性研究，旨在支持大型语言模型(LLM)的机制可解释性研究，通过GPU集群和NNsight API实现远程模型分析，并通过试点研究验证其技术和科学价值。

Details

Motivation: 欧洲缺乏广泛可访问的LLM可解释性基础设施，该研究旨在通过eDIF框架为研究社区提供民主化的高级模型分析能力。

Result: 平台性能稳定，用户参与度逐渐提升，远程实验能力获得积极反馈；但也发现了激活数据下载时间长和执行中断等问题。

Insight: eDIF框架是欧洲LLM可解释性研究基础设施的重要一步，为未来扩展工具和社区协作奠定了基础。

Abstract: This paper presents a feasibility study on the deployment of a European Deep Inference Fabric (eDIF), an NDIF-compatible infrastructure designed to support mechanistic interpretability research on large language models. The need for widespread accessibility of LLM interpretability infrastructure in Europe drives this initiative to democratize advanced model analysis capabilities for the research community. The project introduces a GPU-based cluster hosted at Ansbach University of Applied Sciences and interconnected with partner institutions, enabling remote model inspection via the NNsight API. A structured pilot study involving 16 researchers from across Europe evaluated the platform’s technical performance, usability, and scientific utility. Users conducted interventions such as activation patching, causal tracing, and representation analysis on models including GPT-2 and DeepSeek-R1-70B. The study revealed a gradual increase in user engagement, stable platform performance throughout, and a positive reception of the remote experimentation capabilities. It also marked the starting point for building a user community around the platform. Identified limitations such as prolonged download durations for activation data as well as intermittent execution interruptions are addressed in the roadmap for future development. This initiative marks a significant step towards widespread accessibility of LLM interpretability infrastructure in Europe and lays the groundwork for broader deployment, expanded tooling, and sustained community collaboration in mechanistic interpretability research.

[111] Learning from Natural Language Feedback for Personalized Question Answering cs.CL | cs.AI | cs.IRPDF

Alireza Salemi, Hamed Zamani

TL;DR: 论文提出了一种新框架VAC，用自然语言反馈（NLF）取代标量奖励信号，以提升个性化问答任务的性能。NLF提供了更丰富的监督信号，帮助模型迭代优化输出。实验表明，VAC在LaMP-QA基准上显著优于现有方法。

Details

Motivation: 现有的个性化语言模型依赖标量奖励信号，反馈信号较弱且缺乏指导性，限制了学习效率和个性化质量。

Result: 在LaMP-QA基准测试中，VAC显著优于当前最优方法，人类评估也证实了生成回答的优越性。

Insight: 自然语言反馈比标量奖励信号更有效，能更好地优化个性化问答任务的性能。

Abstract: Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.

[112] Beyond “Not Novel Enough”: Enriching Scholarly Critique with LLM-Assisted Feedback cs.CLPDF

Osama Mohammed Afzal, Preslav Nakov, Tom Hope, Iryna Gurevych

TL;DR: 该论文提出了一种基于LLM的结构化方法来辅助学术论文的新颖性评估，通过内容提取、相关工作检索与综合、结构化比较等步骤，实现了与人类评审高度一致的结果。

Details

Motivation: 在NLP等高产领域，审稿人的能力日益紧张，新颖性评估作为同行评审的核心却未被充分研究。亟需一种自动化方法来支持更严谨和透明的评审过程。

Result: 在182篇ICLR 2025提交的评估中，该方法与人类评审的推理一致性达到86.5%，新颖性结论的一致性为75.3%，优于现有LLM基线。

Insight: 结构化LLM辅助方法能够提升评审的一致性和透明性，同时不取代人类的专业判断，为同行评审提供了新的技术支持。

Abstract: Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.

[113] Reinforced Language Models for Sequential Decision Making cs.CL | cs.AI | cs.LG | I.2.7; I.2.8PDF

Jim Dilkes, Vahid Yazdanpanah, Sebastian Stein

TL;DR: 这篇论文提出了一种名为MS-GRPO的新算法，通过后训练优化小型LLM在多步决策任务中的性能，解决了现有方法在信用分配上的不足，并展示其优于更大规模模型的表现。

Details

Motivation: 大型语言模型（LLMs）作为序列决策代理潜力巨大，但依赖大规模模型带来的计算成本限制了其应用。因此，需要优化小型模型的能力，而现有的后训练方法无法有效处理多步任务中的信用分配问题。

Result: 后训练的30亿参数模型在Frozen Lake任务上比720亿参数基线模型表现提升了50%，证明了后训练方法是实现高效序列决策代理的可行替代方案。

Insight: 研究揭示了通过针对性后训练可以显著提升小型模型的性能，减少对模型规模的依赖，为实际应用提供了更高效的解决方案。

Abstract: Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.

[114] Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning cs.CLPDF

Chongyuan Dai, Jinpeng Hu, Hongchang Shi, Zhuo Li, Xun Yang

TL;DR: 该论文提出了Psyche-R1，首个结合共情、心理学专业知识和推理的中文心理大语言模型，通过统一数据生成和混合训练策略显著提升了心理援助的可靠性。

Details

Motivation: 心理健康专业人员短缺，心理疾病负担日益加重。当前的LLMs在数学和编程领域表现优异，但在心理学领域主要关注情感支持，忽略了推理机制对生成可靠回复的重要性。

Result: 实验表明，7B参数的Psyche-R1在多项心理基准测试中表现优异，与671B的DeepSeek-R1结果相当。

Insight: 推理机制的引入和混合训练策略是心理学LLMs可靠性的关键，未来可扩展至其他领域。

Abstract: Amidst a shortage of qualified mental health professionals, the integration of large language models (LLMs) into psychological applications offers a promising way to alleviate the growing burden of mental health disorders. Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating reliable responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. Specifically, we design a comprehensive data synthesis pipeline that produces over 75k high-quality psychological questions paired with detailed rationales, generated through chain-of-thought (CoT) reasoning and iterative prompt-rationale optimization, along with 73k empathetic dialogues. Subsequently, we employ a hybrid training strategy wherein challenging samples are identified through a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) to improve reasoning ability, while the remaining data is used for supervised fine-tuning (SFT) to enhance empathetic response generation and psychological domain knowledge. Extensive experiment results demonstrate the effectiveness of the Psyche-R1 across several psychological benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B DeepSeek-R1.

[115] From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms cs.CL | cs.AIPDF

Zhaokun Jiang, Ziyin Zhang

TL;DR: 本文提出了一种结合特征工程、数据增强和可解释机器学习的多维建模框架，用于自动评估口译质量，强调透明性和可解释性，取代传统的‘黑箱’预测。

Details

Motivation: 现有研究在语言使用质量评估、数据稀缺和不平衡导致的建模效果不佳，以及模型预测解释不足方面存在缺陷。本文旨在解决这些问题。

Result: 在英汉交替传译数据集上展示了强预测性能，识别了忠实度（BLEURT和CometKiwi分数）、流利度（停顿相关特征）和语言使用（中文特定短语多样性指标）的关键特征。

Insight: 通过强调可解释性，为学习者提供详细诊断反馈，支持自我调节学习，展示了透明模型在教育应用中的潜力。

Abstract: Recent advancements in machine learning have spurred growing interests in automated interpreting quality assessment. Nevertheless, existing research suffers from insufficient examination of language use quality, unsatisfactory modeling effectiveness due to data scarcity and imbalance, and a lack of efforts to explain model predictions. To address these gaps, we propose a multi-dimensional modeling framework that integrates feature engineering, data augmentation, and explainable machine learning. This approach prioritizes explainability over ``black box’’ predictions by utilizing only construct-relevant, transparent features and conducting Shapley Value (SHAP) analysis. Our results demonstrate strong predictive performance on a novel English-Chinese consecutive interpreting dataset, identifying BLEURT and CometKiwi scores to be the strongest predictive features for fidelity, pause-related features for fluency, and Chinese-specific phraseological diversity metrics for language use. Overall, by placing particular emphasis on explainability, we present a scalable, reliable, and transparent alternative to traditional human evaluation, facilitating the provision of detailed diagnostic feedback for learners and supporting self-regulated learning advantages not afforded by automated scores in isolation.

[116] SSRL: Self-Search Reinforcement Learning cs.CLPDF

Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen

TL;DR: 该论文提出了一种基于大型语言模型（LLMs）的自搜索强化学习（SSRL）方法，通过结构化提示和重复采样激发LLMs的内部知识，减少对外部搜索引擎的依赖，提升搜索任务的效率。

Details

Motivation: 现有的强化学习（RL）任务通常依赖于昂贵的外部搜索引擎交互，而论文探索了如何利用LLMs的固有知识作为高效的搜索模拟器，以降低成本并提升效率。

Result: 实验表明，SSRL训练的模型在问答基准（如BrowseComp任务）上表现出色（高pass@k），同时降低了对外部搜索引擎的依赖，实现了鲁棒的模拟到现实迁移。

Insight: 1) LLMs具备可被有效激发的世界知识；2) 利用内部知识可以减少幻觉；3) SSRL模型能无缝整合外部搜索引擎。

Abstract: We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs’ Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.

[117] A Survey on Diffusion Language Models cs.CL | cs.AI | cs.LGPDF

Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen

TL;DR: 这篇综述全面介绍了扩散语言模型（DLMs）的最新进展，包括其优势、与自回归和掩码模型的比较，以及其在自然语言处理中的应用。

Details

Motivation: 扩散语言模型（DLMs）因其并行生成和双向上下文捕捉能力，成为一种有前景的自回归模型替代方案，但其发展迅速且复杂，需要系统性总结。

Result: DLMs在推理速度和生成质量上已接近自回归模型，适用于多种NLP任务，多模态扩展也展现了潜力。

Insight: DLMs的优势在于并行生成和上下文控制，但面临效率、长序列处理和基础设施等挑战；未来需进一步优化模型设计和应用范围。

Abstract: Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

cs.SE [Back]

[118] SaraCoder: Orchestrating Semantic and Structural Cues for Profit-Oriented Repository-Level Code Completion cs.SE | cs.CL | cs.IR | cs.PLPDF

Xiaohan Chen, Zhongying Pan, Quan Feng, Yu Tian, Shuqun Yang

TL;DR: Saracoder通过分层特征优化和外部感知标识符消歧模块，显著提升了存储库级别代码补全的准确性和多样性。

Details

Motivation: 当前基于检索增强生成（RAG）的存储库级别代码补全方法依赖浅层文本相似性，导致语义误导、冗余和同质化问题，且无法解决外部符号歧义。

Result: 在CrossCodeEval和RepoEval-Updated基准测试中，Saracoder显著优于现有方法，支持多种编程语言和模型。

Insight: 通过多维度系统优化检索结果，可以构建更准确和鲁棒的存储库级别代码补全系统。

Abstract: Retrieval-augmented generation (RAG) for repository-level code completion commonly relies on superficial text similarity, leading to results plagued by semantic misguidance, redundancy, and homogeneity, while also failing to resolve external symbol ambiguity. To address these challenges, we introduce Saracoder, a Hierarchical Feature-Optimized retrieval framework. Its core Hierarchical Feature Optimization module systematically refines candidates by distilling deep semantic relationships, pruning exact duplicates, assessing structural similarity with a novel graph-based metric that weighs edits by their topological importance, and reranking results to maximize both relevance and diversity. Furthermore, an External-Aware Identifier Disambiguator module accurately resolves cross-file symbol ambiguity via dependency analysis. Extensive experiments on the challenging CrossCodeEval and RepoEval-Updated benchmarks demonstrate that Saracoder significantly outperforms existing baselines across multiple programming languages and models. Our work proves that systematically refining retrieval results across multiple dimensions provides a new paradigm for building more accurate and robust repository-level code completion systems.

q-bio.NC [Back]

[119] Large Language Models Show Signs of Alignment with Human Neurocognition During Abstract Reasoning q-bio.NC | cs.AI | cs.CLPDF

Christopher Pinier, Sonia Acuña Vargas, Mariia Steeghs-Turchina, Dora Matzke, Claire E. Stevenson

TL;DR: 该研究探讨了大型语言模型（LLMs）在抽象推理任务中是否与人类神经认知对齐，发现较大规模的LLMs（约700亿参数）表现出与人类相似的行为和神经表征模式。

Details

Motivation: 研究动机是探索LLMs与人类在抽象推理中的神经认知相似性，以揭示生物智能与人工智能之间的潜在共享机制。

Result: 结果表明，仅最大规模的LLMs（如Qwen-2.5-72B和DeepSeek-R1-70B）达到人类水平的准确性，且其任务优化层的表征几何与人类前额叶FRPs存在中度正相关。

Insight: 研究发现LLMs可能模拟了人类抽象推理的神经机制，暗示生物智能与人工智能在抽象问题解决中存在共享原则。

Abstract: This study investigates whether large language models (LLMs) mirror human neurocognition during abstract reasoning. We compared the performance and neural representations of human participants with those of eight open-source LLMs on an abstract-pattern-completion task. We leveraged pattern type differences in task performance and in fixation-related potentials (FRPs) as recorded by electroencephalography (EEG) during the task. Our findings indicate that only the largest tested LLMs (~70 billion parameters) achieve human-comparable accuracy, with Qwen-2.5-72B and DeepSeek-R1-70B also showing similarities with the human pattern-specific difficulty profile. Critically, every LLM tested forms representations that distinctly cluster the abstract pattern categories within their intermediate layers, although the strength of this clustering scales with their performance on the task. Moderate positive correlations were observed between the representational geometries of task-optimal LLM layers and human frontal FRPs. These results consistently diverged from comparisons with other EEG measures (response-locked ERPs and resting EEG), suggesting a potential shared representational space for abstract patterns. This indicates that LLMs might mirror human brain mechanisms in abstract reasoning, offering preliminary evidence of shared principles between biological and artificial intelligence.

[120] Insights from the Algonauts 2025 Winners q-bio.NC | cs.CVPDF

Paul S. Scotti, Mihir Tripathy

TL;DR: Algonauts 2025挑战赛聚焦于使用长时多模态电影数据预测人类大脑活动，胜出团队展示了当前脑编码领域的最新技术，并揭示了未来研究方向。

Details

Motivation: 研究通过自然主义电影刺激预测大脑活动，推动计算神经科学与脑编码模型的进一步发展。

Result: 胜出团队在分布外（OOD）测试中表现最佳，验证了模型对复杂多模态数据的适应能力。

Insight: 长时多模态数据为脑编码研究提供了更真实的实验环境，未来可能进一步探索模型的泛化能力和跨模态学习。

Abstract: The Algonauts 2025 Challenge just wrapped up a few weeks ago. It is a biennial challenge in computational neuroscience in which teams attempt to build models that predict human brain activity from carefully curated stimuli. Previous editions (2019, 2021, 2023) focused on still images and short videos; the 2025 edition, which concluded last month (late July), pushed the field further by using long, multimodal movies. Teams were tasked with predicting fMRI responses across 1,000 whole-brain parcels across four participants in the dataset who were scanned while watching nearly 80 hours of naturalistic movie stimuli. These recordings came from the CNeuroMod project and included 65 hours of training data, about 55 hours of Friends (seasons 1-6) plus four feature films (The Bourne Supremacy, Hidden Figures, Life, and The Wolf of Wall Street). The remaining data were used for validation: Season 7 of Friends for in-distribution tests, and the final winners for the Challenge were those who could best predict brain activity for six films in their held-out out-of-distribution (OOD) set. The winners were just announced and the top team reports are now publicly available. As members of the MedARC team which placed 4th in the competition, we reflect on the approaches that worked, what they reveal about the current state of brain encoding, and what might come next.

cs.HC [Back]

[121] Personalized Real-time Jargon Support for Online Meetings cs.HC | cs.CLPDF

Yifan Song, Wing Yee Au, Hon Yung Wong, Brian P. Bailey, Tal August

TL;DR: 论文研究了在线会议中的专业术语障碍，提出了一个基于LLM的实时个性化术语支持系统ParseJargon，显著提升了理解与参与度。

Details

Motivation: 跨学科沟通常因专业术语障碍而受限，当前术语管理策略在工作会议中表现不足。

Result: 个性化术语支持显著提升理解与参与度，通用支持则对参与度产生负面影响。实地研究验证了系统实用性。

Insight: 个性化术语支持工具对跨学科沟通和教育工作具有广泛的应用潜力。

Abstract: Effective interdisciplinary communication is frequently hindered by domain-specific jargon. To explore the jargon barriers in-depth, we conducted a formative diary study with 16 professionals, revealing critical limitations in current jargon-management strategies during workplace meetings. Based on these insights, we designed ParseJargon, an interactive LLM-powered system providing real-time personalized jargon identification and explanations tailored to users’ individual backgrounds. A controlled experiment comparing ParseJargon against baseline (no support) and general-purpose (non-personalized) conditions demonstrated that personalized jargon support significantly enhanced participants’ comprehension, engagement, and appreciation of colleagues’ work, whereas general-purpose support negatively affected engagement. A follow-up field study validated ParseJargon’s usability and practical value in real-time meetings, highlighting both opportunities and limitations for real-world deployment. Our findings contribute insights into designing personalized jargon support tools, with implications for broader interdisciplinary and educational applications.

cs.CR [Back]

[122] Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs cs.CR | cs.AI | cs.CLPDF

Jinhwa Kim, Ian G. Harris

TL;DR: 论文提出了一种名为Context Filtering的防御机制，通过过滤不可信的上下文来保护大语言模型（LLM）免受对抗性攻击，同时保持模型的原始性能。

Details

Motivation: 目前大语言模型在性能上取得显著进步，但容易被恶意用户通过对抗性上下文攻击，引发安全和伦理风险。为兼顾安全性和模型的有用性，需要一种无需调整模型本身的防御方法。

Result: 实验表明，该方法将攻击成功率降低高达88%，并在Safety和Helpfulness Product指标上达到最先进水平。

Insight: 维护模型安全性不一定牺牲其有用性，通过输入预处理方法可以在不修改底层模型的情况下实现安全性提升。

Abstract: While Large Language Models (LLMs) have shown significant advancements in performance, various jailbreak attacks have posed growing safety and ethical risks. Malicious users often exploit adversarial context to deceive LLMs, prompting them to generate responses to harmful queries. In this study, we propose a new defense mechanism called Context Filtering model, an input pre-processing method designed to filter out untrustworthy and unreliable context while identifying the primary prompts containing the real user intent to uncover concealed malicious intent. Given that enhancing the safety of LLMs often compromises their helpfulness, potentially affecting the experience of benign users, our method aims to improve the safety of the LLMs while preserving their original performance. We evaluate the effectiveness of our model in defending against jailbreak attacks through comparative analysis, comparing our approach with state-of-the-art defense mechanisms against six different attacks and assessing the helpfulness of LLMs under these defenses. Our model demonstrates its ability to reduce the Attack Success Rates of jailbreak attacks by up to 88% while maintaining the original LLMs’ performance, achieving state-of-the-art Safety and Helpfulness Product results. Notably, our model is a plug-and-play method that can be applied to all LLMs, including both white-box and black-box models, to enhance their safety without requiring any fine-tuning of the models themselves. We will make our model publicly available for research purposes.

[123] Searching for Privacy Risks in LLM Agents via Simulation cs.CR | cs.AI | cs.CLPDF

Yanzhe Zhang, Diyi Yang

TL;DR: 提出了一种基于搜索的框架，通过模拟隐私关键的智能体交互来发现和防御LLM智能体中的隐私风险。

Details

Motivation: 随着LLM智能体的广泛部署，恶意智能体可能通过多轮交互提取敏感信息，这种动态对话的复杂性使得手动发现漏洞变得困难，亟需一种自动化方法。

Result: 发现攻击策略可以从简单请求升级为复杂的多轮战术（如冒充和伪造同意），而防御措施也从基于规则的约束发展为身份验证状态机。

Insight: LLM可以作为优化器高效探索交互空间，发现的攻击和防御策略具有跨场景和模型的强迁移性。

Abstract: The widespread deployment of LLM-based agents is likely to introduce a critical privacy threat: malicious agents that proactively engage others in multi-turn interactions to extract sensitive information. These dynamic dialogues enable adaptive attack strategies that can cause severe privacy violations, yet their evolving nature makes it difficult to anticipate and discover sophisticated vulnerabilities manually. To tackle this problem, we present a search-based framework that alternates between improving attacker and defender instructions by simulating privacy-critical agent interactions. Each simulation involves three roles: data subject, data sender, and data recipient. While the data subject’s behavior is fixed, the attacker (data recipient) attempts to extract sensitive information from the defender (data sender) through persistent and interactive exchanges. To explore this interaction space efficiently, our search algorithm employs LLMs as optimizers, using parallel search with multiple threads and cross-thread propagation to analyze simulation trajectories and iteratively propose new instructions. Through this process, we find that attack strategies escalate from simple direct requests to sophisticated multi-turn tactics such as impersonation and consent forgery, while defenses advance from rule-based constraints to identity-verification state machines. The discovered attacks and defenses transfer across diverse scenarios and backbone models, demonstrating strong practical utility for building privacy-aware agents.

[124] Invisible Watermarks, Visible Gains: Steering Machine Unlearning with Bi-Level Watermarking Design cs.CR | cs.CVPDF

Yuhao Sun, Yihua Zhang, Gaowen Liu, Hongtao Xie, Sijia Liu

TL;DR: 该论文提出了一种利用数字水印技术改进机器遗忘（MU）的新方法Water4MU，通过双层次优化框架实现敏感数据的高效遗忘。

Details

Motivation: 随着“被遗忘权”需求的增加，机器遗忘成为增强信任和合规的重要工具，但现有方法多依赖模型权重调整，缺乏对数据层次的研究。

Result: 实验表明，Water4MU在图像分类和生成任务中均有效，尤其在“挑战性遗忘”场景下优于现有方法。

Insight: 数据层次的水印技术可以显著提升机器遗忘的精确性和可控性，为未来的隐私保护方法提供了新思路。

Abstract: With the increasing demand for the right to be forgotten, machine unlearning (MU) has emerged as a vital tool for enhancing trust and regulatory compliance by enabling the removal of sensitive data influences from machine learning (ML) models. However, most MU algorithms primarily rely on in-training methods to adjust model weights, with limited exploration of the benefits that data-level adjustments could bring to the unlearning process. To address this gap, we propose a novel approach that leverages digital watermarking to facilitate MU by strategically modifying data content. By integrating watermarking, we establish a controlled unlearning mechanism that enables precise removal of specified data while maintaining model utility for unrelated tasks. We first examine the impact of watermarked data on MU, finding that MU effectively generalizes to watermarked data. Building on this, we introduce an unlearning-friendly watermarking framework, termed Water4MU, to enhance unlearning effectiveness. The core of Water4MU is a bi-level optimization (BLO) framework: at the upper level, the watermarking network is optimized to minimize unlearning difficulty, while at the lower level, the model itself is trained independently of watermarking. Experimental results demonstrate that Water4MU is effective in MU across both image classification and image generation tasks. Notably, it outperforms existing methods in challenging MU scenarios, known as “challenging forgets”.

cs.IR [Back]

[125] Personalized Product Search Ranking: A Multi-Task Learning Approach with Tabular and Non-Tabular Data cs.IR | cs.AI | cs.CL | cs.LGPDF

Lalitesh Morishetti, Abhay Kumar, Jonathan Scott, Kaushiki Nag, Gunjan Sharma

TL;DR: 该论文提出了一种新颖的多任务学习模型架构，用于优化个性化产品搜索排名，结合了表格与非表格数据，并通过改进的嵌入技术和采样方法提升性能。

Details

Motivation: 个性化产品搜索排名需要同时处理多种数据类型（表格与非表格），现有方法难以有效整合这些信息。论文旨在通过多任务学习和预训练嵌入技术解决这一问题。

Result: 实验表明，该方法在个性化搜索排名任务中显著优于XGBoost、TabNet等基线模型，且消融实验验证了嵌入技术、相关性标注和多任务学习的重要性。

Insight: 1. 多任务学习能有效整合异构数据；2. 预训练语言模型（如TinyBERT）的嵌入显著提升语义理解；3. 自动生成的相关性标注可替代人工标注，降低成本。

Abstract: In this paper, we present a novel model architecture for optimizing personalized product search ranking using a multi-task learning (MTL) framework. Our approach uniquely integrates tabular and non-tabular data, leveraging a pre-trained TinyBERT model for semantic embeddings and a novel sampling technique to capture diverse customer behaviors. We evaluate our model against several baselines, including XGBoost, TabNet, FT-Transformer, DCN-V2, and MMoE, focusing on their ability to handle mixed data types and optimize personalized ranking. Additionally, we propose a scalable relevance labeling mechanism based on click-through rates, click positions, and semantic similarity, offering an alternative to traditional human-annotated labels. Experimental results show that combining non-tabular data with advanced embedding techniques in multi-task learning paradigm significantly enhances model performance. Ablation studies further underscore the benefits of incorporating relevance labels, fine-tuning TinyBERT layers, and TinyBERT query-product embedding interactions. These results demonstrate the effectiveness of our approach in achieving improved personalized product search ranking.

cs.AI [Back]

[126] Amazon Nova AI Challenge – Trusted AI: Advancing secure, AI-assisted software development cs.AI | cs.CL | I.2.7; I.2.6; E.0PDF

Sattvik Sahai, Prasoon Goyal, Michael Johnston, Anna Gottardi, Yao Lu

TL;DR: 论文介绍了亚马逊Nova AI挑战赛中Trusted AI赛道的成果，聚焦于解决AI辅助软件开发中的安全性问题，通过全球10支大学团队的竞赛推动了安全AI技术的发展。

Details

Motivation: 随着AI在软件开发中的应用日益广泛，如何确保其安全性成为关键挑战。亚马逊通过竞赛形式，推动学术界和工业界合作解决这一问题。

Result: 参赛团队开发了多项前沿技术，包括基于推理的安全对齐方法和高效的LLM探测技术，显著提升了AI在软件开发中的安全性。

Insight: 对抗性测试和多轮对话是验证AI安全性的有效手段；竞赛形式可以促进学术界和工业界在安全AI领域的深度合作。

Abstract: AI systems for software development are rapidly gaining prominence, yet significant challenges remain in ensuring their safety. To address this, Amazon launched the Trusted AI track of the Amazon Nova AI Challenge, a global competition among 10 university teams to drive advances in secure AI. In the challenge, five teams focus on developing automated red teaming bots, while the other five create safe AI assistants. This challenge provides teams with a unique platform to evaluate automated red-teaming and safety alignment methods through head-to-head adversarial tournaments where red teams have multi-turn conversations with the competing AI coding assistants to test their safety alignment. Along with this, the challenge provides teams with a feed of high quality annotated data to fuel iterative improvement. Throughout the challenge, teams developed state-of-the-art techniques, introducing novel approaches in reasoning-based safety alignment, robust model guardrails, multi-turn jail-breaking, and efficient probing of large language models (LLMs). To support these efforts, the Amazon Nova AI Challenge team made substantial scientific and engineering investments, including building a custom baseline coding specialist model for the challenge from scratch, developing a tournament orchestration service, and creating an evaluation harness. This paper outlines the advancements made by university teams and the Amazon Nova AI Challenge team in addressing the safety challenges of AI for software development, highlighting this collaborative effort to raise the bar for AI safety.

[127] Reverse Physician-AI Relationship: Full-process Clinical Diagnosis Driven by a Large Language Model cs.AI | cs.CE | cs.CLPDF

Shicheng Xu, Xin Huang, Zihao Wei, Liang Pang, Huawei Shen

TL;DR: 论文提出了一种全新的临床诊断范式，通过将大型语言模型（LLM）作为主导演的角色，驱动从模糊主诉开始的全程诊断，显著减少医生工作量并提升效率。

Details

Motivation: 当前的AI在临床诊断中仅作为医生辅助工具，无法从模糊主诉驱动全程诊断流程，导致医生负担未能显著减轻。

Result: 在罕见、复杂和真实病例中，DxDirector-7B诊断准确性显著优于现有技术，同时大幅减少医生工作量。

Insight: 研究标志着AI从医生辅助角色转向主导者的新时代，为高效精准诊断提供了可行方案。

Abstract: Full-process clinical diagnosis in the real world encompasses the entire diagnostic workflow that begins with only an ambiguous chief complaint. While artificial intelligence (AI), particularly large language models (LLMs), is transforming clinical diagnosis, its role remains largely as an assistant to physicians. This AI-assisted working pattern makes AI can only answer specific medical questions at certain parts within the diagnostic process, but lack the ability to drive the entire diagnostic process starting from an ambiguous complaint, which still relies heavily on human physicians. This gap limits AI’s ability to fully reduce physicians’ workload and enhance diagnostic efficiency. To address this, we propose a paradigm shift that reverses the relationship between physicians and AI: repositioning AI as the primary director, with physicians serving as its assistants. So we present DxDirector-7B, an LLM endowed with advanced deep thinking capabilities, enabling it to drive the full-process diagnosis with minimal physician involvement. Furthermore, DxDirector-7B establishes a robust accountability framework for misdiagnoses, delineating responsibility between AI and human physicians. In evaluations across rare, complex, and real-world cases under full-process diagnosis setting, DxDirector-7B not only achieves significant superior diagnostic accuracy but also substantially reduces physician workload than state-of-the-art medical LLMs as well as general-purpose LLMs. Fine-grained analyses across multiple clinical departments and tasks validate its efficacy, with expert evaluations indicating its potential to serve as a viable substitute for medical specialists. These findings mark a new era where AI, traditionally a physicians’ assistant, now drives the entire diagnostic process to drastically reduce physicians’ workload, indicating an efficient and accurate diagnostic solution.

[128] Improving Value-based Process Verifier via Low-Cost Variance Reduction cs.AI | cs.CLPDF

Zetian Sun, Dongfang Li, Baotian Hu, Min Zhang

TL;DR: 论文提出了一种名为ComMCS的低成本方差减少方法，用于改进基于价值的过程验证器，通过线性组合当前和后续步骤的蒙特卡洛估计器，显著降低了估计误差。

Details

Motivation: 大型语言模型（LLMs）在复杂领域（如数学）中的推理能力仍然存在挑战，特别是基于价值的过程验证器因蒙特卡洛样本的限制而存在高方差问题。

Result: 在MATH-500和GSM8K基准测试中，ComMCS比回归优化方法提升2.8分，比无方差减少基线提升2.2分。

Insight: 蒙特卡洛估计器的高方差是主要问题，而非偏差，通过低成本方差减少方法可以显著提升推理任务的性能。

Abstract: Large language models (LLMs) have achieved remarkable success in a wide range of tasks. However, their reasoning capabilities, particularly in complex domains like mathematics, remain a significant challenge. Value-based process verifiers, which estimate the probability of a partial reasoning chain leading to a correct solution, are a promising approach for improving reasoning. Nevertheless, their effectiveness is often hindered by estimation error in their training annotations, a consequence of the limited number of Monte Carlo (MC) samples feasible due to the high cost of LLM inference. In this paper, we identify that the estimation error primarily arises from high variance rather than bias, and the MC estimator is a Minimum Variance Unbiased Estimator (MVUE). To address the problem, we propose the \textsc{Com}pound \textsc{M}onte \textsc{C}arlo \textsc{S}ampling (ComMCS) method, which constructs an unbiased estimator by linearly combining the MC estimators from the current and subsequent steps. Theoretically, we show that our method leads to a predictable reduction in variance, while maintaining an unbiased estimation without additional LLM inference cost. We also perform empirical experiments on the MATH-500 and GSM8K benchmarks to demonstrate the effectiveness of our method. Notably, ComMCS outperforms regression-based optimization method by 2.8 points, the non-variance-reduced baseline by 2.2 points on MATH-500 on Best-of-32 sampling experiment.

[129] MM-Food-100K: A 100,000-Sample Multimodal Food Intelligence Dataset with Verifiable Provenance cs.AI | cs.CR | cs.CV | I.2.10; I.2.6PDF

Yi Dong, Yusuke Muraoka, Scott Shi, Yi Zhang

TL;DR: 提出了一个包含10万样本的多模态食品数据集MM-Food-100K，涵盖丰富标注信息，并采用可验证的数据来源机制。通过社区贡献和AI辅助质量检查收集数据，并在大模型上验证了其有效性。

Details

Motivation: 食品智能领域缺乏高质量、可验证的大规模多模态数据集，这限制了相关模型的发展和应用。

Result: 微调后在标准指标上显著优于基线模型，验证了数据集的有效性。

Insight: 社区贡献加AI辅助的质量控制是一种可行的大规模数据集构建方法，同时可验证来源机制可提升数据的可信度。

Abstract: We present MM-Food-100K, a public 100,000-sample multimodal food intelligence dataset with verifiable provenance. It is a curated approximately 10% open subset of an original 1.2 million, quality-accepted corpus of food images annotated for a wide range of information (such as dish name, region of creation). The corpus was collected over six weeks from over 87,000 contributors using the Codatta contribution model, which combines community sourcing with configurable AI-assisted quality checks; each submission is linked to a wallet address in a secure off-chain ledger for traceability, with a full on-chain protocol on the roadmap. We describe the schema, pipeline, and QA, and validate utility by fine-tuning large vision-language models (ChatGPT 5, ChatGPT OSS, Qwen-Max) on image-based nutrition prediction. Fine-tuning yields consistent gains over out-of-box baselines across standard metrics; we report results primarily on the MM-Food-100K subset. We release MM-Food-100K for publicly free access and retain approximately 90% for potential commercial access with revenue sharing to contributors.

[130] We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning cs.AI | cs.CV | cs.LGPDF

Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang

TL;DR: We-Math 2.0 是一个统一的系统，通过结构化数学知识体系、模型中心数据空间建模和基于强化学习的训练范式，全面增强多模态大语言模型的数学推理能力。

Details

Motivation: 现有的多模态大语言模型在复杂数学推理中表现不佳，而现有研究多关注数据集构建和方法优化，忽略了综合知识驱动的设计和模型中心数据空间建模。

Result: 实验表明MathBook-RL在四个常用基准测试中表现优异，并在MathBookEval上展现出泛化能力。

Insight: 结合知识驱动的数据设计和渐进式强化学习可以显著提升模型在复杂数学推理任务中的表现。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various tasks, but still struggle with complex mathematical reasoning. Existing research primarily focuses on dataset construction and method optimization, often overlooking two critical aspects: comprehensive knowledge-driven design and model-centric data space modeling. In this paper, we introduce We-Math 2.0, a unified system that integrates a structured mathematical knowledge system, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to comprehensively enhance the mathematical reasoning abilities of MLLMs. The key contributions of We-Math 2.0 are fourfold: (1) MathBook Knowledge System: We construct a five-level hierarchical system encompassing 491 knowledge points and 1,819 fundamental principles. (2) MathBook-Standard & Pro: We develop MathBook-Standard, a dataset that ensures broad conceptual coverage and flexibility through dual expansion. Additionally, we define a three-dimensional difficulty space and generate 7 progressive variants per problem to build MathBook-Pro, a challenging dataset for robust training. (3) MathBook-RL: We propose a two-stage RL framework comprising: (i) Cold-Start Fine-tuning, which aligns the model with knowledge-oriented chain-of-thought reasoning; and (ii) Progressive Alignment RL, leveraging average-reward learning and dynamic data scheduling to achieve progressive alignment across difficulty levels. (4) MathBookEval: We introduce a comprehensive benchmark covering all 491 knowledge points with diverse reasoning step distributions. Experimental results show that MathBook-RL performs competitively with existing baselines on four widely-used benchmarks and achieves strong results on MathBookEval, suggesting promising generalization in mathematical reasoning.

[131] Agentic Design Review System cs.AI | cs.CV | cs.LG | cs.MA | cs.MMPDF

Sayan Nag, K J Joseph, Koustava Goswami, Vlad I Morariu, Balaji Vasan Srinivasan

TL;DR: 论文提出了一种基于多智能体协作的Agentic设计评审系统，通过图匹配和提示扩展技术实现设计感知，并在DRS-BENCH基准测试中验证了其有效性。

Details

Motivation: 当前图形设计的评估依赖于多位专家评审的反馈，缺乏自动化的高效工具。论文旨在通过多智能体协作解决这一问题。

Result: 实验表明AgenticDRS在图形设计评估和生成可行反馈方面优于现有基线方法。

Insight: 通过多智能体协作和上下文感知技术，可以更高效地进行图形设计的自动化评估。

Abstract: Evaluating graphic designs involves assessing it from multiple facets like alignment, composition, aesthetics and color choices. Evaluating designs in a holistic way involves aggregating feedback from individual expert reviewers. Towards this, we propose an Agentic Design Review System (AgenticDRS), where multiple agents collaboratively analyze a design, orchestrated by a meta-agent. A novel in-context exemplar selection approach based on graph matching and a unique prompt expansion method plays central role towards making each agent design aware. Towards evaluating this framework, we propose DRS-BENCH benchmark. Thorough experimental evaluation against state-of-the-art baselines adapted to the problem setup, backed-up with critical ablation experiments brings out the efficacy of Agentic-DRS in evaluating graphic designs and generating actionable feedback. We hope that this work will attract attention to this pragmatic, yet under-explored research direction.

cs.RO [Back]

Zhuoyuan Yu, Yuxing Long, Zihan Yang, Chengyan Zeng, Hongwei Fan

TL;DR: 论文提出了一种名为Self-correction Flywheel的后训练范式，通过利用模型在训练集上的错误轨迹生成自校正数据，逐步提升VLA导航模型的性能，实验表明其在R2R-CE和RxR-CE基准测试中取得了新的最优成绩。

Details

Motivation: 现有视觉-语言-动作导航模型在执行指令时容易偏离正确轨迹，缺乏有效的错误校正能力。为解决这一问题，研究提出了一种创新的后训练范式。

Result: 在R2R-CE和RxR-CE基准测试中，CorrectNav取得了65.1%和69.3%的成功率，分别超过之前最优模型8.2%和16.4%，并在真实机器人测试中表现出色。

Insight: 模型的错误轨迹可以作为改进的关键资源，通过自动生成自校正数据并迭代优化，能够显著提升导航模型的鲁棒性和准确性。

Abstract: Existing vision-and-language navigation models often deviate from the correct trajectory when executing instructions. However, these models lack effective error correction capability, hindering their recovery from errors. To address this challenge, we propose Self-correction Flywheel, a novel post-training paradigm. Instead of considering the model’s error trajectories on the training set as a drawback, our paradigm emphasizes their significance as a valuable data source. We have developed a method to identify deviations in these error trajectories and devised innovative techniques to automatically generate self-correction data for perception and action. These self-correction data serve as fuel to power the model’s continued training. The brilliance of our paradigm is revealed when we re-evaluate the model on the training set, uncovering new error trajectories. At this time, the self-correction flywheel begins to spin. Through multiple flywheel iterations, we progressively enhance our monocular RGB-based VLA navigation model CorrectNav. Experiments on R2R-CE and RxR-CE benchmarks show CorrectNav achieves new state-of-the-art success rates of 65.1% and 69.3%, surpassing prior best VLA navigation models by 8.2% and 16.4%. Real robot tests in various indoor and outdoor environments demonstrate \method’s superior capability of error correction, dynamic obstacle avoidance, and long instruction following.

[133] ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver cs.RO | cs.CVPDF

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding

TL;DR: ReconVLA是一个基于重建的视觉-语言-动作模型，通过隐式定位范式指导视觉注意力，帮助机器人精确操作目标物体，并在大规模数据集上表现出优异的泛化能力。

Details

Motivation: 当前视觉-语言-动作模型（VLA）的视觉注意力通常分散，无法准确聚焦于目标区域，限制了机器人在多模态任务中的表现。

Result: 实验表明，ReconVLA在模拟和真实环境中均实现了精确的操作和优异的泛化性能。

Insight: 通过重建任务驱动视觉注意力分配，能够有效提升机器人对任务相关视觉信息的利用能力。

Abstract: Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model’s visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset comprising over 100k trajectories and 2 million data samples from open-source robotic datasets, further boosting the model’s generalization in visual reconstruction. Extensive experiments in simulation and the real world demonstrate the superiority of our implicit grounding method, showcasing its capabilities of precise manipulation and generalization. Our project page is https://zionchow.github.io/ReconVLA/.

cs.LG [Back]

[134] Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts cs.LG | cs.AI | cs.CLPDF

Maxime Heuillet, Yufei Cui, Boxing Chen, Audrey Durand, Prasanna Parthasarathi

TL;DR: Nested-ReFT提出了一种高效的强化学习框架，通过离线策略生成和目标模型的层动态跳过，显著降低了计算成本，同时保持了与标准ReFT相当的性能。

Details

Motivation: 标准的ReFT框架在训练过程中需要多次推理生成完成内容，计算成本高昂。作者希望通过离线RL和推测解码技术优化这一过程。

Result: 在多个数学推理基准和模型规模上实现了更高的计算效率（token/s），同时性能与标准ReFT相当。

Insight: 离线策略生成和动态层跳过是降低LLM微调计算成本的有效方法，同时需注意梯度更新的偏差控制。

Abstract: Advanced reasoning in LLMs on challenging domains like mathematical reasoning can be tackled using verifiable rewards based reinforced fine-tuning (ReFT). In standard ReFT frameworks, a behavior model generates multiple completions with answers per problem, for the answer to be then scored by a reward function. While such RL post-training methods demonstrate significant performance improvements across challenging reasoning domains, the computational cost of generating completions during training with multiple inference steps makes the training cost non-trivial. To address this, we draw inspiration from off-policy RL, and speculative decoding to introduce a novel ReFT framework, dubbed Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training. The behavior model configured with dynamic layer skipping per batch during training decreases the inference cost compared to the standard ReFT frameworks. Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance. Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes. Additionally, we explore three variants of bias mitigation to minimize the off-policyness in the gradient updates that allows for maintaining performance that matches the baseline ReFT performance.

[135] Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards cs.LG | cs.AI | cs.CLPDF

Zetian Sun, Dongfang Li, Zhuoen Chen, Yuhuai Qin, Baotian Hu

TL;DR: 本文提出了G-RA（门控奖励累积）方法，通过设定长期奖励阈值来控制即时奖励的累积，解决了长时程强化学习中奖励稀疏和奖励错位的问题。实验证明了其在软件工程任务中的有效性。

Details

Motivation: 长时程强化学习任务中，奖励稀疏和即时奖励与长期目标的不一致导致策略优化不稳定（如奖励篡改和次优策略）。本文旨在解决这一问题，特别是在软件工程领域中的多步推理和基于规则的验证任务中。

Result: 在SWE-bench Verified和kBench上的实验表明，G-RA显著提升了任务完成率（47.6%→93.8%和22.0%→86.0%）和修改率（19.6%→23.8%和12.0%→42.0%），同时避免了策略退化。

Insight: 在长时程强化学习中，平衡即时奖励和长期奖励的累积至关重要。G-RA通过门控机制实现了两者的一致，为类似任务提供了实用解决方案。

Abstract: Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a significant challenge, while existing outcome-based reward shaping struggles to define meaningful immediate rewards without introducing bias or requiring explicit task decomposition. Alternatively, verification-based reward shaping uses stepwise critics, but misalignment between immediate rewards and long-term objectives can lead to reward hacking and suboptimal policies. In this work, we address this problem in the context of software engineering (SWE) tasks, where multi-turn reasoning and rule-based verification are critical. We introduce the SWE-oriented RL Framework, a unified system supporting multi-turn interaction, docker-based execution, and customizable reward functions. Additionally, we propose Gated Reward Accumulation (G-RA), a novel method that accumulates immediate rewards only when high-level (long-term) rewards meet a predefined threshold, ensuring stable RL optimization. Experiments on SWE-bench Verified and kBench demonstrate that G-RA leads to an increase in completion rates (47.6% \rightarrow 93.8% and 22.0% \rightarrow 86.0%) and modification rates (19.6% \rightarrow 23.8% and 12.0% \rightarrow 42.0%), while avoiding policy degradation caused by reward misalignment. Our findings highlight the importance of balanced reward accumulation in long-horizon RL and provide a practical solution.

[136] Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models cs.LG | cs.AI | cs.CLPDF

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye

TL;DR: 本文提出了一种基于Pass@k的强化学习训练方法，通过平衡探索与利用，提升大型推理模型的性能。

Details

Motivation: 传统强化学习中使用Pass@1作为奖励容易导致策略偏向保守行为，陷入局部最优。Pass@k在评估中虽被使用，但其在探索能力中的作用未被充分研究。

Result: 实验显示Pass@k Training能有效提升模型的探索能力，同时探索与利用相互促进，优势函数设计展示了潜在的研究方向。

Insight: 探索与利用并非对立目标，可通过Pass@k平衡；优势函数设计是RLVR中的一个重要研究方向。

Abstract: Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., $\textbf{Pass@k Training}$), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.

[137] Memory-Augmented Transformers: A Systematic Review from Neuroscience Principles to Technical Solutions cs.LG | cs.CLPDF

Parsa Omidi, Xingshuai Huang, Axel Laborieux, Bahareh Nikpour, Tianyu Shi

TL;DR: 本文综述了记忆增强Transformer的研究，结合神经科学原理与技术解决方案，提出了动态多时间尺度记忆的框架，探讨了功能目标、记忆表示及整合机制，并指出了未来研究方向。

Details

Motivation: 探讨Transformer架构在长期上下文保留、持续学习和知识整合方面的局限性，结合神经科学原理提出改进方案。

Result: 揭示了从静态缓存向自适应、测试时学习系统的转变，并提出了层次化缓冲和基于惊喜的门控更新等新兴解决方案。

Insight: 未来记忆增强Transformer的研究应关注可扩展性和干扰问题，结合神经科学原理实现更接近认知的终身学习架构。

Abstract: Memory is fundamental to intelligence, enabling learning, reasoning, and adaptability across biological and artificial systems. While Transformer architectures excel at sequence modeling, they face critical limitations in long-range context retention, continual learning, and knowledge integration. This review presents a unified framework bridging neuroscience principles, including dynamic multi-timescale memory, selective attention, and consolidation, with engineering advances in Memory-Augmented Transformers. We organize recent progress through three taxonomic dimensions: functional objectives (context extension, reasoning, knowledge integration, adaptation), memory representations (parameter-encoded, state-based, explicit, hybrid), and integration mechanisms (attention fusion, gated control, associative retrieval). Our analysis of core memory operations (reading, writing, forgetting, and capacity management) reveals a shift from static caches toward adaptive, test-time learning systems. We identify persistent challenges in scalability and interference, alongside emerging solutions including hierarchical buffering and surprise-gated updates. This synthesis provides a roadmap toward cognitively-inspired, lifelong-learning Transformer architectures.

[138] From Intent to Execution: Multimodal Chain-of-Thought Reinforcement Learning for Precise CAD Code Generation cs.LG | cs.CVPDF

Ke Niu, Haiyang Yu, Zhuofan Chen, Mengyang Zhao, Teng Fu

TL;DR: 本文提出了一种多模态链式思维（CoT）引导的强化学习框架CAD-RL，用于从自然语言生成精确的CAD代码。通过结合CoT的冷启动和目标驱动的强化学习，并引入三种任务特定的奖励，解决了从设计意图到可执行代码的转换难题。

Details

Motivation: 当前CAD工作流需要大量领域专业知识和手动建模，而现有方法难以直接将设计意图转换为可执行的CAD代码，逻辑推理、语法正确性和数值精度是关键挑战。

Result: 实验表明，CAD-RL在推理质量、输出精度和代码可执行性上显著优于现有视觉语言模型。

Insight: 结合链式思维和强化学习可以有效解决CAD代码生成的复杂性问题，多任务奖励和目标优化策略是提升性能的关键。

Abstract: Computer-Aided Design (CAD) plays a vital role in engineering and manufacturing, yet current CAD workflows require extensive domain expertise and manual modeling effort. Recent advances in large language models (LLMs) have made it possible to generate code from natural language, opening new opportunities for automating parametric 3D modeling. However, directly translating human design intent into executable CAD code remains highly challenging, due to the need for logical reasoning, syntactic correctness, and numerical precision. In this work, we propose CAD-RL, a multimodal Chain-of-Thought (CoT) guided reinforcement learning post training framework for CAD modeling code generation. Our method combines CoT-based Cold Start with goal-driven reinforcement learning post training using three task-specific rewards: executability reward, geometric accuracy reward, and external evaluation reward. To ensure stable policy learning under sparse and high-variance reward conditions, we introduce three targeted optimization strategies: Trust Region Stretch for improved exploration, Precision Token Loss for enhanced dimensions parameter accuracy, and Overlong Filtering to reduce noisy supervision. To support training and benchmarking, we release ExeCAD, a noval dataset comprising 16,540 real-world CAD examples with paired natural language and structured design language descriptions, executable CADQuery scripts, and rendered 3D models. Experiments demonstrate that CAD-RL achieves significant improvements in reasoning quality, output precision, and code executability over existing VLMs.

[139] Improving Learning of New Diseases through Knowledge-Enhanced Initialization for Federated Adapter Tuning cs.LG | cs.CVPDF

Danni Peng, Yuan Wang, Kangning Cai, Peiyan Ning, Jiming Xu

TL;DR: 论文提出了一种名为FedKEI的新框架，通过利用跨客户端和跨任务的知识转移，为新任务提供优化的适配器初始化，从而在医疗联邦学习中快速适应新疾病。

Details

Motivation: 医疗领域的联邦学习需要快速适应新任务或新疾病，但现有的方法在利用过去知识和个性化知识转移方面存在不足。

Result: 在皮肤病学、胸部X光和视网膜OCT三个数据集上的实验显示，FedKEI在适应新疾病方面优于现有方法。

Insight: 跨任务知识转移和个性化优化是新任务学习的关键，尤其是在医疗领域数据多样化的背景下。

Abstract: In healthcare, federated learning (FL) is a widely adopted framework that enables privacy-preserving collaboration among medical institutions. With large foundation models (FMs) demonstrating impressive capabilities, using FMs in FL through cost-efficient adapter tuning has become a popular approach. Given the rapidly evolving healthcare environment, it is crucial for individual clients to quickly adapt to new tasks or diseases by tuning adapters while drawing upon past experiences. In this work, we introduce Federated Knowledge-Enhanced Initialization (FedKEI), a novel framework that leverages cross-client and cross-task transfer from past knowledge to generate informed initializations for learning new tasks with adapters. FedKEI begins with a global clustering process at the server to generalize knowledge across tasks, followed by the optimization of aggregation weights across clusters (inter-cluster weights) and within each cluster (intra-cluster weights) to personalize knowledge transfer for each new task. To facilitate more effective learning of the inter- and intra-cluster weights, we adopt a bi-level optimization scheme that collaboratively learns the global intra-cluster weights across clients and optimizes the local inter-cluster weights toward each client’s task objective. Extensive experiments on three benchmark datasets of different modalities, including dermatology, chest X-rays, and retinal OCT, demonstrate FedKEI’s advantage in adapting to new diseases compared to state-of-the-art methods.

[140] On Spectral Properties of Gradient-based Explanation Methods cs.LG | cs.AI | cs.CVPDF

Amir Mehrpanah, Erik Englesson, Hossein Azizpour

TL;DR: 本文通过概率和谱视角分析梯度解释方法，揭示梯度使用的谱偏差，并提出SpectralLens聚合方法及其扰动超参数标准化机制。

Details

Motivation: 尽管可解释性方法研究广泛，但可靠性问题仍然存在，主要源于缺乏形式化分析。本文旨在通过新颖的视角（概率和谱）对现有解释方法进行理论分析。

Result: 理论分析和实验验证表明，提出的形式化方法能显著提高解释的可靠性和一致性。

Insight: 梯度的谱偏差是解释方法不可靠的重要原因，通过形式化分析可以优化现有方法的设计选择（如平方梯度和输入扰动）。

Abstract: Understanding the behavior of deep networks is crucial to increase our confidence in their results. Despite an extensive body of work for explaining their predictions, researchers have faced reliability issues, which can be attributed to insufficient formalism. In our research, we adopt novel probabilistic and spectral perspectives to formally analyze explanation methods. Our study reveals a pervasive spectral bias stemming from the use of gradient, and sheds light on some common design choices that have been discovered experimentally, in particular, the use of squared gradient and input perturbation. We further characterize how the choice of perturbation hyperparameters in explanation methods, such as SmoothGrad, can lead to inconsistent explanations and introduce two remedies based on our proposed formalism: (i) a mechanism to determine a standard perturbation scale, and (ii) an aggregation method which we call SpectralLens. Finally, we substantiate our theoretical results through quantitative evaluations.

eess.IV [Back]

[141] Data-Efficient Learning for Generalizable Surgical Video Understanding eess.IV | cs.CVPDF

Sahar Nasirihaghighi

TL;DR: 该论文提出了一种数据高效学习的方法，用于解决手术视频理解中的标注稀缺、时空复杂性和跨领域差距问题，并通过半监督学习和新架构设计提高了模型性能。

Details

Motivation: 手术视频分析的挑战包括标注稀缺、时空复杂性和跨领域差距，这些问题限制了其在实际临床中的部署和应用。

Result: 半监督框架在手术数据集上取得了最佳表现，且新发布的GynSurg和Cataract-1K数据集支持了研究和可复现性。

Insight: 数据高效学习和半监督方法可以显著提升手术视频理解的泛化能力，推动AI在临床手术中的实际应用。

Abstract: Advances in surgical video analysis are transforming operating rooms into intelligent, data-driven environments. Computer-assisted systems support full surgical workflow, from preoperative planning to intraoperative guidance and postoperative assessment. However, developing robust and generalizable models for surgical video understanding remains challenging due to (I) annotation scarcity, (II) spatiotemporal complexity, and (III) domain gap across procedures and institutions. This doctoral research aims to bridge the gap between deep learning-based surgical video analysis in research and its real-world clinical deployment. To address the core challenge of recognizing surgical phases, actions, and events, critical for analysis, I benchmarked state-of-the-art neural network architectures to identify the most effective designs for each task. I further improved performance by proposing novel architectures and integrating advanced modules. Given the high cost of expert annotations and the domain gap across surgical video sources, I focused on reducing reliance on labeled data. We developed semi-supervised frameworks that improve model performance across tasks by leveraging large amounts of unlabeled surgical video. We introduced novel semi-supervised frameworks, including DIST, SemiVT-Surge, and ENCORE, that achieved state-of-the-art results on challenging surgical datasets by leveraging minimal labeled data and enhancing model training through dynamic pseudo-labeling. To support reproducibility and advance the field, we released two multi-task datasets: GynSurg, the largest gynecologic laparoscopy dataset, and Cataract-1K, the largest cataract surgery video dataset. Together, this work contributes to robust, data-efficient, and clinically scalable solutions for surgical video analysis, laying the foundation for generalizable AI systems that can meaningfully impact surgical care and training.

[142] DINOMotion: advanced robust tissue motion tracking with DINOv2 in 2D-Cine MRI-guided radiotherapy eess.IV | cs.AI | cs.CVPDF

Soorena Salari, Catherine Spino, Laurie-Anne Pharand, Fabienne Lathuiliere, Hassan Rivaz

TL;DR: DINOMotion 是一种基于 DINOv2 和 LoRA 层的深度学习框架，用于 2D-Cine MRI 引导的放疗中准确且高效的软组织运动跟踪。

Details

Motivation: 现有方法在大偏移量和缺乏可解释性方面存在挑战，需要一种更高效且鲁棒的运动跟踪方法。

Result: 在肾、肝和肺的数据集上分别达到 Dice 系数 92.07%、90.90% 和 95.23%，处理速度为每帧 30ms。

Insight: DINOv2 的强大特征表示能力使其在处理大偏移量时表现优异，LoRA 层的高效性显著提升了训练效率。

Abstract: Accurate tissue motion tracking is critical to ensure treatment outcome and safety in 2D-Cine MRI-guided radiotherapy. This is typically achieved by registration of sequential images, but existing methods often face challenges with large misalignments and lack of interpretability. In this paper, we introduce DINOMotion, a novel deep learning framework based on DINOv2 with Low-Rank Adaptation (LoRA) layers for robust, efficient, and interpretable motion tracking. DINOMotion automatically detects corresponding landmarks to derive optimal image registration, enhancing interpretability by providing explicit visual correspondences between sequential images. The integration of LoRA layers reduces trainable parameters, improving training efficiency, while DINOv2’s powerful feature representations offer robustness against large misalignments. Unlike iterative optimization-based methods, DINOMotion directly computes image registration at test time. Our experiments on volunteer and patient datasets demonstrate its effectiveness in estimating both linear and nonlinear transformations, achieving Dice scores of 92.07% for the kidney, 90.90% for the liver, and 95.23% for the lung, with corresponding Hausdorff distances of 5.47 mm, 8.31 mm, and 6.72 mm, respectively. DINOMotion processes each scan in approximately 30ms and consistently outperforms state-of-the-art methods, particularly in handling large misalignments. These results highlight its potential as a robust and interpretable solution for real-time motion tracking in 2D-Cine MRI-guided radiotherapy.

[143] DIVA-VQA: Detecting Inter-frame Variations in UGC Video Quality eess.IV | cs.CV | cs.MMPDF

Xinyi Wang, Angeliki Katsenou, David Bull

TL;DR: DIVA-VQA提出了一种基于帧间变化的无参考视频质量评估模型，通过多层次的时空碎片分析，结合2D和3D特征，在五个UGC数据集上表现优异，运行效率高。

Details

Motivation: 随着用户生成视频内容（UGC）的快速增长，无参考视频质量评估（NR-VQA）的需求增加，而现有方法难以高效捕捉时空变化。

Result: 在五个UGC数据集上，模型性能位列前二（DIVA-VQA-L: 0.898，DIVA-VQA-B: 0.886），且运行效率优于现有最快方法。

Insight: 帧间变化是多层次质量分析的关键，结合时空特征和高效计算设计可显著提升无参考视频质量评估的性能。

Abstract: The rapid growth of user-generated (video) content (UGC) has driven increased demand for research on no-reference (NR) perceptual video quality assessment (VQA). NR-VQA is a key component for large-scale video quality monitoring in social media and streaming applications where a pristine reference is not available. This paper proposes a novel NR-VQA model based on spatio-temporal fragmentation driven by inter-frame variations. By leveraging these inter-frame differences, the model progressively analyses quality-sensitive regions at multiple levels: frames, patches, and fragmented frames. It integrates frames, fragmented residuals, and fragmented frames aligned with residuals to effectively capture global and local information. The model extracts both 2D and 3D features in order to characterize these spatio-temporal variations. Experiments conducted on five UGC datasets and against state-of-the-art models ranked our proposed method among the top 2 in terms of average rank correlation (DIVA-VQA-L: 0.898 and DIVA-VQA-B: 0.886). The improved performance is offered at a low runtime complexity, with DIVA-VQA-B ranked top and DIVA-VQA-L third on average compared to the fastest existing NR-VQA method. Code and models are publicly available at: https://github.com/xinyiW915/DIVA-VQA.

Table of Contents

cs.CV [Back]

[1] DINOv3 cs.CV | cs.LGPDF

[2] Empowering Morphing Attack Detection using Interpretable Image-Text Foundation Model cs.CV | cs.AIPDF

[3] Interpretable Oracle Bone Script Decipherment through Radical and Pictographic Analysis with LVLMs cs.CVPDF

[4] Deep Learning Enables Large-Scale Shape and Appearance Modeling in Total-Body DXA Imaging cs.CVPDF

[5] MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning cs.CVPDF

[6] Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model cs.CV | cs.AI | cs.ETPDF

[7] SynSpill: Improved Industrial Spill Detection With Synthetic Data cs.CV | cs.ETPDF

[8] EntropyGS: An Efficient Entropy Coding on 3D Gaussian Splatting cs.CVPDF

[9] CellSymphony: Deciphering the molecular and phenotypic orchestration of cells with single-cell pathomics cs.CVPDF

[10] Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets cs.CVPDF

[11] MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs cs.CV | cs.AIPDF

[12] High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance cs.CVPDF

[13] VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation cs.CVPDF

[14] JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics cs.CVPDF

[15] A Sub-Pixel Multimodal Optical Remote Sensing Images Matching Method cs.CVPDF

[16] InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild cs.CVPDF

[17] Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances cs.CVPDF

[18] Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models cs.CV | cs.LGPDF

[19] Glo-DMU: A Deep Morphometry Framework of Ultrastructural Characterization in Glomerular Electron Microscopic Images cs.CVPDF

[20] Improving OCR for Historical Texts of Multiple Languages cs.CV | cs.CLPDF

[21] AtomDiffuser: Time-Aware Degradation Modeling for Drift and Beam Damage in STEM Imaging cs.CVPDF

[22] Contrast Sensitivity Function of Multimodal Vision-Language Models cs.CVPDF

[23] PQ-DAF: Pose-driven Quality-controlled Data Augmentation for Data-scarce Driver Distraction Detection cs.CV | cs.AIPDF

[24] SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection cs.CVPDF

[25] NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer cs.CVPDF

[26] STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes cs.CVPDF

[27] CRISP: Contrastive Residual Injection and Semantic Prompting for Continual Video Instance Segmentation cs.CVPDF

[28] DOD-SA: Infrared-Visible Decoupled Object Detection with Single-Modality Annotations cs.CV | 68T07, 68T45, 68U10 | I.2.10PDF

[29] SkeySpot: Automating Service Key Detection for Digital Electrical Layout Plans in the Construction Industry cs.CV | cs.LGPDF

[30] From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images cs.CVPDF

[31] Trajectory-aware Shifted State Space Models for Online Video Super-Resolution cs.CVPDF

[32] Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers cs.CV | cs.IR | cs.LGPDF

[33] SingleStrip: learning skull-stripping from a single labeled example cs.CV | cs.LGPDF

[34] Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition cs.CV | cs.AIPDF

[35] STAMP: Multi-pattern Attention-aware Multiple Instance Learning for STAS Diagnosis in Multi-center Histopathology Images cs.CV | cs.CYPDF

[36] TweezeEdit: Consistent and Efficient Image Editing with Path Regularization cs.CVPDF

[37] EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba cs.CVPDF

[38] Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies cs.CVPDF

[39] Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset cs.CV | cs.AIPDF

[40] GCRPNet: Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images cs.CVPDF

[41] PSScreen: Partially Supervised Multiple Retinal Disease Screening cs.CVPDF

[42] AR Surgical Navigation With Surface Tracing: Comparing In-SitVisualization with Tool-Tracking Guidance for Neurosurgical Applications cs.CVPDF

[43] Retrieval-Augmented Prompt for OOD Detection cs.CV | cs.AIPDF

[44] PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks cs.CV | cs.AIPDF

[45] HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis cs.CVPDF

[46] SpaRC-AD: A Baseline for Radar-Camera Fusion in End-to-End Autonomous Driving cs.CV | cs.ROPDF

[47] Adapting SAM via Cross-Entropy Masking for Class Imbalance in Remote Sensing Change Detection cs.CVPDF

[48] Towards Agentic AI for Multimodal-Guided Video Object Segmentation cs.CVPDF

[49] HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs cs.CVPDF

[50] EvTurb: Event Camera Guided Turbulence Removal cs.CVPDF

[51] Fourier-Guided Attention Upsampling for Image Super-Resolution cs.CV | cs.AIPDF

[52] FIND-Net – Fourier-Integrated Network with Dictionary Kernels for Metal Artifact Reduction cs.CV | eess.IVPDF

[53] Increasing the Utility of Synthetic Images through Chamfer Guidance cs.CVPDF

[54] ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation cs.CVPDF

[55] Processing and acquisition traces in visual encoders: What does CLIP know about your camera? cs.CVPDF

[56] Lameness detection in dairy cows using pose estimation and bidirectional LSTMs cs.CVPDF

[57] SemPT: Semantic Prompt Tuning for Vision-Language Models cs.CVPDF

[58] Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking cs.CV | cs.AIPDF

[59] AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models cs.CV | cs.AIPDF

[60] IADGPT: Unified LVLM for Few-Shot Industrial Anomaly Detection, Localization, and Reasoning via In-Context Learning cs.CVPDF

[61] Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios cs.CVPDF

[62] Revisiting Cross-View Localization from Image Matching cs.CVPDF

[63] EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering cs.CV | cs.AIPDF

[64] Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction cs.CV | cs.LGPDF

[65] Axis-level Symmetry Detection with Group-Equivariant Representation cs.CVPDF

[66] From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models cs.CVPDF

[67] AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences cs.CV | cs.AIPDF

[68] Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation cs.CV | cs.AI | cs.LGPDF

[69] Cooperative Face Liveness Detection from Optical Flow cs.CVPDF

[70] Object Fidelity Diffusion for Remote Sensing Image Generation cs.CVPDF

[71] Mobile-Friendly Deep Learning for Plant Disease Detection: A Lightweight CNN Benchmark Across 101 Classes of 33 Crops cs.CV | cs.LGPDF

[72] UI-Venus Technical Report: Building High-performance UI Agents with RFT cs.CVPDF

[73] Generalizable Federated Learning using Client Adaptive Focal Modulation cs.CVPDF

[74] Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation cs.CVPDF

[75] Performance of GPT-5 in Brain Tumor MRI Reasoning cs.CV | cs.AIPDF

[76] TexVerse: A Universe of 3D Objects with High-Resolution Textures cs.CVPDF

[77] Medico 2025: Visual Question Answering for Gastrointestinal Imaging cs.CV | cs.AI | 68T45, 92C55 | I.2.10; I.4.9PDF

[78] ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing cs.CV | cs.AIPDF