cs.CV [Total: 74]
cs.CL [Total: 39]

cs.CV [Back]

[1] Deep Learning-Driven Multimodal Detection and Movement Analysis of Objects in Culinary cs.CV | cs.AIPDF

Tahoshin Alam Ishat

TL;DR: 该研究结合了YOLOv8分割模型、基于手部运动序列的LSTM模型和ASR模型，通过多模态数据提取为LLM生成食谱和分步烹饪指南，展示了计算机视觉在厨房任务中的广泛应用。

Details

Motivation: 研究旨在通过结合多模态数据（视觉、运动和语音）解决厨房任务中的复杂问题，扩展计算机视觉在日常生活中的应用。

Result: 系统在复杂厨房环境中表现良好，验证了多模态方法在日常生活任务中的有效性。

Insight: 多模态融合是解决复杂任务（如厨房活动）的有效方法，未来可以进一步优化模型以适应更多实际应用。

Abstract: This is a research exploring existing models and fine tuning them to combine a YOLOv8 segmentation model, a LSTM model trained on hand point motion sequence and a ASR (whisper-base) to extract enough data for a LLM (TinyLLaMa) to predict the recipe and generate text creating a step by step guide for the cooking procedure. All the data were gathered by the author for a robust task specific system to perform best in complex and challenging environments proving the extension and endless application of computer vision in daily activities such as kitchen work. This work extends the field for many more crucial task of our day to day life.

[2] AMMKD: Adaptive Multimodal Multi-teacher Distillation for Lightweight Vision-Language Models cs.CVPDF

Yuqi Li, Chuanguang Yang, Junhao Dong, Zhengtao Yao, Haoyan Xu

TL;DR: AMMKD提出了一种自适应多模态多教师知识蒸馏框架，旨在为轻量级视觉语言模型提供高效的训练方法，通过多模态特征融合和多教师蒸馏显著减少模型复杂度并提升性能。

Details

Motivation: 大规模视觉语言预训练（VLP）模型在图像-文本检索任务中表现优异，但其巨大的模型规模和计算复杂度限制了在移动设备上的部署。因此，研究需要一种轻量级但高效的解决方案。

Result: 在三个基准数据集上的实验表明，AMMKD在显著降低模型复杂度的同时实现了优越的性能，验证了其有效性和灵活性。

Insight: 1. 多教师蒸馏结合动态权重调整可以有效减少冲突并提升学习效果；2. 预计算文本特征是一种高效的信息利用方式；3. 解耦模态可以进一步提升模型训练效率。

Abstract: The success of large-scale visual language pretraining (VLP) models has driven widespread adoption of image-text retrieval tasks. However, their deployment on mobile devices remains limited due to large model sizes and computational complexity. We propose Adaptive Multi-Modal Multi-Teacher Knowledge Distillation (AMMKD), a novel framework that integrates multi-modal feature fusion, multi-teacher distillation, and adaptive optimization to deliver lightweight yet effective retrieval models. Specifically, our method begins with a feature fusion network that extracts and merges discriminative features from both the image and text modalities. To reduce model parameters and further improve performance, we design a multi-teacher knowledge distillation framework to pre-train two CLIP teacher models. We decouple modalities by pre-computing and storing text features as class vectors via the teacher text encoder to enhance efficiency. To better align teacher and student outputs, we apply KL scatter for probability distribution matching. Finally, we design an adaptive dynamic weighting scheme that treats multi-teacher distillation as a multi-objective optimization problem. By leveraging gradient space diversity, we dynamically adjust the influence of each teacher, reducing conflicts and guiding the student toward more optimal learning directions. Extensive experiments on three benchmark datasets demonstrate that AMMKD achieves superior performance while significantly reducing model complexity, validating its effectiveness and flexibility.

[3] ARTPS: Depth-Enhanced Hybrid Anomaly Detection and Learnable Curiosity Score for Autonomous Rover Target Prioritization cs.CV | cs.AI | 68T45, 68T07, 68U10 | I.2.10; I.4.8; I.5.4; J.2PDF

Poyraz Baydemir

TL;DR: ARTPS提出了一种结合深度估计、异常检测和可学习好奇心评分的混合AI系统，用于行星表面自主探测。该系统性能优越，在火星探测数据上表现突出。

Details

Motivation: 行星探测中的目标优先排序需要高效准确的异常检测和决策支持，传统方法难以兼顾多维度信息。

Result: AUROC为0.94，AUPRC为0.89，F1-Score为0.87，假阳性降低23%，灵敏度保持高位。

Insight: 混合方法在多维信息融合中表现优越，可学习评分机制对提升探测效率至关重要。

Abstract: We present ARTPS (Autonomous Rover Target Prioritization System), a novel hybrid AI system that combines depth estimation, anomaly detection, and learnable curiosity scoring for autonomous exploration of planetary surfaces. Our approach integrates monocular depth estimation using Vision Transformers with multi-component anomaly detection and a weighted curiosity score that balances known value, anomaly signals, depth variance, and surface roughness. The system achieves state-of-the-art performance with AUROC of 0.94, AUPRC of 0.89, and F1-Score of 0.87 on Mars rover datasets. We demonstrate significant improvements in target prioritization accuracy through ablation studies and provide comprehensive analysis of component contributions. The hybrid fusion approach reduces false positives by 23% while maintaining high detection sensitivity across diverse terrain types.

[4] Performance is not All You Need: Sustainability Considerations for Algorithms cs.CV | cs.PFPDF

Xiang Li, Chong Zhang, Hongpeng Wang, Shreyank Narayana Gowda, Yushi Li

TL;DR: 该论文提出了一种创新的二维可持续性评估系统，通过两个量化指标（FMS和ASC）平衡算法性能和能耗，推动绿色AI从理论转向实践。

Details

Motivation: 深度学习模型训练的高碳排放严重，传统单一性能导向的评估范式无法兼顾性能和能耗的平衡，亟需一种新的评估方法。

Result: 实验证明该系统能为跨任务算法评估提供定量依据，支持绿色AI研究的实践化。

Insight: 可持续性评估不应仅关注性能，还需综合考虑能耗，推动行业建立算法能效标准。

Abstract: This work focuses on the high carbon emissions generated by deep learning model training, specifically addressing the core challenge of balancing algorithm performance and energy consumption. It proposes an innovative two-dimensional sustainability evaluation system. Different from the traditional single performance-oriented evaluation paradigm, this study pioneered two quantitative indicators that integrate energy efficiency ratio and accuracy: the sustainable harmonic mean (FMS) integrates accumulated energy consumption and performance parameters through the harmonic mean to reveal the algorithm performance under unit energy consumption; the area under the sustainability curve (ASC) constructs a performance-power consumption curve to characterize the energy efficiency characteristics of the algorithm throughout the cycle. To verify the universality of the indicator system, the study constructed benchmarks in various multimodal tasks, including image classification, segmentation, pose estimation, and batch and online learning. Experiments demonstrate that the system can provide a quantitative basis for evaluating cross-task algorithms and promote the transition of green AI research from theory to practice. Our sustainability evaluation framework code can be found here, providing methodological support for the industry to establish algorithm energy efficiency standards.

[5] MESTI-MEGANet: Micro-expression Spatio-Temporal Image and Micro-expression Gradient Attention Networks for Micro-expression Recognition cs.CVPDF

Luu Tu Nguyen, Vu Tram Anh Khuong, Thanh Ha Le, Thi Duyen Ngo

TL;DR: 提出了MESTI和MEGANet分别作为微表情识别的新输入模态和网络架构，显著提升了识别性能。

Details

Motivation: 传统输入模态（如Apex Frame、光流、动态图像）难以捕捉微表情的细微和短暂特征，导致性能不佳。

Result: MESTI显著优于现有输入模态，MEGANet在CASMEII和SAMM数据集上取得SOTA结果。

Insight: MESTI和MEGANet的结合为微表情识别提供了更高效的解决方案，具有广泛应用潜力。

Abstract: Micro-expression recognition (MER) is a challenging task due to the subtle and fleeting nature of micro-expressions. Traditional input modalities, such as Apex Frame, Optical Flow, and Dynamic Image, often fail to adequately capture these brief facial movements, resulting in suboptimal performance. In this study, we introduce the Micro-expression Spatio-Temporal Image (MESTI), a novel dynamic input modality that transforms a video sequence into a single image while preserving the essential characteristics of micro-movements. Additionally, we present the Micro-expression Gradient Attention Network (MEGANet), which incorporates a novel Gradient Attention block to enhance the extraction of fine-grained motion features from micro-expressions. By combining MESTI and MEGANet, we aim to establish a more effective approach to MER. Extensive experiments were conducted to evaluate the effectiveness of MESTI, comparing it with existing input modalities across three CNN architectures (VGG19, ResNet50, and EfficientNetB0). Moreover, we demonstrate that replacing the input of previously published MER networks with MESTI leads to consistent performance improvements. The performance of MEGANet, both with MESTI and Dynamic Image, is also evaluated, showing that our proposed network achieves state-of-the-art results on the CASMEII and SAMM datasets. The combination of MEGANet and MESTI achieves the highest accuracy reported to date, setting a new benchmark for micro-expression recognition. These findings underscore the potential of MESTI as a superior input modality and MEGANet as an advanced recognition network, paving the way for more effective MER systems in a variety of applications.

[6] Scaffold Diffusion: Sparse Multi-Category Voxel Structure Generation with Discrete Diffusion cs.CV | cs.AI | cs.LGPDF

Justin Jung

TL;DR: Paper introduces Scaffold Diffusion, a generative model using discrete diffusion for sparse multi-category 3D voxel structures, demonstrating effectiveness on highly sparse data like Minecraft houses.

Details

Motivation: Generating sparse multi-category 3D voxel structures is challenging due to cubic memory scaling and class imbalance from sparsity. Existing methods struggle with coherence and scalability.

Result: Outperforms prior baselines and autoregressive models on the 3D-Craft dataset, generating realistic structures despite over 98% sparsity.

Insight: Discrete diffusion is a promising framework for 3D sparse voxel generation, offering scalability and coherence even with extreme sparsity.

Abstract: Generating realistic sparse multi-category 3D voxel structures is difficult due to the cubic memory scaling of voxel structures and moreover the significant class imbalance caused by sparsity. We introduce Scaffold Diffusion, a generative model designed for sparse multi-category 3D voxel structures. By treating voxels as tokens, Scaffold Diffusion uses a discrete diffusion language model to generate 3D voxel structures. We show that discrete diffusion language models can be extended beyond inherently sequential domains such as text to generate spatially coherent 3D structures. We evaluate on Minecraft house structures from the 3D-Craft dataset and demonstrate that, unlike prior baselines and an auto-regressive formulation, Scaffold Diffusion produces realistic and coherent structures even when trained on data with over 98% sparsity. We provide an interactive viewer where readers can visualize generated samples and the generation process. Our results highlight discrete diffusion as a promising framework for 3D sparse voxel generative modeling.

[7] Dual-Stage Global and Local Feature Framework for Image Dehazing cs.CVPDF

Anas M. Ali, Anis Koubaa, Bilel Benjdira

TL;DR: 本文提出了一种名为SGLC的双阶段全局与局部特征融合框架，用于解决高分辨率图像去雾问题。通过结合全局特征生成器（GFG）和局部特征增强器（LFE），SGLC显著提升了高分辨率图像的去雾效果。

Details

Motivation: 高分辨率图像去雾面临全局上下文信息与局部精细细节难以有效结合的挑战，现有方法常因降采样或分块处理导致性能下降。本文旨在填补这一空白。

Result: 在高分辨率数据集上，SGLC显著提升了PSNR指标，验证了其在去雾任务中的有效性。

Insight: 全局与局部特征的组合对高分辨率图像去雾至关重要，SGLC框架提供了一个通用的解决方案。

Abstract: Addressing the challenge of removing atmospheric fog or haze from digital images, known as image dehazing, has recently gained significant traction in the computer vision community. Although contemporary dehazing models have demonstrated promising performance, few have thoroughly investigated high-resolution imagery. In such scenarios, practitioners often resort to downsampling the input image or processing it in smaller patches, which leads to a notable performance degradation. This drop is primarily linked to the difficulty of effectively combining global contextual information with localized, fine-grained details as the spatial resolution grows. In this chapter, we propose a novel framework, termed the Streamlined Global and Local Features Combinator (SGLC), to bridge this gap and enable robust dehazing for high-resolution inputs. Our approach is composed of two principal components: the Global Features Generator (GFG) and the Local Features Enhancer (LFE). The GFG produces an initial dehazed output by focusing on broad contextual understanding of the scene. Subsequently, the LFE refines this preliminary output by enhancing localized details and pixel-level features, thereby capturing the interplay between global appearance and local structure. To evaluate the effectiveness of SGLC, we integrated it with the Uformer architecture, a state-of-the-art dehazing model. Experimental results on high-resolution datasets reveal a considerable improvement in peak signal-to-noise ratio (PSNR) when employing SGLC, indicating its potency in addressing haze in large-scale imagery. Moreover, the SGLC design is model-agnostic, allowing any dehazing network to be augmented with the proposed global-and-local feature fusion mechanism. Through this strategy, practitioners can harness both scene-level cues and granular details, significantly improving visual fidelity in high-resolution environments.

[8] Waste-Bench: A Comprehensive Benchmark for Evaluating VLLMs in Cluttered Environments cs.CV | cs.AIPDF

Muhammad Ali, Salman Khan

TL;DR: 论文提出了一个名为Waste-Bench的新型数据集，专为复杂环境（如变形物体）中的废物分类任务设计，并提供了对视觉大语言模型(VLLMs)在杂乱环境中性能的全面评估。

Details

Motivation: LLMs在标准自然图像上表现优异，但其在复杂和杂乱环境中的能力尚未得到充分研究，尤其是涉及变形物体的场景。因此，作者提出了一个专注于废物分类的基准数据集。

Result: 研究发现，VLLMs在复杂环境中的性能仍有改进空间，尤其是对变形物体的识别能力不足。

Insight: 该研究表明，当前VLLMs在处理复杂和杂乱环境时存在局限性，强调了进一步改进模型鲁棒性的必要性。

Abstract: Recent advancements in Large Language Models (LLMs) have paved the way for Vision Large Language Models (VLLMs) capable of performing a wide range of visual understanding tasks. While LLMs have demonstrated impressive performance on standard natural images, their capabilities have not been thoroughly explored in cluttered datasets where there is complex environment having deformed shaped objects. In this work, we introduce a novel dataset specifically designed for waste classification in real-world scenarios, characterized by complex environments and deformed shaped objects. Along with this dataset, we present an in-depth evaluation approach to rigorously assess the robustness and accuracy of VLLMs. The introduced dataset and comprehensive analysis provide valuable insights into the performance of VLLMs under challenging conditions. Our findings highlight the critical need for further advancements in VLLM’s robustness to perform better in complex environments. The dataset and code for our experiments will be made publicly available.

[9] Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders cs.CVPDF

Faizan Farooq Khan, Vladan Stojnić, Zakaria Laskar, Mohamed Elhoseiny, Giorgos Tolias

TL;DR: 这篇论文提出一种改进的类别级文本到图像检索方法，通过生成扩散模型将文本查询转化为视觉查询，并结合视觉编码器减少模态间的差异，显著提升了检索性能。

Details

Motivation: 现有的视觉与语言模型（如CLIP）在处理文本到图像检索时，文本和图像在表示空间中存在较大距离（模态差距），限制了检索效果。论文旨在通过生成扩散模型和视觉编码器弥合这一差距。

Result: 实验表明，该方法在类别级文本到图像检索任务中表现优于现有方法，证明了生成视觉查询和跨模态融合的有效性。

Insight: 论文表明，结合生成模型和视觉编码器可以有效减少文本和图像间的模态差距，提升跨模态检索性能。

Abstract: This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir

[10] Safe-LLaVA: A Privacy-Preserving Vision-Language Dataset and Benchmark for Biometric Safety cs.CVPDF

Younggun Kim, Sirnam Swetha, Fazil Kagdi, Mubarak Shah

TL;DR: 论文提出了Safe-LLaVA数据集和PRISM基准测试，旨在解决多模态大语言模型（MLLMs）在无意中泄露敏感生物特征信息的问题，并通过系统化的数据清理和模型微调显著减少隐私泄露。

Details

Motivation: 现有的MLLMs在执行视觉-语言任务时会无意中推断或泄露敏感生物特征信息（如种族、性别等），但目前缺乏公开的数据集或基准测试来全面评估或缓解这一问题。

Result: 评估显示现有MLLMs存在广泛的生物特征泄露问题，而基于Safe-LLaVA微调的模型显著减少了泄露。

Insight: 隐私保护需从数据集和基准测试层面入手；系统化清理和微调可有效减少MLLMs的隐私风险。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks. However, these models often infer and reveal sensitive biometric attributes - such as race, gender, age, body weight, and eye color - even when such information is not explicitly requested. This raises critical concerns, particularly in real-world applications and socially-sensitive domains. Despite increasing awareness, no publicly available dataset or benchmark exists to comprehensively evaluate or mitigate biometric leakage in MLLMs. To address this gap, we introduce PRISM (Privacy-aware Evaluation of Responses in Sensitive Modalities), a new benchmark designed to assess MLLMs on two fronts: (1) refuse biometric-related queries and (2) implicit biometric leakage in general responses while maintaining semantic faithfulness. Further, we conduct a detailed audit of the widely used LLaVA datasets and uncover extensive biometric leakage across pretraining and instruction data. To address this, we present Safe-LLaVA dataset, the first privacy-preserving MLLM training dataset constructed by systematically removing explicit and implicit biometric information from LLaVA dataset. Our evaluations on PRISM reveal biometric leakages across MLLMs for different attributes, highlighting the detailed privacy-violations. We also fine-tune a model on Safe-LLaVA dataset and show that it substantially reduces the biometric leakages. Together, Safe-LLaVA & PRISM set a new standard for privacy-aligned development and evaluation of MLLMs. The Safe-LLaVA dataset & PRISM benchmark are publicly available at https://huggingface.co/datasets/kyh9191/Safe-LLaVA, and the source code is available at https://github.com/Kimyounggun99/Safe-LLaVA.git.

[11] Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment cs.CV | cs.AIPDF

Jinzhou Tang, Jusheng zhang, Sidi Liu, Waikit Xiu, Qinhan Lv

TL;DR: 该论文提出了VEME方法，通过跨模态对齐和几何语义世界先验知识，提升了动态环境中视频驱动的智能体模型在时空推理和任务适应性上的性能。

Details

Motivation: 现有视觉语言模型（VLM）在静态场景理解中表现优秀，但在动态、开放集任务（如任务导向导航和具身问答）中时空推理能力不足，缺乏对细粒度时空线索和物理世界理解的建模。

Result: 在VSI-Bench和VLN-CE数据集上，相比传统方法，精度和探索效率提升了1%-3%。

Insight: 几何语义世界先验的引入显著增强了模型对动态环境的适应性，证明了时空对齐在具身智能任务中的关键作用。

Abstract: Achieving human-like reasoning in deep learning models for complex tasks in unknown environments remains a critical challenge in embodied intelligence. While advanced vision-language models (VLMs) excel in static scene understanding, their limitations in spatio-temporal reasoning and adaptation to dynamic, open-set tasks like task-oriented navigation and embodied question answering (EQA) persist due to inadequate modeling of fine-grained spatio-temporal cues and physical world comprehension. To address this, we propose VEME, a novel cross-modal alignment method that enhances generalization in unseen scenes by learning an ego-centric, experience-centered world model. Our framework integrates three key components: (1) a cross-modal alignment framework bridging objects, spatial representations, and visual semantics with spatio-temporal cues to enhance VLM in-context learning; (2) a dynamic, implicit cognitive map activated by world embedding to enable task-relevant geometric-semantic memory recall; and (3) an instruction-based navigation and reasoning framework leveraging embodied priors for long-term planning and efficient exploration. By embedding geometry-aware spatio-temporal episodic experiences, our method significantly improves reasoning and planning in dynamic environments. Experimental results on VSI-Bench and VLN-CE demonstrate 1%-3% accuracy and exploration efficiency improvement compared to traditional approaches.

[12] Multimodal Deep Learning for Phyllodes Tumor Classification from Ultrasound and Clinical Data cs.CV | cs.AIPDF

Farhan Fuad Abir, Abigail Elliott Daly, Kyle Anderman, Tolga Ozmen, Laura J. Brattain

TL;DR: 该论文提出了一种多模态深度学习框架，结合乳腺超声图像和临床数据，以提高对叶状肿瘤的分类准确性，减少不必要的手术切除。

Details

Motivation: 叶状肿瘤（PTs）在术前难以与良性纤维腺瘤区分，导致不必要的切除手术。为了解决这一问题，研究旨在通过多模态方法提高诊断精度。

Result: 结果表明，多模态方法优于单模态基线，ConvNeXt和ResNet18在AUC-ROC和F1分数上表现最佳。

Insight: 研究证明多模态AI可作为非侵入性诊断工具，优化乳腺肿瘤管理中的临床决策。

Abstract: Phyllodes tumors (PTs) are rare fibroepithelial breast lesions that are difficult to classify preoperatively due to their radiological similarity to benign fibroadenomas. This often leads to unnecessary surgical excisions. To address this, we propose a multimodal deep learning framework that integrates breast ultrasound (BUS) images with structured clinical data to improve diagnostic accuracy. We developed a dual-branch neural network that extracts and fuses features from ultrasound images and patient metadata from 81 subjects with confirmed PTs. Class-aware sampling and subject-stratified 5-fold cross-validation were applied to prevent class imbalance and data leakage. The results show that our proposed multimodal method outperforms unimodal baselines in classifying benign versus borderline/malignant PTs. Among six image encoders, ConvNeXt and ResNet18 achieved the best performance in the multimodal setting, with AUC-ROC scores of 0.9427 and 0.9349, and F1-scores of 0.6720 and 0.7294, respectively. This study demonstrates the potential of multimodal AI to serve as a non-invasive diagnostic tool, reducing unnecessary biopsies and improving clinical decision-making in breast tumor management.

[13] GraViT: Transfer Learning with Vision Transformers and MLP-Mixer for Strong Gravitational Lens Discovery cs.CV | astro-ph.GAPDF

René Parlange, Juan C. Cuevas-Tello, Octavio Valenzuela, Omar de J. Cabrera-Rosas, Tomás Verdugo

TL;DR: GraViT是一个基于PyTorch的管道，利用Vision Transformers（ViT）和MLP-Mixer进行预训练，用于引力透镜检测。研究了迁移学习对分类性能的影响，并在实验中与卷积基线进行了比较。

Details

Motivation: 引力透镜现象对研究暗物质和宇宙学参数至关重要。LSST预计在未来十年内发现大量引力透镜，亟需自动化分类工具。

Result: 与卷积基线相比，GraViT在引力透镜检测任务中表现出色，并通过实验验证了其性能。

Insight: ViT和MLP-Mixer在引力透镜检测任务中具有潜力，迁移学习可以显著提升模型性能。

Abstract: Gravitational lensing offers a powerful probe into the properties of dark matter and is crucial to infer cosmological parameters. The Legacy Survey of Space and Time (LSST) is predicted to find O(10^5) gravitational lenses over the next decade, demanding automated classifiers. In this work, we introduce GraViT, a PyTorch pipeline for gravitational lens detection that leverages extensive pretraining of state-of-the-art Vision Transformer (ViT) models and MLP-Mixer. We assess the impact of transfer learning on classification performance by examining data quality (source and sample size), model architecture (selection and fine-tuning), training strategies (augmentation, normalization, and optimization), and ensemble predictions. This study reproduces the experiments in a previous systematic comparison of neural networks and provides insights into the detectability of strong gravitational lenses on that common test sample. We fine-tune ten architectures using datasets from HOLISMOKES VI and SuGOHI X, and benchmark them against convolutional baselines, discussing complexity and inference-time analysis.

[14] A High-Accuracy Fast Hough Transform with Linear-Log-Cubed Computational Complexity for Arbitrary-Shaped Images cs.CVPDF

Danil Kazimirov, Dmitry Nikolaev

TL;DR: 该论文提出了一种名为FHT2SP的新型Hough变换算法，在任意形状图像上实现了线性-对数-立方计算复杂度的高精度Hough变换，既保证了计算效率，又维持了误差的常数边界。

Details

Motivation: Hough变换在图像分析等领域至关重要，传统快速算法（如FHT2DT）虽效率高但精度随图像尺寸下降，而高精度算法则计算成本过大。因此，需要一种兼顾高效与高精度的算法。

Result: 理论和实验验证表明，FHT2SP在任意尺寸图像上实现了近乎最优的计算复杂度，且误差独立于图像尺寸并通过参数可控。

Insight: 超像素概念的扩展和参数化设计为Hough变换的高效与高精度提供了新思路，适用于广泛的实际应用。

Abstract: The Hough transform (HT) is a fundamental tool across various domains, from classical image analysis to neural networks and tomography. Two key aspects of the algorithms for computing the HT are their computational complexity and accuracy - the latter often defined as the error of approximation of continuous lines by discrete ones within the image region. The fast HT (FHT) algorithms with optimal linearithmic complexity - such as the Brady-Yong algorithm for power-of-two-sized images - are well established. Generalizations like $FHT2DT$ extend this efficiency to arbitrary image sizes, but with reduced accuracy that worsens with scale. Conversely, accurate HT algorithms achieve constant-bounded error but require near-cubic computational cost. This paper introduces $FHT2SP$ algorithm - a fast and highly accurate HT algorithm. It builds on our development of Brady’s superpixel concept, extending it to arbitrary shapes beyond the original power-of-two square constraint, and integrates it into the $FHT2DT$ algorithm. With an appropriate choice of the superpixel’s size, for an image of shape $w \times h$, the $FHT2SP$ algorithm achieves near-optimal computational complexity $\mathcal{O}(wh \ln^3 w)$, while keeping the approximation error bounded by a constant independent of image size, and controllable via a meta-parameter. We provide theoretical and experimental analyses of the algorithm’s complexity and accuracy.

[15] Generative AI for Industrial Contour Detection: A Language-Guided Vision System cs.CV | cs.AIPDF

Liang Gong, Tommy, Wang, Sara Chaker, Yanchen Dong

TL;DR: 该论文提出了一种基于语言引导的生成式视觉系统，用于工业轮廓检测，通过数据采集、条件GAN生成轮廓以及多模态轮廓细化三个阶段，显著提升了CAD级精度和边缘连续性。

Details

Motivation: 传统工业计算机视觉系统在噪声、材料多样性和非受控成像条件下表现不佳，限制了经典边缘检测器和手工流程的效果。

Result: 在FabTrack数据集上，系统提升了轮廓保真度，减少了人工追踪，且GPT-image-1在结构精度和感知质量上优于Gemini 2.0 Flash。

Insight: 语言引导的生成式工作流程有望超越传统管道，推动工业计算机视觉的发展。

Abstract: Industrial computer vision systems often struggle with noise, material variability, and uncontrolled imaging conditions, limiting the effectiveness of classical edge detectors and handcrafted pipelines. In this work, we present a language-guided generative vision system for remnant contour detection in manufacturing, designed to achieve CAD-level precision. The system is organized into three stages: data acquisition and preprocessing, contour generation using a conditional GAN, and multimodal contour refinement through vision-language modeling, where standardized prompts are crafted in a human-in-the-loop process and applied through image-text guided synthesis. On proprietary FabTrack datasets, the proposed system improved contour fidelity, enhancing edge continuity and geometric alignment while reducing manual tracing. For the refinement stage, we benchmarked several vision-language models, including Google’s Gemini 2.0 Flash, OpenAI’s GPT-image-1 integrated within a VLM-guided workflow, and open-source baselines. Under standardized conditions, GPT-image-1 consistently outperformed Gemini 2.0 Flash in both structural accuracy and perceptual quality. These findings demonstrate the promise of VLM-guided generative workflows for advancing industrial computer vision beyond the limitations of classical pipelines.

[16] Language-Aware Information Maximization for Transductive Few-Shot CLIP cs.CVPDF

Ghassen Baklouti, Maxime Zanella, Ismail Ben Ayed

TL;DR: 本文提出了一种新的基于信息最大化的语言感知方法（LIMO），用于提升转导式（transductive）少样本学习在视觉-语言模型（如CLIP）中的性能，通过结合互信息、KL散度和交叉熵损失，并探索参数高效微调策略，显著优于现有方法。

Details

Motivation: 当前大部分转导式少样本学习方法仅针对纯视觉模型，而基于视觉-语言基础模型（如CLIP）的研究较少。本文旨在填补这一空白，提出针对视觉-语言模型的转导式少样本学习方法。

Result: LIMO在转导式少样本CLIP任务中显著优于现有方法，同时对最佳归纳式方法也有显著提升。

Insight: 1. 语言信息在视觉-语言模型的转导式少样本学习中至关重要；2. 参数高效微调在转导式学习中具有潜在优势。

Abstract: Transductive few-shot learning has triggered an abundant literature focusing on vision-only models, but is still at a nascent stage within the recent context of foundational vision-language models (VLMs). Only a few recent methods addressed the problem, pointing to the potential of tranduction in VLMs and to the need for VLM-tailored methods. Building on this momentum, we leverage information-theoretic concepts and recent progress in parameter-efficient fine-tuning (PEFT), developing a highly competitive transductive few-shot CLIP method. Specifically, we introduce a novel Language-aware Information MaximizatiOn (LIMO) loss integrating three complementary terms: (i) the mutual information between the vision inputs and the textual class descriptions; (ii) a Kullback-Leibler (KL) divergence penalizing deviation of the network’s probabilistic outputs from the text-driven zero-shot predictions; and (iii) a standard cross-entropy loss based on the labeled shots. Furthermore, we challenge the commonly followed fine-tuning practices in the context of transductive few-shot learning, and explore PEFT strategies, completely overlooked in this context. Surprisingly, we observe substantial boosts in performances, which points to the potential of adapting a subset of the model’s parameters in the transductive few-shot setting. We report comprehensive evaluations, which show that LIMO outperforms the very recent transductive few-shot CLIP methods by a large margin and yields significant gains over the best-performing inductive methods. Our code is publicly available at:[ \href{https://github.com/ghassenbaklouti/LIMO}{\text{here}} ]

[17] MorphGen: Morphology-Guided Representation Learning for Robust Single-Domain Generalization in Histopathological Cancer Classification cs.CVPDF

Hikmat Khan, Syed Farhan Alam Zaidi, Pir Masoom Shah, Kiruthika Balakrishnan, Rabia Khan

TL;DR: 该论文提出MorphGen，一种基于形态学指导的表征学习方法，旨在解决计算病理学中因组织制备、染色和成像条件差异导致的领域泛化问题。通过结合核分割掩模和监督对比学习，模型专注于核和形态学异常等诊断特征，而非领域特异性特征，提升了模型在分布外数据上的泛化能力。

Details

Motivation: 计算病理学中的领域泛化问题（如组织切片图像的异质性）限制了模型的实用性。病理学家依赖跨领域的形态学特征进行诊断，这启发了作者提出一种基于形态学指导的学习方法。

Result: MorphGen在领域泛化任务中表现优于基线方法，且对图像污染和对抗攻击具有更强的鲁棒性。

Insight: 形态学特征是病理诊断的关键，通过显式建模这些特征可提升模型的泛化能力和鲁棒性，同时减少对领域特异性特征的依赖。

Abstract: Domain generalization in computational histopathology is hindered by heterogeneity in whole slide images (WSIs), caused by variations in tissue preparation, staining, and imaging conditions across institutions. Unlike machine learning systems, pathologists rely on domain-invariant morphological cues such as nuclear atypia (enlargement, irregular contours, hyperchromasia, chromatin texture, spatial disorganization), structural atypia (abnormal architecture and gland formation), and overall morphological atypia that remain diagnostic across diverse settings. Motivated by this, we hypothesize that explicitly modeling biologically robust nuclear morphology and spatial organization will enable the learning of cancer representations that are resilient to domain shifts. We propose MorphGen (Morphology-Guided Generalization), a method that integrates histopathology images, augmentations, and nuclear segmentation masks within a supervised contrastive learning framework. By aligning latent representations of images and nuclear masks, MorphGen prioritizes diagnostic features such as nuclear and morphological atypia and spatial organization over staining artifacts and domain-specific features. To further enhance out-of-distribution robustness, we incorporate stochastic weight averaging (SWA), steering optimization toward flatter minima. Attention map analyses revealed that MorphGen primarily relies on nuclear morphology, cellular composition, and spatial cell organization within tumors or normal regions for final classification. Finally, we demonstrate resilience of the learned representations to image corruptions (such as staining artifacts) and adversarial attacks, showcasing not only OOD generalization but also addressing critical vulnerabilities in current deep learning systems for digital pathology. Code, datasets, and trained models are available at: https://github.com/hikmatkhan/MorphGen

[18] Towards Adaptive Visual Token Pruning for Large Multimodal Models cs.CVPDF

Hao Zhang, Mengsi Lyu, Chenrui He, Yulong Ao, Yonghua Lin

TL;DR: 该论文提出了一种针对大型多模态模型的可适应视觉令牌修剪策略，通过保留跨模态对齐和模态内信息多样性，显著减少了计算和内存开销，同时保持了模型性能。

Details

Motivation: 大型多模态模型（LMMs）通常将视觉输入编码为密集令牌序列，增加了推理时的计算和内存成本。现有的令牌修剪方法存在校准成本高或重要性指标不优的问题，导致冗余令牌保留。

Result: 实验表明，该方法在LLaVA-1.5-7B和LLaVA-NEXT-7B等模型上减少了88.9%的令牌，推理速度提升了56.7%，同时保持了较强的性能。

Insight: 视觉令牌修剪是优化大型多模态模型效率的有效途径，跨模态对齐和模态内多样性是保留令牌时的重要考虑因素。

Abstract: Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language model. However, the increased token count substantially raises computational and memory costs during inference. Token pruning has emerged as a promising approach to address this issue. Existing token pruning methods often rely on costly calibration or suboptimal importance metrics, leading to redundant retained tokens. In this paper, we analyze the redundancy differences between visual and textual tokens and propose pruning exclusively on visual tokens. Based on this, we propose a visual token pruning strategy that explicitly preserves both cross-modal alignment and intra-modal informational diversity. We introduce a mutual information-based token pruning strategy that removes visual tokens semantically misaligned with textual tokens, effectively preserving the alignment between the visual and textual modalities. To further improve the representational quality of the retained tokens, we additionally prune redundant visual tokens by maximizing the expected pairwise distances in the embedding space, which is solved efficiently with a greedy algorithm. Extensive experiments demonstrate that our method maintains strong performance while reducing tokens by 88.9% on models such as LLaVA-1.5-7B and LLaVA-NEXT-7B, resulting in a 56.7% improvement in inference speed.

[19] CryptoFace: End-to-End Encrypted Face Recognition cs.CV | cs.CRPDF

Wei Ao, Vishnu Naresh Boddeti

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Face recognition is central to many authentication, security, and personalized applications. Yet, it suffers from significant privacy risks, particularly arising from unauthorized access to sensitive biometric data. This paper introduces CryptoFace, the first end-to-end encrypted face recognition system with fully homomorphic encryption (FHE). It enables secure processing of facial data across all stages of a face-recognition process–feature extraction, storage, and matching–without exposing raw images or features. We introduce a mixture of shallow patch convolutional networks to support higher-dimensional tensors via patch-based processing while reducing the multiplicative depth and, thus, inference latency. Parallel FHE evaluation of these networks ensures near-resolution-independent latency. On standard face recognition benchmarks, CryptoFace significantly accelerates inference and increases verification accuracy compared to the state-of-the-art FHE neural networks adapted for face recognition. CryptoFace will facilitate secure face recognition systems requiring robust and provable security. The code is available at https://github.com/human-analysis/CryptoFace.

[20] LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables cs.CVPDF

Xunpeng Yi, Yibing Zhang, Xinyu Xiang, Qinglong Yan, Han Xu

TL;DR: 论文提出了一种基于可学习查找表（LUT）的红外与可见光图像快速融合方法LUT-Fuse，通过蒸馏技术显著提升融合速度，适用于低功耗设备。

Details

Motivation: 当前红外与可见光图像融合研究多关注性能提升，而忽略了实时性需求。本文旨在解决这一问题，提出极速融合的解决方案。

Result: 方法在效率上取得突破，速度比当前轻量级SOTA算法快十倍以上，适用于低功耗移动设备。实验验证了方法的优越性、可靠性和稳定性。

Insight: 通过LUT蒸馏技术，可以在多模态图像融合中实现极速和高性能的平衡，为实时应用提供了新思路。

Abstract: Current advanced research on infrared and visible image fusion primarily focuses on improving fusion performance, often neglecting the applicability on real-time fusion devices. In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. Firstly, we develop a look-up table structure that utilizing low-order approximation encoding and high-level joint contextual scene encoding, which is well-suited for multi-modal fusion. Moreover, given the lack of ground truth in multi-modal image fusion, we naturally proposed the efficient LUT distillation strategy instead of traditional quantization LUT methods. By integrating the performance of the multi-modal fusion network (MM-Net) into the MM-LUT model, our method achieves significant breakthroughs in efficiency and performance. It typically requires less than one-tenth of the time compared to the current lightweight SOTA fusion algorithms, ensuring high operational speed across various scenarios, even in low-power mobile devices. Extensive experiments validate the superiority, reliability, and stability of our fusion approach. The code is available at https://github.com/zyb5/LUT-Fuse.

[21] Target-Oriented Single Domain Generalization cs.CV | cs.AI | cs.LGPDF

Marzi Heidari, Yuhong Guo

TL;DR: 该论文提出了一种新方法TO-SDG，利用目标域的文本描述（无需目标数据）指导模型泛化，通过STAR模块将目标语义注入源特征，提升了单域泛化性能。

Details

Motivation: 在单域泛化问题中，现有方法忽略目标环境的文本描述信息，而这些信息是现成可用的资源。研究尝试通过利用文本描述提升模型在未见目标域上的泛化能力。

Result: 在多个图像分类和目标检测基准测试中，STAR表现优于现有方法，证明了文本描述在单域泛化中的有效性。

Insight: 无需目标域数据，仅需少量文本描述即可显著提升模型在未见目标域上的泛化能力，为实际部署提供了新思路。

Abstract: Deep models trained on a single source domain often fail catastrophically under distribution shifts, a critical challenge in Single Domain Generalization (SDG). While existing methods focus on augmenting source data or learning invariant features, they neglect a readily available resource: textual descriptions of the target deployment environment. We propose Target-Oriented Single Domain Generalization (TO-SDG), a novel problem setup that leverages the textual description of the target domain, without requiring any target data, to guide model generalization. To address TO-SDG, we introduce Spectral TARget Alignment (STAR), a lightweight module that injects target semantics into source features by exploiting visual-language models (VLMs) such as CLIP. STAR uses a target-anchored subspace derived from the text embedding of the target description to recenter image features toward the deployment domain, then utilizes spectral projection to retain directions aligned with target cues while discarding source-specific noise. Moreover, we use a vision-language distillation to align backbone features with VLM’s semantic geometry. STAR further employs feature-space Mixup to ensure smooth transitions between source and target-oriented representations. Experiments across various image classification and object detection benchmarks demonstrate STAR’s superiority. This work establishes that minimal textual metadata, which is a practical and often overlooked resource, significantly enhances generalization under severe data constraints, opening new avenues for deploying robust models in target environments with unseen data.

[22] AQFusionNet: Multimodal Deep Learning for Air Quality Index Prediction with Imagery and Sensor Data cs.CV | cs.AI | 68T07, 68T09, 68U10 | I.4.8; I.2.10; I.5.4; C.3PDF

Koushik Ahmed Kushal, Abdullah Al Mamun

TL;DR: AQFusionNet是一种多模态深度学习框架，结合大气图像和传感器数据预测空气质量指数（AQI），在资源受限地区表现优异，准确率达92.02%。

Details

Motivation: 资源受限地区空气质量监测面临传感器稀疏和基础设施不足的挑战，需要一种高效且可扩展的解决方案。

Result: 在印度和尼泊尔的8,000多个样本上，AQFusionNet分类准确率达92.02%，RMSE为7.70，优于单模态方法18.5%。

Insight: 多模态融合显著提升预测性能，轻量化设计使其适用于资源受限环境，为边缘部署提供了可行方案。

Abstract: Air pollution monitoring in resource-constrained regions remains challenging due to sparse sensor deployment and limited infrastructure. This work introduces AQFusionNet, a multimodal deep learning framework for robust Air Quality Index (AQI) prediction. The framework integrates ground-level atmospheric imagery with pollutant concentration data using lightweight CNN backbones (MobileNetV2, ResNet18, EfficientNet-B0). Visual and sensor features are combined through semantically aligned embedding spaces, enabling accurate and efficient prediction. Experiments on more than 8,000 samples from India and Nepal demonstrate that AQFusionNet consistently outperforms unimodal baselines, achieving up to 92.02% classification accuracy and an RMSE of 7.70 with the EfficientNet-B0 backbone. The model delivers an 18.5% improvement over single-modality approaches while maintaining low computational overhead, making it suitable for deployment on edge devices. AQFusionNet provides a scalable and practical solution for AQI monitoring in infrastructure-limited environments, offering robust predictive capability even under partial sensor availability.

[23] Iterative Low-rank Network for Hyperspectral Image Denoising cs.CVPDF

Jin Ye, Fengchao Xiong, Jun Zhou, Yuntao Qian

TL;DR: 论文提出了一种迭代低秩网络（ILRNet），用于高光谱图像去噪，结合了模型驱动和数据驱动方法，通过嵌入秩最小化模块（RMM）和迭代优化过程，有效去除噪声并保留图像细节。

Details

Motivation: 高光谱图像（HSI）去噪是后续任务的重要预处理步骤。干净的HSI通常存在于低维子空间中，可通过低秩和稀疏表示捕捉其物理先验。然而，如何充分利用这些物理特性进行有效去噪并保留细节仍具挑战性。

Result: 实验表明，ILRNet在合成和真实噪声去除任务中均达到了最先进的性能。

Insight: 1. 结合模型驱动与数据驱动的优势，充分利用HSI的物理先验；\n2. 自适应参数学习和小波域处理提升了模型的灵活性；\n3. 迭代优化过程有效保留了图像细节。

Abstract: Hyperspectral image (HSI) denoising is a crucial preprocessing step for subsequent tasks. The clean HSI usually reside in a low-dimensional subspace, which can be captured by low-rank and sparse representation, known as the physical prior of HSI. It is generally challenging to adequately use such physical properties for effective denoising while preserving image details. This paper introduces a novel iterative low-rank network (ILRNet) to address these challenges. ILRNet integrates the strengths of model-driven and data-driven approaches by embedding a rank minimization module (RMM) within a U-Net architecture. This module transforms feature maps into the wavelet domain and applies singular value thresholding (SVT) to the low-frequency components during the forward pass, leveraging the spectral low-rankness of HSIs in the feature domain. The parameter, closely related to the hyperparameter of the singular vector thresholding algorithm, is adaptively learned from the data, allowing for flexible and effective capture of low-rankness across different scenarios. Additionally, ILRNet features an iterative refinement process that adaptively combines intermediate denoised HSIs with noisy inputs. This manner ensures progressive enhancement and superior preservation of image details. Experimental results demonstrate that ILRNet achieves state-of-the-art performance in both synthetic and real-world noise removal tasks.

[24] SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding cs.CV | cs.AI | cs.LGPDF

Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny T. M. Chan

TL;DR: SurgLLM是一个多功能的大型多模态模型，专为增强手术视频理解的空间聚焦和时间感知而设计，通过手术上下文感知的多模态预训练和时间感知的多模态调优，显著提升了手术视频的理解能力。

Details

Motivation: 手术视频理解对于计算机辅助手术(CAS)系统至关重要，但现有研究在视觉内容感知和时间感知方面存在不足。本文旨在通过改进空间聚焦和时间感知，开发一个更通用的CAS解决方案。

Result: 在多种手术视频理解任务（如字幕生成、一般VQA和时间VQA）上的实验表明，SurgLLM显著优于现有方法。

Insight: 通过结合空间聚焦和时间感知的多模态方法，可以有效提升手术视频理解的通用性和性能。

Abstract: Surgical video understanding is crucial for facilitating Computer-Assisted Surgery (CAS) systems. Despite significant progress in existing studies, two major limitations persist, including inadequate visual content perception and insufficient temporal awareness in surgical videos, and hinder the development of versatile CAS solutions. In this work, we propose the SurgLLM framework, an effective large multimodal model tailored for versatile surgical video understanding tasks with enhanced spatial focus and temporal awareness. Specifically, to empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM, by performing instrument-centric Masked Video Reconstruction (MV-Recon) and subsequent multimodal alignment. To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings. Moreover, to accommodate various understanding tasks of surgical videos without conflicts, we devise a Surgical Task Dynamic Ensemble to efficiently triage a query with optimal learnable parameters in our SurgLLM. Extensive experiments performed on diverse surgical video understanding tasks, including captioning, general VQA, and temporal VQA, demonstrate significant improvements over the state-of-the-art approaches, validating the effectiveness of our SurgLLM in versatile surgical video understanding. The source code is available at https://github.com/franciszchen/SurgLLM.

[25] A Multimodal Head and Neck Cancer Dataset for AI-Driven Precision Oncology cs.CVPDF

Numan Saeed, Salma Hassan, Shahad Hardan, Ahmed Aly, Darya Taratynova

TL;DR: 该论文介绍了一个公开的多模态头颈癌数据集，用于AI驱动的精准肿瘤学研究，包含1123例PET/CT研究数据，并用于肿瘤分割、生存预测和HPV状态分类等临床任务。

Details

Motivation: 头颈癌研究的临床数据集通常缺乏多样性和标准化标注，限制了AI模型的发展。该论文旨在填补这一空白，提供一个国际多中心、标准化的多模态数据集。

Result: 通过基准模型在肿瘤分割、生存预测和HPV分类任务中验证了数据集的潜力，展示了其在推动AI驱动的精准肿瘤学中的价值。

Insight: 该数据集为头颈癌研究提供了标准化资源，尤其适合多模态AI模型的开发和验证，对临床转化研究具有重要意义。

Abstract: We describe a publicly available multimodal dataset of annotated Positron Emission Tomography/Computed Tomography (PET/CT) studies for head and neck cancer research. The dataset includes 1123 FDG-PET/CT studies from patients with histologically confirmed head and neck cancer, acquired from 10 international medical centers. All examinations consisted of co-registered PET/CT scans with varying acquisition protocols, reflecting real-world clinical diversity across institutions. Primary gross tumor volumes (GTVp) and involved lymph nodes (GTVn) were manually segmented by experienced radiation oncologists and radiologists following standardized guidelines and quality control measures. We provide anonymized NifTi files of all studies, along with expert-annotated segmentation masks, radiotherapy dose distribution for a subset of patients, and comprehensive clinical metadata. This metadata includes TNM staging, HPV status, demographics (age and gender), long-term follow-up outcomes, survival times, censoring indicators, and treatment information. We demonstrate how this dataset can be used for three key clinical tasks: automated tumor segmentation, recurrence-free survival prediction, and HPV status classification, providing benchmark results using state-of-the-art deep learning models, including UNet, SegResNet, and multimodal prognostic frameworks.

[26] Two Causes, Not One: Rethinking Omission and Fabrication Hallucinations in MLLMs cs.CVPDF

Guangzong Si, Hao Yin, Xianfei Li, Qing Ding, Wenlong Liao

TL;DR: 论文挑战了当前多模态大语言模型（MLLMs）中物体幻觉问题的常见假设，提出遗漏和虚构幻觉有不同成因，并提出了一种新的框架和方法来分别解决这两种幻觉。

Details

Motivation: 现有方法错误地假设遗漏和虚构幻觉有共同原因，导致解决一种幻觉时会加剧另一种。作者希望通过区分这两种幻觉的成因，提出更有效的解决方案。

Result: VPFC能显著减少遗漏幻觉，同时避免了虚构幻觉的增加，展示了更平衡和鲁棒的幻觉缓解效果。

Insight: 当前物体幻觉研究存在忽视两类幻觉差异的问题，区分成因并分别解决是提升MLLMs可靠性的关键。

Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive advances, yet object hallucination remains a persistent challenge. Existing methods, based on the flawed assumption that omission and fabrication hallucinations share a common cause, often reduce omissions only to trigger more fabrications. In this work, we overturn this view by demonstrating that omission hallucinations arise from insufficient confidence when mapping perceived visual features to linguistic expressions, whereas fabrication hallucinations result from spurious associations within the cross-modal representation space due to statistical biases in the training corpus. Building on findings from visual attention intervention experiments, we propose the Visual-Semantic Attention Potential Field, a conceptual framework that reveals how the model constructs visual evidence to infer the presence or absence of objects. Leveraging this insight, we introduce Visual Potential Field Calibration (VPFC), a plug-and-play hallucination mitigation method that effectively reduces omission hallucinations without introducing additional fabrication hallucinations. Our findings reveal a critical oversight in current object hallucination research and chart new directions for developing more robust and balanced hallucination mitigation strategies.

[27] Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models cs.CV | cs.AIPDF

Sihao Wu, Gaojie Jin, Wei Huang, Jianhong Wang, Xiaowei Huang

TL;DR: 该论文提出了一种名为SPO-VLM的两阶段防御框架，结合激活层干预和策略优化，以增强视觉语言模型（VLM）对对抗攻击的鲁棒性。

Details

Motivation: 现有防御方法依赖任务特定的对比提示，性能不佳且损害视觉基础能力。SPO-VLM通过通用性更强的激活导向和策略优化，解决这些问题。

Result: 实验表明SPO-VLM显著提升安全性，且不影响视觉理解能力。

Insight: 两阶段结构平衡效率与效果，轻量级防御基础与深度策略优化的结合是成功关键。

Abstract: Vision Language Models (VLMs) have demonstrated impressive capabilities in integrating visual and textual information for understanding and reasoning, but remain highly vulnerable to adversarial attacks. While activation steering has emerged as a promising defence, existing approaches often rely on task-specific contrastive prompts to extract harmful directions, which exhibit suboptimal performance and can degrade visual grounding performance. To address these limitations, we propose \textit{Sequence-Level Preference Optimization} for VLM (\textit{SPO-VLM}), a novel two-stage defense framework that combines activation-level intervention with policy-level optimization to enhance model robustness. In \textit{Stage I}, we compute adaptive layer-specific steering vectors from diverse data sources, enabling generalized suppression of harmful behaviors during inference. In \textit{Stage II}, we refine these steering vectors through a sequence-level preference optimization process. This stage integrates automated toxicity assessment, as well as visual-consistency rewards based on caption-image alignment, to achieve safe and semantically grounded text generation. The two-stage structure of SPO-VLM balances efficiency and effectiveness by combining a lightweight mitigation foundation in Stage I with deeper policy refinement in Stage II. Extensive experiments shown SPO-VLM enhances safety against attacks via activation steering and preference optimization, while maintaining strong performance on benign tasks without compromising visual understanding capabilities. We will release our code, model weights, and evaluation toolkit to support reproducibility and future research. \textcolor{red}{Warning: This paper may contain examples of offensive or harmful text and images.}

[28] Adaptive Point-Prompt Tuning: Fine-Tuning Heterogeneous Foundation Models for 3D Point Cloud Analysis cs.CVPDF

Mengke Li, Lihao Chen, Peng Zhang, Yiu-ming Cheung, Hui Huang

TL;DR: 该论文提出了一种名为自适应点提示调优（APPT）的方法，通过直接利用点特征校准异质基础模型，实现了对3D点云的高效分析，避免了传统的从高维到低维映射带来的信息损失。

Details

Motivation: 由于点云数据稀缺，预训练大型3D模型具有挑战性。现有方法通常通过从高维映射到低维（如视觉到点云）的方式处理点云，导致空间几何信息丢失且缺乏通用性。本文旨在直接利用点特征校准异质基础模型，以高效实现3D点云分析。

Result: APPT方法在轻量调优参数下有效处理点云，避免了传统映射方式的信息损失，并在实验中展示了其高效性和通用性。

Insight: 1. 直接利用点特征校准跨模态模型是3D点云分析的高效路径；2. 动态点提示能显著提升模型对全局结构的理解；3. 权重共享设计减少了计算开销。

Abstract: Parameter-efficient fine-tuning strategies for foundation models in 1D textual and 2D visual analysis have demonstrated remarkable efficacy. However, due to the scarcity of point cloud data, pre-training large 3D models remains a challenging task. While many efforts have been made to apply pre-trained visual models to 3D domains through “high-to-low” mapping, these approaches often lead to the loss of spatial geometries and lack a generalizable framework for adapting any modality to 3D. This paper, therefore, attempts to directly leverage point features to calibrate the heterogeneous foundation model of any modality for 3D point cloud analysis. Specifically, we propose the Adaptive Point-Prompt Tuning (APPT) method, which fine-tunes pre-trained models with a modest number of parameters, enabling direct point cloud processing without heterogeneous mappings. We convert raw point clouds into point embeddings by aggregating local geometry to capture spatial features followed by linear layers to ensure seamless utilization of frozen pre-trained models. Given the inherent disorder of point clouds, in contrast to the structured nature of images and language, we employ a permutation-invariant feature to capture the relative positions of point embeddings, thereby obtaining point tokens enriched with location information to optimize self-attention mechanisms. To calibrate self-attention across source domains of any modality to 3D and reduce computational overhead, we introduce a prompt generator that shares weights with the point embedding module, dynamically producing point-prompts without adding additional parameters. These prompts are then concatenated into a frozen foundation model, providing rich global structural information and compensating for the lack of structural context in the heterogeneous data.

[29] HERO-VQL: Hierarchical, Egocentric and Robust Visual Query Localization cs.CVPDF

Joohyun Chang, Soyeon Hong, Hyogun Lee, Seong Jong Ha, Dongho Lee

TL;DR: HERO-VQL是一种针对自我中心视频中查询对象定位的新方法，通过引入分层注意力机制和数据增强技术，显著提升了在复杂视角变化下的定位性能。

Details

Motivation: 自我中心视频中频繁的视角变化和遮挡导致现有方法难以准确定位，需要一种更鲁棒的解决方案。

Result: 在VQ2D数据集上的实验表明，HERO-VQL显著优于基线方法。

Insight: 模拟人类认知过程的注意力机制和数据增强策略能够有效提升模型在复杂场景中的表现。

Abstract: In this work, we tackle the egocentric visual query localization (VQL), where a model should localize the query object in a long-form egocentric video. Frequent and abrupt viewpoint changes in egocentric videos cause significant object appearance variations and partial occlusions, making it difficult for existing methods to achieve accurate localization. To tackle these challenges, we introduce Hierarchical, Egocentric and RObust Visual Query Localization (HERO-VQL), a novel method inspired by human cognitive process in object recognition. We propose i) Top-down Attention Guidance (TAG) and ii) Egocentric Augmentation based Consistency Training (EgoACT). Top-down Attention Guidance refines the attention mechanism by leveraging the class token for high-level context and principal component score maps for fine-grained localization. To enhance learning in diverse and challenging matching scenarios, EgoAug enhances query diversity by replacing the query with a randomly selected corresponding object from groundtruth annotations and simulates extreme viewpoint changes by reordering video frames. Additionally, CT loss enforces stable object localization across different augmentation scenarios. Extensive experiments on VQ2D dataset validate that HERO-VQL effectively handles egocentric challenges, significantly outperforming baselines.

[30] Double-Constraint Diffusion Model with Nuclear Regularization for Ultra-low-dose PET Reconstruction cs.CVPDF

Mengxiao Geng, Ran Hong, Bingxuan Li, Qiegen Liu

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Ultra-low-dose positron emission tomography (PET) reconstruction holds significant potential for reducing patient radiation exposure and shortening examination times. However, it may also lead to increased noise and reduced imaging detail, which could decrease the image quality. In this study, we present a Double-Constraint Diffusion Model (DCDM), which freezes the weights of a pre-trained diffusion model and injects a trainable double-constraint controller into the encoding architecture, greatly reducing the number of trainable parameters for ultra-low-dose PET reconstruction. Unlike full fine-tuning models, DCDM can adapt to different dose levels without retraining all model parameters, thereby improving reconstruction flexibility. Specifically, the two constraint modules, named the Nuclear Transformer Constraint (NTC) and the Encoding Nexus Constraint (ENC), serve to refine the pre-trained diffusion model. The NTC leverages the nuclear norm as an approximation for matrix rank minimization, integrates the low-rank property into the Transformer architecture, and enables efficient information extraction from low-dose images and conversion into compressed feature representations in the latent space. Subsequently, the ENC utilizes these compressed feature representations to encode and control the pre-trained diffusion model, ultimately obtaining reconstructed PET images in the pixel space. In clinical reconstruction, the compressed feature representations from NTC help select the most suitable ENC for efficient unknown low-dose PET reconstruction. Experiments conducted on the UDPET public dataset and the Clinical dataset demonstrated that DCDM outperforms state-of-the-art methods on known dose reduction factors (DRF) and generalizes well to unknown DRF scenarios, proving valuable even at ultra-low dose levels, such as 1% of the full dose.

[31] DAOVI: Distortion-Aware Omnidirectional Video Inpainting cs.CV | cs.AIPDF

Ryosuke Seshimo, Mariko Isogawa

TL;DR: 本文提出了一种针对全景视频修复的深度学习模型DAOVI，通过考虑测地距离和几何畸变，解决了传统方法在全景视频中表现不佳的问题。

Details

Motivation: 现有视频修复方法主要针对窄视场普通视频，无法有效处理全景视频的几何畸变，导致修复效果不理想。

Result: 实验表明DAOVI在定量和定性评估上均优于现有方法。

Insight: 在全景视频修复中，几何畸变的处理是关键，测地距离和特征传播模块能有效提升修复效果。

Abstract: Omnidirectional videos that capture the entire surroundings are employed in a variety of fields such as VR applications and remote sensing. However, their wide field of view often causes unwanted objects to appear in the videos. This problem can be addressed by video inpainting, which enables the natural removal of such objects while preserving both spatial and temporal consistency. Nevertheless, most existing methods assume processing ordinary videos with a narrow field of view and do not tackle the distortion in equirectangular projection of omnidirectional videos. To address this issue, this paper proposes a novel deep learning model for omnidirectional video inpainting, called Distortion-Aware Omnidirectional Video Inpainting (DAOVI). DAOVI introduces a module that evaluates temporal motion information in the image space considering geodesic distance, as well as a depth-aware feature propagation module in the feature space that is designed to address the geometric distortion inherent to omnidirectional videos. The experimental results demonstrate that our proposed method outperforms existing methods both quantitatively and qualitatively.

[32] DevilSight: Augmenting Monocular Human Avatar Reconstruction through a Virtual Perspective cs.CVPDF

Yushuo Chen, Ruizhi Shao, Youxin Pang, Hongwen Zhang, Xinyi Wu

TL;DR: 论文提出了一种新的框架，通过虚拟视角增强单目视频中人体化身的重建，克服了现有方法在捕捉动态细节或生成新视角时的不足，利用视频生成模型作为额外监督信号。

Details

Motivation: 现有方法在单目视频重建人体化身时难以捕捉精细动态细节或生成新视角的合理细节，主要原因是化身模型的表征能力有限和观测数据不足。

Result: 实验表明，该方法优于现有先进方法，验证了所提策略的有效性。

Insight: 虚拟视角生成可以作为有效的监督信号，提升单目重建的细节和泛化能力；结合生成模型与传统优化方法能显著提升质量。

Abstract: We present a novel framework to reconstruct human avatars from monocular videos. Recent approaches have struggled either to capture the fine-grained dynamic details from the input or to generate plausible details at novel viewpoints, which mainly stem from the limited representational capacity of the avatar model and insufficient observational data. To overcome these challenges, we propose to leverage the advanced video generative model, Human4DiT, to generate the human motions from alternative perspective as an additional supervision signal. This approach not only enriches the details in previously unseen regions but also effectively regularizes the avatar representation to mitigate artifacts. Furthermore, we introduce two complementary strategies to enhance video generation: To ensure consistent reproduction of human motion, we inject the physical identity into the model through video fine-tuning. For higher-resolution outputs with finer details, a patch-based denoising algorithm is employed. Experimental results demonstrate that our method outperforms recent state-of-the-art approaches and validate the effectiveness of our proposed strategies.

[33] LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression cs.CVPDF

Lianyu Hu, Fanhua Shang, Wei Feng, Liang Wan

TL;DR: LightVLM是一种无需训练的加速方法，通过分层令牌合并和KV缓存压缩，显著提升视觉语言模型的推理效率，同时保持性能。

Details

Motivation: 现有的视觉语言模型（VLM）推理效率低，尤其是在处理长序列时延迟高，限制了实际部署。

Result: 保留35%令牌时性能100%，3%令牌时98%性能；网络吞吐量提升2.02倍，预填充时间减少3.65倍，长序列推理时间减少3.21倍。

Insight: 通过分层优化和缓存管理，可以显著提升大模型效率，使其超越小型模型的表现。

Abstract: In this paper, we introduce LightVLM, a simple but effective method that can be seamlessly deployed upon existing Vision-Language Models (VLMs) to greatly accelerate the inference process in a training-free manner. We divide the inference procedure of VLMs into two stages, i.e., encoding and decoding, and propose to simultaneously accelerate VLMs in both stages to largely improve model efficiency. During encoding, we propose pyramid token merging to reduce tokens of different LLM layers in a hierarchical manner by finally only keeping a few dominant tokens to achieve high efficiency. During decoding, aimed at reducing the high latency of outputting long sequences, we propose KV Cache compression to remove unnecessary caches to increase the network throughput. Experimental results show that LightVLM successfully retains 100% performance when only preserving 35% image tokens, and maintains around 98% performance when keeping only 3% image tokens. LightVLM could 2.02$\times$ the network throughput and reduce the prefilling time by 3.65$\times$. LightVLM also makes large VLMs faster again by enabling a heavy model (e.g., InternVL2.5 26B) to infer faster than significantly smaller models (e.g., InternVL2.5 8B), hopefully facilitating the real-world deployment. When generating long text sequences (e.g., 4096 tokens), LightVLM could reduce the inference time by 3.21$\times$, largely outperforming existing methods.

[34] Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation cs.CVPDF

Xuechao Zou, Shun Zhang, Xing Fu, Yue Li, Kai Li

TL;DR: Face-MoGLE是一个基于扩散变换器（DiTs）的新框架，通过全局和局部专家混合（Mixture of Experts）及动态门控网络，实现高质量的语义可控人脸生成。

Details

Motivation: 现有方法在语义控制和生成质量之间难以平衡，Face-MoGLE通过解耦语义控制和利用专家专业知识来解决这一问题。

Result: 实验表明Face-MoGLE在多模态和单模态人脸生成任务中表现优异，并具备强大的零样本泛化能力。

Insight: 分割全局和局部信息并通过动态门控调整生成过程是实现高质量可控生成的有效途径。

Abstract: Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.

[35] SemaMIL: Semantic Reordering with Retrieval-Guided State Space Modeling for Whole Slide Image Classification cs.CVPDF

Lubin Gan, Xiaoman Wu, Jing Zhang, Zhifeng Wang, Linhao Qu

TL;DR: SemaMIL通过语义重排序和检索引导的状态空间建模，解决了传统MIL方法忽略上下文关系和Transformer计算复杂度高的问题，在WSI分类任务中实现了更高准确性和更低计算成本。

Details

Motivation: 现有MIL方法（如注意力机制）忽略了上下文关系，而Transformer模型计算复杂度高且容易过拟合。SemaMIL旨在通过语义重排序和状态空间建模解决这些问题。

Result: 在四个WSI亚型数据集上，SemaMIL以更少的计算量和参数达到SOTA准确率。

Insight: 通过语义重排序和高效的状态空间建模，可以在保持病理学意义的同时提升模型性能和解释性。

Abstract: Multiple instance learning (MIL) has become the leading approach for extracting discriminative features from whole slide images (WSIs) in computational pathology. Attention-based MIL methods can identify key patches but tend to overlook contextual relationships. Transformer models are able to model interactions but require quadratic computational cost and are prone to overfitting. State space models (SSMs) offer linear complexity, yet shuffling patch order disrupts histological meaning and reduces interpretability. In this work, we introduce SemaMIL, which integrates Semantic Reordering (SR), an adaptive method that clusters and arranges semantically similar patches in sequence through a reversible permutation, with a Semantic-guided Retrieval State Space Module (SRSM) that chooses a representative subset of queries to adjust state space parameters for improved global modeling. Evaluation on four WSI subtype datasets shows that, compared to strong baselines, SemaMIL achieves state-of-the-art accuracy with fewer FLOPs and parameters.

[36] Stage-wise Adaptive Label Distribution for Facial Age Estimation cs.CVPDF

Bo Wu, Zhiqi Ai, Jun Jiang, Congcong Zhu, Shugong Xu

TL;DR: Error

Details

Motivation: Error

Result: Error

Insight: Error

Abstract: Label ambiguity poses a significant challenge in age estimation tasks. Most existing methods address this issue by modeling correlations between adjacent age groups through label distribution learning. However, they often overlook the varying degrees of ambiguity present across different age stages. In this paper, we propose a Stage-wise Adaptive Label Distribution Learning (SA-LDL) algorithm, which leverages the observation – revealed through our analysis of embedding similarities between an anchor and all other ages – that label ambiguity exhibits clear stage-wise patterns. By jointly employing stage-wise adaptive variance modeling and weighted loss function, SA-LDL effectively captures the complex and structured nature of label ambiguity, leading to more accurate and robust age estimation. Extensive experiments demonstrate that SA-LDL achieves competitive performance, with MAE of 1.74 and 2.15 on the MORPH-II and FG-NET datasets.

[37] Encoder-Only Image Registration cs.CVPDF

Xiang Chen, Renjiu Hu, Jinwei Zhang, Yuxi Zhang, Xinyao Yue

TL;DR: 论文提出了一种名为Encoder-Only Image Registration (EOIR)的新型图像配准框架，通过分离特征学习和光流估计，提升了配准的准确性与效率。

Details

Motivation: 尽管基于学习的技术显著提升了可变形图像配准的精度和速度，但如何降低计算复杂性和处理大变形仍然是挑战。论文通过分析卷积神经网络（ConvNets）在配准中的作用，提出了新的解决方案。

Result: 在五种不同模态和解剖区域的数据集上，EOIR在准确性与效率、准确性与平滑性之间实现了优越的平衡，且提供了更高的效率和平滑性。

Insight: 研究发现ConvNets在配准中主要起到线性化局部强度和协调全局对比度变化的作用，这一观察为设计更高效的配准方法提供了理论基础。

Abstract: Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Based on these insights, we propose the Encoder-Only Image Registration (EOIR) framework, designed to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on five datasets across different modalities and anatomical regions demonstrate EOIR’s effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR will be publicly available on https://github.com/XiangChen1994/EOIR.

[38] Exploring Decision-Making Capabilities of LLM Agents: An Experimental Study on Jump-Jump Game cs.CVPDF

Juwu Li

TL;DR: 论文通过实验研究了LLM智能体在《Jump-Jump》游戏中的决策能力，探讨了其在空间推理、物理建模和策略规划等方面的表现。

Details

Motivation: 研究动机是利用简单的《Jump-Jump》游戏作为测试环境，探索LLM在需要多认知任务的游戏中的决策能力。

Result: 实验结果表明，LLM能够在游戏中表现出一定的决策能力，但在复杂情境下仍需改进。

Insight: 研究揭示了LLM在动态任务中的潜力，同时也指出了其在精确控制和策略规划方面的局限性。

Abstract: The Jump-Jump game, as a simple yet challenging casual game, provides an ideal testing environment for studying LLM decision-making capabilities. The game requires players to precisely control jumping force based on current position and target platform distance, involving multiple cognitive aspects including spatial reasoning, physical modeling, and strategic planning. It illustrates the basic gameplay mechanics of the Jump-Jump game, where the player character (red circle) must jump across platforms with appropriate force to maximize score.

[39] VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding cs.CV | cs.AIPDF

Zhihong Zhang, Xiaojian Huang, Jin Xu, Zhuodong Luo, Xinzhi Wang

TL;DR: VideoRewardBench是一个全面的多模态奖励模型（MRM）评估基准，覆盖视频理解的四个核心方面：感知、知识、推理和安全性。通过AI辅助数据管道，生成了1563个高质量标注样本，评估了28种MRM模型，揭示了性能瓶颈和关键洞察。

Details

Motivation: 现有视频领域的MRM评估基准在问题数量、多样性和评估维度上存在不足，亟需一个更全面的基准来推动MRM的发展和评估。

Result: GPT-4o和Qwen2.5-VL-72B模型的整体准确率仅分别为57.0%和53.3%，揭示了性能瓶颈。分析还发现RL训练未必提升跨模态泛化能力，推理时缩放对部分MRM有益，视频帧数影响因模型类型而异。

Insight: 1) RL训练的MRM未必优于非RL训练的；2) 推理时缩放对生成式和半标量式MRM有益；3) 视频帧数对不同类型MRM影响各异。

Abstract: Multimodal reward models (MRMs) play a crucial role in the training, inference, and evaluation of Large Vision Language Models (LVLMs) by assessing response quality. However, existing benchmarks for evaluating MRMs in the video domain suffer from a limited number and diversity of questions, a lack of comprehensive evaluation dimensions, and inadequate evaluation of diverse types of MRMs. To address these gaps, we introduce VideoRewardBench, the first comprehensive benchmark covering four core aspects of video understanding: perception, knowledge, reasoning, and safety. Through our AI-assisted data pipeline, we curate a high-quality preference dataset of 1,563 annotated samples, including 1,482 unique videos and 1,559 distinct questions–15 times the number found in the most question-rich prior benchmark. Each sample is a triplet consisting of a video-text prompt, a chosen response, and a rejected response. We also conduct a comprehensive evaluation across 28 multimodal reward models spanning three categories: generative, discriminative, and semi-scalar. Results show that even the top-performing model GPT-4o achieves only 57.0% overall accuracy, and the state-of-the-art open-source model Qwen2.5-VL-72B reaches merely 53.3%. Our analysis further reveals three key insights: (i) MRMs trained with reinforcement learning (RL) do not necessarily exhibit stronger cross-modal generalization than those trained without RL; (ii) except for discriminative MRMs, other types of MRMs across varying model capacities can benefit from inference-time scaling; and (iii) variations in input video frame count have different effects on different types of MRMs. We believe VideoRewardBench offers a challenging and valuable benchmark for advancing the evaluation and development of MRMs in the video domain.

[40] Multi-Focused Video Group Activities Hashing cs.CV | cs.AIPDF

Zhongmiao Qi, Yan Jiang, Bolin Zhang, Lijun Guo, Chong Wang

TL;DR: 该论文提出了一种新的视频哈希技术STVH和其增强版M-STVH，用于快速检索视频中的群体活动，通过同时建模个体动态和群体交互来解决视频检索中的粒度问题。

Details

Motivation: 随着视频数据的爆炸式增长，快速检索群体活动成为迫切需求。现有方法通常只能检索整个视频，而无法聚焦活动粒度，因此需要新的解决方案。

Result: 在两个公开数据集上的实验表明，STVH和M-STVH均能取得优异性能。

Insight: 多焦点表示学习能有效结合活动语义和对象视觉特征，为复杂场景下的视频检索提供新思路。

Abstract: With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.

[41] TRUST: Token-dRiven Ultrasound Style Transfer for Cross-Device Adaptation cs.CVPDF

Nhat-Tuong Do-Tran, Ngoc-Hoang-Lam Le, Ian Chiu, Po-Tsun Paul Kuo, Ching-Chun Huang

TL;DR: TRUST提出了一种基于令牌的双流框架，用于解决超声图像跨设备风格迁移问题，通过过滤最相关的风格特征，提升下游任务的性能。

Details

Motivation: 不同设备采集的超声图像风格差异大，导致下游任务性能下降，现有无配对图像迁移方法未充分筛选相关风格特征，迁移效果不佳。

Result: 在超声数据集上，TRUST在视觉质量和下游任务性能上均优于现有无配对图像迁移方法。

Insight: 显式筛选风格特征对跨设备图像迁移至关重要，结合下游任务指导的风格选择能进一步提升迁移效果。

Abstract: Ultrasound images acquired from different devices exhibit diverse styles, resulting in decreased performance of downstream tasks. To mitigate the style gap, unpaired image-to-image (UI2I) translation methods aim to transfer images from a source domain, corresponding to new device acquisitions, to a target domain where a frozen task model has been trained for downstream applications. However, existing UI2I methods have not explicitly considered filtering the most relevant style features, which may result in translated images misaligned with the needs of downstream tasks. In this work, we propose TRUST, a token-driven dual-stream framework that preserves source content while transferring the common style of the target domain, ensuring that content and style remain unblended. Given multiple styles in the target domain, we introduce a Token-dRiven (TR) module that operates from two perspectives: (1) a data view–selecting “suitable” target tokens corresponding to each source token, and (2) a model view–identifying ``optimal” target tokens for the downstream model, guided by a behavior mirror loss. Additionally, we inject auxiliary prompts into the source encoder to match content representation with downstream behavior. Experimental results on ultrasound datasets demonstrate that TRUST outperforms existing UI2I methods in both visual quality and downstream task performance.

[42] Make me an Expert: Distilling from Generalist Black-Box Models into Specialized Models for Semantic Segmentation cs.CVPDF

Yasser Benigmim, Subhankar Roy, Khalid Oublal, Imad Eddine Marouf, Slim Essid

TL;DR: 论文提出了黑盒蒸馏（B2D）方法，利用通用黑盒模型（仅提供one-hot预测）训练本地专用模型，解决了输入分辨率敏感性问题。

Details

Motivation: AIaaS虽然提供了通用模型的API访问，但受限于黑盒特性（无权重、训练数据或logits），现有方法难以有效蒸馏知识到本地专用模型。

Result: 实验表明ATGC在仅依赖one-hot预测的条件下，显著提升了多个数据集的语义分割性能。

Insight: 通用模型的输入分辨率敏感性是关键挑战，动态注意力引导的分辨率选择是有效蒸馏的核心。

Abstract: The rise of Artificial Intelligence as a Service (AIaaS) democratizes access to pre-trained models via Application Programming Interfaces (APIs), but also raises a fundamental question: how can local models be effectively trained using black-box models that do not expose their weights, training data, or logits, a constraint in which current domain adaptation paradigms are impractical ? To address this challenge, we introduce the Black-Box Distillation (B2D) setting, which enables local model adaptation under realistic constraints: (1) the API model is open-vocabulary and trained on large-scale general-purpose data, and (2) access is limited to one-hot predictions only. We identify that open-vocabulary models exhibit significant sensitivity to input resolution, with different object classes being segmented optimally at different scales, a limitation termed the “curse of resolution”. Our method, ATtention-Guided sCaler (ATGC), addresses this challenge by leveraging DINOv2 attention maps to dynamically select optimal scales for black-box model inference. ATGC scores the attention maps with entropy to identify informative scales for pseudo-labelling, enabling effective distillation. Experiments demonstrate substantial improvements under black-box supervision across multiple datasets while requiring only one-hot API predictions. Our code is available at https://github.com/yasserben/ATGC.

[43] Learning Yourself: Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement cs.CVPDF

Ruitao Wu, Yifan Zhao, Jia Li

TL;DR: 论文提出了一种基于语言引导的框架LBD，用于解决类增量语义分割中的语义纠缠问题，通过利用预训练的视觉-语言模型（如CLIP）实现特征解耦，并在多个数据集上取得最优性能。

Details

Motivation: 类增量语义分割（CISS）面临原型-特征纠缠和背景-增量纠缠两大挑战，传统方法由于缺乏足够的区分线索而引入噪声和错误。论文试图通过语言引导的解耦框架解决这些问题。

Result: 在Pascal VOC和ADE20k数据集上取得最优性能，尤其在多步增量场景中表现突出。

Insight: 语言引导的特征解耦能有效缓解增量学习中的语义混淆问题，预训练视觉-语言模型的语义先验对密集任务有显著帮助。

Abstract: Class-Incremental Semantic Segmentation (CISS) requires continuous learning of newly introduced classes while retaining knowledge of past classes. By abstracting mainstream methods into two stages (visual feature extraction and prototype-feature matching), we identify a more fundamental challenge termed catastrophic semantic entanglement. This phenomenon involves Prototype-Feature Entanglement caused by semantic misalignment during the incremental process, and Background-Increment Entanglement due to dynamic data evolution. Existing techniques, which rely on visual feature learning without sufficient cues to distinguish targets, introduce significant noise and errors. To address these issues, we introduce a Language-inspired Bootstrapped Disentanglement framework (LBD). We leverage the prior class semantics of pre-trained visual-language models (e.g., CLIP) to guide the model in autonomously disentangling features through Language-guided Prototypical Disentanglement and Manifold Mutual Background Disentanglement. The former guides the disentangling of new prototypes by treating hand-crafted text features as topological templates, while the latter employs multiple learnable prototypes and mask-pooling-based supervision for background-incremental class disentanglement. By incorporating soft prompt tuning and encoder adaptation modifications, we further bridge the capability gap of CLIP between dense and sparse tasks, achieving state-of-the-art performance on both Pascal VOC and ADE20k, particularly in multi-step scenarios.

[44] A Modality-agnostic Multi-task Foundation Model for Human Brain Imaging cs.CVPDF

Peirong Liu, Oula Puonti, Xiaoling Hu, Karthik Gopinath, Annabel Sorby-Adams

TL;DR: 该论文提出了一种名为BrainFM的多模态、多任务基础模型，用于人脑成像，能够适应不同模态和任务，表现出很强的泛化能力。

Details

Motivation: 现有的学习方法在校准医学影像（如CT）中表现良好，但在非校准模态（如MRI）中泛化能力较差，这限制了其在多样化临床协议中的广泛应用。

Result: 在11个公共数据集上验证了BrainFM的鲁棒性和有效性，涵盖图像合成、分割、距离估计等多种任务。

Insight: 通过创新的训练策略，BrainFM能够在不依赖特定模态校准的情况下，实现跨任务的优异表现，为解决医学影像中的泛化问题提供了新思路。

Abstract: Recent learning-based approaches have made astonishing advances in calibrated medical imaging like computerized tomography (CT), yet they struggle to generalize in uncalibrated modalities – notably magnetic resonance (MR) imaging, where performance is highly sensitive to the differences in MR contrast, resolution, and orientation. This prevents broad applicability to diverse real-world clinical protocols. Here we introduce BrainFM, a modality-agnostic, multi-task vision foundation model for human brain imaging. With the proposed “mild-to-severe” intra-subject generation and “real-synth” mix-up training strategy, BrainFM is resilient to the appearance of acquired images (e.g., modality, contrast, deformation, resolution, artifacts), and can be directly applied to five fundamental brain imaging tasks, including image synthesis for CT and T1w/T2w/FLAIR MRI, anatomy segmentation, scalp-to-cortical distance, bias field estimation, and registration. We evaluate the efficacy of BrainFM on eleven public datasets, and demonstrate its robustness and effectiveness across all tasks and input modalities. Code is available at https://github.com/jhuldr/BrainFM.

[45] DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation cs.CVPDF

Boyi Li, Ce Zhang, Richard M. Timmerman, Wenxuan Bao

TL;DR: DGL-RSIS是一种免训练框架，通过解耦全局空间上下文和局部类别语义，实现遥感图像分割的视觉-语言对齐。

Details

Motivation: 遥感图像分割面临数据集类别多样性不足和自然图像与遥感图像领域差异的挑战，DGL-RSIS通过视觉与语言对齐解决这些问题。

Result: 实现了遥感图像的开放词汇分割和参考表达分割，性能优于传统方法。

Insight: 解耦全局和局部信息，结合视觉与语言对齐，是提升遥感图像分割效果的有效方法。

Abstract: The emergence of vision language models (VLMs) has bridged vision and language, enabling joint multimodal understanding beyond traditional visual-only deep learning models. However, transferring VLMs from the natural image domain to remote sensing (RS) segmentation remains challenging due to the limited category diversity in RS datasets and the domain gap between natural and RS imagery. Here, we propose a training-free framework, DGL-RSIS, that decouples visual and textual inputs, performing visual-language alignment at both the local semantic and global contextual levels through tailored strategies. Specifically, we first introduce a global-local decoupling (GLD) module, where text inputs are divided into local class nouns and global modifiers using natural language processing (NLP) techniques; image inputs are partitioned into a set of class-agnostic mask proposals via unsupervised mask proposal networks. Second, visual and textual features are aligned at local scale, through a novel context-aware cropping strategy for extracting image patches with proper boundaries and introducing RS-specific knowledge to enrich the text inputs. By matching the enhanced text features with mask-guided visual features, we enable the mask classification, supporting open-vocabulary semantic segmentation (OVSS). Third, at the global scale, we propose a Cross-Scale Grad-CAM module to refine Grad-CAM maps using contextual information from global modifiers. A subsequent mask selection module integrates pixel-level Grad-CAM activations into the mask-level segmentation output, such that accurate and interpretable alignment can be realized across global and local dimensions for referring expression segmentation (RES).

[46] Towards Methane Detection Onboard Satellites cs.CV | cs.AIPDF

Maggie Chen, Hala Lambdouar, Luca Marini, Laura Martínez-Ferrer, Chris Bridges

TL;DR: 论文提出了一种基于未正交校正数据（UnorthoDOS）的甲烷检测方法，避免了传统预处理步骤，性能与正交校正数据相当，并发布了数据集和模型。

Details

Motivation: 甲烷是强效温室气体，需快速检测以支持减排。传统方法依赖图像预处理（如正交校正和匹配滤波），但这些步骤增加了复杂度。

Result: ML模型在未正交校正数据上表现与正交校正数据相当，且优于传统匹配滤波基线。

Insight: 未正交校正数据可直接用于甲烷检测，简化流程，同时保持性能，为卫星实时检测提供新思路。

Abstract: Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.

[47] MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation cs.CV | cs.ROPDF

Aviral Chharia, Wenbo Gou, Haoye Dong

TL;DR: MV-SSM是一种用于多视角3D人体姿态估计的新框架，通过状态空间建模和双向扫描技术提升泛化能力，显著优于现有方法。

Details

Motivation: 多视角3D人体姿态估计在泛化到新相机配置时表现不佳，现有基于注意力的Transformer方法难以准确建模空间关系，容易过拟合特定场景。

Result: 在CMU Panoptic和Campus A1数据集上，MV-SSM在AP25和PCP指标上分别提升10.8（24%）和15.3（38%），展示了强泛化能力。

Insight: MV-SSM通过结合状态空间建模和双向扫描技术，显著提升多视角3D姿态估计的鲁棒性，尤其在复杂场景和跨数据集评估中表现突出。

Abstract: While significant progress has been made in single-view 3D human pose estimation, multi-view 3D human pose estimation remains challenging, particularly in terms of generalizing to new camera configurations. Existing attention-based transformers often struggle to accurately model the spatial arrangement of keypoints, especially in occluded scenarios. Additionally, they tend to overfit specific camera arrangements and visual scenes from training data, resulting in substantial performance drops in new settings. In this study, we introduce a novel Multi-View State Space Modeling framework, named MV-SSM, for robustly estimating 3D human keypoints. We explicitly model the joint spatial sequence at two distinct levels: the feature level from multi-view images and the person keypoint level. We propose a Projective State Space (PSS) block to learn a generalized representation of joint spatial arrangements using state space modeling. Moreover, we modify Mamba’s traditional scanning into an effective Grid Token-guided Bidirectional Scanning (GTBS), which is integral to the PSS block. Multiple experiments demonstrate that MV-SSM achieves strong generalization, outperforming state-of-the-art methods: +10.8 on AP25 (+24%) on the challenging three-camera setting in CMU Panoptic, +7.0 on AP25 (+13%) on varying camera arrangements, and +15.3 PCP (+38%) on Campus A1 in cross-dataset evaluations. Project Website: https://aviralchharia.github.io/MV-SSM

[48] Face4FairShifts: A Large Image Benchmark for Fairness and Robust Learning across Visual Domains cs.CV | cs.CY | cs.LGPDF

Yumeng Lin, Dong Li, Xintao Wu, Minglai Shao, Xujiang Zhao

TL;DR: 该论文提出了Face4FairShifts，一个大规模面部图像基准数据集，用于系统评估公平性学习和领域泛化能力。数据集包含100,000张图像，涵盖4个视觉域和14个属性的39种标注。实验揭示了分布变化下模型性能的显著差距。

Details

Motivation: 当前机器学习模型在领域变化下的公平性和鲁棒性仍存在挑战，现有数据集无法满足系统性评估的需求，因此需要更全面的基准数据集。

Result: 实验结果表明，现有模型在跨域公平性任务中存在显著性能差距，凸显了改进公平性领域自适应技术的必要性。

Insight: 数据集的设计为系统评估公平性和鲁棒性提供了新工具，推动了公平可靠AI系统的发展。

Abstract: Ensuring fairness and robustness in machine learning models remains a challenge, particularly under domain shifts. We present Face4FairShifts, a large-scale facial image benchmark designed to systematically evaluate fairness-aware learning and domain generalization. The dataset includes 100,000 images across four visually distinct domains with 39 annotations within 14 attributes covering demographic and facial features. Through extensive experiments, we analyze model performance under distribution shifts and identify significant gaps. Our findings emphasize the limitations of existing related datasets and the need for more effective fairness-aware domain adaptation techniques. Face4FairShifts provides a comprehensive testbed for advancing equitable and reliable AI systems. The dataset is available online at https://meviuslab.github.io/Face4FairShifts/.

[49] Automatic Identification and Description of Jewelry Through Computer Vision and Neural Networks for Translators and Interpreters cs.CVPDF

Jose Manuel Alcalde-Llergo, Aurora Ruiz-Mezcua, Rocio Avila-Ramirez, Andrea Zingoni, Juri Taborri

TL;DR: 这篇论文提出了一种利用计算机视觉和神经网络自动识别和描述珠宝的方法，旨在帮助翻译和口译人员快速获取准确信息。模型通过三个层次的描述生成自然语言，展现了90%以上的准确性。

Details

Motivation: 珠宝种类繁多，风格多样，目前对其精确描述通常仅限于行业专家。翻译和口译人员亟需一种快速、可靠的方法来理解和描述珠宝。

Result: 最终模型在描述生成任务中实现了超过90%的准确率。

Insight: 分层描述方法能够有效捕捉珠宝的细节，为翻译和口译人员提供了实用的工具；编码器-解码器架构在多层次描述任务中表现优越。

Abstract: Identifying jewelry pieces presents a significant challenge due to the wide range of styles and designs. Currently, precise descriptions are typically limited to industry experts. However, translators and interpreters often require a comprehensive understanding of these items. In this study, we introduce an innovative approach to automatically identify and describe jewelry using neural networks. This method enables translators and interpreters to quickly access accurate information, aiding in resolving queries and gaining essential knowledge about jewelry. Our model operates at three distinct levels of description, employing computer vision techniques and image captioning to emulate expert analysis of accessories. The key innovation involves generating natural language descriptions of jewelry across three hierarchical levels, capturing nuanced details of each piece. Different image captioning architectures are utilized to detect jewels in images and generate descriptions with varying levels of detail. To demonstrate the effectiveness of our approach in recognizing diverse types of jewelry, we assembled a comprehensive database of accessory images. The evaluation process involved comparing various image captioning architectures, focusing particularly on the encoder decoder model, crucial for generating descriptive captions. After thorough evaluation, our final model achieved a captioning accuracy exceeding 90 per cent.

[50] Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model cs.CV | cs.AIPDF

Yifei She, Huangxuan Wu

TL;DR: 论文提出Fusion to Enhance（FtZ）框架，通过融合语义强大的锚定编码器和感知丰富的增强编码器，解决了多模态大语言模型在细粒度视觉任务中的不足。

Details

Motivation: 现有的多模态大语言模型虽然在高级语义理解上表现出色，但在需要细粒度视觉感知的任务中表现不佳，主要因为依赖单一视觉编码器牺牲了细节捕捉能力。

Result: 在TextVQA、POPE、MMMU等多个细粒度视觉任务基准测试中，FtZ显著优于单编码器或现有特征融合方法。

Insight: 异构专家编码器的组合是突破当前MLLM视觉感知瓶颈的有效路径，为下一代更强感知能力的AI系统提供了新设计范式。

Abstract: Multimodal Large Language Models (MLLMs) have made significant progress in bridging visual perception with high-level textual reasoning. However, they face a fundamental contradiction: while excelling at complex semantic understanding, these models often fail at basic visual tasks that require precise detail perception. This deficiency primarily stems from the prevalent architectural reliance on a single vision encoder optimized for high-level semantic alignment, which inherently sacrifices the ability to capture fine-grained visual information. To address this issue, we introduce Fusion to Enhance (FtZ), a novel vision tower framework. FtZ moves beyond the single-encoder design by innovatively composing a semantically powerful anchor encoder with a perception-rich augmenting encoder via a lightweight Multi-Head Cross-Attention mechanism. Experimental results demonstrate that on several challenging benchmarks demanding fine-grained visual understanding, such as TextVQA, POPE, MMMU, MME and MM-Vet, our FtZ model significantly outperforms baselines that use only a single encoder or existing feature fusion methods. This work proves that composing heterogeneous expert encoders is an efficient and effective path to overcoming the visual perception bottleneck in current MLLMs, offering a new design paradigm for building next-generation AI systems with stronger perceptual capabilities.

[51] ER-LoRA: Effective-Rank Guided Adaptation for Weather-Generalized Depth Estimation cs.CV | cs.ROPDF

Weilong Yan, Xin Zhang, Robby T. Tan

TL;DR: 论文提出了一种基于有效秩引导的参数高效微调方法ER-LoRA，用于实现天气通用的深度估计，通过选择、调优和维护策略，平衡了任务适应性与预训练知识的保留。

Details

Motivation: 解决恶劣天气下单目深度估计的挑战性问题，现有方法依赖合成数据或自监督学习，存在域差距或违反光度假设。

Result: 在四个真实世界的天气多样性基准测试中，STM超越了现有PEFT方法、全微调方法以及使用合成数据训练的模型。

Insight: 通过结构化分解预训练权重并结合有效秩，能够在任务适应性与泛化能力之间取得理想平衡，适合几何任务如深度估计。

Abstract: Monocular depth estimation under adverse weather conditions (e.g.\ rain, fog, snow, and nighttime) remains highly challenging due to the lack of reliable ground truth and the difficulty of learning from unlabeled real-world data. Existing methods often rely on synthetic adverse data with pseudo-labels, which suffer from domain gaps, or employ self-supervised learning, which violates photometric assumptions in adverse scenarios. In this work, we propose to achieve weather–generalized depth estimation by Parameter–Efficient Fine–Tuning (PEFT) of Vision Foundation Models (VFMs), using only a small amount of high–visibility (normal) data. While PEFT has shown strong performance in semantic tasks such as segmentation, it remains underexplored for geometry–centric tasks like depth estimation – especially in terms of balancing effective adaptation with the preservation of pretrained knowledge. To this end, we introduce the Selecting–Tuning–Maintaining (STM) strategy, which structurally decomposes the pretrained weights of VFMs based on two kinds of effective ranks (entropy–rank and stable–rank). In the tuning phase, we adaptively select the proper rank number as well as the task–aware singular directions for initialization, based on the entropy–rank and full–tuned weight; while in the maintaining stage, we enforce a principal direction regularization based on the stable–rank. This design guarantees flexible task adaptation while preserving the strong generalization capability of the pretrained VFM. Extensive experiments on four real–world benchmarks across diverse weather conditions demonstrate that STM not only outperforms existing PEFT methods and full fine–tuning but also surpasses methods trained with adverse synthetic data, and even the depth foundation model

[52] LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model cs.CV | cs.LGPDF

Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu

TL;DR: 论文挑战了视觉语言模型中批评模型和策略模型分离的传统，提出通过强化学习直接优化生成模型，得到LLaVA-Critic-R1，既可作为优秀的批评模型，又具备竞争力强的生成能力。

Details

Motivation: 传统视觉语言模型中，批评模型仅用于评分或偏好判断，而策略模型负责生成回应。论文质疑这种分离的必要性，探索批评模型是否也可直接用于生成任务。

Result: LLaVA-Critic-R1在26个视觉推理和理解基准测试中表现优异，平均提升5.7%；LLaVA-Critic-R1+在7B规模上达到SoTA性能（MMMU 71.9）。推理时自评平均提升13.8%的性能。

Insight: 批评数据可以用于训练兼具评价和生成能力的统一模型，为多模态系统的自我提升提供了简单路径。

Abstract: In vision-language modeling, critic models are typically trained to evaluate outputs – assigning scalar scores or pairwise preferences – rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model – matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.

[53] CSFMamba: Cross State Fusion Mamba Operator for Multimodal Remote Sensing Image Classification cs.CVPDF

Qingyu Wang, Xue Jiang, Guozheng Xu

TL;DR: 提出了一种名为CSFMamba的新型网络，结合Mamba的低计算负担和CNN的特征提取能力，通过交叉状态模块实现多模态遥感图像的分类，性能优于Transformer且计算负担更低。

Details

Motivation: 传统多模态融合方法（如CNN和Transformer）在遥感图像分类中存在计算复杂度高的问题，尤其是建模长距离依赖时负担过重。Mamba虽能解决计算负担问题，但无法直接进行特征融合。

Result: 在MUUFL和Houston2018数据集上，CSFMamba性能优于Transformer，同时降低了网络训练的计算负担。

Insight: 通过Mamba的低计算负担和CNN的特征提取能力结合，可以有效实现多模态遥感图像的高效分类，为复杂任务提供新思路。

Abstract: Multimodal fusion has made great progress in the field of remote sensing image classification due to its ability to exploit the complementary spatial-spectral information. Deep learning methods such as CNN and Transformer have been widely used in these domains. State Space Models recently highlighted that prior methods suffer from quadratic computational complexity. As a result, modeling longer-range dependencies of spatial-spectral features imposes an overwhelming burden on the network. Mamba solves this problem by incorporating time-varying parameters into ordinary SSM and performing hardware optimization, but it cannot perform feature fusion directly. In order to make full use of Mamba’s low computational burden and explore the potential of internal structure in multimodal feature fusion, we propose Cross State Fusion Mamba (CSFMamba) Network. Specifically, we first design the preprocessing module of remote sensing image information for the needs of Mamba structure, and combine it with CNN to extract multi-layer features. Secondly, a cross-state module based on Mamba operator is creatively designed to fully fuse the feature of the two modalities. The advantages of Mamba and CNN are combined by designing a more powerful backbone. We capture the fusion relationship between HSI and LiDAR modalities with stronger full-image understanding. The experimental results on two datasets of MUUFL and Houston2018 show that the proposed method outperforms the experimental results of Transformer under the premise of reducing the network training burden.

[54] CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition cs.CVPDF

Yusen Peng, Alper Yilmaz

TL;DR: 本文提出了CascadeFormer，一种两阶段级联Transformer架构，用于基于骨架的人体动作识别。通过预训练和微调两阶段设计，该方法在多个基准数据集上取得了竞争力表现。

Details

Motivation: 传统的图卷积网络（GCNs）在骨架动作识别中占主导地位，但Transformer模型和掩码预训练框架的出现为新型表示学习提供了可能。

Result: 在Penn Action、N-UCLA和NTU RGB+D 60等数据集上表现优异。

Insight: Transformer架构在骨架动作识别中具有潜力，两阶段设计结合了通用表示学习和任务适配性。

Abstract: Skeleton-based human action recognition leverages sequences of human joint coordinates to identify actions performed in videos. Owing to the intrinsic spatiotemporal structure of skeleton data, Graph Convolutional Networks (GCNs) have been the dominant architecture in this field. However, recent advances in transformer models and masked pretraining frameworks open new avenues for representation learning. In this work, we propose CascadeFormer, a family of two-stage cascading transformers for skeleton-based human action recognition. Our framework consists of a masked pretraining stage to learn generalizable skeleton representations, followed by a cascading fine-tuning stage tailored for discriminative action classification. We evaluate CascadeFormer across three benchmark datasets (Penn Action N-UCLA, and NTU RGB+D 60), achieving competitive performance on all tasks. To promote reproducibility, we release our code and model checkpoints.

[55] Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision cs.CVPDF

Raehyuk Jung, Seungjun Yu, Hyunjung Shim

TL;DR: 该论文提出了一种评估视觉语言模型（VLMs）投影层泛化能力的新基准，重点关注其在未见视觉概念上的表现。实验表明，投影层对未见类别的性能保留率为79%-88%，并通过机制可解释性分析揭示其类似键值记忆的工作原理。

Details

Motivation: 视觉语言模型（VLMs）在多模态任务中表现出色，但其投影层对未见概念的泛化能力尚未系统评估。为解决这一问题，本文提出了一种新的评估方法。

Result: 投影层对未见类别的性能保留率为79%-88%，实验揭示了其在处理已见和未见概念时的相似行为。

Insight: 投影层在未见概念上的泛化能力表明，视觉语言模型可以通过有限的对齐数据高效训练，无需对所有概念进行显式监督。

Abstract: Vision-Language Models (VLMs) combine a vision encoder and a large language model (LLM) through alignment training, showing strong performance on multimodal tasks. A central component in this architecture is the projection layer, which maps visual features into the LLM’s embedding space. Despite its importance, its ability to generalize to unseen visual concepts has not been systematically evaluated. To address this, we propose a benchmark for evaluating projection-layer generalization. We adapt object detection datasets (rich in fine-grained annotations) into a prompting format and design train/test splits with disjoint label sets, enabling precise control over seen and unseen concept separation. Experimental results show that the projection layer retains about 79 to 88 percent of the performance on unseen classes compared to seen ones across various settings, suggesting a non-trivial level of generalization even without explicit alignment supervision on those concepts. We further analyze this behavior through a mechanistic interpretability lens. Our findings indicate that the feed-forward network in the projection layer functions like a key-value memory, processing seen and unseen tokens in similar ways. This study introduces a new evaluation framework for alignment generalization and highlights the potential for efficient VLM training with limited aligned data.

[56] Enhancing Fairness in Skin Lesion Classification for Medical Diagnosis Using Prune Learning cs.CV | cs.AI | cs.CY | cs.LGPDF

Kuniko Paxton, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos, Tanaya Maslekar

TL;DR: 论文提出了一种通过修剪学习增强皮肤病变分类公平性的方法，解决了因肤色差异导致的诊断偏差问题。

Details

Motivation: 深度学习在皮肤病变分类中的准确性提升显著，但仍存在肤色相关的潜在偏差，可能影响诊断公平性。现有方法难以高效分类肤色并验证公平性。

Result: 方法降低了计算成本并减少偏差，在不依赖传统统计方法的情况下保持模型公平性，且可能减小模型规模。

Insight: 通过修剪冗余通道减少肤色偏差，为医疗诊断中的公平性问题提供了一种高效且实用的解决方案。

Abstract: Recent advances in deep learning have significantly improved the accuracy of skin lesion classification models, supporting medical diagnoses and promoting equitable healthcare. However, concerns remain about potential biases related to skin color, which can impact diagnostic outcomes. Ensuring fairness is challenging due to difficulties in classifying skin tones, high computational demands, and the complexity of objectively verifying fairness. To address these challenges, we propose a fairness algorithm for skin lesion classification that overcomes the challenges associated with achieving diagnostic fairness across varying skin tones. By calculating the skewness of the feature map in the convolution layer of the VGG (Visual Geometry Group) network and the patches and the heads of the Vision Transformer, our method reduces unnecessary channels related to skin tone, focusing instead on the lesion area. This approach lowers computational costs and mitigates bias without relying on conventional statistical methods. It potentially reduces model size while maintaining fairness, making it more practical for real-world applications.

[57] Causal Interpretation of Sparse Autoencoder Features in Vision cs.CV | cs.AIPDF

Sangyu Han, Yearim Kim, Nojun Kwak

TL;DR: 论文提出了一种因果特征解释方法CaFE，利用有效感受野（ERF）和输入归因方法，更准确地识别稀疏自编码器（SAE）特征的驱动因素，避免了仅依赖激活位置的误解。

Details

Motivation: 现有方法通过检查SAE特征激活最高的图像块来解释特征，但忽略了自注意力机制会在整张图像中混合信息，导致激活块可能并非特征的真正驱动因素。

Result: CaFE的解释比传统激活排名方法更精确，揭示了特征的隐藏上下文依赖关系，并在实验中证明了其有效性。

Insight: 仅依赖激活位置可能导致对SAE特征的误解，而结合因果分析能更准确地揭示特征的语义内容。

Abstract: Understanding what sparse auto-encoder (SAE) features in vision transformers truly represent is usually done by inspecting the patches where a feature’s activation is highest. However, self-attention mixes information across the entire image, so an activated patch often co-occurs with-but does not cause-the feature’s firing. We propose Causal Feature Explanation (CaFE), which leverages Effective Receptive Field (ERF). We consider each activation of an SAE feature to be a target and apply input-attribution methods to identify the image patches that causally drive that activation. Across CLIP-ViT features, ERF maps frequently diverge from naive activation maps, revealing hidden context dependencies (e.g., a “roaring face” feature that requires the co-occurrence of eyes and nose, rather than merely an open mouth). Patch insertion tests confirm that CaFE more effectively recovers or suppresses feature activations than activation-ranked patches. Our results show that CaFE yields more faithful and semantically precise explanations of vision-SAE features, highlighting the risk of misinterpretation when relying solely on activation location.

[58] EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions cs.CVPDF

Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le

TL;DR: 本文提出了一个多阶段的事件感知图像检索框架，结合了密集文本检索、事件感知语言模型重排序、语义匹配等技术，显著提升了复杂事件描述下的图像检索性能。

Details

Motivation: 传统视觉-语言检索方法难以处理描述抽象事件、因果或复杂叙事的自由文本。本文旨在通过结合事件理解与多模态检索，解决这些挑战。

Result: 在EVENTA 2025挑战赛Track 2的私有测试集上取得了top-1分数，验证了方法的有效性。

Insight: 结合事件语义与多模态检索能显著提升复杂文本场景下的图像理解能力。

Abstract: Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.

[59] Multi-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification cs.CVPDF

Y Hop Nguyen, Doan Anh Phan Huu, Trung Thai Tran, Nhat Nam Mai, Van Toi Giap

TL;DR: 该论文提出了一种统一的视觉-语言框架，用于耳鼻喉（ENT）内窥镜图像分析，通过多级CLS令牌融合和对比学习，在分类和检索任务中表现优异。

Details

Motivation: 传统CNN方法难以捕捉跨模态语义，且医疗数据受限，因此需要一种高效的方法来提升多模态对齐和表示多样性。

Result: 在ENTRep Grand Challenge中，分类准确率和F1-score达95%，图像和文本检索的Recall@1分别为0.93和0.92，MRR分数为0.97和0.96。

Insight: 多级CLS令牌融合和对比学习的结合显著提升了医疗多模态任务的性能，适用于低资源临床场景。

Abstract: We present a unified vision-language framework tailored for ENT endoscopy image analysis that simultaneously tackles three clinically-relevant tasks: image classification, image-to-image retrieval, and text-to-image retrieval. Unlike conventional CNN-based pipelines that struggle to capture cross-modal semantics, our approach leverages the CLIP ViT-B/16 backbone and enhances it through Low-Rank Adaptation, multi-level CLS token aggregation, and spherical feature interpolation. These components collectively enable efficient fine-tuning on limited medical data while improving representation diversity and semantic alignment across modalities. To bridge the gap between visual inputs and textual diagnostic context, we introduce class-specific natural language prompts that guide the image encoder through a joint training objective combining supervised classification with contrastive learning. We validated our framework through participation in the ACM MM’25 ENTRep Grand Challenge, achieving 95% accuracy and F1-score in classification, Recall@1 of 0.93 and 0.92 for image-to-image and text-to-image retrieval respectively, and MRR scores of 0.97 and 0.96. Ablation studies demonstrated the incremental benefits of each architectural component, validating the effectiveness of our design for robust multimodal medical understanding in low-resource clinical settings.

[60] MarkSplatter: Generalizable Watermarking for 3D Gaussian Splatting Model via Splatter Image Structure cs.CVPDF

Xiufeng Huang, Ziyuan Luo, Qi Song, Ruofei Wang, Renjie Wan

TL;DR: 这篇论文提出了MarkSplatter，一种用于3D高斯泼溅（3DGS）模型的通用水印框架，通过Splatter Image结构实现高效的单次前向传播保护，无需昂贵的微调过程。

Details

Motivation: 随着3DGS的普及，其版权保护需求日益增长。现有的水印方法需要为每个预定义消息进行昂贵的微调，限制了实用性。

Result: MarkSplatter在单次前向传播中实现了高效的水印嵌入，同时保持了视觉质量和水印提取的鲁棒性。

Insight: 通过结构化转换和神经处理，可以显著提升3DGS模型的版权保护效率和通用性。

Abstract: The growing popularity of 3D Gaussian Splatting (3DGS) has intensified the need for effective copyright protection. Current 3DGS watermarking methods rely on computationally expensive fine-tuning procedures for each predefined message. We propose the first generalizable watermarking framework that enables efficient protection of Splatter Image-based 3DGS models through a single forward pass. We introduce GaussianBridge that transforms unstructured 3D Gaussians into Splatter Image format, enabling direct neural processing for arbitrary message embedding. To ensure imperceptibility, we design a Gaussian-Uncertainty-Perceptual heatmap prediction strategy for preserving visual quality. For robust message recovery, we develop a dense segmentation-based extraction mechanism that maintains reliable extraction even when watermarked objects occupy minimal regions in rendered views. Project page: https://kevinhuangxf.github.io/marksplatter.

[61] No More Sibling Rivalry: Debiasing Human-Object Interaction Detection cs.CVPDF

Bin Yang, Yulin Zhang, Hong-Yu Zhou, Sibei Yang

TL;DR: 这篇论文提出了一种解决HOI检测中“毒性兄弟”偏差的方法，通过两种新的去偏学习目标，显著提升了性能。

Details

Motivation: Transformer在HOI检测中的应用虽取得进展，但“毒性兄弟”偏差导致相似HOI三元组互相干扰，降低了模型精度。

Result: 在HICO-Det上分别比基线和高性能方法提升了9.18%和3.59%的mAP。

Insight: 相似类别的互相干扰是HOI检测的重要挑战，通过解耦共享特征和区分组内差异可以有效缓解。

Abstract: Detection transformers have been applied to human-object interaction (HOI) detection, enhancing the localization and recognition of human-action-object triplets in images. Despite remarkable progress, this study identifies a critical issue-“Toxic Siblings” bias-which hinders the interaction decoder’s learning, as numerous similar yet distinct HOI triplets interfere with and even compete against each other both input side and output side to the interaction decoder. This bias arises from high confusion among sibling triplets/categories, where increased similarity paradoxically reduces precision, as one’s gain comes at the expense of its toxic sibling’s decline. To address this, we propose two novel debiasing learning objectives-“contrastive-then-calibration” and “merge-then-split”-targeting the input and output perspectives, respectively. The former samples sibling-like incorrect HOI triplets and reconstructs them into correct ones, guided by strong positional priors. The latter first learns shared features among sibling categories to distinguish them from other groups, then explicitly refines intra-group differentiation to preserve uniqueness. Experiments show that we significantly outperform both the baseline (+9.18% mAP on HICO-Det) and the state-of-the-art (+3.59% mAP) across various settings.

[62] InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos cs.CVPDF

Yangsong Zhang, Abdul Ahad Butt, Gül Varol, Ivan Laptev

TL;DR: 该论文提出了InterPose，一个从大规模网络视频中自动提取人-物交互动作的数据集，并展示了其在提升动作生成模型性能方面的显著效果。

Details

Motivation: 现有运动捕捉数据通常局限于单人或少数物体的交互，缺乏多样化的动作数据集，这限制了生成逼真人-物交互动作的能力。

Result: 实验表明，InterPose显著提升了现有方法的性能，并支持零样本生成多样化动作。

Insight: 大规模多样化数据集是推动人-物交互动作生成的关键，自动提取流程为构建此类数据集提供了可行方案。

Abstract: Human motion generation has shown great advances thanks to the recent diffusion models trained on large-scale motion capture data. Most of existing works, however, currently target animation of isolated people in empty scenes. Meanwhile, synthesizing realistic human-object interactions in complex 3D scenes remains a critical challenge in computer graphics and robotics. One obstacle towards generating versatile high-fidelity human-object interactions is the lack of large-scale datasets with diverse object manipulations. Indeed, existing motion capture data is typically restricted to single people and manipulations of limited sets of objects. To address this issue, we propose an automatic motion extraction pipeline and use it to collect interaction-rich human motions. Our new dataset InterPose contains 73.8K sequences of 3D human motions and corresponding text captions automatically obtained from 45.8K videos with human-object interactions. We perform extensive experiments and demonstrate InterPose to bring significant improvements to state-of-the-art methods for human motion generation. Moreover, using InterPose we develop an LLM-based agent enabling zero-shot animation of people interacting with diverse objects and scenes.

[63] Diffusion-Based Image-to-Brain Signal Generation with Cross-Attention Mechanisms for Visual Prostheses cs.CVPDF

Ganxi Xu, Jinyi Long, Jia Zhang

TL;DR: 本文提出了一种基于扩散模型和交叉注意力机制的图像到大脑信号生成框架，用于提升视觉假体中的大脑信号生物相似性。

Details

Motivation: 现有的视觉假体在生成大脑信号时缺乏生物相似性，且缺乏真实大脑响应的监督信号验证。本文旨在解决这一问题，提出一种更精确的图像到大脑信号生成方法。

Result: 在THINGS-EEG2和THINGS-MEG数据集上验证了生成信号的生物合理性，并通过可视化展示了信号在内/被试间的变化。

Insight: 交叉注意力机制能有效提升视觉特征与大脑信号的细粒度对齐，为视觉假体的生物信号生成提供了新思路。

Abstract: Visual prostheses have shown great potential in restoring vision for blind individuals. On the one hand, researchers have been continuously improving the brain decoding framework of visual prostheses by leveraging the powerful image generation capabilities of diffusion models. On the other hand, the brain encoding stage of visual prostheses struggles to generate brain signals with sufficient biological similarity. Although existing works have recognized this problem, the quality of predicted stimuli still remains a critical issue, as existing approaches typically lack supervised signals from real brain responses to validate the biological plausibility of predicted stimuli. To address this issue, we propose a novel image-to-brain framework based on denoising diffusion probabilistic models (DDPMs) enhanced with cross-attention mechanisms. Our framework consists of two key architectural components: a pre-trained CLIP visual encoder that extracts rich semantic representations from input images, and a cross-attention enhanced U-Net diffusion model that learns to reconstruct biologically plausible brain signals through iterative denoising. Unlike conventional generative models that rely on simple concatenation for conditioning, our cross-attention modules enable dynamic interaction between visual features and brain signal representations, facilitating fine-grained alignment during the generation process. We evaluate our framework on two multimodal datasets (THINGS-EEG2 and THINGS-MEG) to demonstrate its effectiveness in generating biologically plausible brain signals. Moreover, we visualize the training and test M/EEG topographies for all subjects on both datasets to intuitively demonstrate the intra-subject variations and inter-subject variations in M/EEG signals.

[64] OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving cs.CVPDF

Pei Liu, Qingtian Ning, Xinyan Lu, Haipeng Liu, Weiliang Ma

TL;DR: 该论文提出了OmniReason框架，通过结合动态3D环境建模和决策过程，解决了当前视觉语言模型在自动驾驶中缺乏时空推理的问题。提出了两个创新点：数据生成方法和智能体架构设计，显著提升了自动驾驶的时空理解和解释能力。

Details

Motivation: 当前视觉语言模型虽然在空间推理方面表现出色，但在自动驾驶的真实动态场景中缺乏对时间维度的建模。这限制了系统在复杂环境中的决策能力。

Result: 在开环规划任务和视觉问答基准测试中达到了最先进的性能，同时实现了复杂动态场景下的可解释时空推理。

Insight: 时空建模和决策解释是提升自动驾驶系统可靠性和可信度的关键，尤其是在动态和复杂环境中。

Abstract: Recent advances in vision-language models (VLMs) have demonstrated impressive spatial reasoning capabilities for autonomous driving, yet existing methods predominantly focus on static scene understanding while neglecting the essential temporal dimension of real-world driving scenarios. To address this critical limitation, we propose the OmniReason framework, which establishes robust spatiotemporal reasoning by jointly modeling dynamic 3D environments and their underlying decision-making processes. Our work makes two fundamental advances: (1) We introduce OmniReason-Data, two large-scale vision-language-action (VLA) datasets with dense spatiotemporal annotations and natural language explanations, generated through a novel hallucination-mitigated auto-labeling pipeline that ensures both physical plausibility and temporal coherence; (2) We develop the OmniReason-Agent architecture, which integrates a sparse temporal memory module for persistent scene context modeling and an explanation generator that produces human-interpretable decision rationales, facilitated by our spatiotemporal knowledge distillation approach that effectively captures spatiotemporal causal reasoning patterns. Comprehensive experiments demonstrate state-of-the-art performance, where OmniReason-Agent achieves significant improvements in both open-loop planning tasks and visual question answering (VQA) benchmarks, while establishing new capabilities for interpretable, temporally-aware autonomous vehicles operating in complex, dynamic environments.

[65] Multimodal Iterative RAG for Knowledge Visual Question Answering cs.CV | cs.AIPDF

Changin Choi, Wonseok Lee, Jungmin Ko, Wonjong Rhee

TL;DR: 论文提出了一种多模态迭代检索增强生成框架（MI-RAG），通过动态多查询检索异构知识库，逐步优化推理记录，显著提升了知识密集型视觉问答任务的性能。

Details

Motivation: 尽管多模态大语言模型（MLLMs）在多模态理解方面取得了显著进展，但在需要外部知识的视觉问答任务中表现仍然有限。传统的单次检索增强生成（RAG）框架难以获取足够的知识，因此需要一种更高效的解决方案。

Result: 在Encyclopedic VQA、InfoSeek和OK-VQA等具有挑战性的基准测试中，MI-RAG显著提升了检索召回率和答案准确率。

Insight: MI-RAG通过迭代查询和跨模态知识检索，展示了在知识密集型视觉问答任务中进行组合式推理的可扩展性。

Abstract: While Multimodal Large Language Models (MLLMs) have significantly advanced multimodal understanding, their performance remains limited on knowledge-intensive visual questions that require external knowledge beyond the image. Retrieval-Augmented Generation (RAG) has become a promising solution for providing models with external knowledge, its conventional single-pass framework often fails to gather sufficient knowledge. To overcome this limitation, we propose MI-RAG, a Multimodal Iterative RAG framework that leverages reasoning to enhance retrieval and update reasoning over newly retrieved knowledge across modalities. At each iteration, MI-RAG leverages an accumulated reasoning record to dynamically formulate a multi-query. These queries then drive a joint search across heterogeneous knowledge bases containing both visually-grounded and textual knowledge. The newly acquired knowledge is synthesized into the reasoning record, progressively refining understanding across iterations. Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy, establishing a scalable approach for compositional reasoning in knowledge-intensive VQA.

[66] SWAGSplatting: Semantic-guided Water-scene Augmented Gaussian Splatting cs.CVPDF

Zhuodong Jiang, Haoran Wang, Guoxi Huang, Brett Seymour, Nantheera Anantrasirichai

TL;DR: SWAGSplatting利用多模态知识，提出了一种语义引导的3D高斯泼溅方法，用于水下场景重建。通过嵌入语义特征和阶段化训练策略，显著提升了重建质量。

Details

Motivation: 水下环境的复杂因素（如光线扭曲、混浊和有限能见度）使3D重建极具挑战性。现有方法未充分利用AI潜力，特别是语言模型与视觉处理的结合。

Result: 在SeaThru-NeRF和Submerged3D数据集上表现优越，平均PSNR提升达3.09 dB。

Insight: 结合语义信息与视觉处理能显著提升水下场景重建质量，多模态知识是未来研究的潜在方向。

Abstract: Accurate 3D reconstruction in underwater environments remains a complex challenge due to issues such as light distortion, turbidity, and limited visibility. AI-based techniques have been applied to address these issues, however, existing methods have yet to fully exploit the potential of AI, particularly in integrating language models with visual processing. In this paper, we propose a novel framework that leverages multimodal cross-knowledge to create semantic-guided 3D Gaussian Splatting for robust and high-fidelity deep-sea scene reconstruction. By embedding an extra semantic feature into each Gaussian primitive and supervised by the CLIP extracted semantic feature, our method enforces semantic and structural awareness throughout the training. The dedicated semantic consistency loss ensures alignment with high-level scene understanding. Besides, we propose a novel stage-wise training strategy, combining coarse-to-fine learning with late-stage parameter refinement, to further enhance both stability and reconstruction quality. Extensive results show that our approach consistently outperforms state-of-the-art methods on SeaThru-NeRF and Submerged3D datasets across three metrics, with an improvement of up to 3.09 dB on average in terms of PSNR, making it a strong candidate for applications in underwater exploration and marine perception.

[67] Adaptive Contrast Adjustment Module: A Clinically-Inspired Plug-and-Play Approach for Enhanced Fetal Plane Classification cs.CV | cs.AIPDF

Yang Chen, Sanglin Zhao, Baoyu Chen, Mans Gustaf

TL;DR: 论文提出了一种自适应对比度调整模块（ACAM），通过临床医生调整图像对比度的实践启发，设计了一个可即插即用的模块，用于增强胎儿超声标准平面分类效果。该方法提升了模型的鲁棒性和分类精度。

Details

Motivation: 胎儿超声标准平面分类在医学诊断中至关重要，但受限于低组织对比度、边界模糊和操作者依赖性。论文旨在通过仿临床实践的方法提升分类性能。

Result: 模块显著提升了分类性能：轻量级模型精度提升2.02%，传统模型提升1.29%，SOTA模型提升1.15%。

Insight: 1. 仿临床实践的预处理方法优于随机预处理；2. 多视图融合有助于提升模型对图像异质性的适应能力；3. 低层特征与高层语义的有效结合是医学图像分析的关键。

Abstract: Fetal ultrasound standard plane classification is essential for reliable prenatal diagnosis but faces inherent challenges, including low tissue contrast, boundary ambiguity, and operator-dependent image quality variations. To overcome these limitations, we propose a plug-and-play adaptive contrast adjustment module (ACAM), whose core design is inspired by the clinical practice of doctors adjusting image contrast to obtain clearer and more discriminative structural information. The module employs a shallow texture-sensitive network to predict clinically plausible contrast parameters, transforms input images into multiple contrast-enhanced views through differentiable mapping, and fuses them within downstream classifiers. Validated on a multi-center dataset of 12,400 images across six anatomical categories, the module consistently improves performance across diverse models, with accuracy of lightweight models increasing by 2.02 percent, accuracy of traditional models increasing by 1.29 percent, and accuracy of state-of-the-art models increasing by 1.15 percent. The innovation of the module lies in its content-aware adaptation capability, replacing random preprocessing with physics-informed transformations that align with sonographer workflows while improving robustness to imaging heterogeneity through multi-view fusion. This approach effectively bridges low-level image features with high-level semantics, establishing a new paradigm for medical image analysis under real-world image quality variations.

[68] Sequential Difference Maximization: Generating Adversarial Examples via Multi-Stage Optimization cs.CV | cs.AI | cs.LG | Doctor of EngineeringPDF

Xinlei Liu, Tao Hu, Peng Yi, Weitao Han, Jichao Xie

TL;DR: 论文提出了一种基于多阶段优化的对抗样本生成方法SDM，通过最大化非真实标签概率上界与真实标签概率的差异，显著提升了攻击性能和成本效益。

Details

Motivation: 现有的对抗攻击方法在评估计算机视觉模型鲁棒性时存在效率不足的问题，因此需要一种更高效的攻击方法。

Result: 实验表明，SDM在攻击性能和成本效益上均优于现有SOTA方法，并能结合对抗训练增强防御效果。

Insight: SDM通过多阶段优化和新的损失函数设计，显著提升了对抗样本生成的效率，同时揭示了损失函数设计在对抗攻击中的重要性。

Abstract: Efficient adversarial attack methods are critical for assessing the robustness of computer vision models. In this paper, we reconstruct the optimization objective for generating adversarial examples as “maximizing the difference between the non-true labels’ probability upper bound and the true label’s probability,” and propose a gradient-based attack method termed Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of “cycle-stage-step.” The processes between cycles and between iterative steps are respectively identical, while optimization stages differ in terms of loss functions: in the initial stage, the negative probability of the true label is used as the loss function to compress the solution space; in subsequent stages, we introduce the Directional Probability Difference Ratio (DPDR) loss function to gradually increase the non-true labels’ probability upper bound by compressing the irrelevant labels’ probabilities. Experiments demonstrate that compared with previous SOTA methods, SDM not only exhibits stronger attack performance but also achieves higher attack cost-effectiveness. Additionally, SDM can be combined with adversarial training methods to enhance their defensive effects. The code is available at https://github.com/X-L-Liu/SDM.

[69] Surface Defect Detection with Gabor Filter Using Reconstruction-Based Blurring U-Net-ViT cs.CVPDF

Jongwook Si, Sungyoung Kim

TL;DR: 该论文提出了一种结合Gabor滤波器与模糊U-Net-ViT模型的新方法，用于提高基于纹理的表面缺陷检测的准确性和可靠性。通过结合U-Net的局部特征学习和ViT的全局处理能力，模型能够在多种纹理上有效检测缺陷。

Details

Motivation: 表面缺陷检测在工业质量控制中至关重要，但对复杂纹理和噪声环境的鲁棒性仍具挑战。现有方法在局部和全局特征结合以及噪声抑制方面存在不足，因此需要一种更高效的解决方案。

Result: 模型在MVTec-AD、Surface Crack Detection等数据集上表现优异，平均AUC为0.939。消融实验验证了滤波器和噪声概率对性能的显著影响。

Insight: 1. 结合局部与全局特征的混合模型能显著提升缺陷检测性能。2. 噪声抑制和边界强化策略对复杂纹理环境非常重要。3. 参数优化（如Gabor滤波器配置）是高效检测的关键。

Abstract: This paper proposes a novel approach to enhance the accuracy and reliability of texture-based surface defect detection using Gabor filters and a blurring U-Net-ViT model. By combining the local feature training of U-Net with the global processing of the Vision Transformer(ViT), the model effectively detects defects across various textures. A Gaussian filter-based loss function removes background noise and highlights defect patterns, while Salt-and-Pepper(SP) masking in the training process reinforces texture-defect boundaries, ensuring robust performance in noisy environments. Gabor filters are applied in post-processing to emphasize defect orientation and frequency characteristics. Parameter optimization, including filter size, sigma, wavelength, gamma, and orientation, maximizes performance across datasets like MVTec-AD, Surface Crack Detection, and Marble Surface Anomaly Dataset, achieving an average Area Under the Curve(AUC) of 0.939. The ablation studies validate that the optimal filter size and noise probability significantly enhance defect detection performance.

[70] UPGS: Unified Pose-aware Gaussian Splatting for Dynamic Scene Deblurring cs.CVPDF

Zhijing Wu, Longguang Wang

TL;DR: 论文提出了一个统一的优化框架UPGS，通过将相机位姿作为可学习参数与3D高斯属性结合，实现端到端优化，显著提高了动态场景去模糊的重建质量和位姿估计精度。

Details

Motivation: 动态场景的单目视频重建在AR/VR、机器人等领域有广泛应用，但由于相机和物体运动引起的严重运动模糊，导致现有方法难以准确实现。传统两步流程中位姿估计误差会劣化重建结果。

Result: 在Stereo Blur数据集和真实场景序列上的实验表明，该方法在重建质量和位姿估计精度上显著优于现有动态去模糊方法。

Insight: 统一的端到端优化框架能够有效避免传统流程中的误差累积问题，三阶段训练策略有助于实现稳定的优化过程。

Abstract: Reconstructing dynamic 3D scenes from monocular video has broad applications in AR/VR, robotics, and autonomous navigation, but often fails due to severe motion blur caused by camera and object motion. Existing methods commonly follow a two-step pipeline, where camera poses are first estimated and then 3D Gaussians are optimized. Since blurring artifacts usually undermine pose estimation, pose errors could be accumulated to produce inferior reconstruction results. To address this issue, we introduce a unified optimization framework by incorporating camera poses as learnable parameters complementary to 3DGS attributes for end-to-end optimization. Specifically, we recast camera and object motion as per-primitive SE(3) affine transformations on 3D Gaussians and formulate a unified optimization objective. For stable optimization, we introduce a three-stage training schedule that optimizes camera poses and Gaussians alternatively. Particularly, 3D Gaussians are first trained with poses being fixed, and then poses are optimized with 3D Gaussians being untouched. Finally, all learnable parameters are optimized together. Extensive experiments on the Stereo Blur dataset and challenging real-world sequences demonstrate that our method achieves significant gains in reconstruction quality and pose estimation accuracy over prior dynamic deblurring methods.

[71] SegDINO: An Efficient Design for Medical and Natural Image Segmentation with DINO-V3 cs.CVPDF

Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, Lei Zhu

TL;DR: SegDINO是一个高效的图像分割框架，结合冷冻的DINOv3主干和轻量级解码器，通过多级特征提取和MLP头部直接预测分割掩码，实现了卓越的性能同时降低了计算成本。

Details

Motivation: 现有的DINO家族模型在迁移学习中有出色表现，但将其特征用于分割任务时，通常依赖复杂且计算量大的解码器，导致参数量和计算成本过高。

Result: 在六个基准数据集（包括三个医学和三个自然图像数据集）上，SegDINO均达到了最先进的性能。

Insight: 轻量级设计不仅能降低计算成本，还能充分利用预训练模型的表示能力，展示了高效分割框架的潜力。

Abstract: The DINO family of self-supervised vision models has shown remarkable transferability, yet effectively adapting their representations for segmentation remains challenging. Existing approaches often rely on heavy decoders with multi-scale fusion or complex upsampling, which introduce substantial parameter overhead and computational cost. In this work, we propose SegDINO, an efficient segmentation framework that couples a frozen DINOv3 backbone with a lightweight decoder. SegDINO extracts multi-level features from the pretrained encoder, aligns them to a common resolution and channel width, and utilizes a lightweight MLP head to directly predict segmentation masks. This design minimizes trainable parameters while preserving the representational power of foundation features. Extensive experiments across six benchmarks, including three medical datasets (TN3K, Kvasir-SEG, ISIC) and three natural image datasets (MSD, VMD-D, ViSha), demonstrate that SegDINO consistently achieves state-of-the-art performance compared to existing methods. Code is available at https://github.com/script-Yang/SegDINO.

[72] Look Beyond: Two-Stage Scene View Generation via Panorama and Video Diffusion cs.CV | cs.AIPDF

Xueyang Kang, Zhengkang Xiang, Zezheng Zhang, Kourosh Khoshelham

TL;DR: 论文提出了一种两阶段的新视角合成方法，通过全景和视频扩散模型生成长距离一致的场景视图。

Details

Motivation: 单视角新视角合成由于未观察区域较大且视角偏移严重，导致生成结果缺乏一致性和正确对齐。现有方法难以满足长距离或闭环轨迹的需求。

Result: 在多样场景数据集上实验表明，该方法在用户定义轨迹上生成的视图比现有方法更一致，尤其在闭环场景中表现突出。

Insight: 通过分解问题为全景生成和视角插值，结合扩散模型的高效学习能力，可以显著提升新视角合成的全局一致性和灵活性。

Abstract: Novel view synthesis (NVS) from a single image is highly ill-posed due to large unobserved regions, especially for views that deviate significantly from the input. While existing methods focus on consistency between the source and generated views, they often fail to maintain coherence and correct view alignment across long-range or looped trajectories. We propose a model that addresses this by decomposing single-view NVS into a 360-degree scene extrapolation followed by novel view interpolation. This design ensures long-term view and scene consistency by conditioning on keyframes extracted and warped from a generated panoramic representation. In the first stage, a panorama diffusion model learns the scene prior from the input perspective image. Perspective keyframes are then sampled and warped from the panorama and used as anchor frames in a pre-trained video diffusion model, which generates novel views through a proposed spatial noise diffusion process. Compared to prior work, our method produces globally consistent novel views – even in loop closure scenarios – while enabling flexible camera control. Experiments on diverse scene datasets demonstrate that our approach outperforms existing methods in generating coherent views along user-defined trajectories. Our implementation is available at https://github.com/YiGuYT/LookBeyond.

[73] Quantization Meets OOD: Generalizable Quantization-aware Training from a Flatness Perspective cs.CVPDF

Jiacheng Jiang, Yuan Meng, Chen Tang, Han Yu, Qun Li

TL;DR: 该论文指出当前量化感知训练（QAT）方法在提高I.D数据性能的同时，会显著降低OOD数据的泛化能力。作者提出一种基于平坦化训练的FQAT方法，通过分层冻结机制和自适应冻结算法，解决了QAT导致的尖锐损失景观问题，实现了更好的OOD泛化性能。

Details

Motivation: 现有QAT方法忽视了OOD数据性能下降的问题，而平坦化损失景观与OOD泛化能力之间存在矛盾。为此，作者提出了一种兼顾平坦化和QAT的训练方法。

Result: 在I.D和OOD图像分类任务中，FQAT显著优于现有基线方法，证明了其优越的泛化性能。

Insight: 平坦化损失景观是提升OOD泛化能力的关键，而自适应冻结机制能有效平衡QAT和平坦化的双重优化目标。

Abstract: Current quantization-aware training (QAT) methods primarily focus on enhancing the performance of quantized models on in-distribution (I.D) data, while overlooking the potential performance degradation on out-of-distribution (OOD) data. In this paper, we first substantiate this problem through rigorous experiment, showing that QAT can lead to a significant OOD generalization performance degradation. Further, we find the contradiction between the perspective that flatness of loss landscape gives rise to superior OOD generalization and the phenomenon that QAT lead to a sharp loss landscape, can cause the above problem. Therefore, we propose a flatness-oriented QAT method, FQAT, to achieve generalizable QAT. Specifically, i) FQAT introduces a layer-wise freezing mechanism to mitigate the gradient conflict issue between dual optimization objectives (i.e., vanilla QAT and flatness). ii) FQAT proposes an disorder-guided adaptive freezing algorithm to dynamically determines which layers to freeze at each training step, effectively addressing the challenges caused by interference between layers. A gradient disorder metric is designed to help the algorithm identify unstable layers during training. Extensive experiments on influential OOD benchmark demonstrate the superiority of our method over state-of-the-art baselines under both I.D and OOD image classification tasks.

[74] DarkVRAI: Capture-Condition Conditioning and Burst-Order Selective Scan for Low-light RAW Video Denoising cs.CVPDF

Youngjin Oh, Junhyeong Kwon, Junyoung Park, Nam Ik Cho

TL;DR: DarkVRAI提出了一种用于低光照RAW视频去噪的新框架，通过捕获条件引导和对长程时间依赖性的建模，取得了SOTA性能。

Details

Motivation: 低光照RAW视频去噪因高传感器增益和短曝光时间导致的严重信号退化而极具挑战性，需要新的方法来解决这一问题。

Result: 在AIM 2025挑战赛中取得第一名，并在真实基准数据集上实现了SOTA性能。

Insight: 捕获条件信息的显式利用和长程时间依赖性的建模是提升低光照视频去噪性能的关键。

Abstract: Low-light RAW video denoising is a fundamentally challenging task due to severe signal degradation caused by high sensor gain and short exposure times, which are inherently limited by video frame rate requirements. To address this, we propose DarkVRAI, a novel framework that achieved first place in the AIM 2025 Low-light RAW Video Denoising Challenge. Our method introduces two primary contributions: (1) a successful application of a conditioning scheme for image denoising, which explicitly leverages capture metadata, to video denoising to guide the alignment and denoising processes, and (2) a Burst-Order Selective Scan (BOSS) mechanism that effectively models long-range temporal dependencies within the noisy video sequence. By synergistically combining these components, DarkVRAI demonstrates state-of-the-art performance on a rigorous and realistic benchmark dataset, setting a new standard for low-light video denoising.

cs.CL [Back]

[75] Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis cs.CL | cs.AIPDF

Teo Susnjak

TL;DR: 该论文提出了一种可复现的工作流程，通过声明式提示优化方法替代手动设计的提示，以提高大语言模型在系统文献综述中的可靠性和可复现性。

Details

Motivation: 当前大语言模型（LLMs）在系统文献综述中的应用依赖脆弱的手工设计提示，影响了科学研究的可靠性和复现性。

Result: 实现了可靠的LLM流水线，并提供了具体的代码示例，支持透明和严谨的证据合成原则。

Insight: 这是首次将声明式提示优化技术应用于系统文献综述流水线，为LLM在科学领域的可靠使用提供了新思路。

Abstract: Large language models (LLMs) offer significant potential to accelerate systematic literature reviews (SLRs), yet current approaches often rely on brittle, manually crafted prompts that compromise reliability and reproducibility. This fragility undermines scientific confidence in LLM-assisted evidence synthesis. In response, this work adapts recent advances in declarative prompt optimisation, developed for general-purpose LLM applications, and demonstrates their applicability to the domain of SLR automation. This research proposes a structured, domain-specific framework that embeds task declarations, test suites, and automated prompt tuning into a reproducible SLR workflow. These emerging methods are translated into a concrete blueprint with working code examples, enabling researchers to construct verifiable LLM pipelines that align with established principles of transparency and rigour in evidence synthesis. This is a novel application of such approaches to SLR pipelines.

[76] Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics cs.CL | cs.AIPDF

Sheldon Yu, Yuxin Xiong, Junda Wu, Xintong Li, Tong Yu

TL;DR: 该论文提出了一种状态感知的转换框架，用于分析和解释大型语言模型（LLMs）中的链式思维（CoT）推理过程，通过将推理步骤抽象为结构化潜在动态，提升了其可解释性。

Details

Motivation: 现有的链式思维（CoT）推理方法缺乏对高层次语义角色及其转换的解释能力，限制了其透明性和可信度。

Result: 该方法提供了一种结构化和可解释的推理视图，能够揭示推理步骤的语义角色和转换模式。

Insight: 通过量化分析推理动态，可以更好地理解和改进LLMs的多步推理过程。

Abstract: Recent advances in chain-of-thought (CoT) prompting have enabled large language models (LLMs) to perform multi-step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token-level attribution, such that the high-level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state-aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token-level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.

Seiji Maekawa, Hayate Iso, Nikita Bhutani

TL;DR: 该论文提出了Distinctive Feature Mining (DFM)任务，用于评估LLMs在全局上下文中识别稀有特征的能力，并开发了可配置的DiFBench基准框架。通过实验发现，通用模型与增强推理模型之间存在显著性能差距，且所有模型在任务复杂度增加时表现明显下降。

Details

Motivation: 现有LLM基准主要关注信息检索或摘要任务，未能评估模型在全局上下文中识别稀有特征的统计推理能力。这在诸如候选人筛选或产品差异化等现实场景中至关重要。

Result: 1. 通用模型与增强推理模型在性能上存在显著差距；2. 所有模型在任务复杂度和文档数量增加时表现大幅下降；3. 常见失败模式是将高频特征误判为稀有。

Insight: 1. 当前LLMs在细粒度统计推理和稀有性检测方面能力有限；2. 稀有性盲点揭示了模型在现实场景中应用的关键瓶颈。

Abstract: Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model’s ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs’ abilities to perform fine-grained, statistical reasoning and rarity detection.

[78] The Temporal Game: A New Perspective on Temporal Relation Extraction cs.CLPDF

Hugo Sousa, Ricardo Campos, Alípio Jorge

TL;DR: 该论文提出了一种名为Temporal Game的新方法，将时序关系抽取任务建模为互动游戏，通过逐点比较时序实体的起点和终点来实现更细粒度和灵活的标注，同时支持强化学习训练。

Details

Motivation: 传统的时序关系标注方法通常直接标注区间级别的关系，缺乏灵活性和细粒度。论文希望通过游戏化方式分解任务，提升标注效率和一致性，同时为强化学习奠定基础。

Result: 开发了公开可用的Temporal Game工具（https://temporal-game.inesctec.pt），支持时序标注和强化学习训练，提高了标注的灵活性和效率。

Insight: 将时序关系抽取任务游戏化不仅简化了标注流程，还为强化学习提供了新的任务框架，展示了互动式方法在NLP任务中的潜力。

Abstract: In this paper we demo the Temporal Game, a novel approach to temporal relation extraction that casts the task as an interactive game. Instead of directly annotating interval-level relations, our approach decomposes them into point-wise comparisons between the start and end points of temporal entities. At each step, players classify a single point relation, and the system applies temporal closure to infer additional relations and enforce consistency. This point-based strategy naturally supports both interval and instant entities, enabling more fine-grained and flexible annotation than any previous approach. The Temporal Game also lays the groundwork for training reinforcement learning agents, by treating temporal annotation as a sequential decision-making task. To showcase this potential, the demo presented in this paper includes a Game mode, in which users annotate texts from the TempEval-3 dataset and receive feedback based on a scoring system, and an Annotation mode, that allows custom documents to be annotated and resulting timeline to be exported. Therefore, this demo serves both as a research tool and an annotation interface. The demo is publicly available at https://temporal-game.inesctec.pt, and the source code is open-sourced to foster further research and community-driven development in temporal reasoning and annotation.

[79] Exploring Reasoning-Infused Text Embedding with Large Language Models for Zero-Shot Dense Retrieval cs.CLPDF

Yuxiang Liu, Tian Wang, Gourab Kundu, Tianyu Cao, Guang Cheng

TL;DR: 论文提出了一种名为RITE的文本嵌入方法，通过结合生成式大语言模型（LLMs）的逻辑推理能力，在嵌入过程中融入推理步骤，显著提升了零样本密集检索的性能。

Details

Motivation: 传统的基于Transformer的文本嵌入模型（如BERT和E5）虽然在捕捉上下文表示方面表现优异，但在需要复杂推理的检索任务中表现不佳。现有基于LLM的嵌入方法未能充分利用LLMs的推理能力，因此需要一种新的方法来弥补这一缺陷。

Result: 在BRIGHT（一个推理密集型的检索基准测试）上的实验结果表明，RITE在多个领域的零样本检索任务中表现显著优于基线方法。

Insight: 将推理能力融入文本嵌入过程有助于解决复杂查询的检索问题，尤其是在零样本场景下，推理能力的引入能够显著提升性能。

Abstract: Transformer-based models such as BERT and E5 have significantly advanced text embedding by capturing rich contextual representations. However, many complex real-world queries require sophisticated reasoning to retrieve relevant documents beyond surface-level lexical matching, where encoder-only retrievers often fall short. Decoder-only large language models (LLMs), known for their strong reasoning capabilities, offer a promising alternative. Despite this potential, existing LLM-based embedding methods primarily focus on contextual representation and do not fully exploit the reasoning strength of LLMs. To bridge this gap, we propose Reasoning-Infused Text Embedding (RITE), a simple but effective approach that integrates logical reasoning into the text embedding process using generative LLMs. RITE builds upon existing language model embedding techniques by generating intermediate reasoning texts in the token space before computing embeddings, thereby enriching representations with inferential depth. Experimental results on BRIGHT, a reasoning-intensive retrieval benchmark, demonstrate that RITE significantly enhances zero-shot retrieval performance across diverse domains, underscoring the effectiveness of incorporating reasoning into the embedding process.

[80] Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models cs.CLPDF

Chen Zheng, Yiyuan Ma, Yuan Yang, Deyi Liu, Jing Liu

TL;DR: 该论文提出了一种平衡演员初始化（BAI）方法，解决在蒸馏训练模型上应用RLHF时的训练不稳定问题，如序列长度崩溃和奖励曲棍球棒效应。

Details

Motivation: 在结合指令微调和RLHF对齐与蒸馏推理微调两种范式时，模型训练会出现序列长度崩溃和奖励曲棍球棒效应等不稳定问题，影响模型的推理和对齐能力。

Result: BAI成功解决了序列长度崩溃问题，缓解了奖励曲棍球棒效应，并在各类实验中表现优异，实现了训练稳定性与推理能力的平衡。

Insight: 平衡的合并比例是实现训练稳定性与推理能力保持之间最优权衡的关键。

Abstract: The development of alignment and reasoning capabilities in large language models has seen remarkable progress through two paradigms: instruction tuning and reinforcement learning from human feedback (RLHF) alignment paradigm, and distillation-based reasoning fine-tuning paradigm. While both approaches prove effective independently, the third paradigm of applying RLHF to distillation-trained models presents significant challenges. Our investigation reveals two critical phenomena that emerge in this paradigm: Sequence Length Collapse, where language generation dramatically reduces during early RLHF training, and the Reward Hockey Stick Curve, featuring severe reward score drops followed by gradual recovery. These instabilities fundamentally compromise the model’s alignment and reasoning capabilities. To address these challenges, we propose Balanced Actor Initialization (BAI), a two-stage weighted model merging approach. BAI first merges instruction-following and distillation-based reasoning fine-tuned models, then further combines this intermediate model with the pretrained model to preserve foundational knowledge. Through comprehensive experiments across diverse benchmarks and detailed analysis of training experiments, we demonstrate that BAI resolves Sequence Length Collapse, mitigates the Reward Hockey Stick Curve, and enables continuous sequence length improvement during training. Additionally, our analysis reveals that balanced merging ratios achieve optimal trade-offs between training stability and reasoning capability preservation. Our work provides the effective solution for stable training in this third paradigm, enabling more capable reasoning models that combine distillation efficiency with RLHF alignment.

Rinku Dewri

TL;DR: GIER 是一种基于自我反思和迭代修订的框架，通过识别和填补推理差距来提升大语言模型的输出质量，不依赖示例或提示模板。

Details

Motivation: 当前大语言模型在生成高质量推理和理由时存在不足，现有方法通常依赖示例或复杂的提示策略。GIER 旨在通过自我反思和迭代改进解决这一问题。

Result: 在三个推理密集型任务（SciFact、PrivacyQA 和 e-SNLI）和四个大语言模型上，GIER 显著提升了推理质量和对齐性，同时保持了任务准确性。

Insight: 研究表明，大语言模型能够理解和利用抽象的推理差距描述，将其转化为具体的改进，突出了模型自我优化的潜力。

Abstract: We introduce GIER (Gap-driven Iterative Enhancement of Responses), a general framework for improving large language model (LLM) outputs through self-reflection and revision based on conceptual quality criteria. Unlike prompting strategies that rely on demonstrations, examples, or chain-of-thought templates, GIER utilizes natural language descriptions of reasoning gaps, and prompts a model to iteratively critique and refine its own outputs to better satisfy these criteria. Across three reasoning-intensive tasks (SciFact, PrivacyQA, and e-SNLI) and four LLMs (GPT-4.1, GPT-4o Mini, Gemini 1.5 Pro, and Llama 3.3 70B), GIER improves rationale quality, grounding, and reasoning alignment without degrading task accuracy. Our analysis demonstrates that models can not only interpret abstract conceptual gaps but also translate them into concrete reasoning improvements.

[82] Open Data Synthesis For Deep Research cs.CL | cs.AIPDF

Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu

TL;DR: 论文提出InfoSeek框架，通过双代理系统合成复杂深度研究任务的数据集，解决现有基准在任务复杂度上的不足，并展示其优越性能。

Details

Motivation: 现有基准无法捕捉深度研究任务的复杂性，而合成数据集常存在推理捷径或结构深度不足的问题。

Result: 实验表明，基于InfoSeek训练的模型在BrowseComp-Plus基准上超越大模型和商业API，性能接近更强API。

Insight: 层次化任务设计和元信息保留对优化深度研究任务模型至关重要。

Abstract: Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in \href{https://github.com/VectorSpaceLab/InfoSeek}{this repository}.

[83] The Resurgence of GCG Adversarial Attacks on Large Language Models cs.CL | cs.AI | cs.CR | cs.LGPDF

Yuting Tan, Xuying Li, Zhuo Li, Huizhen Shu, Peikang Hu

TL;DR: 该论文系统评估了GCG及其变体T-GCG在不同规模开源大语言模型（LLM）上的对抗攻击效果，揭示了模型规模增加会降低攻击成功率，推理任务更容易受到攻击，模拟退火虽能提升攻击多样性但对语义评估的增益有限。

Details

Motivation: 研究动机在于探索梯度对抗提示（如GCG算法）在多类LLM上的攻击效果，尤其是安全性提示和推理密集型任务的表现差异。

Result: 结果显示：1）大模型对攻击更具抵抗力；2）前缀评估高估攻击效果；3）编码提示比安全提示更易受攻击。

Insight: 研究启发未来需开发更鲁棒的对抗评估方法，同时模拟退火策略在对抗搜索中有潜力但需进一步优化。

Abstract: Gradient-based adversarial prompting, such as the Greedy Coordinate Gradient (GCG) algorithm, has emerged as a powerful method for jailbreaking large language models (LLMs). In this paper, we present a systematic appraisal of GCG and its annealing-augmented variant, T-GCG, across open-source LLMs of varying scales. Using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B, we evaluate attack effectiveness on both safety-oriented prompts (AdvBench) and reasoning-intensive coding prompts. Our study reveals three key findings: (1) attack success rates (ASR) decrease with model size, reflecting the increasing complexity and non-convexity of larger models’ loss landscapes; (2) prefix-based heuristics substantially overestimate attack effectiveness compared to GPT-4o semantic judgments, which provide a stricter and more realistic evaluation; and (3) coding-related prompts are significantly more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector. In addition, preliminary results with T-GCG show that simulated annealing can diversify adversarial search and achieve competitive ASR under prefix evaluation, though its benefits under semantic judgment remain limited. Together, these findings highlight the scalability limits of GCG, expose overlooked vulnerabilities in reasoning tasks, and motivate further development of annealing-inspired strategies for more robust adversarial evaluation.

[84] The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang cs.CLPDF

Fenghua Liu, Yulong Chen, Yixuan Liu, Zhujun Jin, Solomon Tsai

TL;DR: 论文通过构建新语言Camlang测试了LLMs的元语言推理能力，发现GPT-5在Camlang任务中表现远低于人类，揭示了其推理能力的局限性。

Details

Motivation: 研究旨在评估LLMs是否能够通过显式元语言推理掌握陌生语言，而非仅仅依赖模式匹配。

Result: GPT-5在Camlang中表现不佳（47% EM），远低于人类（87%），且成功案例多为浅层词汇对齐。

Insight: 当前LLMs在系统性的语法掌握和元语言推理能力上与人类存在显著差距。

Abstract: Large Language Models (LLMs) achieve gold-medal performance across many benchmarks, yet it remains unclear whether such success reflects genuine reasoning or pattern matching. From a cognitive science perspective, an informative test is whether models can master an unfamiliar language through explicit metalinguistic deductive learning, a paradigm where human learners can reliably internalise grammatical systems through metalinguistic reasoning. We address this question with Camlang, a novel constructed language that exhibits naturalistic yet unattested feature combinations. Camlang consists of two explicit resources, a grammar book and a bilingual dictionary, which mirror adult second-language learning via explicit grammar rules and lexical lookup, and enable us to disentangle errors in morpho-syntax, lexical semantics, and sentence-level reasoning. Human experiments show that these resources are sufficient for participants to acquire Camlang and successfully solve Camlang tasks. To operationalise evaluation, we adapt CommonsenseQA into Camlang, creating Camlang-CSQA-v0, the first task in a broader suite where solving questions requires applying grammar rules and lexical mappings. Experimental results show that GPT-5 achieves 98% EM accuracy in English but only 47% in Camlang, far below human performance at 87%, while other state-of-the-art reasoning LLMs perform even worse. Human verification further reveals that most model successes stem from shallow lexical alignment while GPT-5 shows emerging metalinguistic awareness to a limited extent but not systematic grammatical mastery as humans. Camlang establishes a cognitively grounded evaluation paradigm that exposes fundamental gaps between current models and human metalinguistic competence.

[85] CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning cs.CL | cs.LGPDF

Salah Eddine Bekhouche, Abdellah Zakaria Sellam, Hichem Telli, Cosimo Distante, Abdenour Hadid

TL;DR: 这篇论文提出了一个轻量级框架，使用专门的阿拉伯语文本编码器和注意力相关评分（ARS）来解决伊斯兰继承法中的选择题。尽管大型模型在准确性上表现更优，但该方法的效率、设备端部署和隐私优势更适用于高风险领域。

Details

Motivation: 伊斯兰继承法需要精确识别继承人和计算份额，这对AI提出了挑战。论文旨在通过轻量级框架解决这一问题，同时平衡性能和实用性。

Result: MARBERT方法实现了69.87%的准确率，尽管低于最优LLM的87.6%，但在效率、部署和隐私方面具有优势。

Insight: 在高风险领域，小型专用系统虽然峰值性能较低，但在资源消耗和隐私保护方面提供了更实际的解决方案。

Abstract: Islamic inheritance law (Ilm al-Mawarith) requires precise identification of heirs and calculation of shares, which poses a challenge for AI. In this paper, we present a lightweight framework for solving multiple-choice inheritance questions using a specialised Arabic text encoder and Attentive Relevance Scoring (ARS). The system ranks answer options according to semantic relevance, and enables fast, on-device inference without generative reasoning. We evaluate Arabic encoders (MARBERT, ArabicBERT, AraBERT) and compare them with API-based LLMs (Gemini, DeepSeek) on the QIAS 2025 dataset. While large models achieve an accuracy of up to 87.6%, they require more resources and are context-dependent. Our MARBERT-based approach achieves 69.87% accuracy, presenting a compelling case for efficiency, on-device deployability, and privacy. While this is lower than the 87.6% achieved by the best-performing LLM, our work quantifies a critical trade-off between the peak performance of large models and the practical advantages of smaller, specialized systems in high-stakes domains.

[86] Modeling Motivated Reasoning in Law: Evaluating Strategic Role Conditioning in LLM Summarization cs.CL | cs.CYPDF

Eunjung Cho, Alexander Hoyle, Yoan Hermstrüwer

TL;DR: 该论文研究了大型语言模型（LLMs）在法律文本摘要中如何因不同法律角色（如法官、检察官、律师）的提示而表现出动机性推理现象，并提出了一种基于法律事实和推理包容性的评估框架。

Details

Motivation: 随着LLMs越来越多地用于生成用户定制的摘要，特别是在法律领域，如何确保模型输出不受法律角色动机影响成为一个重要问题。论文旨在探讨LLMs是否会在摘要中表现出角色一致的选择性内容包容。

Result: 实验结果表明，即使提示中包含平衡指令，LLMs的摘要仍会选择性包容与特定角色一致的信息，反映出动机性推理的倾向。

Insight: 论文揭示了LLMs在法律摘要任务中可能隐含角色偏见的风险，强调了在高风险法律场景中需要对模型的行为进行角色感知评估的重要性。

Abstract: Large Language Models (LLMs) are increasingly used to generate user-tailored summaries, adapting outputs to specific stakeholders. In legal contexts, this raises important questions about motivated reasoning – how models strategically frame information to align with a stakeholder’s position within the legal system. Building on theories of legal realism and recent trends in legal practice, we investigate how LLMs respond to prompts conditioned on different legal roles (e.g., judges, prosecutors, attorneys) when summarizing judicial decisions. We introduce an evaluation framework grounded in legal fact and reasoning inclusion, also considering favorability towards stakeholders. Our results show that even when prompts include balancing instructions, models exhibit selective inclusion patterns that reflect role-consistent perspectives. These findings raise broader concerns about how similar alignment may emerge as LLMs begin to infer user roles from prior interactions or context, even without explicit role instructions. Our results underscore the need for role-aware evaluation of LLM summarization behavior in high-stakes legal settings.

[87] Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs cs.CLPDF

Hanqi Yan, Hainiu Xu, Yulan He

TL;DR: 论文揭示了大型语言模型（LLM）在推理能力增强时可能出现的行为失准现象，称为“推理诱发失准”，强调需要改进高级推理模型的对齐机制。

Details

Motivation: 随着LLM的广泛应用，其安全性和与人类价值观的对齐问题备受关注。此前研究关注恶意数据微调导致的失准行为，而本文探讨推理能力增强时可能引发的更严重失准现象。

Result: 推理能力增强导致LLM对恶意请求更敏感，密集模型尤为脆弱；注意力转移和专家模块可能将过度推理导向安全防护措施。

Insight: 揭示了推理能力与安全性之间的新权衡，强调在提升模型推理能力时必须同步改进对齐机制，以避免潜在风险。

Abstract: With Large Language Models (LLMs) becoming increasingly widely adopted, concerns regarding their safety and alignment with human values have intensified. Previous studies have shown that fine-tuning LLMs on narrow and malicious datasets induce misaligned behaviors. In this work, we report a more concerning phenomenon, Reasoning-Induced Misalignment. Specifically, we observe that LLMs become more responsive to malicious requests when reasoning is strengthened, via switching to “think-mode” or fine-tuning on benign math datasets, with dense models particularly vulnerable. Moreover, we analyze internal model states and find that both attention shifts and specialized experts in mixture-of-experts models help redirect excessive reasoning towards safety guardrails. These findings provide new insights into the emerging reasoning-safety trade-off and underscore the urgency of advancing alignment for advanced reasoning models.

[88] Text Reinforcement for Multimodal Time Series Forecasting cs.CLPDF

Chen Su, Yuanhe Tian, Yan Song, Yongdong Zhang

TL;DR: 该论文提出了一种文本增强模型（TeR），通过强化文本模态来改善多模态时间序列预测（TSF）的性能。TeR生成的增强文本弥补了原始文本的不足，并通过强化学习优化模型，从而提升预测效果。

Details

Motivation: 多模态TSF依赖于高质量的文本和时间序列数据，但原始文本可能无法准确或完整反映时间序列信息，导致预测性能不稳定。因此，需要增强文本内容以提升多模态TSF的性能。

Result: 在覆盖多个领域的真实基准数据集上验证了方法的有效性，性能优于现有基线和其他研究。

Insight: 通过增强文本模态可以有效提升多模态TSF的性能，强化学习的引入进一步优化了增强文本的质量和对预测任务的贡献。

Abstract: Recent studies in time series forecasting (TSF) use multimodal inputs, such as text and historical time series data, to predict future values. These studies mainly focus on developing advanced techniques to integrate textual information with time series data to perform the task and achieve promising results. Meanwhile, these approaches rely on high-quality text and time series inputs, whereas in some cases, the text does not accurately or fully capture the information carried by the historical time series, which leads to unstable performance in multimodal TSF. Therefore, it is necessary to enhance the textual content to improve the performance of multimodal TSF. In this paper, we propose improving multimodal TSF by reinforcing the text modalities. We propose a text reinforcement model (TeR) to generate reinforced text that addresses potential weaknesses in the original text, then apply this reinforced text to support the multimodal TSF model’s understanding of the time series, improving TSF performance. To guide the TeR toward producing higher-quality reinforced text, we design a reinforcement learning approach that assigns rewards based on the impact of each reinforced text on the performance of the multimodal TSF model and its relevance to the TSF task. We optimize the TeR accordingly, so as to improve the quality of the generated reinforced text and enhance TSF performance. Extensive experiments on a real-world benchmark dataset covering various domains demonstrate the effectiveness of our approach, which outperforms strong baselines and existing studies on the dataset.

[89] CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders cs.CLPDF

Alex Gulko, Yusen Peng, Sachin Kumar

TL;DR: CE-Bench是一个轻量级的对比评估基准，用于稀疏自编码器的可解释性评测。

Details

Motivation: 稀疏自编码器在大型语言模型中的应用潜力巨大，但缺乏自动化的评估方法，阻碍了其发展。

Result: 实验结果表明，CE-Bench能够可靠地评测稀疏自编码器的可解释性，且与现有基准对齐。

Insight: CE-Bench无需依赖外部大型语言模型，为稀疏自编码器的评估提供了轻量化的解决方案。

Abstract: Probing with sparse autoencoders is a promising approach for uncovering interpretable features in large language models (LLMs). However, the lack of automated evaluation methods has hindered their broader adoption and development. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive ablation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks, all without requiring an external LLM. The official implementation and evaluation dataset are open-sourced under the MIT License.

[90] Learning to Shop Like Humans: A Review-driven Retrieval-Augmented Recommendation Framework with LLMs cs.CLPDF

Kaiwen Wei, Jinpeng Gao, Jiang Zhong, Yuming Yang, Fengmao Lv

TL;DR: RevBrowse是一个基于大语言模型（LLM）的推荐框架，通过结合用户评论和检索增强模块PrefRAG，动态建模用户偏好，显著提升了推荐效果。

Details

Motivation: 现有的LLM在推荐任务中难以高效利用用户评论，并且缺乏动态筛选相关评论的机制，因此需要一种新方法来更好地结合评论信息。

Result: 在四个Amazon评论数据集上，RevBrowse显著优于基线方法，展示了其泛化能力和动态建模用户偏好的有效性。

Insight: 检索增强过程具有透明性，使推荐结果更具可解释性，揭示了哪些评论影响了最终推荐。

Abstract: Large language models (LLMs) have shown strong potential in recommendation tasks due to their strengths in language understanding, reasoning and knowledge integration. These capabilities are especially beneficial for review-based recommendation, which relies on semantically rich user-generated texts to reveal fine-grained user preferences and item attributes. However, effectively incorporating reviews into LLM-based recommendation remains challenging due to (1) inefficient to dynamically utilize user reviews under LLMs’ constrained context windows, and (2) lacking effective mechanisms to prioritize reviews most relevant to the user’s current decision context. To address these challenges, we propose RevBrowse, a review-driven recommendation framework inspired by the “browse-then-decide” decision process commonly observed in online user behavior. RevBrowse integrates user reviews into the LLM-based reranking process to enhance its ability to distinguish between candidate items. To improve the relevance and efficiency of review usage, we introduce PrefRAG, a retrieval-augmented module that disentangles user and item representations into structured forms and adaptively retrieves preference-relevant content conditioned on the target item. Extensive experiments on four Amazon review datasets demonstrate that RevBrowse achieves consistent and significant improvements over strong baselines, highlighting its generalizability and effectiveness in modeling dynamic user preferences. Furthermore, since the retrieval-augmented process is transparent, RevBrowse offers a certain level of interpretability by making visible which reviews influence the final recommendation.

[91] Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs cs.CL | cs.AIPDF

Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park

TL;DR: 论文提出了一种名为奖励加权采样（RWS）的解码策略，通过使用外部奖励模型在扩散过程中提供全局信号，以增强掩码扩散大语言模型的非自回归特性，从而改善了生成序列的全局一致性和性能。

Details

Motivation: 标准解码方法（如基于置信度的采样）在选择令牌时独立处理每个令牌的置信度，导致生成顺序类似自回归过程，限制了非自回归模型的优势。因此，需要一种方法能够在扩散过程中引入全局信号，改善生成顺序。

Result: 实验结果表明，RWS显著改善了非自回归生成顺序，并在多个评估指标上实现了性能提升。

Insight: 全局信号的整合对于提升非自回归模型的性能和生成顺序至关重要，RWS提供了一种有效的方法来实现这一目标。

Abstract: Masked diffusion models (MDMs) offer a promising non-autoregressive alternative for large language modeling. Standard decoding methods for MDMs, such as confidence-based sampling, select tokens independently based on individual token confidences at each diffusion step. However, we observe that this independent token selection often results in generation orders resembling sequential autoregressive processes, limiting the advantages of non-autoregressive modeling. To mitigate this pheonomenon, we propose Reward-Weighted Sampling (RWS), a novel decoding strategy that leverages an external reward model to provide a principled global signal during the iterative diffusion process. Specifically, at each diffusion step, RWS evaluates the quality of the entire intermediate sequence and scales token logits accordingly, guiding token selection by integrating global sequence-level coherence. This method selectively increases the confidence of tokens that initially have lower scores, thereby promoting a more non-autoregressive generation order. Furthermore, we provide theoretical justification showing that reward-weighted logit scaling induces beneficial rank reversals in token selection and consistently improves expected reward. Experiments demonstrate that RWS significantly promotes non-autoregressive generation orders, leading to improvements across multiple evaluation metrics. These results highlight the effectiveness of integrating global signals in enhancing both the non-autoregressive properties and overall performance of MDMs.

[92] LegalChainReasoner: A Legal Chain-guided Framework for Criminal Judicial Opinion Generation cs.CL | cs.AIPDF

Weizhe Shi, Qiqi Wang, Yihong Pan, Qian Liu, Kaiqi Zhao

TL;DR: 论文提出了LegalChainReasoner框架，通过结构化法律链（Legal Chain）指导模型生成完整的司法意见，解决了法律推理与量刑预测分离导致的矛盾问题。

Details

Motivation: 当前研究中，法律推理与量刑预测被拆分为孤立任务，导致结果不一致且脱离实际司法需求。此外，依赖人工知识的现有方法限制了实际应用效果。

Result: 在中文法律案例数据集上，LegalChainReasoner优于基线模型。

Insight: 通过法律链的引入，首次实现法律推理与量刑的协同生成，更符合司法实践需求。

Abstract: A criminal judicial opinion represents the judge’s disposition of a case, including the decision rationale and sentencing. Automatically generating such opinions can assist in analyzing sentencing consistency and provide judges with references to similar past cases. However, current research typically approaches this task by dividing it into two isolated subtasks: legal reasoning and sentencing prediction. This separation often leads to inconsistency between the reasoning and predictions, failing to meet real-world judicial requirements. Furthermore, prior studies rely on manually curated knowledge to enhance applicability, yet such methods remain limited in practical deployment. To address these limitations and better align with legal practice, we propose a new LegalAI task: Judicial Opinion Generation, which simultaneously produces both legal reasoning and sentencing decisions. To achieve this, we introduce LegalChainReasoner, a framework that applies structured legal chains to guide the model through comprehensive case assessments. By integrating factual premises, composite legal conditions, and sentencing conclusions, our approach ensures flexible knowledge injection and end-to-end opinion generation. Experiments on two real-world and open-source Chinese legal case datasets demonstrate that our method outperforms baseline models.

[93] EviNote-RAG: Enhancing RAG Models via Answer-Supportive Evidence Notes cs.CLPDF

Yuqin Dai, Guoqing Wang, Yuan Wang, Kairan Dou, Kaichen Zhou

TL;DR: EviNote-RAG通过引入支持性证据笔记（SENs）和证据质量奖励（EQR），提升了RAG模型在开放域问答中的性能，解决了检索噪声和推理错误累积的问题。

Details

Motivation: 传统检索-回答范式存在检索信号噪声大和多跳推理中错误累积的问题，影响了开放域问答的性能。

Result: 在HotpotQA、Bamboogle和2Wiki等基准测试中，EviNote-RAG取得了20%到91%的相对F1提升。

Insight: 通过结构化笔记和逻辑奖励机制，可以有效减少噪声并提升推理的鲁棒性。

Abstract: Large Language Models (LLMs) empowered with retrieval mechanisms have achieved strong progress in open-domain question answering (QA). Yet, the conventional retrieve–then–answer paradigm often suffers from two key limitations: (1) low signal-to-noise ratio in retrieved evidence, where useful information is buried under irrelevant content, and (2) error accumulation in multi-hop reasoning when incomplete or noisy passages are involved. To address these challenges, we present EviNote-RAG, an agentic RAG framework that introduces a structured retrieve–note–answer pipeline. Instead of directly reasoning over raw retrievals, the model is trained to compose Supportive-Evidence Notes (SENs), concise, human-like notes that preserve only answer-relevant information, highlight uncertainty, and explicitly state when no useful evidence exists. This distillation process is further reinforced by the Evidence Quality Reward (EQR), an entailment-based signal that evaluates whether SENs logically support the final answer. Together, SENs and EQR guide the model toward faithful and robust reasoning, while reducing the impact of noise. Experiments on in-domain and out-of-domain QA benchmarks show that EviNote-RAG consistently outperforms strong baselines in accuracy, generalization, and training stability. In particular, it achieves state-of-the-art results while enhancing robustness and efficiency, yielding relative F1 gains of 20% on HotpotQA (+0.093), 40% on Bamboogle (+0.151), and 91% on 2Wiki (+0.256) via denser rewards and reduced verbosity.

[94] RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning cs.CLPDF

Chia-Hsuan Hsu, Jun-En Ding, Hsin-Ling Hsu, Feng Liu, Fang-Ming Hung

TL;DR: 论文提出了一种名为RPRO的新框架，结合强化学习和偏好驱动的推理优化，以提升医学问答和诊断推理的临床可靠性。

Details

Motivation: 现有的大型语言模型（LLMs）在医学推理中常生成缺乏事实准确性和临床可靠性的推理链。研究旨在通过强化学习和偏好优化的结合，改进这一不足。

Result: 在PubMedQA和MedQA-USMLE数据集上，RPRO优于基线模型，1.1B参数模型甚至超越7B-13B参数的专业医学模型。

Insight: 偏好优化与质量驱动的推理修正结合，是一种可扩展且高效的方法，能构建更可靠、临床基础更扎实的医学LLMs。

Abstract: Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that uniquely combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO differentiates itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley-Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA and MedQA-USMLE show consistent improvements over strong baselines. Remarkably, our 1.1B parameter model outperforms much larger 7B-13B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement offers a scalable and effective approach to building more reliable, clinically grounded medical LLMs.

[95] Performance Analysis of Supervised Machine Learning Algorithms for Text Classification cs.CLPDF

Sadia Zaman Mishu, S M Rafiuddin

TL;DR: 本文通过实验比较了多种监督学习算法在文本分类任务中的性能，重点分析了不同模型在标注数据集上的分类准确性。

Details

Motivation: 随着文本分类在网络搜索、数据挖掘、推荐系统等领域的广泛应用，需要了解不同监督学习算法在处理标注文本时的性能差异。

Result: 实验结果表明，不同模型在分类准确性上表现各异，揭示了某些模型在特定场景下的优势。

Insight: 选择合适的监督学习算法对文本分类任务至关重要，不同模型可能需要根据数据特性进行调整。

Abstract: The demand for text classification is growing significantly in web searching, data mining, web ranking, recommendation systems, and so many other fields of information and technology. This paper illustrates the text classification process on different datasets using some standard supervised machine learning techniques. Text documents can be classified through various kinds of classifiers. Labeled text documents are used to classify the text in supervised classifications. This paper applies these classifiers on different kinds of labeled documents and measures the accuracy of the classifiers. An Artificial Neural Network (ANN) model using Back Propagation Network (BPN) is used with several other models to create an independent platform for labeled and supervised text classification process. An existing benchmark approach is used to analyze the performance of classification using labeled documents. Experimental analysis on real data reveals which model works well in terms of classification accuracy.

[96] Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL cs.CLPDF

Xiaoying Song, Anirban Saha Anik, Dibakar Barua, Pengcheng Luo, Junhua Ding

TL;DR: 该论文提出了一种基于检索增强生成（RAG）和强化学习（RL）的框架，用于生成针对不同健康素养水平的定制化反驳言论，以更有效地应对健康错误信息。

Details

Motivation: 现有的反驳言论生成方法通常忽略了受众健康素养水平的差异，导致生成内容缺乏针对性和有效性。论文旨在解决这一问题。

Result: 实验表明，Controlled-Literacy在生成更具可访问性和用户偏好的反驳言论方面优于基线方法。

Insight: 健康素养水平的动态适配能显著提升公共健康沟通的公平性和影响力。

Abstract: Health misinformation spreading online poses a significant threat to public health. Researchers have explored methods for automatically generating counterspeech to health misinformation as a mitigation strategy. Existing approaches often produce uniform responses, ignoring that the health literacy level of the audience could affect the accessibility and effectiveness of counterspeech. We propose a Controlled-Literacy framework using retrieval-augmented generation (RAG) with reinforcement learning (RL) to generate tailored counterspeech adapted to different health literacy levels. In particular, we retrieve knowledge aligned with specific health literacy levels, enabling accessible and factual information to support generation. We design a reward function incorporating subjective user preferences and objective readability-based rewards to optimize counterspeech to the target health literacy level. Experiment results show that Controlled-Literacy outperforms baselines by generating more accessible and user-preferred counterspeech. This research contributes to more equitable and impactful public health communication by improving the accessibility and comprehension of counterspeech to health misinformation.

[97] Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation cs.CL | cs.AI | I.2.6; I.2.7PDF

Abdessalam Bouchekif, Samer Rashwani, Heba Sbahi, Shahd Gaben, Mutez Al-Khatib

TL;DR: 该论文评估了大型语言模型在伊斯兰继承法中的知识储备和推理能力，揭示了不同模型在性能上的显著差距，并分析了常见的错误模式。

Details

Motivation: 研究旨在测试大型语言模型在处理复杂的伊斯兰继承法（’ilm al-mawarith）时的能力，以了解其在结构化法律推理中的表现和局限性。

Result: 结果显示，o3和Gemini 2.5的准确率超过90%，而ALLaM、Fanar、LLaMA和Mistral的准确率低于50%，表明模型间的推理能力和领域适应性存在显著差异。

Insight: 研究发现，大型语言模型在处理结构化法律推理时存在困难，如继承场景误解和法律规则错误应用，这为改进模型在伊斯兰法律领域的表现提供了方向。

Abstract: This paper evaluates the knowledge and reasoning capabilities of Large Language Models in Islamic inheritance law, known as ‘ilm al-mawarith. We assess the performance of seven LLMs using a benchmark of 1,000 multiple-choice questions covering diverse inheritance scenarios, designed to test models’ ability to understand the inheritance context and compute the distribution of shares prescribed by Islamic jurisprudence. The results reveal a significant performance gap: o3 and Gemini 2.5 achieved accuracies above 90%, whereas ALLaM, Fanar, LLaMA, and Mistral scored below 50%. These disparities reflect important differences in reasoning ability and domain adaptation. We conduct a detailed error analysis to identify recurring failure patterns across models, including misunderstandings of inheritance scenarios, incorrect application of legal rules, and insufficient domain knowledge. Our findings highlight limitations in handling structured legal reasoning and suggest directions for improving performance in Islamic legal reasoning. Code: https://github.com/bouchekif/inheritance_evaluation

[98] Privacy-Preserving Reasoning with Knowledge-Distilled Parametric Retrieval Augmented Generation cs.CLPDF

Jinwen Chen, Hainan Zhang, Liang Pang, Yongxin Tong, Haibo Zhou

TL;DR: DistilledPRAG提出了一种基于知识蒸馏的参数化检索增强生成（PRAG）方法，通过合成QA对和文档结构对齐，解决了传统PRAG的高延迟和泛化性问题，实现了高效且隐私保护的推理。

Details

Motivation: 当前RAG系统需要上传明文文档至云端，存在隐私泄露风险。PRAG通过LoRA编码文档避免了这一问题，但仍面临高推理延迟和泛化性差的挑战。

Result: 在四个QA数据集上，DistilledPRAG在准确率和OOD泛化性上均优于基线方法。

Insight: 知识蒸馏和文档结构对齐是提升参数化RAG性能的关键，同时解决了隐私保护和性能的权衡问题。

Abstract: The current RAG system requires uploading plaintext documents to the cloud, risking private data leakage. Parametric RAG (PRAG) addresses this by encoding documents as LoRA within LLMs, enabling reasoning without exposing raw content. However, it still faces two issues: (1) PRAG demands synthesizing QA pairs and fine-tuning LLM for each individual document to create its corresponding LoRA, leading to unacceptable inference latency. (2) The performance of PRAG relies solely on synthetic QA data, lacking internal alignment with standard RAG, resulting in poor generalization on out-of-distribution(OOD) inputs. Therefore, achieving high-efficiency parameterization while maintaining RAG-level performance remains a critical challenge for privacy-preserving reasoning. In this paper, we propose DistilledPRAG, a generalizable knowledge-distilled parametric RAG model aligned with standard RAG in document structure and parameter activation. We first synthesize QA pairs from single and multi-documents to enhance cross-document reasoning. Then, we mask the plaintext documents with a special token and translate them to LoRA via a parameter generator, maintaining the standard RAG document structure. Finally, guided by synthetic QA data, we train the parameter generator to match standard RAG’s hidden states and output logits, enabling RAG-style reasoning without original documents. Experiments on four QA datasets show that DistilledPRAG outperforms baselines in accuracy and generalizes well on OOD data.

[99] Dream-Coder 7B: An Open Diffusion Language Model for Code cs.CLPDF

Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong

TL;DR: Dream-Coder 7B是一个开源的离散扩散语言模型，专注于代码生成任务，通过自适应解码策略和独特的训练方法，在多项基准测试中表现优异。

Details

Motivation: 传统自回归模型在代码生成任务中限制较大，尤其是严格的左到右解码方式不适应复杂任务需求。Dream-Coder 7B旨在通过扩散模型框架和自适应策略解决这一问题。

Result: 在LiveCodeBench上达到21.4% pass@1，并在HumanEval等多项基准测试中表现优异。

Insight: 自适应解码策略和扩散框架的结合为代码生成任务提供了新的研究方向，同时也展示了离散扩散模型在自然语言处理中的潜力。

Abstract: We present Dream-Coder 7B, an open-source discrete diffusion language model for code generation that exhibits emergent any-order generation capabilities. Unlike traditional autoregressive (AR) models that decode strictly left-to-right, Dream-Coder 7B adaptively determines its decoding strategy based on the coding task: sketch-first generation for complex algorithms, left-to-right generation for straightforward completions, and interleaved reasoning generation for code understanding tasks. We adapt a pretrained AR checkpoint to a discrete diffusion frameworks with a continuous-time weighted cross-entropy objective. Our post-training recipe comprises (i) supervised fine-tuning, where we mitigate padding pathologies via random truncation and a padding penalty to improve sample efficiency and stabilize generation; and (ii) reinforcement learning with verifiable rewards over a curated high-quality prompt set drawn from open-source datasets, using a tailored reinforcement learning recipe for diffusion language models. The resulting Dream-Coder 7B Instruct attains 21.4% pass@1 on LiveCodeBench (2410–2505) and demonstrates competitive performance on HumanEval, MBPP, BigCodeBench, and CRUXEval. We release Dream-Coder-7B and Dream-Coder-7B-Instruct checkpoints, training recipes, preprocessing pipelines, and inference code to facilitate reproducibility and further research.

[100] Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning cs.CL | cs.AIPDF

Yu Liu, Yanan Cao, Xixun Lin, Yanmin Shang, Shi Wang

TL;DR: 该论文提出一种名为SAT的新框架，通过结构感知的对齐调优（alignment-tuning）增强大型语言模型（LLM）在知识图谱补全（KGC）任务中的表现。SAT通过分层知识对齐和结构指令调优解决了自然语言与图结构表示空间不一致的问题，并统一了KGC任务的指令设计，显著提升了性能。

Details

Motivation: 传统LLM增强的KGC方法忽视自然语言与图结构表示空间的不一致性，且为不同KGC任务设计独立指令导致效率低下。为解决这些问题，作者提出SAT框架。

Result: 在四个基准数据集上的两个KGC任务中，SAT显著优于现有方法，链接预测任务提升8.7%至29.8%。

Insight: 通过结构感知的对齐和统一的指令设计，可以显著提升LLM在KGC任务中的表现，同时减少任务特定指令的开发成本。

Abstract: Knowledge graph completion (KGC) aims to infer new knowledge and make predictions from knowledge graphs. Recently, large language models (LLMs) have exhibited remarkable reasoning capabilities. LLM-enhanced KGC methods primarily focus on designing task-specific instructions, achieving promising advancements. However, there are still two critical challenges. First, existing methods often ignore the inconsistent representation spaces between natural language and graph structures. Second, most approaches design separate instructions for different KGC tasks, leading to duplicate works and time-consuming processes. To address these challenges, we propose SAT, a novel framework that enhances LLMs for KGC via structure-aware alignment-tuning. Specifically, we first introduce hierarchical knowledge alignment to align graph embeddings with the natural language space through multi-task contrastive learning. Then, we propose structural instruction tuning to guide LLMs in performing structure-aware reasoning over KGs, using a unified graph instruction combined with a lightweight knowledge adapter. Experimental results on two KGC tasks across four benchmark datasets demonstrate that SAT significantly outperforms state-of-the-art methods, especially in the link prediction task with improvements ranging from 8.7% to 29.8%.

[101] Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation cs.CL | cs.AIPDF

Seganrasan Subramanian, Abhigya Verma

TL;DR: 本文提出了一种模块化框架，用于生成合成长上下文数据，以支持语言模型的训练和评估，填补了高质量长上下文数据集的空白。

Details

Motivation: 当前大型语言模型（LLM）在处理长上下文文本方面能力有限，主要原因是缺乏高质量、多样化和可验证的长上下文数据集。为了推动模型能力的进步，需要一种可控且可扩展的方法来生成此类数据。

Result: 该框架能够生成高质量、多样化的长上下文数据集，支持LLM的训练和评估，进一步提升了模型在长文本处理方面的能力。

Insight: 研究强调可控性和可扩展性在数据生成中的重要性，同时指出模型无关的设计使得框架可以灵活适应不同训练目标，为未来长上下文任务的研究提供了新思路。

Abstract: The ability of large language models (LLMs) to process and reason over long textual inputs is critical for a wide range of real-world applications. However, progress in this area is significantly constrained by the absence of high-quality, diverse, and verifiable long-context datasets suitable for both training and evaluation. This work introduces a modular, extensible framework for synthetic long-context data generation via prompt-based interaction with LLMs. The framework supports multiple training and alignment objectives, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). It encompasses four core generation paradigms: multi-turn conversational dialogues, document-grounded input-output pairs, verifiable instruction-response tasks, and long-context reasoning examples. Through templated prompting, a model-agnostic architecture, and metadata-enriched outputs, the proposed approach facilitates scalable, controllable, and purpose-aligned dataset creation for advancing long-context capabilities in LLMs.

[102] Mitigating Catastrophic Forgetting in Continual Learning through Model Growth cs.CLPDF

Ege Süalp, Mina Rezaei

TL;DR: 探究了通过在持续学习中采用模型生长策略缓解灾难性遗忘问题，发现模型生长（如Stack LLM）在部分任务上表现出更好的保留能力，但处理社会偏见时存在权衡。

Details

Motivation: 灾难性遗忘是持续学习中的主要挑战，大型语言模型（LLM）在持续学习时容易丢失旧任务的知识。模型生长策略可能提供解决方案。

Result: Stack LLM在领域知识和阅读理解中表现更好，灾难性遗忘程度更低；但在偏见处理中，传统LLM更中立，而Stack LLM保持稳定偏见比例。

Insight: 模型生长策略可部分缓解灾难性遗忘，但需注意其在社会偏见等任务中的权衡，仍需进一步优化。

Abstract: Catastrophic forgetting is a significant challenge in continual learning, in which a model loses prior knowledge when it is fine-tuned on new tasks. This problem is particularly critical for large language models (LLMs) undergoing continual learning, as retaining performance across diverse domains is important for their general utility. In this paper, we explore model growth, a promising strategy that leverages smaller models to expedite and structure the training of larger ones for mitigating the catastrophic forgetting problem. Although growth-based pretraining, particularly via transformer stacking, has shown promise in accelerating convergence, its impact on forgetting remains under-explored. Therefore, we evaluate whether growth-based models can retain previously learned capabilities more effectively across a sequence of fine-tuning tasks involving domain knowledge, reasoning, reading comprehension, and bias. Our findings show that both models – one trained with growth (Stack LLM) and one without (LLM) – exhibit improvements in domain knowledge. However, reasoning and reading comprehension degrade over time, indicating signs of catastrophic forgetting. Stack LLM consistently shows less degradation, especially in reading comprehension, suggesting enhanced retention capabilities. Interestingly, in bias evaluation, the baseline LLM becomes progressively more neutral with continued fine-tuning, while Stack LLM maintains a steady bias ratio around 60–61%. These results indicate that growth-based pretraining may deliver modest improvements in resisting catastrophic forgetting, though trade-offs remain in handling social biases.

[103] DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Taks Based on Data and Model Compression cs.CL | cs.AI | cs.LGPDF

Wei Huang, Huang Wei, Yinggui Wang

TL;DR: DaMoC提出了一种基于数据和模型压缩的框架，用于快速选择最适合微调领域任务的大语言模型（LLM），显著减少了训练时间。

Details

Motivation: 尽管LLM在通用任务上表现优异，但在领域特定任务中需要微调，目前缺乏高效选择最佳模型的方法。

Result: 在四个数据集上实验表明，DaMoC能高效选择最优LLM，并节省约20倍的训练时间。

Insight: 通过数据和模型压缩的结合，可以在不牺牲性能的情况下显著提升模型选择和训练效率。

Abstract: Large language models (LLMs) excel in general tasks but struggle with domain-specific ones, requiring fine-tuning with specific data. With many open-source LLMs available, selecting the best model for fine-tuning downstream tasks is challenging, primarily focusing on how to quickly identify the optimal LLM. We introduce a Data and Model Compression Framework (DaMoC) that addresses this challenge by: 1) Data Level: A systematic categorization of data filtering methodologies for LLMs is first established, classifying them into three distinct paradigms: (1) distribution-aware methods, (2) quality-aware methods, and (3) hybrid approaches considering both dimensions. Further, we enhance the density of key tokens in the text achieving token compression. Subsequently, we use an LLM to iterative rewrite the text to optimize its expression. 2) Model Level: We use layer similarity scores to assess each layer’s importance and remove those with lower importance. Then, we introduce a sparse merging paradigm to preserve as much of the original model’s capability as possible. Extensive experiments on four datasets, medical Q&A, financial Q&A, general Q&A, and reading comprehension, show that we can select the optimal LLM while saving approximately 20-fold in training time.

[104] Rethinking the Chain-of-Thought: The Roles of In-Context Learning and Pre-trained Priors cs.CL | cs.AIPDF

Hao Yang, Zhiyu Yang, Yunjie Zhang, Shanyi Zhu, Lin Yang

TL;DR: 本文探讨了Chain-of-Thought推理的工作原理，重点分析了上下文学习和预训练先验的双重作用，发现模型在依赖预训练先验的同时，也能通过上下文信号调整决策行为。

Details

Motivation: Chain-of-Thought推理在增强模型推理能力方面发挥了重要作用，但其工作机制尚不明确。本文旨在从上下文学习和预训练先验的关系角度揭示其工作机理。

Result: 实验表明：1. 模型能快速学习推理结构和逻辑模式，但依赖预训练先验；2. 足够样本促使模型转向上下文信号，误导性提示会引入不稳定性；3. 长Chain-of-Thought提示能提升下游任务性能。

Insight: Chain-of-Thought推理的成功不仅依赖于预训练先验，还需要合适的上下文信号。长链提示可以进一步激发模型的深度推理能力。

Abstract: Chain-of-Thought reasoning has emerged as a pivotal methodology for enhancing model inference capabilities. Despite growing interest in Chain-of-Thought reasoning, its underlying mechanisms remain unclear. This paper explores the working mechanisms of Chain-of-Thought reasoning from the perspective of the dual relationship between in-context learning and pretrained priors. We first conduct a fine-grained lexical-level analysis of rationales to examine the model’s reasoning behavior. Then, by incrementally introducing noisy exemplars, we examine how the model balances pretrained priors against erroneous in-context information. Finally, we investigate whether prompt engineering can induce slow thinking in large language models. Our extensive experiments reveal three key findings: (1) The model not only quickly learns the reasoning structure at the lexical level but also grasps deeper logical reasoning patterns, yet it heavily relies on pretrained priors. (2) Providing sufficient exemplars shifts the model’s decision-making from pretrained priors to in-context signals, while misleading prompts introduce instability. (3) Long Chain-of-Thought prompting can induce the model to generate longer reasoning chains, thereby improving its performance on downstream tasks.

[105] TableZoomer: A Collaborative Agent Framework for Large-scale Table Question Answering cs.CLPDF

Sishi Xiong, Ziyang He, Zhongjiang He, Yu Zhao, Changzai Pan

TL;DR: 提出了一种名为TableZoomer的新型基于LLM和编程的代理框架，用于大规模表格问答，通过结构化表模式、查询感知的缩放机制和Program-of-Thoughts策略解决了现有方法的局限。

Details

Motivation: 现有的大语言模型在表格问答任务中面临结构性异构、目标数据定位困难和复杂推理瓶颈等问题，难以在工业场景中应用。

Result: 在Qwen3-8B-Instruct模型上，TableZoomer在DataBench和TableBench上分别实现了19.34%和25%的准确性提升。

Insight: 通过编程式和迭代推理的方式，可以有效提升LLM在表格问答任务中的性能、可扩展性和实用性。

Abstract: While large language models (LLMs) have shown promise in the table question answering (TQA) task through prompt engineering, they face challenges in industrial applications, including structural heterogeneity, difficulties in target data localization, and bottlenecks in complex reasoning. To address these limitations, this paper presents TableZoomer, a novel LLM-powered, programming-based agent framework. It introduces three key innovations: (1) replacing the original fully verbalized table with structured table schema to bridge the semantic gap and reduce computational complexity; (2) a query-aware table zooming mechanism that dynamically generates sub-table schema through column selection and entity linking, significantly improving target localization efficiency; and (3) a Program-of-Thoughts (PoT) strategy that transforms queries into executable code to mitigate numerical hallucination. Additionally, we integrate the reasoning workflow with the ReAct paradigm to enable iterative reasoning. Extensive experiments demonstrate that our framework maintains the usability advantages while substantially enhancing performance and scalability across tables of varying scales. When implemented with the Qwen3-8B-Instruct LLM, TableZoomer achieves accuracy improvements of 19.34% and 25% over conventional PoT methods on the large-scale DataBench dataset and the small-scale Fact Checking task of TableBench dataset, respectively.

[106] LongCat-Flash Technical Report cs.CL | cs.AI | cs.DC | cs.LGPDF

Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang

TL;DR: LongCat-Flash是一个5600亿参数的混合专家（MoE）语言模型，通过动态计算预算分配和优化计算-通信重叠窗口实现高效推理和训练，30天内完成20万亿标记的训练并在推理中达到每秒100标记的性能。

Details

Motivation: 研究旨在解决大规模语言模型在计算效率和资源优化方面的挑战，同时提升模型的智能代理能力。

Result: 在30天内完成20万亿标记的训练，推理速度达每秒100标记，成本为每百万输出标记0.70美元，性能在智能代理任务中表现突出。

Insight: 通过动态资源和通信优化，可以实现大规模语言模型的高效训练和推理，同时提升模型在复杂任务中的表现。

Abstract: We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of $0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat

[107] KoBLEX: Open Legal Question Answering with Multi-hop Reasoning cs.CLPDF

Jihyung Lee, Daehui Kim, Seonjeong Hwang, Hyounghun Kim, Gary Lee

TL;DR: KoBLEX 是一个专为评估法律领域多跳推理能力的韩国基准，包含 226 个基于场景的问答实例。作者提出了 ParSeR 方法，通过参数化法律条款引导生成可靠答案，并结合 LF-Eval 自动评估答案的法律可信度，实验表明 ParSeR 显著优于基线方法。

Details

Motivation: 现有法律领域的基准未能充分评估开放式和基于法律条款的问答能力，因此需要一种新的基准和方法来支持多跳法律推理和可靠答案生成。

Result: 实验结果显示，ParSeR 在多个 LLM 上表现最优，相比标准检索方法（如 GPT-4o），F1 分数提升 37.91，LF-Eval 分数提升 30.81。分析还表明 ParSeR 在不同推理深度上表现一致。

Insight: 1. 参数化条款能有效引导法律领域的多跳推理；2. 结合自动评估指标（如 LF-Eval）可高效验证答案的法律可信度；3. LLM 与专家协作的混合流程能提升法律领域数据质量。

Abstract: Large Language Models (LLM) have achieved remarkable performances in general domains and are now extending into the expert domain of law. Several benchmarks have been proposed to evaluate LLMs’ legal capabilities. However, these benchmarks fail to evaluate open-ended and provision-grounded Question Answering (QA). To address this, we introduce a Korean Benchmark for Legal EXplainable QA (KoBLEX), designed to evaluate provision-grounded, multi-hop legal reasoning. KoBLEX includes 226 scenario-based QA instances and their supporting provisions, created using a hybrid LLM-human expert pipeline. We also propose a method called Parametric provision-guided Selection Retrieval (ParSeR), which uses LLM-generated parametric provisions to guide legally grounded and reliable answers. ParSeR facilitates multi-hop reasoning on complex legal questions by generating parametric provisions and employing a three-stage sequential retrieval process. Furthermore, to better evaluate the legal fidelity of the generated answers, we propose Legal Fidelity Evaluation (LF-Eval). LF-Eval is an automatic metric that jointly considers the question, answer, and supporting provisions and shows a high correlation with human judgments. Experimental results show that ParSeR consistently outperforms strong baselines, achieving the best results across multiple LLMs. Notably, compared to standard retrieval with GPT-4o, ParSeR achieves +37.91 higher F1 and +30.81 higher LF-Eval. Further analyses reveal that ParSeR efficiently delivers consistent performance across reasoning depths, with ablations confirming the effectiveness of ParSeR.

[108] Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic cs.CLPDF

Mohammad Zbeeb, Hasan Abed Al Kader Hammoud, Bernard Ghanem

TL;DR: 这篇论文提出了一种通过任务向量提取和传递大语言模型推理能力的方法，无需昂贵训练即可提升模型性能。

Details

Motivation: 大语言模型通常需要通过昂贵的优化（如强化学习）来掌握复杂推理任务。作者希望通过提取和复用现有模型的推理能力，减少重复计算的开销。

Result: 推理向量显著提升了多个推理基准的性能（如GSM8K +4.9%，HumanEval +4.3%，SciQ +1.7%），且在对抗条件下表现稳定；减去向量则会导致性能显著下降。

Insight: 研究表明，推理能力可以被提取为向量并通过算术操作传递，为模型性能提升提供了一种低成本且高效的方式。

Abstract: Large language models often require costly optimization, such as reinforcement learning, to master complex reasoning tasks. This work demonstrates that reasoning ability, once learned, can be extracted and transferred between models as a compact task vector. We source two publicly available, identically initialized Qwen2.5 models, one fine-tuned with supervised fine-tuning (SFT) and the other with group relative policy optimization (GRPO) on the same dataset. From these, we extract a reasoning vector: $v_{\text{reason}} = \theta_{\text{GRPO}} - \theta_{\text{SFT}}$. We hypothesize that this vector captures the reasoning capability instilled by reinforcement learning while factoring out shared knowledge from the SFT process. When added to compatible instruction-tuned models through simple arithmetic, this vector consistently improves performance across diverse reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), SciQ (+1.7%), and BigBenchHard (+12.3% for the 1.5B model). The performance improvements persist under adversarial conditions. Conversely, subtracting the vector causes significant performance degradation (-11.8% on GSM8K), demonstrating the vector’s strong contribution to the model’s reasoning abilities. This work shows how reasoning capabilities, typically developed through expensive training, can be extracted from existing open-source models and reused through simple tensor arithmetic, offering a practical way to enhance models by recycling prior computational investments.

[109] WATCHED: A Web AI Agent Tool for Combating Hate Speech by Expanding Data cs.CLPDF

Paloma Piot, Diego Sánchez, Javier Parapar

TL;DR: 论文提出了一种名为WATCHED的AI助手工具，旨在通过扩展数据和结合大型语言模型与专有工具，帮助内容审核员识别和解释仇恨言论，以提升在线平台的安全性和信任度。

Details

Motivation: 在线仇恨言论对用户安全和社交平台信任构成严重威胁，亟需结合自动化系统速度和人类判断力的工具，不仅能检测有害内容，还能清晰解释决策过程。

Result: 实验表明该方法在宏F1分数上达到0.91，优于现有最优方法。

Insight: AI与人类协作的模式在内容审核中具有潜力，不仅能提升效率，还能增强透明度和信任。

Abstract: Online harms are a growing problem in digital spaces, putting user safety at risk and reducing trust in social media platforms. One of the most persistent forms of harm is hate speech. To address this, we need tools that combine the speed and scale of automated systems with the judgment and insight of human moderators. These tools should not only find harmful content but also explain their decisions clearly, helping to build trust and understanding. In this paper, we present WATCHED, a chatbot designed to support content moderators in tackling hate speech. The chatbot is built as an Artificial Intelligence Agent system that uses Large Language Models along with several specialised tools. It compares new posts with real examples of hate speech and neutral content, uses a BERT-based classifier to help flag harmful messages, looks up slang and informal language using sources like Urban Dictionary, generates chain-of-thought reasoning, and checks platform guidelines to explain and support its decisions. This combination allows the chatbot not only to detect hate speech but to explain why content is considered harmful, grounded in both precedent and policy. Experimental results show that our proposed method surpasses existing state-of-the-art methods, reaching a macro F1 score of 0.91. Designed for moderators, safety teams, and researchers, the tool helps reduce online harms by supporting collaboration between AI and human oversight.

[110] ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links cs.CL | cs.IR | cs.LGPDF

Serwar Basch, Ilia Kuznetsov, Tom Hope, Iryna Gurevych

TL;DR: 本文提出了一种新的跨文档细粒度链接注释框架ABCD-LINK，通过生成半合成数据集和结合检索模型与LLMs，提高了链接标注的精度，适用于多种领域。

Details

Motivation: 跨文档细粒度关系的理解在许多应用领域至关重要，但缺乏高效的训练和评估数据集创建方法限制了自动化研究的进展。

Result: 在同行评审和新闻领域，结合检索模型与LLMs的方法获得了78%的人类评分者认可，精度比单独使用检索模型提高了一倍以上。

Insight: 结合检索模型与LLMs的方法在跨文档链接任务中表现优异，为跨文档理解任务提供了新的研究基础和数据支持。

Abstract: Understanding fine-grained relations between documents is crucial for many application domains. However, the study of automated assistance is limited by the lack of efficient methods to create training and evaluation datasets of cross-document links. To address this, we introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links in a new domain from scratch. We first generate and validate semi-synthetic datasets of interconnected documents. This data is used to perform automatic evaluation, producing a shortlist of best-performing linking approaches. These approaches are then used in an extensive human evaluation study, yielding performance estimates on natural text pairs. We apply our framework in two distinct domains – peer review and news – and show that combining retrieval models with LLMs achieves 78% link approval from human raters, more than doubling the precision of strong retrievers alone. Our framework enables systematic study of cross-document understanding across application scenarios, and the resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review. We make the code, data, and annotation protocols openly available.

[111] LLMs cannot spot math errors, even when allowed to peek into the solution cs.CL | cs.AIPDF

KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar

TL;DR: 论文研究了大语言模型（LLMs）在数学错误检测任务中的表现，发现即使允许参考正确解，LLMs仍难以定位学生解答中的第一处错误。提出了一种生成中间修正解的方法以提升性能。

Details

Motivation: 尽管LLMs在数学应用题上表现优异，但在元推理任务（如识别学生解答中的错误）上仍存在困难。本文旨在探究LLMs在定位错误时的局限性及其改进方法。

Result: 实验表明，当前最先进的LLMs即使参考正确解仍难以定位错误；提出的中间修正解方法有效提升了性能。

Insight: LLMs在元推理任务上的能力仍有待提升，通过生成中间修正解可以部分缓解这一问题。

Abstract: Large language models (LLMs) demonstrate remarkable performance on math word problems, yet they have been shown to struggle with meta-reasoning tasks such as identifying errors in student solutions. In this work, we investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K. Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution. To that end, we propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student’s solution, which helps improve performance.

[112] Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning cs.CL | 68T07, 68T50, 68T05 | I.2.7; I.2.6; I.2.8; H.5.2PDF

Kaviraj Pather, Elena Hadjigeorgiou, Arben Krasniqi, Claire Schmit, Irina Rusu

TL;DR: Vis-CoT 是一个交互式可视化框架，通过将链式思维（CoT）文本转化为可操作的推理图，支持用户干预和优化LLM的推理过程，显著提升准确性和可信度。

Details

Motivation: 当前LLM的推理过程缺乏透明性，使得在关键任务中难以验证和调试。Vis-CoT通过引入人机交互机制，提升推理的可控性和可靠性。

Result: 在GSM8K和StrategyQA任务中，准确率提升最高达24%；用户研究显示可用性和信任度显著提升。

Insight: 人机协同干预能有效优化LLM推理，结合可视化工具可增强模型的透明度和可控性。

Abstract: Large language models (LLMs) show strong reasoning via chain-of-thought (CoT) prompting, but the process is opaque, which makes verification, debugging, and control difficult in high-stakes settings. We present Vis-CoT, a human-in-the-loop framework that converts linear CoT text into an interactive reasoning graph. Users can visualize the logical flow, identify flawed steps, and intervene by pruning incorrect paths and grafting new, user-defined premises. This shifts interaction from passive observation to active collaboration, steering models toward more accurate and trustworthy conclusions. Across GSM8K and StrategyQA, Vis-CoT improves final-answer accuracy by up to 24 percentage points over non-interactive baselines. A user study also shows large gains in perceived usability and trust. Vis-CoT points to a practical path for more reliable, understandable, and collaborative reasoning by combining LLMs with targeted human oversight.

[113] On the Alignment of Large Language Models with Global Human Opinion cs.CLPDF

Yang Liu, Masahiro Kaneko, Chenhui Chu

TL;DR: 该论文研究了大型语言模型（LLMs）在多语言场景下如何与全球不同国家和历史时期的人类观点对齐，填补了现有研究的空白。

Details

Motivation: 当前LLMs的研究主要集中在少数国家或特定人群的观点对齐上，缺乏全球范围和多历史时期的分析，以及语言对模型观点对齐的影响。

Result: LLMs仅与少数国家的观点对齐或过度对齐，而与大多数国家的观点对齐不足；改变提示语言能更有效地引导模型对齐。

Insight: 提示语言的选择对LLMs的观点对齐有显著影响，模型更倾向于与当代人群的观点对齐。

Abstract: Today’s large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs’ opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions. Our code and data are publicly available at https://github.com/nlply/global-opinion-alignment.

Table of Contents

cs.CV [Back]

[1] Deep Learning-Driven Multimodal Detection and Movement Analysis of Objects in Culinary cs.CV | cs.AIPDF

[2] AMMKD: Adaptive Multimodal Multi-teacher Distillation for Lightweight Vision-Language Models cs.CVPDF

[3] ARTPS: Depth-Enhanced Hybrid Anomaly Detection and Learnable Curiosity Score for Autonomous Rover Target Prioritization cs.CV | cs.AI | 68T45, 68T07, 68U10 | I.2.10; I.4.8; I.5.4; J.2PDF

[4] Performance is not All You Need: Sustainability Considerations for Algorithms cs.CV | cs.PFPDF

[5] MESTI-MEGANet: Micro-expression Spatio-Temporal Image and Micro-expression Gradient Attention Networks for Micro-expression Recognition cs.CVPDF

[6] Scaffold Diffusion: Sparse Multi-Category Voxel Structure Generation with Discrete Diffusion cs.CV | cs.AI | cs.LGPDF

[7] Dual-Stage Global and Local Feature Framework for Image Dehazing cs.CVPDF

[8] Waste-Bench: A Comprehensive Benchmark for Evaluating VLLMs in Cluttered Environments cs.CV | cs.AIPDF

[9] Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders cs.CVPDF

[10] Safe-LLaVA: A Privacy-Preserving Vision-Language Dataset and Benchmark for Biometric Safety cs.CVPDF

[11] Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment cs.CV | cs.AIPDF

[12] Multimodal Deep Learning for Phyllodes Tumor Classification from Ultrasound and Clinical Data cs.CV | cs.AIPDF

[13] GraViT: Transfer Learning with Vision Transformers and MLP-Mixer for Strong Gravitational Lens Discovery cs.CV | astro-ph.GAPDF

[14] A High-Accuracy Fast Hough Transform with Linear-Log-Cubed Computational Complexity for Arbitrary-Shaped Images cs.CVPDF

[15] Generative AI for Industrial Contour Detection: A Language-Guided Vision System cs.CV | cs.AIPDF

[16] Language-Aware Information Maximization for Transductive Few-Shot CLIP cs.CVPDF

[17] MorphGen: Morphology-Guided Representation Learning for Robust Single-Domain Generalization in Histopathological Cancer Classification cs.CVPDF

[18] Towards Adaptive Visual Token Pruning for Large Multimodal Models cs.CVPDF

[19] CryptoFace: End-to-End Encrypted Face Recognition cs.CV | cs.CRPDF

[20] LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables cs.CVPDF

[21] Target-Oriented Single Domain Generalization cs.CV | cs.AI | cs.LGPDF

[22] AQFusionNet: Multimodal Deep Learning for Air Quality Index Prediction with Imagery and Sensor Data cs.CV | cs.AI | 68T07, 68T09, 68U10 | I.4.8; I.2.10; I.5.4; C.3PDF

[23] Iterative Low-rank Network for Hyperspectral Image Denoising cs.CVPDF

[24] SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding cs.CV | cs.AI | cs.LGPDF

[25] A Multimodal Head and Neck Cancer Dataset for AI-Driven Precision Oncology cs.CVPDF

[26] Two Causes, Not One: Rethinking Omission and Fabrication Hallucinations in MLLMs cs.CVPDF

[27] Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models cs.CV | cs.AIPDF

[28] Adaptive Point-Prompt Tuning: Fine-Tuning Heterogeneous Foundation Models for 3D Point Cloud Analysis cs.CVPDF

[29] HERO-VQL: Hierarchical, Egocentric and Robust Visual Query Localization cs.CVPDF

[30] Double-Constraint Diffusion Model with Nuclear Regularization for Ultra-low-dose PET Reconstruction cs.CVPDF

[31] DAOVI: Distortion-Aware Omnidirectional Video Inpainting cs.CV | cs.AIPDF

[32] DevilSight: Augmenting Monocular Human Avatar Reconstruction through a Virtual Perspective cs.CVPDF

[33] LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression cs.CVPDF

[34] Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation cs.CVPDF

[35] SemaMIL: Semantic Reordering with Retrieval-Guided State Space Modeling for Whole Slide Image Classification cs.CVPDF

[36] Stage-wise Adaptive Label Distribution for Facial Age Estimation cs.CVPDF

[37] Encoder-Only Image Registration cs.CVPDF

[38] Exploring Decision-Making Capabilities of LLM Agents: An Experimental Study on Jump-Jump Game cs.CVPDF

[39] VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding cs.CV | cs.AIPDF

[40] Multi-Focused Video Group Activities Hashing cs.CV | cs.AIPDF

[41] TRUST: Token-dRiven Ultrasound Style Transfer for Cross-Device Adaptation cs.CVPDF

[42] Make me an Expert: Distilling from Generalist Black-Box Models into Specialized Models for Semantic Segmentation cs.CVPDF

[43] Learning Yourself: Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement cs.CVPDF

[44] A Modality-agnostic Multi-task Foundation Model for Human Brain Imaging cs.CVPDF

[45] DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation cs.CVPDF

[46] Towards Methane Detection Onboard Satellites cs.CV | cs.AIPDF

[47] MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation cs.CV | cs.ROPDF

[48] Face4FairShifts: A Large Image Benchmark for Fairness and Robust Learning across Visual Domains cs.CV | cs.CY | cs.LGPDF

[49] Automatic Identification and Description of Jewelry Through Computer Vision and Neural Networks for Translators and Interpreters cs.CVPDF

[50] Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model cs.CV | cs.AIPDF

[51] ER-LoRA: Effective-Rank Guided Adaptation for Weather-Generalized Depth Estimation cs.CV | cs.ROPDF

[52] LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model cs.CV | cs.LGPDF

[53] CSFMamba: Cross State Fusion Mamba Operator for Multimodal Remote Sensing Image Classification cs.CVPDF

[54] CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition cs.CVPDF

[55] Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision cs.CVPDF

[56] Enhancing Fairness in Skin Lesion Classification for Medical Diagnosis Using Prune Learning cs.CV | cs.AI | cs.CY | cs.LGPDF

[57] Causal Interpretation of Sparse Autoencoder Features in Vision cs.CV | cs.AIPDF

[58] EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions cs.CVPDF

[59] Multi-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification cs.CVPDF

[60] MarkSplatter: Generalizable Watermarking for 3D Gaussian Splatting Model via Splatter Image Structure cs.CVPDF

[61] No More Sibling Rivalry: Debiasing Human-Object Interaction Detection cs.CVPDF

[62] InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos cs.CVPDF

[63] Diffusion-Based Image-to-Brain Signal Generation with Cross-Attention Mechanisms for Visual Prostheses cs.CVPDF

[64] OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving cs.CVPDF

[65] Multimodal Iterative RAG for Knowledge Visual Question Answering cs.CV | cs.AIPDF

[66] SWAGSplatting: Semantic-guided Water-scene Augmented Gaussian Splatting cs.CVPDF

[67] Adaptive Contrast Adjustment Module: A Clinically-Inspired Plug-and-Play Approach for Enhanced Fetal Plane Classification cs.CV | cs.AIPDF

[68] Sequential Difference Maximization: Generating Adversarial Examples via Multi-Stage Optimization cs.CV | cs.AI | cs.LG | Doctor of EngineeringPDF

[69] Surface Defect Detection with Gabor Filter Using Reconstruction-Based Blurring U-Net-ViT cs.CVPDF

[70] UPGS: Unified Pose-aware Gaussian Splatting for Dynamic Scene Deblurring cs.CVPDF

[71] SegDINO: An Efficient Design for Medical and Natural Image Segmentation with DINO-V3 cs.CVPDF

[72] Look Beyond: Two-Stage Scene View Generation via Panorama and Video Diffusion cs.CV | cs.AIPDF

[73] Quantization Meets OOD: Generalizable Quantization-aware Training from a Flatness Perspective cs.CVPDF

[74] DarkVRAI: Capture-Condition Conditioning and Burst-Order Selective Scan for Low-light RAW Video Denoising cs.CVPDF

cs.CL [Back]

[75] Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis cs.CL | cs.AIPDF

[76] Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics cs.CL | cs.AIPDF

[77] The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs cs.CLPDF