cs.CV [Total: 61]
cs.CL [Total: 12]
eess.IV [Total: 1]
eess.AS [Total: 1]
q-fin.CP [Total: 1]
cs.RO [Total: 1]
cs.LG [Total: 9]
cs.CR [Total: 1]
cs.GR [Total: 1]
cs.AI [Total: 5]
cs.SD [Total: 1]

cs.CV [Back]

[1] Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges cs.CV | cs.AIPDF

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

TL;DR: 本文综述了利用大语言模型（LLMs）从视频数据中检测碰撞的方法，总结了融合策略、关键数据集、模型架构、性能比较，并讨论了挑战与机遇。

Details

Motivation: 智能交通系统中，碰撞检测是一个重要问题。近年来，LLMs和视觉语言模型（VLMs）的发展为多模态信息处理提供了新思路。

Result: 综述发现LLMs在多模态信息处理中表现优异，但仍有数据稀缺和计算复杂性等挑战。

Insight: LLMs在多模态任务中潜力巨大，但需解决数据和计算效率问题以进一步推动应用。

Abstract: Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.

[2] Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning cs.CVPDF

Zijie Cai, Christopher Metzler

TL;DR: 该论文提出了一个水下单目深度估计的全面评测基准，并对现有模型进行了零样本和微调测试。通过微调Depth Anything V2模型在合成水下数据上的表现，显著提升了其在真实水下场景中的性能。

Details

Motivation: 水下环境的复杂性（如光衰减、散射和颜色失真）以及高质量地面真实数据的缺乏，导致现有单目深度估计模型在水下表现不佳。本文旨在解决这一问题。

Result: 微调后的模型在所有评测基准上表现优于基线模型，尤其是在真实水下场景中，验证了领域适应和合成数据微调的有效性。

Insight: 水下深度估计需要领域特定的数据增强和微调策略，合成数据可以作为填补真实数据不足的有效手段。尺度感知监督对提升模型的鲁棒性和泛化性至关重要。

Abstract: Monocular depth estimation has recently advanced to provide not only relative but also metric depth predictions. However, its reliability in underwater environments remains limited due to light attenuation and scattering, color distortion, turbidity, and the lack of high-quality metric ground-truth data. In this paper, we present a comprehensive benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets with metric depth annotations, such as FLSea and SQUID. We evaluate a diverse set of state-of-the-art models across a range of underwater conditions with different ranges. Our results show that large-scale models trained on terrestrial (real or synthetic) data, while effective in in-air settings, perform poorly underwater due to significant domain shifts. To address this, we fine-tune Depth Anything V2 with a ViT-S backbone encoder on a synthetic underwater variant of the Hypersim dataset, which we generated using a physically based underwater image formation model. We demonstrate our fine-tuned model consistently improves performance across all benchmarks and outperforms baselines trained only on the clean in-air Hypersim dataset. Our study provides a detailed evaluation and visualization for monocular metric depth estimation in underwater scenes, highlighting the importance of domain adaptation and scale-aware supervision for achieving robust and generalizable metric depth predictions in challenging underwater environments for future research.

[3] ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning cs.CV | cs.AI | cs.CLPDF

Xiao Wang, Jingtao Jiang, Qiang Chen, Lan Chen, Lin Zhu

TL;DR: ESTR-CoT 是一个基于事件流的场景文本识别框架，通过链式思维推理提升解释性和准确性。它结合了视觉编码器和大语言模型，生成答案和推理过程，并在三个基准数据集上验证了其有效性。

Details

Motivation: 现有的事件流场景文本识别方法缺乏解释性和上下文逻辑推理能力。作者希望通过链式思维推理解决这些问题。

Result: 在 EventSTR、WordArt* 和 IC15* 三个基准数据集上的实验验证了 ESTR-CoT 的有效性和解释性。

Insight: 链式思维推理可以显著提升事件流场景文本识别的解释性和逻辑推理能力，同时大规模数据集对模型训练至关重要。

Abstract: Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.

[4] Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach cs.CVPDF

Elena Ryumina, Maxim Markitantov, Alexandr Axyonov, Dmitry Ryumin, Mikhail Dolgushin

TL;DR: 该论文提出了一种零样本多模态方法，用于复合表情识别（CER），结合了六种异构模态，并通过动态加权和可解释的模块实现了与监督方法相当的表现。

Details

Motivation: 复合表情识别（CER）在情感计算中具有重要意义，但传统方法依赖于特定任务的训练数据。本文旨在通过零样本方法解决这一问题，以降低对领域适应数据的需求。

Result: 在AffWild2、AFEW和C-EXPR-DB数据集上的零样本测试中，F1分数分别为46.95%、49.02%和34.85%，与监督方法表现相当。

Insight: 零样本方法可以有效地捕获复合表情，无需领域适应，为多模态情感计算提供了新的研究方向。

Abstract: Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions. In this work, we present a novel zero-shot multimodal approach for CER that combines six heterogeneous modalities into a single pipeline: static and dynamic facial expressions, scene and label matching, scene context, audio, and text. Unlike previous approaches relying on task-specific training data, our approach uses zero-shot components, including Contrastive Language-Image Pretraining (CLIP)-based label matching and Qwen-VL for semantic scene understanding. We further introduce a Multi-Head Probability Fusion (MHPF) module that dynamically weights modality-specific predictions, followed by a Compound Expressions (CE) transformation module that uses Pair-Wise Probability Aggregation (PPA) and Pair-Wise Feature Similarity Aggregation (PFSA) methods to produce interpretable compound emotion outputs. Evaluated under multi-corpus training, the proposed approach shows F1 scores of 46.95% on AffWild2, 49.02% on Acted Facial Expressions in The Wild (AFEW), and 34.85% on C-EXPR-DB via zero-shot testing, which is comparable to the results of supervised approaches trained on target data. This demonstrates the effectiveness of the proposed approach for capturing CE without domain adaptation. The source code is publicly available.

[5] SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers cs.CV | cs.CL | cs.LGPDF

Takuro Kawada, Shunsuke Kitada, Sota Nemoto, Hitoshi Iyatomi

TL;DR: 本文介绍了SciGA-145k数据集，包含14.5万篇科学论文和114万张图表，旨在支持图形摘要（GA）的选择、推荐和自动生成研究。同时，提出了两种任务（Intra-GA和Inter-GA推荐）和一种新评估指标CAR。

Details

Motivation: 图形摘要在科学传播中非常重要，但设计和选择有效的GA需要高级可视化技能，限制了其广泛应用。现有的研究对GA潜力的探索不足，亟需一个大规模数据集支持相关研究。

Result: SciGA-145k为GA选择、推荐和生成提供了数据基础。提出的任务和CAR指标为后续研究提供了新方向。

Insight: 图形摘要的自动化推荐和生成是提升科学传播效率的关键方向，CAR指标为多标签推荐场景提供了更合理的评估方法。

Abstract: Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.

[6] Understanding Trade offs When Conditioning Synthetic Data cs.CV | cs.AIPDF

Brandon Trabucco, Qasim Wani, Benjamin Pikus, Vasu Sharma

TL;DR: 论文研究了在合成数据生成中两种条件化策略（基于提示和基于布局）的效果，发现随着条件线索多样性的增加，基于布局的条件化效果更优，并能显著提升目标检测性能。

Details

Motivation: 工业视觉系统中，高质量训练数据采集耗时且成本高。合成数据成为解决方案，但现有方法（如3D引擎）渲染速度慢且存在仿真与现实的差距，而扩散模型虽高效但精确控制难。研究旨在理解不同条件化策略对合成数据质量的影响。

Result: 布局条件化在条件线索多样性高时效果更佳，合成数据与真实数据结合能显著提升mAP（平均34%，最高177%）。

Insight: 条件化策略的选择需考虑数据多样性，布局条件化在复杂场景下更具优势，为高效生成高质量合成数据提供了新思路。

Abstract: Learning robust object detectors from only a handful of images is a critical challenge in industrial vision systems, where collecting high quality training data can take months. Synthetic data has emerged as a key solution for data efficient visual inspection and pick and place robotics. Current pipelines rely on 3D engines such as Blender or Unreal, which offer fine control but still require weeks to render a small dataset, and the resulting images often suffer from a large gap between simulation and reality. Diffusion models promise a step change because they can generate high quality images in minutes, yet precise control, especially in low data regimes, remains difficult. Although many adapters now extend diffusion beyond plain text prompts, the effect of different conditioning schemes on synthetic data quality is poorly understood. We study eighty diverse visual concepts drawn from four standard object detection benchmarks and compare two conditioning strategies: prompt based and layout based. When the set of conditioning cues is narrow, prompt conditioning yields higher quality synthetic data; as diversity grows, layout conditioning becomes superior. When layout cues match the full training distribution, synthetic data raises mean average precision by an average of thirty four percent and by as much as one hundred seventy seven percent compared with using real data alone.

[7] High-Fidelity Differential-information Driven Binary Vision Transformer cs.CVPDF

Tian Gao, Zhiyuan Zhang, Kaijie Yin, Xu-Cheng Zhong, Hui Kong

TL;DR: 论文提出了一种名为DIDB-ViT的二元视觉变换器，通过引入差分信息和频率分解技术，解决了现有二元ViT方法的性能退化问题，并在多项任务中表现优异。

Details

Motivation: 视觉变换器（ViT）的二值化可以降低计算和存储需求，适合边缘设备部署，但现有方法常导致性能显著下降或依赖全精度模块。

Result: DIDB-ViT在多种ViT架构中的图像分类和分割任务上显著优于现有网络量化方法。

Insight: 通过差分信息和频率分解技术可以有效缓解二值化的信息损失，改进的激活函数对模型表示能力有显著提升。

Abstract: The binarization of vision transformers (ViTs) offers a promising approach to addressing the trade-off between high computational/storage demands and the constraints of edge-device deployment. However, existing binary ViT methods often suffer from severe performance degradation or rely heavily on full-precision modules. To address these issues, we propose DIDB-ViT, a novel binary ViT that is highly informative while maintaining the original ViT architecture and computational efficiency. Specifically, we design an informative attention module incorporating differential information to mitigate information loss caused by binarization and enhance high-frequency retention. To preserve the fidelity of the similarity calculations between binary Q and K tensors, we apply frequency decomposition using the discrete Haar wavelet and integrate similarities across different frequencies. Additionally, we introduce an improved RPReLU activation function to restructure the activation distribution, expanding the model’s representational capacity. Experimental results demonstrate that our DIDB-ViT significantly outperforms state-of-the-art network quantization methods in multiple ViT architectures, achieving superior image classification and segmentation performance.

[8] FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model cs.CVPDF

Jiangxia Chen, Tongyuan Huang, Ke Song

TL;DR: FMOcc提出了一种基于TPV和选择性状态空间模型的流匹配方法，用于提升自动驾驶中3D语义占据预测的精度，特别是在遮挡和远距离场景下。

Details

Motivation: 现有方法大多通过融合历史帧数据提升3D语义占据预测性能，但需要额外数据和大量计算资源。本文旨在通过改进特征生成和选择性过滤，提升少帧输入下的预测能力。

Result: 在两个数据集（Occ3D-nuScenes和OpenOcc）上表现优于现有方法，两帧输入下达到43.1% RayIoU和39.8% mIoU（Occ3D-nuScenes），推理内存5.4G，时间330ms。

Insight: 流匹配与选择性状态空间模型结合可有效提升少帧输入的3D占据预测性能，同时减少冗余计算。

Abstract: 3D semantic occupancy prediction plays a pivotal role in autonomous driving. However, inherent limitations of fewframe images and redundancy in 3D space compromise prediction accuracy for occluded and distant scenes. Existing methods enhance performance by fusing historical frame data, which need additional data and significant computational resources. To address these issues, this paper propose FMOcc, a Tri-perspective View (TPV) refinement occupancy network with flow matching selective state space model for few-frame 3D occupancy prediction. Firstly, to generate missing features, we designed a feature refinement module based on a flow matching model, which is called Flow Matching SSM module (FMSSM). Furthermore, by designing the TPV SSM layer and Plane Selective SSM (PS3M), we selectively filter TPV features to reduce the impact of air voxels on non-air voxels, thereby enhancing the overall efficiency of the model and prediction capability for distant scenes. Finally, we design the Mask Training (MT) method to enhance the robustness of FMOcc and address the issue of sensor data loss. Experimental results on the Occ3D-nuScenes and OpenOcc datasets show that our FMOcc outperforms existing state-of-theart methods. Our FMOcc with two frame input achieves notable scores of 43.1% RayIoU and 39.8% mIoU on Occ3D-nuScenes validation, 42.6% RayIoU on OpenOcc with 5.4 G inference memory and 330ms inference time.

[9] SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement cs.CV | cs.AIPDF

Zeyu Lei, Hongyuan Yu, Jinlin Wu, Zhen Chen

TL;DR: SurgVisAgent是一个基于多模态大语言模型的智能手术视觉代理，能够动态识别内窥镜图像中的失真类别和严重程度，并执行多种增强任务，如低光增强、过曝校正、运动模糊消除和烟雾去除。

Details

Motivation: 传统的手术视觉增强算法通常针对单一任务设计，无法应对复杂真实场景的多样化需求。为了解决这一问题，作者提出了SurgVisAgent，旨在提供一个统一的解决方案。

Result: 在模拟真实手术失真的基准上，SurgVisAgent的表现优于传统单任务模型，展示了其作为统一手术辅助解决方案的潜力。

Insight: SurgVisAgent通过多模态和上下文学习方法，成功解决了手术视觉增强中的多样化需求，为智能手术辅助系统提供了新的思路。

Abstract: Precise surgical interventions are vital to patient safety, and advanced enhancement algorithms have been developed to assist surgeons in decision-making. Despite significant progress, these algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world situations. To address this limitation, we propose SurgVisAgent, an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs). SurgVisAgent dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks such as low-light enhancement, overexposure correction, motion blur elimination, and smoke removal. Specifically, to achieve superior surgical scenario understanding, we design a prior model that provides domain-specific knowledge. Additionally, through in-context few-shot learning and chain-of-thought (CoT) reasoning, SurgVisAgent delivers customized image enhancements tailored to a wide range of distortion types and severity levels, thereby addressing the diverse requirements of surgeons. Furthermore, we construct a comprehensive benchmark simulating real-world surgical distortions, on which extensive experiments demonstrate that SurgVisAgent surpasses traditional single-task models, highlighting its potential as a unified solution for surgical assistance.

[10] Cross-domain Hyperspectral Image Classification based on Bi-directional Domain Adaptation cs.CV | eess.IVPDF

Yuxiang Zhang, Wei Li, Wen Jia, Mengmeng Zhang, Ran Tao

TL;DR: 该论文提出了一种双向域适应（BiDA）框架，用于跨域高光谱图像分类，通过三分支Transformer架构和耦合多头跨注意力机制，提取域不变特征和域特定信息，显著提升了分类性能。

Details

Motivation: 高光谱图像在不同场景或时间获取时，同一类别的光谱特征存在显著差异，导致传统方法在跨域分类中表现不佳。因此，作者提出了BiDA框架，以解决这一挑战。

Result: 在跨时空树种类别分类任务中，BiDA比最先进方法高出3%~5%。

Insight: 通过同时提取域不变和域特定特征，并结合特征交互机制，可以有效提升跨域高光谱图像分类性能。

Abstract: Utilizing hyperspectral remote sensing technology enables the extraction of fine-grained land cover classes. Typically, satellite or airborne images used for training and testing are acquired from different regions or times, where the same class has significant spectral shifts in different scenes. In this paper, we propose a Bi-directional Domain Adaptation (BiDA) framework for cross-domain hyperspectral image (HSI) classification, which focuses on extracting both domain-invariant features and domain-specific information in the independent adaptive space, thereby enhancing the adaptability and separability to the target scene. In the proposed BiDA, a triple-branch transformer architecture (the source branch, target branch, and coupled branch) with semantic tokenizer is designed as the backbone. Specifically, the source branch and target branch independently learn the adaptive space of source and target domains, a Coupled Multi-head Cross-attention (CMCA) mechanism is developed in coupled branch for feature interaction and inter-domain correlation mining. Furthermore, a bi-directional distillation loss is designed to guide adaptive space learning using inter-domain correlation. Finally, we propose an Adaptive Reinforcement Strategy (ARS) to encourage the model to focus on specific generalized feature extraction within both source and target scenes in noise condition. Experimental results on cross-temporal/scene airborne and satellite datasets demonstrate that the proposed BiDA performs significantly better than some state-of-the-art domain adaptation approaches. In the cross-temporal tree species classification task, the proposed BiDA is more than 3%$\sim$5% higher than the most advanced method. The codes will be available from the website: https://github.com/YuxiangZhang-BIT/IEEE_TCSVT_BiDA.

[11] MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement cs.CVPDF

Fanghai Yi, Zehong Zheng, Zexiao Liang, Yihang Dong, Xiyang Fang

TL;DR: MAC-Lookup模型通过条件3D查找表和多轴自适应增强技术，有效解决了水下图像增强问题，提高了颜色准确性和细节恢复。

Details

Motivation: 水下图像因光线变化、浑浊和气泡导致能见度和颜色问题，传统方法和深度学习因数据不足或局限性难以有效解决。

Result: 实验表明，MAC-Lookup在颜色恢复和细节增强方面优于现有方法。

Insight: MAC-Lookup通过多阶段处理有效平衡了颜色校正和细节增强，为水下图像增强提供了新思路。

Abstract: Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color accuracy, sharpness, and contrast. It includes Conditional 3D Lookup Table Color Correction (CLTCC) for preliminary color and quality correction and Multi-Axis Adaptive Enhancement (MAAE) for detail refinement. This model prevents over-enhancement and saturation while handling underwater challenges. Extensive experiments show that MAC-Lookup excels in enhancing underwater images by restoring details and colors better than existing methods. The code is https://github.com/onlycatdoraemon/MAC-Lookup.

[12] Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation cs.CV | cs.AI | cs.MMPDF

Feizhen Huang, Yu Wu, Yutian Lin, Bo Du

TL;DR: 论文提出了一种通过自蒸馏方法扩展视频到音频生成模型的简单方法，以应对电影语言场景中部分可见目标的挑战。

Details

Motivation: 现有视频到音频生成方法忽视了电影语言这一电影艺术表达的关键组成部分，导致在部分可见场景下性能下降。

Result: 该方法在所有评估指标上对部分可见场景表现显著提升，并在VGGSound大规模数据集上增强了性能。

Insight: 通过引入电影语言的自蒸馏训练，可以有效提升视频到音频生成模型在部分可见场景下的性能。

Abstract: Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.

[13] LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models cs.CVPDF

Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng

TL;DR: 论文提出了一种名为LaCo的新框架，用于在视觉编码器的中间层高效压缩视觉令牌，显著提升了训练和推理效率。

Details

Motivation: 现有的视觉令牌压缩方法多为后编码器模块，限制了效率提升潜力。作者希望通过在视觉编码器中间层进行压缩，进一步优化效率。

Result: LaCo在中间层压缩令牌的效果优于现有方法，训练效率提升20%以上，推理吞吐量提升15%以上。

Insight: 在视觉编码器中间层进行令牌压缩比后编码器模块更有效，能显著提升多模态大语言模型的效率。

Abstract: Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that our LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20% and inference throughput over 15% while maintaining strong performance.

[14] Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization cs.CV | cs.LGPDF

De Cheng, Zhipeng Xu, Xinyang Jiang, Dongsheng Li, Nannan Wang

TL;DR: 论文提出了一种通过语言引导和表示对齐的提示解耦方法，用于领域泛化（DG），结合大语言模型（LLM）解耦文本提示，并通过视觉提示增强领域不变性，实验表明其优于当前最优方法。

Details

Motivation: 现有基于视觉基础模型（VFMs）的领域提示调优方法在解耦跨领域不变特征时存在挑战，作者希望通过语言提示的灵活性和可控性来解决这一问题。

Result: 在PACS、VLCS、OfficeHome、DomainNet和TerraInc等数据集上，该方法优于当前最先进的DG方法。

Insight: 文本模态更易于解耦，而视觉特征的复杂性需要额外的表示对齐约束来确保泛化能力。

Abstract: Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.

[15] DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation cs.CVPDF

Yunhan Yang, Shuo Chen, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo

TL;DR: DreamComposer++ 是一个通过多视角条件增强扩散模型的可控3D内容生成的框架，结合了3D提升和多视角特征融合模块，显著提升了新视角合成的能力。

Details

Motivation: 现有方法在从单张图像生成高质量新视角时缺乏多视角信息支持，导致视角控制能力不足。因此，作者提出了一种结合多视角条件的框架，以提升可控性。

Result: 实验证明，DreamComposer++ 能够无缝集成最新的视角感知扩散模型，并显著提升其从多视角条件生成可控新视角的能力。

Insight: 多视角条件的引入是提升3D内容生成可控性的关键，同时该框架的灵活性使其可以广泛适配不同的扩散模型。

Abstract: Recent advancements in leveraging pre-trained 2D diffusion models achieve the generation of high-quality novel views from a single in-the-wild image. However, existing works face challenges in producing controllable novel views due to the lack of information from multiple views. In this paper, we present DreamComposer++, a flexible and scalable framework designed to improve current view-aware diffusion models by incorporating multi-view conditions. Specifically, DreamComposer++ utilizes a view-aware 3D lifting module to extract 3D representations of an object from various views. These representations are then aggregated and rendered into the latent features of target view through the multi-view feature fusion module. Finally, the obtained features of target view are integrated into pre-trained image or video diffusion models for novel view synthesis. Experimental results demonstrate that DreamComposer++ seamlessly integrates with cutting-edge view-aware diffusion models and enhances their abilities to generate controllable novel views from multi-view conditions. This advancement facilitates controllable 3D object reconstruction and enables a wide range of applications.

[16] Perception Activator: An intuitive and portable framework for brain cognitive exploration cs.CVPDF

Le Xu, Qi Zhang, Qixian Zhang, Hongyun Zhang, Duoqian Miao

TL;DR: 本文提出了一种名为Perception Activator的框架，通过将fMRI表征注入多尺度图像特征中，验证了fMRI信号在下游任务中的语义增强效果。

Details

Motivation: 现有的脑视觉解码方法主要依赖像素级对齐，缺乏精细的语义对齐，导致重建结果失真。本文旨在探索fMRI信号中的多对象语义线索及其对下游任务的影响。

Result: 实验表明，引入fMRI信号能够提升检测和分割的准确性，证实了fMRI包含丰富的多对象语义线索和粗略空间定位信息。

Insight: fMRI信号不仅包含低阶感知信息，还具有未被充分挖掘的高阶语义线索，这对未来的脑视觉解码方法提供了新的方向。

Abstract: Recent advances in brain-vision decoding have driven significant progress, reconstructing with high fidelity perceived visual stimuli from neural activity, e.g., functional magnetic resonance imaging (fMRI), in the human visual cortex. Most existing methods decode the brain signal using a two-level strategy, i.e., pixel-level and semantic-level. However, these methods rely heavily on low-level pixel alignment yet lack sufficient and fine-grained semantic alignment, resulting in obvious reconstruction distortions of multiple semantic objects. To better understand the brain’s visual perception patterns and how current decoding models process semantic objects, we have developed an experimental framework that uses fMRI representations as intervention conditions. By injecting these representations into multi-scale image features via cross-attention, we compare both downstream performance and intermediate feature changes on object detection and instance segmentation tasks with and without fMRI information. Our results demonstrate that incorporating fMRI signals enhances the accuracy of downstream detection and segmentation, confirming that fMRI contains rich multi-object semantic cues and coarse spatial localization information-elements that current models have yet to fully exploit or integrate.

[17] MAGIC: Mask-Guided Diffusion Inpainting with Multi-Level Perturbations and Context-Aware Alignment for Few-Shot Anomaly Generation cs.CV | cs.AIPDF

JaeHyuck Choi, MinJun Kim, JeHyeong Hong

TL;DR: MAGIC提出了一种基于扩散模型的少样本异常生成方法，通过多级扰动和上下文感知对齐解决了背景破坏、掩码对齐和语义合理性问题。

Details

Motivation: 工业质量控制中异常数据稀缺，现有方法无法同时满足背景完整、掩码对齐和语义合理性的需求。

Result: 在MVTec-AD数据集上优于现有方法。

Insight: 通过多级扰动和上下文感知对齐可以有效解决少样本异常生成中的关键挑战。

Abstract: Few-shot anomaly generation is emerging as a practical solution for augmenting the scarce anomaly data in industrial quality control settings. An ideal generator would meet three demands at once, namely (i) keep the normal background intact, (ii) inpaint anomalous regions to tightly overlap with the corresponding anomaly masks, and (iii) generate anomalous regions in a semantically valid location, while still producing realistic, diverse appearances from only a handful of real examples. Existing diffusion-based methods usually satisfy at most two of these requirements: global anomaly generators corrupt the background, whereas mask-guided ones often falter when the mask is imprecise or misplaced. We propose MAGIC–Mask-guided inpainting with multi-level perturbations and Context-aware alignment–to resolve all three issues. At its core, MAGIC fine-tunes a Stable Diffusion inpainting backbone that preserves normal regions and ensures strict adherence of the synthesized anomaly to the supplied mask, directly addressing background corruption and misalignment. To offset the diversity loss that fine-tuning can cause, MAGIC adds two complementary perturbation strategies: (i) Gaussian prompt-level perturbation applied during fine-tuning and inference that broadens the global appearance of anomalies while avoiding low-fidelity textual appearances, and (ii) mask-guided spatial noise injection that enriches local texture variations. Additionally, the context-aware mask alignment module forms semantic correspondences and relocates masks so that every anomaly remains plausibly contained within the host object, eliminating out-of-boundary artifacts. Under a consistent identical evaluation protocol on the MVTec-AD dataset, MAGIC outperforms previous state-of-the-arts in downstream anomaly tasks.

[18] Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos cs.CVPDF

Zecheng Zhao, Selena Song, Tong Chen, Zhi Chen, Shazia Sadiq

TL;DR: 论文引入SynTVA数据集和基准，评估合成视频在下游任务（如文本到视频检索）中的实用性，并提出自动评估工具Auto-Evaluator。

Details

Motivation: 现有评估指标主要关注视觉质量和时间一致性，对合成视频在下游任务中的表现评价不足，因此需要新的评估框架。

Result: SynTVA可筛选高质量合成样本，显著提升文本到视频检索任务的表现。

Insight: 合成视频的选择标准需考虑语义对齐，而非仅视觉质量，可有效提升下游任务性能。

Abstract: Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object & Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at https://jasoncodemaker.github.io/SynTVA/.

[19] Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback cs.CVPDF

Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov

TL;DR: 论文提出InnerControl，通过在扩散模型的所有去噪步骤中强制执行空间一致性，改进ControlNet和ControlNet++对生成图像的空间控制能力。

Details

Motivation: 现有方法（如ControlNet++）仅在最终去噪步骤中应用循环一致性损失，忽略了中间生成阶段的空间对齐，限制了模型的性能。

Result: 结合ControlNet++等技术，InnerControl在多种条件方法（如边缘、深度）上取得了最先进的性能。

Insight: 中间生成阶段的特征反馈对提升空间控制的精确性至关重要，轻量级探针能高效提取噪声潜在空间中的信号。

Abstract: Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).

[20] Two-Steps Neural Networks for an Automated Cerebrovascular Landmark Detection cs.CV | cs.AIPDF

Rafic Nader, Vincent L’Allinec, Romain Bourcier, Florent Autrusseau

TL;DR: 提出了一种用于检测脑血管标志点的两阶段神经网络方法，首先通过目标检测网络识别感兴趣区域（ROIs），再利用改进的U-Net精确定位分叉点，有效减少误检和漏检。

Details

Motivation: 颅内动脉瘤（ICA）常见于Willis环（CoW）的特定分段，尤其是13个主要动脉分叉处。准确检测这些标志点对快速高效诊断至关重要。

Result: 在两类脑血管MRA数据集上的实验表明，该方法在分叉点检测任务中性能最佳。

Insight: 两阶段方法不仅提升检测精度，还能处理解剖学变异性和标志点相近的问题，适用于临床实践。

Abstract: Intracranial aneurysms (ICA) commonly occur in specific segments of the Circle of Willis (CoW), primarily, onto thirteen major arterial bifurcations. An accurate detection of these critical landmarks is necessary for a prompt and efficient diagnosis. We introduce a fully automated landmark detection approach for CoW bifurcations using a two-step neural networks process. Initially, an object detection network identifies regions of interest (ROIs) proximal to the landmark locations. Subsequently, a modified U-Net with deep supervision is exploited to accurately locate the bifurcations. This two-step method reduces various problems, such as the missed detections caused by two landmarks being close to each other and having similar visual characteristics, especially when processing the complete MRA Time-of-Flight (TOF). Additionally, it accounts for the anatomical variability of the CoW, which affects the number of detectable landmarks per scan. We assessed the effectiveness of our approach using two cerebral MRA datasets: our In-House dataset which had varying numbers of landmarks, and a public dataset with standardized landmark configuration. Our experimental results demonstrate that our method achieves the highest level of performance on a bifurcation detection task.

[21] Lightweight Shrimp Disease Detection Research Based on YOLOv8n cs.CVPDF

Fei Yuhuan, Wang Gengchen, Liu Fenghao, Zang Ran, Sun Xufei

TL;DR: 提出了一种基于YOLOv8n的轻量级网络架构，通过改进检测头和自注意力机制，减少了计算复杂度并提升了检测精度，在虾病检测中实现了高效性和准确性。

Details

Motivation: 虾病是虾养殖经济受损的主要原因之一，急需高效且轻量的智能检测方法以减少疾病传播并提升检测效率。

Result: 模型参数减少32.3%，mAP@0.5达92.7%（提升3%），在URPC2020数据集上mAP@0.5提升4.1%。

Insight: 轻量化设计可显著提升模型效率，同时自注意力机制的改进能有效增强特征提取能力，适用于边缘计算等资源受限场景。

Abstract: Shrimp diseases are one of the primary causes of economic losses in shrimp aquaculture. To prevent disease transmission and enhance intelligent detection efficiency in shrimp farming, this paper proposes a lightweight network architecture based on YOLOv8n. First, by designing the RLDD detection head and C2f-EMCM module, the model reduces computational complexity while maintaining detection accuracy, improving computational efficiency. Subsequently, an improved SegNext_Attention self-attention mechanism is introduced to further enhance the model’s feature extraction capability, enabling more precise identification of disease characteristics. Extensive experiments, including ablation studies and comparative evaluations, are conducted on a self-constructed shrimp disease dataset, with generalization tests extended to the URPC2020 dataset. Results demonstrate that the proposed model achieves a 32.3% reduction in parameters compared to the original YOLOv8n, with a mAP@0.5 of 92.7% (3% improvement over YOLOv8n). Additionally, the model outperforms other lightweight YOLO-series models in mAP@0.5, parameter count, and model size. Generalization experiments on the URPC2020 dataset further validate the model’s robustness, showing a 4.1% increase in mAP@0.5 compared to YOLOv8n. The proposed method achieves an optimal balance between accuracy and efficiency, providing reliable technical support for intelligent disease detection in shrimp aquaculture.

[22] LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling cs.CVPDF

Jiahao Wu, Rui Peng, Jianbo Jiao, Jiayu Yang, Luyang Tang

TL;DR: LocalDyGS提出了一种新的动态场景建模方法，通过分解复杂动态场景为局部空间并解耦静态与动态特征，以适应大范围和精细尺度的运动。

Details

Motivation: 现实世界中的复杂动态运动使得多视角输入的视频合成具有挑战性。现有方法（如神经辐射场或3D高斯泼溅）难以建模精细和大范围的运动，限制了其应用。

Result: 在精细尺度数据集上表现优异，且首次尝试建模更大、更复杂的高度动态场景。

Insight: 通过局部空间分解和特征解耦，LocalDyGS能够更真实地建模复杂动态场景，填补了现有方法在建模大范围动态运动上的不足。

Abstract: Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce LocalDyGS, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes: 1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space. 2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space. As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Project page: https://wujh2001.github.io/LocalDyGS/.

[23] UVLM: Benchmarking Video Language Model for Underwater World Understanding cs.CVPDF

Xizhe Xue, Yang Zhou, Dawei Yan, Ying Li, Haokui Zhang

TL;DR: 论文提出了UVLM，一个专为水下世界理解设计的视频语言模型基准数据集，弥补了现有模型在陆地场景的局限性，并通过多角度设计和实验验证了其有效性。

Details

Motivation: 现有视频语言模型（VidLMs）主要关注陆地场景，但水下观测的应用需求被忽视。因此，作者希望通过构建高质量、多样化的水下数据集，填补这一空白。

Result: 实验表明，基于UVLM微调的VidLMs显著提升了水下理解能力，同时在陆基基准（如VideoMME和Perception text）上也有小幅提升。

Insight: UVLM展示了针对特定领域（如水下）设计专用数据集和任务的重要性，为未来领域专属模型的开发提供了参考。

Abstract: Recently, the remarkable success of large language models (LLMs) has achieved a profound impact on the field of artificial intelligence. Numerous advanced works based on LLMs have been proposed and applied in various scenarios. Among them, video language models (VidLMs) are particularly widely used. However, existing works primarily focus on terrestrial scenarios, overlooking the highly demanding application needs of underwater observation. To overcome this gap, we introduce UVLM, an under water observation benchmark which is build through a collaborative approach combining human expertise and AI models. To ensure data quality, we have conducted in-depth considerations from multiple perspectives. First, to address the unique challenges of underwater environments, we selected videos that represent typical underwater challenges including light variations, water turbidity, and diverse viewing angles to construct the dataset. Second, to ensure data diversity, the dataset covers a wide range of frame rates, resolutions, 419 classes of marine animals, and various static plants and terrains. Next, for task diversity, we adopted a structured design where observation targets are categorized into two major classes: biological and environmental. Each category includes content observation and change/action observation, totaling 20 distinct task types. Finally, we designed several challenging evaluation metrics to enable quantitative comparison and analysis of different methods. Experiments on two representative VidLMs demonstrate that fine-tuning VidLMs on UVLM significantly improves underwater world understanding while also showing potential for slight improvements on existing in-air VidLM benchmarks, such as VideoMME and Perception text. The dataset and prompt engineering will be released publicly.

[24] PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection cs.CVPDF

Seokyeong Lee, Sithu Aung, Junyong Choi, Seungryong Kim, Ig-Jae Kim

TL;DR: 论文提出了一种基于视频目标跟踪的伪标签方法（PLOT），用于解决单目3D目标检测（M3OD）中数据稀缺的问题，提升了检测的鲁棒性和可扩展性。

Details

Motivation: 单目3D目标检测因标注成本高和2D到3D的模糊性导致数据稀缺，现有方法多依赖领域特定学习或单一观测信息，限制了实用性。

Result: 实验表明该方法在可靠性和可扩展性上表现优异，为单目3D目标检测提供了一种实用解决方案。

Insight: 视频时序信息可以有效缓解单目3D检测的模糊性，伪标签技术在数据稀缺条件下具有显著潜力。

Abstract: Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.

[25] Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis cs.CVPDF

Byung Hyun Lee, Wongi Jeong, Woojae Han, Kyoungbun Lee, Se Young Chun

TL;DR: 论文提出了一种名为CoMEL的持续多实例学习框架，专注于病理全切片图像的分析，旨在解决现有方法在持续任务中遗忘问题和定位精度不足的挑战。

Details

Motivation: 现有MIL方法在大规模图像（如病理全切片图像）上虽能降低标注成本，但在持续任务中易遗忘且定位能力不足，特别是在缺乏全局关系的病理图像中。

Result: 在三个公开数据集上，CoMEL在持续MIL设置中，袋级准确率最高提升11%，定位准确率最高提升23.4%。

Insight: CoMEL通过结合高效的实例编码、可靠伪标注和遗忘缓解技术，显著提升了病理图像持续任务的性能。

Abstract: Multiple instance learning (MIL) significantly reduced annotation costs via bag-level weak labels for large-scale images, such as histopathological whole slide images (WSIs). However, its adaptability to continual tasks with minimal forgetting has been rarely explored, especially on instance classification for localization. Weakly incremental learning for semantic segmentation has been studied for continual localization, but it focused on natural images, leveraging global relationships among hundreds of small patches (e.g., $16 \times 16$) using pre-trained models. This approach seems infeasible for MIL localization due to enormous amounts ($\sim 10^5$) of large patches (e.g., $256 \times 256$) and no available global relationships such as cancer cells. To address these challenges, we propose Continual Multiple Instance Learning with Enhanced Localization (CoMEL), an MIL framework for both localization and adaptability with minimal forgetting. CoMEL consists of (1) Grouped Double Attention Transformer (GDAT) for efficient instance encoding, (2) Bag Prototypes-based Pseudo-Labeling (BPPL) for reliable instance pseudo-labeling, and (3) Orthogonal Weighted Low-Rank Adaptation (OWLoRA) to mitigate forgetting in both bag and instance classification. Extensive experiments on three public WSI datasets demonstrate superior performance of CoMEL, outperforming the prior arts by up to $11.00%$ in bag-level accuracy and up to $23.4%$ in localization accuracy under the continual MIL setup.

[26] Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection cs.CV | cs.AIPDF

Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo

TL;DR: 该论文提出了一种基于像素级时间频率的深度伪造视频检测方法，通过1D傅里叶变换提取时间不一致性特征，结合注意力机制和联合变换模块，显著提升了检测性能。

Details

Motivation: 传统基于空间频率的检测器忽视了像素级的时间不一致性，无法有效检测时间维度上的伪造痕迹。论文旨在解决这一问题，利用时间频率特征提升检测准确性。

Result: 方法在多样化和挑战性场景中表现出鲁棒性，显著优于传统基于空间频率的检测器。

Insight: 时间频率特征是检测深度伪造视频的关键，像素级分析能够更精准地捕捉伪造痕迹。

Abstract: We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. Traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect temporal artifacts in the pixel plane. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.

[27] TABNet: A Triplet Augmentation Self-Recovery Framework with Boundary-Aware Pseudo-Labels for Medical Image Segmentation cs.CV | cs.LGPDF

Peilin Zhang, Shaouxan Wua, Jun Feng, Zhuo Jin, Zhizezhang Gao

TL;DR: TABNet提出了一种基于三重增强自恢复和边界感知伪标签的弱监督医学图像分割框架，在稀疏标注（scribble）下显著优于现有方法，并接近全监督性能。

Details

Motivation: 医学图像分割任务中，全标注数据的获取成本高且耗时，而稀疏标注（如scribble）标注效率高但缺乏边界监督和目标区域的特征学习，限制了分割性能。

Result: 在ACDC和MSCMR seg数据集上超越现有弱监督方法，性能接近全监督模型。

Insight: 通过多样化增强和边界感知损失，TABNet证明了稀疏标注下可以通过增强特征学习和边界优化实现接近全监督的分割性能。

Abstract: Background and objective: Medical image segmentation is a core task in various clinical applications. However, acquiring large-scale, fully annotated medical image datasets is both time-consuming and costly. Scribble annotations, as a form of sparse labeling, provide an efficient and cost-effective alternative for medical image segmentation. However, the sparsity of scribble annotations limits the feature learning of the target region and lacks sufficient boundary supervision, which poses significant challenges for training segmentation networks. Methods: We propose TAB Net, a novel weakly-supervised medical image segmentation framework, consisting of two key components: the triplet augmentation self-recovery (TAS) module and the boundary-aware pseudo-label supervision (BAP) module. The TAS module enhances feature learning through three complementary augmentation strategies: intensity transformation improves the model’s sensitivity to texture and contrast variations, cutout forces the network to capture local anatomical structures by masking key regions, and jigsaw augmentation strengthens the modeling of global anatomical layout by disrupting spatial continuity. By guiding the network to recover complete masks from diverse augmented inputs, TAS promotes a deeper semantic understanding of medical images under sparse supervision. The BAP module enhances pseudo-supervision accuracy and boundary modeling by fusing dual-branch predictions into a loss-weighted pseudo-label and introducing a boundary-aware loss for fine-grained contour refinement. Results: Experimental evaluations on two public datasets, ACDC and MSCMR seg, demonstrate that TAB Net significantly outperforms state-of-the-art methods for scribble-based weakly supervised segmentation. Moreover, it achieves performance comparable to that of fully supervised methods.

[28] Wildlife Target Re-Identification Using Self-supervised Learning in Non-Urban Settings cs.CV | cs.AI | cs.LGPDF

Mufhumudzi Muthivhi, Terence L. van Zyl

TL;DR: 该论文探讨了在非城市环境中利用自监督学习进行野生动物重识别，提出了一种无需标注数据的方法，通过时间图像对自动提取特征，其性能优于监督学习方法。

Details

Motivation: 现有野生动物重识别方法依赖于监督学习，需要大量标注数据，限制了模型的泛化能力和实用性。通过自监督学习减少对标注数据的依赖，提升模型在开放场景中的鲁棒性和适应性。

Result: 自监督模型在数据有限的情况下表现更鲁棒，且在所有下游任务中均优于监督学习方法。

Insight: 自监督学习能有效减少对标注数据的依赖，提升野生动物重识别的泛化能力，为实际应用提供新思路。

Abstract: Wildlife re-identification aims to match individuals of the same species across different observations. Current state-of-the-art (SOTA) models rely on class labels to train supervised models for individual classification. This dependence on annotated data has driven the curation of numerous large-scale wildlife datasets. This study investigates self-supervised learning Self-Supervised Learning (SSL) for wildlife re-identification. We automatically extract two distinct views of an individual using temporal image pairs from camera trap data without supervision. The image pairs train a self-supervised model from a potentially endless stream of video data. We evaluate the learnt representations against supervised features on open-world scenarios and transfer learning in various wildlife downstream tasks. The analysis of the experimental results shows that self-supervised models are more robust even with limited data. Moreover, self-supervised features outperform supervision across all downstream tasks. The code is available here https://github.com/pxpana/SSLWildlife.

[29] PosDiffAE: Position-aware Diffusion Auto-encoder For High-Resolution Brain Tissue Classification Incorporating Artifact Restoration cs.CVPDF

Ayantika Das, Moitreya Chaudhuri, Koushik Bhat, Keerthi Ram, Mihail Bota

TL;DR: 这篇论文提出了一种结合扩散模型和自编码器的位置感知扩散自编码器（PosDiffAE），用于高分辨率脑组织分类并包含伪影修复功能。通过结构化潜在空间和实现无监督的伪影修复技术，该方法在脑组织识别和图像修复方面表现出色。

Details

Motivation: 扩散模型在生成高保真图像方面表现出色，但缺乏提取图像语义表示的能力，而自编码器可以提供这种能力。结合两者的优势，可以实现既能生成高质量图像又能提取语义表示的模型。

Result: 实验表明，PosDiffAE在脑组织分类和伪影修复任务中表现出色，能够有效区分组织类型并修复图像伪影。

Insight: 结合扩散模型和自编码器的优势，可以同时实现高质量的图像生成和语义表示的提取。通过结构化潜在空间，可以进一步提升模型的解释性和性能。

Abstract: Denoising diffusion models produce high-fidelity image samples by capturing the image distribution in a progressive manner while initializing with a simple distribution and compounding the distribution complexity. Although these models have unlocked new applicabilities, the sampling mechanism of diffusion does not offer means to extract image-specific semantic representation, which is inherently provided by auto-encoders. The encoding component of auto-encoders enables mapping between a specific image and its latent space, thereby offering explicit means of enforcing structures in the latent space. By integrating an encoder with the diffusion model, we establish an auto-encoding formulation, which learns image-specific representations and offers means to organize the latent space. In this work, First, we devise a mechanism to structure the latent space of a diffusion auto-encoding model, towards recognizing region-specific cellular patterns in brain images. We enforce the representations to regress positional information of the patches from high-resolution images. This creates a conducive latent space for differentiating tissue types of the brain. Second, we devise an unsupervised tear artifact restoration technique based on neighborhood awareness, utilizing latent representations and the constrained generation capability of diffusion models during inference. Third, through representational guidance and leveraging the inference time steerable noising and denoising capability of diffusion, we devise an unsupervised JPEG artifact restoration technique.

[30] AvatarMakeup: Realistic Makeup Transfer for 3D Animatable Head Avatars cs.CVPDF

Yiming Zhong, Xiaolin Zhang, Ligang Liu, Yao Zhao, Yunchao Wei

TL;DR: 该论文提出了AvatarMakeup方法，利用预训练扩散模型从单张参考照片中迁移化妆模式到3D可动画头像上，解决了现有方法在动态表情和多视角效果中无法保持一致性、身份保留和精细控制的问题。

Details

Motivation: 现有3D高斯编辑方法在实现逼真化妆效果时存在不足，尤其是在动态表情、身份保留和精细控制方面。为此，论文旨在开发一种能够满足这些需求的3D化妆方法。

Result: 实验表明，AvatarMakeup在化妆迁移质量和动画一致性上达到最佳效果。

Insight: 扩散模型在3D化妆任务中具有潜力，但需要针对性设计以确保动态和多视角一致性。从粗到细的优化流程是实现高质量3D化妆的关键。

Abstract: Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions, 2) preserving the identity throughout the makeup process, and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multiview effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.

[31] From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding cs.CV | cs.CLPDF

Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song

TL;DR: 本文提出了一个受人类启发的自动视频编辑框架（HIVE），通过多模态叙事理解将长视频浓缩为吸引人的短片，显著优于现有基线方法。

Details

Motivation: 随着短视频平台的兴起，如何将长视频高效编辑为高质量的短片成为重要需求，而现有方法主要依赖文本线索，忽略了视觉上下文的丰富性，导致结果不连贯。

Result: 实验显示HIVE在通用和广告剪辑任务中均优于现有基线，显著缩小了自动与人工作品的质量差距。

Insight: 多模态理解是提升视频编辑连贯性和吸引力的关键，而数据集的质量直接影响模型的表现。

Abstract: The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.

[32] Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection cs.CVPDF

Weiwei Duan, Luping Ji, Shengjia Chen, Sicheng Zhu, Jianghong Huang

TL;DR: 该论文提出了一种弱监督对比学习方案（WeCoL），用于移动红外小目标检测，仅需目标数量提示作为训练输入，显著减少了人工标注需求。通过结合潜在目标挖掘和对比学习，以及长短期运动感知学习，方案性能接近当前全监督方法的90%。

Details

Motivation: 移动红外小目标检测因目标尺寸小、背景对比度低而极具挑战性。现有方法多为全监督，依赖大量手工标注，成本高昂且耗时长。论文旨在通过弱监督方式减少标注需求。

Result: 在DAUB和ITSDT-15K数据集上的实验表明，WeCoL性能优于早期全监督方法，并达到SOTA全监督方法的90%以上。

Insight: 弱监督方法在红外小目标检测中具潜力，能有效减少标注成本，通过对比学习和运动建模提升检测可靠性。

Abstract: Different from general object detection, moving infrared small target detection faces huge challenges due to tiny target size and weak background contrast.Currently, most existing methods are fully-supervised, heavily relying on a large number of manual target-wise annotations. However, manually annotating video sequences is often expensive and time-consuming, especially for low-quality infrared frame images. Inspired by general object detection, non-fully supervised strategies ($e.g.$, weakly supervised) are believed to be potential in reducing annotation requirements. To break through traditional fully-supervised frameworks, as the first exploration work, this paper proposes a new weakly-supervised contrastive learning (WeCoL) scheme, only requires simple target quantity prompts during model training.Specifically, in our scheme, based on the pretrained segment anything model (SAM), a potential target mining strategy is designed to integrate target activation maps and multi-frame energy accumulation.Besides, contrastive learning is adopted to further improve the reliability of pseudo-labels, by calculating the similarity between positive and negative samples in feature subspace.Moreover, we propose a long-short term motion-aware learning scheme to simultaneously model the local motion patterns and global motion trajectory of small targets.The extensive experiments on two public datasets (DAUB and ITSDT-15K) verify that our weakly-supervised scheme could often outperform early fully-supervised methods. Even, its performance could reach over 90% of state-of-the-art (SOTA) fully-supervised ones.

[33] Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection cs.CV | cs.CL | cs.CRPDF

Ziqi Miao, Yi Ding, Lijun Li, Jing Shao

TL;DR: 本文提出了一种新型视觉中心攻击方法VisCo，通过动态生成图像和优化攻击提示词，成功诱导多模态大语言模型（MLLMs）生成有害响应，显著提升了毒性分数和攻击成功率。

Details

Motivation: 多模态大语言模型（MLLMs）在视觉语言任务中表现出色，但其视觉模态的安全性漏洞使其在开放环境中部署面临挑战。现有攻击方法多依赖文本语义，缺乏视觉实际的上下文支持。

Result: VisCo的毒性分数达4.78，攻击成功率达85%，显著优于baseline（毒性分数2.48，成功率22.2%）。

Insight: 视觉模态作为攻击媒介的重要性被低估，VisCo展示了动态图像生成和上下文优化在提升攻击效果中的潜力。

Abstract: With the emergence of strong visual-language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct visual-focused strategies, dynamically generating auxiliary images when necessary to construct a visual-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The code is available at https://github.com/Dtc7w3PQ/Visco-Attack.

[34] CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios cs.CV | cs.AIPDF

Teng Fu, Yuwen Chen, Zhuofan Chen, Mengyang Zhao, Bin Li

TL;DR: 该论文提出了一个名为CrowdTrack的大规模、高难度的多行人跟踪数据集，旨在解决现有数据集场景简单、不真实的问题，为复杂场景下的行人跟踪算法研究提供了重要支持。

Details

Motivation: 现有的多目标跟踪数据集场景相对简单或不真实，难以满足复杂场景下的算法研究需求，特别是对行人跟踪这种高价值应用。

Result: 数据集为复杂场景下的行人跟踪算法研究提供了平台，并通过测试验证了现有模型的性能表现。

Insight: 复杂场景下的多行人跟踪仍需进一步研究，尤其是如何处理遮挡和部分可见性问题，现有算法在真实复杂场景中表现仍有提升空间。

Abstract: Multi-object tracking is a classic field in computer vision. Among them, pedestrian tracking has extremely high application value and has become the most popular research category. Existing methods mainly use motion or appearance information for tracking, which is often difficult in complex scenarios. For the motion information, mutual occlusions between objects often prevent updating of the motion state; for the appearance information, non-robust results are often obtained due to reasons such as only partial visibility of the object or blurred images. Although learning how to perform tracking in these situations from the annotated data is the simplest solution, the existing MOT dataset fails to satisfy this solution. Existing methods mainly have two drawbacks: relatively simple scene composition and non-realistic scenarios. Although some of the video sequences in existing dataset do not have the above-mentioned drawbacks, the number is far from adequate for research purposes. To this end, we propose a difficult large-scale dataset for multi-pedestrian tracking, shot mainly from the first-person view and all from real-life complex scenarios. We name it ``CrowdTrack’’ because there are numerous objects in most of the sequences. Our dataset consists of 33 videos, containing a total of 5,185 trajectories. Each object is annotated with a complete bounding box and a unique object ID. The dataset will provide a platform to facilitate the development of algorithms that remain effective in complex situations. We analyzed the dataset comprehensively and tested multiple SOTA models on our dataset. Besides, we analyzed the performance of the foundation models on our dataset. The dataset and project code is released at: https://github.com/loseevaya/CrowdTrack .

[35] MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention cs.CVPDF

Zunhui Xia, Hongxing Li, Libin Lan

TL;DR: MedFormer 是一种高效的医学视觉 Transformer，通过金字塔缩放结构和内容感知的双稀疏选择注意力（DSSA）解决了任务特定性和高计算成本的问题，显著提升了医学图像分类和密集预测任务的性能。

Details

Motivation: 现有的医学视觉 Transformer 方法存在任务特定性高和计算成本大的问题，这限制了它们的通用性和效率。MedFormer 旨在解决这些问题，提供一个更通用和高效的解决方案。

Result: 在多种成像模态数据集上，MedFormer 在图像分类、语义分割和病变检测任务中均表现出显著的性能提升。

Insight: 1. 金字塔结构和动态稀疏注意力机制的结合可以显著提升医学视觉任务的效率和性能。2. 内容感知的注意力设计有助于模型在噪声环境下保持鲁棒性。

Abstract: Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer-based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task-specific and architecture-tailored, limiting their general applicability. Second, they usually either adopt full attention to model long-range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense prediction tasks such as semantic segmentation and lesion detection. This structure facilitates hierarchical feature representation while reducing the computation load of feature maps, highly beneficial for boosting performance. Second, it introduces a novel Dual Sparse Selection Attention (DSSA) with content awareness to improve computational efficiency and robustness against noise while maintaining high performance. As the core building technique of MedFormer, DSSA is explicitly designed to attend to the most relevant content. In addition, a detailed theoretical analysis has been conducted, demonstrating that MedFormer has superior generality and efficiency in comparison to existing medical vision transformers. Extensive experiments on a variety of imaging modality datasets consistently show that MedFormer is highly effective in enhancing performance across all three above-mentioned medical image recognition tasks. The code is available at https://github.com/XiaZunhui/MedFormer.

[36] Temporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy cs.CV | cs.AIPDF

Luca Parolari, Andrea Cherubini, Lamberto Ballan, Carlo Biffi

TL;DR: 该论文提出了一种结合时间感知监督对比学习的息肉计数方法，通过引入时间相关目标和时序邻接约束，显著减少了息肉跟踪和聚类中的错误，性能优于现有方法。

Details

Motivation: 自动化息肉计数在结肠镜检查中有重要意义，但现有方法主要依赖自监督学习，忽视了时间关系，导致跟踪和聚类效果不佳。因此，作者提出了结合时间感知的监督对比学习方法。

Result: 在公开数据集上验证，实现了2.2倍的碎片率降低，取得了新的SOTA性能。

Insight: 时间信息在息肉计数任务中至关重要，合理利用时序关系可以显著提升跟踪和聚类的鲁棒性。

Abstract: Automated polyp counting in colonoscopy is a crucial step toward automated procedure reporting and quality control, aiming to enhance the cost-effectiveness of colonoscopy screening. Counting polyps in a procedure involves detecting and tracking polyps, and then clustering tracklets that belong to the same polyp entity. Existing methods for polyp counting rely on self-supervised learning and primarily leverage visual appearance, neglecting temporal relationships in both tracklet feature learning and clustering stages. In this work, we introduce a paradigm shift by proposing a supervised contrastive loss that incorporates temporally-aware soft targets. Our approach captures intra-polyp variability while preserving inter-polyp discriminability, leading to more robust clustering. Additionally, we improve tracklet clustering by integrating a temporal adjacency constraint, reducing false positive re-associations between visually similar but temporally distant tracklets. We train and validate our method on publicly available datasets and evaluate its performance with a leave-one-out cross-validation strategy. Results demonstrate a 2.2x reduction in fragmentation rate compared to prior approaches. Our results highlight the importance of temporal awareness in polyp counting, establishing a new state-of-the-art. Code is available at https://github.com/lparolari/temporally-aware-polyp-counting.

[37] Automatic Labelling for Low-Light Pedestrian Detection cs.CVPDF

Dimitrios Bouzoulas, Eerik Alamikkotervo, Risto Ojala

TL;DR: 论文提出了一种自动标注低光行人检测的方法，通过红外检测和标签迁移，生成RGB图像的标注，实验表明基于生成标注的检测模型性能优于基于人工标注的模型。

Details

Motivation: 解决低光条件下的行人检测缺乏公开数据集的问题。

Result: 在KAIST数据集上，生成标注训练的模型在6/9情况下性能优于人工标注训练的模型。

Insight: 自动标注流程可以高效生成低光条件下的标注数据，提升检测模型的性能。

Abstract: Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. A challenge in RGB pedestrian detection, that does not appear to have large public datasets, is low-light conditions. As a solution, in this research, we propose an automated infrared-RGB labeling pipeline. The proposed pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For the evaluation, object detection models were trained on the generated autolabels and ground truth labels. When compared on a previously unseen image sequence, the results showed that the models trained on generated labels outperformed the ones trained on ground-truth labels in 6 out of 9 cases for the mAP@50 and mAP@50-95 metrics. The source code for this research is available at https://github.com/BouzoulasDimitrios/IR-RGB-Automated-LowLight-Pedestrian-Labeling

[38] IMASHRIMP: Automatic White Shrimp (Penaeus vannamei) Biometrical Analysis from Laboratory Images Using Computer Vision and Deep Learning cs.CV | I.2.10; I.4.8PDF

Abiam Remache González, Meriem Chagour, Timon Bijan Rüth, Raúl Trapiella Cañedo, Marina Martínez Soler

TL;DR: IMASHRIMP是一个自动化的白虾形态分析系统，结合了计算机视觉和深度学习技术，优化了水产养殖中的遗传选择任务。

Details

Motivation: 传统的手动白虾形态分析易受人为错误影响，效率低下，因此需要一种自动化方法提高准确性和效率。

Result: 系统在姿态估计中的mAP达到97.94%，像素到厘米的转换误差为0.07 (+/- 0.1) cm，显著降低了人为错误。

Insight: 通过结合人类和AI的双因素认证，IMASHRIMP展示了自动化技术在水产养殖中的潜力，为可持续实践提供了支持。

Abstract: This paper introduces IMASHRIMP, an adapted system for the automated morphological analysis of white shrimp (Penaeus vannamei}, aimed at optimizing genetic selection tasks in aquaculture. Existing deep learning and computer vision techniques were modified to address the specific challenges of shrimp morphology analysis from RGBD images. IMASHRIMP incorporates two discrimination modules, based on a modified ResNet-50 architecture, to classify images by the point of view and determine rostrum integrity. It is proposed a “two-factor authentication (human and IA)” system, it reduces human error in view classification from 0.97% to 0% and in rostrum detection from 12.46% to 3.64%. Additionally, a pose estimation module was adapted from VitPose to predict 23 key points on the shrimp’s skeleton, with separate networks for lateral and dorsal views. A morphological regression module, using a Support Vector Machine (SVM) model, was integrated to convert pixel measurements to centimeter units. Experimental results show that the system effectively reduces human error, achieving a mean average precision (mAP) of 97.94% for pose estimation and a pixel-to-centimeter conversion error of 0.07 (+/- 0.1) cm. IMASHRIMP demonstrates the potential to automate and accelerate shrimp morphological analysis, enhancing the efficiency of genetic selection and contributing to more sustainable aquaculture practices.The code are available at https://github.com/AbiamRemacheGonzalez/ImaShrimp-public

[39] Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning cs.CVPDF

Buzhen Huang, Chen Li, Chongyang Xu, Dongyue Lu, Jinnan Chen

TL;DR: 该论文提出了一种双分支优化框架，结合外观和人际距离推理，以解决现有方法在复杂交互场景中难以恢复合理姿态的问题。通过扩散模型学习人际距离行为和姿态先验，并结合多种约束，实现了从野外视频中准确估计交互运动。

Details

Motivation: 现有的人体姿态估计方法在视觉模糊和人际遮挡的复杂交互场景中难以恢复合理的姿态。即使像SAM这样的大型基础模型也无法准确区分这种挑战性场景中的人类语义。

Result: 实验结果表明，该方法在多个基准测试中优于现有方法，能够从复杂环境的野外视频中准确估计交互运动。

Insight: 1. 人类外观可以作为解决复杂交互场景的直观线索。2. 人际距离和物理规律对恢复合理姿态至关重要。

Abstract: Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data are available at https://www.buzhenhuang.com/works/CloseApp.html.

[40] Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning cs.CVPDF

Tan Pan, Zhaorui Tan, Kaiyu Guo, Dongli Xu, Weidi Xu

TL;DR: 该论文提出了一种名为$S^2DC$的自监督学习框架，通过优化传输策略增强语义差异，并结合邻域相似性分布实现结构级语义一致性，从而学习3D医学图像的结构感知表示。

Details

Motivation: 现有的3D医学图像自监督学习方法通常采用固定大小的图像分块，忽略了医学图像中解剖结构在位置、尺度和形态上的变化。

Result: 在10个数据集、4项任务和3种模态上验证，$S^2DC$均优于现有方法。

Insight: 结构感知的语义差异和一致性是提升3D医学图像自监督学习的关键。

Abstract: 3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. We assume that patches within the same structure share the same semantics (semantic consistency) while those from different structures exhibit distinct semantics (semantic discrepancy). Based on this assumption, we propose an mSSL framework named $S^2DC$, achieving Structure-aware Semantic Discrepancy and Consistency in two steps. First, $S^2DC$ enforces distinct representations for different patches to increase semantic discrepancy by leveraging an optimal transport strategy. Second, $S^2DC$ advances semantic consistency at the structural level based on neighborhood similarity distribution. By bridging patch-level and structure-level representations, $S^2DC$ achieves structure-aware representations. Thoroughly evaluated across 10 datasets, 4 tasks, and 3 modalities, our proposed method consistently outperforms the state-of-the-art methods in mSSL.

[41] AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding cs.CVPDF

Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye

TL;DR: AuroraLong提出了一种高效的线性RNN语言模型，用于解决长视频理解的计算和内存成本问题，替代了传统的基于Transformer的大语言模型，并通过视觉令牌合并进一步提升了效率。

Details

Motivation: 长视频理解的高计算复杂度和内存成本限制了其实际应用，尤其是基于Transformer的模型因其二次方复杂度难以处理长序列输入。AuroraLong旨在通过线性RNN降低这一门槛。

Result: 尽管仅包含2B参数并在公共数据上训练，AuroraLong在多个视频基准测试中表现与基于Transformer的同类模型相当，展示了线性RNN在长视频理解中的潜力。

Insight: 线性RNN在长序列任务中具有显著的计算和内存效率优势，有望通过降低计算门槛，推动长视频理解的普及化。

Abstract: The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.

Riccardo Gallon, Fabian Schiemenz, Alessandra Menicucci, Eberhard Gill

TL;DR: 该论文旨在解决视觉导航中摄像头传感器故障的问题，提出了一种模拟框架生成故障图像数据集，用于训练和测试基于AI的故障检测算法。

Details

Motivation: 视觉导航（VBN）在太空任务中的重要性日益增加，但传感器故障可能导致导航算法失效或数据错误，而现有方法缺乏代表性数据集来支持AI的故障检测应用。

Result: 生成了一个可用于训练和测试AI故障检测算法的合成故障图像数据集，为视觉导航的可靠性研究提供了重要工具。

Insight: 合成数据可以填补现实世界中故障数据不足的空白，为AI在安全关键系统中的故障检测提供了新的研究途径。

Abstract: The increasing importance of Vision-Based Navigation (VBN) algorithms in space missions raises numerous challenges in ensuring their reliability and operational robustness. Sensor faults can lead to inaccurate outputs from navigation algorithms or even complete data processing faults, potentially compromising mission objectives. Artificial Intelligence (AI) offers a powerful solution for detecting such faults, overcoming many of the limitations associated with traditional fault detection methods. However, the primary obstacle to the adoption of AI in this context is the lack of sufficient and representative datasets containing faulty image data. This study addresses these challenges by focusing on an interplanetary exploration mission scenario. A comprehensive analysis of potential fault cases in camera sensors used within the VBN pipeline is presented. The causes and effects of these faults are systematically characterized, including their impact on image quality and navigation algorithm performance, as well as commonly employed mitigation strategies. To support this analysis, a simulation framework is introduced to recreate faulty conditions in synthetically generated images, enabling a systematic and controlled reproduction of faulty data. The resulting dataset of fault-injected images provides a valuable tool for training and testing AI-based fault detection algorithms. The final link to the dataset will be added after an embargo period. For peer-reviewers, this private link is available.

[43] AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models cs.CVPDF

Ziyin Zhou, Yunpeng Luo, Yuanchen Wu, Ke Sun, Jiayi Ji

TL;DR: 该论文提出了一种名为AIGI-Holmes的模型，通过多模态大语言模型（MLLMs）检测AI生成图像（AIGI），解决了现有方法缺乏可解释性和泛化能力的问题。

Details

Motivation: 随着AI生成内容（AIGC）技术的快速发展，高度逼真的AI生成图像（AIGI）被滥用于传播虚假信息，威胁公共信息安全。现有检测技术缺乏可解释性和对新生成技术的泛化能力。

Result: 在三个基准测试上的实验验证了AIGI-Holmes的有效性。

Insight: 通过结合视觉专家的感知能力和大语言模型的语义推理能力，可以实现对AI生成图像的检测，并生成人类可验证的解释，同时提升模型的泛化能力。

Abstract: The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.

[44] APT: Adaptive Personalized Training for Diffusion Models with Limited Data cs.CV | cs.AI | 60J60, 68T07 | I.2.6; I.2.10; I.4.9PDF

JungWoo Chae, Jiyoon Kim, JaeWoong Choi, Kyungyul Kim, Sangheum Hwang

TL;DR: APT是一种针对有限数据的扩散模型个性化训练框架，通过自适应训练策略和表征正则化，有效解决了过拟合问题，同时保留了预训练模型的先验知识。

Details

Motivation: 有限数据下的扩散模型个性化训练容易导致过拟合、先验知识丢失和文本对齐能力下降，APT旨在解决这些问题。

Result: 实验表明，APT在有限数据下能生成高质量且多样化的图像，优于现有方法。

Insight: 过拟合的实时检测与动态调整策略对于有限数据下的模型训练至关重要，而表征正则化和注意力对齐能有效保留模型的语义一致性。

Abstract: Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework that mitigates overfitting by employing adaptive training strategies and regularizing the model’s internal representations during fine-tuning. APT consists of three key components: (1) Adaptive Training Adjustment, which introduces an overfitting indicator to detect the degree of overfitting at each time step bin and applies adaptive data augmentation and adaptive loss weighting based on this indicator; (2)Representation Stabilization, which regularizes the mean and variance of intermediate feature maps to prevent excessive shifts in noise prediction; and (3) Attention Alignment for Prior Knowledge Preservation, which aligns the cross-attention maps of the fine-tuned model with those of the pretrained model to maintain prior knowledge and semantic coherence. Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.

[45] CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation cs.CVPDF

Xiangyang Luo, Ye Zhu, Yunfei Liu, Lijian Lin, Cong Wan

TL;DR: CanonSwap提出了一种新的视频换脸框架，通过解耦运动信息和外观信息，在统一的标准空间中完成身份替换，同时保留目标脸部的动态属性，实现了高质量的视觉效果和时间一致性。

Details

Motivation: 现有视频换脸方法在身份转移时难以保持目标脸部的动态属性（如姿势、表情等），导致结果不一致。CanonSwap旨在解决这一问题。

Result: 实验表明，CanonSwap在视觉质量、时间一致性和身份保持方面优于现有方法。

Insight: 解耦运动与外观信息是实现高质量视频换脸的关键；部分身份调制和细粒度同步指标可进一步提升结果质量。

Abstract: Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, \etc. Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped feature is reintegrated into the original video space, ensuring the preservation of the target face’s dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions. Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation. Our project page are publicly available at https://luoxyhappy.github.io/CanonSwap/.

[46] UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation cs.CVPDF

Qin Guo, Ailing Zeng, Dongxu Yue, Ceyuan Yang, Yang Cao

TL;DR: UniMC提出了一种基于DiT的框架，通过统一控制多类图像生成，结合实例和关键点条件，解决了现有方法在非刚性物体和多类别生成中的局限性。同时，发布了HAIG-2.9M数据集，支持高质量的人类和动物图像生成。

Details

Motivation: 现有关键点引导的文本到图像扩散模型在非刚性物体（如动物）和多类别重叠场景生成时效果不佳，主要受限于现有控制方法的局限性和缺乏合适的数据集。

Result: 实验表明UniMC在严重遮挡和多类别场景中表现优异，HAIG-2.9M数据集的高质量支持了方法的有效性。

Insight: 通过统一的条件编码和高质量数据集，可以实现更通用的关键点引导图像生成，尤其是多类别和非刚性物体场景。

Abstract: Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.

[47] FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models cs.CV | cs.AIPDF

Yuxuan Wang, Tianwei Cao, Huayu Zhang, Zhongjiang He, Kongming Liang

TL;DR: FairHuman提出了一种多目标微调方法，通过最小潜在延迟公平性准则提升扩散模型中手部和面部的生成质量。

Details

Motivation: 现有生成模型在生成人体图像时，对面部和手部等局部细节的生成质量不足，缺乏针对性监督。

Result: 实验表明方法显著提升了手部和面部细节的生成质量，同时保持了整体质量。

Insight: 多目标优化结合公平性准则能有效解决生成模型中局部细节质量不足的问题。

Abstract: Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.

[48] Prompt learning with bounding box constraints for medical image segmentation cs.CVPDF

Mélanie Gaillochet, Mehrdad Noori, Sahar Dastani, Christian Desrosiers, Hervé Lombaert

TL;DR: 本文提出了一种结合基础模型表示能力和弱监督分割效率的新框架，通过仅使用边界框标注自动生成提示，优化方案整合了边界框约束和伪标签，在多模态数据集中表现优异。

Details

Motivation: 医学图像分割通常需要繁琐的像素级标注，而边界框标注更容易获取。现有基于提示学习的方法依赖全标注分割掩码，限制了其应用。本文旨在通过弱监督方式（仅依赖边界框标注）提升分割效率。

Result: 在多模态数据集上平均Dice得分为84.90%，优于现有全监督和弱监督方法。

Insight: 弱监督方式结合基础模型可以显著减少标注成本，提示学习在医学图像分割中具有潜力。

Abstract: Pixel-wise annotations are notoriously labourious and costly to obtain in the medical domain. To mitigate this burden, weakly supervised approaches based on bounding box annotations-much easier to acquire-offer a practical alternative. Vision foundation models have recently shown noteworthy segmentation performance when provided with prompts such as points or bounding boxes. Prompt learning exploits these models by adapting them to downstream tasks and automating segmentation, thereby reducing user intervention. However, existing prompt learning approaches depend on fully annotated segmentation masks. This paper proposes a novel framework that combines the representational power of foundation models with the annotation efficiency of weakly supervised segmentation. More specifically, our approach automates prompt generation for foundation models using only bounding box annotations. Our proposed optimization scheme integrates multiple constraints derived from box annotations with pseudo-labels generated by the prompted foundation model. Extensive experiments across multimodal datasets reveal that our weakly supervised method achieves an average Dice score of 84.90% in a limited data setting, outperforming existing fully-supervised and weakly-supervised approaches. The code is available at https://github.com/Minimel/box-prompt-learning-VFM.git

[49] DexVLG: Dexterous Vision-Language-Grasp Model at Scale cs.CV | cs.ROPDF

Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang

TL;DR: DexVLG is a large-scale vision-language-grasp model for dexterous grasp pose prediction aligned with language instructions, trained on a synthetic dataset of 170 million grasp poses. It shows strong zero-shot generalization and part-grasp accuracy in both simulation and real-world experiments.

Details

Motivation: Existing vision-language-action systems focus on simple grippers, lacking research on dexterous hands for functional grasping. This paper addresses the gap by leveraging large-scale synthetic data to train a model for human-like dexterous grasping.

Result: Achieves over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation. Successful part-aligned grasps in real-world scenarios.

Insight: Large-scale synthetic data can effectively train dexterous grasp models, enabling robust zero-shot generalization and alignment with language instructions.

Abstract: As large models gain traction, vision-language-action (VLA) systems are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM and flow-matching-based pose head capable of producing instruction-aligned grasp poses for tabletop objects. To assess DexVLG’s performance, we create benchmarks in physics-based simulations and conduct real-world experiments. Extensive testing demonstrates DexVLG’s strong zero-shot generalization capabilities-achieving over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation-and successful part-aligned grasps on physical objects in real-world scenarios.

[50] Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics cs.CV | cs.AI | cs.LGPDF

Alex Colagrande, Paul Caillon, Eva Feillet, Alexandre Allauzen

TL;DR: 论文提出了一种新型的线性注意力机制MANO，通过多尺度距离计算注意力，显著降低计算复杂度，同时在图像分类和物理模拟任务中表现优异。

Details

Motivation: 标准Transformer的二次方复杂度使其难以处理高分辨率输入，现有方法常以损失细节为代价。本文受n体数值模拟启发，提出一种线性复杂度的注意力机制。

Result: 在图像分类和Darcy流任务中，MANO性能媲美ViT和Swin Transformer，同时计算和内存开销大幅降低。

Insight: 通过物理模拟中的多尺度思想优化注意力机制，为高分辨率输入的高效处理提供了新思路。

Abstract: Transformers have become the de facto standard for a wide range of tasks, from image classification to physics simulations. Despite their impressive performance, the quadratic complexity of standard Transformers in both memory and time with respect to the input length makes them impractical for processing high-resolution inputs. Therefore, several variants have been proposed, the most successful relying on patchification, downsampling, or coarsening techniques, often at the cost of losing the finest-scale details. In this work, we take a different approach. Inspired by state-of-the-art techniques in $n$-body numerical simulations, we cast attention as an interaction problem between grid points. We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion. MANO maintains, in each attention head, a global receptive field and achieves linear time and memory complexity with respect to the number of grid points. Empirical results on image classification and Darcy flows demonstrate that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude. We open source our code for reproducibility at https://github.com/AlexColagrande/MANO.

Danrong Zhang, Huili Huang, N. Simrill Smith, Nimisha Roy, J. David Frost

TL;DR: 该论文通过将地震后社交媒体图像中的损伤严重性评估任务转化为语义分割问题，提出了一种更客观、全面的方法。利用改进的SegFormer模型和新定义的损伤评分系统，实现了对损伤程度的量化分析。

Details

Motivation: 传统的地震灾害评估方法主要依赖于主观的分类方法，无法准确反映图像中不同区域的损伤程度。因此，研究提出了一种新的语义分割框架，以实现更精确的损伤分析。

Result: 该方法能够更客观地评估地震后社交媒体图像中的损伤程度，并提供更精确的灾害应对指导。

Insight: 语义分割框架在地震灾害评估中具有潜力，未来可扩展到其他自然灾害场景。

Abstract: In the aftermath of earthquakes, social media images have become a crucial resource for disaster reconnaissance, providing immediate insights into the extent of damage. Traditional approaches to damage severity assessment in post-earthquake social media images often rely on classification methods, which are inherently subjective and incapable of accounting for the varying extents of damage within an image. Addressing these limitations, this study proposes a novel approach by framing damage severity assessment as a semantic segmentation problem, aiming for a more objective analysis of damage in earthquake-affected areas. The methodology involves the construction of a segmented damage severity dataset, categorizing damage into three degrees: undamaged structures, damaged structures, and debris. Utilizing this dataset, the study fine-tunes a SegFormer model to generate damage severity segmentations for post-earthquake social media images. Furthermore, a new damage severity scoring system is introduced, quantifying damage by considering the varying degrees of damage across different areas within images, adjusted for depth estimation. The application of this approach allows for the quantification of damage severity in social media images in a more objective and comprehensive manner. By providing a nuanced understanding of damage, this study enhances the ability to offer precise guidance to disaster reconnaissance teams, facilitating more effective and targeted response efforts in the aftermath of earthquakes.

[52] No time to train! Training-Free Reference-Based Instance Segmentation cs.CVPDF

Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley

TL;DR: 该论文提出了一种无训练的参考图像实例分割方法，利用基础模型的语义先验来自动生成实例级分割掩码，显著提升了分割性能。

Details

Motivation: 传统图像分割模型依赖大量标注数据，而SAM虽减少了标注需求，但仍需手动提示。为解决这一问题，论文探索了仅需参考图像的任务对象分割方法。

Result: 在COCO FSOD、PASCAL VOC Few-Shot和Cross-Domain FSOD等任务中表现优异，超越了现有无训练方法。

Insight: 利用基础模型的语义先验可以有效减少对标注数据或手动提示的依赖，为无监督或少样本分割任务提供了新思路。

Abstract: The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).

[53] HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars cs.CV | cs.GRPDF

Gent Serifi, Marcel C. Bühler

TL;DR: 论文提出了一种称为HyperGaussians的新方法，通过将3D高斯分布扩展为高维多变量高斯分布，显著提升了可动画化面部虚拟形象的渲染质量，尤其在处理高频细节和复杂变形时表现优越。

Details

Motivation: 现有的3D高斯泼溅（3DGS）技术在渲染静态面部时效果出色，但在处理动态面部变形、复杂光照和高频细节时仍存在挑战。论文旨在通过改进高斯表示本身来提升表现力。

Result: 在19个受试者和4个面部数据集上的实验表明，HyperGaussians在数值和视觉上均优于3DGS，尤其在渲染眼镜框、牙齿、复杂表情和镜面反射等高频细节时表现突出。

Insight: 通过重新思考高斯表示的本质，高维扩展能够显著提升渲染质量的潜力。逆协方差技巧为高维计算提供了高效解决方案，具有广泛的应用前景。

Abstract: We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. Creating such detailed face avatars from videos is a challenging problem and has numerous applications in augmented and virtual reality. While tremendous successes have been achieved for static faces, animatable avatars from monocular videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed ‘HyperGaussians’. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the ‘inverse covariance trick’. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into the state-of-the-art in fast monocular face avatars: FlashAvatar. Our evaluation on 19 subjects from 4 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyeglass frames, teeth, complex facial movements, and specular reflections.

[54] LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion cs.CVPDF

Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang

TL;DR: LangScene-X提出了一种新的生成框架，通过TriMap视频扩散模型和语言量化压缩器（LQC），从稀疏视图中重建可泛化的3D语言嵌入场景，解决了传统方法在多视角不足时的渲染和语义合成问题。

Details

Motivation: 传统基于密集视角的3D重建方法在稀疏视图下存在渲染伪影和语义合成不准确的问题。LangScene-X旨在通过生成一致的多模态信息（外观、几何、语义）克服这一限制。

Result: 实验显示LangScene-X在真实数据上的质量和泛化能力优于现有方法。

Insight: 通过生成模型和语言嵌入的结合，稀疏输入也能实现高质量和可泛化的3D重建与场景理解。

Abstract: Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.

[55] Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach cs.CVPDF

Panpan Ji, Junni Song, Hang Xiao, Hanyu Liu, Chao Li

TL;DR: 论文提出了一种名为DCDP-HAR的新型框架，用于解决多模态人类活动识别中的跨模态特征对齐和模态贡献不平衡问题。框架包括双路径特征提取、多阶段对比学习和置信度驱动的梯度调制策略，并结合动量梯度累积以提升训练稳定性。实验验证了其有效性。

Details

Motivation: 多模态人类活动识别系统面临跨模态特征对齐和模态贡献不平衡的挑战，需要一种动态调整学习强度的策略。

Result: 在四个公共数据集上的实验证明了该框架的优势。

Insight: 动态调整模态学习强度可有效缓解模态竞争，提升识别性能。

Abstract: Sensor-based Human Activity Recognition (HAR) is a core technology that enables intelligent systems to perceive and interact with their environment. However, multimodal HAR systems still encounter key challenges, such as difficulties in cross-modal feature alignment and imbalanced modality contributions. To address these issues, we propose a novel framework called the Dynamic Contrastive Dual-Path Network (DCDP-HAR). The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseNet branches collaboratively process multimodal sensor data. Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction. Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation, effectively alleviating modality competition. In addition, a momentum-based gradient accumulation strategy is adopted to enhance training stability. We conduct ablation studies to validate the effectiveness of each component and perform extensive comparative experiments on four public benchmark datasets.

[56] AnyI2V: Animating Any Conditional Image with Motion Control cs.CVPDF

Ziye Li, Hao Luo, Xincheng Shuai, Henghui Ding

TL;DR: AnyI2V 是一个无需训练的框架，通过用户定义的运动轨迹来动画化任意条件图像，支持多种模态输入，并实现了灵活的生成控制。

Details

Motivation: 现有文本到视频（T2V）和图像到视频（I2V）方法在动态运动信号和空间约束的集成上存在不足，缺乏对生成内容的精确控制和编辑能力。

Result: 实验表明 AnyI2V 在空间和运动控制的视频生成中表现优异。

Insight: AnyI2V 为视频生成提供了更灵活的输入模态和编辑能力，为未来的研究提供了新方向。

Abstract: Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content. In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation. Code is available at https://henghuiding.com/AnyI2V/.

[57] Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation cs.CVPDF

Jiaer Xia, Bingkui Tong, Yuhang Zang, Rui Shao, Kaiyang Zhou

TL;DR: 提出了一种基于 Grounded Chain-of-Thought (GCoT) 的方法，通过在推理步骤中注入 grounding 信息（如边界框），改进多模态大语言模型（MLLMs）在数据有限情况下对专业视觉任务的适应性。

Details

Motivation: 多模态大语言模型（MLLMs）在专业视觉任务（如图表理解）中的适应性较差，且大规模数据重新训练成本高。预训练数据与下游任务数据存在不匹配的问题。

Result: 在数据有限的情况下，GCoT 方法显著提升了模型在专业视觉任务上的表现。

Insight: 利用 grounding 信息可以弥补预训练数据与专业任务的差距，提升模型对非对象类图像的推理能力。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.

[58] Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching cs.CVPDF

Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen

TL;DR: EasyCache是一种无需训练的加速框架，通过运行时自适应缓存机制减少视频扩散模型的冗余计算，显著提升推理速度并保持高质量输出。

Details

Motivation: 视频生成模型因迭代降噪过程导致推理速度慢、计算成本高，限制了其广泛应用。EasyCache旨在通过轻量级缓存机制解决这一问题。

Result: 在多个大规模视频生成模型上测试，推理时间减少2.1-3.3倍，PSNR提升高达36%，视觉效果与原始方法相当。

Insight: 轻量级运行时缓存机制是高效加速视频扩散模型的关键，无需训练或复杂调优，适合实际应用部署。

Abstract: Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$\times$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.

[59] LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans cs.CV | cs.AI | cs.GRPDF

Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner

TL;DR: LiteReality 提出了一种新型流水线，将室内环境的 RGB-D 扫描转换为紧凑、逼真且可交互的 3D 虚拟副本，适用于 AR/VR、游戏和机器人等应用。

Details

Motivation: 现有方法在生成逼真、支持图形流水线的 3D 场景时存在局限性，LiteReality 旨在填补这一空白，同时支持物体个性化和物理交互。

Result: 在 Scan2CAD 基准测试中实现最先进的相似性性能，生成的场景兼容标准图形流水线，适用于多领域应用。

Insight: 通过结合场景理解和资产检索，LiteReality 在无需训练的情况下实现了高质量的场景重建和交互功能。

Abstract: We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines – such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets – even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c

[60] RefTok: Reference-Based Tokenization for Video Generation cs.CVPDF

Xiang Fan, Xiaohang Sun, Kushan Thakkar, Zhu Liu, Vimal Bhat

TL;DR: RefTok提出了一种基于参考的帧标记化方法，通过利用未量化的参考帧编码和解码视频帧组，解决了现有方法在捕捉时间依赖性和冗余性上的不足，显著提升了视频生成质量和压缩效率。

Details

Motivation: 现有视频模型通常独立处理帧组，无法有效捕捉视频中的时间依赖性和冗余信息，导致生成质量受限。RefTok旨在通过参考帧编码解决这一问题。

Result: 在四个数据集（K600、UCF-101、BAIR Robot Pushing和DAVIS）上，RefTok显著优于Cosmos和MAGVIT，平均提升指标36.7%。在BAIR任务中，RefTok生成的视频质量甚至优于参数多4倍的MAGVIT-L。

Insight: 基于参考的编码能有效捕捉时间冗余和动态信息，为视频生成和压缩领域提供了新思路。未来可以将类似方法扩展到其他时序任务中。

Abstract: Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly outperforms current state-of-the-art tokenizers (Cosmos and MAGVIT) and improves all evaluated metrics (PSNR, SSIM, LPIPS) by an average of 36.7% at the same or higher compression ratios. When a video generation model is trained using RefTok’s latents on the BAIR Robot Pushing task, the generations not only outperform MAGVIT-B but the larger MAGVIT-L, which has 4x more parameters, across all generation metrics by an average of 27.9%.

[61] Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory cs.CV | cs.AI | cs.LGPDF

Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu

TL;DR: Point3R 提出了一种显式的空间指针记忆机制，用于在线流式 3D 重建，解决了以往隐式记忆容量有限和早期帧信息丢失的问题。

Details

Motivation: 现有的 3D 重建方法依赖隐式记忆，容易因容量限制丢失早期帧信息，影响了重建的连续性和准确性。

Result: 在多个任务中表现优异，代码已开源。

Insight: 显式记忆设计能更有效地保留场景信息，适用于流式重建任务。

Abstract: Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code is available at: https://github.com/YkiWu/Point3R.

cs.CL [Back]

[62] Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization cs.CL | cs.AIPDF

Keyan Jin, Yapeng Wang, Leonel Santos, Tao Fang, Xu Yang

TL;DR: 论文全面评估了推理型LLM在对话摘要任务中的表现，发现显式逐步推理未必提升摘要质量，反而可能导致冗长和不一致。

Details

Motivation: 对话摘要在实际应用中具有重要意义，但当前推理型LLM（如Long Chain-of-Thought）的表现尚未在对话场景中得到充分研究。

Result: 发现显式推理并未持续提升摘要质量，反而易导致冗长、事实不一致和不简洁。

Insight: 当前推理型LLM在复杂对话场景中可能失效，需针对实际需求优化模型和评估策略。

Abstract: Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplored for dialogue scenarios requiring concurrent abstraction and conciseness. In this work, we present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs across three major paradigms-generic, role-oriented, and query-oriented dialogue summarization. Our study spans diverse languages, domains, and summary lengths, leveraging strong benchmarks (SAMSum, DialogSum, CSDS, and QMSum) and advanced evaluation protocols that include both LLM-based automatic metrics and human-inspired criteria. Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality. Instead, reasoning LLMs are often prone to verbosity, factual inconsistencies, and less concise summaries compared to their non-reasoning counterparts. Through scenario-specific analyses and detailed case studies, we further identify when and why explicit reasoning may fail to benefit-or even hinder-summarization in complex dialogue contexts. Our work provides new insights into the limitations of current reasoning LLMs and highlights the need for targeted modeling and evaluation strategies for real-world dialogue summarization.

[63] Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer cs.CL | cs.AI | cs.LGPDF

Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, Enqi Liu

TL;DR: 本文研究了Huginn-3.5B（一种深度循环Transformer）是否在潜在空间中实现了推理链（CoT），发现其表现有限，且推理的潜在状态可解释性不一致。

Details

Motivation: 探究循环Transformer是否能在潜在空间中实现推理链以提升效率，而不需要像传统模型那样显式生成自然语言推理步骤。

Result: 研究发现Huginn-3.5B的潜在推理链表现有限，隐藏状态的可解释性依赖于层索引和解码方法，增加循环深度收益甚微。

Insight: 显式生成推理步骤的模型仍优于潜在推理链，可能因潜在空间的可解释性和一致性不足。

Abstract: Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model’s internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at https://github.com/wenquanlu/huginn-latent-cot.

[64] Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models cs.CLPDF

Christian Jaumann, Annemarie Friedrich, Rainer Lienhart

TL;DR: 论文提出了一个基于多模态大语言模型的系统，通过少样本检索和置信度集成策略，在SciVQA 2025任务中排名第三，F1得分为85.12。

Details

Motivation: 解决科学视觉问答任务中多模态模型的选择与集成问题，优化少样本学习效果。

Result: 在盲测数据上排名第三，F1得分为85.12（ROUGE-1、ROUGE-L和BERT分数平均）。

Insight: 少样本检索和模型集成策略能有效提升多模态任务中的性能。

Abstract: This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models’ confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.

[65] IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders cs.CL | cs.AI | cs.LG | 91B14, 68T50 | I.2.7; K.4.1; K.5.2PDF

Sneha Deshmukh, Prathmesh Kamble

TL;DR: 该论文介绍了IndianBailJudgments-1200，一个包含了1200个印度法庭保释判决的多属性数据集，支持法律NLP任务。

Details

Motivation: 法律NLP在印度等地区发展不足，主要原因之一是缺乏结构化数据集。

Result: 发布了IndianBailJudgments-1200数据集，为法律NLP研究提供了新资源。

Insight: 此数据集填补了印度法律NLP领域的空白，促进了保释判决相关任务的多样性研究。

Abstract: Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.

[66] WebSailor: Navigating Super-human Reasoning for Web Agent cs.CL | cs.AIPDF

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou

TL;DR: WebSailor 是一种后训练方法，通过生成高不确定性任务和优化策略，使开源模型在复杂信息搜索任务中表现接近专有代理。

Details

Motivation: 现有开源模型在复杂信息搜索任务中表现不如专有代理，主要原因是缺乏系统性的高不确定性导航能力。

Result: WebSailor 在复杂信息搜索任务中超越所有开源代理，匹配专有代理性能。

Insight: 系统性的高不确定性导航能力是专有代理成功的关键，可通过任务生成和策略优化复现。

Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.

[67] Revisiting Active Learning under (Human) Label Variation cs.CL | cs.HC | cs.LG | stat.MLPDF

Cornelia Gruber, Helen Alber, Bernd Bischl, Göran Kauermann, Barbara Plank

TL;DR: 这篇论文探讨了在主动学习（AL）中如何处理人类标签变异性（HLV），提出了一个框架，将HLV作为信号而非噪声，并讨论了如何结合大型语言模型（LLM）作为标注者。

Details

Motivation: 高标注成本是监督学习的瓶颈，现有的AL和标注框架通常假设存在单一真实标签，忽略了HLV的复杂性和信息价值。论文旨在解决这一问题。

Result: 论文提供了一个概念性框架，为实际标注中的复杂性（如HLV）提供了更真实的处理方式。

Insight: HLV应被视为信息而非噪声，AL框架需适应标注的多样性，LLM可能成为高效的标注工具。

Abstract: Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed – or neglected – these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.

Ken Tsui

TL;DR: 论文研究了大型语言模型（LLMs）存在的‘自我纠正盲点’现象，即LLMs能够识别用户输入中的错误，但无法纠正自身输出中的相同错误。通过Self-Correction Bench框架测试14个模型，发现平均盲点率为64.5%，并提出通过调整输入提示（如添加‘Wait’）可显著减少盲点。

Details

Motivation: 虽然LLMs具有强大的生成能力，但其在自我纠正方面的能力存在明显不足，尤其是在纠正自身输出中的错误时表现不佳。这种局限性可能影响其在实际应用中的可靠性和可信度。

Result: 1) 平均盲点率为64.5%；2) 发现LLMs的训练数据中缺乏错误纠正的示例；3) 添加‘Wait’提示可将盲点减少89.3%。

Insight: 当前的LLMs在自我纠正方面存在固有局限性，但其潜在能力可通过简单的提示调整激活。未来的LLM训练应更多地包含错误纠正的示例以提高其可靠性。

Abstract: Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic ‘Self-Correction Blind Spot’ - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending “Wait” reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

[69] Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models cs.CLPDF

Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia

TL;DR: 本文研究了推理语言模型（RLMs）在对抗社会偏见方面的鲁棒性，发现具有明确推理能力的模型反而更容易被偏见激发，挑战了推理能力提升鲁棒性的假设。

Details

Motivation: 探讨推理语言模型（RLMs）在复杂推理任务中的能力是否提高了对偏见的鲁棒性，或是否可能无意中加剧偏见。

Result: 推理模型比无推理机制的基准模型更容易被偏见激发，尤其是通过CoT提示的模型。

Insight: 推理能力在设计时需要更多关注偏见，不能默认认为推理总是提高模型鲁棒性。

Abstract: Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.

[70] Multimodal Mathematical Reasoning with Diverse Solving Perspective cs.CLPDF

Wenhao Shi, Zhiqiang Hu, Yi Bin, Yang Yang, See-Kiong Ng

TL;DR: MathV-DP数据集和Qwen-VL-DP模型通过捕捉多样化解题路径和强化学习方法，显著提升了多模态数学推理的准确性和多样性。

Details

Motivation: 现有MLLMs依赖一对一图像-文本对和单一解法监督，忽略了推理视角的多样性，限制了模型的潜力。

Result: 在MathVista和Math-V基准测试中，Qwen-VL-DP显著优于现有MLLMs，展现了高准确性和丰富的生成多样性。

Insight: 多模态数学推理需要多样性监督和强化学习方法的结合，以捕捉不同的解题视角。

Abstract: Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista’s minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.

[71] SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model cs.CL | cs.AI | cs.LGPDF

Wencheng Zhang, Shiqin Qiao, Lingjie Luo, Yinfeng Li, Chuanyang Zheng

TL;DR: SynapseRoute是一个基于双状态大语言模型（LLM）的自动路由切换框架，通过动态将查询分配给“思考”或“非思考”模式，优化了准确性、成本和用户体验。实验表明，该方法不仅提高了准确性，还显著降低了推理时间和资源消耗。

Details

Motivation: 在大语言模型实际应用中，选择合适模型需平衡性能和运营成本。问题复杂度的二分现象揭示了动态路由的潜力。

Result: 相比单一模式，SynapseRoute提高了准确性（0.8390 vs. 0.8272），减少推理时间36.8%，降低token消耗39.66%。

Insight: 简单问题的过度推理可能导致不必要的延迟和准确性下降，动态路由避免了这一问题。

Abstract: With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between “thinking” (high reasoning) and “non-thinking” (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.

[72] Generalizing Verifiable Instruction Following cs.CLPDF

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang

TL;DR: 这篇论文提出了一个新基准IFBench，用于评估语言模型在精确指令跟随任务中的泛化能力，并通过强化学习与验证奖励（RLVR）显著提升了模型性能。

Details

Motivation: 尽管当前强大的语言模型在遵循人类指令方面表现优异，但在满足特定输出约束（如仅回答是或否）时仍存在困难。研究发现，模型对基准中的少量约束过拟合，无法泛化到未见过的约束。

Result: RLVR方法显著提升了模型在精确指令跟随任务中的泛化能力，特别是在IFBench的多样化约束上表现优异。

Insight: 模型的精确指令跟随能力可以通过验证模块和强化学习进一步优化，未来研究应关注如何更好地泛化到多样化约束。

Abstract: A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like only answer with yes or no" or mention the word `abrakadabra’ at least 3 times” that the user adds to craft a more useful answer. Even today’s strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

[73] MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs cs.CL | cs.AI | cs.IT | cs.LG | cs.SY | eess.SY | math.ITPDF

Purbesh Mitra, Sennur Ulukus

TL;DR: MOTIF提出了一种通过强化学习微调的方法，使大型语言模型（LLM）能够分多次生成推理令牌，突破上下文长度的限制，显著提升了推理能力。

Details

Motivation: 当前LLM的上下文长度限制了其在长序列推理任务中的表现。MOTIF旨在通过分模块推理的方式缓解这一瓶颈。

Result: 在MATH500和AIME2024基准测试中，分别比传统GRPO方法提升了3.8%和3.3%，且仅需15%的样本量。

Insight: 模块化推理是解决LLM上下文长度限制的有效方法，同时保持了样本的高效性。

Abstract: Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ – an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8% and 3.3% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.

eess.IV [Back]

[74] 3D Heart Reconstruction from Sparse Pose-agnostic 2D Echocardiographic Slices eess.IV | cs.CVPDF

Zhurong Chen, Jinhua Chen, Wei Zhuo, Wufeng Xue, Dong Ni

TL;DR: 该论文提出了一种从稀疏、姿态无关的2D超声心动图切片中重建3D心脏的创新框架，通过交替优化3D姿态估计和3D切片整合，显著提升了左心室体积估计的准确性。

Details

Motivation: 超声心动图通常只能提供2D图像，难以准确估计临床参数（如左心室体积），而3D超声成像又存在分辨率低和手动标注复杂的问题。因此，亟需一种高效方法从稀疏2D切片中重建3D心脏模型。

Result: 实验表明，使用6个平面时，左心室体积估计误差仅为1.98%，显著优于双平面方法的20.24%，且首次实现了从2D切片中估计右心室体积（误差5.75%）。

Insight: 该研究为从心脏超声中获取个性化3D结构和功能分析提供了新思路，具有重要临床应用潜力。

Abstract: Echocardiography (echo) plays an indispensable role in the clinical practice of heart diseases. However, ultrasound imaging typically provides only two-dimensional (2D) cross-sectional images from a few specific views, making it challenging to interpret and inaccurate for estimation of clinical parameters like the volume of left ventricle (LV). 3D ultrasound imaging provides an alternative for 3D quantification, but is still limited by the low spatial and temporal resolution and the highly demanding manual delineation. To address these challenges, we propose an innovative framework for reconstructing personalized 3D heart anatomy from 2D echo slices that are frequently used in clinical practice. Specifically, a novel 3D reconstruction pipeline is designed, which alternatively optimizes between the 3D pose estimation of these 2D slices and the 3D integration of these slices using an implicit neural network, progressively transforming a prior 3D heart shape into a personalized 3D heart model. We validate the method with two datasets. When six planes are used, the reconstructed 3D heart can lead to a significant improvement for LV volume estimation over the bi-plane method (error in percent: 1.98% VS. 20.24%). In addition, the whole reconstruction framework makes even an important breakthrough that can estimate RV volume from 2D echo slices (with an error of 5.75% ). This study provides a new way for personalized 3D structure and function analysis from cardiac ultrasound and is of great potential in clinical practice.

eess.AS [Back]

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang

TL;DR: DeSTA2.5-Audio是一种通用的大型音频语言模型（LALM），通过自生成的跨模态对齐策略，避免了任务特定的微调，同时保留了原始语言能力，并在多种音频语言基准测试中表现出色。

Details

Motivation: 现有的大型音频语言模型（LALM）通常通过任务特定的音频指令数据集增强大型语言模型（LLM），但这会导致LLM的原始语言能力丢失。本文旨在解决这一问题。

Result: 在Dynamic-SUPERB、MMAU、SAKURA、Speech-IFEval和VoiceBench等基准测试中达到SOTA或竞争性性能。

Insight: 自生成的数据构建策略能够有效避免灾难性遗忘，同时提升模型的跨模态对齐能力，为通用LALM的开发提供了实用见解。

Abstract: We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM’s original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM’s native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

q-fin.CP [Back]

[76] FinAI-BERT: A Transformer-Based Model for Sentence-Level Detection of AI Disclosures in Financial Reports q-fin.CP | cs.CL | econ.GN | q-fin.EC | q-fin.GNPDF

Muhammad Bilal Zafar

TL;DR: FinAI-BERT是一种基于Transformer的语言模型，用于金融报告中的句子级AI信息披露检测，性能优异（准确率99.37%），并兼顾可解释性与鲁棒性。

Details

Motivation: 随着AI在金融领域的普及，现有方法（如关键词扩展或文档级分类）在粒度、可解释性和鲁棒性方面表现不足。

Result: FinAI-BERT达到99.37%的准确率，F1分数为0.993，优于传统基线模型（如逻辑回归、随机森林等）。鲁棒性验证显示模型对句子长度、对抗输入等具有稳定性。

Insight: 研究为金融NLP提供了细粒度分类的实践方案，同时为监管机构和分析师提供了透明、可扩展的AI监测工具。

Abstract: The proliferation of artificial intelligence (AI) in financial services has prompted growing demand for tools that can systematically detect AI-related disclosures in corporate filings. While prior approaches often rely on keyword expansion or document-level classification, they fall short in granularity, interpretability, and robustness. This study introduces FinAI-BERT, a domain-adapted transformer-based language model designed to classify AI-related content at the sentence level within financial texts. The model was fine-tuned on a manually curated and balanced dataset of 1,586 sentences drawn from 669 annual reports of U.S. banks (2015 to 2023). FinAI-BERT achieved near-perfect classification performance (accuracy of 99.37 percent, F1 score of 0.993), outperforming traditional baselines such as Logistic Regression, Naive Bayes, Random Forest, and XGBoost. Interpretability was ensured through SHAP-based token attribution, while bias analysis and robustness checks confirmed the model’s stability across sentence lengths, adversarial inputs, and temporal samples. Theoretically, the study advances financial NLP by operationalizing fine-grained, theme-specific classification using transformer architectures. Practically, it offers a scalable, transparent solution for analysts, regulators, and scholars seeking to monitor the diffusion and framing of AI across financial institutions.

cs.RO [Back]

[77] MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real cs.RO | cs.CVPDF

Renhao Wang, Haoran Geng, Tingle Li, Feishi Wang, Gopala Anumanchipalli

TL;DR: MultiGen 是一个利用生成模型在物理模拟器中合成多模态（如视觉和听觉）数据的框架，用于训练机器人在没有真实数据的情况下的多模态策略，成功实现了零样本迁移到真实世界的任务。

Details

Motivation: 机器人需要整合多种感知模态才能在现实世界中有效行动，但目前大规模学习多模态策略仍具挑战性。模拟器虽然可行，但难以模拟如声音等模态，导致多模态的模拟到真实迁移尚未实现。

Result: 在真实的倒水任务中（涉及新容器和液体），展示了多模态策略的有效零样本迁移。

Insight: 生成模型不仅可以模拟难以建模的模态，还能缩小多模态模拟到真实的差距，为机器人多模态学习提供了新思路。

Abstract: Robots must integrate multiple sensory modalities to act effectively in the real world. Yet, learning such multimodal policies at scale remains challenging. Simulation offers a viable solution, but while vision has benefited from high-fidelity simulators, other modalities (e.g. sound) can be notoriously difficult to simulate. As a result, sim-to-real transfer has succeeded primarily in vision-based tasks, with multimodal transfer still largely unrealized. In this work, we tackle these challenges by introducing MultiGen, a framework that integrates large-scale generative models into traditional physics simulators, enabling multisensory simulation. We showcase our framework on the dynamic task of robot pouring, which inherently relies on multimodal feedback. By synthesizing realistic audio conditioned on simulation video, our method enables training on rich audiovisual trajectories – without any real robot data. We demonstrate effective zero-shot transfer to real-world pouring with novel containers and liquids, highlighting the potential of generative modeling to both simulate hard-to-model modalities and close the multimodal sim-to-real gap.

cs.LG [Back]

Gautam Kishore Shahi

TL;DR: 该研究探索了多模态特征（文本、图像和社会特征）的早期融合在社交媒体谣言检测中的效果，结合无监督和监督机器学习，性能提升15%（相比单模态）和5%（相比双模态）。

Details

Motivation: 社交媒体在选举和危机期间充斥着谣言，现有研究多聚焦于单模态（文本或图像），而多模态特征融合的研究较少。

Result: 多模态组合模型性能优于单模态（提升15%）和双模态（提升5%），并分析了谣言传播模式与用户特征的关系。

Insight: 多模态特征融合和社交特征的引入是谣言检测的关键，早期融合策略在性能提升上具有潜力。

Abstract: Amid a tidal wave of misinformation flooding social media during elections and crises, extensive research has been conducted on misinformation detection, primarily focusing on text-based or image-based approaches. However, only a few studies have explored multimodal feature combinations, such as integrating text and images for building a classification model to detect misinformation. This study investigates the effectiveness of different multimodal feature combinations, incorporating text, images, and social features using an early fusion approach for the classification model. This study analyzed 1,529 tweets containing both text and images during the COVID-19 pandemic and election periods collected from Twitter (now X). A data enrichment process was applied to extract additional social features, as well as visual features, through techniques such as object detection and optical character recognition (OCR). The results show that combining unsupervised and supervised machine learning models improves classification performance by 15% compared to unimodal models and by 5% compared to bimodal models. Additionally, the study analyzes the propagation patterns of misinformation based on the characteristics of misinformation tweets and the users who disseminate them.

[79] Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions cs.LG | cs.CL | cs.CYPDF

Eitan Anzenberg, Arunava Samajpati, Sivasankaran Chandrasekar, Varun Kacholia

TL;DR: 论文评估了大型语言模型（LLMs）在招聘中的准确性和公平性，发现专用招聘模型Match Score在准确性和公平性上优于通用LLMs。研究表明，专用模型能更有效缓解社会偏见，强调了在高风险领域使用AI时领域专用建模和偏见审核的重要性。

Details

Motivation: 招聘中LLMs的广泛应用可能导致准确性和算法偏见问题，因此需要评估其在真实招聘场景中的表现，并与专用模型对比。

Result: Match Score在ROC AUC（0.85 vs 0.77）和公平性（种族Impact Ratio 0.957 vs 0.809）上显著优于LLMs。

Insight: 专用监督模型在高风险领域（如招聘）中优于通用LLMs，表明领域专用建模和偏见审核的重要性，同时证明了准确性与公平性可以兼得。

Abstract: The use of large language models (LLMs) in hiring promises to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias where sufficient safeguards are not in place. In this work, we benchmark several state-of-the-art foundational LLMs - including models from OpenAI, Anthropic, Google, Meta, and Deepseek, and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching. We evaluate each model’s predictive accuracy (ROC AUC, Precision-Recall AUC, F1-score) and fairness (impact ratio of cut-off analysis across declared gender, race, and intersectional subgroups). Our experiments on a dataset of roughly 10,000 real-world recent candidate-job pairs show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups. Notably, Match Score attains a minimum race-wise impact ratio of 0.957 (near-parity), versus 0.809 or lower for the best LLMs, (0.906 vs 0.773 for the intersectionals, respectively). We discuss why pretraining biases may cause LLMs with insufficient safeguards to propagate societal biases in hiring scenarios, whereas a bespoke supervised model can more effectively mitigate these biases. Our findings highlight the importance of domain-specific modeling and bias auditing when deploying AI in high-stakes domains such as hiring, and caution against relying on off-the-shelf LLMs for such tasks without extensive fairness safeguards. Furthermore, we show with empirical evidence that there shouldn’t be a dichotomy between choosing accuracy and fairness in hiring: a well-designed algorithm can achieve both accuracy in hiring and fairness in outcomes.

[80] Energy-Based Transformers are Scalable Learners and Thinkers cs.LG | cs.AI | cs.CL | cs.CVPDF

Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha

TL;DR: 该论文提出了一种新型的Energy-Based Transformers（EBTs），能通过无监督学习实现推理时间计算（类似于人类系统2思维），并在多个任务上表现出优于传统Transformer++和Diffusion Transformers的性能。

Details

Motivation: 现有推理时间计算方法多为模态或问题特定，或需额外监督训练。论文旨在探索如何通过无监督学习实现通用推理能力，即“模型是否可以从无监督学习中学会思考”。

Result: 1. EBTs训练扩展速度比Transformer++快35%；2. 推理时，EBTs在语言任务中性能提升29%；3. 在图像去噪任务中优于Diffusion Transformers且使用更少前向传播。

Insight: EBTs在相同或更差预训练性能下，在下游任务中表现更好，表明其泛化能力更强。EBTs为模型的学习和推理能力扩展提供了新方向。

Abstract: Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question “Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?” Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) – a new class of Energy-Based Models (EBMs) – to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.

[81] OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding cs.LG | cs.CLPDF

Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin

TL;DR: OmniDraft提出了一种跨词汇的在线自适应推测解码框架，解决现有方案对目标模型兼容性和动态适应的挑战，支持单一草稿模型与任意目标模型搭配，并提供1.5-2倍的加速。

Details

Motivation: 现有推测解码方法通常需要预先训练或离线蒸馏的草稿模型，且仅限于特定目标模型系列。在线部署时面临两个主要挑战：与目标模型不兼容，以及需要随时间提升延迟性能。

Result: OmniDraft框架让单一Llama-68M模型成功与多种目标模型（Vicuna-7B、Qwen2-7B、Llama3-8B）搭配，并提供1.5-2倍的加速。

Insight: 1. 单一草稿模型适配多目标模型具有可行性。2. 动态适应技术和词汇匹配优化是提升推测解码性能的关键。3. 边缘设备上的模型效率和用户定制化需求值得进一步研究。

Abstract: Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all’’} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

[82] ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning cs.LG | cs.CLPDF

Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi

TL;DR: 论文提出了ExPO方法，通过基于真实答案的条件化自我解释优化，克服了现有RL方法在困难推理任务中依赖初始样本生成能力的局限。

Details

Motivation: 现有RL方法在困难推理任务中依赖模型初始能力生成正样本，难以解决模型初始失败的问题，尤其在早期训练和挑战性任务中表现不佳。

Result: 实验显示ExPO在推理基准测试中提升了学习效率和最终性能，尤其是在初始表现较差的困难任务（如MATH level-5）中表现突出。

Insight: 专家演示在RL微调中效果有限，而模型自我生成的条件化解释可能更高效；探索性样本需兼具策略一致性和答案相关性。

Abstract: Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training, which improves reasoning by optimizing model outputs based on reward or preference signals. GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier. However, these methods depend heavily on the model’s initial ability to produce positive samples. They primarily refine what the model already knows (distribution sharpening) rather than enabling the model to solve problems where it initially fails. This limitation is especially problematic in early-stage RL training and on challenging reasoning tasks, where positive samples are unlikely to be generated. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model’s likelihood of predicting the correct answer. Based on these insights, we propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most.

[83] Generative Latent Diffusion for Efficient Spatiotemporal Data Reduction cs.LG | cs.CVPDF

Xiao Li, Liangji Zhu, Anand Rangarajan, Sanjay Ranka

TL;DR: 论文提出了一个高效的潜在扩散框架，结合变分自编码器和条件扩散模型，实现了高压缩比和精确的时空数据重建。

Details

Motivation: 生成模型在条件设置下表现优异，但其可控性和重建精度限制了其在数据压缩中的实际应用，因此需要一种更高效的方法。

Result: 在多个数据集上，压缩比达到基于规则的先进压缩器SZ3的10倍，重建误差下性能比领先的基于学习的方法提升63%。

Insight: 通过生成插值替代存储每一帧的潜在表示，能够在保持重建精度的同时大幅减少存储需求，为时空数据压缩提供了新思路。

Abstract: Generative models have demonstrated strong performance in conditional settings and can be viewed as a form of data compression, where the condition serves as a compact representation. However, their limited controllability and reconstruction accuracy restrict their practical application to data compression. In this work, we propose an efficient latent diffusion framework that bridges this gap by combining a variational autoencoder with a conditional diffusion model. Our method compresses only a small number of keyframes into latent space and uses them as conditioning inputs to reconstruct the remaining frames via generative interpolation, eliminating the need to store latent representations for every frame. This approach enables accurate spatiotemporal reconstruction while significantly reducing storage costs. Experimental results across multiple datasets show that our method achieves up to 10 times higher compression ratios than rule-based state-of-the-art compressors such as SZ3, and up to 63 percent better performance than leading learning-based methods under the same reconstruction error.

[84] Holistic Continual Learning under Concept Drift with Adaptive Memory Realignment cs.LG | cs.AI | cs.CVPDF

Alif Ashrafee, Jedrzej Kozal, Michal Wozniak, Bartosz Krawczyk

TL;DR: 该论文提出了一种名为自适应记忆对齐（AMR）的轻量级方法，用于解决持续学习中因概念漂移导致的数据分布动态变化问题。通过在记忆缓冲区中移除过时的样本并补充少量更新后的实例，AMR在减少标注和计算开销的同时，实现了与完全重新学习（FR）相当的性能。论文还引入了四个具有概念漂移变体的标准视觉基准数据集用于评估。

Details

Motivation: 传统的持续学习方法假设已学习任务的数据分布是静态的，忽视了现实数据流的动态性（如概念漂移）。这导致模型在快速适应新分布的同时难以保持稳定性。论文旨在解决这一矛盾。

Result: 在四个新引入的数据集上，AMR表现优于其他基于记忆的回放方法，接近FR的性能，同时显著减少了计算和标注开销。

Insight: 持续学习需要同时处理概念漂移和灾难性遗忘，AMR通过轻量级的记忆调整机制平衡了稳定性和适应性，为动态环境中的持续学习提供了高效解决方案。

Abstract: Traditional continual learning methods prioritize knowledge retention and focus primarily on mitigating catastrophic forgetting, implicitly assuming that the data distribution of previously learned tasks remains static. This overlooks the dynamic nature of real-world data streams, where concept drift permanently alters previously seen data and demands both stability and rapid adaptation. We introduce a holistic framework for continual learning under concept drift that simulates realistic scenarios by evolving task distributions. As a baseline, we consider Full Relearning (FR), in which the model is retrained from scratch on newly labeled samples from the drifted distribution. While effective, this approach incurs substantial annotation and computational overhead. To address these limitations, we propose Adaptive Memory Realignment (AMR), a lightweight alternative that equips rehearsal-based learners with a drift-aware adaptation mechanism. AMR selectively removes outdated samples of drifted classes from the replay buffer and repopulates it with a small number of up-to-date instances, effectively realigning memory with the new distribution. This targeted resampling matches the performance of FR while reducing the need for labeled data and computation by orders of magnitude. To enable reproducible evaluation, we introduce four concept-drift variants of standard vision benchmarks: Fashion-MNIST-CD, CIFAR10-CD, CIFAR100-CD, and Tiny-ImageNet-CD, where previously seen classes reappear with shifted representations. Comprehensive experiments on these datasets using several rehearsal-based baselines show that AMR consistently counters concept drift, maintaining high accuracy with minimal overhead. These results position AMR as a scalable solution that reconciles stability and plasticity in non-stationary continual learning environments.

[85] Fair Deepfake Detectors Can Generalize cs.LG | cs.CVPDF

Harry Cheng, Ming-Hui Liu, Yangyang Guo, Tianyi Wang, Liqiang Nie

TL;DR: 该论文首次揭示了深度伪造检测模型中公平性与泛化性之间的因果关系，并提出了一种即插即用的框架DAID，通过去混淆干预同时提升两者的性能。

Details

Motivation: 现有的深度伪造检测方法在泛化到未见过的篡改技术和保持人口统计学公平性之间存在冲突，论文试图从因果关系的角度解决这一矛盾。

Result: 在三个跨域基准测试中，DAID在公平性和泛化性上均优于现有方法，验证了其理论和实际有效性。

Insight: 通过去混淆干预，公平性和泛化性不再是冲突的目标，而是可以协同优化的方向。这一因果视角为相关研究提供了新思路。

Abstract: Deepfake detection models face two critical challenges: generalization to unseen manipulations and demographic fairness among population groups. However, existing approaches often demonstrate that these two objectives are inherently conflicting, revealing a trade-off between them. In this paper, we, for the first time, uncover and formally define a causal relationship between fairness and generalization. Building on the back-door adjustment, we show that controlling for confounders (data distribution and model capacity) enables improved generalization via fairness interventions. Motivated by this insight, we propose Demographic Attribute-insensitive Intervention Detection (DAID), a plug-and-play framework composed of: i) Demographic-aware data rebalancing, which employs inverse-propensity weighting and subgroup-wise feature normalization to neutralize distributional biases; and ii) Demographic-agnostic feature aggregation, which uses a novel alignment loss to suppress sensitive-attribute signals. Across three cross-domain benchmarks, DAID consistently achieves superior performance in both fairness and generalization compared to several state-of-the-art detectors, validating both its theoretical foundation and practical effectiveness.

Francesco Di Salvo, Hanh Huyen My Nguyen, Christian Ledig

TL;DR: 该论文提出了一种基于嵌入的联邦数据共享方法，通过差分隐私条件变分自编码器（DP-CVAE）生成隐私保护的全局数据分布，支持多样化的下游任务，同时提高了隐私性、可扩展性和效率。

Details

Motivation: 由于数据稀缺性和隐私法规的限制，深度学习在医学影像中的应用受到制约。联邦学习虽然支持分布式训练，但存在高通信成本和任务单一性的问题，缺乏灵活性。

Result: 相较于传统联邦学习分类器，该方法在隐私性、可扩展性和效率上表现更优，且生成的嵌入比差分隐私条件生成对抗网络（DP-CGAN）具有更高的保真度，参数需求更少。

Insight: 该方法通过差分隐私生成模型和联邦学习的结合，为医学影像等领域的数据共享提供了一种隐私保护且高效的新思路，同时支持多任务场景的应用。

Abstract: Deep Learning (DL) has revolutionized medical imaging, yet its adoption is constrained by data scarcity and privacy regulations, limiting access to diverse datasets. Federated Learning (FL) enables decentralized training but suffers from high communication costs and is often restricted to a single downstream task, reducing flexibility. We propose a data-sharing method via Differentially Private (DP) generative models. By adopting foundation models, we extract compact, informative embeddings, reducing redundancy and lowering computational overhead. Clients collaboratively train a Differentially Private Conditional Variational Autoencoder (DP-CVAE) to model a global, privacy-aware data distribution, supporting diverse downstream tasks. Our approach, validated across multiple feature extractors, enhances privacy, scalability, and efficiency, outperforming traditional FL classifiers while ensuring differential privacy. Additionally, DP-CVAE produces higher-fidelity embeddings than DP-CGAN while requiring $5{\times}$ fewer parameters.

cs.CR [Back]

[87] Early Signs of Steganographic Capabilities in Frontier LLMs cs.CR | cs.AI | cs.CL | cs.LGPDF

Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, David Lindner

TL;DR: 论文研究了前沿大型语言模型（LLM）是否具备隐写术能力，即能否通过看似正常的输出隐藏信息或推理过程。研究发现，当前模型在标准条件下难以隐藏短消息，但在拥有额外工具（如未监控的草稿纸）时可以实现隐写。此外，模型在简单状态跟踪任务中展现出初步的隐写推理能力，但尚无法有效欺骗监控系统。

Details

Motivation: 随着LLM的广泛应用，监控其输出以防止滥用和不对齐风险变得至关重要。然而，隐写术可能被用来逃避监控，因此需要评估前沿LLM的隐写能力，以了解潜在风险。

Result: 当前LLM在标准监控条件下无法隐藏短消息，但借助额外工具时可能成功。模型还展现出对简单隐写推理的初步能力，但难以隐蔽地欺骗监控。

Insight: 前沿LLM已具备初步隐写能力，尽管目前不足以绕过设计良好的监控，但随着模型能力提升，隐写术可能成为未来监控的重大挑战。

Abstract: Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances such as using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.

cs.GR [Back]

[88] Real-time Image-based Lighting of Glints cs.GR | cs.CVPDF

Tom Kneiphof, Reinhard Klein

TL;DR: 提出了一个高效的实时图像基础光照方法，用于模拟材料表面的闪亮或闪光效果，支持动态材质属性和环境贴图。

Details

Motivation: 模拟现实世界中闪亮材料（如闪耀或闪烁表面）的光照效果是实时渲染中的一大挑战，现有方法难以高效实现动态环境光照。

Result: 验证了方法在多种材质和光照条件下的接近真实的效果，且性能开销低。

Insight: 通过近似和分层采样技术，实时渲染闪光效果成为可能，为动态光照条件下的复杂材质渲染提供了新思路。

Abstract: Image-based lighting is a widely used technique to reproduce shading under real-world lighting conditions, especially in real-time rendering applications. A particularly challenging scenario involves materials exhibiting a sparkling or glittering appearance, caused by discrete microfacets scattered across their surface. In this paper, we propose an efficient approximation for image-based lighting of glints, enabling fully dynamic material properties and environment maps. Our novel approach is grounded in real-time glint rendering under area light illumination and employs standard environment map filtering techniques. Crucially, our environment map filtering process is sufficiently fast to be executed on a per-frame basis. Our method assumes that the environment map is partitioned into few homogeneous regions of constant radiance. By filtering the corresponding indicator functions with the normal distribution function, we obtain the probabilities for individual microfacets to reflect light from each region. During shading, these probabilities are utilized to hierarchically sample a multinomial distribution, facilitated by our novel dual-gated Gaussian approximation of binomial distributions. We validate that our real-time approximation is close to ground-truth renderings for a range of material properties and lighting conditions, and demonstrate robust and stable performance, with little overhead over rendering glints from a single directional light. Compared to rendering smooth materials without glints, our approach requires twice as much memory to store the prefiltered environment map.

cs.AI [Back]

[89] STELLA: Self-Evolving LLM Agent for Biomedical Research cs.AI | cs.CL | q-bio.BMPDF

Ruofan Jin, Zaixi Zhang, Mengdi Wang, Le Cong

TL;DR: STELLA是一种自进化的LLM代理，通过多代理架构动态扩展其能力和工具集，显著提升生物医学任务的性能。

Details

Motivation: 生物医学数据和工具的快速增长超越了人类专家的处理能力，现有的AI代理因依赖静态工具集而无法适应和扩展。

Result: 在多个生物医学基准测试中表现优于现有模型，性能随经验增长而显著提升。

Insight: STELLA展示了AI代理通过动态学习和工具扩展实现持续改进的潜力，为生物医学研究提供了一种新范式。

Abstract: The rapid growth of biomedical data, tools, and literature has created a fragmented research landscape that outpaces human expertise. While AI agents offer a solution, they typically rely on static, manually curated toolsets, limiting their ability to adapt and scale. Here, we introduce STELLA, a self-evolving AI agent designed to overcome these limitations. STELLA employs a multi-agent architecture that autonomously improves its own capabilities through two core mechanisms: an evolving Template Library for reasoning strategies and a dynamic Tool Ocean that expands as a Tool Creation Agent automatically discovers and integrates new bioinformatics tools. This allows STELLA to learn from experience. We demonstrate that STELLA achieves state-of-the-art accuracy on a suite of biomedical benchmarks, scoring approximately 26% on Humanity’s Last Exam: Biomedicine, 54% on LAB-Bench: DBQA, and 63% on LAB-Bench: LitQA, outperforming leading models by up to 6 percentage points. More importantly, we show that its performance systematically improves with experience; for instance, its accuracy on the Humanity’s Last Exam benchmark almost doubles with increased trials. STELLA represents a significant advance towards AI Agent systems that can learn and grow, dynamically scaling their expertise to accelerate the pace of biomedical discovery.

[90] Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory cs.AI | cs.CL | cs.GTPDF

Kenneth Payne, Baptiste Alloui-Cros

TL;DR: 该论文通过进化博弈理论中的重复囚徒困境实验，首次评估了大语言模型（LLMs）在竞争环境中的战略智能，发现不同公司的LLMs表现出独特的战略特征。

Details

Motivation: 研究动机是探究LLMs是否具备在竞争环境中进行目标推理的战略智能，填补了LLMs在博弈理论中行为研究的空白。

Result: 结果显示LLMs在复杂生态系统中具有竞争力，各公司的模型表现出独特的战略特征（如Google的模型更具攻击性，OpenAI的模型更合作）。

Insight: 研究揭示了LLMs能够主动推理时间范围和对手策略，为理解算法在不确定环境下的决策提供了新视角。

Abstract: Are Large Language Models (LLMs) a new form of strategic intelligence, able to reason about goals in competitive settings? We present compelling supporting evidence. The Iterated Prisoner’s Dilemma (IPD) has long served as a model for studying decision-making. We conduct the first ever series of evolutionary IPD tournaments, pitting canonical strategies (e.g., Tit-for-Tat, Grim Trigger) against agents from the leading frontier AI companies OpenAI, Google, and Anthropic. By varying the termination probability in each tournament (the “shadow of the future”), we introduce complexity and chance, confounding memorisation. Our results show that LLMs are highly competitive, consistently surviving and sometimes even proliferating in these complex ecosystems. Furthermore, they exhibit distinctive and persistent “strategic fingerprints”: Google’s Gemini models proved strategically ruthless, exploiting cooperative opponents and retaliating against defectors, while OpenAI’s models remained highly cooperative, a trait that proved catastrophic in hostile environments. Anthropic’s Claude emerged as the most forgiving reciprocator, showing remarkable willingness to restore cooperation even after being exploited or successfully defecting. Analysis of nearly 32,000 prose rationales provided by the models reveals that they actively reason about both the time horizon and their opponent’s likely strategy, and we demonstrate that this reasoning is instrumental to their decisions. This work connects classic game theory with machine psychology, offering a rich and granular view of algorithmic decision-making under uncertainty.

[91] Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search cs.AI | cs.CL | cs.IRPDF

Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu

TL;DR: HiRA是一个分层推理框架，通过将复杂搜索任务分解为子任务并由专业化代理执行，显著提升了深度搜索任务的性能。

Details

Motivation: 现实中的复杂搜索需求需要跨模态的深度推理和知识综合，而传统的检索增强生成（RAG）方法难以有效处理。现有的推理方法通常由单一模型处理规划与执行，导致效率低下和扩展性差。

Result: 在四个跨模态深度搜索基准测试中，HiRA显著优于现有RAG和基于代理的系统，答案质量和系统效率均有提升。

Insight: 解耦规划与执行能够有效处理多步信息搜索任务，同时利用专业化代理可以提升整体性能。

Abstract: Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to inefficient reasoning and limited scalability. In this paper, we introduce HiRA, a hierarchical framework that separates strategic planning from specialized execution. Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities, and coordinates the results through a structured integration mechanism. This separation prevents execution details from disrupting high-level reasoning while enabling the system to leverage specialized expertise for different types of information processing. Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems. Our results show improvements in both answer quality and system efficiency, highlighting the effectiveness of decoupled planning and execution for multi-step information seeking tasks. Our code is available at https://github.com/ignorejjj/HiRA.

[92] StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason cs.AI | cs.CL | cs.LGPDF

Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang

TL;DR: StepHint通过多级逐步提示提升强化学习的推理能力，解决了现有方法的‘接近奖励问题’和‘探索停滞’挑战，并在多个数学基准测试中表现优异。

Details

Motivation: 现有强化学习方法在复杂推理任务中面临‘接近奖励问题’和‘探索停滞’两大挑战，导致训练效率低下和推理能力受限。

Result: 在六个数学基准测试中优于竞争方法，并展现出更好的泛化能力和‘域外’任务表现。

Insight: 多级逐步提示不仅提升了训练效率，还帮助模型突破‘舒适区’，增强推理能力。

Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving the complex reasoning abilities of large language models (LLMs). However, current RLVR methods face two significant challenges: the near-miss reward problem, where a small mistake can invalidate an otherwise correct reasoning process, greatly hindering training efficiency; and exploration stagnation, where models tend to focus on solutions within their comfort zone,'' lacking the motivation to explore potentially more effective alternatives. To address these challenges, we propose StepHint, a novel RLVR algorithm that utilizes multi-level stepwise hints to help models explore the solution space more effectively. StepHint generates valid reasoning chains from stronger models and partitions these chains into reasoning steps using our proposed adaptive partitioning method. The initial few steps are used as hints, and simultaneously, multiple-level hints (each comprising a different number of steps) are provided to the model. This approach directs the model's exploration toward a promising solution subspace while preserving its flexibility for independent exploration. By providing hints, StepHint mitigates the near-miss reward problem, thereby improving training efficiency. Additionally, the external reasoning pathways help the model develop better reasoning abilities, enabling it to move beyond its comfort zone’’ and mitigate exploration stagnation. StepHint outperforms competitive RLVR enhancement methods across six mathematical benchmarks, while also demonstrating superior generalization and excelling over baselines on out-of-domain benchmarks.

[93] Grounding Intelligence in Movement cs.AI | cs.CV | cs.LG | cs.ROPDF

Melanie Segado, Felipe Parodi, Jordan K. Matelsky, Michael L. Platt, Eva B. Dyer

TL;DR: 论文提出将运动视为AI的核心建模目标，强调其在生物和人工系统中的重要性，并呼吁开发能够从多样化运动数据中学习并泛化的模型。

Details

Motivation: 运动在生物系统中是智能的核心，但当前的机器学习方法往往将其视为次要问题。论文希望通过将运动作为主要建模目标，推动AI在生成建模和控制方面的能力。

Result: 论文未提供具体实验结果，但展望了通过运动建模提升AI在生成和控制任务中的表现，并为理解行为提供跨领域基础。

Insight: 运动不仅仅是行为的输出，更是理解智能系统如何与世界交互的窗口。通过运动的建模，可以为生物和人工系统建立共享的行为理解框架。

Abstract: Recent advances in machine learning have dramatically improved our ability to model language, vision, and other high-dimensional data, yet they continue to struggle with one of the most fundamental aspects of biological systems: movement. Across neuroscience, medicine, robotics, and ethology, movement is essential for interpreting behavior, predicting intent, and enabling interaction. Despite its core significance in our intelligence, movement is often treated as an afterthought rather than as a rich and structured modality in its own right. This reflects a deeper fragmentation in how movement data is collected and modeled, often constrained by task-specific goals and domain-specific assumptions. But movement is not domain-bound. It reflects shared physical constraints, conserved morphological structures, and purposeful dynamics that cut across species and settings. We argue that movement should be treated as a primary modeling target for AI. It is inherently structured and grounded in embodiment and physics. This structure, often allowing for compact, lower-dimensional representations (e.g., pose), makes it more interpretable and computationally tractable to model than raw, high-dimensional sensory inputs. Developing models that can learn from and generalize across diverse movement data will not only advance core capabilities in generative modeling and control, but also create a shared foundation for understanding behavior across biological and artificial systems. Movement is not just an outcome, it is a window into how intelligent systems engage with the world.

cs.SD [Back]

[94] ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning cs.SD | cs.AI | cs.CL | eess.ASPDF

Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

TL;DR: 该论文提出了一种音频谱图差分注意力机制（ASDA），通过双 softmax 操作和差分系数优化，解决了 Transformer 注意力机制对无关信息分配权重的问题，提升了自监督表示学习的性能。

Details

Motivation: 当前音频自监督表示学习中，Transformer 的注意力机制常将部分注意力分配给无关信息，影响模型的区分能力。作者旨在改进这一问题，提出差分注意力机制以优化注意力分配。

Result: 实验结果显示，ASDA 在音频分类（AS-2M：49.0% mAP，AS20K：41.5% mAP）、关键词检测（SPC-2：98.3% 准确率）和环境声音分类（ESC-50：96.1% 准确率）任务中均实现了 SOTA 性能。

Insight: 通过优化注意力机制，可以显著提升音频任务的模型性能，为自监督表示学习的进一步发展提供了新思路。

Abstract: In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model’s discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA’s effectiveness in audio tasks, paving the way for broader applications.

Table of Contents

cs.CV [Back]

[1] Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges cs.CV | cs.AIPDF

[2] Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning cs.CVPDF

[3] ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning cs.CV | cs.AI | cs.CLPDF

[4] Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach cs.CVPDF

[5] SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers cs.CV | cs.CL | cs.LGPDF

[6] Understanding Trade offs When Conditioning Synthetic Data cs.CV | cs.AIPDF

[7] High-Fidelity Differential-information Driven Binary Vision Transformer cs.CVPDF

[8] FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model cs.CVPDF

[9] SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement cs.CV | cs.AIPDF

[10] Cross-domain Hyperspectral Image Classification based on Bi-directional Domain Adaptation cs.CV | eess.IVPDF

[11] MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement cs.CVPDF

[12] Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation cs.CV | cs.AI | cs.MMPDF

[13] LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models cs.CVPDF

[14] Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization cs.CV | cs.LGPDF

[15] DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation cs.CVPDF

[16] Perception Activator: An intuitive and portable framework for brain cognitive exploration cs.CVPDF

[17] MAGIC: Mask-Guided Diffusion Inpainting with Multi-Level Perturbations and Context-Aware Alignment for Few-Shot Anomaly Generation cs.CV | cs.AIPDF

[18] Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos cs.CVPDF

[19] Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback cs.CVPDF

[20] Two-Steps Neural Networks for an Automated Cerebrovascular Landmark Detection cs.CV | cs.AIPDF

[21] Lightweight Shrimp Disease Detection Research Based on YOLOv8n cs.CVPDF

[22] LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling cs.CVPDF

[23] UVLM: Benchmarking Video Language Model for Underwater World Understanding cs.CVPDF

[24] PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection cs.CVPDF

[25] Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis cs.CVPDF

[26] Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection cs.CV | cs.AIPDF

[27] TABNet: A Triplet Augmentation Self-Recovery Framework with Boundary-Aware Pseudo-Labels for Medical Image Segmentation cs.CV | cs.LGPDF

[28] Wildlife Target Re-Identification Using Self-supervised Learning in Non-Urban Settings cs.CV | cs.AI | cs.LGPDF

[29] PosDiffAE: Position-aware Diffusion Auto-encoder For High-Resolution Brain Tissue Classification Incorporating Artifact Restoration cs.CVPDF

[30] AvatarMakeup: Realistic Makeup Transfer for 3D Animatable Head Avatars cs.CVPDF

[31] From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding cs.CV | cs.CLPDF

[32] Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection cs.CVPDF

[33] Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection cs.CV | cs.CL | cs.CRPDF

[34] CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios cs.CV | cs.AIPDF

[35] MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention cs.CVPDF

[36] Temporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy cs.CV | cs.AIPDF

[37] Automatic Labelling for Low-Light Pedestrian Detection cs.CVPDF

[38] IMASHRIMP: Automatic White Shrimp (Penaeus vannamei) Biometrical Analysis from Laboratory Images Using Computer Vision and Deep Learning cs.CV | I.2.10; I.4.8PDF

[39] Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning cs.CVPDF

[40] Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning cs.CVPDF

[41] AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding cs.CVPDF

[42] Addressing Camera Sensors Faults in Vision-Based Navigation: Simulation and Dataset Development cs.CV | cs.AIPDF

[43] AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models cs.CVPDF

[44] APT: Adaptive Personalized Training for Diffusion Models with Limited Data cs.CV | cs.AI | 60J60, 68T07 | I.2.6; I.2.10; I.4.9PDF

[45] CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation cs.CVPDF

[46] UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation cs.CVPDF

[47] FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models cs.CV | cs.AIPDF

[48] Prompt learning with bounding box constraints for medical image segmentation cs.CVPDF

[49] DexVLG: Dexterous Vision-Language-Grasp Model at Scale cs.CV | cs.ROPDF

[50] Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics cs.CV | cs.AI | cs.LGPDF

[51] From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images cs.CV | cs.SIPDF

[52] No time to train! Training-Free Reference-Based Instance Segmentation cs.CVPDF

[53] HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars cs.CV | cs.GRPDF

[54] LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion cs.CVPDF

[55] Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach cs.CVPDF

[56] AnyI2V: Animating Any Conditional Image with Motion Control cs.CVPDF

[57] Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation cs.CVPDF

[58] Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching cs.CVPDF

[59] LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans cs.CV | cs.AI | cs.GRPDF

[60] RefTok: Reference-Based Tokenization for Video Generation cs.CVPDF

[61] Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory cs.CV | cs.AI | cs.LGPDF

cs.CL [Back]

[62] Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization cs.CL | cs.AIPDF

[63] Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer cs.CL | cs.AI | cs.LGPDF

[64] Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models cs.CLPDF

[65] IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders cs.CL | cs.AI | cs.LG | 91B14, 68T50 | I.2.7; K.4.1; K.5.2PDF

[66] WebSailor: Navigating Super-human Reasoning for Web Agent cs.CL | cs.AIPDF

[67] Revisiting Active Learning under (Human) Label Variation cs.CL | cs.HC | cs.LG | stat.MLPDF

[68] Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs cs.CL | cs.AI | cs.LGPDF

[69] Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models cs.CLPDF

[70] Multimodal Mathematical Reasoning with Diverse Solving Perspective cs.CLPDF

[71] SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model cs.CL | cs.AI | cs.LGPDF

[72] Generalizing Verifiable Instruction Following cs.CLPDF

[73] MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs cs.CL | cs.AI | cs.IT | cs.LG | cs.SY | eess.SY | math.ITPDF

eess.IV [Back]

[74] 3D Heart Reconstruction from Sparse Pose-agnostic 2D Echocardiographic Slices eess.IV | cs.CVPDF

eess.AS [Back]

[75] DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment eess.AS | cs.CL | cs.SDPDF