Table of Contents

cs.CV [Back]

[1] Next-Embedding Prediction Makes Strong Vision Learners cs.CVPDF

Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin

TL;DR: 本文提出了一种名为NEPA(Next-Embedding Predictive Autoregression)的自监督视觉学习方法,其核心思想是让模型学习预测未来的图像块嵌入,而不是直接输出用于下游任务的特征。该方法仅使用简单的Transformer架构和因果掩码,在ImageNet-1K上预训练,无需像素重建、离散标记、对比损失或特定任务头,即可获得强大的视觉表示。

Details

Motivation: 受自然语言处理中生成式预训练成功的启发,本文旨在探索类似的原理是否能产生强大的自监督视觉学习器。其动机是将学习范式从学习静态表示转变为学习能够直接执行预测任务的模型,即学习预测未来的嵌入。

Result: 该方法在ImageNet-1K上取得了SOTA级别的结果:使用ViT-B和ViT-L骨干网络进行微调后,分别达到了83.8%和85.3%的top-1准确率。此外,在ADE20K语义分割任务上也能有效迁移,证明了其泛化能力。

Insight: 主要创新点在于提出了“从嵌入进行生成式预训练”的范式,将自监督目标简化为预测未来的嵌入,从而避免了复杂的损失函数设计(如对比学习)或额外的预处理步骤(如离散化)。这提供了一种架构简单、可扩展且可能模态无关的视觉自监督学习替代方案。

Abstract: Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.