2021 AI 编年史：自监督学习 SSL（Wav2Vec 2.0、HuBERT、MAE）

发表于 2021-04-15 分类于 algrithom 阅读次数：

2021 年自监督学习(SSL)在语音与视觉领域突破：Wav2Vec 2.0、HuBERT、Masked Autoencoder(MAE)。对比学习、掩码预测与架构趋势中英文详解。

2021 AI 编年史：自监督学习 SSL | Self-Supervised Learning in 2021

一、概述与背景知识 | Overview & Background

English

Self-Supervised Learning (SSL) learns useful representations from unlabeled data by defining pretext tasks where labels are derived from the data itself. In 2021, SSL reached parity with supervised learning on major benchmarks in speech and vision, reducing dependence on expensive human annotations.

Three landmark 2021 contributions:

Wav2Vec 2.0 (Meta/Facebook) — contrastive learning on raw audio waveforms with quantization
HuBERT (Meta) — masked prediction of hidden-unit clusters (BERT for speech)
MAE (Meta) — Masked Autoencoder for vision: mask 75% of image patches, reconstruct pixels

Key terms:

Term	Definition
Pretext task	Surrogate objective (e.g., predict masked regions) enabling unsupervised learning
Contrastive learning	Pull similar samples together, push dissimilar apart in embedding space
Negative sampling	Non-matching audio/visual segments used as contrastive negatives
Quantization	Discretizing continuous latent representations into codebook entries
Fine-tuning	Adapting pretrained representations to labeled downstream tasks (ASR, classification)
Linear probe	Evaluating frozen features with a linear classifier — measures representation quality
Mask ratio	Fraction of input hidden during pretraining (MAE uses 75%)

中文

自监督学习（SSL） 通过构造 前置任务（标签来自数据本身）从 无标注数据 学习有效表示。2021 年 SSL 在语音与视觉主要 benchmark 上 媲美监督学习，降低对昂贵人工标注的依赖。

三项标志性工作：

Wav2Vec 2.0（Meta）— 原始音频 对比学习 + 量化
HuBERT（Meta）— 预测 隐单元聚类 的 掩码建模（语音版 BERT）
MAE（Meta）— 视觉 掩码自编码器：mask 75% 图像块并重建像素

核心术语：

术语	含义
前置任务	代理目标（如预测被 mask 区域）驱动无监督学习
对比学习	相似样本在嵌入空间拉近，不相似推远
负采样	用作对比负例的不匹配音/视频片段
量化	将连续 latent 离散化为码本条目
微调	将预训练表示适配到有标注下游任务（ASR、分类）
线性探测	冻结特征 + 线性分类器评估表示质量
Mask 比例	预训练时隐藏输入的比例（MAE 为 75%）

SSL 成为 2021 年 Foundation Model 预训练的标准范式——与 CLIP（对比）、BERT（掩码）共同构成现代 multimodal 栈的基础。

二、技术架构 | Architecture

2.1 Wav2Vec 2.0

flowchart LR
  subgraph Raw["Raw Audio 16kHz"]
    W[Waveform Chunks]
  end
  subgraph Enc["Encoder"]
    CN[CNN Feature Extractor]
    TR[Transformer Context Network]
  end
  subgraph Quant["Quantization"]
    GVQ[Gumbel-Softmax Quantizer]
    CB[Codebook Z]
  end
  subgraph Contrast["Contrastive Loss"]
    POS[Positive: same utterance]
    NEG[Negatives: other utterances]
  end
  W --> CN
  CN --> TR
  TR --> Contrast
  GVQ --> CB
  CB --> Contrast

English

CNN frontend downsamples raw waveform to latent speech frames (~20ms)
Context network (Transformer) produces contextualized representations
Quantization module maps latents to discrete codes via Gumbel-Softmax
Contrastive loss: contextual vector must identify correct quantized target among K negatives
Fine-tuning for ASR: CTC loss on labeled speech — 10 min labeled data can match supervised baselines

中文

CNN 前端 将原始波形下采样为 latent 语音帧（约 20ms）
上下文网络（Transformer）产生 contextualized 表示
量化模块 经 Gumbel-Softmax 映射到离散码
对比损失：context 向量须在 K 个负例中识别正确量化目标
ASR 微调：CTC 损失 — 10 分钟标注 即可媲美监督基线

2.2 HuBERT：迭代聚类 + 掩码预测

HuBERT Training Loop
┌──────────────────────────────────────────────┐
│ Step 1: K-means on MFCC/features → targets   │
│ Step 2: Mask spans of input (like BERT 15%)  │
│ Step 3: Predict cluster IDs of masked frames │
│ Step 4: Re-cluster with better encoder → iterate │
└──────────────────────────────────────────────┘

Architecture: CNN → Transformer → Classification over clusters
Loss: Cross-entropy on masked positions only

English

HuBERT decouples target generation (offline clustering) from prediction (online Transformer). Iterative refinement of clusters improves target quality — analogous to BERT’s MLM but with learned discrete speech units instead of text tokens.

中文

HuBERT 将 目标生成（离线聚类）与预测（在线 Transformer）解耦。迭代聚类提升目标质量 — 类似 BERT MLM，但目标是 学习到的离散语音单元 而非文本 token。

2.3 MAE：非对称编解码视觉 SSL

flowchart TB
  subgraph Input["Image"]
    P[Patchify 16x16]
    M[Random Mask 75%]
  end
  subgraph Enc["Encoder ViT"]
    VE[Visible Patches Only]
    VT[ViT Blocks]
  end
  subgraph Dec["Lightweight Decoder"]
    MP[Mask Tokens Inserted]
    DR[Reconstruct Pixel Values]
  end
  P --> M
  M --> VE
  VE --> VT
  VT --> MP
  MP --> DR

English

High mask ratio (75%): Forces semantic compression — cannot cheat via local interpolation
Asymmetric design: Heavy encoder on visible patches only; lightweight decoder reconstructs all patches
Result: SOTA linear probe and fine-tuning on ImageNet with 1600 epoch pretraining — simpler than contrastive methods (MoCo, SimCLR)

中文

高 mask 比例（75%）：迫使语义压缩 — 无法靠局部插值作弊
非对称设计：重型 encoder 仅处理可见 patch；轻量 decoder 重建全部
结果：ImageNet 线性探测与微调 SOTA — 比 MoCo、SimCLR 等对比方法更简洁

2.4 三大方法对比

方法	模态	核心机制	下游典型任务
Wav2Vec 2.0	语音	对比 + 量化	ASR, speaker ID
HuBERT	语音	掩码 + 聚类目标	ASR, speech emotion
MAE	视觉	掩码重建	分类, 检测, 分割

三、发展趋势 | Trends

English

Unified SSL paradigms: Contrastive (Wav2Vec), predictive (HuBERT/MAE), and generative objectives converging in multimodal models.
Low-label fine-tuning: SSL + small labeled sets became standard for ASR in low-resource languages.
Scaling ViT pretraining: MAE validated that Transformers + SSL outperform CNN pretraining at scale.
Speech as the new NLP: HuBERT/Wav2Vec paved way for Whisper (2022) and speech LLMs.
Efficiency focus: Distilled SSL models (DistilHuBERT) for on-device deployment.
Multi-modal SSL: ImageBind, CLIP extensions combining SSL objectives across modalities.

中文

SSL 范式统一：对比（Wav2Vec）、预测（HuBERT/MAE）、生成目标在多模态模型中融合。
低标注微调：SSL + 少量标注成为 低资源语言 ASR 标准方案。
ViT 预训练缩放：MAE 验证 Transformer + SSL 在大规模下超越 CNN 预训练。
语音成新 NLP：HuBERT/Wav2Vec 为 Whisper（2022）与语音 LLM 铺路。
效率导向：蒸馏 SSL（DistilHuBERT）服务端侧部署。
多模态 SSL：ImageBind 等跨模态 SSL 扩展。

四、优缺点分析 | Pros & Cons

维度	优点 Advantages	缺点 Disadvantages
标注成本	利用海量无标注数据	预训练算力需求大
泛化	跨任务迁移强	域偏移时仍需微调
Wav2Vec 2.0	ASR 低资源 SOTA	对比学习调参敏感
HuBERT	简单 BERT 式目标	依赖聚类质量与迭代
MAE	架构简洁、易扩展	高 mask 比训练不稳定需技巧
可复现	Fairseq 官方实现	大规模预训练资源门槛高
部署	微调后模型紧凑	预训练阶段内存占用大

五、应用场景 | Use Cases

场景	说明
自动语音识别 (ASR)	低资源语言、方言、领域自适应
语音助手	唤醒词无关的鲁棒语音理解
医疗语音	Clinical note dictation with limited labeled data
ImageNet 级分类	MAE 预训练 + 少量 epoch 微调
目标检测/分割	MAE/ViT backbone 迁移至 Detectron2
内容审核	音频/视频 SSL 特征用于异常检测
工业质检	视觉 MAE 预训练 + 小样本缺陷检测

六、开源项目与工具 | Open Source & Tools

项目	说明	URL
fairseq	Wav2Vec 2.0、HuBERT 官方实现	https://github.com/facebookresearch/fairseq
MAE	Masked Autoencoder 官方代码	https://github.com/facebookresearch/mae
Hugging Face Transformers	Wav2Vec2、Hubert 预训练权重	https://github.com/huggingface/transformers
torchaudio	音频预处理与 Wav2Vec2 pipeline	https://github.com/pytorch/audio
timm	ViT + MAE 预训练模型集合	https://github.com/huggingface/pytorch-image-models
OpenSelfSup	综合 SSL 工具箱（对比/旋转等）	https://github.com/open-mmlab/OpenSelfSup
SpeechBrain	语音 SSL 下游任务 toolkit	https://github.com/speechbrain/speechbrain

七、参考文献 | References

Baevski, A., et al. “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” NeurIPS 2020 (广泛影响 2021). https://arxiv.org/abs/2006.11477
Hsu, W.-N., et al. “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units.” IEEE/ACM TASLP, 2021. https://arxiv.org/abs/2106.07447
He, K., et al. “Masked Autoencoders Are Scalable Vision Learners.” CVPR 2022 (arXiv 2021-11). https://arxiv.org/abs/2111.06377
Chen, T., et al. “A Simple Framework for Contrastive Learning of Visual Representations (SimCLR).” ICML 2020. https://arxiv.org/abs/2002.05709
Dosovitskiy, A., et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT).” ICLR 2021. https://arxiv.org/abs/2010.11929
Radford, A., et al. “Learning Transferable Visual Models From Natural Language Supervision (CLIP).” ICML 2021. https://arxiv.org/abs/2103.00020
Fairseq Documentation — Wav2Vec 2.0. https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec

English Summary: 2021 SSL breakthroughs in speech (Wav2Vec 2.0, HuBERT) and vision (MAE) proved that self-supervised pretraining could match or exceed supervised baselines — establishing the default recipe for foundation models before the LLM explosion.

中文总结：2021 年语音（Wav2Vec 2.0、HuBERT）与视觉（MAE）的 SSL 突破证明自监督预训练可媲美甚至超越监督基线，成为 LLM 爆发前 Foundation Model 的默认配方。