AI 技术编年史 2024:万卡异构智算集群
万卡异构智算集群 | 10k GPU Heterogeneous AI Clusters
一、背景与核心概念 | Background and Core Concepts
English
Training frontier LLMs in 2024 required 10,000+ GPU clusters — a scale once reserved for national labs. Meta’s Llama 3 reportedly trained on 16k H100s; xAI’s Colossus reached 100k H100s by late 2024. Heterogeneous computing adds complexity: mixing H100, A100, AMD MI300X, and domestic accelerators (Huawei Ascend, Cambricon) within unified scheduling layers.
Key concepts:
- 3D Parallelism: data parallel (DP) + tensor parallel (TP) + pipeline parallel (PP) + expert parallel (EP) for MoE
- Network fabric: InfiniBand NDR 400G, NVLink NVSwitch for intra-node, fat-tree topology
- Checkpointing: distributed async checkpoint to parallel filesystem (Lustre, GPFS)
- Fault tolerance: spot instances, straggler mitigation, hot spare nodes
- Power and cooling: 10k GPUs ≈ 7–15 MW; liquid cooling mandatory
中文
2024 年训练前沿大模型需 万卡级 GPU 集群——曾为国家级实验室专属。Meta Llama 3 据称用 1.6 万 H100;xAI Colossus 年底达 10 万 H100。异构智算增加复杂度:H100、A100、AMD MI300X 与昇腾、寒武纪等国产芯片在同一调度层混合。
核心概念:3D 并行(DP+TP+PP+EP);InfiniBand NDR、NVLink NVSwitch;分布式 checkpoint;容错与 straggler;万卡功耗约 7–15 MW,液冷必备。
| 术语 | 说明 |
|---|---|
| MFU | Model FLOPs Utilization,有效算力利用率 |
| All-Reduce | 梯度同步集合通信 |
| ZeRO | 分片优化器状态降显存 |
| Pod | 典型 256–2048 GPU 物理单元 |
1.1 规模对照 | Scale Reference
English
| Cluster | GPUs (approx.) | Notable use (2024) |
|---|---|---|
| Meta Llama 3 | ~16,000 H100 | Dense + MoE pretrain |
| xAI Colossus | ~100,000 H100 | Grok training |
| Microsoft Azure AI | Multi-10k regions | OpenAI + enterprise |
| 国内智算中心 | 万卡昇腾/寒武纪 | 行业大模型 |
A single H100 consumes ~700W TDP; 10k GPUs ≈ 7 MW before CPU, networking, and cooling overhead — datacenter site selection became a power contract negotiation.
中文
| 集群 | GPU(约) | 2024 用途 |
|---|---|---|
| Meta Llama 3 | ~1.6 万 H100 | 稠密+MoE 预训练 |
| xAI Colossus | ~10 万 H100 | Grok |
| 微软 Azure AI | 区域级万卡 | OpenAI+企业 |
| 国内智算 | 万卡国产芯片 | 行业大模型 |
单 H100 约 700W TDP;万卡约 7 MW(不含 CPU、网络、冷却)——选址成电力合同谈判。
二、架构设计 | Architecture
English
Typical 10k GPU cluster architecture:
1 | ┌─────────────────────────────────────────────────────────┐ |
Heterogeneous layer: scheduler maps jobs to GPU SKUs by memory, FP8 support, and interconnect bandwidth; gradient accumulation bridges slower nodes.
中文
典型万卡架构:Slurm/K8s 调度 + Megatron/DeepSpeed 编排 → 多 Pod(每 Pod 256 GPU、NVSwitch)→ InfiniBand 全互联 → Lustre/DAOS 并行文件系统。异构层按显存、FP8、带宽映射任务;梯度累积桥接慢节点。
2.1 并行策略选择 | Parallelism Selection
| 模型规模 | 推荐策略 |
|---|---|
| 7B–70B dense | TP(8) + DP + ZeRO-3 |
| 400B+ dense | TP + PP + DP |
| MoE (8×7B) | EP + TP + DP |
| 多模态 | 分离 vision/text 并行组 |
2.2 国产异构栈 | Domestic Heterogeneous Stack
English: Ascend 910B clusters use CANN + MindSpeed; scheduling bridges via FlagOS-style abstraction layers emerging in 2025–2026 roadmaps.
中文:昇腾 910B 用 CANN + MindSpeed;FlagOS 类抽象层在 2025–2026 路线图中出现。
2.3 可靠性工程 | Reliability Engineering
English
At 10k scale, mean time between failure (MTBF) guarantees daily node loss. Training jobs checkpoint every N minutes to distributed storage; torch.distributed elastic hooks restart ranks. Stragglers — slow nodes from thermal throttle or bad NIC — are mitigated by gradient accumulation skew tuning and optional node exclusion lists.
中文
万卡规模下 MTBF 保证每日有节点失效。训练每 N 分钟 checkpoint 到分布式存储;torch.distributed 弹性重启 rank。Straggler(热节流、坏网卡)通过梯度累积偏斜调优与节点排除列表缓解。
三、产业趋势 | Industry Trends
English
2024 cluster trends:
- H100 supply crunch — lead times drove multi-vendor strategies
- Sovereign AI clouds — nations fund domestic 10k+ clusters
- FP8 training — Transformer Engine on H100 cuts memory and boosts MFU
- Inference/Train colocation — same datacenter serves both workloads
- Energy limits — grid constraints cap cluster expansion in some regions
- Open cluster software — PyTorch FSDP2, Megatron-LM, DeepSpeed MoE mature
中文
2024 趋势:H100 供应紧张推动多 vendor;主权 AI 云投资万卡;FP8 训练降显存提 MFU;训推同机房;电网限制扩张;FSDP2、Megatron、DeepSpeed MoE 软件成熟。
四、优缺点分析 | Pros and Cons
4.1 优点 | Advantages
- 训练速度 — 周级完成万亿 token / Week-scale trillion-token runs
- 规模定律延续 — 更大集群 → 更强模型 / Scaling laws continue
- 容错工程成熟 — checkpoint 恢复分钟级 / Resilient long jobs
- 软件栈统一 — PyTorch 生态主导 / Unified PyTorch stack
- 异构弹性 — 多 SKU 提高利用率 / Mixed SKUs improve utilization
- 推理共部署 — 摊薄基础设施 / Shared infra amortization
4.2 缺点 | Disadvantages
- 资本密集 — 万卡 ≈ 数亿美元 / Billions in capex
- 能耗与碳排 — ESG 压力 / Energy and carbon footprint
- 运维复杂 — 网络、存储、调度耦合 / Operational complexity
- 供应商锁定 — NVIDIA 生态依赖 / Vendor concentration risk
- 异构效率损失 — 跨芯片性能不一致 / Heterogeneity overhead
- 人才稀缺 — 分布式系统专家少 / Scarce distributed systems talent
五、典型应用场景 | Use Cases
| 场景 Scenario | 中文说明 | English Description |
|---|---|---|
| 前沿 LLM 预训练 | 1T+ token 基座模型 | Frontier base model pre-training |
| MoE 超大规模 | 混合专家万亿参数 | Trillion-parameter MoE runs |
| 多模态联合训练 | 图文音视频统一 | Multimodal joint training |
| RLHF 大规模 | 并行 rollout + RM | Large-scale alignment training |
| 科学计算 AI | 气候、蛋白质联合 HPC+AI | HPC-AI fusion workloads |
| 云厂商租售 | 按 GPU·时售卖算力 | GPU-hour cloud offerings |
六、GitHub 与开源生态 | GitHub and Open Source
English
Cluster training frameworks:
- NVIDIA/Megatron-LM: 3D parallelism reference
- microsoft/DeepSpeed: ZeRO, MoE, inference
- pytorch/pytorch: FSDP2 native
- hpcaitech/ColossalAI: heterogeneous and hybrid parallel
- ray-project/ray: distributed orchestration
中文
开源框架:Megatron-LM、DeepSpeed、PyTorch FSDP2、ColossalAI、Ray。
| 仓库 | 说明 |
|---|---|
| NVIDIA/Megatron-LM | 3D 并行参考实现 |
| microsoft/DeepSpeed | ZeRO / MoE |
| pytorch/pytorch | FSDP2 |
| hpcaitech/ColossalAI | 异构并行 |
七、参考链接 | References
- Meta Llama 3 训练基础设施博客
- NVIDIA H100 Tensor Core GPU 架构白皮书
- InfiniBand NDR 400G 规范
- xAI Colossus 集群公开报道
- DeepSpeed ZeRO 论文与文档
- 中国智算中心建设政策文件(2024)
八、2025 展望 | Outlook for 2025
English
Next-generation clusters adopt GB200 NVL72 rack-scale designs, FP4/FP6 numerics, and train-inference disaggregation on shared fabrics. Heterogeneous scheduling (FlagOS-class abstractions) becomes mandatory as export controls diversify chip supply. Energy caps push dynamic power capping and geo-shifting training jobs to regions with surplus renewables. Mid-size labs rent 8–512 GPU slices via cloud rather than owning 10k — but frontier labs remain capex-heavy.
中文
下一代集群采用 GB200 NVL72 机架设计、FP4/FP6 数值、训推分离共享 fabric。异构调度(FlagOS 类抽象)在出口管制分化芯片供应下成刚需。能耗上限推动动态功耗封顶与地理迁移训练至富余可再生能源区。中型实验室云租 8–512 GPU 而非自持万卡——前沿实验室仍重 capex。
English Summary: 2024 cemented 10k GPU clusters as the training baseline for frontier AI — heterogeneous scheduling and network fabric became as critical as chip FLOPs.
中文总结:2024 确立万卡集群为前沿 AI 训练基线——异构调度与网络 fabric 与芯片算力同等关键。