AI 技术编年史 2026:合成数据成为主力训练源

AI 技术编年史 2026:合成数据成为主力训练源 | Synthetic Data as Primary Training Source


一、背景 | Background

English

By late 2026, leading labs and enterprise trainers reported that verified synthetic data constituted 50–70% of tokens in major pretraining and fine-tuning mixes — crossing the threshold from “augmentation” to primary training source. Causes were structural: high-quality human web text largely exhausted; licensing battles restricted crawl corpora; domain-specific human data (medicine, law, code internals) remained expensive and gated; meanwhile frontier models + simulators produced synthetic text, code, multimodal pairs, and tool traces at 100× lower marginal cost with improving fidelity.

The concept evolved from naive self-play (model eats own outputs → collapse) to Synthetic Data 2.0: multi-model consensus filtering, executable verification (code runs, proofs check, physics sims validate), provenance graphs, and quality tiers explicitly entering revised scaling laws (see scaling-laws-moe post). Regulators began asking “synthetic %” disclosures for high-risk models.

中文

到 2026 年末,领先实验室与企业训练方披露 经核验合成数据占 major 预训练/微调 mix 的 50–70% token — 从「增广」跨越为 主力训练源。结构性原因:高质量人类网页文本 largely 枯竭;许可诉讼限制 crawl 语料;领域人类数据(医疗、法律、内部代码)昂贵且门禁严;而 前沿模型+仿真器低两个数量级边际成本 产出合成文本、代码、多模态对与工具 trace,保真度持续提升。

概念从 naive 自玩(模型吃自身输出 → collapse)演进为 Synthetic Data 2.0多模型共识过滤可执行验证(代码可跑、证明可检、物理仿真可验)、溯源图谱质量分级 Explicit 进入修正缩放定律。监管开始要求高风险模型披露 「合成占比」


二、架构 | Architecture

English

Synthetic data factory architecture (2026):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Seed Sources(种子)
├── Licensed human slices(high-trust anchor)
├── Public textbooks / papers(structured)
└── Simulator states(games, CAD, lab logs)

Generators
├── Teacher LLM ensemble(diverse architectures)
├── Programmatic templaters(grammar-guided)
├── Diffusion / video synth for multimodal
└── Agent trace replay(tool calls + outcomes)

Verification Layer
├── Executors(unit tests, sandboxes, formal checkers)
├── Critic models(reject hallucination / toxicity)
├── Deduplication + near-duplicate purge
└── Human spot audit(statistical sampling)

Curation & Mixing
├── Quality tier labels(T0 anchor human → T3 bulk synth)
├── Dynamic mixer(scaling law optimizer)
└── Provenance metadata per shard

Training Consumption
└── Pretrain / SFT / RL with tier-weighted sampling

Anti-collapse rules: Minimum 30% T0/T1 human-anchor in frontier mixes; never train exclusively on single-generator outputs; periodic human eval regression on held-out real benchmarks.

中文

2026 合成数据工厂: 种子(许可人类切片、结构化公域、仿真状态)→ 多类生成器(教师 LLM 集成、模板、扩散、Agent trace)→ 验证层(执行器、批评模型、去重、人工抽检)→ 分级混合(T0 人类锚点→T3 bulk 合成)→ 训练消费。

防 collapse 规则: 前沿 mix 至少 30% T0/T1 人类锚点;禁止 单一生成器独占;定期 人类 eval 回归

Tier 来源 典型占比(2026 frontier)
T0 Human expert 10–20%
T1 Human + light synth verify 15–25%
T2 Verified synthetic 40–50%
T3 Bulk synthetic (filtered) 10–20%

English

  1. Synthetic data marketplaces — buy verified shards by domain (finance QA, ICD-10 traces).
  2. Sim-to-text pipelines — Unity/Unreal logs → caption + reasoning datasets.
  3. Legal precedents — courts rule on copyright of synthetic-from-copyrighted prompts (jurisdiction-split).
  4. Enterprise default — internal fine-tunes use company synthetic from redacted docs + agents.
  5. Benchmark shift — “real-world holdout” suites gain prestige over synthetic-friendly benchmarks.
  6. Alignment synthetic — preference pairs generated + verified by debate models + human audit sample.

中文

  1. 合成数据市场 — 按领域购买 verified shard。
  2. Sim-to-text — 游戏/仿真日志 → caption+推理数据集。
  3. 法律先例 — 合成是否侵犯 prompt 版权(法域分化)。
  4. 企业默认 — 内部微调用脱敏文档+Agent 生成的 公司合成数据
  5. Benchmark 转向 — 「真实世界 holdout」套件更受重视。
  6. 对齐合成 — 辩论模型生成 preference + 人工 audit 样本。

四、优缺点 | Pros and Cons

English

Pros: Unlimited scale; domain coverage; privacy (no raw PII in mix); balanced long-tail tasks; reproducible dataset versioning; cost efficiency.

Cons: Model collapse if verification weak; bias amplification from teacher models; legal uncertainty; eval overfitting to synthetic-friendly metrics; anchor drift if human slice too small; trust erosion if undisclosed synthetic %.

中文

优点: 规模无限;领域覆盖;隐私友好;长尾可平衡;版本可复现;成本低。

缺点: 验证弱则 collapse;教师 偏见放大法律不确定eval 过拟合 合成友好指标;锚点过小则 漂移;未披露合成占比则 信任侵蚀


五、应用场景 | Use Cases

场景 合成数据用法
代码 LLM 可执行单元测试过滤的合成 repo
医疗 NLP 脱敏+EHR 结构模板合成临床 note
多语言 低资源语种的 back-translation + critic
机器人 仿真轨迹 → 语言标注 action 数据
金融 合成 transaction + fraud label 平衡
对齐 合成 preference + 宪法 AI 规则校验

六、GitHub 生态 | GitHub Ecosystem

Repository Role
pytorch/pytorch Training loops with dynamic data mixing
NVIDIA NeMo Curator / similar Large-scale synthetic curation pipelines
microsoft/datasketch / dedupe tools Near-duplicate purge at billion scale
EleutherAI lm-data-preparation Open recipes for tier mixing
anthropics/claude-code Generate verified code shards via agent+tests
Argilla / Label Studio Human spot audit UI

Synthetic provenance: Emerging data-card.json standard in repos documents generator model hash, verifier version, and tier — adopted by FlagOpen ecosystem trainers.


七、深入探讨 | Extended Discussion

English

Synthetic Data 2.0 distinguishes generators from verifiers — often different model families to reduce self-reinforcing bias. Code synthetic pipelines run pytest + mutation testing; math pipelines use SymPy / Lean checkers; medical text passes UMLS consistency + clinician sample review. Provenance graphs link each shard to {generator, verifier, seed_hash, tier} stored beside parquet in HuggingFace-style repos.

Enterprise trainers built internal synthetic factories on redacted Confluence/PDF: Agent extracts facts → generates Q&A → critic rejects unsupported claims → only approved shards enter mix. Legal signed off when no verbatim PII leaves enclave and synthetic does not memorizable-regurgitate source ( tested via membership inference probes).

Regulatory disclosure: EU AI Act annex templates ask synthetic % by tier; US FDA draft guidance on AI medical devices requests data lineage including sim sources. Benchmark gaming fears led REAL-Bench 2026 — holdout human-collected tasks never shown to major generators.

中文

Synthetic Data 2.0 区分 生成器验证器 — 常为不同模型族以防 自我强化偏见。代码合成跑 pytest+变异测试;数学用 SymPy/Lean;医疗文本过 UMLS 一致性+临床样本审查。溯源图 将每 shard 链至 {generator, verifier, seed_hash, tier} 存于 parquet 旁 HF 式 repo

企业训练方 在脱敏 Confluence/PDF 上建 内部合成工厂:Agent 抽事实→生成 Q&A→批评模型拒无据 claim→仅 approved shard 入 mix。法务无 verbatim PII 出 enclave 且合成 不可 memorizable 复述 源(membership inference 探针测)时放行。

监管披露: EU AI Act 附件模板问 分级合成占比;FDA AI 器械草案要求含 sim 源的 数据 lineageBenchmark 刷分 担忧催生 REAL-Bench 2026 — 生成器未见过的 holdout 人类任务。

7.1 合成占比与 benchmark 表现 | Synthetic % vs. Benchmark (illustrative)

Synth % MMLU-real-holdout Code-live
20% baseline baseline
50% −0.5% +2%
70% −2.5% +4%
90% (no anchor) −8% collapse risk overfit

八、参考链接 | References


Summary | 总结

In 2026, synthetic data is not a cheat code — it is the main fuel, governed by verification tiers, human anchors, and provenance — without which frontier scaling stalls.

2026 年 合成数据非捷径而是主燃料,由验证分级、人类锚点与溯源治理 — 缺失则前沿缩放停滞。