AI 技术编年史 2026:通用空间基础大模型(2D+3D+物理)

AI 技术编年史 2026:通用空间基础大模型 | Unified Spatial Foundation Model


一、背景 | Background

English

Spatial intelligence — understanding and manipulating the physical world in 2D images, 3D scenes, and dynamical simulations — fragmented across separate model families through 2024–2025 (vision transformers, NeRF/Gaussian splatting, physics engines, robotics policies). In 2026, Spatial Foundation Models (SFMs) emerged as unified encoders and decoders trained on multimodal spatial tokens: pixels, depth, point clouds, meshes, signed distance fields, and physics state vectors in a shared latent space.

Led by research from major labs and robotics OEMs, SFMs enabled queries like “If I move this cabinet 30cm, will the door still open?” with joint geometric and physical reasoning. The models integrated differentiable physics priors (contact, friction, fluid approximations) without replacing full FEM simulators — instead they warm-start simulations and predict feasibility orders of magnitude faster than traditional pipelines.

Commercial SFMs shipped with SDK hooks for Unity, Unreal, and ROS2, letting game developers and robot OEMs share one pretrained backbone instead of licensing separate vision and physics modules. Energy and construction sectors used SFMs for site layout compliance — verifying crane reach and egress paths from phone-captured point clouds.

中文

空间智能——在 2D 图像、3D 场景与动力学仿真中理解与操控物理世界——在 2024–2025 年分散于视觉 Transformer、NeRF/高斯溅射、物理引擎与机器人策略等模型族。2026 年 空间基础大模型(SFM)统一编解码器 形态出现,在 共享潜空间 中训练多模态空间 token:像素、深度、点云、网格、SDF 与物理状态向量。

SFM 支持如 「将此柜移 30cm,门还能开吗?」 的几何+物理联合推理。模型集成 可微物理先验(接触、摩擦、流体近似),不取代完整 FEM 仿真,而是 热启动 仿真并 预测可行性,比传统流水线快数个数量级。

商用 SFM 附带 Unity、Unreal、ROS2 SDK 挂钩,游戏与机器人 OEM 共享单一预训练骨干,而非分别授权视觉与物理模块。能源与建筑业用 SFM 做 工地布局合规 — 从手机点云验证吊臂 reach 与疏散路径。


二、架构 | Architecture

English

SFM unified architecture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Multimodal Spatial Tokenizer
├── 2D patch encoder(ViT-style)
├── 3D point/voxel encoder(Pillar / sparse conv)
├── Mesh/SDF encoder
└── Physics state encoder(q, q̇, contact flags)

Shared Spatial Transformer(cross-attn across modalities)

Latent Scene Graph + Physical Property Heads

Decoders
├── 2D segmentation / depth / flow
├── 3D reconstruction / novel view
├── Trajectory rollout(short-horizon physics)
└── Affordance / grasp / navigation plans

Training recipe: Stage 1 — massive 2D+3D alignment (images ↔ point clouds ↔ CAD); Stage 2 — video + physics sim rollouts (Isaac, MuJoCo, custom GPU sims); Stage 3 — robot teleop fine-tuning with action heads. Losses combine reconstruction, contrastive spatial alignment, physics consistency (predicted vs. sim next-state), and optional RL from embodied tasks.

中文

SFM 统一架构: 多模态空间 Tokenizer(2D/3D/网格/物理状态)→ 共享 Spatial Transformer(跨模态交叉注意力)→ 潜场景图+物理属性头 → 多解码器(2D/3D/轨迹/ affordance)。

训练配方: 三阶段——2D+3D 对齐;视频+物理仿真 rollout;机器人遥操作微调。损失含重建、对比对齐、物理一致性与可选 RL。

模块 输入 输出
Spatial Tokenizer RGB-D, LiDAR, CAD Unified tokens
Scene Graph Head Tokens Objects, relations, materials
Physics Head Tokens + action Next state, feasibility score
Action Head Tokens + goal EE pose, base velocity

English

  1. Single model for AR, robotics, and autonomous driving — shared backbone, task-specific heads.
  2. Real-time SFM on edge NPUs — distilled 1–3B spatial models at 30 FPS for drones.
  3. Generative spatial editing — text + drag to rearrange furniture with physical plausibility.
  4. Digital twin sync — SFM latent state ↔ live factory twin bidirectional update.
  5. Benchmark consolidation — SpatialBench 2026 unifies 2D, 3D, and physics metrics.
  6. Open weights — community models rivaling closed SFMs on indoor scenes.

中文

  1. AR/机器人/自动驾驶共用骨干,任务专用头。
  2. 边缘 NPU 实时 SFM — 蒸馏 1–3B 模型 30 FPS。
  3. 生成式空间编辑 — 文本+拖拽重排家具且物理 plausible。
  4. 数字孪生同步 — 潜状态与工厂孪生双向更新。
  5. 基准整合 — SpatialBench 2026 统一 2D/3D/物理指标。
  6. 开源权重 — 室内场景媲美闭源 SFM。

四、优缺点 | Pros and Cons

English

Pros: One model reduces integration cost; cross-modal transfer improves data efficiency; fast feasibility checks for planning; natural interface for human operators (language + sketch).

Cons: Sim-to-real gap for physics head; hallucinated geometry in occluded regions; compute for full-scene tokens; safety — wrong feasibility score in critical robotics; licensing of CAD/sim training data.

中文

优点: 单模型降低集成成本;跨模态迁移提升数据效率;规划可行性快检;自然的人机接口。

缺点: 物理头 sim-to-real 差距;遮挡区 几何幻觉;全场景 token 算力 高;机器人 安全风险;CAD/仿真数据 许可 问题。


五、应用场景 | Use Cases

场景 English
仓储机器人路径规划 Warehouse robots: feasibility-aware path planning
AR 家具摆放 AR furniture placement with collision + stability
自动驾驶场景理解 AV scene understanding: objects + affordances + weather physics
工业数字孪生 Factory digital twin: predict jam before physical failure
游戏/元宇宙 Procedural worlds with consistent physics
手术规划 Surgical planning from CT + instrument reachability

六、GitHub 生态 | GitHub Ecosystem

Repository Role
pytorch/pytorch 3D ops, differentiable rendering backends
NVIDIA Isaac Sim / Orbit Sim data generation for SFM stage 2
OpenScene / Uni3D research repos Unified 2D–3D pretraining code
FlagOpen/FlagOS Deploy spatial transformers on heterogeneous edge+cloud
Meta SAM 3D / successor forks Segmentation → spatial token pipelines

七、深入探讨 | Extended Discussion

English

SFMs in 2026 inherit world-model ambitions but prioritize actionable spatial queries over pixel-perfect video generation. A warehouse robot asks the SFM: “Can this pallet fit on shelf B3 considering fork width?” — the model returns {feasible: true, confidence: 0.92, blocking_objects: []} without running full motion planning. Full planners consume SFM outputs as heuristic seeds, cutting planning time 5–20×.

Representation unification uses tri-plane + point hybrid tokens for indoor scenes and NeRF-Gaussian latent codes for outdoor driving. Physics heads predict contact manifolds learned from sim friction randomization. Material property prediction (mass, friction, deformability) enables grasp planning even for novel objects.

Training data governance remains contentious: CAD models from manufacturers, synthetic room layouts, and robot teleop logs dominate mixes; raw consumer video usage declined after privacy settlements. Open SFMs lag closed models on outdoor long-tail (weather, construction sites) but match on structured indoor benchmarks.

中文

2026 SFM 继承 世界模型 雄心但优先 可行动空间查询 而非像素级视频。仓储机器人问:「考虑叉宽,此托盘能放 B3 架吗?」 — 模型返回 {feasible, confidence, blocking_objects} 无需完整运动规划。完整规划器将 SFM 输出作 启发 seed,规划时间缩短 5–20×。

表示统一tri-plane+点混合 token 表室内,NeRF-高斯潜码 表室外驾驶。物理头预测经 sim 摩擦随机化学习的 接触流形材料属性预测(质量、摩擦、可变形)支持对新物体的抓取规划。

训练数据治理 仍有争议:厂商 CAD、合成房间布局机器人遥操作日志 主导 mix;消费者 raw 视频在 隐私和解 后使用下降。开源 SFM 在 室外长尾 落后闭源,在 结构化室内 benchmark 持平。

7.1 与机器人栈集成 | Robotics Stack Integration

1
2
3
SFM latent → MoveIt / Isaac motion planner
→ GraspNet action head
→ Nav2 costmap overlay (affordance heatmap)

English: SFMs become the shared spatial memory across perception, planning, and NL instruction — replacing siloed depth nets + semantic seg + separate physics engine calls.

中文: SFM 成为感知、规划与自然语言指令间的 共享空间记忆 — 取代孤立的深度网络+语义分割+独立物理引擎调用。


八、参考链接 | References

  • Fei-Fei Li, “Spatial Intelligence” keynote themes (2025–2026)
  • UniSim, Genie, and world-model survey papers
  • SpatialBench 2026 leaderboard
  • 本系列:ai-timeline-2025-spatial-intelligence

Summary | 总结

2026 SFMs unify 2D, 3D, and physics in one latent space — the algorithmic backbone for embodied AI, AR, and industrial twins.

2026 SFM 在统一潜空间中融合 2D、3D 与物理 — 具身 AI、AR 与工业孪生的算法骨干。