AI 技术编年史 2025:端侧大模型与 NPU 部署 — Edge LLM on NPU
端侧大模型与 NPU 部署 | Edge LLM and NPU Deployment
English Title: AI Technology Timeline 2025 — Edge LLM on NPU
一、背景 | Background
English
By May 2025, running multi-billion-parameter LLMs on device NPUs (Neural Processing Units) shifted from demos to shipping features: on-device summarization, coding assist, vision Q&A, and real-time translation without cloud round-trips. Edge LLM means inference on phone, PC, car, or robot SoC—not datacenter GPU. NPU refers to dedicated AI accelerators (Apple Neural Engine, Qualcomm Hexagon, Intel NPU, AMD XDNA, MediaTek APU).
Drivers: latency, privacy, offline, inference cost, and regulatory data residency. Cloud LLMs remain for hardest tasks; edge handles high-frequency, low-latency interactions with speculative cloud fallback.
Keywords:
| Term | Meaning |
|---|---|
| NPU | Fixed-function or flexible AI accelerator on SoC |
| INT4 / INT8 quantization | Lower precision weights/activations for throughput |
| Speculative decoding | Small draft model + large verify on NPU/GPU |
| KV cache compression | Reduce memory for long context on edge |
| Split inference | Sensitive prefix on device; heavy reasoning in cloud |
中文
至 2025 年 5 月,在 设备 NPU 上运行 数十亿参数 LLM 已从演示变为 量产功能:端侧摘要、编码辅助、视觉问答、实时翻译无需云端往返。端侧 LLM 指在手机、PC、车载或机器人 SoC 上推理——非数据中心 GPU。NPU 指专用 AI 加速器(Apple Neural Engine、高通 Hexagon、Intel NPU、AMD XDNA、联发科 APU)。
驱动因素:延迟、隐私、离线、推理成本、数据 residency 法规。云端 LLM 仍处理最难任务;端侧负责 高频低延迟 交互,并 按需回退云端。
关键词:
| 术语 | 含义 |
|---|---|
| NPU | SoC 上固定功能或可编程 AI 加速器 |
| INT4/INT8 量化 | 降低精度以换吞吐 |
| 投机解码 | 小 draft 模型 + 大模型校验 |
| KV cache 压缩 | 端侧长上下文省内存 |
| 分裂推理 | 敏感前缀在端;重推理在云 |
二、架构 | Architecture
English
1 | Application (OS copilot, camera, IDE plugin) |
2025 deployment stack layers:
- Model preparation: Prune, quantize (GPTQ/AWQ), distill 70B → 7B teacher-student
- Graph capture: Export to ONNX or vendor IR; fuse RoPE, RMSNorm, SwiGLU
- Memory planning: Static allocation for KV cache slots; batch=1 interactive default
- Hybrid routing: Policy engine sends PII-heavy prompts local, complex reasoning remote
- Update channel: OTA model packs signed; A/B perf regression on device farm
中文
1 | 应用(系统 Copilot、相机、IDE 插件) |
2025 部署栈:
- 模型准备: 剪枝、量化(GPTQ/AWQ)、70B→7B 蒸馏
- 图捕获: 导出 ONNX 或厂商 IR;融合 RoPE、RMSNorm、SwiGLU
- 内存规划: KV cache 静态分配;默认 batch=1 交互
- 混合路由: 策略引擎将 PII 重提示留本地,复杂推理送远程
- 更新通道: OTA 签名模型包;设备农场 A/B 性能回归
三、趋势 | Trends
English
| Trend | Description |
|---|---|
| 7B–8B as edge sweet spot | Quality near cloud 2023 GPT-3.5; fits 8–16 GB unified memory |
| NPU TOPS marketing → tokens/s reality | Industry publishes sustained tok/s under thermal limits |
| Multimodal on NPU | Small VLM (3B) for screen/camera understanding alongside text LLM |
| PC AI PCs | Copilot+ class devices with 40+ TOPS NPU mandatory for OEM badges |
| Automotive cockpit LLM | Local voice assistant; map/cloud only for live traffic |
| Cross-vendor compiler pressure | ONNX + open IR reduce lock-in; see 2025 NPU compiler post |
中文
| 趋势 | 说明 |
|---|---|
| 7B–8B 端侧甜点 | 质量接近 2023 云端 GPT-3.5;适配 8–16 GB 统一内存 |
| TOPS 营销 → 真实 tok/s | 产业在热限制下公布 sustained tok/s |
| NPU 多模态 | 小 VLM(3B)与文本 LLM 并存,理解屏幕/相机 |
| AI PC | Copilot+ 类设备,40+ TOPS NPU 成 OEM 认证门槛 |
| 车载 cockpit LLM | 本地语音助手;仅 live 交通用云 |
| 跨厂商编译器压力 | ONNX + 开放 IR 减锁定;见 2025 NPU 编译器文 |
四、优缺点 | Pros/Cons
English
Pros
- Sub-100 ms first-token for assistants; works in airplane mode
- User data never leaves device for local paths—GDPR/CCPA friendly
- Zero marginal cloud cost per query at scale (consumer devices)
- Always-on wake word + LLM without draining battery if NPU efficient
Cons
- Model freshness lags cloud; OTA size limits (GB-scale updates annoy users)
- Quality gap vs. frontier cloud models on reasoning and coding
- Fragmentation: each SoC needs vendor-specific graph tuning
- Thermal throttling reduces tok/s on sustained loads (gaming + LLM)
中文
优点
- 助手首 token <100 ms;飞行模式可用
- 本地路径用户数据不出设备——友好 GDPR/CCPA
- 规模上零边际云成本(消费设备)
- 常开唤醒 + LLM,NPU 高效时不拖垮电池
缺点
- 模型新鲜度落后云端;OTA 体积限制(GB 级更新用户反感)
- 推理/编码质量仍落后前沿云模型
- 碎片化:每 SoC 需厂商图调优
- sustained 负载热节流降 tok/s
五、应用场景 | Use Cases
English
| Device | Edge LLM function |
|---|---|
| Smartphone | Message rewrite, photo search, live caption |
| Laptop AI PC | Offline code completion, meeting notes |
| Smart glasses | Visual Q&A on NPU + low-power display |
| Robot / drone | Mission replanning when link drops |
| Industrial handheld | Manual Q&A in factories without Wi-Fi |
| Car IVI | Natural language HVAC, nav, vehicle settings |
中文
| 设备 | 端侧 LLM 功能 |
|---|---|
| 智能手机 | 消息改写、照片搜索、实时字幕 |
| AI PC 笔记本 | 离线代码补全、会议纪要 |
| 智能眼镜 | NPU 视觉问答 + 低功耗显示 |
| 机器人 / 无人机 | 链路断开时任务重规划 |
| 工业手持 | 无 Wi-Fi 工厂手册问答 |
| 车载 IVI | 自然语言空调、导航、车控 |
六、GitHub 开源生态 | GitHub
English
| Repository | Role |
|---|---|
| ggml-org/llama.cpp | Cross-platform CPU/NPU-friendly LLM inference |
| pytorch/executorch | Meta edge runtime for mobile NPU deployment |
| microsoft/onnxruntime-genai | ONNX Runtime generative AI on NPUs |
| NVIDIA/Cosmos-Tokenizer | Efficient tokenizers for on-device multimodal stacks |
中文
| 仓库 | 作用 |
|---|---|
| ggml-org/llama.cpp | 跨平台 CPU/NPU 友好 LLM 推理 |
| pytorch/executorch | Meta 移动端 NPU 部署运行时 |
| microsoft/onnxruntime-genai | NPU 上 ONNX 生成式 AI |
| NVIDIA/Cosmos-Tokenizer | 端侧多模态高效 tokenizer |
七、参考资料 | References
- Qualcomm — Hexagon NPU LLM benchmark whitepapers (2025)
- Apple — WWDC Neural Engine and Core ML generative models
- Intel — AI PC program and OpenVINO NPU plugin docs
- MLC LLM — Universal LLM deployment project
- IEEE — Edge LLM survey (2024–2025)
八、产业观察与深度解读 | Industry Observations and Deep Dive
English
Supply chain and talent: By the second half of 2025, enterprises stopped treating this topic as a pilot KPI and moved it into annual operating plans. Procurement asked for three-year TCO, not demo accuracy. System integrators packaged reference architectures with SLA-backed support, mirroring how cloud migrations matured a decade earlier.
Interoperability: Open APIs (MCP, ONNX, MLIR dialects where relevant) reduced lock-in, but data gravity still tied customers to platforms with the best vertical corpus or compiler backend. Winners combined open runtimes with proprietary gold datasets or silicon-tuned kernels.
Risk register (2025 common items): (1) Evaluation gap—public benchmarks no longer predict production; (2) Security—prompt injection and tool abuse in agentic stacks; (3) Regulatory—algorithm filing, EU AI Act high-risk categories; (4) Talent—shortage of engineers who understand both ML and domain workflows.
Research frontiers carrying into 2026: Tighter world-model / spatial / sim integration; self-evolving alignment with human audit; cross-chip compilers (see 2026 timeline). Teams that invested in measurement—latency, cost per task, failure replay—outperformed teams chasing parameter counts.
中文
供应链与人才: 2025 年下半年,企业不再将此主题仅作试点 KPI,而是写入 年度经营计划。采购要求 三年 TCO,而非 demo 准确率。系统集成商打包 带 SLA 的参考架构,类似十年前的云迁移成熟路径。
互操作: 开放 API(MCP、ONNX、相关 MLIR dialect)降低锁定,但 数据重力 仍把客户绑在拥有最佳垂直语料或编译后端的平台上。胜者 = 开放运行时 + 专有 gold 数据 或 硅片级调优内核。
风险登记(2025 共性): (1) 评估鸿沟——公开 benchmark 不再预测生产;(2) 安全——Agent 栈提示注入与工具滥用;(3) 监管——算法备案、EU AI Act 高风险类;(4) 人才——既懂 ML 又懂领域 workflow 的工程师短缺。
延续至 2026 的研究前沿: 世界模型 / 空间 / 仿真 更紧耦合;带人工 audit 的 自演化对齐;跨芯片编译器(见 2026 时间线)。投资 度量——延迟、单任务成本、失败回放——的团队胜过追逐参数量。
Glossary reinforcement | 术语 reinforcement
| EN | 中文 | One-line |
|---|---|---|
| Foundation model | 基础模型 | Large pretrained model finetuned for downstream tasks |
| Finetune | 微调 | Update weights on domain data |
| RAG | 检索增强生成 | Retrieve docs then generate grounded answers |
| Sim2real | 仿真到真实 | Transfer policies from simulator to physical world |
| TCO | 总拥有成本 | Full cost of ownership over deployment lifetime |
九、实施路线图(2025 Q2–Q4)| Implementation Roadmap
English
| Phase | Actions | Success metric |
|---|---|---|
| Assess | Inventory data, latency, compliance | Gap report signed by domain lead |
| Pilot | One workflow, HITL, private eval | >80% task success on golden set |
| Harden | SLO, monitoring, rollback | p95 latency and cost per task stable 4 weeks |
| Scale | Multi-site rollout, train-the-trainer | Adoption without support ticket spike |
Team roles: Product owner (workflow), ML engineer (model/compiler), Domain expert (gold labels), SRE (serving)—four roles minimum for production, not a lone prompt engineer.
中文
| 阶段 | 行动 | 成功指标 |
|---|---|---|
| 评估 | 清点数据、延迟、合规 | 领域负责人签字差距报告 |
| 试点 | 单工作流、HITL、私有 eval | 黄金集任务成功率 >80% |
| 加固 | SLO、监控、回滚 | p95 延迟与单任务成本稳定 4 周 |
| 推广 | 多站点、培训 | 支持工单无尖峰 |
团队角色: 产品负责人(工作流)、ML 工程师(模型/编译器)、领域专家(gold 标注)、SRE(serving)——生产最少四人,非 lone prompt engineer。
总结 | Summary
中文: 2025 年 5 月,端侧 LLM + NPU 是 AI 民主化 的硬件基础:7B 级模型进口袋,编译器与量化决定真实体验。云边协同而非云边二选一,是产业共识。
English: May 2025 edge LLM on NPU is the hardware base for democratized AI—7B-class models in your pocket; compilers and quantization define real UX. Cloud-edge collaboration, not either-or, is the consensus.