AI 技术编年史 2025:端侧大模型与 NPU 部署 — Edge LLM on NPU

端侧大模型与 NPU 部署 | Edge LLM and NPU Deployment

English Title: AI Technology Timeline 2025 — Edge LLM on NPU


一、背景 | Background

English

By May 2025, running multi-billion-parameter LLMs on device NPUs (Neural Processing Units) shifted from demos to shipping features: on-device summarization, coding assist, vision Q&A, and real-time translation without cloud round-trips. Edge LLM means inference on phone, PC, car, or robot SoC—not datacenter GPU. NPU refers to dedicated AI accelerators (Apple Neural Engine, Qualcomm Hexagon, Intel NPU, AMD XDNA, MediaTek APU).

Drivers: latency, privacy, offline, inference cost, and regulatory data residency. Cloud LLMs remain for hardest tasks; edge handles high-frequency, low-latency interactions with speculative cloud fallback.

Keywords:

Term Meaning
NPU Fixed-function or flexible AI accelerator on SoC
INT4 / INT8 quantization Lower precision weights/activations for throughput
Speculative decoding Small draft model + large verify on NPU/GPU
KV cache compression Reduce memory for long context on edge
Split inference Sensitive prefix on device; heavy reasoning in cloud

中文

至 2025 年 5 月,在 设备 NPU 上运行 数十亿参数 LLM 已从演示变为 量产功能:端侧摘要、编码辅助、视觉问答、实时翻译无需云端往返。端侧 LLM 指在手机、PC、车载或机器人 SoC 上推理——非数据中心 GPU。NPU 指专用 AI 加速器(Apple Neural Engine、高通 Hexagon、Intel NPU、AMD XDNA、联发科 APU)。

驱动因素:延迟隐私离线推理成本数据 residency 法规。云端 LLM 仍处理最难任务;端侧负责 高频低延迟 交互,并 按需回退云端

关键词:

术语 含义
NPU SoC 上固定功能或可编程 AI 加速器
INT4/INT8 量化 降低精度以换吞吐
投机解码 小 draft 模型 + 大模型校验
KV cache 压缩 端侧长上下文省内存
分裂推理 敏感前缀在端;重推理在云

二、架构 | Architecture

English

1
2
3
4
5
6
7
8
9
Application (OS copilot, camera, IDE plugin)

Runtime (ONNX Runtime, ExecuTorch, llama.cpp, MLC-LLM)

Graph compiler (TensorRT-LLM, OpenVINO, QNN, Core ML)

NPU driver + firmware (operator kernels, memory tiling)

Hardware NPU + shared DRAM / unified memory

2025 deployment stack layers:

  1. Model preparation: Prune, quantize (GPTQ/AWQ), distill 70B → 7B teacher-student
  2. Graph capture: Export to ONNX or vendor IR; fuse RoPE, RMSNorm, SwiGLU
  3. Memory planning: Static allocation for KV cache slots; batch=1 interactive default
  4. Hybrid routing: Policy engine sends PII-heavy prompts local, complex reasoning remote
  5. Update channel: OTA model packs signed; A/B perf regression on device farm

中文

1
2
3
4
5
6
7
8
9
应用(系统 Copilot、相机、IDE 插件)

运行时(ONNX Runtime、ExecuTorch、llama.cpp、MLC-LLM)

图编译器(TensorRT-LLM、OpenVINO、QNN、Core ML)

NPU 驱动 + 固件(算子内核、内存 tiling)

硬件 NPU + 共享 DRAM / 统一内存

2025 部署栈:

  1. 模型准备: 剪枝、量化(GPTQ/AWQ)、70B→7B 蒸馏
  2. 图捕获: 导出 ONNX 或厂商 IR;融合 RoPE、RMSNorm、SwiGLU
  3. 内存规划: KV cache 静态分配;默认 batch=1 交互
  4. 混合路由: 策略引擎将 PII 重提示留本地,复杂推理送远程
  5. 更新通道: OTA 签名模型包;设备农场 A/B 性能回归

English

Trend Description
7B–8B as edge sweet spot Quality near cloud 2023 GPT-3.5; fits 8–16 GB unified memory
NPU TOPS marketing → tokens/s reality Industry publishes sustained tok/s under thermal limits
Multimodal on NPU Small VLM (3B) for screen/camera understanding alongside text LLM
PC AI PCs Copilot+ class devices with 40+ TOPS NPU mandatory for OEM badges
Automotive cockpit LLM Local voice assistant; map/cloud only for live traffic
Cross-vendor compiler pressure ONNX + open IR reduce lock-in; see 2025 NPU compiler post

中文

趋势 说明
7B–8B 端侧甜点 质量接近 2023 云端 GPT-3.5;适配 8–16 GB 统一内存
TOPS 营销 → 真实 tok/s 产业在热限制下公布 sustained tok/s
NPU 多模态 小 VLM(3B)与文本 LLM 并存,理解屏幕/相机
AI PC Copilot+ 类设备,40+ TOPS NPU 成 OEM 认证门槛
车载 cockpit LLM 本地语音助手;仅 live 交通用云
跨厂商编译器压力 ONNX + 开放 IR 减锁定;见 2025 NPU 编译器文

四、优缺点 | Pros/Cons

English

Pros

  • Sub-100 ms first-token for assistants; works in airplane mode
  • User data never leaves device for local paths—GDPR/CCPA friendly
  • Zero marginal cloud cost per query at scale (consumer devices)
  • Always-on wake word + LLM without draining battery if NPU efficient

Cons

  • Model freshness lags cloud; OTA size limits (GB-scale updates annoy users)
  • Quality gap vs. frontier cloud models on reasoning and coding
  • Fragmentation: each SoC needs vendor-specific graph tuning
  • Thermal throttling reduces tok/s on sustained loads (gaming + LLM)

中文

优点

  • 助手首 token <100 ms;飞行模式可用
  • 本地路径用户数据不出设备——友好 GDPR/CCPA
  • 规模上零边际云成本(消费设备)
  • 常开唤醒 + LLM,NPU 高效时不拖垮电池

缺点

  • 模型新鲜度落后云端;OTA 体积限制(GB 级更新用户反感)
  • 推理/编码质量仍落后前沿云模型
  • 碎片化:每 SoC 需厂商图调优
  • sustained 负载热节流降 tok/s

五、应用场景 | Use Cases

English

Device Edge LLM function
Smartphone Message rewrite, photo search, live caption
Laptop AI PC Offline code completion, meeting notes
Smart glasses Visual Q&A on NPU + low-power display
Robot / drone Mission replanning when link drops
Industrial handheld Manual Q&A in factories without Wi-Fi
Car IVI Natural language HVAC, nav, vehicle settings

中文

设备 端侧 LLM 功能
智能手机 消息改写、照片搜索、实时字幕
AI PC 笔记本 离线代码补全、会议纪要
智能眼镜 NPU 视觉问答 + 低功耗显示
机器人 / 无人机 链路断开时任务重规划
工业手持 无 Wi-Fi 工厂手册问答
车载 IVI 自然语言空调、导航、车控

六、GitHub 开源生态 | GitHub

English

Repository Role
ggml-org/llama.cpp Cross-platform CPU/NPU-friendly LLM inference
pytorch/executorch Meta edge runtime for mobile NPU deployment
microsoft/onnxruntime-genai ONNX Runtime generative AI on NPUs
NVIDIA/Cosmos-Tokenizer Efficient tokenizers for on-device multimodal stacks

中文

仓库 作用
ggml-org/llama.cpp 跨平台 CPU/NPU 友好 LLM 推理
pytorch/executorch Meta 移动端 NPU 部署运行时
microsoft/onnxruntime-genai NPU 上 ONNX 生成式 AI
NVIDIA/Cosmos-Tokenizer 端侧多模态高效 tokenizer

七、参考资料 | References

  1. Qualcomm — Hexagon NPU LLM benchmark whitepapers (2025)
  2. Apple — WWDC Neural Engine and Core ML generative models
  3. Intel — AI PC program and OpenVINO NPU plugin docs
  4. MLC LLM — Universal LLM deployment project
  5. IEEE — Edge LLM survey (2024–2025)

八、产业观察与深度解读 | Industry Observations and Deep Dive

English

Supply chain and talent: By the second half of 2025, enterprises stopped treating this topic as a pilot KPI and moved it into annual operating plans. Procurement asked for three-year TCO, not demo accuracy. System integrators packaged reference architectures with SLA-backed support, mirroring how cloud migrations matured a decade earlier.

Interoperability: Open APIs (MCP, ONNX, MLIR dialects where relevant) reduced lock-in, but data gravity still tied customers to platforms with the best vertical corpus or compiler backend. Winners combined open runtimes with proprietary gold datasets or silicon-tuned kernels.

Risk register (2025 common items): (1) Evaluation gap—public benchmarks no longer predict production; (2) Security—prompt injection and tool abuse in agentic stacks; (3) Regulatory—algorithm filing, EU AI Act high-risk categories; (4) Talent—shortage of engineers who understand both ML and domain workflows.

Research frontiers carrying into 2026: Tighter world-model / spatial / sim integration; self-evolving alignment with human audit; cross-chip compilers (see 2026 timeline). Teams that invested in measurement—latency, cost per task, failure replay—outperformed teams chasing parameter counts.

中文

供应链与人才: 2025 年下半年,企业不再将此主题仅作试点 KPI,而是写入 年度经营计划。采购要求 三年 TCO,而非 demo 准确率。系统集成商打包 带 SLA 的参考架构,类似十年前的云迁移成熟路径。

互操作: 开放 API(MCP、ONNX、相关 MLIR dialect)降低锁定,但 数据重力 仍把客户绑在拥有最佳垂直语料或编译后端的平台上。胜者 = 开放运行时 + 专有 gold 数据硅片级调优内核

风险登记(2025 共性): (1) 评估鸿沟——公开 benchmark 不再预测生产;(2) 安全——Agent 栈提示注入与工具滥用;(3) 监管——算法备案、EU AI Act 高风险类;(4) 人才——既懂 ML 又懂领域 workflow 的工程师短缺。

延续至 2026 的研究前沿: 世界模型 / 空间 / 仿真 更紧耦合;带人工 audit 的 自演化对齐跨芯片编译器(见 2026 时间线)。投资 度量——延迟、单任务成本、失败回放——的团队胜过追逐参数量。

Glossary reinforcement | 术语 reinforcement

EN 中文 One-line
Foundation model 基础模型 Large pretrained model finetuned for downstream tasks
Finetune 微调 Update weights on domain data
RAG 检索增强生成 Retrieve docs then generate grounded answers
Sim2real 仿真到真实 Transfer policies from simulator to physical world
TCO 总拥有成本 Full cost of ownership over deployment lifetime

九、实施路线图(2025 Q2–Q4)| Implementation Roadmap

English

Phase Actions Success metric
Assess Inventory data, latency, compliance Gap report signed by domain lead
Pilot One workflow, HITL, private eval >80% task success on golden set
Harden SLO, monitoring, rollback p95 latency and cost per task stable 4 weeks
Scale Multi-site rollout, train-the-trainer Adoption without support ticket spike

Team roles: Product owner (workflow), ML engineer (model/compiler), Domain expert (gold labels), SRE (serving)—four roles minimum for production, not a lone prompt engineer.

中文

阶段 行动 成功指标
评估 清点数据、延迟、合规 领域负责人签字差距报告
试点 单工作流、HITL、私有 eval 黄金集任务成功率 >80%
加固 SLO、监控、回滚 p95 延迟与单任务成本稳定 4 周
推广 多站点、培训 支持工单无尖峰

团队角色: 产品负责人(工作流)、ML 工程师(模型/编译器)、领域专家(gold 标注)、SRE(serving)——生产最少四人,非 lone prompt engineer。


总结 | Summary

中文: 2025 年 5 月,端侧 LLM + NPU 是 AI 民主化 的硬件基础:7B 级模型进口袋,编译器与量化决定真实体验。云边协同而非云边二选一,是产业共识。

English: May 2025 edge LLM on NPU is the hardware base for democratized AI—7B-class models in your pocket; compilers and quantization define real UX. Cloud-edge collaboration, not either-or, is the consensus.