LangGraph 深度指南:从图状态机到生产级 Agent(中英文对照)
English Title: LangGraph Deep Dive — From Graph State Machines to Production-Grade Agents
如果你已经用 LangChain 写过 ReAct Agent,却在生产环境遇到 状态丢失、崩溃无法恢复、无法人工审批、循环失控 等问题,LangGraph 就是为此而生的。本文从架构原理到可运行代码,带你系统掌握 LangGraph。
If you’ve built ReAct agents with LangChain but hit state loss, crash recovery gaps, missing human approval, and runaway loops in production, LangGraph was built for exactly these problems. This article covers architecture and runnable code to help you master LangGraph systematically.
1. LangGraph 是什么?| What Is LangGraph?
中文: LangGraph 是 LangChain 团队开发的 有向图状态机运行时,将 Agent 工作流建模为:
- State(状态) — 贯穿全流程的共享数据结构
- Node(节点) — 处理步骤(LLM 调用、工具执行、自定义函数)
- Edge(边) — 节点间的流转逻辑(含条件分支)
English: LangGraph is a directed graph state-machine runtime from the LangChain team. It models agent workflows with:
- State — Shared data structure flowing through the pipeline
- Node — Processing steps (LLM calls, tool execution, custom functions)
- Edge — Transition logic between nodes (including conditional branches)
1 | ┌──────────┐ |
与 LangChain 的 create_agent 相比,LangGraph 是 底层运行时 而非高层封装——你获得完全的控制权,代价是更多样板代码。
Compared to LangChain’s create_agent, LangGraph is a low-level runtime, not a high-level wrapper — you get full control at the cost of more boilerplate.
2. 核心概念详解 | Core Concepts
2.1 State — 状态模式
中文: State 通常用 TypedDict 定义,每个节点读取并返回状态更新。LangGraph 自动合并(merge)各节点的返回值。
English: State is typically defined with TypedDict. Each node reads and returns state updates; LangGraph automatically merges node outputs.
1 | from typing import Annotated, TypedDict |
add_messages 是 LangGraph 内置的 reducer:新消息追加到列表,而非覆盖。
add_messages is a built-in reducer: new messages are appended, not overwritten.
2.2 Node — 节点函数
中文: 节点是普通 Python 函数,接收当前 State,返回 State 更新字典。
English: A node is a plain Python function that receives the current State and returns a state update dict.
1 | def call_llm(state: AgentState) -> dict: |
2.3 Edge — 条件路由
中文: 条件边根据 State 动态决定下一个节点,这是 LangGraph 超越简单 Chain 的关键能力。
English: Conditional edges dynamically choose the next node based on State — LangGraph’s key advantage over simple chains.
1 | def should_continue(state: AgentState) -> str: |
3. 完整示例:带工具调用的 Agent | Full Example
中文: 以下是一个最小可运行的 LangGraph Agent,包含 LLM 推理、工具执行和循环控制。
English: Below is a minimal runnable LangGraph agent with LLM reasoning, tool execution, and loop control.
1 | from langgraph.graph import StateGraph, END |
4. 生产级特性 | Production Features
4.1 Checkpointing — 状态持久化
中文: Checkpointing 将每个状态转换持久化到数据库(SQLite、PostgreSQL 等),实现:
- 崩溃恢复 — 从最后检查点继续执行
- Time-travel Debugging — 回溯任意历史状态
- 多用户会话 — 通过
thread_id隔离不同用户
English: Checkpointing persists each state transition to a database (SQLite, PostgreSQL, etc.), enabling:
- Crash recovery — Resume from the last checkpoint
- Time-travel debugging — Replay any historical state
- Multi-user sessions — Isolate users via
thread_id
1 | from langgraph.checkpoint.sqlite import SqliteSaver |
4.2 Human-in-the-Loop — 人工介入
中文: 在关键节点设置 断点(Breakpoint),图执行到此处暂停,等待人工输入后恢复。
English: Set breakpoints at critical nodes; the graph pauses and resumes after human input.
1 | from langgraph.types import interrupt |
1 | # 编译时设置断点 |
4.3 流式输出 | Streaming
中文: LangGraph 支持 Token 级流式输出,改善用户体验。
English: LangGraph supports token-level streaming for better UX.
1 | for event in app.stream(input_state, config, stream_mode="messages"): |
4.4 子图嵌套 | Sub-graph Composition
中文: 一个完整的图可以作为另一个图的节点,实现模块化复用。
English: A complete graph can become a node in a parent graph for modular reuse.
1 | research_graph = build_research_subgraph() |
5. LangGraph vs LangChain vs CrewAI | Comparison
| 维度 Dimension | LangChain | LangGraph | CrewAI |
|---|---|---|---|
| 抽象层级 Level | 高层 High | 底层 Low | 中高层 Mid-high |
| 状态管理 State | 无原生 No native | 显式持久 Explicit | 内置记忆 Built-in |
| 循环/分支 Loops | 有限 Limited | 原生 Native | 有限 Limited |
| Checkpoint | ❌ | ✅ | ❌ |
| HITL | 中间件 Middleware | 原生 Native | 可选 Optional |
| 上手难度 Learning curve | 低 Low | 高 High | 低 Low |
| 生产就绪 Production | 中 Medium | 高 High | 中 Medium |
| 可观测性 Observability | LangSmith | LangSmith | 第三方 3rd-party |
选型建议 Selection advice:
- 简单 RAG / 单轮工具调用 → LangChain
create_agent - 复杂工作流 / 生产系统 → LangGraph
- 快速多 Agent 原型 → CrewAI
- 高可靠性 API → PydanticAI + LangGraph
6. 与 LangSmith 集成 | LangSmith Integration
中文: LangSmith 为 LangGraph 提供全链路可观测性:
1 | import os |
在 LangSmith 控制台可查看:
- 每个节点的输入/输出
- LLM 调用的 Prompt 和 Response
- 工具执行的参数和返回值
- 端到端延迟和 Token 消耗
English: LangSmith provides full-chain observability for LangGraph. In the console you can inspect per-node I/O, LLM prompts/responses, tool args/results, and end-to-end latency and token usage.
7. 生产部署清单 | Production Checklist
中文:
| 检查项 | 说明 |
|---|---|
| ✅ 终止条件 | step_count 上限、超时、Token 预算 |
| ✅ Checkpoint | 生产环境用 PostgreSQL 而非 SQLite |
| ✅ HITL | 资金/删除/外发操作必经审批 |
| ✅ 工具权限 | 最小权限原则,避免 Agent 越权 |
| ✅ 结构化输出 | 关键节点强制 JSON Schema |
| ✅ 错误处理 | 节点内 try/catch + 图级 fallback 边 |
| ✅ 可观测性 | LangSmith Trace + 告警 |
| ✅ 评估集 | Golden Dataset 回归测试 |
English:
| Check | Description |
|---|---|
| ✅ Termination | step_count cap, timeout, token budget |
| ✅ Checkpoint | PostgreSQL in production, not SQLite |
| ✅ HITL | Human approval for financial/delete/outbound ops |
| ✅ Tool permissions | Least privilege; prevent escalation |
| ✅ Structured output | Enforce JSON Schema at critical nodes |
| ✅ Error handling | try/catch in nodes + graph-level fallback edges |
| ✅ Observability | LangSmith traces + alerts |
| ✅ Evaluation | Golden Dataset regression tests |
8. 常见陷阱 | Common Pitfalls
中文:
- 状态膨胀 —
messages列表无限增长,需定期摘要压缩 - 循环死锁 — 忘记设置
step_count上限,Agent 反复调用同一工具 - Checkpoint 膨胀 — 高频写入导致存储暴涨,需设置保留策略
- 过度设计 — 简单顺序流程不需要 LangGraph,用 LangChain Chain 即可
- 忽略评估 — 没有 Golden Dataset,Prompt 微调后无法验证回归
English:
- State bloat — Unbounded
messages; summarize periodically - Loop deadlock — Missing
step_countcap; agent retries the same tool forever - Checkpoint bloat — High-frequency writes; set retention policies
- Over-engineering — Simple sequential flows don’t need LangGraph
- Skipping evaluation — No Golden Dataset means no regression verification
9. 总结 | Conclusion
中文: LangGraph 的核心价值在于 将 Agent 工作流从黑盒循环变成可审计、可恢复、可介入的状态机。学习曲线虽陡,但对于需要生产级可靠性的系统,这是目前最成熟的开源选择。推荐路径:
- 用 LangChain
create_agent验证业务价值 - 遇到状态/循环/审批需求时迁移到 LangGraph
- 配合 LangSmith 建设可观测性与评估体系
- 复杂 Agent 逻辑用 PydanticAI 保证类型安全
English: LangGraph’s core value is turning agent workflows from opaque loops into auditable, recoverable, interruptible state machines. The learning curve is steep, but for production reliability it remains the most mature open-source choice. Recommended path:
- Validate business value with LangChain
create_agent - Migrate to LangGraph when you need state, loops, or approvals
- Build observability and evaluation with LangSmith
- Use PydanticAI for type-safe complex agent logic
相关阅读 Related reading:LLM Agent 架构全景:LangChain 生态设计与实践