AI 技术编年史 2024:Microsoft GraphRAG 图谱检索
Microsoft GraphRAG 图谱检索 | Microsoft GraphRAG
一、背景与核心概念 | Background and Core Concepts
English
In April 2024, Microsoft Research open-sourced GraphRAG, addressing a known weakness of naive vector RAG: global and cross-document reasoning. Standard chunk retrieval excels at fact lookup (“What is X?”) but struggles with questions like “What are the main themes across 10,000 earnings calls?” or “How do entities A and B relate across reports?”
GraphRAG builds a knowledge graph from unstructured corpora using LLM extraction, then applies community detection (Leiden algorithm) to cluster related entities. Each community receives an LLM-generated summary. At query time, the system chooses between local search (entity-neighborhood retrieval) and global search (map-reduce over community summaries).
中文
2024 年 4 月,微软研究院开源 GraphRAG,针对朴素向量 RAG 的弱点——全局与跨文档推理。标准分块检索擅长事实查询(”X 是什么?”),却难以回答”一万份财报的主要主题是什么?”或”实体 A 与 B 在各报告中如何关联?”
GraphRAG 用 LLM 从非结构化语料抽取知识图谱,以 Leiden 算法做社区发现,并为每个社区生成 LLM 摘要。查询时可在局部搜索(实体邻域)与全局搜索(社区摘要 map-reduce)间切换。
| 概念 | 说明 |
|---|---|
| Entity Extraction | LLM 抽取实体、关系、声明 |
| Community Summary | 社区级语义压缩 |
| Local Search | 围绕 query 相关实体的子图检索 |
| Global Search | 跨社区摘要聚合回答宏观问题 |
1.1 朴素 RAG 的全局盲区 | Global Blind Spot of Naive RAG
English
Microsoft’s benchmark showed vector RAG on private datasets scored well on local fact queries (“What did Project X budget?”) but poorly on thematic questions (“What risks appear across all projects?”) because chunks lack cross-document aggregation. GraphRAG addresses this by pre-computing community-level abstractions — trading upfront indexing cost for query-time global reasoning.
中文
微软 benchmark 显示向量 RAG 在局部事实(「项目 X 预算是多少?」)表现好,在主题问题(「所有项目的共同风险?」)表现差——分块缺乏跨文档聚合。GraphRAG 通过预计算社区级抽象解决——以索引成本换查询时全局推理。
二、架构设计 | Architecture
English
GraphRAG pipeline has two phases — indexing and query:
Indexing Phase
1 | Raw Documents |
Query Phase
- Local mode: embed query → retrieve entities → expand subgraph → LLM answer with context
- Global mode: embed query → rank community summaries → map-reduce synthesis
中文
GraphRAG 分索引与查询两阶段。索引:文档 → 文本单元 → LLM 抽图谱 → 社区发现 → 社区摘要 → 向量化。查询:局部模式扩展实体子图;全局模式对社区摘要做 map-reduce 综合。
1 | ┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ |
2.1 与朴素 RAG 对比 | vs. Naive RAG
| 维度 | Vector RAG | GraphRAG |
|---|---|---|
| 索引结构 | Flat chunks | Graph + communities |
| 擅长问题 | 点对点事实 | 主题、趋势、关联 |
| 索引成本 | 低 | 高(多次 LLM 调用) |
| 索引成本 | 低 | 高(多次 LLM 调用) |
| 可解释性 | 文档引用 | 实体 + 社区路径 |
2.2 配置与调优 | Configuration and Tuning
English
GraphRAG exposes settings.yaml for chunk size, entity extraction prompts, community size thresholds, and embedding models. Production teams report indexing a 10GB corpus can cost hundreds of dollars in GPT-4 API calls — prompting cheaper extractors (GPT-4o-mini) for first pass and quality models for summarization only. Incremental updates remain research-grade; many 2024 deployments re-index nightly rather than true incremental graph merge.
中文
GraphRAG 通过 settings.yaml 配置分块、抽取 prompt、社区阈值、Embedding 模型。生产团队称 10GB 语料索引 或耗数百美元 GPT-4 API——常用 GPT-4o-mini 做抽取、强模型仅做摘要。增量更新仍偏研究;2024 多数部署夜间全量重索引而非真正增量图合并。
三、产业趋势 | Industry Trends
English
GraphRAG sparked a “Graph + RAG” wave in 2024:
- Vendors integrated graph stores (Neo4j, FalkorDB) with vector search
- LightRAG, KAG (OpenSPG), and RAPTOR offered lighter or hierarchical alternatives
- Enterprises with existing knowledge graphs explored hybrid pipelines
- Research focused on reducing indexing cost via smaller models and incremental graph updates
The trend reflects a broader shift: RAG is not one pattern but a family of retrieval architectures chosen by query type.
中文
GraphRAG 引发 2024 “图谱 + RAG” 浪潮:Neo4j 等与向量检索融合;LightRAG、KAG、RAPTOR 等轻量/层次替代方案;已有知识图谱企业探索混合流水线;研究聚焦降索引成本与增量更新。
趋势表明:RAG 是检索架构族,需按查询类型选型。
四、优缺点分析 | Pros and Cons
4.1 优点 | Advantages
- 全局问答 — 跨 corpus 主题与趋势 / Strong on holistic questions
- 关系推理 — 显式实体边支持多跳 / Explicit relational structure
- 可解释路径 — 子图可视化推理链 / Explainable graph paths
- 开源可复现 — Microsoft 完整 pipeline / Full open pipeline
- 与向量检索互补 — 可 hybrid 部署 / Complements flat RAG
- 社区摘要 — 层次化语义压缩 / Hierarchical semantic compression
4.2 缺点 | Disadvantages
- 索引成本高 — 大量 LLM 调用建图 / Expensive indexing
- 抽取质量依赖 LLM — 错误实体污染图谱 / Extraction errors propagate
- 延迟较高 — 全局 map-reduce 慢 / Global search latency
- 动态更新复杂 — 增量图维护难 / Hard to incrementally update
- 小 corpus 过度设计 — 小数据集收益有限 / Overkill for small datasets
- 配置门槛 — prompt、社区参数需调优 / Tuning complexity
五、典型应用场景 | Use Cases
| 场景 Scenario | 中文说明 | English Description |
|---|---|---|
| 研报综合分析 | 跨 thousands 报告的主题与风险 | Thematic analysis across research reports |
| 情报与 OSINT | 人物、组织关系网络 | Entity relationship intelligence |
| 合规调查 | 跨邮件/文档关联实体 | Cross-document compliance investigation |
| 医学文献 | 疾病-基因-药物关联 | Biomedical entity linking |
| 企业 M&A | 尽职调查材料全局摘要 | Due diligence corpus summarization |
| 新闻监测 | 事件演化与叙事追踪 | Narrative tracking across news archives |
六、GitHub 与开源生态 | GitHub and Open Source
English
The primary repository is microsoft/graphrag — Python package with CLI, configurable prompts, and Azure/OpenAI integration.
Related projects:
- Neo4j LLM Graph Builder: visual graph construction
- LightRAG: simplified graph-augmented retrieval
- LangChain GraphRAG integrations: community adapters
中文
主仓库 microsoft/graphrag 提供 Python CLI、可配置 prompt 与 Azure/OpenAI 集成。相关:Neo4j LLM Graph Builder、LightRAG、LangChain 适配器。
| 仓库 | 说明 |
|---|---|
| microsoft/graphrag | 官方 GraphRAG 实现 |
| neo4j-labs/llm-graph-builder | 可视化建图 |
| HKUDS/LightRAG | 轻量图谱 RAG |
安装示例 / Quick Start
1 | pip install graphrag |
七、参考链接 | References
- GraphRAG 论文 / 项目页:microsoft.github.io/graphrag
- GitHub:github.com/microsoft/graphrag
- From Local to Global: A Graph RAG Approach (Microsoft Research)
- Leiden 社区发现算法:Traag et al.
- Neo4j GraphRAG 集成文档
八、2025 展望 | Outlook for 2025
English
GraphRAG indexing costs will drop via smaller extractors, cached entity graphs, and incremental community updates — Microsoft and competitors already ship lighter variants (Global Search only on demand). Expect fusion with vector RAG in one query router: factual → local vector; thematic → global graph. Knowledge graph vendors (Neo4j, FalkorDB) embed GraphRAG patterns natively. For engineers: prototype GraphRAG on high-value corpora (research, legal) where global questions justify indexing cost; keep vector RAG for FAQ-style support.
中文
GraphRAG 索引成本将通过小抽取模型、实体图缓存、增量社区更新下降——微软与竞品已推轻量变体。预期与向量 RAG 统一 query 路由:事实→局部向量;主题→全局图。Neo4j、FalkorDB 等原生嵌入 GraphRAG。工程师:在高价值语料(研究、法律)上原型 GraphRAG——全局问题 justify 索引成本;FAQ 支持仍用向量 RAG。
English Summary: GraphRAG extended RAG from chunk lookup to graph-aware global reasoning — a framework milestone for enterprise knowledge systems in 2024.
中文总结:GraphRAG 将 RAG 从分块查表扩展为图谱感知的全局推理——2024 企业知识系统的重要框架里程碑。