AI 技术编年史 2024:Microsoft GraphRAG 图谱检索

Microsoft GraphRAG 图谱检索 | Microsoft GraphRAG


一、背景与核心概念 | Background and Core Concepts

English

In April 2024, Microsoft Research open-sourced GraphRAG, addressing a known weakness of naive vector RAG: global and cross-document reasoning. Standard chunk retrieval excels at fact lookup (“What is X?”) but struggles with questions like “What are the main themes across 10,000 earnings calls?” or “How do entities A and B relate across reports?”

GraphRAG builds a knowledge graph from unstructured corpora using LLM extraction, then applies community detection (Leiden algorithm) to cluster related entities. Each community receives an LLM-generated summary. At query time, the system chooses between local search (entity-neighborhood retrieval) and global search (map-reduce over community summaries).

中文

2024 年 4 月,微软研究院开源 GraphRAG,针对朴素向量 RAG 的弱点——全局与跨文档推理。标准分块检索擅长事实查询(”X 是什么?”),却难以回答”一万份财报的主要主题是什么?”或”实体 A 与 B 在各报告中如何关联?”

GraphRAG 用 LLM 从非结构化语料抽取知识图谱,以 Leiden 算法做社区发现,并为每个社区生成 LLM 摘要。查询时可在局部搜索(实体邻域)与全局搜索(社区摘要 map-reduce)间切换。

概念 说明
Entity Extraction LLM 抽取实体、关系、声明
Community Summary 社区级语义压缩
Local Search 围绕 query 相关实体的子图检索
Global Search 跨社区摘要聚合回答宏观问题

1.1 朴素 RAG 的全局盲区 | Global Blind Spot of Naive RAG

English

Microsoft’s benchmark showed vector RAG on private datasets scored well on local fact queries (“What did Project X budget?”) but poorly on thematic questions (“What risks appear across all projects?”) because chunks lack cross-document aggregation. GraphRAG addresses this by pre-computing community-level abstractions — trading upfront indexing cost for query-time global reasoning.

中文

微软 benchmark 显示向量 RAG 在局部事实(「项目 X 预算是多少?」)表现好,在主题问题(「所有项目的共同风险?」)表现差——分块缺乏跨文档聚合。GraphRAG 通过预计算社区级抽象解决——以索引成本换查询时全局推理。


二、架构设计 | Architecture

English

GraphRAG pipeline has two phases — indexing and query:

Indexing Phase

1
2
3
4
5
6
7
8
9
10
11
Raw Documents

Text Units (chunks)

LLM Entity/Relationship Extraction → Knowledge Graph

Community Detection (Leiden) → Hierarchical Communities

LLM Summarization per Community → Summary Index

Vector Embeddings (entities, text units, summaries)

Query Phase

  • Local mode: embed query → retrieve entities → expand subgraph → LLM answer with context
  • Global mode: embed query → rank community summaries → map-reduce synthesis

中文

GraphRAG 分索引查询两阶段。索引:文档 → 文本单元 → LLM 抽图谱 → 社区发现 → 社区摘要 → 向量化。查询:局部模式扩展实体子图;全局模式对社区摘要做 map-reduce 综合。

1
2
3
4
5
6
7
8
┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│ Documents │ ──→ │ KG + Graph │ ──→ │ Community Index │
└─────────────┘ └──────────────┘ └─────────────────┘
│ │
Local Search Global Search
└──────────┬───────────┘

LLM Response

2.1 与朴素 RAG 对比 | vs. Naive RAG

维度 Vector RAG GraphRAG
索引结构 Flat chunks Graph + communities
擅长问题 点对点事实 主题、趋势、关联
索引成本 高(多次 LLM 调用)
索引成本 高(多次 LLM 调用)
可解释性 文档引用 实体 + 社区路径

2.2 配置与调优 | Configuration and Tuning

English

GraphRAG exposes settings.yaml for chunk size, entity extraction prompts, community size thresholds, and embedding models. Production teams report indexing a 10GB corpus can cost hundreds of dollars in GPT-4 API calls — prompting cheaper extractors (GPT-4o-mini) for first pass and quality models for summarization only. Incremental updates remain research-grade; many 2024 deployments re-index nightly rather than true incremental graph merge.

中文

GraphRAG 通过 settings.yaml 配置分块、抽取 prompt、社区阈值、Embedding 模型。生产团队称 10GB 语料索引 或耗数百美元 GPT-4 API——常用 GPT-4o-mini 做抽取、强模型仅做摘要。增量更新仍偏研究;2024 多数部署夜间全量重索引而非真正增量图合并。


English

GraphRAG sparked a “Graph + RAG” wave in 2024:

  • Vendors integrated graph stores (Neo4j, FalkorDB) with vector search
  • LightRAG, KAG (OpenSPG), and RAPTOR offered lighter or hierarchical alternatives
  • Enterprises with existing knowledge graphs explored hybrid pipelines
  • Research focused on reducing indexing cost via smaller models and incremental graph updates

The trend reflects a broader shift: RAG is not one pattern but a family of retrieval architectures chosen by query type.

中文

GraphRAG 引发 2024 “图谱 + RAG” 浪潮:Neo4j 等与向量检索融合;LightRAG、KAG、RAPTOR 等轻量/层次替代方案;已有知识图谱企业探索混合流水线;研究聚焦降索引成本与增量更新。

趋势表明:RAG 是检索架构族,需按查询类型选型。


四、优缺点分析 | Pros and Cons

4.1 优点 | Advantages

  1. 全局问答 — 跨 corpus 主题与趋势 / Strong on holistic questions
  2. 关系推理 — 显式实体边支持多跳 / Explicit relational structure
  3. 可解释路径 — 子图可视化推理链 / Explainable graph paths
  4. 开源可复现 — Microsoft 完整 pipeline / Full open pipeline
  5. 与向量检索互补 — 可 hybrid 部署 / Complements flat RAG
  6. 社区摘要 — 层次化语义压缩 / Hierarchical semantic compression

4.2 缺点 | Disadvantages

  1. 索引成本高 — 大量 LLM 调用建图 / Expensive indexing
  2. 抽取质量依赖 LLM — 错误实体污染图谱 / Extraction errors propagate
  3. 延迟较高 — 全局 map-reduce 慢 / Global search latency
  4. 动态更新复杂 — 增量图维护难 / Hard to incrementally update
  5. 小 corpus 过度设计 — 小数据集收益有限 / Overkill for small datasets
  6. 配置门槛 — prompt、社区参数需调优 / Tuning complexity

五、典型应用场景 | Use Cases

场景 Scenario 中文说明 English Description
研报综合分析 跨 thousands 报告的主题与风险 Thematic analysis across research reports
情报与 OSINT 人物、组织关系网络 Entity relationship intelligence
合规调查 跨邮件/文档关联实体 Cross-document compliance investigation
医学文献 疾病-基因-药物关联 Biomedical entity linking
企业 M&A 尽职调查材料全局摘要 Due diligence corpus summarization
新闻监测 事件演化与叙事追踪 Narrative tracking across news archives

六、GitHub 与开源生态 | GitHub and Open Source

English

The primary repository is microsoft/graphrag — Python package with CLI, configurable prompts, and Azure/OpenAI integration.

Related projects:

  • Neo4j LLM Graph Builder: visual graph construction
  • LightRAG: simplified graph-augmented retrieval
  • LangChain GraphRAG integrations: community adapters

中文

主仓库 microsoft/graphrag 提供 Python CLI、可配置 prompt 与 Azure/OpenAI 集成。相关:Neo4j LLM Graph Builder、LightRAG、LangChain 适配器。

仓库 说明
microsoft/graphrag 官方 GraphRAG 实现
neo4j-labs/llm-graph-builder 可视化建图
HKUDS/LightRAG 轻量图谱 RAG

安装示例 / Quick Start

1
2
3
4
pip install graphrag
graphrag init --root ./ragtest
graphrag index --root ./ragtest
graphrag query --root ./ragtest --method global "What are the top themes?"

七、参考链接 | References


八、2025 展望 | Outlook for 2025

English

GraphRAG indexing costs will drop via smaller extractors, cached entity graphs, and incremental community updates — Microsoft and competitors already ship lighter variants (Global Search only on demand). Expect fusion with vector RAG in one query router: factual → local vector; thematic → global graph. Knowledge graph vendors (Neo4j, FalkorDB) embed GraphRAG patterns natively. For engineers: prototype GraphRAG on high-value corpora (research, legal) where global questions justify indexing cost; keep vector RAG for FAQ-style support.

中文

GraphRAG 索引成本将通过小抽取模型实体图缓存增量社区更新下降——微软与竞品已推轻量变体。预期与向量 RAG 统一 query 路由:事实→局部向量;主题→全局图。Neo4j、FalkorDB 等原生嵌入 GraphRAG。工程师:在高价值语料(研究、法律)上原型 GraphRAG——全局问题 justify 索引成本;FAQ 支持仍用向量 RAG。


English Summary: GraphRAG extended RAG from chunk lookup to graph-aware global reasoning — a framework milestone for enterprise knowledge systems in 2024.

中文总结:GraphRAG 将 RAG 从分块查表扩展为图谱感知的全局推理——2024 企业知识系统的重要框架里程碑。