AI 技术编年史 2024：Microsoft GraphRAG 图谱检索

发表于 2024-03-10 分类于 framework 阅读次数：

2024 年 Microsoft GraphRAG 发布：知识图谱 + 社区摘要 + 全局查询，解决朴素 RAG 在跨文档推理上的局限。

Microsoft GraphRAG 图谱检索 | Microsoft GraphRAG

一、背景与核心概念 | Background and Core Concepts

English

In April 2024, Microsoft Research open-sourced GraphRAG, addressing a known weakness of naive vector RAG: global and cross-document reasoning. Standard chunk retrieval excels at fact lookup (“What is X?”) but struggles with questions like “What are the main themes across 10,000 earnings calls?” or “How do entities A and B relate across reports?”

GraphRAG builds a knowledge graph from unstructured corpora using LLM extraction, then applies community detection (Leiden algorithm) to cluster related entities. Each community receives an LLM-generated summary. At query time, the system chooses between local search (entity-neighborhood retrieval) and global search (map-reduce over community summaries).

中文

2024 年 4 月，微软研究院开源 GraphRAG，针对朴素向量 RAG 的弱点——全局与跨文档推理。标准分块检索擅长事实查询（”X 是什么？”），却难以回答”一万份财报的主要主题是什么？”或”实体 A 与 B 在各报告中如何关联？”

GraphRAG 用 LLM 从非结构化语料抽取知识图谱，以 Leiden 算法做社区发现，并为每个社区生成 LLM 摘要。查询时可在局部搜索（实体邻域）与全局搜索（社区摘要 map-reduce）间切换。

概念	说明
Entity Extraction	LLM 抽取实体、关系、声明
Community Summary	社区级语义压缩
Local Search	围绕 query 相关实体的子图检索
Global Search	跨社区摘要聚合回答宏观问题

English

Microsoft’s benchmark showed vector RAG on private datasets scored well on local fact queries (“What did Project X budget?”) but poorly on thematic questions (“What risks appear across all projects?”) because chunks lack cross-document aggregation. GraphRAG addresses this by pre-computing community-level abstractions — trading upfront indexing cost for query-time global reasoning.

中文

微软 benchmark 显示向量 RAG 在局部事实（「项目 X 预算是多少？」）表现好，在主题问题（「所有项目的共同风险？」）表现差——分块缺乏跨文档聚合。GraphRAG 通过预计算社区级抽象解决——以索引成本换查询时全局推理。

二、架构设计 | Architecture

English

GraphRAG pipeline has two phases — indexing and query:

Indexing Phase

Raw Documents
    ↓
Text Units (chunks)
    ↓
LLM Entity/Relationship Extraction → Knowledge Graph
    ↓
Community Detection (Leiden) → Hierarchical Communities
    ↓
LLM Summarization per Community → Summary Index
    ↓
Vector Embeddings (entities, text units, summaries)

Query Phase

Local mode: embed query → retrieve entities → expand subgraph → LLM answer with context
Global mode: embed query → rank community summaries → map-reduce synthesis

中文

GraphRAG 分索引与查询两阶段。索引：文档 → 文本单元 → LLM 抽图谱 → 社区发现 → 社区摘要 → 向量化。查询：局部模式扩展实体子图；全局模式对社区摘要做 map-reduce 综合。

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│  Documents  │ ──→ │  KG + Graph  │ ──→ │ Community Index │
└─────────────┘     └──────────────┘     └─────────────────┘
                            │                      │
                     Local Search            Global Search
                            └──────────┬───────────┘
                                       ↓
                                  LLM Response

2.1 与朴素 RAG 对比 | vs. Naive RAG

维度	Vector RAG	GraphRAG
索引结构	Flat chunks	Graph + communities
擅长问题	点对点事实	主题、趋势、关联
索引成本	低	高（多次 LLM 调用）
索引成本	低	高（多次 LLM 调用）
可解释性	文档引用	实体 + 社区路径

2.2 配置与调优 | Configuration and Tuning

English

GraphRAG exposes settings.yaml for chunk size, entity extraction prompts, community size thresholds, and embedding models. Production teams report indexing a 10GB corpus can cost hundreds of dollars in GPT-4 API calls — prompting cheaper extractors (GPT-4o-mini) for first pass and quality models for summarization only. Incremental updates remain research-grade; many 2024 deployments re-index nightly rather than true incremental graph merge.

中文

GraphRAG 通过 settings.yaml 配置分块、抽取 prompt、社区阈值、Embedding 模型。生产团队称 10GB 语料索引 或耗数百美元 GPT-4 API——常用 GPT-4o-mini 做抽取、强模型仅做摘要。增量更新仍偏研究；2024 多数部署夜间全量重索引而非真正增量图合并。

三、产业趋势 | Industry Trends

English

GraphRAG sparked a “Graph + RAG” wave in 2024:

Vendors integrated graph stores (Neo4j, FalkorDB) with vector search
LightRAG, KAG (OpenSPG), and RAPTOR offered lighter or hierarchical alternatives
Enterprises with existing knowledge graphs explored hybrid pipelines
Research focused on reducing indexing cost via smaller models and incremental graph updates

The trend reflects a broader shift: RAG is not one pattern but a family of retrieval architectures chosen by query type.

中文

GraphRAG 引发 2024 “图谱 + RAG” 浪潮：Neo4j 等与向量检索融合；LightRAG、KAG、RAPTOR 等轻量/层次替代方案；已有知识图谱企业探索混合流水线；研究聚焦降索引成本与增量更新。

趋势表明：RAG 是检索架构族，需按查询类型选型。

四、优缺点分析 | Pros and Cons

4.1 优点 | Advantages

全局问答 — 跨 corpus 主题与趋势 / Strong on holistic questions
关系推理 — 显式实体边支持多跳 / Explicit relational structure
可解释路径 — 子图可视化推理链 / Explainable graph paths
开源可复现 — Microsoft 完整 pipeline / Full open pipeline
与向量检索互补 — 可 hybrid 部署 / Complements flat RAG
社区摘要 — 层次化语义压缩 / Hierarchical semantic compression

4.2 缺点 | Disadvantages

索引成本高 — 大量 LLM 调用建图 / Expensive indexing
抽取质量依赖 LLM — 错误实体污染图谱 / Extraction errors propagate
延迟较高 — 全局 map-reduce 慢 / Global search latency
动态更新复杂 — 增量图维护难 / Hard to incrementally update
小 corpus 过度设计 — 小数据集收益有限 / Overkill for small datasets
配置门槛 — prompt、社区参数需调优 / Tuning complexity

五、典型应用场景 | Use Cases

场景 Scenario	中文说明	English Description
研报综合分析	跨 thousands 报告的主题与风险	Thematic analysis across research reports
情报与 OSINT	人物、组织关系网络	Entity relationship intelligence
合规调查	跨邮件/文档关联实体	Cross-document compliance investigation
医学文献	疾病-基因-药物关联	Biomedical entity linking
企业 M&A	尽职调查材料全局摘要	Due diligence corpus summarization
新闻监测	事件演化与叙事追踪	Narrative tracking across news archives

六、GitHub 与开源生态 | GitHub and Open Source

English

The primary repository is microsoft/graphrag — Python package with CLI, configurable prompts, and Azure/OpenAI integration.

Related projects:

Neo4j LLM Graph Builder: visual graph construction
LightRAG: simplified graph-augmented retrieval
LangChain GraphRAG integrations: community adapters

中文

主仓库 microsoft/graphrag 提供 Python CLI、可配置 prompt 与 Azure/OpenAI 集成。相关：Neo4j LLM Graph Builder、LightRAG、LangChain 适配器。

仓库	说明
microsoft/graphrag	官方 GraphRAG 实现
neo4j-labs/llm-graph-builder	可视化建图
HKUDS/LightRAG	轻量图谱 RAG

安装示例 / Quick Start

pip install graphrag
graphrag init --root ./ragtest
graphrag index --root ./ragtest
graphrag query --root ./ragtest --method global "What are the top themes?"

七、参考链接 | References

GraphRAG 论文 / 项目页：microsoft.github.io/graphrag
GitHub：github.com/microsoft/graphrag
From Local to Global: A Graph RAG Approach (Microsoft Research)
Leiden 社区发现算法：Traag et al.
Neo4j GraphRAG 集成文档

八、2025 展望 | Outlook for 2025

English

GraphRAG indexing costs will drop via smaller extractors, cached entity graphs, and incremental community updates — Microsoft and competitors already ship lighter variants (Global Search only on demand). Expect fusion with vector RAG in one query router: factual → local vector; thematic → global graph. Knowledge graph vendors (Neo4j, FalkorDB) embed GraphRAG patterns natively. For engineers: prototype GraphRAG on high-value corpora (research, legal) where global questions justify indexing cost; keep vector RAG for FAQ-style support.

中文

GraphRAG 索引成本将通过小抽取模型、实体图缓存、增量社区更新下降——微软与竞品已推轻量变体。预期与向量 RAG 统一 query 路由：事实→局部向量；主题→全局图。Neo4j、FalkorDB 等原生嵌入 GraphRAG。工程师：在高价值语料（研究、法律）上原型 GraphRAG——全局问题 justify 索引成本；FAQ 支持仍用向量 RAG。

English Summary: GraphRAG extended RAG from chunk lookup to graph-aware global reasoning — a framework milestone for enterprise knowledge systems in 2024.

中文总结：GraphRAG 将 RAG 从分块查表扩展为图谱感知的全局推理——2024 企业知识系统的重要框架里程碑。

Microsoft GraphRAG 图谱检索 | Microsoft GraphRAG

一、背景与核心概念 | Background and Core Concepts

1.1 朴素 RAG 的全局盲区 | Global Blind Spot of Naive RAG

二、架构设计 | Architecture

2.1 与朴素 RAG 对比 | vs. Naive RAG

2.2 配置与调优 | Configuration and Tuning

三、产业趋势 | Industry Trends

四、优缺点分析 | Pros and Cons

4.1 优点 | Advantages

4.2 缺点 | Disadvantages

五、典型应用场景 | Use Cases

六、GitHub 与开源生态 | GitHub and Open Source

七、参考链接 | References

八、2025 展望 | Outlook for 2025