Academic QA System with GraphRAG

Open in ColabOpen in GitHub

Overview

This tutorial demonstrates how to implement a QA system that better leverages paper (academic) content by using GraphRAG.

GraphRAG is a novel system introduced by Microsoft that utilizes a graph to extract both local and global information from text, providing more contextually rich answers.

However, Microsoft’s official GraphRAG implementation is not readily integrated with LangChain, making it difficult to use.

To solve this, we use langchain-graphrag which allows us to implement GraphRAG within LangChain.

In this tutorial, we’ll learn how to build a QA system for the latest AI papers using langchain-graphrag.

GraphRAG

from arXiv- "From Local to Global-A Graph RAG Approach to Query-Focused Summarization"

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can check out the langchain-opentutorial for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

Download and Load arXiv PDFs

In this tutorial, we will use arXiv data. arXiv is an online archive for the latest research papers, all available in PDF format. There is an official GitHub repository containing all PDFs, but it is about 1TB in total size and can only be downloaded from AWS. Thus, in this tutorial, we will selectively use a few PDF files instead.

  • Link to the full dataset: https://github.com/mattbierbaum/arxiv-public-datasets

Text Chunking and Text Extracting

In this step, we perform query routing and document evaluation. These steps are crucial parts of Adaptive RAG, contributing to efficient information retrieval and generation.

  • Query Routing: Analyzes the user’s query to route it to the appropriate information source.

  • Document Evaluation: Evaluates the quality and relevance of the retrieved documents.

These steps support the core functions of Adaptive RAG, aiming to provide accurate and reliable information.

document_id
id
text_unit

0

1e3e4efb-27de-44c1-9854-4b4176ed618b

6fe80d15-f2b2-47f2-8570-b5a64042d2f5

From Local to Global: A Graph RAG Approach to\...

1

1e3e4efb-27de-44c1-9854-4b4176ed618b

7a73bcf9-3874-40eb-ad70-e6c1070dd207

tion from an external knowledge source enables...

2

1e3e4efb-27de-44c1-9854-4b4176ed618b

63e41d22-ac60-48b8-8e70-d3e88ea2a7bc

RAG systems. To combine the strengths of these...

3

1e3e4efb-27de-44c1-9854-4b4176ed618b

861b4797-2cfd-4811-9cdf-efe0eb6a2bee

question, each community summary is used to ge...

4

1e3e4efb-27de-44c1-9854-4b4176ed618b

d543cef6-4565-4b73-9364-075b325ae290

approaches is forthcoming at https://aka .ms/g...

...

...

...

...

117

65133966-6472-4c24-8520-40fccbfd666b

52cd8901-0e27-4f0e-93e2-bfeade1cdad9

with chain-of-thought reasoning for knowledge-...

118

65133966-6472-4c24-8520-40fccbfd666b

36449b6c-150a-451a-b6fc-1a3081757068

Wang, Y ., Lipka, N., Rossi, R. A., Siu, A., Z...

119

65133966-6472-4c24-8520-40fccbfd666b

7f1c0194-ba84-4fe5-b81f-659aa62d89cf

Empirical Methods in Natural Language Processi...

120

c4225aec-3b51-4bd6-bb1b-53c66d42229c

e73e8ca0-417d-462c-8c84-bce611bf50e4

Yao, L., Peng, J., Mao, C., and Luo, Y . (2023...

121

c4225aec-3b51-4bd6-bb1b-53c66d42229c

94b7803f-07dc-45a3-a12f-cec090a7f721

Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, ...

122 rows × 3 columns

Entity Relationship Extraction

GraphRAG extracts entities and relationships from the text chunks to automatically build a knowledge graph.

When constructing a Knowledge Graph, an LLM is used. In this tutorial, we use gpt-4o-mini for performance and cost reasons. The LLM uses a predefined prompt to extract entity and relationship information.

Graph Generation

GraphRAG does not use all extracted entities and relationships individually; it merges them into a more comprehensive structure. We call this process Summarization.

Through element summarization, GraphRAG enhances search functionality by improving global context understanding.

Graph Index Build

  • We run all steps from Text Chunking to Community Detection and Community Summarization in code.

  • For community detection, we use the Leiden algorithm, known for good performance.

  • In GraphRAG, we create an index called an artifact. We ultimately store the artifact using the save_artifact function.

Local Search through Knowledge Graph

We perform a local search using the Knowledge Graph built by GraphRAG. Local Search is helpful for retrieving specific passages or details. Compare this with a simple gpt-4o-mini answer. The GraphRAG-based answer is much more detailed and grounded in the paper content.

Global Search through Knowledge Graph

We can also perform a global search using the Knowledge Graph built by GraphRAG. A global search is useful for getting answers with broader context. However, global search requires a model with a sufficiently large max token length. For example, gpt-4o-mini (max token size = 16k) may not handle the entire content of a large paper-based graph. If possible, try gpt-4o (max token size = 32k) for larger contexts!

Last updated