Academic QA System with GraphRAG

Author: Yongdam Kim
Peer Review:
Proofread : Juni Lee
This is a part of LangChain Open Tutorial

Overview

This tutorial demonstrates how to implement a QA system that better leverages paper (academic) content by using GraphRAG.

GraphRAG is a novel system introduced by Microsoft that utilizes a graph to extract both local and global information from text, providing more contextually rich answers.

However, Microsoft’s official GraphRAG implementation is not readily integrated with LangChain, making it difficult to use.

To solve this, we use langchain-graphrag which allows us to implement GraphRAG within LangChain.

In this tutorial, we’ll learn how to build a QA system for the latest AI papers using langchain-graphrag.

from arXiv- "From Local to Global-A Graph RAG Approach to Query-Focused Summarization"

References

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can check out the langchain-opentutorial for more details.

%%capture --no-stderr
!pip install langchain-opentutorial

# Install required packages.
# 'langchain_opentutorial.package.install' is a helper function that installs
# the specified packages inside this environment.

from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain-graphrag",
        "langchain_chroma",
        "jq",
    ],
    verbose=False,
    upgrade=False,
)

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

# Load environment variables from a .env file (e.g. OPENAI_API_KEY)
from dotenv import load_dotenv

load_dotenv(override=True)

True

Download and Load arXiv PDFs

In this tutorial, we will use arXiv data. arXiv is an online archive for the latest research papers, all available in PDF format. There is an official GitHub repository containing all PDFs, but it is about 1TB in total size and can only be downloaded from AWS. Thus, in this tutorial, we will selectively use a few PDF files instead.

Link to the full dataset: https://github.com/mattbierbaum/arxiv-public-datasets

# Download and save sample PDF file to ./data directory
import requests
import os

def download_pdf(url, save_path):
    """
    Downloads a PDF file from the given URL and saves it to the specified path.

    Args:
        url (str): The URL of the PDF file to download.
        save_path (str): The full path (including file name) where the file will be saved.
    """
    try:
        # Ensure the directory exists
        os.makedirs(os.path.dirname(save_path), exist_ok=True)

        # Download the file
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an error for bad status codes

        # Save the file to the specified path
        with open(save_path, "wb") as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

        print(f"PDF downloaded and saved to: {save_path}")
    except Exception as e:
        print(f"An error occurred while downloading the file: {e}")

# Configuration for the PDF file
pdf_url = "https://arxiv.org/pdf/2404.16130v1"
file_path = "./data/2404.16130v1.pdf"

# Download the PDF
download_pdf(pdf_url, file_path)

# Load the GraphRAG paper using PyPDFLoader.
# PyPDFLoader loads PDF content on a per-page basis.
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_path)
docs = loader.load()
print(f"Loaded {len(docs)} documents.")
print(docs[0].page_content)

PDF downloaded and saved to: ./data/2404.16130v1.pdf
    Loaded 15 documents.
    From Local to Global: A Graph RAG Approach to
    Query-Focused Summarization
    Darren Edge1†Ha Trinh1†Newman Cheng2Joshua Bradley2Alex Chao3
    Apurva Mody3Steven Truitt2
    Jonathan Larson1
    1Microsoft Research
    2Microsoft Strategic Missions and Technologies
    3Microsoft Office of the CTO
    {daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso }
    @microsoft.com
    †These authors contributed equally to this work
    Abstract
    The use of retrieval-augmented generation (RAG) to retrieve relevant informa-
    tion from an external knowledge source enables large language models (LLMs)
    to answer questions over private and/or previously unseen document collections.
    However, RAG fails on global questions directed at an entire text corpus, such
    as “What are the main themes in the dataset?”, since this is inherently a query-
    focused summarization (QFS) task, rather than an explicit retrieval task. Prior
    QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical
    RAG systems. To combine the strengths of these contrasting methods, we propose
    a Graph RAG approach to question answering over private text corpora that scales
    with both the generality of user questions and the quantity of source text to be in-
    dexed. Our approach uses an LLM to build a graph-based text index in two stages:
    first to derive an entity knowledge graph from the source documents, then to pre-
    generate community summaries for all groups of closely-related entities. Given a
    question, each community summary is used to generate a partial response, before
    all partial responses are again summarized in a final response to the user. For a
    class of global sensemaking questions over datasets in the 1 million token range,
    we show that Graph RAG leads to substantial improvements over a na ¨ıve RAG
    baseline for both the comprehensiveness and diversity of generated answers. An
    open-source, Python-based implementation of both global and local Graph RAG
    approaches is forthcoming at https://aka .ms/graphrag .
    1 Introduction
    Human endeavors across a range of domains rely on our ability to read and reason about large
    collections of documents, often reaching conclusions that go beyond anything stated in the source
    texts themselves. With the emergence of large language models (LLMs), we are already witnessing
    attempts to automate human-like sensemaking in complex domains like scientific discovery (Mi-
    crosoft, 2023) and intelligence analysis (Ranade and Joshi, 2023), where sensemaking is defined as
    Preprint. Under review.arXiv:2404.16130v1  [cs.CL]  24 Apr 2024

Text Chunking and Text Extracting

In this step, we perform query routing and document evaluation. These steps are crucial parts of Adaptive RAG, contributing to efficient information retrieval and generation.

Query Routing: Analyzes the user’s query to route it to the appropriate information source.
Document Evaluation: Evaluates the quality and relevance of the retrieved documents.

These steps support the core functions of Adaptive RAG, aiming to provide accurate and reliable information.

# Here we split the loaded documents into text chunks.
# 'RecursiveCharacterTextSplitter' is a commonly used utility in LangChain for
# chunking text with some overlap.

from langchain_core.documents import Document
from langchain_graphrag.indexing import TextUnitExtractor
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
text_unit_extractor = TextUnitExtractor(text_splitter=splitter)

# This runs the text splitting logic on the loaded PDF pages
df_text_units = text_unit_extractor.run(docs)
df_text_units

Extracting text units ...: 100%|██████████| 6/6 [00:00<00:00, 10437.92it/s]
    Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 15148.73it/s]
    Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 23746.94it/s]
    Extracting text units ...: 100%|██████████| 10/10 [00:00<00:00, 113359.57it/s]
    Extracting text units ...: 100%|██████████| 7/7 [00:00<00:00, 23413.18it/s]
    Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 43976.98it/s]
    Extracting text units ...: 100%|██████████| 9/9 [00:00<00:00, 46603.38it/s]
    Extracting text units ...: 100%|██████████| 11/11 [00:00<00:00, 43815.14it/s]
    Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 11188.54it/s]
    Extracting text units ...: 100%|██████████| 10/10 [00:00<00:00, 149263.49it/s]
    Extracting text units ...: 100%|██████████| 10/10 [00:00<00:00, 79891.50it/s]
    Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 22429.43it/s]
    Extracting text units ...: 100%|██████████| 9/9 [00:00<00:00, 21794.88it/s]
    Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 97541.95it/s]
    Extracting text units ...: 100%|██████████| 2/2 [00:00<00:00, 39199.10it/s]
    Processing documents ...: 100%|██████████| 15/15 [00:00<00:00, 506.48it/s]

document_id

text_unit