Real-Time GraphRAG QA
Author: Jongcheol Kim
Peer Review: Heesun Moon, Taylor(Jihyun Kim)
Proofread : Juni Lee
This is a part of LangChain Open Tutorial
Overview
This tutorial provides GraphRAG QA functionality that extracts knowledge from PDF documents and enables natural language queries through a Neo4j graph database. After users upload PDF documents, they are processed using OpenAI's GPT models (e.g., gpt-4o) to extract entities and relationships.
The extracted information is stored in a Neo4j graph database. Users can then interact with the graph in real-time by asking natural language questions, which are converted into Cypher queries to retrieve answers from the graph.
Features
Real-time GraphRAG: Enables real-time knowledge extraction from documents and supports interactive querying.
Modular and Configurable: Users can set up their own credentials for
OpenAIandNeo4j.Natural Language Interface: Ask questions in plain English and get answers from the graph database.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
Neo4j Database Connection
This tutorial uses Neo4j and Neo4j Sandbox for graph database construction.
Neo4j Sandbox is an online graph database that allows you to easily build a free cloud-based Neo4j instance, allowing you to experiment with graph databases in the cloud environment without local installation.
[Note] Neo4j can be set up in several different ways:
Neo4j Desktop: A desktop application for local developmentNeo4j Sandbox: A free, cloud-based platform for working with graph databasesDocker: Run Neo4j in a container using the official Neo4j Docker image
Setup Neo4j Sandbox
Go to Neo4j Sandbox and click the "+ New Project" button. Select your desired dataset to create a database.

After creation, click the toggle to see example connection code provided for different programming languages. You can easily connect using the neo4j-driver library.

To connect the graph database with LangChain, you'll need the connection details from this section.

The following code connects to the Neo4j database and initializes the necessary components for our application.
PDF Processing
Here's how we process PDF documents and extract text from them.
First, we use PyPDFLoader to load the PDF file and split it into individual pages. Then, we use RecursiveCharacterTextSplitter to break these pages down into manageable chunks. We set each chunk to be 200 characters long, with a 40-character overlap between chunks to maintain smooth transitions and context.
Once all the splitting work is complete, we begin our text cleanup process. For each piece of document, we remove any newline characters and include source information so we can track where the content originated. All of this cleaned-up information gets neatly organized into a list of Document objects.
This approach helps us transform complex PDF documents into a format that's much more suitable for AI processing and analysis.
Graph Transformation
Here's how to transform extracted text into a graph structure using our transformation function.
First, we initialize the graph database by clearing any existing nodes and relationships. We then define a set of allowed nodes and their permitted relationships:
Allowed nodes: ["Device", "PowerSource", "OperatingSystem", "ConnectionStatus", "Software", "Action"]Permitted relationships: ["USES_POWER", "OPERATES_ON", "HAS_STATUS", "REQUIRES", "PERFORMS"]
For demonstration purposes, these nodes and relationships were defined using the gpt-4o model.
We then create a graph transformer using LLMGraphTransformer and configure it with our defined nodes and relationships. To keep things simple, we set both node and relationship properties to false. This transformer takes our document chunks and converts them into graph documents, which are then added to our Neo4j database along with their source information.
This whole process transforms our text data into a structured graph format, making it much easier to query and analyze relationships between different entities in our documents.
QA Chain Setup
Here's how to create a powerful question-answering chain using the GraphCypherQAChain.
First, we create a custom prompt template that helps generate Cypher queries. This template is quite comprehensive - it includes all the available relationship types in our database like MENTIONS, PERFORMS, USES_POWER, HAS_STATUS, OPERATES_ON, and REQUIRES. It also provides an example query structure to ensure proper formatting and includes a placeholder where we'll insert the user's question.
Once we have our template ready, we create a GraphCypherQAChain with several important configurations:
We use our configured
llmfor query generationWe connect it to our
graphdatabaseWe incorporate our
cypher_prompttemplateWe enable
verbosemode for detailed loggingWe set
return_intermediate_stepsto see what's happening under the hoodWe set
allow_dangerous_requeststo true for handling complex queriesWe limit our results to the top 3 matches with
top_k
This whole setup creates a powerful chain that can take natural language questions from users, convert them into proper Cypher queries, and fetch relevant answers from our graph database. It's like having a translator that converts human questions into database language and back again.
Define QA Function
Here's how this function combines a graph database query with an LLM fallback to provide answers efficiently.
First, it searches the graph database using a Cypher query. It looks for Document nodes connected via MENTIONS relationships that contain the question's keyword in their text. This is the primary way the function tries to find answers.
If the first query doesn't return a result, it splits the question into individual words and searches using each word as a keyword. This approach helps when a single keyword doesn't match exactly but parts of the question might.
The function includes several fallback mechanisms:
If no answer is found in the database, the question is passed to an
LLMThe
LLMprocesses the query and generates an answer independentlyThe entire process is wrapped in a try-except block for smooth error handling
Users receive friendly error messages if something goes wrong
The function follows a clear decision path:
Return database answer if found
Use LLM's answer if database search fails
Ask user to rephrase if both methods fail
This hybrid approach ensures flexibility, combining the speed of database queries with the depth of LLM-generated answers. It's perfect for handling both structured and unstructured data queries seamlessly.
Usage Example
Here's a practical example.
I used the following document for this demonstration.
Please download the document using the link below and save it to the
datafolder.
Document Details
Document Name: Lenovo Combined Mouse User Guide
Document Type: User Manual
File Size: 744 KB
Pages: 4
Download LinkMi Dual Mode Wireless Mouse Silent Edition User Manual
Last updated