Academic Search System

Open in ColabOpen in GitHub

Overview

This tutorial involves loading an open academic publication dataset called OpenAlex into a Graph DB named Neo4J.

Then, utilizing an LLM to generate Cypher queries, which are essentially queries for the Graph DB, and using the data obtained from these Cypher queries to produce appropriate answers to inquiries, we will build an Academic Search System.

Before we dive into the tutorial, let's understand what GraphRAG is and why we should use it!

GraphRAG refers to the RAG we already know well, but extended to include not only vectors but also a knowledge graph in the search path.

But what are the advantages of using this GraphRAG that we need to explore? The reasons are as follows.

  1. You can obtain more accurate and higher quality results.

    • According to Microsoft, using GraphRAG allowed them to obtain more relevant contexts, which led to better answers. It also made it easier to trace the grounds for those answers.

    • Additionally, it required 26~97% fewer tokens, resulting in cost savings and scalability benefits.

  2. It enhances data comprehension.

    • When looking at vectors represented by numerous numbers, it is nearly impossible for a human to conceptually and intuitively understand them. vector dataIt seems impossible to understand...

    However, graphs are highly intuitive. They make it much easier to understand the relationships between data.

    graph dataIt looks much clearer now. By exploring such intuitive graphs, you can gain new insights.

  3. Management becomes easier in terms of tracking, explaining, and access control.

    • Using graphs, you can trace why certain data was selected or why errors occurred. This traceability can be used to explain the results.

    • Additionally, you can assign data permissions within the knowledge graph, enhancing security and privacy protection.

Knowing what GraphRAG is makes you want to use it even more, doesn't it? Now, let's embark on creating an Academic Search System together!

Table of Contents

References


Of course, Graph RAG does not come without its disadvantages.

  1. It is quite challenging to construct.

  2. It can be inefficient when dealing with unstructured data.

  3. etc ...

Therefore, one must exercise caution when applying it in a production environment.

However, in this tutorial, we will focus solely on the topic of Academic Search System.

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

Preliminary Task: Running Neo4j Using Docker

Before we get into the main tutorial, we need to perform some pre-tasks. Specifically, we need to launch the Graph DB, Neo4j, using Docker!

Since our goal is not to study Docker, we will skip the detailed explanation about Docker and share the Docker Compose code declared to launch the Neo4j container. Please modify it according to your environment!

Official Site : Getting started with Neo4j in Docker

docker-compose.yml

After launching the container using the above files (or in your own unique way),

https://localhost:7474

if you access the local server on port 7474,

vector dataneo4j Browser Console

Ta-da! You will be able to see the following Neo4j screen, and you will be ready to fully enjoy the tutorial.

Shall we dive into the main content now?

Prepare the Data

Let's prepare the Data. As mentioned earlier, we will use the OpenAlex, an open academic publication dataset. OpenAlex data describes academic entities and how these entities are interconnected. The data properties provided include:

  • Works

  • Authors

  • Scores

  • Institutions

  • Topics

  • Publishers

  • Funders

Among these, we will focus on handling the following data properties: Works, Authors, Institutions, Topics

Data Structure

Before we look at the structure of the data we will create, let's briefly understand what Nodes and Relationships in a GraphDB are.

GraphDB is composed of Nodes and Relationships.

  • Node: Refers to an individual entity. A node can have zero or more labels that define it.

  • Relationship: Refers to the connection between a source node and a target node. Relationships always have a direction and a type.

Both nodes and relationships can have key-value pair properties.

For more detailed information about Graph properties, please refer to the official website.

Let us now explore the nodes and relationships of the data we will construct.

Node

  • Works: These are academic documents such as journal articles, books, datasets, and theses.

    • display_name: The title of the academic document.

    • cited_by_count: The number of times the document has been cited.

    • language: The language in which the document is written.

    • publication_year: The year the document was published.

    • type: The type of the document.

    • license: The license under which the document is published.

    • url: The URL where the document is available.

- ```Authors```: Information about the authors who wrote the academic documents. - ```display_name```: The name of the author. - ```orcid```: The author's ORCID. (ORCID is a global and unique ID for authors.) - ```works_count```: The number of documents the author has worked on. - ```Topics```: Subjects related to the documents. - ```display_name```: The title of the topic. - ```description```: A description of the topic. - ```Institutions```: The institutions to which the authors were affiliated. It is included in the Authors data. - ```display_name```: The name of the institution. - ```ror```: The ROR (Research Organization Registry) ID of the institution. - ```country_code```: The country where the institution is located.

Relationship

  • Works <- WROTE - Authors: The relationship between an author and the documents they have written.

    • author_position The author's position in the authorship list (e.g., first author, second author).

  • Works - ASSOCIATION -> Topics: The relationship between documents and topics.

    • score The relevance score indicating how strongly the document is related to the topic.

  • Authors - AFFILIATED -> Institutions: The relationship between authors and the institutions with which they were affiliated.

    • years The years during which the author was affiliated with the institution.

For more detailed information about OpenAlex Entities, please refer to the official website.

The above structure utilizes only a small portion of the data, and you could certainly develop a more logical and comprehensive structure for nodes and relationships based on your own rationale.

However, for the purposes of this tutorial, we will proceed with the structure as outlined above.

Let's create a graph based on the JSON file we downloaded.

Now let's build a graph using Cypher. Before that, let me briefly talk about Cypher. Cypher is Neo4jโ€™s declarative query language, allowing users to unlock the full potential of property graph databases. For more detailed information about Neo4j Cyphers, please refer to the official website.

As always, a single line of code is often easier to understand than ten lines of explanation. Let's use Cypher to insert JSON data.

Let's analyze the Cypher declared above, line by line. We will omit explanations for duplicated forms of code.

  • CALL apoc.load.json('"+ file+ "')

    • Read the JSON files. At this time, the APOC module is required. In the case of the docker compose provided above, it will be automatically installed through NEO4J_PLUGINS=['apoc'].

- ```YIELD value``` - Returns the ```value``` obtained by reading the JSON file. - ```UNWIND value.authorships as authorships``` - By separating the ```authorships``` list within the ```value``` object, each item is individually processed as ```authorships``` objects with an alias assigned through ```as```. - ```WITH value, authorships, author, topics, field, domain``` - Variables obtained through ```YIELD``` and ```UNWIND``` are passed to the next part of the query, making them available for subsequent operations. - `MERGE (w:Works {id: value.id}) \ SET w.display_name = coalesce(value.display_name, '')\ ...` - The ```MERGE``` clause is used to match or create a node with the ```Works``` label that has a unique ```id``` property matching ```value.id```. If a node with the corresponding ```id``` already exists, it matches that node; otherwise, it creates a new node. - The ```SET``` clause updates the ```display_name``` property of the ```Works``` node. The ```coalesce``` function ensures that ```value.display_name``` is replaced with an empty string (```''```) if it is ```null```. - ```MERGE (a)-[:WROTE{author_position: authorships.author_position}]->(w)``` - The ```MERGE``` clause is used to match or create a relationship between nodes ```a``` and ```w```. Just like with nodes, if the same relationship already exists, it matches or creates it, and if it does not exist, it creates a new relationship. - This relationship has the ```WROTE``` label and includes the ```author_position``` property. This property is set to the value of ```authorships.author_position```.

The nodes and relationships for Authors and Topics will be constructed in a similar manner, so the explanation will be omitted.

The graph with all the data will have the following structure.

our graphThe graph we built!

Now, let us integrate the generated graph with the LLM to build a Q&A system.

Let's make the Academic Search System

Using the default QA chain

First, let's make use of the default QA Chain provided by langchain. GraphCypherQAChain is a function that generates Cypher queries and facilitates question-answering about graphs by having a pre-declared chain, making it convenient to use.

GraphCypherQAChainsource : Langchain

As can be seen from the above picture, the model operates the LLM once to generate a Cypher query, then runs the GraphDB with the generated query, and operates the LLM once again to generate an appropriate response to the user's query based on the executed results.

Let's implement a simple QA service using the GraphCypherQAChain function.

However, there is one issue with this function, which is that it directly inserts the information from the query into the Cypher.

In other words, you need to have precise information about the data to get the desired answer.

Therefore, instead of relying on the predefined QA chain, we should create our own custom chain using LangGraph.

Using the LangGraph chain we built

Upon receiving a query, if it pertains to a node or the relationships between nodes, we plan to first extract related data by performing a semantic search using Embedding Vectors of specific properties. Then, we will construct a graph to utilize the extracted data for the query.

png

By acquiring the data most similar to the user's query through semantic search and adjusting the query to fit the acquired data, we were able to obtain the desired answer.

In other words, even without knowing the precise terminology, we have built a QA System that allows us to obtain the desired answers!

This concludes the tutorial for the Academic Search System. Thank you for your hard work this time as well!

Last updated