Neo4j

Open in Colab Open in GitHub

Overview

This tutorial covers how to use Neo4j with LangChain .

Neo4j is a graph database backed by vector store and can be deployed locally or on cloud.

To fully utilize Neo4j, you need to learn about Cypher, declarative query language.

This tutorial walks you through using CRUD operations with the Neo4j storing , updating , deleting documents, and performing similarity-based retrieval .

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

Setup Neo4j

We have two options to start with: cloud or local deployment.

In this tutorial, we will use the cloud service called Aura, provided by Neo4j.

We will also describe how to deploy Neo4j using Docker.

  • Getting started with Aura

    You can create a new Neo4j Aura account on the Neo4j official website.

    Visit the website and click "Get Started" Free at the top right.

    Once you have signed in, you will see a button, Create instance, and after that, you will see your username and password.

    To get your API key, click Download and continue to download a .txt file that contains the API key to connect your NEO4j Aura .

  • Getting started with Docker

    Here is the description for how to run Neo4j using Docker.

    To run Neo4j container , use the following command.

    You can visit Neo4j Docker installation reference to check more detailed information.

[NOTE]

  • Neo4j also supports native deployment on macOS, Windows and Linux. Visit the Neo4j Official Installation guide reference for more details.

  • The Neo4j community edition only supports one database.

What is Neo4j?

Neo4j is a native graph database, which means it represents data as nodes and edges.

  • Nodes

    • label: tag to represent node role in a domain.

    • property: key-value pairs, e.g. name-John.

  • Edges

    • Represents relationship between two nodes.

    • Directional, which means it has start and end node.

    • property: like nodes, edge can have properties.

  • NoSQL

    • Neo4j does not require predefined schema allowing flexible data modeling.

  • Cypher

    • Neo4j uses Cypher, a declarative query language, to interact with the database.

    • Cypher expression resembles how humans think about relationships.

Prepare Data

This section guides you through the data preparation process .

This section includes the following components:

  • Data Introduction

  • Preprocess Data

Introduce Data

In this tutorial, we will use the fairy tale 📗 The Little Prince in PDF format as our data.

This material complies with the Apache 2.0 license .

The data is used in a text (.txt) format converted from the original PDF.

You can view the data at the link below.

Preprocessing Data

In this tutorial section, we will preprocess the text data from The Little Prince and convert it into a list of LangChain Document objects with metadata.

Each document chunk will include a title field in the metadata, extracted from the first line of each section.

Setting up Neo4j

This part walks you through the initial setup of Neo4j .

This section includes the following components:

  • Load Embedding Model

  • Load Neo4j Client

  • Create Index

Load Embedding Model

In the Load Embedding Model section, you'll learn how to load an embedding model.

This tutorial uses OpenAI's API-Key for loading the model.

💡 If you prefer to use another embedding model, see the instructions below.

Load Neo4j Client

In the Load Neo4j Client section, we cover how to load the database client object using the Python SDK for Neo4j .

Create Index

If you are successfully connected to Neo4j Aura, some basic indexes are already created.

But, in this tutorial we will create a new index with Neo4jIndexManager class.

Document Manager

To support the Langchain-Opentutorial , we implemented a custom set of CRUD functionalities for VectorDBs.

The following operations are included:

  • upsert : Update existing documents or insert if they don’t exist

  • upsert_parallel : Perform upserts in parallel for large-scale data

  • similarity_search : Search for similar documents based on embeddings

  • delete : Remove documents based on filter conditions

Each of these features is implemented as class methods specific to each VectorDB.

In this tutorial, you can easily utilize these methods to interact with your VectorDB.

We plan to continuously expand the functionality by adding more common operations in the future.

Create Instance

First, we create an instance of the Neo4j helper class to use its CRUD functionalities.

This class is initialized with the Neo4j Python SDK client instance, index name and the embedding model instance , both of which were defined in the previous section.

Now you can use the following CRUD operations with the crud_manager instance.

These instance allow you to easily manage documents in your Neo4j .

Upsert Document

Update existing documents or insert if they don’t exist

✅ Args

  • texts : Iterable[str] – List of text contents to be inserted/updated.

  • metadatas : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).

  • ids : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.

  • **kwargs : Extra arguments for the underlying vector store.

🔄 Return

  • None

Upsert Parallel Document

Perform upserts in parallel for large-scale data

✅ Args

  • texts : Iterable[str] – List of text contents to be inserted/updated.

  • metadatas : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).

  • ids : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.

  • batch_size : int – Number of documents per batch (default: 32).

  • workers : int – Number of parallel workers (default: 10).

  • **kwargs : Extra arguments for the underlying vector store.

🔄 Return

  • None

Search for similar documents based on embeddings .

This method uses "cosine similarity" .

✅ Args

  • query : str – The text query for similarity search.

  • k : int – Number of top results to return (default: 10).

**kwargs : Additional search options (e.g., filters).

🔄 Return

  • results : List[Document] – A list of LangChain Document objects ranked by similarity.

Delete Document

Remove documents based on filter conditions

✅ Args

  • ids : Optional[List[str]] – List of document IDs to delete. If None, deletion is based on filter.

  • filters : Optional[Dict] – Dictionary specifying filter conditions (e.g., metadata match).

  • **kwargs : Any additional parameters.

🔄 Return

  • Boolean

Last updated