We built Neo4jDB class from Python SDK of Neo4j. Langchain also supports neo4j vector store class but it lacks some methods like delete. Look neo4j_interface.py in utils
You can visit Neo4j Docker installation reference to check more detailed information.
[NOTE]
Neo4j also supports macOS, windows and Linux native deployment. Visit Neo4j Official Installation guide reference for more detail.
Neo4j community edition only supports one database.
Credentials
Now, if you successfully create your own account for Aura, you will get your NEO4J_URI, NEO4J_USERNAME, NEO4J_USERPASSWORD.
Add it to environmental variable above or add it to your .env file.
import os
import time
from utils.neo4j_interface import Neo4jDB
# set uri, username, password
uri = os.getenv("NEO4J_URI")
username = os.getenv("NEO4J_USERNAME")
password = os.getenv("NEO4J_PASSWORD")
client = Neo4jDB(uri=uri, username=username, password=password)
Connected to Neo4j database
Connection info
URI=neo4j+s://3ed1167e.databases.neo4j.io
username=neo4j
Neo4j version is above 5.23
Once we established connection to Aura Neo4j database, connection info using get_api_key method.
# get connection info
client.get_api_key()
Initialization
If you are succesfully connected to Neo4j Aura, there are some basic indexes already created.
But, in this tutorial we will create a new indexand will add items(nodes) to it.
To do this, we now look how to manage indexes.
To manage indexes, we will see how to:
List indexes
Create new index
Delete index
List Indexs
Before create a new index, let's check indexes already in the Neo4j database
# get name list of indexes
names = client.list_indexes()
print(names)
['index_343aff4e', 'index_f7700477']
Create Index
Now we will create a new index.
This can be done by calling create_index method, which will return an object connected to newly created index.
If an index exists with the same name, the method will print out notification.
When we create a new index, we must provide embedding object or dimension of vector, and metric to use for similarity search.
In this tutorial we will pass OpenAIEmbeddings when we create a new index.
[ NOTE ]
If you pass dimension of vector instead of embedding object, this must match the dimension of embeded vector of your choice of embedding model.
An embedding object must have embed_query and embed_documents methods.
metric is used to set distance method for similarity search. Neo4j supports cosine and euclidean .
# Initialize OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
# set index_name and node_label
index_name = "tutorial_index"
node_label = "tutorial_node"
# create a new index
index = client.create_index(
embedding=embeddings, index_name=index_name, node_label=node_label
)
if isinstance(index, Neo4jDB):
print("Index creation was successful")
# check name list of indexes
names = client.list_indexes()
print(names)
Created index information
Index name: tutorial_index
Node label: tutorial_node
Similarity metric: COSINE
Embedding dimension: 1536
Embedding node property: embedding
Text node property: text
Index creation was successful
['index_343aff4e', 'index_f7700477', 'tutorial_index']
Delete Index
We can delete specific index by calling delete_index method.
Delete tutorial_index we created above and then create it again to use later.
# delete index
client.delete_index("tutorial_index")
# print name list of indexes
names = client.list_indexes()
if "tutorial_index" not in names:
print(f"Index deleted succesfully ")
print(names)
# recreate the tutorial_index
index = client.create_index(
embedding=embeddings, index_name="tutorial_index", node_label="tutorial_node"
)
Index deleted succesfully
['index_343aff4e', 'index_f7700477']
Created index information
Index name: tutorial_index
Node label: tutorial_node
Similarity metric: COSINE
Embedding dimension: 1536
Embedding node property: embedding
Text node property: text
Select Embeddings model
We also can change embedding model.
In this subsection we use text-embedding-3-large model to create a new index with it
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain_openai import OpenAIEmbeddings
embeddings_large = OpenAIEmbeddings(model="text-embedding-3-large")
# create new index
index2 = client.create_index(
embedding=embeddings_large,
index_name="tutorial_index_2",
node_label="tutorial_node_2",
)
Created index information
Index name: tutorial_index_2
Node label: tutorial_node_2
Similarity metric: COSINE
Embedding dimension: 3072
Embedding node property: embedding
Text node property: text
Data Preprocessing
Below is the preprocessing process for general documents.
Need to extract metadata from documents
Filter documents by minimum length.
Determine whether to use basename or not. Default is False.
basename denotes the last value of the filepath.
For example, document.pdf will be the basename for the filepath ./data/document.pdf .
# This is a long document we can split up.
data_path = "./data/the_little_prince.txt"
with open(data_path, encoding="utf8") as f:
raw_text = f.read()
from langchain_text_splitters import RecursiveCharacterTextSplitter
# define text splitter
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=100,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
)
# split raw text by splitter.
split_docs = text_splitter.create_documents([raw_text])
# print one of documents to check its structure
print(split_docs[0])
page_content='The Little Prince
Written By Antoine de Saiot-Exupery (1900〜1944)'
Now we preprocess splited document to extract author, page and source metadata while fit the data to store it into Neo4j
# preprocess raw documents
processed_docs = client.preprocess_documents(
split_docs=split_docs,
metadata_keys=["source", "page", "author"],
min_length=5,
use_basename=True,
source=data_path,
author="Saiot-Exupery",
)
# print one of preprocessed document to chekc its structure
print(processed_docs[0])
page_content='The Little Prince
Written By Antoine de Saiot-Exupery (1900〜1944)' metadata={'source': 'the_little_prince.txt', 'page': 1, 'author': 'Saiot-Exupery'}
Manage vector store
Once you have created your vector store, we can interact with it by adding and deleting different items.
Also, you can scroll data from the store with filter or with Cypher query.
Add items to vector store
We can add items to our vector store by using the upsert_documents or upsert_documents_parallel method.
If you pass ids along with documents, then ids will be used, but if you do not pass ids, it will be created based page_content using md5 hash function.
Basically, upsert_document and upsert_document_parallel methods do upsert not insert, based on id of the item.
So if you provided id and want to update data, you must provide the same id that you provided at first upsertion.
We will upsert data to index, tutorial_index, with upsert_documents method for the first half, and with upsert_documents_parallel for the second half.
from uuid import uuid4
# make ids for each document
uuids = [str(uuid4()) for _ in range(len(processed_docs))]
# upsert documents
total_number = len(processed_docs)
upsert_result = index.upsert_documents(
processed_docs[: total_number // 2], ids=uuids[: total_number // 2]
)
# upsert documents parallel
upsert_parallel_result = index.upsert_documents_parallel(
processed_docs[total_number // 2 :],
batch_size=32,
max_workers=8,
ids=uuids[total_number // 2 :],
)
result = upsert_result + upsert_parallel_result
# check number of ids upserted
print(len(result))
# check manual ids are the same as output ids
print("Manual Ids == Output Ids:", sorted(result) == sorted(uuids))
Delete items from vector store
We can delete nodes by filter or ids with delete_node method.
For example, we will delete the first page, that is page 1, of the little prince, and try to scroll it.
{'id': '8f9ed6b2-4fc5-4c23-a32b-d53acc72a68a',
'metadata': {'author': 'Saiot-Exupery',
'text': '[ Antoine de Saiot-Exupery ]',
'source': 'the_little_prince.txt',
'page': 2}}
Now delete 5 items using ids.
# delete item by ids
ids = uuids[1:6]
# call delete_node method
result = index.delete_node(ids=ids)
print(result)
True
# scroll vector index
result = index.scroll_nodes(limit=None)
print("The number of nodes in vector: {}".format(len(result)))
The number of nodes in vector: 1353
Scroll items from vector store
You can scroll items(nodes) in store by calling scroll_nodes method with filters or ids.
If you are you scroll by filter and you passed keys and values, those will be treated as MUST condition, which means the nodes that match all the conditions will be returned.
# define scroll filter
filters = {"author": "Saiot-Exupery", "page": 10}
# get nodes
result = index.scroll_nodes(filters=filters)
print(result)
Scroll nodes by filter
[{'id': '8fcae3d1-8d41-4010-9458-6324a87c6cb4', 'metadata': {'author': 'Saiot-Exupery', 'text': 'learned to fly a plane. Five years later, he would leave the military in order to begin flying air', 'source': 'the_little_prince.txt', 'page': 10}}]
# get nodes by ids
result = index.scroll_nodes(ids=uuids[11])
print(result)
Scroll nodes by ids
[{'id': '9f4790f0-6f1b-428c-87c7-dbc3b909852a', 'metadata': {'author': 'Saiot-Exupery', 'text': 'For Saint-Exupéry, it was a grand adventure - one with dangers lurking at every corner. Flying his', 'source': 'the_little_prince.txt', 'page': 12}}]
(Advanced) Scroll items with query
Provided method, scroll_nodes only support AND condition for multiple (key, value) pairs.
But if you use Cypher, more complicated condition can be used to scroll items.
# create cypher query
query = "MATCH (n) WHERE n.page IN [10,11,12] AND n.author='Saiot-Exupery' RETURN n.page, n.author, n.text"
# scroll items with query
result = index.scroll_nodes(query=query)
for item in result:
print(item)
Scroll nodes by query
{'n.page': 10, 'n.author': 'Saiot-Exupery', 'n.text': 'learned to fly a plane. Five years later, he would leave the military in order to begin flying air'}
{'n.page': 11, 'n.author': 'Saiot-Exupery', 'n.text': 'to begin flying air mail between remote settlements in the Sahara desert.'}
{'n.page': 12, 'n.author': 'Saiot-Exupery', 'n.text': 'For Saint-Exupéry, it was a grand adventure - one with dangers lurking at every corner. Flying his'}
Similarity search
As Neo4j supports vector database, you can also do similarity search.
The similarity is calculated by the metric you set when you created the index to search on.
In this tutorial we will search items on tutorial_index , which has metric cosine .
To do search, we call search method.
You can pass the raw text(to query paramter), or embeded vector of the text(to embeded_query paramter) when calling search.
# do search. top_k is the number of documents in the result
res_with_text = index.search(query="Does the little prince have a friend?", top_k=5)
# print out top 2 results
print("RESULT BY RAW QUERY")
for i in range(2):
print(res_with_text[i])
# embed query
embeded_query = embeddings.embed_query("Does the little prince have a friend?")
# do search with embeded vector value
res_with_embed = index.search(embeded_query=embeded_query, top_k=5)
# print out top 2 results
print()
print("RESULT BY EMBEDED QUERY")
for i in range(2):
print(res_with_embed[i])
RESULT BY RAW QUERY
{'text': '"My friend the fox--" the little prince said to me.', 'metadata': {'id': '70d75baa-3bed-4751-b0cf-98157e190756', 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt', 'page': 1087, 'embedding': None}, 'score': 0.947}
{'text': 'And the little prince asked himself:', 'metadata': {'id': '9e779e02-1d2b-4252-a8f4-78bae7866af5', 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt', 'page': 492, 'embedding': None}, 'score': 0.946}
RESULT BY EMBEDED QUERY
{'text': '"My friend the fox--" the little prince said to me.', 'metadata': {'id': '70d75baa-3bed-4751-b0cf-98157e190756', 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt', 'page': 1087, 'embedding': None}, 'score': 0.947}
{'text': 'And the little prince asked himself:', 'metadata': {'id': '9e779e02-1d2b-4252-a8f4-78bae7866af5', 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt', 'page': 492, 'embedding': None}, 'score': 0.946}
That's it!
You can now do the basics of how to use Neo4j.
If you want to do more advanced tasks, please refer to Neo4j official API documents and official Python SDK of Neo4j API documents.