PGVector is an open-source extension for PostgreSQL that allows you to store and search vector data alongside your regular database information.
This notebook shows how to use functionality related to PGVector, implementing LangChain vectorstore abstraction using postgres as the backend and utilizing the pgvector extension.
from dotenv import load_dotenv
load_dotenv(override=True)
True
What is PGVector?
PGVector is a PostgreSQL extension that enables vector similarity search directly within your PostgreSQL database, making it ideal for AI applications, semantic search, and recommendation systems.
This is particularly valuable for who already use PostgreSQL who want to add vector search capabilities without managing separate infrastructure or learning new query languages.
Features :
Native PostgreSQL integration with standard SQL queries
Multiple similarity search methods including L2, Inner Product, Cosine
Several indexing options including HNSW and IVFFlat
Support for up to 2,000 dimensions per vector
ACID compliance inherited from PostgreSQL
Advantages :
Free and open-source
Easy integration with existing PostgreSQL databases
Full SQL functionality and transactional support
No additional infrastructure needed
Supports hybrid searches combining vector and traditional SQL queries
Disadvantages :
Performance limitations with very large datasets (billions of vectors)
Limited to single-node deployment
Memory-intensive for large vector dimensions
Requires manual optimization for best performance
Less specialized features compared to dedicated vector databases
Set up PGVector
If you are using Windows and have installed postgresql for Windows, you are required to install vector extension for postgresql. The following may help Install pgvector on Windows.
If you are successfully running the pgvector container, you can use pgVectorIndexManager from pgvector_interface in utils directory to handle collections.
To initialize pgVectorIndexManager you can pass full connection string or pass each parameter separately.
When you initialize pgVectorIndexManager, the procedure will automatically create two tableslangchain_pg_collection and langchain_pg_embedding.
langchain_pg_collection
Stores names of the collections.
Distinguish collection by uuid and name.
langchain_pg_embedding
Stores actual data.
So, when you create a new collection and insert data to the collection, the data will be stored in langchain_pg_embedding table.
As you can see below, the uuid column in langchain_pg_collection table matched with collection_id column in langchain_pg_embedding table.
Create collection
Now we can create collection with index_manager.
To create collection, you need to pass embedding model and collection_name when calling the create_index method.
In this tutorial we will use text-embedding-3-large model of OpenAI.
If creation is successful, the method will return pgVectorDocumentManager class that can handle actual data.
In this tutorial we will create an collection with name langchain_opentutorial.
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
# create new collection
col_manager = index_manager.create_index(
collection_name="langchain_opentutorial", embedding=embeddings
)
List collections
As we have created a new collection, we will call the list_indexes method to check if the collection is created.
As we said, when you create a new collection by calling the create_index method, this will automatically return pgVectorDocumentManager instance.
But if you want to re-use already created collection, you can call the get_index method with name of the collection and embedding model you used to create the collection to get manager.
# Get collection
col_manager_tmp2 = index_manager.get_index(
embedding=embeddings, collection_name="langchain_opentutorial"
)
Manage vector store
Once you have created your vector store, we can interact with it by adding and deleting different items.
Filtering
The pgVector support following filtering operations.
Operator
Meaning/Category
$eq
Equality (==)
$ne
Inequality (!=)
$lt
Less than (<)
$lte
Less than or equal (<=)
$gt
Greater than (>)
$gte
Greater than or equal (>=)
$in
Special Cased (in)
$nin
Special Cased (not in)
$between
Special Cased (between)
$like
Text (like)
$ilike
Text (case-insensitive like)
$and
Logical (and)
$or
Logical (or)
Filter can be used with scroll, delete, and search methods.
To apply filter, we create a dictionary and pass it to filter parameter like the following
{"page": {"$between": [10,20]}}
Connect to index
To add, delete, search items, we need to initialize an object which connected to the index we operate on.
We will connect to langchain_opentutorial . Recall that we used basic OpenAIEmbedding as a embedding function, and thus we need to pass it when we initialize index_manager object.
Remember that we also can get pgVectorDocumentManager object when we create an index with pgVectorIndexManager object or pgVectorIndexManager.get_index method, but this time we call it directly to get an pgVectorDocumentManager object.
from utils.pgvector_interface import pgVectorDocumentManager
# Get document manager
col_manager = pgVectorDocumentManager(
embedding=embeddings,
connection_info=conn_str,
collection_name="langchain_opentutorial",
)
Data Preprocessing
Below is the preprocessing process for general documents.
Need to extract metadata from documents
Filter documents by minimum length.
Determine whether to use basename or not. Default is False.
basename denotes the last value of the filepath.
For example, document.pdf will be the basename for the filepath ./data/document.pdf .
# This is a long document we can split up.
data_path = "./data/the_little_prince.txt"
with open(data_path, encoding="utf8") as f:
raw_text = f.read()
from langchain_text_splitters import RecursiveCharacterTextSplitter
from uuid import uuid4
# define text splitter
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=100,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
)
# split raw text by splitter.
split_docs = text_splitter.create_documents([raw_text])
# print one of documents to check its structure
print(split_docs[0])
page_content='The Little Prince
Written By Antoine de Saiot-Exupery (1900〜1944)'
# define document preprocessor
def preprocess_documents(
split_docs, metadata_keys, min_length, use_basename=False, **kwargs
):
metadata = kwargs
if use_basename:
assert metadata.get("source", None) is not None, "source must be provided"
metadata["source"] = metadata["source"].split("/")[-1]
result_docs = []
for idx, doc in enumerate(split_docs):
if len(doc.page_content) < min_length:
continue
for k in metadata_keys:
doc.metadata.update({k: metadata.get(k, "")})
doc.metadata.update({"page": idx + 1, "id": str(uuid4())})
result_docs.append(doc)
return result_docs
# preprocess raw documents
processed_docs = preprocess_documents(
split_docs=split_docs,
metadata_keys=["source", "page", "author"],
min_length=5,
use_basename=True,
source=data_path,
author="Saiot-Exupery",
)
# print one of preprocessed document to chekc its structure
print(processed_docs[0])
page_content='The Little Prince
Written By Antoine de Saiot-Exupery (1900〜1944)' metadata={'source': 'the_little_prince.txt', 'page': 1, 'author': 'Saiot-Exupery', 'id': 'cc23e228-2540-4e5c-8eb3-be6df7a3bf77'}
Add items to vector store
We can add items to our vector store by using the upsert or upsert_parallel method.
If you pass ids along with documents, then ids will be used, but if you do not pass ids, it will be created based page_content using md5 hash function.
Basically, upsert and upsert_parallel methods do upsert not insert, based on id of the item.
So if you provided id and want to update data, you must provide the same id that you provided at first upsertion.
We will upsert data to collection, langchain_opentutorial , with upsert method for the first half, and with upsert_parallel for the second half.
# Gather uuids, texts, metadatas
uuids = [doc.metadata["id"] for doc in processed_docs]
texts = [doc.page_content for doc in processed_docs]
metadatas = [doc.metadata for doc in processed_docs]
# Get total number of documents
total_number = len(processed_docs)
print("Number of documents:", total_number)
CPU times: user 1.79 s, sys: 82.9 ms, total: 1.88 s
Wall time: 4.96 s
result = upsert_result + upsert_parallel_result
# check number of ids upserted
print(len(result))
# check manual ids are the same as output ids
print("Manual Ids == Output Ids:", sorted(result) == sorted(uuids))
1359
Manual Ids == Output Ids: True
[ NOTE ]
As we have only one table, langchain_pg_embedding to store data, we have only one column cmetadata to store metadata for each document.
The cmetadata column is jsonb type, and thus if you want to update the metadata, you should provide not only the new metadata key-value you want to update, but with all the metadata already stored.
Scroll items from vector store
As we have added some items to our first vector store, named langchain_opentutorial , we can scroll items from the vector store.
This can be done by calling scroll method.
When we scroll items from the vector store we can pass ids or filter to get items that we want, or just call scroll to get k(default 10) items.
We can get embedded vector values of each items by set include_embedding True.
# Do scroll without ids or filter
scroll_result = col_manager.scroll()
# print the number of items scrolled and first item that returned.
print(f"Number of items scrolled: {len(scroll_result)}")
print(scroll_result[0])
Number of items scrolled: 10
{'content': 'The Little Prince\nWritten By Antoine de Saiot-Exupery (1900〜1944)', 'metadata': {'id': 'cc23e228-2540-4e5c-8eb3-be6df7a3bf77', 'page': 1, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}
# Do scroll with filter
scroll_result = col_manager.scroll(filter={"page": {"$in": [1, 2, 3]}})
# print the number of items scrolled and all items that returned.
print(f"Number of items scrolled: {len(scroll_result)}")
for r in scroll_result:
print(r)
Number of items scrolled: 3
{'content': 'The Little Prince\nWritten By Antoine de Saiot-Exupery (1900〜1944)', 'metadata': {'id': 'cc23e228-2540-4e5c-8eb3-be6df7a3bf77', 'page': 1, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}
{'content': '[ Antoine de Saiot-Exupery ]', 'metadata': {'id': 'd4bf8981-2af4-4288-8aaf-6586381973c4', 'page': 2, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}
{'content': 'Over the past century, the thrill of flying has inspired some to perform remarkable feats of', 'metadata': {'id': '31dc52cf-530b-449c-a3db-ec64d9e1a10c', 'page': 3, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}
# Do scroll with ids
scroll_result = col_manager.scroll(ids=uuids[:3])
# print the number of items scrolled and all items that returned.
print(f"Number of items scrolled: {len(scroll_result)}")
for r in scroll_result:
print(r)
Number of items scrolled: 3
{'content': 'The Little Prince\nWritten By Antoine de Saiot-Exupery (1900〜1944)', 'metadata': {'id': 'cc23e228-2540-4e5c-8eb3-be6df7a3bf77', 'page': 1, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}
{'content': '[ Antoine de Saiot-Exupery ]', 'metadata': {'id': 'd4bf8981-2af4-4288-8aaf-6586381973c4', 'page': 2, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}
{'content': 'Over the past century, the thrill of flying has inspired some to perform remarkable feats of', 'metadata': {'id': '31dc52cf-530b-449c-a3db-ec64d9e1a10c', 'page': 3, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}
Delete items from vector store
We can delete items by filter or ids with delete method.
For example, we will delete the first page, that is page 1, of the little prince, and try to scroll it.
# delete an item
col_manager.delete(filter={"page": {"$eq": 1}})
# check if it remains in DB.
print(col_manager.scroll(filter={"page": {"$eq": 1}}))
Delete done successfully
[]
Now we delete 5 items using ids.
# delete item by ids
ids = uuids[1:6]
# call delete_node method
col_manager.delete(ids=ids)
# check if it remains in DB.
print(col_manager.scroll(ids=ids))
Delete done successfully
[]
Similarity search
As a vector store, pgVector support similarity search with various distance metric, l2 , inner (max inner product), cosine .
By default, distance strategy is set to cosine.
Similarity search can be done by calling the search method.
You can set the number of retrieved documents by passing k(default to 4).
results = col_manager.search(query="Does the little prince have a friend?", k=5)
for doc in results:
print(doc)
{'content': '"My friend the fox--" the little prince said to me.', 'metadata': {'id': 'b02aaaa0-9352-403a-8924-cfff4973b926', 'page': 1087, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.631413271508214}
{'content': '"No," said the little prince. "I am looking for friends. What does that mean-- ‘tame‘?"', 'metadata': {'id': '48adae15-36ba-4384-8762-0ef3f0ac33a3', 'page': 958, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.6050397117589812}
{'content': 'the little prince returns to his planet', 'metadata': {'id': '4ed37f54-5619-4fc9-912b-4a37fb5a5625', 'page': 1202, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.5846221199406966}
{'content': 'midst of the Sahara where he meets a tiny prince from another world traveling the universe in order', 'metadata': {'id': '28b44d4b-cf4e-4cb9-983b-7fb3ec735609', 'page': 25, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.5682375512406654}
{'content': '[ Chapter 2 ]\n- the narrator crashes in the desert and makes the acquaintance of the little prince', 'metadata': {'id': '2a4e0184-bc2c-4558-8eaa-63a1a13da3a0', 'page': 85, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.555493427632688}
Similarity search with filters
You can also do similarity search with filter as we have done in scroll or delete.
# search with filter
result_with_filter = col_manager.search(
"Does the little prince have a friend?",
filter={"page": {"$between": [100, 110]}},
k=5,
)
for doc in result_with_filter:
print(doc)
{'content': 'inhabited region. And yet my little man seemed neither to be straying uncertainly among the sands,', 'metadata': {'id': '1be69712-f0f4-4728-b6f2-d4cf12cddfdb', 'page': 107, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.23158187113240447}
{'content': 'Nothing about him gave any suggestion of a child lost in the middle of the desert, a thousand miles', 'metadata': {'id': 'df4ece8c-dcb6-400e-9d8e-0eb5820a5c4e', 'page': 109, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.18018012822748797}
{'content': 'among the sands, nor to be fainting from fatigue or hunger or thirst or fear. Nothing about him', 'metadata': {'id': '71b4297c-3b76-43cb-be6a-afca5f59388d', 'page': 108, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.17715921622781305}
{'content': 'less charming than its model.', 'metadata': {'id': '507267bc-7076-42f7-ad7c-ed1f835663f2', 'page': 100, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.16131896837723747}
{'content': 'a thousand miles from any human habitation. When at last I was able to speak, I said to him:', 'metadata': {'id': '524af6ff-1370-4c20-ad94-1b37e45fe0c5', 'page': 110, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.15769872390077566}