Chroma

Open in Colab Open in GitHub

Overview

This tutorial covers how to use Chroma with LangChain .

Chroma is an open-source vector database optimized for semantic search and RAG applications. It offers fast similarity search, metadata filtering, and supports both in-memory and persistent storage. With built-in or custom embedding functions and a simple Python API, it's easy to integrate into ML pipelines.

This tutorial walks you through using CRUD operations with the Chroma storing , updating , deleting documents, and performing similarity-based retrieval .

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

What is Chroma?

chroma

Chroma is an open-source embedding database built for enabling semantic search in AI applications. It is commonly used in Retrieval-Augmented Generation(RAG) pipelines to manage and search document embeddings efficiently.

Unlike traditional databases or pure vector stores, Chroma combines vector similarity with structured metadata filtering.

This allows developers to build hybrid search systems that consider both the meaning of the text and metadata constraints.

Key Features

  • Easy-to-use API : Simplifies vector management and querying through a clean Python interface.

  • Persistent storage : Supports both in-memory and on-disk storage for scalable deployment.

  • Metadata filtering : Enables precise search using custom fields stored alongside vectors.

  • Built-in similarity search : Provides fast approximate nearest-neighbor (ANN) retrieval using cosine distance.

  • Local-first and open-source : No cloud lock-in; can run entirely on local or edge environments.

Prepare Data

This section guides you through the data preparation process .

This section includes the following components:

  • Data Introduction

  • Preprocess Data

Data Introduction

In this tutorial, we will use the fairy tale 📗 The Little Prince in PDF format as our data.

This material complies with the Apache 2.0 license .

The data is used in a text (.txt) format converted from the original PDF.

You can view the data at the link below.

Preprocess Data

In this tutorial section, we will preprocess the text data from The Little Prince and convert it into a list of LangChain Document objects with metadata.

Each document chunk will include a title field in the metadata, extracted from the first line of each section.

Setting up Chroma

This part walks you through the initial setup of Chroma .

This section includes the following components:

  • Load Embedding Model

  • Load Qdrant Client

Load Embedding Model

In this section, you'll learn how to load an embedding model.

This tutorial uses OpenAI's API-Key for loading the model.

💡 If you prefer to use another embedding model, see the instructions below.

Load Chroma Client

In this section, we'll show you how to load the database client object using the Python SDK for Chroma .

Document Manager

For the LangChain-OpenTutorial, we have implemented a custom set of CRUD functionalities for VectorDBs

The following operations are included:

  • upsert : Update existing documents or insert if they don’t exist

  • upsert_parallel : Perform upserts in parallel for large-scale data

  • similarity_search : Search for similar documents based on embeddings

  • delete : Remove documents based on filter conditions

Each of these features is implemented as class methods specific to each VectorDB.

In this tutorial, you'll learn how to use these methods to interact with your VectorDB.

We plan to continuously expand the functionality by adding more common operations in the future.

Create Instance

First, create an instance of the Qdrant helper class to use its CRUD functionalities.

This class is initialized with the Qdrant Python SDK client instance and the embedding model instance , both of which were defined in the previous section.

Now you can use the following CRUD operations with the crud_manager instance.

These instance allow you to easily manage documents in your Chroma .

Upsert Document

Update existing documents or insert if they don’t exist

✅ Args

  • texts : Iterable[str] – List of text contents to be inserted/updated.

  • metadatas : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).

  • ids : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.

  • **kwargs : Extra arguments for the underlying vector store.

🔄 Return

  • None

Upsert Parallel

Perform upsert in parallel for large-scale data

✅ Args

  • texts : Iterable[str] – List of text contents to be inserted/updated.

  • metadatas : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).

  • ids : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.

  • batch_size : int – Number of documents per batch (default: 32).

  • workers : int – Number of parallel workers (default: 10).

  • **kwargs : Extra arguments for the underlying vector store.

🔄 Return

  • None

Search for similar documents based on embeddings .

This method uses "cosine similarity" .

✅ Args

  • query : str – The text query for similarity search.

  • k : int – Number of top results to return (default: 10).

  • **kwargs : Additional search options (e.g., filters).

🔄 Return

  • results : List[Document] – A list of LangChain Document objects ranked by similarity.

as_retriever

The as_retriever() method creates a LangChain-compatible retriever wrapper.

This function allows a DocumentManager class to return a retriever object by wrapping the internal search() method, while staying lightweight and independent from full LangChain VectorStore dependencies.

The retriever obtained through this function is compatible with existing LangChain retrievers and can be used in LangChain Pipelines (e.g., RetrievalQA, ConversationalRetrievalChain, Tool, etc.)

✅ Args

  • search_fn : Callable - The function used to retrieve relevant documents. Typically this is self.search from a DocumentManager instance.

  • search_kwargs : Optional[Dict] - A dictionary of keyword arguments passed to search_fn, such as k for top-K results or metadata filters.

🔄 Return

  • LightCustomRetriever :BaseRetriever - A lightweight LangChain-compatible retriever that internally uses the given search_fn and search_kwargs.

Delete Document

Delete documents based on filter conditions

✅ Args

  • ids : Optional[List[str]] – List of document IDs to delete. If None, deletion is based on filter.

  • filters : Optional[Dict] – Dictionary specifying filter conditions (e.g., metadata match).

  • **kwargs : Any additional parameters.

🔄 Return

  • None

Last updated