Elasticsearch

Open in Colab Open in GitHub

Overview

This tutorial covers how to use Elasticsearch with LangChain .

This tutorial walks you through using CRUD operations with the Elasticsearch storing , updating , deleting documents, and performing similarity-based retrieval .

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

Setup Elasticsearch

In order to use the Elasticsearch vector search you must install the langchain-elasticsearch package.

🚀 Setting Up Elasticsearch with Elastic Cloud (Colab Compatible)

  • Elastic Cloud allows you to manage Elasticsearch seamlessly in the cloud, eliminating the need for local installations.

  • It integrates well with Google Colab, enabling efficient experimentation and prototyping.

📚 What is Elastic Cloud?

  • Elastic Cloud is a managed Elasticsearch service provided by Elastic.

  • Supports custom cluster configurations and auto-scaling .

  • Deployable on AWS , GCP , and Azure .

  • Compatible with Google Colab, allowing simplified cloud-based workflows.

📌 Getting Started with Elastic Cloud

  1. Sign up for Elastic Cloud’s Free Trial .

  2. Create an Elasticsearch Cluster .

  3. Retrieve your Elasticsearch URL and Elasticsearch API Key from the Elastic Cloud Console.

  4. Add the following to your .env file

What is Elasticsearch?

Elasticsearch is an open-source, distributed search and analytics engine designed to store, search, and analyze both structured and unstructured data in real-time.

Key Features

  • Real-Time Search: Instantly searchable data upon ingestion

  • Large-Scale Data Processing: Efficient handling of vast datasets

  • Scalability: Flexible scaling through clustering and distributed architecture

  • Versatile Search Support: Keyword search, semantic search, and multimodal search

Use Cases

  • Log Analytics: Real-time monitoring of system and application logs

  • Monitoring: Server and network health tracking

  • Product Recommendations: Behavior-based recommendation systems

  • Natural Language Processing (NLP): Semantic text searches

  • Multimodal Search: Text-to-image and image-to-image searches

Vector Database Functionality in Elasticsearch

  • Elasticsearch supports vector data storage and similarity search via Dense Vector Fields . As a vector database, it excels in applications like NLP, image search, and recommendation systems.

Core Vector Database Features

  • Dense Vector Field: Store and query high-dimensional vectors

  • KNN (k-Nearest Neighbors) Search: Find vectors most similar to the input

  • Semantic Search: Perform meaning-based searches beyond keyword matching

  • Multimodal Search: Combine text and image data for advanced search capabilities

Vector Search Use Cases

  • Semantic Search: Understand user intent and deliver precise results

  • Text-to-Image Search: Retrieve relevant images from textual descriptions

  • Image-to-Image Search: Find visually similar images in a dataset

Elasticsearch goes beyond traditional text search engines, offering robust vector database capabilities essential for NLP and multimodal search applications.


Prepare Data

This section guides you through the data preparation process .

This section includes the following components:

  • Data Introduction

  • Preprocess Data

Data Introduction

In this tutorial, we will use the fairy tale 📗 The Little Prince in PDF format as our data.

This material complies with the Apache 2.0 license .

The data is used in a text (.txt) format converted from the original PDF.

You can view the data at the link below.

Preprocess Data

In this tutorial section, we will preprocess the text data from The Little Prince and convert it into a list of LangChain Document objects with metadata.

Each document chunk will include a title field in the metadata, extracted from the first line of each section.

Setting up Elasticsearch

This part walks you through the initial setup of Elasticsearch .

This section includes the following components:

  • Load Embedding Model

  • Load Elasticsearch Client

Load Embedding Model

In this section, you'll learn how to load an embedding model.

This tutorial uses OpenAI's API-Key for loading the model.

💡 If you prefer to use another embedding model, see the instructions below.

Load Elasticsearch Client

In this section, we'll show you how to load the database client object using the Python SDK for Elasticsearch .

Create Index

If you are successfully connected to Elasticsearch, some basic indexes are already created.

But, in this tutorial we will create a new index with ElasticsearchIndexManager class.

Delete Index

If you want to remove an existing index from Elasticsearch, you can use the ElasticsearchIndexManager class to delete it easily.

This is useful when you want to reset your data or clean up unused indexes during development or testing.

To proceed with the tutorial, let’s create the index once again.

Document Manager

For the LangChain-OpenTutorial, we have implemented a custom set of CRUD functionalities for VectorDBs

The following operations are included:

  • upsert : Update existing documents or insert if they don’t exist

  • upsert_parallel : Perform upserts in parallel for large-scale data

  • similarity_search : Search for similar documents based on embeddings

  • delete : Remove documents based on filter conditions

Each of these features is implemented as class methods specific to each VectorDB.

In this tutorial, you'll learn how to use these methods to interact with your VectorDB.

We plan to continuously expand the functionality by adding more common operations in the future.

Create Instance

First, create an instance of the Elasticsearch helper class to use its CRUD functionalities.

This class is initialized with the Elasticsearch Python SDK client instance and the embedding model instance , both of which were defined in the previous section.

Now you can use the following CRUD operations with the crud_manager instance.

These instance allow you to easily manage documents in your Elasticsearch .

Upsert Document

Update existing documents or insert if they don’t exist

✅ Args

  • texts : Iterable[str] – List of text contents to be inserted/updated.

  • metadatas : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).

  • ids : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.

  • **kwargs : Extra arguments for the underlying vector store.

🔄 Return

  • None

Upsert Parallel

Perform upserts in parallel for large-scale data

✅ Args

  • texts : Iterable[str] – List of text contents to be inserted/updated.

  • metadatas : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).

  • ids : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.

  • batch_size : int – Number of documents per batch (default: 32).

  • workers : int – Number of parallel workers (default: 10).

  • **kwargs : Extra arguments for the underlying vector store.

🔄 Return

  • None

Search for similar documents based on embeddings .

This method uses "cosine similarity" .

✅ Args

  • query : str – The text query for similarity search.

  • k : int – Number of top results to return (default: 10).

  • **kwargs : Additional search options (e.g., filters).

🔄 Return

  • results : List[Document] – A list of LangChain Document objects ranked by similarity.

Delete Document

Delete documents based on filter conditions

✅ Args

  • ids : Optional[List[str]] – List of document IDs to delete. If None, deletion is based on filter.

  • filters : Optional[Dict] – Dictionary specifying filter conditions (e.g., metadata match).

  • **kwargs : Any additional parameters.

🔄 Return

  • None

Last updated