HuggingFace Embeddings

Author: liniar
Design: liniar
Peer Review : byoon, Sun Hyoung Lee
This is a part of LangChain Open Tutorial

Overview

Hugging Face offers a wide range of embedding models for free, enabling various embedding tasks with ease.
In this tutorial, we’ll use langchain_huggingface to build a simple text embedding-based search system.
The following models will be used for Text Embedding
- 1️⃣ multilingual-e5-large-instruct: A multilingual instruction-based embedding model.
- 2️⃣ multilingual-e5-large: A powerful multilingual embedding model.
- 3️⃣ bge-m3: Optimized for large-scale text processing.

References

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions, and utilities for tutorials.
You can check out the langchain-opentutorial for more details.

🛠️ The following configurations will be set up

Jupyter Notebook Output Settings
- Display standard error ( stderr ) messages directly instead of capturing them.
Install Required Packages
- Ensure all necessary dependencies are installed.
API Key Setup
- Configure the API key for authentication.
PyTorch Device Selection Setup
- Automatically select the optimal computing device (CPU, CUDA, or MPS).
  - {"device": "mps"} : Perform embedding calculations using MPS instead of GPU. (For Mac users)
  - {"device": "cuda"} : Perform embedding calculations using GPU. (For Linux and Windows users, requires CUDA installation)
  - {"device": "cpu"} : Perform embedding calculations using CPU. (Available for all users)
Embedding Model Local Storage Path
- Define a local path for storing embedding models.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_huggingface",
        "torch",
        "numpy",
        "scikit-learn",
    ],
    verbose=False,
    upgrade=False,
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "HuggingFace Embeddings",  # title 과 동일하게 설정해 주세요
        "HUGGINGFACEHUB_API_TOKEN": "",
    }
)

Environment variables have been set successfully.

You can alternatively set OPENAI_API_KEY in .env file and load it.

[Note]

This is not necessary if you've already set OPENAI_API_KEY in previous steps.

from dotenv import load_dotenv

load_dotenv(override=True)

True

# Automatically select the appropriate device
import torch
import platform


def get_device():
    if platform.system() == "Darwin":  # macOS specific
        if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
            print("✅ Using MPS (Metal Performance Shaders) on macOS")
            return "mps"
    if torch.cuda.is_available():
        print("✅ Using CUDA (NVIDIA GPU)")
        return "cuda"
    else:
        print("✅ Using CPU")
        return "cpu"


# Set the device
device = get_device()
print("🖥️ Current device in use:", device)

✅ Using MPS (Metal Performance Shaders) on macOS
    🖥️ Current device in use: mps

# Embedding Model Local Storage Path
import os
import warnings

# Ignore warnings
warnings.filterwarnings("ignore")

# Set the download path to ./cache/
os.environ["HF_HOME"] = "./cache/"

Data Preparation for Embedding-Based Search Tutorial

To perform embedding-based search, we prepare both a Query and Documents.

Query

Write a key question that will serve as the basis for the search.

# Query
q = "Please tell me more about LangChain."

Documents

Prepare multiple documents (texts) that will serve as the target for the search.
Each document will be embedded to enable semantic search capabilities.

# Documents for Text Embedding
docs = [
    "Hi, nice to meet you.",
    "LangChain simplifies the process of building applications with large language models.",
    "The LangChain English tutorial is structured based on LangChain's official documentation, cookbook, and various practical examples to help users utilize LangChain more easily and effectively.",
    "LangChain simplifies the process of building applications with large-scale language models.",
    "Retrieval-Augmented Generation (RAG) is an effective technique for improving AI responses.",
]

Which Text Embedding Model Should You Use?

Leverage the MTEB leaderboard and free embedding models to confidently select and utilize the best-performing text embedding models for your projects! 🚀

🚀 What is MTEB (Massive Text Embedding Benchmark)?

MTEB is a benchmark designed to systematically and objectively evaluate the performance of text embedding models.
- Purpose: To fairly compare the performance of embedding models.
- Evaluation Tasks: Includes tasks like Classification, Retrieval, Clustering, and Semantic Similarity.
- Supported Models: A wide range of text embedding models available on Hugging Face.
- Results: Displayed as scores, with top-performing models ranked on the leaderboard.

🔗 MTEB Leaderboard (Hugging Face)

🛠️ Models Used in This Tutorial

Embedding Model

Description

1️⃣ multilingual-e5-large-instruct

Offers strong multilingual support with consistent results.

2️⃣ multilingual-e5-large

A powerful multilingual embedding model.

3️⃣ bge-m3

Optimized for large-scale text processing, excelling in retrieval and semantic similarity tasks.

Similarity Calculation

Similarity Calculation Using Vector Dot Product

Similarity is determined using the dot product of vectors.
Similarity Calculation Formula:

📐 Mathematical Significance of the Vector Dot Product

Definition of Vector Dot Product

The dot product of two vectors, $\mathbf{a}$ and $\mathbf{b}$, is mathematically defined as:

Relationship with Cosine Similarity

The dot product also relates to cosine similarity and follows this property:

Where:

$|\mathbf{a}|$ and $|\mathbf{b}|$ represent the magnitudes (norms, specifically Euclidean norms) of vectors $\mathbf{a}$ and $\mathbf{b}$.
$\theta$ is the angle between the two vectors.
$\cos \theta$ represents the cosine similarity between the two vectors.

🔍 Interpretation of Vector Dot Product in Similarity

When the dot product value is large (a large positive value):

The magnitudes ($|\mathbf{a}|$ and $|\mathbf{b}|$) of the two vectors are large.
The angle ($\theta$) between the two vectors is small ( $\cos \theta$ approaches 1 ).

This indicates that the two vectors point in a similar direction and are more semantically similar, especially when their magnitudes are also large.

📏 Calculation of Vector Magnitude (Norm)

Definition of Euclidean Norm

For a vector $\mathbf{a} = [a_1, a_2, \ldots, a_n]$, the Euclidean norm $|\mathbf{a}|$ is calculated as:

This magnitude represents the length or size of the vector in multi-dimensional space.

Understanding these mathematical foundations helps ensure precise similarity calculations, enabling better performance in tasks like semantic search, retrieval systems, and recommendation engines. 🚀

Similarity calculation between `embedded_query` and `embedded_document`

embed_documents : For embedding multiple texts (documents)
embed_query : For embedding a single text (query)

We've implemented a method to search for the most relevant documents using text embeddings.

Let's use search_similar_documents(q, docs, hf_embeddings) to find the most relevant documents.

import numpy as np


def search_similar_documents(q, docs, hf_embeddings):
    """
    Search for the most relevant documents based on a query using text embeddings.

    Args:
        q (str): The query string for which relevant documents are to be found.
        docs (list of str): A list of document strings to compare against the query.
        hf_embeddings: An embedding model object with `embed_query` and `embed_documents` methods.

    Returns:
        tuple:
            - embedded_query (numpy.ndarray): The embedding vector of the query.
            - embedded_documents (numpy.ndarray): The embedding matrix of the documents.

    Workflow:
        1. Embed the query string into a numerical vector using `embed_query`.
        2. Embed each document into numerical vectors using `embed_documents`.
        3. Calculate similarity scores between the query and documents using the dot product.
        4. Sort the documents based on their similarity scores in descending order.
        5. Print the query and display the sorted documents by their relevance.
        6. Return the query and document embeddings for further analysis if needed.
    """
    # Embed the query and documents using the embedding model
    embedded_query = hf_embeddings.embed_query(q)
    embedded_documents = hf_embeddings.embed_documents(docs)

    # Calculate similarity scores using dot product
    similarity_scores = np.array(embedded_query) @ np.array(embedded_documents).T

    # Sort documents by similarity scores in descending order
    sorted_idx = similarity_scores.argsort()[::-1]

    # Display the results
    print(f"[Query] {q}\n" + "=" * 40)
    for i, idx in enumerate(sorted_idx):
        print(f"[{i}] {docs[idx]}")
        print()

    # Return embeddings for potential further processing or analysis
    return embedded_query, embedded_documents

HuggingFaceEndpointEmbeddings Overview

HuggingFaceEndpointEmbeddings is a feature in the LangChain library that leverages Hugging Face’s Inference API endpoint to generate text embeddings seamlessly.

📚 Key Concepts

Hugging Face Inference API
- Access pre-trained embedding models via Hugging Face’s API.
- No need to download models locally; embeddings are generated directly through the API.
LangChain Integration
- Easily integrate embedding results into LangChain workflows using its standardized interface.
Use Cases
- Text-query and document similarity calculation
- Search and recommendation systems
- Natural Language Understanding (NLU) applications

⚙️ Key Parameters

model : The Hugging Face model ID (e.g., BAAI/bge-m3 )
task : The task to perform (usually "feature-extraction" )
api_key : Your Hugging Face API token
model_kwargs : Additional model configuration parameters

💡 Advantages

No Local Model Download: Instant access via API.
Scalability: Supports a wide range of pre-trained Hugging Face models.
Seamless Integration: Effortlessly integrates embeddings into LangChain workflows.

⚠️ Caveats

API Support: Not all models support API inference.
Speed & Cost: Free APIs may have slower response times and usage limitations.

With HuggingFaceEndpointEmbeddings, you can easily integrate Hugging Face’s powerful embedding models into your LangChain workflows for efficient and scalable NLP solutions. 🚀

Let’s use the intfloat/multilingual-e5-large-instruct model via the API to search for the most relevant documents using text embeddings.

intfloat/multilingual-e5-large-instruct

from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings

model_name = "intfloat/multilingual-e5-large-instruct"

hf_endpoint_embeddings = HuggingFaceEndpointEmbeddings(
    model=model_name,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HUGGINGFACEHUB_API_TOKEN"],
)

Search for the most relevant documents based on a query using text embeddings.

%%time
# Embed the query and documents using the embedding model
embedded_query = hf_endpoint_embeddings.embed_query(q)
embedded_documents = hf_endpoint_embeddings.embed_documents(docs)

CPU times: user 7.18 ms, sys: 2.32 ms, total: 9.5 ms
    Wall time: 1.21 s

# Calculate similarity scores using dot product
similarity_scores = np.array(embedded_query) @ np.array(embedded_documents).T

# Sort documents by similarity scores in descending order
sorted_idx = similarity_scores.argsort()[::-1]

# Display the results
print(f"[Query] {q}\n" + "=" * 40)
for i, idx in enumerate(sorted_idx):
    print(f"[{i}] {docs[idx]}")
    print()

[Query] Please tell me more about LangChain.
    ========================================
    [0] LangChain simplifies the process of building applications with large language models.
    
    [1] LangChain simplifies the process of building applications with large-scale language models.
    
    [2] The LangChain English tutorial is structured based on LangChain's official documentation, cookbook, and various practical examples to help users utilize LangChain more easily and effectively.
    
    [3] Retrieval-Augmented Generation (RAG) is an effective technique for improving AI responses.
    
    [4] Hi, nice to meet you.

print("[HuggingFace Endpoint Embedding]")
print(f"Model: \t\t{model_name}")
print(f"Document Dimension: \t{len(embedded_documents[0])}")
print(f"Query Dimension: \t{len(embedded_query)}")

[HuggingFace Endpoint Embedding]
    Model: 		intfloat/multilingual-e5-large-instruct
    Document Dimension: 	1024
    Query Dimension: 	1024

We can verify that the dimensions of embedded_documents and embedded_query are consistent.

You can also perform searches using the search_similar_documents method we implemented earlier. From now on, let's use this method for our searches.

%%time
embedded_query, embedded_documents = search_similar_documents(q, docs, hf_endpoint_embeddings)

[Query] Please tell me more about LangChain.
    ========================================
    [0] LangChain simplifies the process of building applications with large language models.
    
    [1] LangChain simplifies the process of building applications with large-scale language models.
    
    [2] The LangChain English tutorial is structured based on LangChain's official documentation, cookbook, and various practical examples to help users utilize LangChain more easily and effectively.
    
    [3] Retrieval-Augmented Generation (RAG) is an effective technique for improving AI responses.
    
    [4] Hi, nice to meet you.
    
    CPU times: user 7.25 ms, sys: 3.26 ms, total: 10.5 ms
    Wall time: 418 ms