HuggingFace Embeddings

Open in ColabOpen in GitHub

Overview

  • Hugging Face offers a wide range of embedding models for free, enabling various embedding tasks with ease.

  • In this tutorial, we’ll use langchain_huggingface to build a simple text embedding-based search system.

  • The following models will be used for Text Embedding

    • 1️⃣ multilingual-e5-large-instruct: A multilingual instruction-based embedding model.

    • 2️⃣ multilingual-e5-large: A powerful multilingual embedding model.

    • 3️⃣ bge-m3: Optimized for large-scale text processing.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions, and utilities for tutorials.

  • You can check out the langchain-opentutorial for more details.


🛠️ The following configurations will be set up

  • Jupyter Notebook Output Settings

    • Display standard error ( stderr ) messages directly instead of capturing them.

  • Install Required Packages

    • Ensure all necessary dependencies are installed.

  • API Key Setup

    • Configure the API key for authentication.

  • PyTorch Device Selection Setup

    • Automatically select the optimal computing device (CPU, CUDA, or MPS).

      • {"device": "mps"} : Perform embedding calculations using MPS instead of GPU. (For Mac users)

      • {"device": "cuda"} : Perform embedding calculations using GPU. (For Linux and Windows users, requires CUDA installation)

      • {"device": "cpu"} : Perform embedding calculations using CPU. (Available for all users)

  • Embedding Model Local Storage Path

    • Define a local path for storing embedding models.

You can alternatively set OPENAI_API_KEY in .env file and load it.

[Note]

  • This is not necessary if you've already set OPENAI_API_KEY in previous steps.

Data Preparation for Embedding-Based Search Tutorial

To perform embedding-based search, we prepare both a Query and Documents.

  1. Query

  • Write a key question that will serve as the basis for the search.

  1. Documents

  • Prepare multiple documents (texts) that will serve as the target for the search.

  • Each document will be embedded to enable semantic search capabilities.

Which Text Embedding Model Should You Use?

  • Leverage the MTEB leaderboard and free embedding models to confidently select and utilize the best-performing text embedding models for your projects! 🚀


🚀 What is MTEB (Massive Text Embedding Benchmark)?

  • MTEB is a benchmark designed to systematically and objectively evaluate the performance of text embedding models.

    • Purpose: To fairly compare the performance of embedding models.

    • Evaluation Tasks: Includes tasks like Classification, Retrieval, Clustering, and Semantic Similarity.

    • Supported Models: A wide range of text embedding models available on Hugging Face.

    • Results: Displayed as scores, with top-performing models ranked on the leaderboard.

🔗 MTEB Leaderboard (Hugging Face)


🛠️ Models Used in This Tutorial

Embedding Model

Description

1️⃣ multilingual-e5-large-instruct

Offers strong multilingual support with consistent results.

2️⃣ multilingual-e5-large

A powerful multilingual embedding model.

3️⃣ bge-m3

Optimized for large-scale text processing, excelling in retrieval and semantic similarity tasks.

1️⃣ multilingual-e5-large-instruct

2️⃣ multilingual-e5-large

3️⃣ bge-m3

Similarity Calculation

Similarity Calculation Using Vector Dot Product

  • Similarity is determined using the dot product of vectors.

  • Similarity Calculation Formula:

similarities=querydocumentsT\text{similarities} = \mathbf{query} \cdot \mathbf{documents}^T


📐 Mathematical Significance of the Vector Dot Product

Definition of Vector Dot Product

The dot product of two vectors, $\mathbf{a}$ and $\mathbf{b}$, is mathematically defined as:

ab=i=1naibi\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i


Relationship with Cosine Similarity

The dot product also relates to cosine similarity and follows this property:

ab=abcosθ\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos \theta

Where:

  • $|\mathbf{a}|$ and $|\mathbf{b}|$ represent the magnitudes (norms, specifically Euclidean norms) of vectors $\mathbf{a}$ and $\mathbf{b}$.

  • $\theta$ is the angle between the two vectors.

  • $\cos \theta$ represents the cosine similarity between the two vectors.


🔍 Interpretation of Vector Dot Product in Similarity

When the dot product value is large (a large positive value):

  • The magnitudes ($|\mathbf{a}|$ and $|\mathbf{b}|$) of the two vectors are large.

  • The angle ($\theta$) between the two vectors is small ( $\cos \theta$ approaches 1 ).

This indicates that the two vectors point in a similar direction and are more semantically similar, especially when their magnitudes are also large.


📏 Calculation of Vector Magnitude (Norm)

Definition of Euclidean Norm

For a vector $\mathbf{a} = [a_1, a_2, \ldots, a_n]$, the Euclidean norm $|\mathbf{a}|$ is calculated as:

a=a12+a22++an2\|\mathbf{a}\| = \sqrt{a_1^2 + a_2^2 + \cdots + a_n^2}

This magnitude represents the length or size of the vector in multi-dimensional space.


Understanding these mathematical foundations helps ensure precise similarity calculations, enabling better performance in tasks like semantic search, retrieval systems, and recommendation engines. 🚀


Similarity calculation between embedded_query and embedded_document

  • embed_documents : For embedding multiple texts (documents)

  • embed_query : For embedding a single text (query)

We've implemented a method to search for the most relevant documents using text embeddings.

  • Let's use search_similar_documents(q, docs, hf_embeddings) to find the most relevant documents.

HuggingFaceEndpointEmbeddings Overview

HuggingFaceEndpointEmbeddings is a feature in the LangChain library that leverages Hugging Face’s Inference API endpoint to generate text embeddings seamlessly.


📚 Key Concepts

  1. Hugging Face Inference API

    • Access pre-trained embedding models via Hugging Face’s API.

    • No need to download models locally; embeddings are generated directly through the API.

  2. LangChain Integration

    • Easily integrate embedding results into LangChain workflows using its standardized interface.

  3. Use Cases

    • Text-query and document similarity calculation

    • Search and recommendation systems

    • Natural Language Understanding (NLU) applications


⚙️ Key Parameters

  • model : The Hugging Face model ID (e.g., BAAI/bge-m3 )

  • task : The task to perform (usually "feature-extraction" )

  • api_key : Your Hugging Face API token

  • model_kwargs : Additional model configuration parameters


💡 Advantages

  • No Local Model Download: Instant access via API.

  • Scalability: Supports a wide range of pre-trained Hugging Face models.

  • Seamless Integration: Effortlessly integrates embeddings into LangChain workflows.


⚠️ Caveats

  • API Support: Not all models support API inference.

  • Speed & Cost: Free APIs may have slower response times and usage limitations.


With HuggingFaceEndpointEmbeddings, you can easily integrate Hugging Face’s powerful embedding models into your LangChain workflows for efficient and scalable NLP solutions. 🚀


Let’s use the intfloat/multilingual-e5-large-instruct model via the API to search for the most relevant documents using text embeddings.

Search for the most relevant documents based on a query using text embeddings.

We can verify that the dimensions of embedded_documents and embedded_query are consistent.

You can also perform searches using the search_similar_documents method we implemented earlier. From now on, let's use this method for our searches.

HuggingFaceEmbeddings Overview

  • HuggingFaceEmbeddings is a feature in the LangChain library that enables the conversion of text data into vectors using Hugging Face embedding models.

  • This class downloads and operates Hugging Face models locally for efficient processing.


📚 Key Concepts

  1. Hugging Face Pre-trained Models

    • Leverages pre-trained embedding models provided by Hugging Face.

    • Downloads models locally for direct embedding operations.

  2. LangChain Integration

    • Seamlessly integrates with LangChain workflows using its standardized interface.

  3. Use Cases

    • Text-query and document similarity calculation

    • Search and recommendation systems

    • Natural Language Understanding (NLU) applications


⚙️ Key Parameters

  • model_name : The Hugging Face model ID (e.g., sentence-transformers/all-MiniLM-L6-v2 )

  • model_kwargs : Additional model configuration parameters (e.g., GPU/CPU device settings)

  • encode_kwargs : Extra settings for embedding generation


💡 Advantages

  • Local Embedding Operations: Perform embeddings locally without requiring an internet connection.

  • High Performance: Utilize GPU settings for faster embedding generation.

  • Model Variety: Supports a wide range of Hugging Face models.


⚠️ Caveats

  • Local Storage Requirement: Pre-trained models must be downloaded locally.

  • Environment Configuration: Performance may vary depending on GPU/CPU device settings.


With HuggingFaceEmbeddings, you can efficiently leverage Hugging Face's powerful embedding models in a local environment, enabling flexible and scalable NLP solutions. 🚀


Let's download the embedding model locally, perform embeddings, and search for the most relevant documents.

intfloat/multilingual-e5-large-instruct

Last updated