HuggingFace Embeddings
Author: liniar
Design: liniar
Peer Review : byoon, Sun Hyoung Lee
Proofread : Youngjun cho
This is a part of LangChain Open Tutorial
Overview
Hugging Faceoffers a wide range of embedding models for free, enabling various embedding tasks with ease.In this tutorial, we’ll use
langchain_huggingfaceto build a simple text embedding-based search system.The following models will be used for Text Embedding
1️⃣ multilingual-e5-large-instruct: A multilingual instruction-based embedding model.
2️⃣ multilingual-e5-large: A powerful multilingual embedding model.
3️⃣ bge-m3: Optimized for large-scale text processing.

Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions, and utilities for tutorials.You can check out the
langchain-opentutorialfor more details.
🛠️ The following configurations will be set up
Jupyter Notebook Output Settings
Display standard error (
stderr) messages directly instead of capturing them.
Install Required Packages
Ensure all necessary dependencies are installed.
API Key Setup
Configure the API key for authentication.
PyTorch Device Selection Setup
Automatically select the optimal computing device (CPU, CUDA, or MPS).
{"device": "mps"}: Perform embedding calculations using MPS instead of GPU. (For Mac users){"device": "cuda"}: Perform embedding calculations using GPU. (For Linux and Windows users, requires CUDA installation){"device": "cpu"}: Perform embedding calculations using CPU. (Available for all users)
Embedding Model Local Storage Path
Define a local path for storing embedding models.
You can alternatively set OPENAI_API_KEY in .env file and load it.
[Note]
This is not necessary if you've already set
OPENAI_API_KEYin previous steps.
Data Preparation for Embedding-Based Search Tutorial
To perform embedding-based search, we prepare both a Query and Documents.
Query
Write a key question that will serve as the basis for the search.
Documents
Prepare multiple documents (texts) that will serve as the target for the search.
Each document will be embedded to enable semantic search capabilities.
Which Text Embedding Model Should You Use?
Leverage the MTEB leaderboard and free embedding models to confidently select and utilize the best-performing text embedding models for your projects! 🚀
🚀 What is MTEB (Massive Text Embedding Benchmark)?
MTEB is a benchmark designed to systematically and objectively evaluate the performance of text embedding models.
Purpose: To fairly compare the performance of embedding models.
Evaluation Tasks: Includes tasks like Classification, Retrieval, Clustering, and Semantic Similarity.
Supported Models: A wide range of text embedding models available on Hugging Face.
Results: Displayed as scores, with top-performing models ranked on the leaderboard.
🔗 MTEB Leaderboard (Hugging Face)
🛠️ Models Used in This Tutorial
Embedding Model
Description
1️⃣ multilingual-e5-large-instruct
Offers strong multilingual support with consistent results.
2️⃣ multilingual-e5-large
A powerful multilingual embedding model.
3️⃣ bge-m3
Optimized for large-scale text processing, excelling in retrieval and semantic similarity tasks.
1️⃣ multilingual-e5-large-instruct
2️⃣ multilingual-e5-large
3️⃣ bge-m3
Similarity Calculation
Similarity Calculation Using Vector Dot Product
Similarity is determined using the dot product of vectors.
Similarity Calculation Formula:
📐 Mathematical Significance of the Vector Dot Product
Definition of Vector Dot Product
The dot product of two vectors, $\mathbf{a}$ and $\mathbf{b}$, is mathematically defined as:
Relationship with Cosine Similarity
The dot product also relates to cosine similarity and follows this property:
Where:
$|\mathbf{a}|$ and $|\mathbf{b}|$ represent the magnitudes (norms, specifically Euclidean norms) of vectors $\mathbf{a}$ and $\mathbf{b}$.
$\theta$ is the angle between the two vectors.
$\cos \theta$ represents the cosine similarity between the two vectors.
🔍 Interpretation of Vector Dot Product in Similarity
When the dot product value is large (a large positive value):
The magnitudes ($|\mathbf{a}|$ and $|\mathbf{b}|$) of the two vectors are large.
The angle ($\theta$) between the two vectors is small ( $\cos \theta$ approaches 1 ).
This indicates that the two vectors point in a similar direction and are more semantically similar, especially when their magnitudes are also large.
📏 Calculation of Vector Magnitude (Norm)
Definition of Euclidean Norm
For a vector $\mathbf{a} = [a_1, a_2, \ldots, a_n]$, the Euclidean norm $|\mathbf{a}|$ is calculated as:
This magnitude represents the length or size of the vector in multi-dimensional space.
Understanding these mathematical foundations helps ensure precise similarity calculations, enabling better performance in tasks like semantic search, retrieval systems, and recommendation engines. 🚀
Similarity calculation between embedded_query and embedded_document
embedded_query and embedded_documentembed_documents: For embedding multiple texts (documents)embed_query: For embedding a single text (query)
We've implemented a method to search for the most relevant documents using text embeddings.
Let's use
search_similar_documents(q, docs, hf_embeddings)to find the most relevant documents.
HuggingFaceEndpointEmbeddings Overview
HuggingFaceEndpointEmbeddings is a feature in the LangChain library that leverages Hugging Face’s Inference API endpoint to generate text embeddings seamlessly.
📚 Key Concepts
Hugging Face Inference API
Access pre-trained embedding models via Hugging Face’s API.
No need to download models locally; embeddings are generated directly through the API.
LangChain Integration
Easily integrate embedding results into LangChain workflows using its standardized interface.
Use Cases
Text-query and document similarity calculation
Search and recommendation systems
Natural Language Understanding (NLU) applications
⚙️ Key Parameters
model: The Hugging Face model ID (e.g.,BAAI/bge-m3)task: The task to perform (usually"feature-extraction")api_key: Your Hugging Face API tokenmodel_kwargs: Additional model configuration parameters
💡 Advantages
No Local Model Download: Instant access via API.
Scalability: Supports a wide range of pre-trained Hugging Face models.
Seamless Integration: Effortlessly integrates embeddings into LangChain workflows.
⚠️ Caveats
API Support: Not all models support API inference.
Speed & Cost: Free APIs may have slower response times and usage limitations.
With HuggingFaceEndpointEmbeddings, you can easily integrate Hugging Face’s powerful embedding models into your LangChain workflows for efficient and scalable NLP solutions. 🚀
Let’s use the intfloat/multilingual-e5-large-instruct model via the API to search for the most relevant documents using text embeddings.
Search for the most relevant documents based on a query using text embeddings.
We can verify that the dimensions of embedded_documents and embedded_query are consistent.
You can also perform searches using the search_similar_documents method we implemented earlier.
From now on, let's use this method for our searches.
HuggingFaceEmbeddings Overview
HuggingFaceEmbeddings is a feature in the LangChain library that enables the conversion of text data into vectors using Hugging Face embedding models.
This class downloads and operates Hugging Face models locally for efficient processing.
📚 Key Concepts
Hugging Face Pre-trained Models
Leverages pre-trained embedding models provided by Hugging Face.
Downloads models locally for direct embedding operations.
LangChain Integration
Seamlessly integrates with LangChain workflows using its standardized interface.
Use Cases
Text-query and document similarity calculation
Search and recommendation systems
Natural Language Understanding (NLU) applications
⚙️ Key Parameters
model_name: The Hugging Face model ID (e.g.,sentence-transformers/all-MiniLM-L6-v2)model_kwargs: Additional model configuration parameters (e.g., GPU/CPU device settings)encode_kwargs: Extra settings for embedding generation
💡 Advantages
Local Embedding Operations: Perform embeddings locally without requiring an internet connection.
High Performance: Utilize GPU settings for faster embedding generation.
Model Variety: Supports a wide range of Hugging Face models.
⚠️ Caveats
Local Storage Requirement: Pre-trained models must be downloaded locally.
Environment Configuration: Performance may vary depending on GPU/CPU device settings.
With HuggingFaceEmbeddings, you can efficiently leverage Hugging Face's powerful embedding models in a local environment, enabling flexible and scalable NLP solutions. 🚀
Let's download the embedding model locally, perform embeddings, and search for the most relevant documents.
intfloat/multilingual-e5-large-instruct
Last updated