Multimodal Embeddings With Langchain
Author: Gwangwon Jung
Peer Review : Teddy Lee, DoWoung Kong
Proofread : Youngjun cho
This is a part of LangChain Open Tutorial
Overview
This tutorial covers how to perform Text Embedding and Image Embedding using Multimodal Embedding Model with Langchain.
The Multimodal Embedding Model is a model that can vectorize text as well as image.
In this tutorial, we will create a simple Image Similarity Searching example using Multimodal Embedding Model and Langchain.

Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
Multimodal Embedding
Multimodal embedding is the process of creating a vector that represents an image’s features and context, making it compatible with text search in the same vector space.

What is Image Similarity Search?
Image Similarity Search is a technique that allows you to find images in a database that are similar to a given query (either an image or text describing the image) using vector-based representations.
The process involves converting images or text into embedding vectors that capture their visual or semantic features.
These vectors are then compared using similarity metrics, such as Cosine Similarity or Euclidean Distance, to find the most similar images in the database based on their vector representations.
Setting Image Data
In this tutorial, example images are provided. These images are copyright-free and cover a variety of topics (e.g., dog, cat, female, male,...) created using SDXL.
The images are located at ./data/for_embed_images.zip.
Create a list containing the image path.
Model Load and Embedding Images
In this tutorial, we use OpenCLIP, which implements OpenAI's CLIP as an open source.
OpenCLIP can be used with Langchain to easily embed Text and Image .
You can load OpenCLIP Embedding model using the Python libraries open_clip_torch and langchain-experimental.
Image Similarity Search with Text
Image Similarity Search with Text finds the image in the image dataset that most relates to a given text query.
We will use cosine similarity for calculation of similarity.
Because cosine similarity is commonly used in image similarity search.
Steps
Text Query Embedding
Calculate the similarity between the
Text Query Embedding Vectorand theImage Embedding VectorGet similar images

Image Similarity Search with Image
Image Similarity Search with Image finds the image in the image dataset that most relates to a given image query.
We will use cosine similarity for calculation of similarity.
Because cosine Similarity is commonly used in image similarity search.
Steps
Image Query Embedding
Calculate the similarity between the
Image Query Embedding Vectorand theImage Embedding VectorGet similar images


Last updated