OpenAI Embeddings

Open in ColabOpen in GitHub

Overview

This tutorial explores the use of OpenAI Text embedding models within the LangChain framework.

It showcases how to generate embeddings for text queries and documents, reduce their dimensionality using PCA , and visualize them in 2D for better interpretability.

By analyzing relationships between the query and documents through cosine similarity, it provides insights into how embeddings can enhance workflows, including text analysis and data visualization.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

[Note] If you are using a .env file, proceed as follows.

Load model and set dimension

Describes the Embedding model and dimension settings supported by OpenAI.

Why Adjust Embedding Dimensions?

  • Optimize Resources : Shortened embeddings use less memory and compute.

  • Flexible Usage : Models like text-embedding-3-large allow size reduction with the dimensions API.

  • Key Insight : Even at 256 dimensions, performance can surpass larger models like text-embedding-ada-002.

This is a description of the models supported by OpenAI

Model
~ Pages per Dollar
Performance on MTEB Eval
Max Input
Available dimension

text-embedding-3-small

62,500

62.3%

8191

512, 1536

text-embedding-3-large

9,615

64.6%

8191

256, 1024, 3072

text-embedding-ada-002

12,500

61.0%

8191

1536

"Initialize and utilize OpenAI embedding models using langchain_openai package."

[note] If dimension reduction is necessary, please set as below.

Define query and documents

Now we embed the query and document using the set embedding model.

Similarity Calculation (Cosine Similarity)

This code calculates the similarity between the query and the document through Cosine Similarity . Find the documents similar (top 3) and (bottom 3) .

Embeddings visualization(PCA)

Reduce the dimensionality of the embeddings for visualization purposes. This code uses principal component analysis (PCA) to reduce high-dimensional embedding vectors to two dimensions. The resulting 2D points are displayed in a scatterplot, with each point labeled for its corresponding document.

Why Dimension Reduction?

High-dimensional embedding vectors are challenging to interpret and analyze directly. By reducing them to 2D, we can:

  • Visually explore relationships between embeddings (e.g., clustering, grouping).

  • Identify patterns or anomalies in the data that may not be obvious in high dimensions.

  • Improve interpretability , making the data more accessible for human analysis and decision-making.

png

Last updated