Multi Modal RAG

Open in ColabOpen in GitHub

Overview

Many documents contain a mix of different content types, including text and images.

However, in most RAG applications, the information contained in images is lost.

With the advent of multimodal LLMs like GPT-4V and GPT-4o, it is worth considering how to utilize images in RAG:

Option 1:

  • Use multimodal embedding (such as CLIP) to embed images and text.

  • Search for both using similarity search.

  • Pass the original image and text fragments to the multimodal LLM to synthesize the answers.

Option 2:

  • Generate text summaries from images using multimodal LLMs (e.g. GPT-4V, GPT-4o, LLaVA, FUYU-8b).

  • Embed and search for text.

  • Pass text fragments to LLMs to synthesize answers.

Option 3:

  • Generate text summaries from images using multimodal LLMs (e.g. GPT-4V, GPT-4o, LLaVA, FUYU-8b).

  • Embed and retrieve the image summary with a reference to the original image.

  • Pass the original image and text fragment to a multimodal LLM to synthesize answers.

graphic-01.png

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

Package

To use unstructured, the system requires poppler (Installation Guide) and tesseract (Installation Guide).

[Note] Option 2 is suitable when multimodal LLMs cannot be used for answer synthesis (e.g., due to cost or other limitations).

Data Loading

Before processing PDFs, it's essential to distinguish between text and images for accurate extraction.

Splitting PDF Text and Images

Using partition_pdf provided by Unstructured, you can extract text and images.

To extract images, use the following:

extract_images_in_pdf=True

If you want to process only text:

extract_images_in_pdf=False

Multi-Vector Search Engine

Using the multi-vector-retriever, you can index summaries of images (and/or text, tables) while retrieving the original images (along with the original text or tables).

Text and Table Summarization

To generate summaries for tables and optionally text, we will use GPT-4-turbo.

If you are working with large chunk sizes (e.g., 4k token chunks as set above), text summarization is recommended.

The summaries are used for retrieving the original tables and/or original text chunks.

Image Summarization

We will use GPT-4o to generate summaries for images.

  • The images are passed as base64-encoded data.

Adding to the Vector Store

To add the original documents and their summaries to the Multi Vector Retriever:

  • Store the original text, tables, and images in the docstore.

  • Save text summaries, table summaries, and image summaries in the vectorstore for efficient semantic search.

Explaining the Process of Creating a Multi-Vector Search Engine for Indexing and Retrieving Various Data Types (Text, Tables, Images)

  • Initialize the storage layer using InMemoryStore.

  • Create a MultiVectorRetriever to index summarized data but configure it to return the original text or images.

  • Include the process of adding summaries and original data for each data type (text, tables, images) to the vectorstore and docstore:

    • Generate a unique doc_id for each document.

    • Add the summarized data to the vectorstore and store the original data along with the doc_id in the docstore.

  • Check conditions to ensure that only non-empty summaries are added for each data type.

  • Use the Chroma vector store to index summaries and generate embeddings using the OpenAIEmbeddings function.

  • The resulting multi-vector search engine indexes summaries for various data types and ensures that original data is returned during searches.

RAG

Effectively retrieving relevant documents is a crucial step in enhancing response accuracy.

Building the Retriever

The retrieved documents must be assigned to the correct sections of the GPT-4o prompt template.

The following describes how to process Base64-encoded images and text and use them to construct a multimodal question-answering (QA) chain:

  • Verify if a Base64-encoded string is an image. Supported image formats include JPG, PNG, GIF, and WEBP.

  • Resize the Base64-encoded image to the given dimensions.

  • Separate Base64-encoded images and text from a document set.

  • Use the separated images and text to construct messages that will serve as inputs to the multimodal QA chain. This process involves creating messages that include image URLs and text information.

  • Construct the multimodal QA chain. This chain generates responses to questions based on the provided image and text information. The model used is ChatOpenAI, specifically the gpt-4o model.

This process outlines the implementation of a multimodal QA system that leverages both image and text data to generate responses to questions. It includes Base64 encoding and decoding for image data, image resizing, and the integration of image and text information to produce responses.

Verification

When we search for images related to a question, we receive relevant images in return.

Verification

Let’s revisit the images we stored to understand why this works.

Here is the corresponding summary, which we embedded for similarity search.

It is quite reasonable that this image was retrieved based on its similarity to the summary of our query.

RAG

Now, let's run RAG and test its ability to synthesize answers to our questions.

Considerations

Search

  • The search is performed based on the similarity between image summaries and text chunks.

  • Careful consideration is required as it may fail if other text chunks have a competitive advantage over the image summary search results.

Image Size

  • The quality of answer synthesis appears to be sensitive to image size, as stated in the guidelines.

Last updated