MultiVectorRetriever

Open in ColabOpen in GitHub

Overview

MultiVectorRetriever enables efficient querying of documents in various contexts. It allows documents to be stored and managed with multiple vectors, significantly enhancing the accuracy and efficiency of information retrieval.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

Alternatively, environment variables can also be set using a .env file.

[Note]

  • This is not necessary if you've already set the environment variables in the previous step.

Methods to Create Multiple Vectors per Document

There are several approaches to creating multiple vectors for a given document. Some of them include:

  1. Creating Smaller Chunks: Split the document into smaller chunks and embed them. This method enables a more granular focus on specific parts of the document. It can be implemented using the ParentDocumentRetriever, making it easier to explore detailed information.

  2. Storing Summary Embeddings: Create a summary for each document and embed it along with the original document. Summary embeddings are particularly useful for quickly grasping the core content of a document. By focusing only on the summary instead of analyzing the entire document, efficiency can be significantly improved.

  3. Utilizing Hypothetical Questions: Create relevant hypothetical questions for each document and embed them along with the original document. This approach is helpful when deeper exploration of specific topics or content is needed. Hypothetical questions enable a broader perspective on the document's content, facilitating a more comprehensive understanding.

  4. Manual Addition: Users can manually add specific questions or queries that should be considered during document retrieval. This method provides users with more control over the search process, allowing for customized searches tailored to their specific needs.

Let's first preprocess the document data by loading data from a text file, and splitting the loaded documents into specified sizes.

The split documents can later be used for tasks such as vectorization and retrieval.

The original documents loaded from the data are stored in the docs variable.

Creating Smaller Chunks

When searching through large volumes of information, embedding data into smaller chunks can be highly beneficial.

With MultiVectorRetriever, documents can be stored and managed as multiple vectors.

  • The original documents are stored in the docstore.

  • The embedded documents are stored in the vectorstore.

This allows for splitting documents into smaller units, enabling more accurate searches. Additionally, the contents of the original document can be accessed when needed.

Here we define a parent_text_splitter for splitting into larger chunks and a child_text_splitter for splitting into smaller chunks.

Create parent documents , which are larger chunks.

Verify the doc_id assigned to parent_docs

Create child documents , which are relatively smaller chunks.

Verify the doc_id assigned to child_docs.

Check the number of chunks for each split document.

Add the newly created smaller child document set to the vector store.

Then, map the parent documents to the generated UUIDs and add them to the docstore.

  • Use the mset method to store document IDs and their content as key-value pairs in the document store.

Perform similarity search, and display the most similar document chunks.

Use the retriever.vectorstore.similarity_search method to search within the child and parent document chunks.

The first document chunk with the highest similarity will be displayed.

Execute a query using the retriever.invoke method.

The retriever.invoke method performs a search across the full content of the original documents.

The default search type performed by the retriever in the vector database is similarity search.

LangChain's VectorStore also support searching using Max Marginal Relevance.

If you want to use this method instead, you can configure the search_type property as follows.

  • Set the search_type property of the retriever object to SearchType.mmr.

    • This specifies that the MMR (Maximal Marginal Relevance) algorithm should be used during the search.

Storing Summary Embeddings

Summaries can often provide a more accurate extraction of the contents of a chunk, which can lead to better search results.

This section describes how to generate summaries, and how to embed them.

Summarize the documents in the docs list in batch using the chain.batch method.

  • Here, we set the max_concurrency parameter to 10 to allow up to 10 documents to be processed simultaneously.

Print the summary to see the results.

Initialize the Chroma vector store to index the child chunks. Use OpenAIEmbeddings as the embedding function.

  • Use “doc_id” as the key representing the document ID.

Save the summarized document and its metadata (here, the Document ID for the summary you created).

The number of articles in the digest matches the number of original articles.

  • Add summary_docs to the vector store with retriever.vectorstore.add_documents(summary_docs).

  • Map doc_ids and docs with retriever.docstore.mset(list(zip(doc_ids, docs)))) to store them in the document store.

Perform a similarity search using the similarity_search method of the vectorstore object.

Use the invoke method of the retriever object to retrieve documents related to the query.

Utilizing Hypothetical Questions

An LLM can also be used to generate a list of questions that can be hypothesized about a particular document.

These generated questions can be embedded to further explore and understand the content of the document.

Generating hypothetical questions can help you identify key topics and concepts in your document, and can encourage readers to ask more questions about the content of your document.

Below is an example of creating a hypothesis question via function calling .

Use ChatPromptTemplate to define a prompt template that generates three hypothetical questions based on the given document.

  • Set functions and function_call to call the virtual question generation functions.

  • Use JsonKeyOutputFunctionsParser to parse the generated virtual questions and extract the values corresponding to the questions key.

Output the answers to the documents.

  • The output contains the three hypothetical questions you created.

Use the chain.batch method to process multiple requests for split_docs data at the same time.

Below is the process for storing the hypothetical questions you created in the vector store, the same way we did before.

Add metadata (document IDs) to the question_docs list.

Add the hypothetical_questions to the document, and add the original document to docstore.

Perform a similarity search using the similarity_search method of the vectorstore object.

Below are the results of the similarity search.

Here, we've only added the hypothetical questions we created, so it returns the documents with the highest similarity among the stored hypothetical questions.

Use the invoke method of the retriever object to retrieve documents related to the query.

Last updated