MultiVectorRetriever

Author: YooKyung Jeon
Peer Review: choincnp, Hye-yoonJeong
Proofread : Juni Lee
This is a part of LangChain Open Tutorial

Overview

MultiVectorRetriever enables efficient querying of documents in various contexts. It allows documents to be stored and managed with multiple vectors, significantly enhancing the accuracy and efficiency of information retrieval.

References

Retriever

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_community",
        "langchain",
        "langchain_chroma",
        "langchain_openai",
        "langchain_core",
        "langchain_text_splitters",
        "pymupdf"
    ],
    verbose=False,
    upgrade=False,
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "07-MultiVectorRetriever",
    }
)

Environment variables have been set successfully.

Alternatively, environment variables can also be set using a .env file.

[Note]

This is not necessary if you've already set the environment variables in the previous step.

from dotenv import load_dotenv

load_dotenv(override=True)

True

Methods to Create Multiple Vectors per Document

There are several approaches to creating multiple vectors for a given document. Some of them include:

Creating Smaller Chunks: Split the document into smaller chunks and embed them. This method enables a more granular focus on specific parts of the document. It can be implemented using the ParentDocumentRetriever, making it easier to explore detailed information.
Storing Summary Embeddings: Create a summary for each document and embed it along with the original document. Summary embeddings are particularly useful for quickly grasping the core content of a document. By focusing only on the summary instead of analyzing the entire document, efficiency can be significantly improved.
Utilizing Hypothetical Questions: Create relevant hypothetical questions for each document and embed them along with the original document. This approach is helpful when deeper exploration of specific topics or content is needed. Hypothetical questions enable a broader perspective on the document's content, facilitating a more comprehensive understanding.
Manual Addition: Users can manually add specific questions or queries that should be considered during document retrieval. This method provides users with more control over the search process, allowing for customized searches tailored to their specific needs.

Let's first preprocess the document data by loading data from a text file, and splitting the loaded documents into specified sizes.

The split documents can later be used for tasks such as vectorization and retrieval.

from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf")
docs = loader.load()

The original documents loaded from the data are stored in the docs variable.

print(docs[5].page_content[:500])

A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
    6
    data for innovators, particularly in the business-to-business (B2B) 
    or government-to-citizens (G2C) domains: e.g. by open access to 
    government data in sectors such as transportation and health-
    care (Burghin et al., 2019), privacy-preserving data marketplaces 
    for companies to share data (de Streel et al., 2019). The genuine 
    concern for innovators access to data is shown by the city of Bar-
    celona where ‘data sovereignty’

Creating Smaller Chunks

When searching through large volumes of information, embedding data into smaller chunks can be highly beneficial.

With MultiVectorRetriever, documents can be stored and managed as multiple vectors.

The original documents are stored in the docstore.
The embedded documents are stored in the vectorstore.

This allows for splitting documents into smaller units, enabling more accurate searches. Additionally, the contents of the original document can be accessed when needed.

import uuid
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers.multi_vector import MultiVectorRetriever

# Vector store for indexing child chunks
vectorstore = Chroma(
    collection_name="small_bigger_chunks",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)

# Storage layer for parent documents
store = InMemoryStore()

id_key = "doc_id"

# Retriever (initially empty)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# Generate document IDs
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Check two of the generated IDs
print(doc_ids[:2])

['46aba7dd-39cd-4852-beed-e8e0560e7a98', 'dc741e0e-89a0-41b5-8090-688ec75748b8']

Here we define a parent_text_splitter for splitting into larger chunks and a child_text_splitter for splitting into smaller chunks.

# Create a RecursiveCharacterTextSplitter object for larger chunks
parent_text_splitter = RecursiveCharacterTextSplitter(chunk_size=600)

# Splitter to be used for generating smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

Create parent documents , which are larger chunks.

parent_docs = []

for i, doc in enumerate(docs):
    # Retrieve the ID of the current document
    _id = doc_ids[i]
    # Split the current document into smaller parent documents
    parent_doc = parent_text_splitter.split_documents([doc])

    for _doc in parent_doc:
        # Set the document ID in the metadata
        _doc.metadata[id_key] = _id
    parent_docs.extend(parent_doc)

Verify the doc_id assigned to parent_docs

# Check the metadata of the generated parent documents.
parent_docs[0].metadata

{'source': 'data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf',
     'file_path': 'data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf',
     'page': 0,
     'total_pages': 24,
     'format': 'PDF 1.4',
     'title': '',
     'author': '',
     'subject': '',
     'keywords': '',
     'creator': 'Adobe InDesign 15.1 (Macintosh)',
     'producer': 'Adobe PDF Library 15.0',
     'creationDate': "D:20200922223534+02'00'",
     'modDate': "D:20200922223544+02'00'",
     'trapped': '',
     'doc_id': '46aba7dd-39cd-4852-beed-e8e0560e7a98'}

Create child documents , which are relatively smaller chunks.

child_docs = []
for i, doc in enumerate(docs):
    # Retrieve the ID of the current document
    _id = doc_ids[i]
    # Split the current document into child documents
    child_doc = child_text_splitter.split_documents([doc])
    for _doc in child_doc:
        # Set the document ID in the metadata
        _doc.metadata[id_key] = _id
    child_docs.extend(child_doc)

Verify the doc_id assigned to child_docs.

# Check the metadata of the generated child documents.
child_docs[0].metadata

{'source': 'data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf',
     'file_path': 'data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf',
     'page': 0,
     'total_pages': 24,
     'format': 'PDF 1.4',
     'title': '',
     'author': '',
     'subject': '',
     'keywords': '',
     'creator': 'Adobe InDesign 15.1 (Macintosh)',
     'producer': 'Adobe PDF Library 15.0',
     'creationDate': "D:20200922223534+02'00'",
     'modDate': "D:20200922223544+02'00'",
     'trapped': '',
     'doc_id': '46aba7dd-39cd-4852-beed-e8e0560e7a98'}

Check the number of chunks for each split document.

print(f"Number of split parent_docs: {len(parent_docs)}")
print(f"Number of split child_docs: {len(child_docs)}")

Number of split parent_docs: 177
    Number of split child_docs: 950

Add the newly created smaller child document set to the vector store.

Then, map the parent documents to the generated UUIDs and add them to the docstore.

Use the mset method to store document IDs and their content as key-value pairs in the document store.

# Add both parent and child documents to the vector store
retriever.vectorstore.add_documents(parent_docs)
retriever.vectorstore.add_documents(child_docs)

# Store the original documents in the docstore
retriever.docstore.mset(list(zip(doc_ids, docs)))

Perform similarity search, and display the most similar document chunks.

Use the retriever.vectorstore.similarity_search method to search within the child and parent document chunks.

The first document chunk with the highest similarity will be displayed.

# Perform similarity search on the vectorstore
relevant_chunks = retriever.vectorstore.similarity_search(
    "What is the phased implementation timeline for the EU AI Act?"
)
print(f"Number of retrieved documents: {len(relevant_chunks)}")

Number of retrieved documents: 4

for chunk in relevant_chunks:
    print(chunk.page_content, end="\n\n")
    print(">" * 100, end="\n\n")

peration on AI (European Commission, 2018c), and coordinated 
    action plan on the development of AI in the EU (European Com-
    mission, 2018d), among others. The European strategy aims to
    
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    
    peration on AI (European Commission, 2018c), and coordinated 
    action plan on the development of AI in the EU (European Com-
    mission, 2018d), among others. The European strategy aims to
    
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    
    peration on AI (European Commission, 2018c), and coordinated 
    action plan on the development of AI in the EU (European Com-
    mission, 2018d), among others. The European strategy aims to
    
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    
    peration on AI (European Commission, 2018c), and coordinated 
    action plan on the development of AI in the EU (European Com-
    mission, 2018d), among others. The European strategy aims to
    
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Execute a query using the retriever.invoke method.

The retriever.invoke method performs a search across the full content of the original documents.

relevant_docs = retriever.invoke(
    "What is the phased implementation timeline for the EU AI Act?"
)
print(f"Number of retrieved documents: {len(relevant_docs)}", end="\n\n")
print("=" * 100, end="\n\n")
print(relevant_docs[0].page_content)

Number of retrieved documents: 1
    
    ====================================================================================================
    
    A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
    5
    laws and regulation. Some negative examples have been given 
    wide attention in the media: a fatal accident involving an autono-
    mous vehicle2; Microsoft’s chatting bot Tay being shut down after 
    16 hours because it became racist, sexist, and denied the Holo-
    caust3; racially biased decisions with credit checks and recidivism 
    (Teich & Tirias Research, 2018). Such examples are fuelling a va-
    riety of concerns about accountability, fairness, bias, autonomy, 
    and due process of AI systems (Pasquale, 2015; Ziewitz, 2015). 
    Beyond these anecdotal instances, AI presents several challenges 
    (Dwivedi et al., 2019), which are economic (need of funds, impact 
    on employment and performances) and organizational (changing 
    working practices, cultural barriers, need of new skills, data inte-
    gration, etc.) issues to be tackled. At societal level AI may challenge 
    cultural norms and face resistance (Hu et al, 2019). In Europe there 
    is an ongoing discussion on the legal and ethical challenges posed 
    by a greater use of AI. One key point is transparency, or lack the-
    reof, of algorithms on which AI applications rely. There is a need 
    to study and understand where algorithms may go wrong as to 
    adopt adequate and proportional remedial and mitigation mea-
    sures. Algorithmic rules may imply moral judgements, such as for 
    driverless cars deciding which lives to save in the event of a se-
    rious accident (Nyholm, & Smids, 2016). 
    The European Commission has launched a series of policy initia-
    tives with the aim to boost the development of sustainable AI 
    in Europe, including the communication ‘Artificial Intelligence for 
    Europe’ (European Commission, 2018a), the declaration of coo-
    peration on AI (European Commission, 2018c), and coordinated 
    action plan on the development of AI in the EU (European Com-
    mission, 2018d), among others. The European strategy aims to 
    place people at the centre of the development of AI, what has been 
    called ‘human-centric AI’. It is a three-pronged approach to support 
    the EU’s technological and industrial capacity and AI uptake across 
    the economy, prepare for socio-economic changes, and ensure an 
    appropriate ethical and legal framework. The Commission has set 
    up a High-Level Expert Group on AI representing a wide range of 
    stakeholders and has tasked it with drafting AI ethics guidelines as 
    well as preparing a set of recommendations for broader AI policy. 
    The Group drafted AI Ethical Guidelines4, which postulate that in 
    order to achieve ‘trustworthy AI’, three components are necessary: 
    (1) it should comply with the law, (2) it should fulfil ethical prin-
    ciples and (3) it should be robust. Based on these three compo-
    nents and the European values, the guidelines identify seven key 
    requirements that AI applications should respect to be considered 
    trustworthy5. These policies culminated in the White Paper on AI 
    – A European Approach to Excellence and Trust (European Com-
    mission, 2020a) and a Communication on ‘A European Strategy 
    for Data’ (European Commission, 2020b). The strategy set out in 
    the Paper is built on two main blocks. On the one hand, it aims to 
    create an ‘ecosystem of excellence’, by boosting the development 
    of AI, partnering with private sector, focusing on R&D, skills and 
    SMEs in particular. On the other hand, it aims to create an ‘ecosys-
    tem of trust’ within an EU regulatory framework. The strategy set 
    out in the White Paper is to build and retain trust in AI. This needs 
    a multi-layered approach that includes critical engagement of civil 
    society to discuss the values guiding and being embedded into AI; 
    public debates to translate these values into strategies and guide-
    lines; and responsible design practices that encode these values 
    and guidelines into AI systems making these ‘ethical by design’. 
    In line with this we have the European data strategy, adopted in 
    February 2020, aiming to establish a path for the creation of Euro-
    pean data spaces whereby more data becomes available for use in 
    the economy and society but under firm control of European com-
    panies and individuals. As noted in a recent parliamentary brief 
    (European Parliament, 2020), the objective of creating European 
    data spaces is related to the ongoing discourse on Europe digital 
    sovereignty (EPSC, 20196)  and the concern that, while Europe is at 
    the frontier in terms of research and on a par with its global com-
    petitors, it nonetheless lags behind the US and China when it co-
    mes to private investment (European Commission, 2018a). The le-
    vel of adoption of AI technologies by companies and by the general 
    public appears comparatively low compared to the US (Probst et 
    al., 2018). This leads to the concern that citizens, businesses and 
    Member States of the EU are gradually losing control over their 
    data, their capacity for innovation, and their ability to shape and 
    enforce legislation in the digital environment. To address these 
    concerns the data strategy proposes the construction of an EU 
    data framework that would favour and support the sharing of

The default search type performed by the retriever in the vector database is similarity search.

LangChain's VectorStore also support searching using Max Marginal Relevance.

If you want to use this method instead, you can configure the search_type property as follows.

Set the search_type property of the retriever object to SearchType.mmr.
- This specifies that the MMR (Maximal Marginal Relevance) algorithm should be used during the search.

from langchain.retrievers.multi_vector import SearchType

# Set the search type to Maximal Marginal Relevance (MMR)
retriever.search_type = SearchType.mmr

# Search all related documents
print(
    retriever.invoke(
        "What is the phased implementation timeline for the EU AI Act?"
    )[0].page_content
)

A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
    5
    laws and regulation. Some negative examples have been given 
    wide attention in the media: a fatal accident involving an autono-
    mous vehicle2; Microsoft’s chatting bot Tay being shut down after 
    16 hours because it became racist, sexist, and denied the Holo-
    caust3; racially biased decisions with credit checks and recidivism 
    (Teich & Tirias Research, 2018). Such examples are fuelling a va-
    riety of concerns about accountability, fairness, bias, autonomy, 
    and due process of AI systems (Pasquale, 2015; Ziewitz, 2015). 
    Beyond these anecdotal instances, AI presents several challenges 
    (Dwivedi et al., 2019), which are economic (need of funds, impact 
    on employment and performances) and organizational (changing 
    working practices, cultural barriers, need of new skills, data inte-
    gration, etc.) issues to be tackled. At societal level AI may challenge 
    cultural norms and face resistance (Hu et al, 2019). In Europe there 
    is an ongoing discussion on the legal and ethical challenges posed 
    by a greater use of AI. One key point is transparency, or lack the-
    reof, of algorithms on which AI applications rely. There is a need 
    to study and understand where algorithms may go wrong as to 
    adopt adequate and proportional remedial and mitigation mea-
    sures. Algorithmic rules may imply moral judgements, such as for 
    driverless cars deciding which lives to save in the event of a se-
    rious accident (Nyholm, & Smids, 2016). 
    The European Commission has launched a series of policy initia-
    tives with the aim to boost the development of sustainable AI 
    in Europe, including the communication ‘Artificial Intelligence for 
    Europe’ (European Commission, 2018a), the declaration of coo-
    peration on AI (European Commission, 2018c), and coordinated 
    action plan on the development of AI in the EU (European Com-
    mission, 2018d), among others. The European strategy aims to 
    place people at the centre of the development of AI, what has been 
    called ‘human-centric AI’. It is a three-pronged approach to support 
    the EU’s technological and industrial capacity and AI uptake across 
    the economy, prepare for socio-economic changes, and ensure an 
    appropriate ethical and legal framework. The Commission has set 
    up a High-Level Expert Group on AI representing a wide range of 
    stakeholders and has tasked it with drafting AI ethics guidelines as 
    well as preparing a set of recommendations for broader AI policy. 
    The Group drafted AI Ethical Guidelines4, which postulate that in 
    order to achieve ‘trustworthy AI’, three components are necessary: 
    (1) it should comply with the law, (2) it should fulfil ethical prin-
    ciples and (3) it should be robust. Based on these three compo-
    nents and the European values, the guidelines identify seven key 
    requirements that AI applications should respect to be considered 
    trustworthy5. These policies culminated in the White Paper on AI 
    – A European Approach to Excellence and Trust (European Com-
    mission, 2020a) and a Communication on ‘A European Strategy 
    for Data’ (European Commission, 2020b). The strategy set out in 
    the Paper is built on two main blocks. On the one hand, it aims to 
    create an ‘ecosystem of excellence’, by boosting the development 
    of AI, partnering with private sector, focusing on R&D, skills and 
    SMEs in particular. On the other hand, it aims to create an ‘ecosys-
    tem of trust’ within an EU regulatory framework. The strategy set 
    out in the White Paper is to build and retain trust in AI. This needs 
    a multi-layered approach that includes critical engagement of civil 
    society to discuss the values guiding and being embedded into AI; 
    public debates to translate these values into strategies and guide-
    lines; and responsible design practices that encode these values 
    and guidelines into AI systems making these ‘ethical by design’. 
    In line with this we have the European data strategy, adopted in 
    February 2020, aiming to establish a path for the creation of Euro-
    pean data spaces whereby more data becomes available for use in 
    the economy and society but under firm control of European com-
    panies and individuals. As noted in a recent parliamentary brief 
    (European Parliament, 2020), the objective of creating European 
    data spaces is related to the ongoing discourse on Europe digital 
    sovereignty (EPSC, 20196)  and the concern that, while Europe is at 
    the frontier in terms of research and on a par with its global com-
    petitors, it nonetheless lags behind the US and China when it co-
    mes to private investment (European Commission, 2018a). The le-
    vel of adoption of AI technologies by companies and by the general 
    public appears comparatively low compared to the US (Probst et 
    al., 2018). This leads to the concern that citizens, businesses and 
    Member States of the EU are gradually losing control over their 
    data, their capacity for innovation, and their ability to shape and 
    enforce legislation in the digital environment. To address these 
    concerns the data strategy proposes the construction of an EU 
    data framework that would favour and support the sharing of

from langchain.retrievers.multi_vector import SearchType

# Set search type to similarity_score_threshold
retriever.search_type = SearchType.similarity_score_threshold
retriever.search_kwargs = {"score_threshold": 0.3}

# Search all related documents
print(
    retriever.invoke(
        "What is the phased implementation timeline for the EU AI Act?"
    )[0].page_content
)

A EUROPEAN APPROACH TO ARTIFICIAL INTELLIGENCE - A POLICY PERSPECTIVE
    5
    laws and regulation. Some negative examples have been given 
    wide attention in the media: a fatal accident involving an autono-
    mous vehicle2; Microsoft’s chatting bot Tay being shut down after 
    16 hours because it became racist, sexist, and denied the Holo-
    caust3; racially biased decisions with credit checks and recidivism 
    (Teich & Tirias Research, 2018). Such examples are fuelling a va-
    riety of concerns about accountability, fairness, bias, autonomy, 
    and due process of AI systems (Pasquale, 2015; Ziewitz, 2015). 
    Beyond these anecdotal instances, AI presents several challenges 
    (Dwivedi et al., 2019), which are economic (need of funds, impact 
    on employment and performances) and organizational (changing 
    working practices, cultural barriers, need of new skills, data inte-
    gration, etc.) issues to be tackled. At societal level AI may challenge 
    cultural norms and face resistance (Hu et al, 2019). In Europe there 
    is an ongoing discussion on the legal and ethical challenges posed 
    by a greater use of AI. One key point is transparency, or lack the-
    reof, of algorithms on which AI applications rely. There is a need 
    to study and understand where algorithms may go wrong as to 
    adopt adequate and proportional remedial and mitigation mea-
    sures. Algorithmic rules may imply moral judgements, such as for 
    driverless cars deciding which lives to save in the event of a se-
    rious accident (Nyholm, & Smids, 2016). 
    The European Commission has launched a series of policy initia-
    tives with the aim to boost the development of sustainable AI 
    in Europe, including the communication ‘Artificial Intelligence for 
    Europe’ (European Commission, 2018a), the declaration of coo-
    peration on AI (European Commission, 2018c), and coordinated 
    action plan on the development of AI in the EU (European Com-
    mission, 2018d), among others. The European strategy aims to 
    place people at the centre of the development of AI, what has been 
    called ‘human-centric AI’. It is a three-pronged approach to support 
    the EU’s technological and industrial capacity and AI uptake across 
    the economy, prepare for socio-economic changes, and ensure an 
    appropriate ethical and legal framework. The Commission has set 
    up a High-Level Expert Group on AI representing a wide range of 
    stakeholders and has tasked it with drafting AI ethics guidelines as 
    well as preparing a set of recommendations for broader AI policy. 
    The Group drafted AI Ethical Guidelines4, which postulate that in 
    order to achieve ‘trustworthy AI’, three components are necessary: 
    (1) it should comply with the law, (2) it should fulfil ethical prin-
    ciples and (3) it should be robust. Based on these three compo-
    nents and the European values, the guidelines identify seven key 
    requirements that AI applications should respect to be considered 
    trustworthy5. These policies culminated in the White Paper on AI 
    – A European Approach to Excellence and Trust (European Com-
    mission, 2020a) and a Communication on ‘A European Strategy 
    for Data’ (European Commission, 2020b). The strategy set out in 
    the Paper is built on two main blocks. On the one hand, it aims to 
    create an ‘ecosystem of excellence’, by boosting the development 
    of AI, partnering with private sector, focusing on R&D, skills and 
    SMEs in particular. On the other hand, it aims to create an ‘ecosys-
    tem of trust’ within an EU regulatory framework. The strategy set 
    out in the White Paper is to build and retain trust in AI. This needs 
    a multi-layered approach that includes critical engagement of civil 
    society to discuss the values guiding and being embedded into AI; 
    public debates to translate these values into strategies and guide-
    lines; and responsible design practices that encode these values 
    and guidelines into AI systems making these ‘ethical by design’. 
    In line with this we have the European data strategy, adopted in 
    February 2020, aiming to establish a path for the creation of Euro-
    pean data spaces whereby more data becomes available for use in 
    the economy and society but under firm control of European com-
    panies and individuals. As noted in a recent parliamentary brief 
    (European Parliament, 2020), the objective of creating European 
    data spaces is related to the ongoing discourse on Europe digital 
    sovereignty (EPSC, 20196)  and the concern that, while Europe is at 
    the frontier in terms of research and on a par with its global com-
    petitors, it nonetheless lags behind the US and China when it co-
    mes to private investment (European Commission, 2018a). The le-
    vel of adoption of AI technologies by companies and by the general 
    public appears comparatively low compared to the US (Probst et 
    al., 2018). This leads to the concern that citizens, businesses and 
    Member States of the EU are gradually losing control over their 
    data, their capacity for innovation, and their ability to shape and 
    enforce legislation in the digital environment. To address these 
    concerns the data strategy proposes the construction of an EU 
    data framework that would favour and support the sharing of

from langchain.retrievers.multi_vector import SearchType

# Set search type to similarity and k value to 1
retriever.search_type = SearchType.similarity
retriever.search_kwargs = {"k": 1}

# Search all related documents
print(
    len(
        retriever.invoke(
            "What is the phased implementation timeline for the EU AI Act?"
        )
    )
)

Storing Summary Embeddings

Summaries can often provide a more accurate extraction of the contents of a chunk, which can lead to better search results.

This section describes how to generate summaries, and how to embed them.

# Importing libraries for loading PDF files and splitting text
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the PDF file loader
loader = PyMuPDFLoader("data/A European Approach to Artificial Intelligence - A Policy Perspective.pdf")

# Split text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=50)

# Load a PDF file and run Text Split
split_docs = loader.load_and_split(text_splitter)

# Output the number of split documents
print(f"Number of split documents: {len(split_docs)}")

Number of split documents: 135

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI


summary_chain = (
    {"doc": lambda x: x.page_content}
    # Create a prompt template for document summaries
    | ChatPromptTemplate.from_messages(
        [
            ("system", "You are an expert in summarizing documents in English."),
            (
                "user",
                "Summarize the following documents in 3 sentences in bullet points format.\n\n{doc}",
            ),
        ]
    )
    # Using OpenAI's ChatGPT model to generate summaries
    | ChatOpenAI(temperature=0, model="gpt-4o-mini")
    | StrOutputParser()
)

Summarize the documents in the docs list in batch using the chain.batch method.

Here, we set the max_concurrency parameter to 10 to allow up to 10 documents to be processed simultaneously.

# Handling batches of documents
summaries = summary_chain.batch(split_docs, {"max_concurrency": 10})

len(summaries)

Print the summary to see the results.

# Prints the contents of the original document.
print(split_docs[33].page_content, end="\n\n")
# Print a summary.
print("[summary]")
print(summaries[33])

decision-making process may become less tractable9. The chosen 
    decision model may also turn out to be unsuitable if the real-world 
    environment behaves differently from what was expected. While 
    more and better data be used for training can help improving pre-
    diction, it will never be perfect or include all justifiable outliers. On 
    the other hand, as technology advances more instruments may 
    become available to quantify the degree of influence of input va-
    riables on algorithm outputs (Datta et al., 2016). Research is also 
    underway in pursuit of rendering algorithms more amenable to
    
    [summary]
    - The decision-making process can become complex and less manageable if the chosen model does not align with real-world conditions.  
    - Although improved data can enhance predictions, it will never be flawless or account for all valid outliers.  
    - Advancements in technology may provide new tools to measure the impact of input variables on algorithm outputs, and research is ongoing to make algorithms more adaptable.

Initialize the Chroma vector store to index the child chunks. Use OpenAIEmbeddings as the embedding function.

Use “doc_id” as the key representing the document ID.

import uuid

# Create a vector store to store the summary information.
summary_vectorstore = Chroma(
    collection_name="summaries",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)

# Create a repository to store the parent document.
store = InMemoryStore()

# Specify a key name to store the document ID.
id_key = "doc_id"

# Initialize the searcher (empty at startup).
retriever = MultiVectorRetriever(
    vectorstore=summary_vectorstore,  # vector store
    byte_store=store,  # byte store
    id_key=id_key,  # document ID
)
# Create a document ID.
doc_ids = [str(uuid.uuid4()) for _ in split_docs]

Save the summarized document and its metadata (here, the Document ID for the summary you created).

summary_docs = [
    # Create a Document object with the summary as the page content and the document ID as metadata.
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

The number of articles in the digest matches the number of original articles.

# Number of documents in the summary
len(summary_docs)

Add summary_docs to the vector store with retriever.vectorstore.add_documents(summary_docs).
Map doc_ids and docs with retriever.docstore.mset(list(zip(doc_ids, docs)))) to store them in the document store.

retriever.vectorstore.add_documents(
    summary_docs
)  # Add the summarized document to the vector repository.

# Map the document ID to the document and store it in the document store.
retriever.docstore.mset(list(zip(doc_ids, split_docs)))

Perform a similarity search using the similarity_search method of the vectorstore object.

# Perform a similarity search.
result_docs = summary_vectorstore.similarity_search(
    "What is the phased implementation timeline for the EU AI Act?"
)

# Output 1 result document.
print(result_docs[0].page_content)

- The European Commission and EU member states are collaborating to enhance the development and implementation of artificial intelligence (AI) technologies within Europe.  
    - In 2018, a commitment was made to boost AI "made in Europe," focusing on fostering innovation and ensuring ethical standards.  
    - The 2020 White Paper outlines a European approach to AI, emphasizing the importance of excellence and trust in AI systems.

Use the invoke method of the retriever object to retrieve documents related to the query.

# Search for and fetch related articles.
retrieved_docs = retriever.invoke(
    "What is the phased implementation timeline for the EU AI Act?"
)
print(retrieved_docs[0].page_content)

cial Intelligence. Retrieved from https://ec.europa.eu/digital-single-market/en/news/
    eu-member-states-sign-cooperate-artificial-intelligence.
    European Commission. (2018d). Member States and Commission to work together to 
    boost artificial intelligence ‘made in Europe’. Retrieved from https://ec.europa.eu/commis-
    sion/presscorner/detail/en/IP_18_6689.
    European Commission. (2020a). White Paper on Artificial Intelligence. A European Ap-
    proach to Excellence and Trust. COM(2020) 65 final, Brussels: European Commission.

Utilizing Hypothetical Questions

An LLM can also be used to generate a list of questions that can be hypothesized about a particular document.

These generated questions can be embedded to further explore and understand the content of the document.

Generating hypothetical questions can help you identify key topics and concepts in your document, and can encourage readers to ask more questions about the content of your document.

Below is an example of creating a hypothesis question via function calling .

functions = [
    {
        "name": "hypothetical_questions",  # Specify a name for the function.
        "description": "Generate hypothetical questions",  # Write a description of the function.
        "parameters": {  # Define the parameters of the function.
            "type": "object",  # Specifies the type of the parameter as an object.
            "properties": {  # Defines the properties of an object.
                "questions": {  # Define the 'questions' attribute.
                    "type": "array",  # Type 'questions' as an array.
                    "items": {
                        "type": "string"
                    },  # Specifies the array's element type as String.
                },
            },
            "required": ["questions"],  # Specify 'questions' as a required parameter.
        },
    }
]

Use ChatPromptTemplate to define a prompt template that generates three hypothetical questions based on the given document.

Set functions and function_call to call the virtual question generation functions.
Use JsonKeyOutputFunctionsParser to parse the generated virtual questions and extract the values corresponding to the questions key.

from langchain_core.prompts import ChatPromptTemplate
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
from langchain_openai import ChatOpenAI

hypothetical_query_chain = (
    {"doc": lambda x: x.page_content}
    # We ask you to create exactly 3 hypothetical questions that you can answer using the documentation below. This number can be adjusted.
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer. "
        "Potential users are those interested in the AI industry. Create questions that they would be interested in. "
        "Output should be written in English:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4o-mini").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    # Extract the value corresponding to the “questions” key from the output.
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

Output the answers to the documents.

The output contains the three hypothetical questions you created.

# Run the chain for the given document.
hypothetical_query_chain.invoke(split_docs[33])

['How might advancements in technology influence the accuracy of decision-making algorithms in the AI industry?',
     'What could be the consequences if decision models in AI fail to adapt to unexpected changes in real-world environments?',
     'In what ways could the availability of better data impact the training of AI algorithms and their ability to predict outcomes?']

Use the chain.batch method to process multiple requests for split_docs data at the same time.

# Create a batch of hypothetical questions for a list of articles
hypothetical_questions = hypothetical_query_chain.batch(
    split_docs, {"max_concurrency": 10}
)

hypothetical_questions[33]

['What could happen if an AI decision-making model is trained on incomplete or biased data?',
     'How might advancements in technology influence the reliability of AI predictions in unpredictable environments?',
     'What would be the implications for the AI industry if algorithms could be made more transparent and interpretable in their decision-making processes?']

Below is the process for storing the hypothetical questions you created in the vector store, the same way we did before.

# Vector store to use for indexing child chunks
hypothetical_vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# Storage hierarchy for parent documents
store = InMemoryStore()

id_key = "doc_id"
# Retriever (empty on startup)
retriever = MultiVectorRetriever(
    vectorstore=hypothetical_vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in split_docs]  # Create a document ID

Add metadata (document IDs) to the question_docs list.

question_docs = []
# save hypothetical_questions
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        # Create a Document object for each question in the list of questions, and include the document ID for that question in the metadata.
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

Add the hypothetical_questions to the document, and add the original document to docstore.

# Add the hypothetical_questions document to the vector repository.
retriever.vectorstore.add_documents(question_docs)

# Map the document ID to the document and store it in the document store.
retriever.docstore.mset(list(zip(doc_ids, split_docs)))

Perform a similarity search using the similarity_search method of the vectorstore object.

# Search the vector repository for similar documents.
result_docs = hypothetical_vectorstore.similarity_search(
    "What is the phased implementation timeline for the EU AI Act?"
)

Below are the results of the similarity search.

Here, we've only added the hypothetical questions we created, so it returns the documents with the highest similarity among the stored hypothetical questions.

# Output the results of the similarity search.
for doc in result_docs:
    print(doc.page_content)
    print(doc.metadata)

What potential socio-economic changes could arise from the implementation of the EU's coordinated action plan on AI?
    {'doc_id': 'accd841f-1474-410f-b600-54e646eac1ec'}
    How might the guidelines set by the Next European Commission impact the regulatory landscape for AI in Europe?
    {'doc_id': '73899cc8-b216-4642-906c-1e73b49bd479'}
    What might be the long-term effects of implementing the operational principles outlined in the EC AI White Paper on the AI industry in Europe?
    {'doc_id': 'dcf69a31-e872-4bef-b4f0-190bb3a8889c'}
    What potential scenarios could arise from the implementation of the proposed AI governance regimes in Europe, and how might they affect the AI industry?
    {'doc_id': 'ba58ee93-5bad-4ba9-958b-4e0af5fc0162'}