Parent Document Retriever

Open in ColabOpen in GitHub

Overview

This tutorial focuses on the ParentDocumentRetriever implementation, a tool designed to balance document search and chunking.

When splitting documents for search, two competing needs arise:

  1. Small Chunks : Needed for accurate meaning representation in embeddings

  2. Context Preservation : Required for maintaining document coherence

How It Works

ParentDocumentRetriever manages this balance by:

  1. Splitting documents into small searchable chunks

  2. Maintaining connections to parent documents via IDs

  3. Loading multiple files through TextLoader objects

Benefits

  1. Efficient Search: Quick identification of relevant content

  2. Context Awareness: Access to broader document context when needed

  3. Flexible Structure: Works with both complete documents and larger chunks as parent documents

Table of Contents


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

First, let's load the documents that we'll use as data.

Full Document Retrieval

In this mode, we aim to search through complete documents. Therefore, we'll only specify the child_splitter.

Later, we'll also specify the parent_splitter to compare the results.

Documents are added using the retriever.add_documents(docs, ids=None) function:

  • If ids is None, they will be automatically generated.

  • Setting add_to_docstore=False prevents duplicate document additions. However, ids values are required to check for duplicates.

This code should return two keys because we added two documents.

  • Convert the keys returned by the store object's yield_keys() method into a list.

Let's try calling the vector store search function.

Since we are storing small chunks, we should see small chunks returned in the search results.

Perform similarity search using the similarity_search method of the vectorstore object.

Now let's search through the entire retriever. In this process, since it returns the documents containing the small chunks, relatively larger documents will be returned.

Use the invoke() method of the retriever object to retrieve documents related to the query.

Adjusting Larger Chunk Sizes

Like the previous results, the entire document may be too large to search through as is .

In this case, what we actually want to do is first split the raw document into larger chunks, and then split those into smaller chunks.

Then we index the small chunks, but search for larger chunks during retrieval (though still not the entire document).

  • Use RecursiveCharacterTextSplitter to create parent and child documents.

    • Parent documents have chunk_size set to 1000.

    • Child documents have chunk_size set to 200, creating smaller sizes than the parent documents.

This is the code to initialize ParentDocumentRetriever:

  • The vectorstore parameter specifies the vector store that stores document vectors.

  • The docstore parameter specifies the document store that stores document data.

  • The child_splitter parameter specifies the document splitter used to split child documents.

  • The parent_splitter parameter specifies the document splitter used to split parent documents.

ParentDocumentRetriever handles hierarchical document structures, separately splitting and storing parent and child documents. This allows effective use of both parent and child documents during retrieval.

Add docs to the retriever object. This adds new documents to the set of documents that retriever can search through.

Now you can see there are many more documents. These are the larger chunks.

Now let's use the invoke() method of the retriever object to search for documents.

Last updated