RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
Author: Yoonji Oh
Peer Review:
Proofread : Yun Eun
This is a part of LangChain Open Tutorial
Overview
Welcome to the RAPTOR Tutorial!
In this tutorial, we’ll explore RAPTOR, which stands for Recursive Abstractive Processing for Tree-Organized Retrieval. This innovative technique organizes data into a tree-like structure through summarization, making it easier to efficiently locate the information you need. The search process begins at the root and navigates down to more detailed nodes, ultimately delivering the most relevant answer to your query.
This tutorial is inspired by the paper "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval". A special thanks to the authors for their groundbreaking work that inspired this deep dive. Let’s get started!
Table of Contents
References
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval Authors: Xinyu Zhang, Tao Lei, Heng Ji, Kevin Small Published in: arXiv preprint arXiv:2401.18059 (2024)
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.
[Note] This is not necessary if you've already set the required API keys in previous steps.
Installation
To proceed with this tutorial, you'll need to install the following packages: langchain, umap-learn, scikit-learn, langchain_community, tiktoken, langchain-openai, langchainhub, chromadb, langchain-anthropic and matplotlib. You can install them all at once using the command below:
Package List
langchain (Github) - LangChain is a framework for developing applications powered by large language models (LLMs).
umap-learn (Github) - Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique that can be used for visualization, similar to t-SNE, but also for general non-linear dimensionality reduction. UMAP depends on scikit-learn and its dependencies like numpy and scipy. UMAP also requires numba for performance reasons.
scikit-learn (Github) - scikit-learn is a Python module for machine learning built on top of SciPy and distributed under the 3-Clause BSD license. If you want to use the
umap-learnlibrary, this module is highly recommended.langchain_community - An extension package provided by the LangChain community, including tools and resources created by the community.
tiktoken (Github) - tiktoken is a fast BPE tokenizer designed for use with OpenAI's models. Language models do not interpret text as we do; instead, they process a sequence of numbers (tokens). Byte pair encoding (BPE) is a method for converting text into tokens.
langchain-openai (Github) - A package that integrates LangChain with OpenAI APIs to leverage OpenAI's language models within LangChain workflows.
langchainhub - LangChain's "Hub" enables the sharing and reuse of various LangChain components (e.g., chains, agents).
langchain-anthropic (Docs) - A package to integrate LangChain with Anthropic APIs, allowing access to language models like Claude through LangChain.
matplotlib (Github) - matplotlib is a popular library for creating static, animated, and interactive visualizations in Python.
What is RAPTOR?
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval The RAPTOR paper introduces an intriguing approach to indexing and retrieving documents.
Leafs represent the initial set of documents.
The leafs are embedded and clustered.
Then, the clusters are summarized at a higher (more abstract) level to capture information across similar documents. This process is performed recursively, forming a "tree" that progresses from the original documents (leafs) to more abstract summaries.
This approach can be applied at various scales. Leafs can be:
Text chunks within a single document (as shown in the paper)
Entire documents (as shown below) With longer-context LLMs, this can be extended to entire documents.
Documents
Let’s apply this concept to LangChain's LCEL documentation.
In this case, each document is a unique web page from the LCEL documentation.
The context sizes range from fewer than 2,000 tokens to over 10,000 tokens.
The process describes extracting text data from web documents and calculating the token count of the text, which is visualized in a histogram.
Token Counting:
Use the
tiktokenlibrary to calculate the number of tokens in a string based on a given encoding name.
Recursive URL Loading:
Use the
RecursiveUrlLoaderclass to recursively load web documents from specified URLs.Extract text from HTML documents using BeautifulSoup during this process.
Aggregating Text Data:
Load documents from multiple URLs and combine all text data into a single list.
Token Calculation:
Call the
num_tokens_from_stringfunction for each document text to calculate the number of tokens, then store the results in a list.
Histogram Visualization:
Use
matplotlibto visualize the distribution of token counts in a histogram.The histogram plots token counts on the x-axis and the frequency of documents with that token count on the y-axis.
A histogram helps to understand the distribution of data, especially for visually identifying the length distribution of the text data.

Explaining the Process of Sorting, Concatenating Document Text, and Calculating Token Count
The documents (
docs) are sorted based on the value of the "source" key in their metadata.The sorted list of documents is then reversed.
The text content of the reversed documents is concatenated using a specific delimiter (
"\n\n\n --- \n\n\n").The number of tokens in the concatenated text is calculated using the
num_tokens_from_stringfunction and printed. The "cl100k_base" model is used for tokenization.
Explaining the Process of Splitting Text Using RecursiveCharacterTextSplitter
Set the
chunk_size_tokvariable to specify the size of each text chunk as 2000 tokens.Initialize the text splitter using the
from_tiktoken_encodermethod ofRecursiveCharacterTextSplitter. Set thechunk_sizeto 2000 and thechunk_overlapto 0 to ensure there is no overlap between chunks.Call the
split_textmethod of the initialized text splitter to split the concatenated text stored in theconcatenated_contentvariable. The split results are stored in thetexts_splitvariable.
Models
Various models can be tested, including the new Claude3 series.
Don’t forget to set the relevant API keys:
OPENAI_API_KEYfor OpenAI andANTHROPIC_API_KEYfor Anthropic if using their services.
Implement a chatbot model using ChatOpenAI or ChatAnthropic with OpenAIEmbeddings.
Instantiate
OpenAIEmbeddingsto initialize OpenAI's embedding functionality.Use
ChatOpenAIorChatAnthropicto initialize the chatbot model, setting the temperature to 0.
The following code illustrates how to set up Cache Embedding using LangChain. It prevents redundant embedding calculations for identical inputs by storing and reusing cached values.
The following code demonstrates how to initialize ChatOpenAI and ChatAnthropic models using LangChain and utilize streaming functionality to output results token by token.
Tree Construction
The clustering approach used in tree construction includes several intriguing ideas.
GMM (Gaussian Mixture Model)
Models the distribution of data points across various clusters.
Determines the optimal number of clusters by evaluating the model's Bayesian Information Criterion (BIC).
UMAP (Uniform Manifold Approximation and Projection)
Supports clustering.
Reduces the dimensionality of high-dimensional data.
UMAP emphasizes natural groupings based on the similarity of data points.
Local and Global Clustering
Used to analyze data at different scales.
Effectively captures both fine-grained patterns and broader trends within the data.
Thresholding
Applied to determine cluster membership in the context of GMM.
Based on probabilistic distributions (assigning data points to ≥ 1 cluster).
The code for GMM and thresholding is credited to Sarthi et al., as mentioned in the following sources:
Full credit is given to both authors.
The global_cluster_embeddings function uses UMAP for global dimensionality reduction of embeddings.
Reduces the dimensionality of the input embeddings to the specified dimension (
dim) using UMAP.n_neighborsspecifies the number of neighbors to consider for each point and defaults to the square root of the number of embeddings if not provided.metricspecifies the distance metric to be used by UMAP.The result is a numpy array of embeddings reduced to the specified dimensions.
The function local_cluster_embeddings is implemented to perform local dimensionality reduction on embedding data.
The input embeddings (
embeddings) are reduced to the specified dimension (dim) using UMAP.During the dimensionality reduction process, the number of neighbors to consider for each point (
num_neighbors) and the distance metric (metric) are used as parameters.Finally, the reduced embeddings are returned as a numpy array.
The get_optimal_clusters function is used to determine the optimal number of clusters based on the given embedding data. This is achieved by calculating the Bayesian Information Criterion (BIC) using the Gaussian Mixture Model (GMM).
The input embeddings are provided as a numpy array.
The maximum number of clusters (
max_clusters) specifies the upper limit for the number of clusters to consider. The default value is 50.A fixed
random_stateis used to ensure reproducibility.The function iterates through multiple cluster numbers for the input embeddings and calculates the BIC value for each.
The number of clusters with the lowest BIC value is determined to be the optimal number of clusters and is returned.
This function is particularly useful for automatically finding the number of clusters that best describe the data in a clustering problem.
The GMM_cluster function performs clustering on embeddings using a Gaussian Mixture Model (GMM). This process is based on a probability threshold.
The input embeddings are provided as a numpy array.
The
thresholdparameter specifies the probability threshold for assigning embeddings to specific clusters.random_stateis used to ensure reproducibility of the results.The
get_optimal_clustersfunction is called to determine the optimal number of clusters.Based on the determined number of clusters, a Gaussian Mixture Model is initialized and trained on the input embeddings.
Cluster assignment probabilities are calculated for each embedding, and embeddings are assigned to clusters if their probability exceeds the given threshold.
The function returns a tuple containing the cluster labels for the embeddings and the determined number of clusters.
The perform_clustering function performs dimensionality reduction, global clustering using a Gaussian Mixture Model (GMM), and local clustering within each global cluster to return clustering results.
Dimensionality Reduction: The input embeddings are reduced to a specified dimension (
dim) using UMAP. This step prepares the embeddings for further clustering processes.Global Clustering: Global clustering is performed on the reduced embeddings using a Gaussian Mixture Model (GMM). Cluster assignments are determined based on a specified probability threshold (
threshold).Local Clustering: Additional local clustering is performed within each global cluster. This involves applying dimensionality reduction and GMM clustering to embeddings that belong to each global cluster, based on the results of the global clustering.
Final Output: The function assigns global and local cluster IDs to all embeddings and returns a list of cluster IDs. This list contains the cluster assignments for each embedding in the order they appear.
This function combines global and local clustering approaches to handle high-dimensional data, providing more granular clustering results. It is particularly effective for analyzing complex data structures and uncovering hierarchical patterns within the data.
Implement the function embed to generate embeddings for a list of text documents.
Takes a list of text documents (
texts) as input.Uses the
embed_documentsmethod of theembdobject to generate embeddings for the text documents.Converts the generated embeddings into a
numpy.ndarrayformat and returns it.
The embed_cluster_texts function embeds and clusters a list of texts, returning a pandas.DataFrame containing the original texts, their embeddings, and assigned cluster labels.
Generates embeddings for the given list of texts.
Performs clustering based on the generated embeddings using the predefined
perform_clusteringfunction.Initializes a
pandas.DataFrameto store the results.Stores the original texts, embedding lists, and cluster labels in the DataFrame.
This function combines the embedding generation and clustering of text data into a single step, facilitating the structural analysis and grouping of text data.
The fmt_txt function formats text documents from a pandas DataFrame into a single string.
The input parameter is a DataFrame that must contain a
'text'column with the text documents to be formatted.All text documents are concatenated into a single string using a specific delimiter (
"--- --- \n --- ---").The function returns a single string containing the concatenated text documents.
The process involves embedding text data, clustering it, and generating summaries for each cluster.
Embedding and Clustering: The given list of texts is embedded, and clustering based on similarity is performed. The result is stored in the
df_clustersDataFrame, which contains the original texts, embeddings, and cluster assignment information.Expanding the DataFrame: To simplify cluster assignment handling, the DataFrame entries are expanded. Each row is transformed into a new DataFrame containing the text, embedding, and cluster assignment.
Formatting and Summarizing: Unique cluster identifiers are extracted from the expanded DataFrame. Texts for each cluster are formatted, and summaries are generated. These summaries are stored in the
df_summaryDataFrame, which includes the summary for each cluster, a specified level of detail, and the cluster identifier.Return Value: The function returns a tuple containing two DataFrames:
The first DataFrame includes the original texts, embeddings, and cluster assignments.
The second DataFrame contains the summaries for each cluster, their detail level, and cluster identifiers.
This function implements the process of recursively embedding, clustering, and summarizing text data.
The given list of texts is embedded, clustered, and summarized, with results stored at each step.
The function executes up to the specified maximum recursion level or until the number of unique clusters becomes 1, whichever comes first.
At each recursion step, the clustering and summarization results for the current level are returned as DataFrames and stored in a results dictionary.
If the current level is less than the maximum recursion level and the number of unique clusters is greater than 1, the summary results from the current level are used as the input texts for the next level, and the function is called recursively.
Finally, the function returns a dictionary containing the cluster DataFrames and summary DataFrames for each recursion level.
In the paper, collapsed tree retrieval is reported to achieve the best performance.
This involves flattening the tree structure into a single layer, followed by applying k-Nearest Neighbor (kNN) retrieval across all nodes simultaneously.
Below is a simplified explanation of this process.
The process of building a vectorized and searchable Chroma vector store using text data is described as follows:
The text data stored in
leaf_textsis initially copied to theall_textsvariable.The result data (
results) is iterated through, extracting the summarized texts at each level and appending them toall_texts.The
summariescolumn from the DataFrame at each level is converted into a list and extracted.The extracted summaries are added to
all_texts.Using all the text data (
all_texts), a Chroma vector store is constructed.The
Chroma.from_textsfunction is called to vectorize the text data and create the vector store.To make the generated vector store searchable, the
.as_retriever()method is used to initialize a retriever.
Through this process, text data, including summaries from various levels, is vectorized and used to build a searchable Chroma vector store.
The code below saves the database locally.
[NOTE]
The following error may occur when using FAISS.load_local :
Why the Error Occurs
The FAISS.load_local method uses pickle files for deserialization, which can pose a security risk. Pickle files may execute malicious code if tampered with, so deserialization is disabled by default unless explicitly enabled.
How to Fix the Error
If you trust the source of the pickle file, you can safely enable deserialization by setting allow_dangerous_deserialization=True :
[Warning]
Only enable
allow_dangerous_deserialization=Trueif:The pickle file was created by you.
You are certain that the file has not been tampered with by others.
Do not enable this for files from untrusted or unknown sources.
Implement the process of defining a Retrieval Augmented Generation (RAG) chain and handling a specific code example request.
Use
hub.pullto fetch the RAG prompt.Define the
format_docsfunction for document formatting. This function concatenates the page content of documents and returns it.Construct the RAG chain. This chain retrieves context from the retriever, formats it using the
format_docsfunction, and processes the question.Use
RunnablePassthrough()to pass the question directly through.The chain parses the final output into a string using the prompt, model, and
StrOutputParser().Use the
rag_chain.invokemethod to process the question: "How to define a RAG chain? Give me a specific code example."
The link below provides the result of the code execution using the LangChain framework:
The link below provides the result of the code execution using the LangChain framework:
The link below provides the result of the code execution using the LangChain framework:
Last updated