Summarization
Author: Erika Park
Peer Review:
Proofread : Juni Lee
This is a part of LangChain Open Tutorial
Overview
Key Summarization Techniques
In this tutorial, we will explore various document summarization techniques, discussing their approaches and applications.
Stuff: Summarizing the entire document at once by feeding it directly into the LLM's context window. This is the simplest and most straightforward method.
Map-Reduce: Splitting a document into multiple chunks, summarizing each chunk individually (map), and then merging the summaries into a final summary (reduce).
Map-Refine: Splitting a document into chunks, summarizing each one, and then progressively refining the summary by referencing previous summaries.
Chain of Density: Repeatedly summarizing a document while filling in missing entities, progressively improving the summary quality.
Clustering-Map-Refine: Dividing a document into N clusters, summarizing a central document from each cluster, and then refining the cluster summaries for a comprehensive result.
Core Principles of Document Summarization
A central question when building a summarizer is: How should the document be presented to the LLM's context window? The primary approaches include:
Stuff(Full Input): Placing the entire document into the context window at once. Simple but limited when handling long documents.Map-Reduce(Chunk and Merge): Splitting the document into multiple chunks, summarizing each chunk, and then merging the results into a final summary. Useful for handling large datasets.Refine(Sequential Improvement): Processing the document sequentially and refining the summary by merging previous summaries with new content, making it effective for detailed summarization needs.
By the end of this tutorial, you will understand how to use these techniques effectively and choose the right method for your specific summarization scenarios.
Document Used for Practice
Artificial Intelligence Index Report 2024 (Standford University)
Authors: Erik Brynjolfsson, John Etchemendy, Juan Carlos Niebles, Vanessa Parli, Yoav Shoham, Russell Wald (Stanford University); Katrina Ligett (Hebrew University); Terah Lyons (JPMorgan Chase & Co.); James Manyika (Google, University of Oxford); Juan Carlos Niebles (Salesforce); Yoav Shoham (AI21 Labs)
File Name: "Artificial Intelligence Index Report.pdf"
Please copy the downloaded file to the data folder for practice.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions, and utilities for tutorials.You can checkout out the
langchain-opentutorialfor more details.
You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.
[Note] This is not necessary if you've already set the required API keys in previous steps.
Stuff
The stuff documents chain (where "stuff" means "to fill" or "for filling") is the simplest type of document chain. It takes a list of documents, inserts them all into the prompt, and then sends that prompt to the LLM.
In other words, the context input directly receives Document objects. When using a retriever to search a vector_store, it returns a List[Document]. This chain automatically converts the documents into a format suitable for the LLM without requiring manual conversion to strings.
This chain is suitable for applications where documents are small, and only a few are passed in most calls.
The following prompt is designed to generate concise and effective summaries by guiding the language model with clear instructions.
The following function can be used for streaming token output, useful for callback handling when working with LLMs.
A callback is a mechanism that allows specific actions to be executed each time a new token is generated by the LLM. This can be useful for streaming token outputs in real-time.
Map-Reduce
Map-reduce summarization is an effective technique for condensing lengthy documents. This method involves two primary stages:
Map Stage: The document is divided into smaller chunks, each of which is summarized independently.
Reduce Stage: The individual summaries are then combined to form a cohesive final summary.
This approach is particularly advantageous when dealing with extensive documents, as it allows for parallel processing of chunks during the map stage, thereby enhancing efficiency. Additionally, it helps circumvent the token limitations inherent in language models by ensuring that each chunk fits within the model's context window.

In this section, we will use the "Artificial Intelligence Index Report.pdf" to carry out the Map phase and the Reduce phase.
Load the downloaded data
Map Stage
During the map stage, each chunk is typically processed by generating a summary.
While the standard approach involves summarizing the content of each chunk, an alternative method is to extract key information instead. Since the reduce stage ultimately combines the outputs into a final summary, both approaches can be effective with minimal impact on the final result.
The choice between summarization and key information extraction during the map stage can be adjusted based on the specific goals and requirements of the task.
Generate summaries for each document using batch() processing
Reduce Stage
To create a Reduce Chain, the results generated during the map stage are further processed to combine and refine them into a final cohesive summary.
Here's an example of how to create a Reduce Chain using LangChain:
Full Implementation of the Map-Reduce Chain
The following code combines both the map and reduce stages for summarizing documents using LangChain.
Map-Refine
The Map-Refine method is another approach for document summarization, similar to Map-Reduce but with some key differences in how summaries are processed and combined.
Map Stage:
The document is divided into multiple smaller chunks.
Each chunk is independently summarized.
Refine Stage:
The generated summaries are processed sequentially.
In each iteration, the previous summary is combined with the next chunk's information to update and refine the summary.
Iterative Process:
The refine stage continues iteratively until all chunks have been processed.
Each iteration enhances the summary by incorporating more information while retaining previously captured details.
Final Summary:
Once all chunks have been processed, the final summary is obtained after the last refinement step.
Key Advantages:
Maintains Document Order: This method preserves the original order of the document, making it particularly useful for content where sequence matters.
Contextual Refinement: Each step progressively improves the summary, making it ideal for content where a gradual build-up of context is necessary.
Limitations:
Sequential Processing: The refine stage requires sequential steps, making parallelization difficult.
Time-Consuming: Due to its non-parallel nature, it can be slower compared to Map-Reduce, especially for large datasets.
Use Cases:
Processing technical manuals or research papers where context builds across sections.
Summarizing meeting transcripts where events unfold in chronological order.

Map Stage
During the map stage, a summary is generated for each individual chunk of the document. This step involves processing the content of each chunk separately, ensuring that key information from every section is captured independently before moving on to the next stage.
[ "- Global private investment in AI has decreased for the second year, but investment in generative AI has significantly increased. - There has been a record number of mentions of AI in Fortune 500 earnings calls and legislative proceedings, indicating heightened interest and awareness. - U.S. regulators enacted more AI-related regulations in 2023 than in previous years, reflecting growing concerns about AI's potential risks, such as deepfakes and election interference. - Public awareness of AI has risen, but this has also led to increased nervousness among the population regarding its implications.", "- AI has outperformed humans in specific tasks like image classification and language understanding, but struggles with complex tasks such as advanced mathematics and planning. - The majority of significant AI research is driven by industry, with 51 notable machine learning models produced by industry compared to only 15 from academia in 2023. - The costs associated with training state-of-the-art AI models have surged, with examples like OpenAI's GPT-4 costing approximately $78 million and Google's Gemini Ultra costing around $191 million. - The U.S. remains the leading source of top AI models, producing 61 notable models in 2023, significantly ahead of the EU and China. - There is a critical lack of standardized evaluations for responsible AI, complicating the comparison of risks and limitations across different models. - Investment in generative AI has dramatically increased, reaching $25.2 billion in 2023, despite a general decline in overall AI private investment. - Studies indicate that AI enhances worker productivity and quality of work, although improper use of AI can negatively impact performance.", - "AI is significantly advancing scientific discovery, with notable applications launched in 2023, such as AlphaDev and GNoME. - The number of AI-related regulations in the U.S. has surged, increasing from one in 2016 to 25 in 2023, with a 56.3% growth in the past year alone. - Global awareness of AI's potential impact is rising, with 66% of people believing it will dramatically affect their lives in the next few years, and 52% expressing nervousness towards AI products and services."]
Refine Stage
During the refine stage, the chunks generated in the previous map stage are processed sequentially, progressively improving the final summary with each iteration. The summary is updated by combining the information from the previous summary with the next chunk, ensuring a more comprehensive and contextually accurate final result.
The following code demonstrates how to create a map_refine_chain that combines both the map and refine stages into a single, streamlined process for document summarization.
Note About the Output:
Clarity: The output progressively becomes clearer and more cohesive.
Sequential Summarization: The results build on previous iterations.
Check the output below to see how the summaries evolve through each refinement step.
Chain of Density (CoD)
Paper: From Sparse to Dense
The Chain of Density (CoD) prompt is a technique developed to improve summary generation using GPT-4.
This method begins by generating an initial summary with minimal entities and then progressively incorporates missing key entities without increasing the summary's length. Studies have shown that summaries generated using CoD are more abstract, better at information fusion, and achieve a density similar to human-written summaries compared to standard prompts.
Progressive Improvement:
CoD generates a simple summary with few entities initially.
It then gradually enhances the summary by adding important entities step by step.
During this process, the summary length remains constant while the information density increases, resulting in a summary that is both information-rich and easy to read.
Balancing Information Density and Readability:
The CoD technique adjusts the information density of summaries, striking an optimal balance between informativeness and readability.
Research indicates that readers prefer CoD summaries over standard GPT-4 summaries, as they are denser without being overwhelmingly packed with information, closely matching the density of human-written summaries.
Enhanced Abstraction and Information Fusion:
Summaries generated using CoD tend to be more abstract and excel in information fusion.
They also reduce the "lead bias," where summaries focus too heavily on the beginning of the original text.
This contributes to better overall summary quality and readability.
The Chain of Density approach offers a structured and effective way to improve summary generation, making it particularly useful for tasks requiring concise yet information-rich outputs.
Input Parameter
content_category: The type of content being summarized (e.g., article, video transcript, blog post, research paper). Default: Articlecontent: The content to be summarized.entity_range: The range of entities to be selected from the content and included in the summary. Default: 1-3max_words: The maximum number of words included in the summary per iteration. Default: 80iterations: The number of entity densification rounds. The total number of summaries generated will be iterations + 1. For an 80-word summary, 3 rounds are ideal. For longer summaries, 4-5 rounds may be suitable, and adjusting theentity_range(e.g., 1-4) can further optimize the results. Default: 3
The code below creates a summarization chain using the Chain of Density (CoD) prompt, designed to progressively enhance the summary by increasing entity density while keeping the summary length constant.
First Chain: Displays intermediate results after each iteration.
Second Chain: Extracts only the final summary after all iterations.
The following code demonstrates how to create a Chain of Density (CoD) pipeline that iteratively refines a document summary by progressively adding key entities and improving the summary detail through multiple iterations.
Review the data to be summarized.
Partial JSON Streaming with Overwriting Chunks
The code below demonstrates how to perform partial JSON streaming where each streamed chunk is a list of JSON dictionaries with additional entities added progressively.
To avoid simply concatenating outputs and instead overwrite previous chunks with each update, a carriage return () is used.
Clustering-Map-Refine
The original author of this tutorial, gkamradt, proposed an innovative approach for summarizing lengthy documents, which balances efficiency and cost without compromising quality.
Background:
Map-Reduce and Map-Refine methods can be time-consuming and expensive when processing long documents.
To address this, the proposed solution involves clustering the document into several clusters (N clusters) and identifying the document closest to the cluster's centroid as the representative document for that cluster.
Only these representative documents are then summarized using either the Map-Reduce or Map-Refine method.
Advantages:
Cost Efficiency: Reduces the number of documents processed directly by the LLM.
Effective Results: The approach retains quality while optimizing performance.
The code in this tutorial is a modified version of the original by gkamradt, tailored to optimize both cost and summarization quality.
Running the code below merges multiple documents into a single text.
The purpose of merging is to avoid separating the content by individual pages.
The combined text contains approximately 781,000 characters.
Splitting a Single Text into Multiple Documents Using RecursiveCharacterTextSplitter
The
RecursiveCharacterTextSplitteris used to divide a single text into multiple smaller documents while preserving logical breaks, such as sentences or paragraphs.
Clustering and Summarizing Documents - Using Upstage Embeddings and K-means Clustering
This section demonstrates how to perform document clustering and summarization using Upstage Embeddings, K-means clustering, and a Map-Refine Chain.
The process involves embedding documents, clustering them, selecting representative documents, and finally generating a refined summary using LangChain.
Summary of the Steps:
Embedding: Documents were embedded using the Upstage Embeddings model.
Clustering: K-means clustering was performed on the document vectors.
Visualization: The clusters were visualized using t-SNE.
Document Selection: The most representative document from each cluster was selected.
Final Summary: A refined summary was generated using the Map-Refine Chain.
1. Import Required Libraries and Prepare Embeddings
2. Perform K-Means Clustering
K-means clustering is applied to group the documents into a specified number of clusters.
KMeansis used to cluster the document vectors.The
random_stateparameter ensures reproducibility.
3. Visualize Clusters Using t-SNE
The clusters are visualized in a 2D space using t-SNE for dimensionality reduction.

Explanation:
TSNEreduces the dimensionality of the document vectors for visualization purposes.Each point represents a document, colored by its cluster label.
4. Select Representative Documents from Each Cluster
The document closest to the cluster centroid is selected as the representative document for that cluster.
Explanation:
For each cluster, the document closest to the cluster center is identified.
The indices are sorted to ensure sequential summarization later.
5. Convert Selected Documents to LangChain Document Format
The selected documents are converted to LangChain's
Documentformat for compatibility with the summarization chain.
6. Generate the Final Summary Using the Map-Refine Chain
The Map-Refine Chain is used to generate a refined summary from the selected representative documents.
Last updated