MarkdownHeaderTextSplitter

Author: HeeWung Song(Dan)
Peer Review : BokyungisaGod, Chaeyoon Kim
Proofread : Chaeyoon Kim
This is a part of LangChain Open Tutorial

Overview

This tutorial introduces how to effectively split Markdown documents using LangChain's MarkdownHeaderTextSplitter. This tool divides documents into meaningful sections based on Markdown headers, preserving the document's structure for systematic content processing.

Context and structure of documents are crucial for effective text embedding. Simply dividing text isn't enough; maintaining semantic connections is key to generating more comprehensive vector representations. This is particularly true when dealing with large documents, as preserving context can significantly enhance the accuracy of subsequent analysis and search operations.

The MarkdownHeaderTextSplitter splits documents according to specified header sets, managing the content under each header group as separate chunks. This enables efficient content processing while maintaining the document's structural coherence.

References

Environment Setup

Setting up your environment is the first step. See the Environment Setup guide for more details.

[Note]

The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.
Check out the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ]
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "MarkdownHeaderTextSplitter",
    }
)

Alternatively, you can set and load OPENAI_API_KEY from a .env file.

[Note] This is only necessary if you haven't already set OPENAI_API_KEY in previous steps.

from dotenv import load_dotenv

load_dotenv()

True

Basic Usage of MarkdownHeaderTextSplitter

The MarkdownHeaderTextSplitter splits Markdown-formatted text based on headers. Here's how to use it:

First, the splitter divides the text based on standard Markdown headers (#, ##, ###, etc.).
Store the Markdown you want to split in a variable called markdown_document.
You'll need a list called headers_to_split_on. This list uses tuples to define the header levels you want to split on and what you want to call them.
Now, create a markdown_splitter object using the MarkdownHeaderTextSplitter class, and give it that headers_to_split_on list.
To actually split the text, call the split_text method on your markdown_splitter object, passing in your markdown_document.

from langchain_text_splitters import MarkdownHeaderTextSplitter

# Define a markdown document as a string
markdown_document = "# Title\n\n## 1. SubTitle\n\nHi this is Jim\n\nHi this is Joe\n\n### 1-1. Sub-SubTitle \n\nHi this is Lance \n\n## 2. Baz\n\nHi this is Molly"
print(markdown_document)

# Title
    
    ## 1. SubTitle
    
    Hi this is Jim
    
    Hi this is Joe
    
    ### 1-1. Sub-SubTitle 
    
    Hi this is Lance 
    
    ## 2. Baz
    
    Hi this is Molly

headers_to_split_on = [  # Define header levels and their names for document splitting
    (
        "#",
        "Header 1",
    ),  # Header level 1 is marked with '#' and named 'Header 1'
    (
        "##",
        "Header 2",
    ),  # Header level 2 is marked with '##' and named 'Header 2'
    (
        "###",
        "Header 3",
    ),  # Header level 3 is marked with '###' and named 'Header 3'
]

# Create a MarkdownHeaderTextSplitter object to split text based on markdown headers
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
# Split markdown_document by headers and store in md_header_splits
md_header_splits = markdown_splitter.split_text(markdown_document)
# Print the split results
for header in md_header_splits:
    print(f"{header.page_content}")
    print(f"{header.metadata}", end="\n=====================\n")

Hi this is Jim  
    Hi this is Joe
    {'Header 1': 'Title', 'Header 2': '1. SubTitle'}
    =====================
    Hi this is Lance
    {'Header 1': 'Title', 'Header 2': '1. SubTitle', 'Header 3': '1-1. Sub-SubTitle'}
    =====================
    Hi this is Molly
    {'Header 1': 'Title', 'Header 2': '2. Baz'}
    =====================

Header Retention in Split Output

By default, the MarkdownHeaderTextSplitter removes headers from the output chunks.

However, you can configure the splitter to retain these headers by setting strip_headers parameter to False.

Example:

markdown_splitter = MarkdownHeaderTextSplitter(
    # Specify headers to split on
    headers_to_split_on=headers_to_split_on,
    # Set to keep headers in the output
    strip_headers=False,
)
# Split markdown document based on headers
md_header_splits = markdown_splitter.split_text(markdown_document)
# Print the split results
for header in md_header_splits:
    print(f"{header.page_content}")
    print(f"{header.metadata}", end="\n=====================\n")

# Title  
    ## 1. SubTitle  
    Hi this is Jim  
    Hi this is Joe
    {'Header 1': 'Title', 'Header 2': '1. SubTitle'}
    =====================
    ### 1-1. Sub-SubTitle  
    Hi this is Lance
    {'Header 1': 'Title', 'Header 2': '1. SubTitle', 'Header 3': '1-1. Sub-SubTitle'}
    =====================
    ## 2. Baz  
    Hi this is Molly
    {'Header 1': 'Title', 'Header 2': '2. Baz'}
    =====================

Combining with Other Text Splitters

After splitting by Markdown headers, you can further process the content within each Markdown group using any desired text splitter.

In this example, we'll use the RecursiveCharacterTextSplitter to demonstrate how to effectively combine different splitting methods.

from langchain_text_splitters import RecursiveCharacterTextSplitter

markdown_document = "# Intro \n\n## History \n\nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n## Rise and divergence \n\nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n#### Standardization \n\nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n# Implementations \n\nImplementations of Markdown are available for over a dozen programming languages."
print(markdown_document)

# Intro 
    
    ## History 
    
    Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] 
    
    Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. 
    
    ## Rise and divergence 
    
    As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for 
    
    additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. 
    
    #### Standardization 
    
    From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. 
    
    # Implementations 
    
    Implementations of Markdown are available for over a dozen programming languages.

First, use MarkdownHeaderTextSplitter to split the Markdown document based on its headers.

headers_to_split_on = [
    ("#", "Header 1"),  # Specify the header level and its name to split on
    # ("##", "Header 2"),
]

# Split the markdown document based on header levels
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)
# Print the split results
for header in md_header_splits:
    print(f"{header.page_content}")
    print(f"{header.metadata}", end="\n=====================\n")

# Intro  
    ## History  
    Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]  
    Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.  
    ## Rise and divergence  
    As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for  
    additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.  
    #### Standardization  
    From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.
    {'Header 1': 'Intro'}
    =====================
    # Implementations  
    Implementations of Markdown are available for over a dozen programming languages.
    {'Header 1': 'Implementations'}
    =====================

Now, we'll further split the output of the MarkdownHeaderTextSplitter using the RecursiveCharacterTextSplitter.

chunk_size = 200  # Specify the size of each split chunk
chunk_overlap = 20  # Specify the number of overlapping characters between chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split the document into chunks by characters
splits = text_splitter.split_documents(md_header_splits)
# Print the split results
for header in splits:
    print(f"{header.page_content}")
    print(f"{header.metadata}", end="\n=====================\n")

# Intro  
    ## History
    {'Header 1': 'Intro'}
    =====================
    Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its
    {'Header 1': 'Intro'}
    =====================
    readers in its source code form.[9]
    {'Header 1': 'Intro'}
    =====================
    Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.  
    ## Rise and divergence
    {'Header 1': 'Intro'}
    =====================
    As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for
    {'Header 1': 'Intro'}
    =====================
    additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.  
    #### Standardization
    {'Header 1': 'Intro'}
    =====================
    From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.
    {'Header 1': 'Intro'}
    =====================
    # Implementations  
    Implementations of Markdown are available for over a dozen programming languages.
    {'Header 1': 'Implementations'}
    =====================

PreviousSplit code with Langchain NextHTMLHeaderTextSplitter

Last updated 3 months ago