This tutorial introduces how to effectively split Markdown documents using LangChain's MarkdownHeaderTextSplitter. This tool divides documents into meaningful sections based on Markdown headers, preserving the document's structure for systematic content processing.
Context and structure of documents are crucial for effective text embedding. Simply dividing text isn't enough; maintaining semantic connections is key to generating more comprehensive vector representations. This is particularly true when dealing with large documents, as preserving context can significantly enhance the accuracy of subsequent analysis and search operations.
The MarkdownHeaderTextSplitter splits documents according to specified header sets, managing the content under each header group as separate chunks. This enables efficient content processing while maintaining the document's structural coherence.
Alternatively, you can set and load OPENAI_API_KEY from a .env file.
[Note] This is only necessary if you haven't already set OPENAI_API_KEY in previous steps.
from dotenv import load_dotenvload_dotenv()
True
Basic Usage of MarkdownHeaderTextSplitter
The MarkdownHeaderTextSplitter splits Markdown-formatted text based on headers. Here's how to use it:
First, the splitter divides the text based on standard Markdown headers (#, ##, ###, etc.).
Store the Markdown you want to split in a variable called markdown_document.
You'll need a list called headers_to_split_on. This list uses tuples to define the header levels you want to split on and what you want to call them.
Now, create a markdown_splitter object using the MarkdownHeaderTextSplitter class, and give it that headers_to_split_on list.
To actually split the text, call the split_text method on your markdown_splitter object, passing in your markdown_document.
from langchain_text_splitters import MarkdownHeaderTextSplitter# Define a markdown document as a stringmarkdown_document ="# Title\n\n## 1. SubTitle\n\nHi this is Jim\n\nHi this is Joe\n\n### 1-1. Sub-SubTitle \n\nHi this is Lance \n\n## 2. Baz\n\nHi this is Molly"print(markdown_document)
# Title
## 1. SubTitle
Hi this is Jim
Hi this is Joe
### 1-1. Sub-SubTitle
Hi this is Lance
## 2. Baz
Hi this is Molly
headers_to_split_on = [ # Define header levels and their names for document splitting ("#","Header 1", ),# Header level 1 is marked with '#' and named 'Header 1' ("##","Header 2", ),# Header level 2 is marked with '##' and named 'Header 2' ("###","Header 3", ),# Header level 3 is marked with '###' and named 'Header 3']# Create a MarkdownHeaderTextSplitter object to split text based on markdown headersmarkdown_splitter =MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)# Split markdown_document by headers and store in md_header_splitsmd_header_splits = markdown_splitter.split_text(markdown_document)# Print the split resultsfor header in md_header_splits:print(f"{header.page_content}")print(f"{header.metadata}", end="\n=====================\n")
Hi this is Jim
Hi this is Joe
{'Header 1': 'Title', 'Header 2': '1. SubTitle'}
=====================
Hi this is Lance
{'Header 1': 'Title', 'Header 2': '1. SubTitle', 'Header 3': '1-1. Sub-SubTitle'}
=====================
Hi this is Molly
{'Header 1': 'Title', 'Header 2': '2. Baz'}
=====================
Header Retention in Split Output
By default, the MarkdownHeaderTextSplitter removes headers from the output chunks.
However, you can configure the splitter to retain these headers by setting strip_headers parameter to False.
Example:
markdown_splitter =MarkdownHeaderTextSplitter(# Specify headers to split on headers_to_split_on=headers_to_split_on,# Set to keep headers in the output strip_headers=False,)# Split markdown document based on headersmd_header_splits = markdown_splitter.split_text(markdown_document)# Print the split resultsfor header in md_header_splits:print(f"{header.page_content}")print(f"{header.metadata}", end="\n=====================\n")
# Title
## 1. SubTitle
Hi this is Jim
Hi this is Joe
{'Header 1': 'Title', 'Header 2': '1. SubTitle'}
=====================
### 1-1. Sub-SubTitle
Hi this is Lance
{'Header 1': 'Title', 'Header 2': '1. SubTitle', 'Header 3': '1-1. Sub-SubTitle'}
=====================
## 2. Baz
Hi this is Molly
{'Header 1': 'Title', 'Header 2': '2. Baz'}
=====================
Combining with Other Text Splitters
After splitting by Markdown headers, you can further process the content within each Markdown group using any desired text splitter.
In this example, we'll use the RecursiveCharacterTextSplitter to demonstrate how to effectively combine different splitting methods.
from langchain_text_splitters import RecursiveCharacterTextSplittermarkdown_document = "# Intro \n\n## History \n\nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n## Rise and divergence \n\nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n#### Standardization \n\nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n# Implementations \n\nImplementations of Markdown are available for over a dozen programming languages."
print(markdown_document)
# Intro
## History
Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]
Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.
## Rise and divergence
As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for
additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.
#### Standardization
From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.
# Implementations
Implementations of Markdown are available for over a dozen programming languages.
First, use MarkdownHeaderTextSplitter to split the Markdown document based on its headers.
headers_to_split_on = [ ("#","Header 1"),# Specify the header level and its name to split on# ("##", "Header 2"),]# Split the markdown document based on header levelsmarkdown_splitter =MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on, strip_headers=False)md_header_splits = markdown_splitter.split_text(markdown_document)# Print the split resultsfor header in md_header_splits:print(f"{header.page_content}")print(f"{header.metadata}", end="\n=====================\n")
# Intro
## History
Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]
Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.
## Rise and divergence
As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for
additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.
#### Standardization
From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.
{'Header 1': 'Intro'}
=====================
# Implementations
Implementations of Markdown are available for over a dozen programming languages.
{'Header 1': 'Implementations'}
=====================
Now, we'll further split the output of the MarkdownHeaderTextSplitter using the RecursiveCharacterTextSplitter.
chunk_size =200# Specify the size of each split chunkchunk_overlap =20# Specify the number of overlapping characters between chunkstext_splitter =RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap)# Split the document into chunks by characterssplits = text_splitter.split_documents(md_header_splits)# Print the split resultsfor header in splits:print(f"{header.page_content}")print(f"{header.metadata}", end="\n=====================\n")
# Intro
## History
{'Header 1': 'Intro'}
=====================
Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its
{'Header 1': 'Intro'}
=====================
readers in its source code form.[9]
{'Header 1': 'Intro'}
=====================
Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.
## Rise and divergence
{'Header 1': 'Intro'}
=====================
As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for
{'Header 1': 'Intro'}
=====================
additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.
#### Standardization
{'Header 1': 'Intro'}
=====================
From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.
{'Header 1': 'Intro'}
=====================
# Implementations
Implementations of Markdown are available for over a dozen programming languages.
{'Header 1': 'Implementations'}
=====================