MarkdownHeaderTextSplitter
Author: HeeWung Song(Dan)
Peer Review : BokyungisaGod, Chaeyoon Kim
Proofread : Chaeyoon Kim
This is a part of LangChain Open Tutorial
Overview
This tutorial introduces how to effectively split Markdown documents using LangChain's MarkdownHeaderTextSplitter. This tool divides documents into meaningful sections based on Markdown headers, preserving the document's structure for systematic content processing.
Context and structure of documents are crucial for effective text embedding. Simply dividing text isn't enough; maintaining semantic connections is key to generating more comprehensive vector representations. This is particularly true when dealing with large documents, as preserving context can significantly enhance the accuracy of subsequent analysis and search operations.
The MarkdownHeaderTextSplitter splits documents according to specified header sets, managing the content under each header group as separate chunks. This enables efficient content processing while maintaining the document's structural coherence.
Table of Contents
References
Environment Setup
Setting up your environment is the first step. See the Environment Setup guide for more details.
[Note]
The
langchain-opentutorialis a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.Check out the
langchain-opentutorialfor more details.
Alternatively, you can set and load OPENAI_API_KEY from a .env file.
[Note] This is only necessary if you haven't already set OPENAI_API_KEY in previous steps.
Basic Usage of MarkdownHeaderTextSplitter
The MarkdownHeaderTextSplitter splits Markdown-formatted text based on headers. Here's how to use it:
First, the splitter divides the text based on standard Markdown headers (#, ##, ###, etc.).
Store the Markdown you want to split in a variable called markdown_document.
You'll need a list called
headers_to_split_on. This list uses tuples to define the header levels you want to split on and what you want to call them.Now, create a
markdown_splitterobject using theMarkdownHeaderTextSplitterclass, and give it thatheaders_to_split_onlist.To actually split the text, call the
split_textmethod on yourmarkdown_splitterobject, passing in yourmarkdown_document.
Header Retention in Split Output
By default, the MarkdownHeaderTextSplitter removes headers from the output chunks.
However, you can configure the splitter to retain these headers by setting strip_headers parameter to False.
Example:
Combining with Other Text Splitters
After splitting by Markdown headers, you can further process the content within each Markdown group using any desired text splitter.
In this example, we'll use the RecursiveCharacterTextSplitter to demonstrate how to effectively combine different splitting methods.
First, use MarkdownHeaderTextSplitter to split the Markdown document based on its headers.
Now, we'll further split the output of the MarkdownHeaderTextSplitter using the RecursiveCharacterTextSplitter.
Last updated