MarkdownHeaderTextSplitter

Open in ColabOpen in GitHub

Overview

This tutorial introduces how to effectively split Markdown documents using LangChain's MarkdownHeaderTextSplitter. This tool divides documents into meaningful sections based on Markdown headers, preserving the document's structure for systematic content processing.

Context and structure of documents are crucial for effective text embedding. Simply dividing text isn't enough; maintaining semantic connections is key to generating more comprehensive vector representations. This is particularly true when dealing with large documents, as preserving context can significantly enhance the accuracy of subsequent analysis and search operations.

The MarkdownHeaderTextSplitter splits documents according to specified header sets, managing the content under each header group as separate chunks. This enables efficient content processing while maintaining the document's structural coherence.

Table of Contents

References


Environment Setup

Setting up your environment is the first step. See the Environment Setup guide for more details.

[Note]

  • The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.

  • Check out the langchain-opentutorial for more details.

Alternatively, you can set and load OPENAI_API_KEY from a .env file.

[Note] This is only necessary if you haven't already set OPENAI_API_KEY in previous steps.

Basic Usage of MarkdownHeaderTextSplitter

The MarkdownHeaderTextSplitter splits Markdown-formatted text based on headers. Here's how to use it:

  • First, the splitter divides the text based on standard Markdown headers (#, ##, ###, etc.).

  • Store the Markdown you want to split in a variable called markdown_document.

  • You'll need a list called headers_to_split_on. This list uses tuples to define the header levels you want to split on and what you want to call them.

  • Now, create a markdown_splitter object using the MarkdownHeaderTextSplitter class, and give it that headers_to_split_on list.

  • To actually split the text, call the split_text method on your markdown_splitter object, passing in your markdown_document.

Header Retention in Split Output

By default, the MarkdownHeaderTextSplitter removes headers from the output chunks.

However, you can configure the splitter to retain these headers by setting strip_headers parameter to False.

Example:

Combining with Other Text Splitters

After splitting by Markdown headers, you can further process the content within each Markdown group using any desired text splitter.

In this example, we'll use the RecursiveCharacterTextSplitter to demonstrate how to effectively combine different splitting methods.

First, use MarkdownHeaderTextSplitter to split the Markdown document based on its headers.

Now, we'll further split the output of the MarkdownHeaderTextSplitter using the RecursiveCharacterTextSplitter.

Last updated