HTMLHeaderTextSplitter

Open in ColabOpen in GitHub

Overview

This is a "structure-aware" chunk generator that splits text at the element level and adds metadata for each header, conceptually similar to the MarkdownHeaderTextSplitter.

It adds metadata "related" to each chunk.

HTMLHeaderTextSplitter can return chunks by element or combine elements with the same metadata,

  • (a) semantically (approximately) group related text and

  • (b) preserve context-rich information encoded in the document structure.

Table of Contents

References


Environment Setup

Setting up your environment is the first step. See the Environment Setup guide for more details.

[Note]

  • The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.

  • Check out the langchain-opentutorial for more details.

Using HTML Strings

  • Specify the header tags and their names to split on in the headers_to_split_on list as tuples.

  • Create an HTMLHeaderTextSplitter object and pass the list of headers to split on to the headers_to_split_on parameter.

Connecting with Other Splitters and Loading HTML from a Web URL

In this example, we load HTML content from a web URL and then process it by connecting it with other splitters in a pipeline.

Limitations

HTMLHeaderTextSplitter attempts to handle structural differences between HTML documents, but it may sometimes miss specific headers.

For example, this algorithm assumes that headers are always nodes "above" the related text, i.e., in previous sibling nodes, ancestor nodes, and combinations thereof.

In the following news article (as of the time of writing), the text of the top headline is tagged as "h1", but it is in a separate subtree from the text element we expect.

Therefore, the text related to the "h1" element does not appear in the chunk metadata, but the text related to "h2" does, if applicable.

Last updated