HTMLHeaderTextSplitter
Author: ChangJun Lee
Peer Review: YooKyung Jeon, Wooseok Jeong
This is a part of LangChain Open Tutorial
Overview
This is a "structure-aware" chunk generator that splits text at the element level and adds metadata for each header, conceptually similar to the MarkdownHeaderTextSplitter.
It adds metadata "related" to each chunk.
HTMLHeaderTextSplitter can return chunks by element or combine elements with the same metadata,
(a) semantically (approximately) group related text and
(b) preserve context-rich information encoded in the document structure.
Table of Contents
References
Environment Setup
Setting up your environment is the first step. See the Environment Setup guide for more details.
[Note]
The
langchain-opentutorialis a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.Check out the
langchain-opentutorialfor more details.
Using HTML Strings
Specify the header tags and their names to split on in the
headers_to_split_onlist as tuples.Create an
HTMLHeaderTextSplitterobject and pass the list of headers to split on to theheaders_to_split_onparameter.
Connecting with Other Splitters and Loading HTML from a Web URL
In this example, we load HTML content from a web URL and then process it by connecting it with other splitters in a pipeline.
Limitations
HTMLHeaderTextSplitter attempts to handle structural differences between HTML documents, but it may sometimes miss specific headers.
For example, this algorithm assumes that headers are always nodes "above" the related text, i.e., in previous sibling nodes, ancestor nodes, and combinations thereof.
In the following news article (as of the time of writing), the text of the top headline is tagged as "h1", but it is in a separate subtree from the text element we expect.
Therefore, the text related to the "h1" element does not appear in the chunk metadata, but the text related to "h2" does, if applicable.
Last updated