02. RecursiveCharacterTextSplitter

Open in ColabOpen in GitHub

Overview

This tutorial explains how to use the RecursiveCharacterTextSplitter, the recommended way to split text in LangChain.

The RecursiveCharacterTextSplitter works by taking a list of characters and attempting to split the text into smaller pieces based on that list. It continues splitting until the pieces are sufficiently small.

By default, the character list is ['\n\n', '\n', ' ", "'], which means it recursively splits in the following order: paragraph -> sentence -> word. This prioritizes keeping paragraphs, then sentences, then words together as much as possible, as these are considered the most semantically related units.

Here's a summary of how it works:

  1. Splitting is done by a list of characters ([‘\n\n’, ‘\n’, ‘ “, ”’]).

  2. Chunk size is measured by the number of characters.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

Example Usage of RecursiveCharacterTextSplitter

This example demonstrates how to use the RecursiveCharacterTextSplitter to split text into smaller chunks.

  1. Open the text file appendix-keywords.txt and read its contents and store this text in a variable named file.

  1. Display some of the content read from the file.

  1. Now, create a RecursiveCharacterTextSplitter with the following parameters:

  • chunk_size = 250 (limits each chunk to 250 characters)

  • chunk_overlap = 50 (allows 50 characters of overlap between chunks)

  • length_function = len() (specifies that built-in len() function for length calculation)

  • is_separator_regex = False (disables regular expression separators).

  1. Use the text_splitter to split the text stored in the file variable into a list of Document objects. This list will be stored in a variable called texts.

  2. Print the first and second documents using print(texts[0]) and print(texts[1]).

Alternatively, you can also use the text_splitter.split_text() function to split the file text.

Last updated