02. RecursiveCharacterTextSplitter

PreviousCharacter Text Splitter NextText Splitting Methods in NLP

Last updated 28 days ago

02. RecursiveCharacterTextSplitter

Author:
Design:
Peer Review : ,
Proofread :
This is a part of

Overview

This tutorial explains how to use the RecursiveCharacterTextSplitter, the recommended way to split text in LangChain.

The RecursiveCharacterTextSplitter works by taking a list of characters and attempting to split the text into smaller pieces based on that list. It continues splitting until the pieces are sufficiently small.

By default, the character list is ['\n\n', '\n', ' ", "'], which means it recursively splits in the following order: paragraph -> sentence -> word. This prioritizes keeping paragraphs, then sentences, then words together as much as possible, as these are considered the most semantically related units.

Here's a summary of how it works:

Splitting is done by a list of characters ([‘\n\n’, ‘\n’, ‘ “, ”’]).
Chunk size is measured by the number of characters.

References

Environment Setup

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

%%capture --no-stderr
!pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_text_splitters",
    ],
    verbose=False,
    upgrade=False,
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "RecursiveCharacterTextSplitter", 
    }
)

Environment variables have been set successfully.

from dotenv import load_dotenv

load_dotenv()

False

Example Usage of RecursiveCharacterTextSplitter

This example demonstrates how to use the RecursiveCharacterTextSplitter to split text into smaller chunks.

Open the text file appendix-keywords.txt and read its contents and store this text in a variable named file.

# Open the appendix-keywords.txt file to create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of the file and stores them in the file variable.

Display some of the content read from the file.

# Output the top 500 characters read from the file.
print(file[:500])

Semantic Search
    
    Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
    Example: Vectors of word embeddings can be stored in a database for quick access.
    Related keywords: embedding, database, vectorization, vectorization
    
    Embedding
    
    Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders

Now, create a RecursiveCharacterTextSplitter with the following parameters:

chunk_size = 250 (limits each chunk to 250 characters)
chunk_overlap = 50 (allows 50 characters of overlap between chunks)
length_function = len() (specifies that built-in len() function for length calculation)
is_separator_regex = False (disables regular expression separators).

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set the chunk size to very small. These settings are for illustrative purposes only.
    chunk_size=250,
    # Sets the number of overlapping characters between chunks.
    chunk_overlap=50,
    # Specifies a function to calculate the length of the string.
    length_function=len,
    # Sets whether to use regular expressions as delimiters.
    is_separator_regex=False,
)

Use the text_splitter to split the text stored in the file variable into a list of Document objects. This list will be stored in a variable called texts.
Print the first and second documents using print(texts[0]) and print(texts[1]).

# Split the file text into documents using text_splitter.
texts = text_splitter.create_documents([file])
print(texts[0])  # Outputs the first document in the split document.
print("===" * 20)
print(texts[1])  # Output the second document of the split document.

page_content='Semantic Search'
    ============================================================
    page_content='Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
    Example: Vectors of word embeddings can be stored in a database for quick access.'

Alternatively, you can also use the text_splitter.split_text() function to split the file text.

# Splits the text and returns the first two elements of the split text.
text_splitter.split_text(file)[:2]

['Semantic Search',
     'Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.']

PreviousCharacter Text Splitter NextText Splitting Methods in NLP

Last updated 28 days ago

Author:
Design:
Peer Review : ,
Proofread :
This is a part of

Overview

This tutorial explains how to use the RecursiveCharacterTextSplitter, the recommended way to split text in LangChain.

Here's a summary of how it works:

Splitting is done by a list of characters ([‘\n\n’, ‘\n’, ‘ “, ”’]).
Chunk size is measured by the number of characters.

References

Environment Setup

Set up the environment. You may refer to for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the for more details.

%%capture --no-stderr
!pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_text_splitters",
    ],
    verbose=False,
    upgrade=False,
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "RecursiveCharacterTextSplitter", 
    }
)

Environment variables have been set successfully.

from dotenv import load_dotenv

load_dotenv()

False

Example Usage of RecursiveCharacterTextSplitter

This example demonstrates how to use the RecursiveCharacterTextSplitter to split text into smaller chunks.

Open the text file appendix-keywords.txt and read its contents and store this text in a variable named file.

# Open the appendix-keywords.txt file to create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of the file and stores them in the file variable.

Display some of the content read from the file.

# Output the top 500 characters read from the file.
print(file[:500])

Semantic Search
    
    Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
    Example: Vectors of word embeddings can be stored in a database for quick access.
    Related keywords: embedding, database, vectorization, vectorization
    
    Embedding
    
    Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders

Now, create a RecursiveCharacterTextSplitter with the following parameters:

chunk_size = 250 (limits each chunk to 250 characters)
chunk_overlap = 50 (allows 50 characters of overlap between chunks)
length_function = len() (specifies that built-in len() function for length calculation)
is_separator_regex = False (disables regular expression separators).

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set the chunk size to very small. These settings are for illustrative purposes only.
    chunk_size=250,
    # Sets the number of overlapping characters between chunks.
    chunk_overlap=50,
    # Specifies a function to calculate the length of the string.
    length_function=len,
    # Sets whether to use regular expressions as delimiters.
    is_separator_regex=False,
)

Use the text_splitter to split the text stored in the file variable into a list of Document objects. This list will be stored in a variable called texts.
Print the first and second documents using print(texts[0]) and print(texts[1]).

# Split the file text into documents using text_splitter.
texts = text_splitter.create_documents([file])
print(texts[0])  # Outputs the first document in the split document.
print("===" * 20)
print(texts[1])  # Output the second document of the split document.

page_content='Semantic Search'
    ============================================================
    page_content='Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
    Example: Vectors of word embeddings can be stored in a database for quick access.'

Alternatively, you can also use the text_splitter.split_text() function to split the file text.

# Splits the text and returns the first two elements of the split text.
text_splitter.split_text(file)[:2]

['Semantic Search',
     'Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.']

02. RecursiveCharacterTextSplitter

02. RecursiveCharacterTextSplitter

Overview

Table of Contents

References

Environment Setup

Example Usage of RecursiveCharacterTextSplitter

Overview

Table of Contents

References

Environment Setup

Example Usage of RecursiveCharacterTextSplitter