This tutorial explains how to use the RecursiveCharacterTextSplitter, the recommended way to split text in LangChain.
The RecursiveCharacterTextSplitter works by taking a list of characters and attempting to split the text into smaller pieces based on that list. It continues splitting until the pieces are sufficiently small.
By default, the character list is ['\n\n', '\n', ' ", "'], which means it recursively splits in the following order: paragraph -> sentence -> word. This prioritizes keeping paragraphs, then sentences, then words together as much as possible, as these are considered the most semantically related units.
Here's a summary of how it works:
Splitting is done by a list of characters ([ā\n\nā, ā\nā, ā ā, āā]).
Chunk size is measured by the number of characters.
Table of Contents
References
Environment Setup
[Note]
langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
This example demonstrates how to use the RecursiveCharacterTextSplitter to split text into smaller chunks.
Open the text file appendix-keywords.txt and read its contents and store this text in a variable named file.
# Open the appendix-keywords.txt file to create a file object named f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of the file and stores them in the file variable.
Display some of the content read from the file.
# Output the top 500 characters read from the file.
print(file[:500])
Semantic Search
Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization
Embedding
Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders
Now, create a RecursiveCharacterTextSplitter with the following parameters:
chunk_size = 250 (limits each chunk to 250 characters)
chunk_overlap = 50 (allows 50 characters of overlap between chunks)
length_function = len() (specifies that built-in len() function for length calculation)
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set the chunk size to very small. These settings are for illustrative purposes only.
chunk_size=250,
# Sets the number of overlapping characters between chunks.
chunk_overlap=50,
# Specifies a function to calculate the length of the string.
length_function=len,
# Sets whether to use regular expressions as delimiters.
is_separator_regex=False,
)
Use the text_splitter to split the text stored in the file variable into a list of Document objects. This list will be stored in a variable called texts.
Print the first and second documents using print(texts[0]) and print(texts[1]).
# Split the file text into documents using text_splitter.
texts = text_splitter.create_documents([file])
print(texts[0]) # Outputs the first document in the split document.
print("===" * 20)
print(texts[1]) # Output the second document of the split document.
page_content='Semantic Search'
============================================================
page_content='Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.'
Alternatively, you can also use the text_splitter.split_text() function to split the file text.
# Splits the text and returns the first two elements of the split text.
text_splitter.split_text(file)[:2]
['Semantic Search',
'Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.']
Set up the environment. You may refer to for more details.