02. RecursiveCharacterTextSplitter
Author: fastjw
Design: fastjw
Peer Review : Wonyoung Lee, sohyunwriter
Proofread : Chaeyoon Kim
This is a part of LangChain Open Tutorial
Overview
This tutorial explains how to use the RecursiveCharacterTextSplitter
, the recommended way to split text in LangChain.
The RecursiveCharacterTextSplitter
works by taking a list of characters and attempting to split the text into smaller pieces based on that list. It continues splitting until the pieces are sufficiently small.
By default, the character list is ['\n\n', '\n', ' ", "'], which means it recursively splits in the following order: paragraph -> sentence -> word. This prioritizes keeping paragraphs, then sentences, then words together as much as possible, as these are considered the most semantically related units.
Here's a summary of how it works:
Splitting is done by a list of characters ([‘\n\n’, ‘\n’, ‘ “, ”’]).
Chunk size is measured by the number of characters.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorial
is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorial
for more details.
%%capture --no-stderr
!pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package
package.install(
[
"langchain_text_splitters",
],
verbose=False,
upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env
set_env(
{
"OPENAI_API_KEY": "",
"LANGCHAIN_API_KEY": "",
"LANGCHAIN_TRACING_V2": "true",
"LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
"LANGCHAIN_PROJECT": "RecursiveCharacterTextSplitter",
}
)
Environment variables have been set successfully.
from dotenv import load_dotenv
load_dotenv()
False
Example Usage of RecursiveCharacterTextSplitter
This example demonstrates how to use the RecursiveCharacterTextSplitter
to split text into smaller chunks.
Open the text file
appendix-keywords.txt
and read its contents and store this text in a variable namedfile
.
# Open the appendix-keywords.txt file to create a file object named f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of the file and stores them in the file variable.
Display some of the content read from the
file
.
# Output the top 500 characters read from the file.
print(file[:500])
Semantic Search
Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization
Embedding
Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders
Now, create a
RecursiveCharacterTextSplitter
with the following parameters:
chunk_size
= 250 (limits each chunk to 250 characters)chunk_overlap
= 50 (allows 50 characters of overlap between chunks)length_function
=len()
(specifies that built-inlen()
function for length calculation)is_separator_regex
=False
(disables regular expression separators).
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set the chunk size to very small. These settings are for illustrative purposes only.
chunk_size=250,
# Sets the number of overlapping characters between chunks.
chunk_overlap=50,
# Specifies a function to calculate the length of the string.
length_function=len,
# Sets whether to use regular expressions as delimiters.
is_separator_regex=False,
)
Use the
text_splitter
to split the text stored in thefile
variable into a list ofDocument
objects. This list will be stored in a variable calledtexts
.Print the first and second documents using
print(texts[0])
andprint(texts[1])
.
# Split the file text into documents using text_splitter.
texts = text_splitter.create_documents([file])
print(texts[0]) # Outputs the first document in the split document.
print("===" * 20)
print(texts[1]) # Output the second document of the split document.
page_content='Semantic Search'
============================================================
page_content='Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.'
Alternatively, you can also use the text_splitter.split_text()
function to split the file
text.
# Splits the text and returns the first two elements of the split text.
text_splitter.split_text(file)[:2]
['Semantic Search',
'Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.']
Last updated