This tutorial will show you how to use the RecursiveCharacterTextSplitter.
This is the recommended way to split text.
It works by taking a list of characters as a parameter.
It tries to split the text into smaller pieces in the order of the given character list until the pieces are very small.
By default, the character lists are ['\n\n', '\n', ' ", "'].
It recursively splits in the following order: paragraph -> sentence -> word.
This means that paragraphs (then sentences, then words) are considered to be the most semantically related pieces of text, so we want to keep them together as much as possible.
How the text is split: by a list of characters ([ā\n\nā, ā\nā, ā ā, āā]).
The chunk size is measured by the number of characters.
Read a file for the RecursiveCharacterTextSplitter exercise.
Open the appendix-keywords.txt file and read its contents.
Save the text to the file variable.
# Open the appendix-keywords.txt file to create a file object named f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of the file and stores them in the file variable.
Output some of the contents of the file read from the file.
# Output the top 500 characters read from the file.
print(file[:500])
Semantic Search
Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization
Embedding
Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders
Example of using RecursiveCharacterTextSplitter to split text into small chunks.
Set chunk_size to 250 to limit the size of each chunk.
Set chunk_overlap to 50 to allow 50 characters of overlap between neighbouring chunks.
Use the len function as length_function to calculate the length of the text.
Set is_separator_regex to False to disable the use of regular expressions as separators.
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set the chunk size to very small. These settings are for illustrative purposes only.
chunk_size=250,
# Sets the number of overlapping characters between chunks.
chunk_overlap=50,
# Specifies a function to calculate the length of the string.
length_function=len,
# Sets whether to use regular expressions as delimiters.
is_separator_regex=False,
)
Use text_splitter to split the file text into documents.
The split documents are stored in the texts list.
Print the first and second documents of the split document via print(texts[0]) and print(texts[1]).
# Split the file text into documents using text_splitter.
texts = text_splitter.create_documents([file])
print(texts[0]) # Outputs the first document in the split document.
print("===" * 20)
print(texts[1]) # Output the second document of the split document.
page_content='Semantic Search'
============================================================
page_content='Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.'
Use the text_splitter.split_text() function to split the file text.
# Splits the text and returns the first two elements of the split text.
text_splitter.split_text(file)[:2]
['Semantic Search',
'Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.']