Character Text Splitter
Author: hellohotkey
Peer Review : fastjw, heewung song
Proofread : JaeJun Shim
This is a part of LangChain Open Tutorial
Overview
Text splitting is a crucial step in document processing with LangChain.
The CharacterTextSplitter
offers efficient text chunking that provides several key benefits:
Token Limits: Overcomes LLM context window size restrictions
Search Optimization: Enables more precise chunk-level retrieval
Memory Efficiency: Processes large documents effectively
Context Preservation: Maintains textual coherence through
chunk_overlap
This tutorial explores practical implementation of text splitting through core methods like split_text()
and create_documents()
, including advanced features such as metadata handling.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorial
is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorial
for more details.
%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package
package.install(
[
"langchain_text_splitters",
],
verbose=False,
upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env
set_env(
{
"OPENAI_API_KEY": "",
"LANGCHAIN_API_KEY": "",
"LANGCHAIN_TRACING_V2": "true",
"LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
"LANGCHAIN_PROJECT": "Adaptive-RAG", # title 과 동일하게 설정해 주세요
}
)
Environment variables have been set successfully.
You can alternatively set API keys such as OPENAI_API_KEY
in a .env
file and load them.
[Note] This is not necessary if you've already set the required API keys in previous steps.
# Load API keys from .env file
from dotenv import load_dotenv
load_dotenv(override=True)
False
CharacterTextSplitter Example
Read and store contents from keywords file
Open
./data/appendix-keywords.txt
file and read its contents.Store the read contents in the
file
variable
with open("./data/appendix-keywords.txt", encoding="utf-8") as f:
file = f.read()
Print the first 500 characters of the file contents.
print(file[:500])
Semantic Search
Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization
Embedding
Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders
Create CharacterTextSplitter
with parameters:
Parameters
separator
: String to split text on (e.g., newlines, spaces, custom delimiters)chunk_size
: Maximum size of chunks to returnchunk_overlap
: Overlap in characters between chunkslength_function
: Function that measures the length of given chunksis_separator_regex
: Boolean indicating whether separator should be treated as a regex pattern
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator=" ", # Splits whenever a space is encountered in text
chunk_size=250, # Each chunk contains maximum 250 characters
chunk_overlap=50, # Two consecutive chunks share 50 characters
length_function=len, # Counts total characters in each chunk
is_separator_regex=False # Uses space as literal separator, not as regex
)
Create document objects from chunks and display the first one
chunks = text_splitter.create_documents([file])
print(chunks[0])
page_content='Semantic Search
Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick'
Demonstrate metadata handling during document creation:
create_documents
accepts both text data and metadata listsEach chunk inherits metadata from its source document
# Define metadata for each document
metadatas = [
{"document": 1},
{"document": 2},
]
# Create documents with metadata
documents = text_splitter.create_documents(
[file, file], # List of texts to split
metadatas=metadatas, # Corresponding metadata
)
print(documents[0]) # Display first document with metadata
page_content='Semantic Search
Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick' metadata={'document': 1}
Split text using the split_text()
method.
text_splitter.split_text(file)[0]
returns the first chunk of the split text
# Split the file text and return the first chunk
text_splitter.split_text(file)[0]
'Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick'
Last updated