Character Text Splitter

Overview

Text splitting is a crucial step in document processing with LangChain.

The CharacterTextSplitter offers efficient text chunking that provides several key benefits:

  • Token Limits: Overcomes LLM context window size restrictions

  • Search Optimization: Enables more precise chunk-level retrieval

  • Memory Efficiency: Processes large documents effectively

  • Context Preservation: Maintains textual coherence through chunk_overlap

This tutorial explores practical implementation of text splitting through core methods like split_text() and create_documents(), including advanced features such as metadata handling.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_text_splitters",
    ],
    verbose=False,
    upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Adaptive-RAG",  # title 과 동일하게 설정해 주세요
    }
)
Environment variables have been set successfully.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)
False

CharacterTextSplitter Example

Read and store contents from keywords file

  • Open ./data/appendix-keywords.txt file and read its contents.

  • Store the read contents in the file variable

with open("./data/appendix-keywords.txt", encoding="utf-8") as f:
   file = f.read()

Print the first 500 characters of the file contents.

print(file[:500])
Semantic Search
    
    Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
    Example: Vectors of word embeddings can be stored in a database for quick access.
    Related keywords: embedding, database, vectorization, vectorization
    
    Embedding
    
    Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders

Create CharacterTextSplitter with parameters:

Parameters

  • separator: String to split text on (e.g., newlines, spaces, custom delimiters)

  • chunk_size: Maximum size of chunks to return

  • chunk_overlap: Overlap in characters between chunks

  • length_function: Function that measures the length of given chunks

  • is_separator_regex: Boolean indicating whether separator should be treated as a regex pattern

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
   separator=" ",           # Splits whenever a space is encountered in text
   chunk_size=250,          # Each chunk contains maximum 250 characters
   chunk_overlap=50,        # Two consecutive chunks share 50 characters
   length_function=len,     # Counts total characters in each chunk
   is_separator_regex=False # Uses space as literal separator, not as regex
)

Create document objects from chunks and display the first one

chunks = text_splitter.create_documents([file])
print(chunks[0])
page_content='Semantic Search
    
    Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
    Example: Vectors of word embeddings can be stored in a database for quick'

Demonstrate metadata handling during document creation:

  • create_documents accepts both text data and metadata lists

  • Each chunk inherits metadata from its source document

# Define metadata for each document
metadatas = [
   {"document": 1},
   {"document": 2},
]

# Create documents with metadata
documents = text_splitter.create_documents(
   [file, file],  # List of texts to split
   metadatas=metadatas,  # Corresponding metadata
)

print(documents[0])  # Display first document with metadata
page_content='Semantic Search
    
    Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
    Example: Vectors of word embeddings can be stored in a database for quick' metadata={'document': 1}

Split text using the split_text() method.

  • text_splitter.split_text(file)[0] returns the first chunk of the split text

# Split the file text and return the first chunk
text_splitter.split_text(file)[0]
'Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick'

Last updated