Character Text Splitter
Author: hellohotkey
Peer Review : fastjw, heewung song
Proofread : JaeJun Shim
This is a part of LangChain Open Tutorial
Overview
Text splitting is a crucial step in document processing with LangChain.
The CharacterTextSplitter offers efficient text chunking that provides several key benefits:
Token Limits: Overcomes LLM context window size restrictions
Search Optimization: Enables more precise chunk-level retrieval
Memory Efficiency: Processes large documents effectively
Context Preservation: Maintains textual coherence through
chunk_overlap
This tutorial explores practical implementation of text splitting through core methods like split_text() and create_documents(), including advanced features such as metadata handling.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.
[Note] This is not necessary if you've already set the required API keys in previous steps.
CharacterTextSplitter Example
Read and store contents from keywords file
Open
./data/appendix-keywords.txtfile and read its contents.Store the read contents in the
filevariable
Print the first 500 characters of the file contents.
Create CharacterTextSplitter with parameters:
Parameters
separator: String to split text on (e.g., newlines, spaces, custom delimiters)chunk_size: Maximum size of chunks to returnchunk_overlap: Overlap in characters between chunkslength_function: Function that measures the length of given chunksis_separator_regex: Boolean indicating whether separator should be treated as a regex pattern
Create document objects from chunks and display the first one
Demonstrate metadata handling during document creation:
create_documentsaccepts both text data and metadata listsEach chunk inherits metadata from its source document
Split text using the split_text() method.
text_splitter.split_text(file)[0]returns the first chunk of the split text
Last updated