Character Text Splitter

Open in ColabOpen in GitHub

Overview

Text splitting is a crucial step in document processing with LangChain.

The CharacterTextSplitter offers efficient text chunking that provides several key benefits:

  • Token Limits: Overcomes LLM context window size restrictions

  • Search Optimization: Enables more precise chunk-level retrieval

  • Memory Efficiency: Processes large documents effectively

  • Context Preservation: Maintains textual coherence through chunk_overlap

This tutorial explores practical implementation of text splitting through core methods like split_text() and create_documents(), including advanced features such as metadata handling.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

CharacterTextSplitter Example

Read and store contents from keywords file

  • Open ./data/appendix-keywords.txt file and read its contents.

  • Store the read contents in the file variable

Print the first 500 characters of the file contents.

Create CharacterTextSplitter with parameters:

Parameters

  • separator: String to split text on (e.g., newlines, spaces, custom delimiters)

  • chunk_size: Maximum size of chunks to return

  • chunk_overlap: Overlap in characters between chunks

  • length_function: Function that measures the length of given chunks

  • is_separator_regex: Boolean indicating whether separator should be treated as a regex pattern

Create document objects from chunks and display the first one

Demonstrate metadata handling during document creation:

  • create_documents accepts both text data and metadata lists

  • Each chunk inherits metadata from its source document

Split text using the split_text() method.

  • text_splitter.split_text(file)[0] returns the first chunk of the split text

Last updated