TokenTextSplitter
Author: Ilgyun Jeong
Peer Review : JoonHo Kim, Sunyoung Park (architectyou)
This is a part of LangChain Open Tutorial
Overview
Language models operate within token limits, making it crucial to manage text within these constraints.
TokenTextSplitter serves as an effective tool for segmenting text into manageable chunks based on token count, ensuring compliance with these limitations.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
You can alternatively set OPENAI_API_KEY in .env file and load it.
[Note] This is not necessary if you've already set OPENAI_API_KEY in previous steps.
Basic Usage of tiktoken
tiktoken is a fast BPE tokenizer created by OpenAI.
Open the file ./data/appendix-keywords.txt and read its contents.
Store the read content in the file variable.
Print a portion of the content read from the file.
Use the CharacterTextSplitter to split the text.
Initialize the text splitter using the
from_tiktoken_encodermethod, which is based on the Tiktoken encoder.
Print the number of divided chunks.
Print the first element of the texts list.
Reference
When using
CharacterTextSplitter.from_tiktoken_encoder, the text is split solely byCharacterTextSplitter, and theTiktokentokenizer is only used to measure and merge the divided text. (This means that the split text might exceed the chunk size as measured by theTiktokentokenizer.)When using
RecursiveCharacterTextSplitter.from_tiktoken_encoder, the divided text is ensured not to exceed the chunk size allowed by the language model. If a split text exceeds this size, it is recursively divided. Additionally, you can directly load theTiktokensplitter, which guarantees that each split is smaller than the chunk size.
Basic Usage of TokenTextSplitter
Use the TokenTextSplitter class to split the text into token-based chunks.
Basic Usage of spaCy
spaCy is an open-source software library for advanced natural language processing, written in Python and Cython programming languages.
Another alternative to NLTK is using the spaCy tokenizer.
How the text is divided: The text is split using the spaCy tokenizer.
How the chunk size is measured: It is measured by the number of characters.
Download the en_core_web_sm model.
Open the appendix-keywords.txt file and read its contents.
Verify by printing a portion of the content.
Create a text splitter using the SpacyTextSplitter class.
Use the split_text method of the text_splitter object to split the file text.
Basic Usage of SentenceTransformers
SentenceTransformersTokenTextSplitter is a text splitter specialized for sentence-transformer models.
Its default behavior is to split text into chunks that fit within the token window of the sentence-transformer model being used.
Check the sample text.
The following code counts the number of tokens in the text stored in the file variable, excluding the count of start and stop tokens, and prints the result.
Use the splitter.split_text() function to split the text stored in the text_to_split variable into chunks.
Split the text into chunks.
Basic Usage of NLTK
The Natural Language Toolkit (NLTK) is a library and a collection of programs for English natural language processing (NLP), written in the Python programming language.
Instead of simply splitting by "\n\n", NLTK can be used to split text based on NLTK tokenizers.
Text splitting method: The text is split using the NLTK tokenizer.
Chunk size measurement: The size is measured by the number of characters.
nltk(Natural Language Toolkit) is a Python library for natural language processing.It supports various NLP tasks such as text preprocessing, tokenization, morphological analysis, and part-of-speech tagging.
Before using NLTK, you need to run nltk.download('punkt_tab').
The reason for running nltk.download('punkt_tab') is to allow the NLTK (Natural Language Toolkit) library to download the necessary data files required for tokenizing text.
Specifically, punkt_tab is a tokenization model capable of splitting text into words or sentences for multiple languages, including English.
Verify the sample text.
Create a text splitter using the
NLTKTextSplitterclass.Set the
chunk_sizeparameter to 1000 to split the text into chunks of up to 1000 characters.
Use the split_text method of the text_splitter object to split the file text.
Basic Usage of KoNLPy
KoNLPy (Korean NLP in Python) is a Python package for Korean Natural Language Processing (NLP).
Tokenization Tokenization involves the process of dividing text into smaller, more manageable units called tokens.
These tokens often represent meaningful elements such as words, phrases, symbols, or other components crucial for further processing and analysis.
In languages like English, tokenization typically involves separating words based on spaces and punctuation.
The effectiveness of tokenization largely depends on the tokenizer's understanding of the language structure, ensuring the generation of meaningful tokens.
Tokenizers designed for English lack the ability to comprehend the unique semantic structure of other languages, such as Korean, and therefore cannot be effectively used for Korean text processing.
Korean Tokenization Using KoNLPy’s Kkma Analyzer
For Korean text, KoNLPy includes a morphological analyzer called Kkma (Korean Knowledge Morpheme Analyzer).
Kkma provides detailed morphological analysis for Korean text. It breaks sentences into words and further decomposes words into their morphemes while identifying the part of speech for each token. It can also split text blocks into individual sentences, which is particularly useful for processing lengthy texts.
Considerations When Using Kkma
Kkma is known for its detailed analysis. However, this precision can affect processing speed. Therefore, Kkma is best suited for applications that prioritize analytical depth over rapid text processing.
KoNLPy is a Python package for Korean Natural Language Processing, offering features such as morphological analysis, part-of-speech tagging, and syntactic parsing.
Verify the sample text.
This is an example of splitting Korean text using KonlpyTextSplitter.
Use the text_splitter to split the file content into sentences.
Basic Usage of Hugging Face tokenizer
Hugging Face provides various tokenizers.
This code demonstrates calculating the token length of a text using one of Hugging Face's tokenizers, GPT2TokenizerFast.
The text splitting approach is as follows:
The text is split at the character level.
The chunk size measurement is determined as follows:
It is based on the number of tokens calculated by the Hugging Face tokenizer.
A
tokenizerobject is created using theGPT2TokenizerFastclass.from_pretrainedmethod is called to load the pre-trained "gpt2" tokenizer model.
from_huggingface_tokenizer method is used to initialize a text splitter with a Hugging Face tokenizer (tokenizer).
Check the split result of the first element
Last updated