SemanticChunker
Last updated
Last updated
Author: Wonyoung Lee
Design:
Peer Review : Wooseok Jeong, sohyunwriter
This is a part of LangChain Open Tutorial
This tutorial dives into a Text Splitter that uses semantic similarity to split text.
LangChain's SemanticChunker
is a powerful tool that takes document chunking to a whole new level. Unlike traiditional methods that split text at fixed intervals, the SemanticChunker
analyzes the meaning of the content to create more logical divisions.
This approach relies on OpenAI's embedding model , calculating how similar different pieces of text are by converting them into numerical representations. The tool offers various splitting options to suit your needs. You can choose from methods based on percentiles, standard deviation, or interquartile range.
What sets the SemanticChunker
apart is its ability to preserve context by identifying natural breaks. This ultimately leads to better performance when working with large language models.
Since the SemanticChunker
understands the actual content, it generates chunks that are more useful and maintain the flow and context of the original document.
The method breaks down the text into individual sentences first. Then, it groups sementically similar sentences into chunks (e.g., 3 sentences), and finally merges similar sentences in the embedding space.
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorial
is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial
for more details.
Load sample text and output the content.
Alternatively, you can set and load OPENAI_API_KEY
from a .env
file.
[Note] This is only necessary if you haven't already set OPENAI_API_KEY
in previous steps.
Load the sample text and output its content.
The SemanticChunker
is an experimental LangChain feature, that splits text into semantically similar chunks.
This approach allows for more effective processing and analysis of text data.
Use the SemanticChunker
to divide the text into semantically related chunks.
Use the text_splitter
with your loaded file (file
) to split the text into smallar, more manageable unit documents. This process is often referred to as chunking.
After splitting, you can examine the resulting chunks to see how the text has been divided.
The create_documents()
function allows you to convert the individual chunks ([file]
) into proper document objects (docs
).
This chunking process works by indentifying natural breaks between sentences.
Here's how it decides where to split the text:
It calculates the difference between these embeddings for each pair of sentences.
When the difference between two sentences exceeds a certain threshold (breakpoint), the text_splitter
identifies this as a natural break and splits the text at that point.
Check out Greg Kamradt's video for more details.
This method sorts all embedding differences between sentences. Then, it splits the text at a specific percentile (e.g. 70th percentile).
Examine the resulting document list (docs
).
Use the len(docs)
function to get the number of chunks created.
This method sets a threshold based on a specified number of standard deviations (breakpoint_threshold_amount
).
To use standard deviation for your breakpoints, set the breakpoint_threshold_type
parameter to "standard_deviation"
when initializing the text_splitter
.
After splitting, check the docs
list and print its length (len(docs)
) to see how many chunks were created.
This method utilizes the interquartile range (IQR) of the embedding differences to consider breaks, leading to a text split.
Set the breakpoint_threshold_type
parameter to "interquartile"
when initializing the text_splitter
to use the IQR for splitting.
Finally, print the length of docs
list (len(docs)
) to view the number of cunks created.