LangChain OpenTutorial
  • 🦜️🔗 The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • HTMLHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Vector Stores
    • Chroma
    • Faiss
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j
    • Weaviate
    • Faiss
    • {VectorStore Name}
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Title
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • CoT Based Smart Web Search
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Table of Contents
  • References
  • Environment Setup
  • Creating a SemanticChunker
  • Text Splitting
  • Breakpoints
  • Percentile-Based Splitting
  • Standard Deviation Splitting
  • Interquartile Range Splitting
  1. 07-TextSplitter

SemanticChunker

PreviousTokenTextSplitterNextSplit code with Langchain

Last updated 28 days ago

  • Author:

  • Peer Review : ,

  • Proofread :

  • This is a part of

Overview

This tutorial dives into a Text Splitter that uses semantic similarity to split text.

LangChain's SemanticChunker is a powerful tool that takes document chunking to a whole new level. Unlike traiditional methods that split text at fixed intervals, the SemanticChunker analyzes the meaning of the content to create more logical divisions.

This approach relies on OpenAI's embedding model , calculating how similar different pieces of text are by converting them into numerical representations. The tool offers various splitting options to suit your needs. You can choose from methods based on percentiles, standard deviation, or interquartile range.

What sets the SemanticChunker apart is its ability to preserve context by identifying natural breaks. This ultimately leads to better performance when working with large language models.

Since the SemanticChunker understands the actual content, it generates chunks that are more useful and maintain the flow and context of the original document.

See

The method breaks down the text into individual sentences first. Then, it groups sementically similar sentences into chunks (e.g., 3 sentences), and finally merges similar sentences in the embedding space.

Table of Contents

References


Environment Setup

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

Load sample text and output the content.

%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package


package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain-anthropic",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
        "langchain_experimental",
    ],
    verbose=False,
    upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "SemanticChunker",  # title
    }
)

Alternatively, you can set and load OPENAI_API_KEY from a .env file.

[Note] This is only necessary if you haven't already set OPENAI_API_KEY in previous steps.

# Configuration File for Managing API Keys as Environment Variables
from dotenv import load_dotenv

# Load API Key Information
load_dotenv(override=True)

Load the sample text and output its content.

# Open the data/appendix-keywords.txt file to create a file object called f.
with open("./data/appendix-keywords.txt", encoding="utf-8") as f:

    file = f.read()  # Read the contents of the file and save it in the file variable.

# Print part of the content read from the file.
print(file[:350])

Creating a SemanticChunker

The SemanticChunker is an experimental LangChain feature, that splits text into semantically similar chunks.

This approach allows for more effective processing and analysis of text data.

Use the SemanticChunker to divide the text into semantically related chunks.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# Initialize a semantic chunk splitter using OpenAI embeddings.
text_splitter = SemanticChunker(OpenAIEmbeddings())

Text Splitting

Use the text_splitter with your loaded file (file) to split the text into smallar, more manageable unit documents. This process is often referred to as chunking.

chunks = text_splitter.split_text(file)

After splitting, you can examine the resulting chunks to see how the text has been divided.

# Print the first chunk among the divided chunks.
print(chunks[0])

The create_documents() function allows you to convert the individual chunks ([file]) into proper document objects (docs).

# Split using text_splitter
docs = text_splitter.create_documents([file])
print(
    docs[0].page_content
)  # Print the content of the first document among the divided documents.

Breakpoints

This chunking process works by indentifying natural breaks between sentences.

Here's how it decides where to split the text:

  1. It calculates the difference between these embeddings for each pair of sentences.

  2. When the difference between two sentences exceeds a certain threshold (breakpoint), the text_splitter identifies this as a natural break and splits the text at that point.

Percentile-Based Splitting

This method sorts all embedding differences between sentences. Then, it splits the text at a specific percentile (e.g. 70th percentile).

text_splitter = SemanticChunker(
    # Initialize the semantic chunker using OpenAI's embedding model
    OpenAIEmbeddings(),
    # Set the split breakpoint type to percentile
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=70,
)

Examine the resulting document list (docs).

docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

Use the len(docs) function to get the number of chunks created.

print(len(docs))  # Print the length of docs.

Standard Deviation Splitting

This method sets a threshold based on a specified number of standard deviations (breakpoint_threshold_amount).

To use standard deviation for your breakpoints, set the breakpoint_threshold_type parameter to "standard_deviation" when initializing the text_splitter.

text_splitter = SemanticChunker(
    # Initialize the semantic chunker using OpenAI's embedding model.
    OpenAIEmbeddings(),
    # Use standard deviation as the splitting criterion.
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=1.25,
)

After splitting, check the docs list and print its length (len(docs)) to see how many chunks were created.

# Split using text_splitter.
docs = text_splitter.create_documents([file])
docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)
print(len(docs))  # Print the length of docs.

Interquartile Range Splitting

This method utilizes the interquartile range (IQR) of the embedding differences to consider breaks, leading to a text split.

Set the breakpoint_threshold_type parameter to "interquartile" when initializing the text_splitter to use the IQR for splitting.

text_splitter = SemanticChunker(
    # Initialize the semantic chunk splitter using OpenAI's embedding model.
    OpenAIEmbeddings(),
    # Set the breakpoint threshold type to interquartile range.
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=0.5,
)
# Split using text_splitter.
docs = text_splitter.create_documents([file])

# Print the results.
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

Finally, print the length of docs list (len(docs)) to view the number of cunks created.

print(len(docs))  # Print the length of docs.

Set up the environment. You may refer to for more details.

You can checkout the for more details.

Check out for more details.

Greg Kamradt's notebook
Greg Kamradt's video
Environment Setup
langchain-opentutorial
Greg Kamradt's video
Wonyoung Lee
Wooseok Jeong
sohyunwriter
Chaeyoon Kim
LangChain Open Tutorial
Greg Kamradt's notebook
Overview
Environement Setup
Creating a Semantic Chunker
Text Splitting
Breakpoints