LangChain OpenTutorial
  • 🦜️🔗 The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • HTMLHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Vector Stores
    • Chroma
    • Faiss
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j
    • Weaviate
    • Faiss
    • {VectorStore Name}
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Title
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • CoT Based Smart Web Search
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Table of Contents
  • References
  • Environment Setup
  • Basic JSON Splitting
  • Handling JSON Structure
  1. 07-TextSplitter

RecursiveJsonSplitter

PreviousHTMLHeaderTextSplitterNext08-Embedding

Last updated 28 days ago

  • Author:

  • Peer Review : ,

  • Proofread :

  • This is a part of

Overview

This JSON splitter generates smaller JSON chunks by performing a depth-first traversal of JSON data.

The splitter aims to keep nested JSON objects intact as much as possible. However, to ensure chunk sizes remain within the min_chunk_size and max_chunk_size, it will split objects if needed. Note that very large string values (those not containing nested JSON) are not subject to splitting.

If precise control over chunk size is required, you can use a recursive text splitter on the chunks this splitter creates.

Splitting Criteria

  1. Text splitting method: Based on JSON values

  2. Chunk size: Determined by character count

Table of Contents

References


Environment Setup

[Note]

  • The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.

%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ]
)
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "RecursiveJsonSplitter",
    }
)

Alternatively, you can set and load OPENAI_API_KEY from a .env file.

[Note] This is only necessary if you haven't already set OPENAI_API_KEY in previous steps.

from dotenv import load_dotenv

load_dotenv()
True

Basic JSON Splitting

Let's explore the basic methods of splitting JSON data using the RecursiveJsonSplitter.

  • JSON data preparation

  • RecursiveJsonSplitter configuration

  • Three splitting methods (split_json, create_documents, and split_text)

  • Chunk size verification

import requests

# Load the JSON data.
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()
json_data

Here is an example of splitting JSON data with the RecursiveJsonSplitter.

from langchain_text_splitters import RecursiveJsonSplitter

# Create a RecursiveJsonSplitter object that splits JSON data into chunks with a maximum size of 300
splitter = RecursiveJsonSplitter(max_chunk_size=300)

Use the splitter.split_json() method to recursively split JSON data.

# Recursively split JSON data. Use this when you need to access or manipulate small JSON fragments.
json_chunks = splitter.split_json(json_data=json_data)

The following code demonstrates two methods for splitting JSON data using a splitter object (like an instance of RecursiveJsonSplitter): use the splitter.create_documents() method to convert JSON data into Document objects, and use the splitter.split_text() method to split JSON data into a list of strings.

# Create documents based on JSON data.
docs = splitter.create_documents(texts=[json_data])

# Create string chunks based on JSON data.
texts = splitter.split_text(json_data=json_data)

# Print the first string.
print(docs[0].page_content)

print("===" * 20)

# Print the split string chunks.
print(texts[0])
{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}
    ============================================================
    {"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}

Handling JSON Structure

Let's explore how the RecursiveJsonSplitter handles different JSON structures and its limitations.

  • Verification of list object size

  • Parsing JSON structures

  • Using the convert_lists parameter for list transformation

By examining texts[2] (one of the larger chunks), we can confirm it contains a list object.

  • The second chunk exceeds the size limit (300) because it contains a list.

  • The RecursiveJsonSplitter is designed not to split list objects.

# Let's check the size of the chunks
print([len(text) for text in texts][:10])

# When examining one of the larger chunks, we can see that it contains a list object
print(texts[2])
[232, 197, 469, 210, 213, 237, 271, 191, 232, 215]
    {"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}

You can parse the chunk at index 2 using the json module.

import json

json_data = json.loads(texts[2])
json_data["paths"]
{'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id',
         'in': 'path',
         'required': True,
         'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}},
        {'name': 'include_stats',
         'in': 'query',
         'required': False,
         'schema': {'type': 'boolean',
          'default': False,
          'title': 'Include Stats'}},
        {'name': 'accept',
         'in': 'header',
         'required': False,
         'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
          'title': 'Accept'}}]}}}

Setting the convert_lists parameter to True transforms JSON lists into key:value pairs (formatted as index:item).

# The following preprocesses JSON and converts lists into dictionaries with index:item as key:value pairs
texts = splitter.split_text(json_data=json_data, convert_lists=True)
# The list has been converted to a dictionary, and we'll check the result.
print(texts[2])
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": {"2": {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": {"0": {"type": "string"}, "1": {"type": "null"}}, "title": "Accept"}}}}}}}

You can access specific documents within the docs list using their index.

# Check the document at index 2.
print(docs[2])
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}'

Setting up your environment is the first step. See the guide for more details.

Check out the for more details.

Langchain RecursiveJsonSplitter
Langchain How-to-split-JSONdata
Environment Setup
langchain-opentutorial
HeeWung Song(Dan)
BokyungisaGod
Chaeyoon Kim
Chaeyoon Kim
LangChain Open Tutorial
Overview
Environment Setup
Basic JSON Splitting
Handling JSON Structure