RecursiveJsonSplitter

Author: HeeWung Song(Dan)
Peer Review : BokyungisaGod, Chaeyoon Kim
Proofread : Chaeyoon Kim
This is a part of LangChain Open Tutorial

Overview

This JSON splitter generates smaller JSON chunks by performing a depth-first traversal of JSON data.

The splitter aims to keep nested JSON objects intact as much as possible. However, to ensure chunk sizes remain within the min_chunk_size and max_chunk_size, it will split objects if needed. Note that very large string values (those not containing nested JSON) are not subject to splitting.

If precise control over chunk size is required, you can use a recursive text splitter on the chunks this splitter creates.

Splitting Criteria

Text splitting method: Based on JSON values
Chunk size: Determined by character count

References

Environment Setup

Setting up your environment is the first step. See the Environment Setup guide for more details.

[Note]

The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.
Check out the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ]
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "RecursiveJsonSplitter",
    }
)

Alternatively, you can set and load OPENAI_API_KEY from a .env file.

[Note] This is only necessary if you haven't already set OPENAI_API_KEY in previous steps.

from dotenv import load_dotenv

load_dotenv()

True

Basic JSON Splitting

Let's explore the basic methods of splitting JSON data using the RecursiveJsonSplitter.

JSON data preparation
RecursiveJsonSplitter configuration
Three splitting methods (split_json, create_documents, and split_text)
Chunk size verification

import requests

# Load the JSON data.
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

json_data

Here is an example of splitting JSON data with the RecursiveJsonSplitter.

from langchain_text_splitters import RecursiveJsonSplitter

# Create a RecursiveJsonSplitter object that splits JSON data into chunks with a maximum size of 300
splitter = RecursiveJsonSplitter(max_chunk_size=300)

Use the splitter.split_json() method to recursively split JSON data.

# Recursively split JSON data. Use this when you need to access or manipulate small JSON fragments.
json_chunks = splitter.split_json(json_data=json_data)

The following code demonstrates two methods for splitting JSON data using a splitter object (like an instance of RecursiveJsonSplitter): use the splitter.create_documents() method to convert JSON data into Document objects, and use the splitter.split_text() method to split JSON data into a list of strings.

# Create documents based on JSON data.
docs = splitter.create_documents(texts=[json_data])

# Create string chunks based on JSON data.
texts = splitter.split_text(json_data=json_data)

# Print the first string.
print(docs[0].page_content)

print("===" * 20)

# Print the split string chunks.
print(texts[0])

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}
    ============================================================
    {"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}

Handling JSON Structure

Let's explore how the RecursiveJsonSplitter handles different JSON structures and its limitations.

Verification of list object size
Parsing JSON structures
Using the convert_lists parameter for list transformation

By examining texts[2] (one of the larger chunks), we can confirm it contains a list object.

The second chunk exceeds the size limit (300) because it contains a list.
The RecursiveJsonSplitter is designed not to split list objects.

# Let's check the size of the chunks
print([len(text) for text in texts][:10])

# When examining one of the larger chunks, we can see that it contains a list object
print(texts[2])

[232, 197, 469, 210, 213, 237, 271, 191, 232, 215]
    {"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}

You can parse the chunk at index 2 using the json module.

import json

json_data = json.loads(texts[2])
json_data["paths"]

{'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id',
         'in': 'path',
         'required': True,
         'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}},
        {'name': 'include_stats',
         'in': 'query',
         'required': False,
         'schema': {'type': 'boolean',
          'default': False,
          'title': 'Include Stats'}},
        {'name': 'accept',
         'in': 'header',
         'required': False,
         'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
          'title': 'Accept'}}]}}}

Setting the convert_lists parameter to True transforms JSON lists into key:value pairs (formatted as index:item).

# The following preprocesses JSON and converts lists into dictionaries with index:item as key:value pairs
texts = splitter.split_text(json_data=json_data, convert_lists=True)

# The list has been converted to a dictionary, and we'll check the result.
print(texts[2])

{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": {"2": {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": {"0": {"type": "string"}, "1": {"type": "null"}}, "title": "Accept"}}}}}}}

You can access specific documents within the docs list using their index.

# Check the document at index 2.
print(docs[2])

page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}'

PreviousHTMLHeaderTextSplitter Next08-Embedding

Last updated 2 months ago

Overview

Table of Contents

References

Environment Setup

Basic JSON Splitting

Handling JSON Structure