RecursiveJsonSplitter
Author: HeeWung Song(Dan)
Peer Review : BokyungisaGod, Chaeyoon Kim
Proofread : Chaeyoon Kim
This is a part of LangChain Open Tutorial
Overview
This JSON splitter generates smaller JSON chunks by performing a depth-first traversal of JSON data.
The splitter aims to keep nested JSON objects intact as much as possible. However, to ensure chunk sizes remain within the min_chunk_size
and max_chunk_size
, it will split objects if needed. Note that very large string values (those not containing nested JSON) are not subject to splitting.
If precise control over chunk size is required, you can use a recursive text splitter on the chunks this splitter creates.
Splitting Criteria
Text splitting method: Based on JSON values
Chunk size: Determined by character count
Table of Contents
References
Environment Setup
Setting up your environment is the first step. See the Environment Setup guide for more details.
[Note]
The
langchain-opentutorial
is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.Check out the
langchain-opentutorial
for more details.
%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package
package.install(
[
"langsmith",
"langchain",
"langchain_core",
"langchain_community",
"langchain_text_splitters",
"langchain_openai",
]
)
# Set environment variables
from langchain_opentutorial import set_env
set_env(
{
"OPENAI_API_KEY": "",
"LANGCHAIN_API_KEY": "",
"LANGCHAIN_TRACING_V2": "true",
"LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
"LANGCHAIN_PROJECT": "RecursiveJsonSplitter",
}
)
Alternatively, you can set and load OPENAI_API_KEY
from a .env
file.
[Note] This is only necessary if you haven't already set OPENAI_API_KEY
in previous steps.
from dotenv import load_dotenv
load_dotenv()
True
Basic JSON Splitting
Let's explore the basic methods of splitting JSON data using the RecursiveJsonSplitter
.
JSON data preparation
RecursiveJsonSplitter
configurationThree splitting methods (
split_json
,create_documents
, andsplit_text
)Chunk size verification
import requests
# Load the JSON data.
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()
json_data
Here is an example of splitting JSON data with the RecursiveJsonSplitter
.
from langchain_text_splitters import RecursiveJsonSplitter
# Create a RecursiveJsonSplitter object that splits JSON data into chunks with a maximum size of 300
splitter = RecursiveJsonSplitter(max_chunk_size=300)
Use the splitter.split_json()
method to recursively split JSON data.
# Recursively split JSON data. Use this when you need to access or manipulate small JSON fragments.
json_chunks = splitter.split_json(json_data=json_data)
The following code demonstrates two methods for splitting JSON data using a splitter object (like an instance of RecursiveJsonSplitter
): use the splitter.create_documents()
method to convert JSON data into Document
objects, and use the splitter.split_text()
method to split JSON data into a list of strings.
# Create documents based on JSON data.
docs = splitter.create_documents(texts=[json_data])
# Create string chunks based on JSON data.
texts = splitter.split_text(json_data=json_data)
# Print the first string.
print(docs[0].page_content)
print("===" * 20)
# Print the split string chunks.
print(texts[0])
{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}
============================================================
{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}
Handling JSON Structure
Let's explore how the RecursiveJsonSplitter
handles different JSON structures and its limitations.
Verification of list object size
Parsing JSON structures
Using the
convert_lists
parameter for list transformation
By examining texts[2]
(one of the larger chunks), we can confirm it contains a list object.
The second chunk exceeds the size limit (300) because it contains a list.
The
RecursiveJsonSplitter
is designed not to split list objects.
# Let's check the size of the chunks
print([len(text) for text in texts][:10])
# When examining one of the larger chunks, we can see that it contains a list object
print(texts[2])
[232, 197, 469, 210, 213, 237, 271, 191, 232, 215]
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}
You can parse the chunk at index 2 using the json
module.
import json
json_data = json.loads(texts[2])
json_data["paths"]
{'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id',
'in': 'path',
'required': True,
'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}},
{'name': 'include_stats',
'in': 'query',
'required': False,
'schema': {'type': 'boolean',
'default': False,
'title': 'Include Stats'}},
{'name': 'accept',
'in': 'header',
'required': False,
'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
'title': 'Accept'}}]}}}
Setting the convert_lists
parameter to True
transforms JSON lists into key:value
pairs (formatted as index:item
).
# The following preprocesses JSON and converts lists into dictionaries with index:item as key:value pairs
texts = splitter.split_text(json_data=json_data, convert_lists=True)
# The list has been converted to a dictionary, and we'll check the result.
print(texts[2])
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": {"2": {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": {"0": {"type": "string"}, "1": {"type": "null"}}, "title": "Accept"}}}}}}}
You can access specific documents within the docs
list using their index.
# Check the document at index 2.
print(docs[2])
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}'
Last updated