This JSON splitter generates smaller JSON chunks by performing a depth-first traversal of JSON data.
The splitter aims to keep nested JSON objects intact as much as possible. However, to ensure chunk sizes remain within the min_chunk_size and max_chunk_size, it will split objects if needed. Note that very large string values (those not containing nested JSON) are not subject to splitting.
If precise control over chunk size is required, you can use a recursive text splitter on the chunks this splitter creates.
Alternatively, you can set and load OPENAI_API_KEY from a .env file.
[Note] This is only necessary if you haven't already set OPENAI_API_KEY in previous steps.
from dotenv import load_dotenv
load_dotenv()
True
Basic JSON Splitting
Let's explore the basic methods of splitting JSON data using the RecursiveJsonSplitter.
JSON data preparation
RecursiveJsonSplitter configuration
Three splitting methods (split_json, create_documents, and split_text)
Chunk size verification
import requests
# Load the JSON data.
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()
json_data
Here is an example of splitting JSON data with the RecursiveJsonSplitter.
from langchain_text_splitters import RecursiveJsonSplitter
# Create a RecursiveJsonSplitter object that splits JSON data into chunks with a maximum size of 300
splitter = RecursiveJsonSplitter(max_chunk_size=300)
Use the splitter.split_json() method to recursively split JSON data.
# Recursively split JSON data. Use this when you need to access or manipulate small JSON fragments.
json_chunks = splitter.split_json(json_data=json_data)
The following code demonstrates two methods for splitting JSON data using a splitter object (like an instance of RecursiveJsonSplitter): use the splitter.create_documents() method to convert JSON data into Document objects, and use the splitter.split_text() method to split JSON data into a list of strings.
# Create documents based on JSON data.
docs = splitter.create_documents(texts=[json_data])
# Create string chunks based on JSON data.
texts = splitter.split_text(json_data=json_data)
# Print the first string.
print(docs[0].page_content)
print("===" * 20)
# Print the split string chunks.
print(texts[0])
Let's explore how the RecursiveJsonSplitter handles different JSON structures and its limitations.
Verification of list object size
Parsing JSON structures
Using the convert_lists parameter for list transformation
By examining texts[2] (one of the larger chunks), we can confirm it contains a list object.
The second chunk exceeds the size limit (300) because it contains a list.
The RecursiveJsonSplitter is designed not to split list objects.
# Let's check the size of the chunks
print([len(text) for text in texts][:10])
# When examining one of the larger chunks, we can see that it contains a list object
print(texts[2])
Setting the convert_lists parameter to True transforms JSON lists into key:value pairs (formatted as index:item).
# The following preprocesses JSON and converts lists into dictionaries with index:item as key:value pairs
texts = splitter.split_text(json_data=json_data, convert_lists=True)
# The list has been converted to a dictionary, and we'll check the result.
print(texts[2])