Here's how to split Python code into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify Language.PYTHON for the language parameter. It tells the splitter you're working with Python code.
Then, set chunk_size to 50. This limits the size of each resulting chunk to a maximum of 50 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
PYTHON_CODE ="""def hello_world(): print("Hello, World!")hello_world()"""python_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.PYTHON, chunk_size=50, chunk_overlap=0)# Create `Document`. The created `Document` is returned as a list.python_docs = python_splitter.create_documents([PYTHON_CODE])python_docs
# This section iterates through the list of documents created by the RecursiveCharacterTextSplitter# and prints each document's content followed by a separator line for readability.for doc in python_docs:print(doc.page_content, end="\n==================\n")
Here's how to split JavaScript code into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify Language.JS for the language parameter. It tells the splitter you're working with JavaScript code.
Then, set chunk_size to 60. This limits the size of each resulting chunk to a maximum of 60 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
JS_CODE ="""function helloWorld() { console.log("Hello, World!");}helloWorld();"""js_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.JS, chunk_size=60, chunk_overlap=0)# Create `Document`. The created `Document` is returned as a list.js_docs = js_splitter.create_documents([JS_CODE])js_docs
Here's how to split Markdown text into smaller chunks using the RecursiveCharacterTextSplitter.
First, Specify Language.MARKDOWN for the language parameter. It tells the splitter you're working with Markdown text.
Then, set chunk_size to 60. This limits the size of each resulting chunk to a maximum of 60 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
markdown_text ="""# 🦜️🔗 LangChain⚡ Building applications with LLMs through composability ⚡## What is LangChain?# Hopefully this code block isn't splitLangChain is a framework for...As an open-source project in a rapidly developing field, we are extremely open to contributions."""md_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0,)md_docs = md_splitter.create_documents([markdown_text])md_docs
[Document(metadata={}, page_content='# 🦜️🔗 LangChain'),
Document(metadata={}, page_content='⚡ Building applications with LLMs through composability ⚡'),
Document(metadata={}, page_content='## What is LangChain?'),
Document(metadata={}, page_content="# Hopefully this code block isn't split"),
Document(metadata={}, page_content='LangChain is a framework for...'),
Document(metadata={}, page_content='As an open-source project in a rapidly developing field, we'),
Document(metadata={}, page_content='are extremely open to contributions.')]
LaTeX
LaTeX is a markup language for document creation, widely used for representing mathematical symbols and formulas.
Here's how to split LaTeX text into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify Language.LATEX for the language parameter. It tells the splitter you're working with LaTeX text.
Then, set chunk_size to 60. This limits the size of each resulting chunk to a maximum of 60 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
latex_text ="""\documentclass{article}\begin{document}\maketitle\section{Introduction}Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.\subsection{History of LLMs}The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.\subsection{Applications of LLMs}LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\end{document}"""latex_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.LATEX, chunk_size=60, chunk_overlap=0,)latex_docs = latex_splitter.create_documents([latex_text])latex_docs
[Document(metadata={}, page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle'),
Document(metadata={}, page_content='\\section{Introduction}\nLarge language models (LLMs) are a'),
Document(metadata={}, page_content='type of machine learning model that can be trained on vast'),
Document(metadata={}, page_content='amounts of text data to generate human-like language. In'),
Document(metadata={}, page_content='recent years, LLMs have made significant advances in a'),
Document(metadata={}, page_content='variety of natural language processing tasks, including'),
Document(metadata={}, page_content='language translation, text generation, and sentiment'),
Document(metadata={}, page_content='analysis.'),
Document(metadata={}, page_content='\\subsection{History of LLMs}\nThe earliest LLMs were'),
Document(metadata={}, page_content='developed in the 1980s and 1990s, but they were limited by'),
Document(metadata={}, page_content='the amount of data that could be processed and the'),
Document(metadata={}, page_content='computational power available at the time. In the past'),
Document(metadata={}, page_content='decade, however, advances in hardware and software have'),
Document(metadata={}, page_content='made it possible to train LLMs on massive datasets, leading'),
Document(metadata={}, page_content='to significant improvements in performance.'),
Document(metadata={}, page_content='\\subsection{Applications of LLMs}\nLLMs have many'),
Document(metadata={}, page_content='applications in industry, including chatbots, content'),
Document(metadata={}, page_content='creation, and virtual assistants. They can also be used in'),
Document(metadata={}, page_content='academia for research in linguistics, psychology, and'),
Document(metadata={}, page_content='computational linguistics.\n\n\\end{document}')]
HTML
Here's how to split HTML text into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify Language.HTML for the language parameter. It tells the splitter you're working with HTML.
Then, set chunk_size to 60. This limits the size of each resulting chunk to a maximum of 60 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
html_text ="""<!DOCTYPE html><html> <head> <title>🦜️🔗 LangChain</title> <style> body { font-family: Arial, sans-serif; } h1 { color: darkblue; } </style> </head> <body> <div> <h1>🦜️🔗 LangChain</h1> <p>⚡ Building applications with LLMs through composability ⚡</p> </div> <div> As an open-source project in a rapidly developing field, we are extremely open to contributions. </div> </body></html>"""html_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.HTML, chunk_size=60, chunk_overlap=0)html_docs = html_splitter.create_documents([html_text])html_docs
[Document(metadata={}, page_content='\n'),
Document(metadata={}, page_content='\n 🦜️🔗 LangChain'),
Document(metadata={}, page_content='\n
Document(metadata={}, page_content='>'), Document(metadata={}, page_content='<body>'), Document(metadata={}, page_content='<div>\n <h1>🦜️🔗 LangChain</h1>'), Document(metadata={}, page_content='<p>⚡ Building applications with LLMs through composability ⚡'), Document(metadata={}, page_content='</p>\n </div>'), Document(metadata={}, page_content='<div>\n As an open-source project in a rapidly dev'), Document(metadata={}, page_content='eloping field, we are extremely open to contributions.'), Document(metadata={}, page_content='</div>\n </body>\n</html>')]</pre>
Solidity
Here's how to split Solidity code (sotred as a string in the SOL_CODE variable) into smaller chunks by creating a RecursiveCharacterTextSplitter instance called sol_splitter to handle the splitting.
First, specify Language.SOL for the language parameter. It tells the splitter you're working with Solidity code.
Then, set chunk_size to 128. This limits the size of each resulting chunk to a maximum of 128 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
The sol_splitter.create_documents() method splits the Solidity code(SOL_CODE) into chunks and stores them in the sol_docs variable.
Print or display the output(sol_docs) to verify the split.
SOL_CODE = """pragma solidity ^0.8.20; contract HelloWorld { function add(uint a, uint b) pure public returns(uint) { return a + b; }}"""sol_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.SOL, chunk_size=128, chunk_overlap=0)sol_docs = sol_splitter.create_documents([SOL_CODE])sol_docs
[Document(metadata={}, page_content='pragma solidity ^0.8.20;'), Document(metadata={}, page_content='contract HelloWorld { \n function add(uint a, uint b) pure public returns(uint) {\n return a + b;\n }\n}')]
C#
Here's how to split C# code into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify Language.CSHARP for the language parameter. It tells the splitter you're working with C# code.
Then, set chunk_size to 128. This limits the size of each resulting chunk to a maximum of 128 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
C_CODE = """using System;class Program{ static void Main() { Console.WriteLine("Enter a number (1-5):"); int input = Convert.ToInt32(Console.ReadLine()); for (int i = 1; i <= input; i++) { if (i % 2 == 0) { Console.WriteLine($"{i} is even."); } else { Console.WriteLine($"{i} is odd."); } } Console.WriteLine("Goodbye!"); }}"""c_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.CSHARP, chunk_size=128, chunk_overlap=0)c_docs = c_splitter.create_documents([C_CODE])c_docs
[Document(metadata={}, page_content='using System;'), Document(metadata={}, page_content='class Program\n{\n static void Main()\n {\n Console.WriteLine("Enter a number (1-5):");'), Document(metadata={}, page_content='int input = Convert.ToInt32(Console.ReadLine());\n for (int i = 1; i <= input; i++)\n {'), Document(metadata={}, page_content='if (i % 2 == 0)\n {\n Console.WriteLine($"{i} is even.");\n }\n else'), Document(metadata={}, page_content='{\n Console.WriteLine($"{i} is odd.");\n }\n }\n Console.WriteLine("Goodbye!");'), Document(metadata={}, page_content='}\n}')]
PHP
Here's how to split PHP code into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify Language.PHP for the language parameter. It tells the splitter you're working with PHP code.
Then, set chunk_size to 50. This limits the size of each resulting chunk to a maximum of 50 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
PHP_CODE = """<?phpnamespace foo;class Hello { public function __construct() { }}function hello() { echo "Hello World!";}interface Human { public function breath();}trait Foo { }enum Color{ case Red; case Blue;}"""php_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.PHP, chunk_size=50, chunk_overlap=0)php_docs = php_splitter.create_documents([PHP_CODE])php_docs
[Document(metadata={}, page_content='KotlinHere's how to split Kotline code into smaller chunks using the RecursiveCharacterTextSplitter.First, specify Language.KOTLIN for the language parameter. It tells the splitter you're working with Kotline code.Then, set chunk_size to 100. This limits the size of each resulting chunk to a maximum of 100 characters.Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.KOTLIN_CODE = """fun main() { val directoryPath = System.getProperty("user.dir") val files = File(directoryPath).listFiles()?.filter { !it.isDirectory }?.sortedBy { it.lastModified() } ?: emptyArray() files.forEach { file -> println("Name: ${file.name} | Last Write Time: ${file.lastModified()}") }}"""kotlin_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.KOTLIN, chunk_size=100, chunk_overlap=0)kotlin_docs = kotlin_splitter.create_documents([KOTLIN_CODE])kotlin_docs[Document(metadata={}, page_content='fun main() {\n val directoryPath = System.getProperty("user.dir")'), Document(metadata={}, page_content='val files = File(directoryPath).listFiles()?.filter { !it.isDirectory }?.sortedBy {'), Document(metadata={}, page_content='it.lastModified() } ?: emptyArray()'), Document(metadata={}, page_content='files.forEach { file ->'), Document(metadata={}, page_content='println("Name: ${file.name} | Last Write Time: ${file.lastModified()}")\n }\n}')]