Split code with Langchain
Author: Jongcheol Kim
Peer Review: kofsitho87, teddylee777
Proofread : Chaeyoon Kim
This is a part of LangChain Open Tutorial
Overview
RecursiveCharacterTextSplitter
includes pre-built separator lists optimized for splitting text in different programming languages.
The CodeTextSplitter
provides even more specialized functionality for splitting code.
To use it, import the Language
enum(enumeration) and specify the desired programming language.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorial
is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorial
for more details.
%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package
package.install(
[
"langchain_text_splitters",
],
verbose=False,
upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env
set_env(
{
"OPENAI_API_KEY": "",
"LANGCHAIN_API_KEY": "",
"LANGCHAIN_TRACING_V2": "true",
"LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
"LANGCHAIN_PROJECT": "Code-Splitter",
}
)
Environment variables have been set successfully.
from dotenv import load_dotenv
load_dotenv()
True
Code Splitter Examples
Here is an example of splitting text using the RecursiveCharacterTextSplitter
.
Import the
Language
andRecursiveCharacterTextSplitter
classes from thelangchain_text_splitters
module.RecursiveCharacterTextSplitter
is a text splitter that recursively splits text at the character level.
from langchain_text_splitters import (
Language,
RecursiveCharacterTextSplitter,
)
Supported languages are stored in the langchain_text_splitters.Language enum.
API Reference: Language | RecursiveCharacterTextSplitter
See below for the full list of supported languages.
# Get the full list of supported languages.
[e.value for e in Language]
['cpp',
'go',
'java',
'kotlin',
'js',
'ts',
'php',
'proto',
'python',
'rst',
'ruby',
'rust',
'scala',
'swift',
'markdown',
'latex',
'html',
'sol',
'csharp',
'cobol',
'c',
'lua',
'perl',
'haskell',
'elixir',
'powershell']
You can use the get_separators_for_language
method of the RecursiveCharacterTextSplitter
class to see the separators used for a given language.
For example, passing
Language.PYTHON
retrieves the separators used for Python:
# You can check the separators used for the given language.
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']
Python
Here's how to split Python code into smaller chunks using the RecursiveCharacterTextSplitter
.
First, specify
Language.PYTHON
for thelanguage
parameter. It tells the splitter you're working with Python code.Then, set
chunk_size
to 50. This limits the size of each resulting chunk to a maximum of 50 characters.Finally, set
chunk_overlap
to 0. It prevents any of the chunks from overlapping.
PYTHON_CODE = """
def hello_world():
print("Hello, World!")
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
# Create `Document`. The created `Document` is returned as a list.
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs
[Document(metadata={}, page_content='def hello_world():\n print("Hello, World!")'),
Document(metadata={}, page_content='hello_world()')]
# This section iterates through the list of documents created by the RecursiveCharacterTextSplitter
# and prints each document's content followed by a separator line for readability.
for doc in python_docs:
print(doc.page_content, end="\n==================\n")
def hello_world():
print("Hello, World!")
==================
hello_world()
==================
JavaScript
Here's how to split JavaScript code into smaller chunks using the RecursiveCharacterTextSplitter
.
First, specify
Language.JS
for thelanguage
parameter. It tells the splitter you're working with JavaScript code.Then, set
chunk_size
to 60. This limits the size of each resulting chunk to a maximum of 60 characters.Finally, set
chunk_overlap
to 0. It prevents any of the chunks from overlapping.
JS_CODE = """
function helloWorld() {
console.log("Hello, World!");
}
helloWorld();
"""
js_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.JS, chunk_size=60, chunk_overlap=0
)
# Create `Document`. The created `Document` is returned as a list.
js_docs = js_splitter.create_documents([JS_CODE])
js_docs
[Document(metadata={}, page_content='function helloWorld() {\n console.log("Hello, World!");\n}'),
Document(metadata={}, page_content='helloWorld();')]
TypeScript
Here's how to split TypeScript code into smaller chunks using the RecursiveCharacterTextSplitter
.
First, specify
Language.TS
for thelanguage
parameter. It tells the splitter you're working with TypeScript code.Then, set
chunk_size
to 60. This limits the size of each resulting chunk to a maximum of 60 characters.Finally, set
chunk_overlap
to 0. It prevents any of the chunks from overlapping.
TS_CODE = """
function helloWorld(): void {
console.log("Hello, World!");
}
helloWorld();
"""
ts_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.TS, chunk_size=60, chunk_overlap=0
)
ts_docs = ts_splitter.create_documents([TS_CODE])
ts_docs
[Document(metadata={}, page_content='function helloWorld(): void {'),
Document(metadata={}, page_content='console.log("Hello, World!");\n}'),
Document(metadata={}, page_content='helloWorld();')]
Markdown
Here's how to split Markdown text into smaller chunks using the RecursiveCharacterTextSplitter
.
First, Specify
Language.MARKDOWN
for thelanguage
parameter. It tells the splitter you're working with Markdown text.Then, set
chunk_size
to 60. This limits the size of each resulting chunk to a maximum of 60 characters.Finally, set
chunk_overlap
to 0. It prevents any of the chunks from overlapping.
markdown_text = """
# 🦜️🔗 LangChain
⚡ Building applications with LLMs through composability ⚡
## What is LangChain?
# Hopefully this code block isn't split
LangChain is a framework for...
As an open-source project in a rapidly developing field, we are extremely open to contributions.
"""
md_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.MARKDOWN,
chunk_size=60,
chunk_overlap=0,
)
md_docs = md_splitter.create_documents([markdown_text])
md_docs
[Document(metadata={}, page_content='# 🦜️🔗 LangChain'),
Document(metadata={}, page_content='⚡ Building applications with LLMs through composability ⚡'),
Document(metadata={}, page_content='## What is LangChain?'),
Document(metadata={}, page_content="# Hopefully this code block isn't split"),
Document(metadata={}, page_content='LangChain is a framework for...'),
Document(metadata={}, page_content='As an open-source project in a rapidly developing field, we'),
Document(metadata={}, page_content='are extremely open to contributions.')]
LaTeX
LaTeX is a markup language for document creation, widely used for representing mathematical symbols and formulas.
Here's how to split LaTeX text into smaller chunks using the RecursiveCharacterTextSplitter
.
First, specify
Language.LATEX
for thelanguage
parameter. It tells the splitter you're working with LaTeX text.Then, set
chunk_size
to 60. This limits the size of each resulting chunk to a maximum of 60 characters.Finally, set
chunk_overlap
to 0. It prevents any of the chunks from overlapping.
latex_text = """
\documentclass{article}
\begin{document}
\maketitle
\section{Introduction}
Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.
\subsection{History of LLMs}
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.
\subsection{Applications of LLMs}
LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.
\end{document}
"""
latex_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.LATEX,
chunk_size=60,
chunk_overlap=0,
)
latex_docs = latex_splitter.create_documents([latex_text])
latex_docs
[Document(metadata={}, page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle'),
Document(metadata={}, page_content='\\section{Introduction}\nLarge language models (LLMs) are a'),
Document(metadata={}, page_content='type of machine learning model that can be trained on vast'),
Document(metadata={}, page_content='amounts of text data to generate human-like language. In'),
Document(metadata={}, page_content='recent years, LLMs have made significant advances in a'),
Document(metadata={}, page_content='variety of natural language processing tasks, including'),
Document(metadata={}, page_content='language translation, text generation, and sentiment'),
Document(metadata={}, page_content='analysis.'),
Document(metadata={}, page_content='\\subsection{History of LLMs}\nThe earliest LLMs were'),
Document(metadata={}, page_content='developed in the 1980s and 1990s, but they were limited by'),
Document(metadata={}, page_content='the amount of data that could be processed and the'),
Document(metadata={}, page_content='computational power available at the time. In the past'),
Document(metadata={}, page_content='decade, however, advances in hardware and software have'),
Document(metadata={}, page_content='made it possible to train LLMs on massive datasets, leading'),
Document(metadata={}, page_content='to significant improvements in performance.'),
Document(metadata={}, page_content='\\subsection{Applications of LLMs}\nLLMs have many'),
Document(metadata={}, page_content='applications in industry, including chatbots, content'),
Document(metadata={}, page_content='creation, and virtual assistants. They can also be used in'),
Document(metadata={}, page_content='academia for research in linguistics, psychology, and'),
Document(metadata={}, page_content='computational linguistics.\n\n\\end{document}')]
HTML
Here's how to split HTML text into smaller chunks using the RecursiveCharacterTextSplitter
.
First, specify
Language.HTML
for thelanguage
parameter. It tells the splitter you're working with HTML.Then, set
chunk_size
to 60. This limits the size of each resulting chunk to a maximum of 60 characters.Finally, set
chunk_overlap
to 0. It prevents any of the chunks from overlapping.
html_text = """
<!DOCTYPE html>
<html>
<head>
<title>🦜️🔗 LangChain</title>
<style>
body {
font-family: Arial, sans-serif;
}
h1 {
color: darkblue;
}
</style>
</head>
<body>
<div>
<h1>🦜️🔗 LangChain</h1>
<p>⚡ Building applications with LLMs through composability ⚡</p>
</div>
<div>
As an open-source project in a rapidly developing field, we are extremely open to contributions.
</div>
</body>
</html>
"""
html_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.HTML, chunk_size=60, chunk_overlap=0
)
html_docs = html_splitter.create_documents([html_text])
html_docs
[Document(metadata={}, page_content='\n'),
Document(metadata={}, page_content='\n 🦜️🔗 LangChain'),
Document(metadata={}, page_content='\n
Document(metadata={}, page_content='>'), Document(metadata={}, page_content='<body>'), Document(metadata={}, page_content='<div>\n <h1>🦜️🔗 LangChain</h1>'), Document(metadata={}, page_content='<p>⚡ Building applications with LLMs through composability ⚡'), Document(metadata={}, page_content='</p>\n </div>'), Document(metadata={}, page_content='<div>\n As an open-source project in a rapidly dev'), Document(metadata={}, page_content='eloping field, we are extremely open to contributions.'), Document(metadata={}, page_content='</div>\n </body>\n</html>')]</pre>
Solidity
Here's how to split Solidity code (sotred as a string in the SOL_CODE variable) into smaller chunks by creating a RecursiveCharacterTextSplitter instance called sol_splitter to handle the splitting.
First, specify Language.SOL for the language parameter. It tells the splitter you're working with Solidity code.
Then, set chunk_size to 128. This limits the size of each resulting chunk to a maximum of 128 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
The sol_splitter.create_documents() method splits the Solidity code(SOL_CODE) into chunks and stores them in the sol_docs variable.
Print or display the output(sol_docs) to verify the split.
SOL_CODE = """pragma solidity ^0.8.20; contract HelloWorld { function add(uint a, uint b) pure public returns(uint) { return a + b; }}"""sol_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.SOL, chunk_size=128, chunk_overlap=0)sol_docs = sol_splitter.create_documents([SOL_CODE])sol_docs
[Document(metadata={}, page_content='pragma solidity ^0.8.20;'), Document(metadata={}, page_content='contract HelloWorld { \n function add(uint a, uint b) pure public returns(uint) {\n return a + b;\n }\n}')]
C#
Here's how to split C# code into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify Language.CSHARP for the language parameter. It tells the splitter you're working with C# code.
Then, set chunk_size to 128. This limits the size of each resulting chunk to a maximum of 128 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
C_CODE = """using System;class Program{ static void Main() { Console.WriteLine("Enter a number (1-5):"); int input = Convert.ToInt32(Console.ReadLine()); for (int i = 1; i <= input; i++) { if (i % 2 == 0) { Console.WriteLine($"{i} is even."); } else { Console.WriteLine($"{i} is odd."); } } Console.WriteLine("Goodbye!"); }}"""c_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.CSHARP, chunk_size=128, chunk_overlap=0)c_docs = c_splitter.create_documents([C_CODE])c_docs
[Document(metadata={}, page_content='using System;'), Document(metadata={}, page_content='class Program\n{\n static void Main()\n {\n Console.WriteLine("Enter a number (1-5):");'), Document(metadata={}, page_content='int input = Convert.ToInt32(Console.ReadLine());\n for (int i = 1; i <= input; i++)\n {'), Document(metadata={}, page_content='if (i % 2 == 0)\n {\n Console.WriteLine($"{i} is even.");\n }\n else'), Document(metadata={}, page_content='{\n Console.WriteLine($"{i} is odd.");\n }\n }\n Console.WriteLine("Goodbye!");'), Document(metadata={}, page_content='}\n}')]
PHP
Here's how to split PHP code into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify Language.PHP for the language parameter. It tells the splitter you're working with PHP code.
Then, set chunk_size to 50. This limits the size of each resulting chunk to a maximum of 50 characters.
Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.
PHP_CODE = """<?phpnamespace foo;class Hello { public function __construct() { }}function hello() { echo "Hello World!";}interface Human { public function breath();}trait Foo { }enum Color{ case Red; case Blue;}"""php_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.PHP, chunk_size=50, chunk_overlap=0)php_docs = php_splitter.create_documents([PHP_CODE])php_docs
[Document(metadata={}, page_content='KotlinHere's how to split Kotline code into smaller chunks using the RecursiveCharacterTextSplitter.First, specify Language.KOTLIN for the language parameter. It tells the splitter you're working with Kotline code.Then, set chunk_size to 100. This limits the size of each resulting chunk to a maximum of 100 characters.Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.KOTLIN_CODE = """fun main() { val directoryPath = System.getProperty("user.dir") val files = File(directoryPath).listFiles()?.filter { !it.isDirectory }?.sortedBy { it.lastModified() } ?: emptyArray() files.forEach { file -> println("Name: ${file.name} | Last Write Time: ${file.lastModified()}") }}"""kotlin_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.KOTLIN, chunk_size=100, chunk_overlap=0)kotlin_docs = kotlin_splitter.create_documents([KOTLIN_CODE])kotlin_docs[Document(metadata={}, page_content='fun main() {\n val directoryPath = System.getProperty("user.dir")'), Document(metadata={}, page_content='val files = File(directoryPath).listFiles()?.filter { !it.isDirectory }?.sortedBy {'), Document(metadata={}, page_content='it.lastModified() } ?: emptyArray()'), Document(metadata={}, page_content='files.forEach { file ->'), Document(metadata={}, page_content='println("Name: ${file.name} | Last Write Time: ${file.lastModified()}")\n }\n}')]
Last updated