Split code with Langchain
Author: Jongcheol Kim
Peer Review: kofsitho87, teddylee777
Proofread : Chaeyoon Kim
This is a part of LangChain Open Tutorial
Overview
RecursiveCharacterTextSplitter includes pre-built separator lists optimized for splitting text in different programming languages.
The CodeTextSplitter provides even more specialized functionality for splitting code.
To use it, import the Language enum(enumeration) and specify the desired programming language.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
Code Splitter Examples
Here is an example of splitting text using the RecursiveCharacterTextSplitter.
Import the
LanguageandRecursiveCharacterTextSplitterclasses from thelangchain_text_splittersmodule.RecursiveCharacterTextSplitteris a text splitter that recursively splits text at the character level.
Supported languages are stored in the langchain_text_splitters.Language enum.
API Reference: Language | RecursiveCharacterTextSplitter
See below for the full list of supported languages.
You can use the get_separators_for_language method of the RecursiveCharacterTextSplitter class to see the separators used for a given language.
For example, passing
Language.PYTHONretrieves the separators used for Python:
Python
Here's how to split Python code into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify
Language.PYTHONfor thelanguageparameter. It tells the splitter you're working with Python code.Then, set
chunk_sizeto 50. This limits the size of each resulting chunk to a maximum of 50 characters.Finally, set
chunk_overlapto 0. It prevents any of the chunks from overlapping.
JavaScript
Here's how to split JavaScript code into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify
Language.JSfor thelanguageparameter. It tells the splitter you're working with JavaScript code.Then, set
chunk_sizeto 60. This limits the size of each resulting chunk to a maximum of 60 characters.Finally, set
chunk_overlapto 0. It prevents any of the chunks from overlapping.
TypeScript
Here's how to split TypeScript code into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify
Language.TSfor thelanguageparameter. It tells the splitter you're working with TypeScript code.Then, set
chunk_sizeto 60. This limits the size of each resulting chunk to a maximum of 60 characters.Finally, set
chunk_overlapto 0. It prevents any of the chunks from overlapping.
Markdown
Here's how to split Markdown text into smaller chunks using the RecursiveCharacterTextSplitter.
First, Specify
Language.MARKDOWNfor thelanguageparameter. It tells the splitter you're working with Markdown text.Then, set
chunk_sizeto 60. This limits the size of each resulting chunk to a maximum of 60 characters.Finally, set
chunk_overlapto 0. It prevents any of the chunks from overlapping.
LaTeX
LaTeX is a markup language for document creation, widely used for representing mathematical symbols and formulas.
Here's how to split LaTeX text into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify
Language.LATEXfor thelanguageparameter. It tells the splitter you're working with LaTeX text.Then, set
chunk_sizeto 60. This limits the size of each resulting chunk to a maximum of 60 characters.Finally, set
chunk_overlapto 0. It prevents any of the chunks from overlapping.
HTML
Here's how to split HTML text into smaller chunks using the RecursiveCharacterTextSplitter.
First, specify
Language.HTMLfor thelanguageparameter. It tells the splitter you're working with HTML.Then, set
chunk_sizeto 60. This limits the size of each resulting chunk to a maximum of 60 characters.Finally, set
chunk_overlapto 0. It prevents any of the chunks from overlapping.
Last updated