Split code with Langchain

Open in ColabOpen in GitHub

Overview

RecursiveCharacterTextSplitter includes pre-built separator lists optimized for splitting text in different programming languages.

The CodeTextSplitter provides even more specialized functionality for splitting code.

To use it, import the Language enum(enumeration) and specify the desired programming language.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

Code Splitter Examples

Here is an example of splitting text using the RecursiveCharacterTextSplitter.

  • Import the Language and RecursiveCharacterTextSplitter classes from the langchain_text_splitters module.

  • RecursiveCharacterTextSplitter is a text splitter that recursively splits text at the character level.

Supported languages are stored in the langchain_text_splitters.Language enum.

API Reference: Language | RecursiveCharacterTextSplitter

See below for the full list of supported languages.

You can use the get_separators_for_language method of the RecursiveCharacterTextSplitter class to see the separators used for a given language.

  • For example, passing Language.PYTHON retrieves the separators used for Python:

Python

Here's how to split Python code into smaller chunks using the RecursiveCharacterTextSplitter.

  • First, specify Language.PYTHON for the language parameter. It tells the splitter you're working with Python code.

  • Then, set chunk_size to 50. This limits the size of each resulting chunk to a maximum of 50 characters.

  • Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.

JavaScript

Here's how to split JavaScript code into smaller chunks using the RecursiveCharacterTextSplitter.

  • First, specify Language.JS for the language parameter. It tells the splitter you're working with JavaScript code.

  • Then, set chunk_size to 60. This limits the size of each resulting chunk to a maximum of 60 characters.

  • Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.

TypeScript

Here's how to split TypeScript code into smaller chunks using the RecursiveCharacterTextSplitter.

  • First, specify Language.TS for the language parameter. It tells the splitter you're working with TypeScript code.

  • Then, set chunk_size to 60. This limits the size of each resulting chunk to a maximum of 60 characters.

  • Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.

Markdown

Here's how to split Markdown text into smaller chunks using the RecursiveCharacterTextSplitter.

  • First, Specify Language.MARKDOWN for the language parameter. It tells the splitter you're working with Markdown text.

  • Then, set chunk_size to 60. This limits the size of each resulting chunk to a maximum of 60 characters.

  • Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.

LaTeX

LaTeX is a markup language for document creation, widely used for representing mathematical symbols and formulas.

Here's how to split LaTeX text into smaller chunks using the RecursiveCharacterTextSplitter.

  • First, specify Language.LATEX for the language parameter. It tells the splitter you're working with LaTeX text.

  • Then, set chunk_size to 60. This limits the size of each resulting chunk to a maximum of 60 characters.

  • Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.

HTML

Here's how to split HTML text into smaller chunks using the RecursiveCharacterTextSplitter.

  • First, specify Language.HTML for the language parameter. It tells the splitter you're working with HTML.

  • Then, set chunk_size to 60. This limits the size of each resulting chunk to a maximum of 60 characters.

  • Finally, set chunk_overlap to 0. It prevents any of the chunks from overlapping.

Last updated