Search Optimization: Enables more precise chunk-level retrieval
Memory Efficiency: Processes large documents effectively
Context Preservation: Maintains textual coherence through chunk_overlap
This tutorial explores practical implementation of text splitting through core methods like split_text() and create_documents(), including advanced features such as metadata handling.
# Set environment variables
from langchain_opentutorial import set_env
set_env(
{
"OPENAI_API_KEY": "",
"LANGCHAIN_API_KEY": "",
"LANGCHAIN_TRACING_V2": "true",
"LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
"LANGCHAIN_PROJECT": "Adaptive-RAG", # title 과 동일하게 설정해 주세요
}
)
Environment variables have been set successfully.
You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.
[Note] This is not necessary if you've already set the required API keys in previous steps.
# Load API keys from .env file
from dotenv import load_dotenv
load_dotenv(override=True)
False
CharacterTextSplitter Example
Read and store contents from keywords file
Open ./data/appendix-keywords.txt file and read its contents.
Store the read contents in the file variable
with open("./data/appendix-keywords.txt", encoding="utf-8") as f:
file = f.read()
Print the first 500 characters of the file contents.
print(file[:500])
Semantic Search
Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization
Embedding
Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders
Create CharacterTextSplitter with parameters:
Parameters
separator: String to split text on (e.g., newlines, spaces, custom delimiters)
chunk_size: Maximum size of chunks to return
chunk_overlap: Overlap in characters between chunks
length_function: Function that measures the length of given chunks
is_separator_regex: Boolean indicating whether separator should be treated as a regex pattern
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator=" ", # Splits whenever a space is encountered in text
chunk_size=250, # Each chunk contains maximum 250 characters
chunk_overlap=50, # Two consecutive chunks share 50 characters
length_function=len, # Counts total characters in each chunk
is_separator_regex=False # Uses space as literal separator, not as regex
)
Create document objects from chunks and display the first one
page_content='Semantic Search
Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick'
Demonstrate metadata handling during document creation:
create_documents accepts both text data and metadata lists
Each chunk inherits metadata from its source document
# Define metadata for each document
metadatas = [
{"document": 1},
{"document": 2},
]
# Create documents with metadata
documents = text_splitter.create_documents(
[file, file], # List of texts to split
metadatas=metadatas, # Corresponding metadata
)
print(documents[0]) # Display first document with metadata
page_content='Semantic Search
Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick' metadata={'document': 1}
Split text using the split_text() method.
text_splitter.split_text(file)[0] returns the first chunk of the split text
# Split the file text and return the first chunk
text_splitter.split_text(file)[0]
'Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick'