This tutorial explores the use of kiwipiepy for Korean morphological analysis and demonstrates its integration within the LangChain framework. It highlights Korean text tokenization, and the comparison of different retrievers with various setups.
Since this tutorial covers Korean morphological analysis, the output primarily contains Korean text, reflecting the language structure being analyzed. For international users, we provide English translations alongside Korean examples.
# Define a tokenization function
def kiwi_tokenize(text):
return [token.form for token in kiwi.tokenize(text)]
Comparing Search Results Using Different Retrievers
In this section, we compare how different retrieval methods rank documents when given the same query. We are using:
BM25: A traditional ranking function based on term frequency (TF) and inverse document frequency (IDF).
Kiwi BM25: BM25 with an added benefit of kiwipiepy tokenization, enabling more accurate splitting of Korean words into morphemes (especially important for Korean queries).
FAISS: A vector-based retriever using embeddings (in this case, OpenAIEmbeddings). It captures semantic similarity, so itโs less reliant on exact keyword matches and more on meaning.
Ensemble: A combination of BM25 (or Kiwi BM25) and FAISS, weighted to leverage both the lexical matching strengths of BM25 and the semantic understanding of FAISS.
Key points of Comparison
Exact Keyword Matching vs. Semantic Matching
BM25 (and Kiwi BM25) excel in finding documents that share exact terms or closely related morphological variants.
FAISS retrieves documents that may not have exact lexical overlap but are semantically similar (e.g., synonyms or paraphrases).
Impact of Korean morphological analysis
Korean often merges stems and endings into single words (โ์๋ ํ์ธ์โ โ โ์๋ + ํ + ์ธ์โ). Kiwi BM25 handles this by splitting the query and documents more precisely.
This can yield more relevant results when dealing with conjugated verbs, particles, or compound nouns.
Ensemble Approaches
By combining lexical (BM25) and semantic (FAISS) retrievers, we can produce a more balanced set of results.
The weighting (e.g., 70:30 or 30:70) can be tuned to emphasize one aspect over the other.
Using MMR (Maximal Marginal Relevance) ensures diversity in the retrieved results, reducing redundancy.
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
# Initialize BM25 retriever using raw documents
bm25 = BM25Retriever.from_documents(docs)
# Initialize BM25 retriever with a custom preprocessing function (e.g., Kiwi tokenizer)
kiwi_bm25 = BM25Retriever.from_documents(docs, preprocess_func=kiwi_tokenize)
# Initialize FAISS retriever with OpenAI embeddings
faiss = FAISS.from_documents(docs, OpenAIEmbeddings()).as_retriever()
# Create an ensemble retriever combining BM25 and FAISS with a 70:30 weighting
bm25_faiss_73 = EnsembleRetriever(
retrievers=[bm25, faiss], # List of retrieval models to combine
weights=[0.7, 0.3], # Weighting for BM25 (70%) and FAISS (30%) results
search_type="mmr", # Use MMR (Maximal Marginal Relevance) to diversify search results
)
# Create an ensemble retriever combining BM25 and FAISS with a 30:70 weighting
bm25_faiss_37 = EnsembleRetriever(
retrievers=[bm25, faiss], # List of retrieval models to combine
weights=[0.3, 0.7], # Weighting for BM25 (30%) and FAISS (70%) results
search_type="mmr", # Use MMR (Maximal Marginal Relevance) to diversify search results
)
# Create an ensemble retriever combining Kiwi BM25 and FAISS with a 70:30 weighting
kiwibm25_faiss_73 = EnsembleRetriever(
retrievers=[kiwi_bm25, faiss], # List of retrieval models to combine
weights=[0.7, 0.3], # Weighting for Kiwi BM25 (70%) and FAISS (30%) results
search_type="mmr", # Use MMR (Maximal Marginal Relevance) to diversify search results
)
# Create an ensemble retriever combining Kiwi BM25 and FAISS with a 30:70 weighting
kiwibm25_faiss_37 = EnsembleRetriever(
retrievers=[kiwi_bm25, faiss], # List of retrieval models to combine
weights=[0.3, 0.7], # Weighting for Kiwi BM25 (30%) and FAISS (70%) results
search_type="mmr", # Use MMR (Maximal Marginal Relevance) to diversify search results
)
# Dictionary to store all retrievers for easy access
retrievers = {
"bm25": bm25, # Standard BM25 retriever
"kiwi_bm25": kiwi_bm25, # BM25 retriever with Kiwi tokenizer
"faiss": faiss, # FAISS retriever with OpenAI embeddings
"bm25_faiss_73": bm25_faiss_73, # Ensemble retriever (BM25:70%, FAISS:30%)
"bm25_faiss_37": bm25_faiss_37, # Ensemble retriever (BM25:30%, FAISS:70%)
"kiwi_bm25_faiss_73": kiwibm25_faiss_73, # Ensemble retriever (Kiwi BM25:70%, FAISS:30%)
"kiwi_bm25_faiss_37": kiwibm25_faiss_37, # Ensemble retriever (Kiwi BM25:30%, FAISS:70%)
}
# Define a function to print search results from multiple retrievers
def print_search_results(retrievers, query):
"""
Prints the top search result from each retriever for a given query.
Args:
retrievers (dict): A dictionary of retriever instances.
query (str): The search query.
"""
print(f"Query: {query}")
for name, retriever in retrievers.items():
# Retrieve and print the top search result for each retriever
print(f"{name}\t: {retriever.invoke(query)[0].page_content}")
print("===" * 20)
Displaying Search Results
Let's display the search results for a variety of queries, and see how different retrievers perform.