Kiwi BM25 Retriever

Open in ColabOpen in GitHub

Overview

This tutorial explores the use of kiwipiepy for Korean morphological analysis and demonstrates its integration within the LangChain framework. It highlights Korean text tokenization, and the comparison of different retrievers with various setups.

Since this tutorial covers Korean morphological analysis, the output primarily contains Korean text, reflecting the language structure being analyzed. For international users, we provide English translations alongside Korean examples.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

[Note] If you are using a .env file, proceed as follows.

Korean Tokenization

Korean words are morphologically rich. A single word is often split into multiple morphemes (root, affix, suffix, etc.).

For instance, โ€œ์•ˆ๋…•ํ•˜์„ธ์š”โ€ is tokenized into:

  • Token(form='์•ˆ๋…•', tag='NNG')

  • Token(form='ํ•˜', tag='XSA')

  • Token(form='์„ธ์š”', tag='EF')

We utilize kiwipiepy, which is a Python module for Kiwi, an open-source Korean morphological analyzer, to tokenize Korean text.

With this, we can easily perform tokenization.

Testing with Various Sentences

To test different retrieval methods, we define a list of documents composed of similar yet distinguishable contents.

Comparing Search Results Using Different Retrievers

In this section, we compare how different retrieval methods rank documents when given the same query. We are using:

  • BM25: A traditional ranking function based on term frequency (TF) and inverse document frequency (IDF).

  • Kiwi BM25: BM25 with an added benefit of kiwipiepy tokenization, enabling more accurate splitting of Korean words into morphemes (especially important for Korean queries).

  • FAISS: A vector-based retriever using embeddings (in this case, OpenAIEmbeddings). It captures semantic similarity, so itโ€™s less reliant on exact keyword matches and more on meaning.

  • Ensemble: A combination of BM25 (or Kiwi BM25) and FAISS, weighted to leverage both the lexical matching strengths of BM25 and the semantic understanding of FAISS.

Key points of Comparison

Exact Keyword Matching vs. Semantic Matching

  • BM25 (and Kiwi BM25) excel in finding documents that share exact terms or closely related morphological variants.

  • FAISS retrieves documents that may not have exact lexical overlap but are semantically similar (e.g., synonyms or paraphrases).

Impact of Korean morphological analysis

  • Korean often merges stems and endings into single words (โ€œ์•ˆ๋…•ํ•˜์„ธ์š”โ€ โ†’ โ€œ์•ˆ๋…• + ํ•˜ + ์„ธ์š”โ€). Kiwi BM25 handles this by splitting the query and documents more precisely.

  • This can yield more relevant results when dealing with conjugated verbs, particles, or compound nouns.

Ensemble Approaches

  • By combining lexical (BM25) and semantic (FAISS) retrievers, we can produce a more balanced set of results.

  • The weighting (e.g., 70:30 or 30:70) can be tuned to emphasize one aspect over the other.

  • Using MMR (Maximal Marginal Relevance) ensures diversity in the retrieved results, reducing redundancy.

Displaying Search Results

Let's display the search results for a variety of queries, and see how different retrievers perform.

Conclusion

By running the code and observing the top documents returned for each query, you can see how each retriever type has its strengths:

  • BM25 / Kiwi BM25: Great for precise keyword matching, beneficial for Korean morphological nuances.

  • FAISS: Finds semantically related documents even if the wording differs.

  • Ensemble: Balances both worlds, often achieving better overall coverage for a wide range of queries.

Last updated