LangChain OpenTutorial
  • ๐Ÿฆœ๏ธ๐Ÿ”— The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • HTMLHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Vector Stores
    • Chroma
    • Faiss
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j
    • Weaviate
    • Faiss
    • {VectorStore Name}
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Title
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • CoT Based Smart Web Search
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Table of Contents
  • References
  • Environment Setup
  • Korean Tokenization
  • Testing with Various Sentences
  • Comparing Search Results Using Different Retrievers
  • Key points of Comparison
  • Displaying Search Results
  • Conclusion
  1. 10-Retriever

Kiwi BM25 Retriever

PreviousTimeWeightedVectorStoreRetrieverNextEnsemble Retriever with Convex Combination (CC)

Last updated 28 days ago

  • Author:

  • Peer Review:

  • Proofread :

  • This is a part of

Overview

This tutorial explores the use of kiwipiepy for Korean morphological analysis and demonstrates its integration within the LangChain framework. It highlights Korean text tokenization, and the comparison of different retrievers with various setups.

Since this tutorial covers Korean morphological analysis, the output primarily contains Korean text, reflecting the language structure being analyzed. For international users, we provide English translations alongside Korean examples.

Table of Contents

References


Environment Setup

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain-openai",
        "langchain",
        "python-dotenv",
        "langchain-core",
        "kiwipiepy",
        "rank_bm25",        
        "langchain-community",
        "faiss-cpu",

    ],
    verbose=False,
    upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Kiwi-BM25-Retriever",
    },
)
Environment variables have been set successfully.

[Note] If you are using a .env file, proceed as follows.

from dotenv import load_dotenv

load_dotenv(override=True)
True

Korean Tokenization

Korean words are morphologically rich. A single word is often split into multiple morphemes (root, affix, suffix, etc.).

For instance, โ€œ์•ˆ๋…•ํ•˜์„ธ์š”โ€ is tokenized into:

  • Token(form='์•ˆ๋…•', tag='NNG')

  • Token(form='ํ•˜', tag='XSA')

  • Token(form='์„ธ์š”', tag='EF')

We utilize kiwipiepy, which is a Python module for Kiwi, an open-source Korean morphological analyzer, to tokenize Korean text.

from kiwipiepy import Kiwi

kiwi = Kiwi()

With this, we can easily perform tokenization.

kiwi.tokenize("์•ˆ๋…•ํ•˜์„ธ์š”? ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ ํ‚ค์œ„์ž…๋‹ˆ๋‹ค")
# Translation: Hi, this is Kiwi, a morphological analyzer.
[Token(form='์•ˆ๋…•', tag='NNG', start=0, len=2),
     Token(form='ํ•˜', tag='XSA', start=2, len=1),
     Token(form='์„ธ์š”', tag='EF', start=3, len=2),
     Token(form='?', tag='SF', start=5, len=1),
     Token(form='ํ˜•ํƒœ์†Œ', tag='NNG', start=7, len=3),
     Token(form='๋ถ„์„๊ธฐ', tag='NNG', start=11, len=3),
     Token(form='ํ‚ค์œ„', tag='NNG', start=15, len=2),
     Token(form='์ด', tag='VCP', start=17, len=1),
     Token(form='แ†ธ๋‹ˆ๋‹ค', tag='EF', start=17, len=3)]

Testing with Various Sentences

To test different retrieval methods, we define a list of documents composed of similar yet distinguishable contents.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Sample documents for retriever testing
docs = [
    Document(
        page_content="๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค."
        # Translation: Financial insurance is a financial product designed for long term asset management and risk coverage.
    ),
    Document(
        page_content="๊ธˆ์œต์ €์ถ•๋ณดํ—˜์€ ๊ทœ์น™์ ์ธ ์ €์ถ•์„ ํ†ตํ•ด ๋ชฉ๋ˆ์„ ๋งˆ๋ จํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ƒ๋ช…๋ณดํ—˜ ๊ธฐ๋Šฅ๋„ ๊ฒธ๋น„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค."
        # Translation: Financial savings insurance allows individuals to accumulate a lump sum through regular savings, and also offers life insurance benefits.
    ),
    Document(
        page_content="์ €์ถ•๊ธˆ์œต๋ณดํ—˜์€ ์ €์ถ•๊ณผ ๊ธˆ์œต์„ ํ†ตํ•ด ๋ชฉ๋ˆ ๋งˆ๋ จ์— ๋„์›€์„ ์ฃผ๋Š” ๋ณดํ—˜์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋ง ๋ณด์žฅ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค."
        # Translation: Savings financial insurance helps individuals gather a lump sum through savings and finance, and also provides death benefit coverage.
    ),
    Document(
        page_content="๊ธˆ์œต์ €์ถ•์‚ฐ๋ฌผ๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ €์ถ• ๋ชฉ์ ๊ณผ ๋”๋ถˆ์–ด, ์ถ•์‚ฐ๋ฌผ ์ œ๊ณต ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ๋Š” ํŠน๋ณ„ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค."
        # Translation: Financial savings livestock insurance is a special financial product designed for long term savings, which also includes provisions for livestock products.
    ),
    Document(
        page_content="๊ธˆ์œต๋‹จํญ๊ฒฉ๋ณดํ—˜์€ ์ €์ถ•์€ ์ปค๋…• ์œ„ํ—˜ ๋Œ€๋น„์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค. ๋†’์€ ์œ„ํ—˜์„ ๊ฐ์ˆ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ณ ๊ฐ์—๊ฒŒ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค."
        # Translation: Financial 'carpet bombing' insurance focuses on risk coverage rather than savings. It is suitable for customers willing to take on high risk.
    ),
    Document(
        page_content="๊ธˆ๋ณดํ—˜์€ ์ €์ถ•์„ฑ๊ณผ๋ฅผ ๊ทน๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ๋…ธํ›„ ๋Œ€๋น„ ์ €์ถ•์— ์œ ๋ฆฌํ•˜๊ฒŒ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค."
        # Translation: Gold insurance maximizes returns on savings. It is especially advantageous for retirement savings.
    ),
    Document(
        page_content="๊ธˆ์œต๋ณด์”จ ํ—˜ํ•œ๋ง ์ข€ ํ•˜์ง€๋งˆ์‹œ๊ณ , ์ €์ถ•์ด๋‚˜ ์ข€ ํ•˜์‹œ๋˜๊ฐ€์š”. ๋ญ๊ฐ€ ๊ทธ๋ฆฌ ๊ธ‰ํ•˜์‹ ์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค์š”."
        # Translation: Hey, Mr. 'Financial Bo,' please refrain from harsh words and consider saving money. I'm not sure why you're in such a hurry.
    ),
]
# Print tokenized documents
for doc in docs:
    print(" ".join([token.form for token in kiwi.tokenize(doc.page_content)]))
๊ธˆ์œต ๋ณดํ—˜ ์€ ์žฅ๊ธฐ ์  ์ด แ†ซ ์ž์‚ฐ ๊ด€๋ฆฌ ์™€ ์œ„ํ—˜ ๋Œ€๋น„ ๋ฅผ ๋ชฉ์  ์œผ๋กœ ๊ณ ์•ˆ ๋˜ แ†ซ ๊ธˆ์œต ์ƒํ’ˆ ์ด แ†ธ๋‹ˆ๋‹ค .
    ๊ธˆ์œต ์ €์ถ• ๋ณดํ—˜ ์€ ๊ทœ์น™ ์  ์ด แ†ซ ์ €์ถ• ์„ ํ†ตํ•˜ ์–ด ๋ชฉ๋ˆ ์„ ๋งˆ๋ จ ํ•˜ แ†ฏ ์ˆ˜ ์žˆ ์œผ๋ฉฐ , ์ƒ๋ช… ๋ณดํ—˜ ๊ธฐ๋Šฅ ๋„ ๊ฒธ๋น„ ํ•˜ ๊ณ  ์žˆ ์Šต๋‹ˆ๋‹ค .
    ์ €์ถ• ๊ธˆ์œต ๋ณดํ—˜ ์€ ์ €์ถ• ๊ณผ ๊ธˆ์œต ์„ ํ†ตํ•˜ ์–ด ๋ชฉ๋ˆ ๋งˆ๋ จ ์— ๋„์›€ ์„ ์ฃผ ๋Š” ๋ณดํ—˜ ์ด แ†ธ๋‹ˆ๋‹ค . ๋˜ํ•œ , ์‚ฌ๋ง ๋ณด์žฅ ๊ธฐ๋Šฅ ๋„ ์ œ๊ณต ํ•˜ แ†ธ๋‹ˆ๋‹ค .
    ๊ธˆ์œต ์ € ์ถ•์‚ฐ๋ฌผ ๋ณดํ—˜ ์€ ์žฅ๊ธฐ ์  ์ด แ†ซ ์ €์ถ• ๋ชฉ์  ๊ณผ ๋”๋ถˆ ์–ด , ์ถ•์‚ฐ๋ฌผ ์ œ๊ณต ๊ธฐ๋Šฅ ์„ ๊ฐ–์ถ” ๊ณ  ์žˆ ๋Š” ํŠน๋ณ„ ๊ธˆ์œต ์ƒํ’ˆ ์ด แ†ธ๋‹ˆ๋‹ค .
    ๊ธˆ์œต ๋‹จ ํญ๊ฒฉ ๋ณดํ—˜ ์€ ์ €์ถ• ์€ ์ปค๋…• ์œ„ํ—˜ ๋Œ€๋น„ ์— ์ดˆ์  ์„ ๋งž์ถ” แ†ซ ์ƒํ’ˆ ์ด แ†ธ๋‹ˆ๋‹ค . ๋†’ ์€ ์œ„ํ—˜ ์„ ๊ฐ์ˆ˜ ํ•˜ ๊ณ ์ž ํ•˜ ๋Š” ๊ณ ๊ฐ ์—๊ฒŒ ์ ํ•ฉ ํ•˜ แ†ธ๋‹ˆ๋‹ค .
    ๊ธˆ ๋ณดํ—˜ ์€ ์ €์ถ• ์„ฑ๊ณผ ๋ฅผ ๊ทน๋Œ€ ํ™” ํ•˜ แ†ธ๋‹ˆ๋‹ค . ํŠนํžˆ ๋…ธํ›„ ๋Œ€๋น„ ์ €์ถ• ์— ์œ ๋ฆฌ ํ•˜ ๊ฒŒ ๊ตฌ์„ฑ ๋˜ ์–ด ์žˆ ์Šต๋‹ˆ๋‹ค .
    ๊ธˆ์œต ๋ณด ์”จ ํ—˜ํ•˜ แ†ซ ๋ง ์ข€ ํ•˜ ์ง€ ๋ง ์‹œ ๊ณ  , ์ €์ถ• ์ด๋‚˜ ์ข€ ํ•˜ ์‹œ ๋˜๊ฐ€์š” . ๋ญ ๊ฐ€ ๊ทธ๋ฆฌ ๊ธ‰ํ•˜ ์‹œ แ†ซ์ง€ ๋ชจ๋ฅด ๊ฒ  ๋„ค์š” .
# Define a tokenization function
def kiwi_tokenize(text):
    return [token.form for token in kiwi.tokenize(text)]

Comparing Search Results Using Different Retrievers

In this section, we compare how different retrieval methods rank documents when given the same query. We are using:

  • BM25: A traditional ranking function based on term frequency (TF) and inverse document frequency (IDF).

  • Kiwi BM25: BM25 with an added benefit of kiwipiepy tokenization, enabling more accurate splitting of Korean words into morphemes (especially important for Korean queries).

  • FAISS: A vector-based retriever using embeddings (in this case, OpenAIEmbeddings). It captures semantic similarity, so itโ€™s less reliant on exact keyword matches and more on meaning.

  • Ensemble: A combination of BM25 (or Kiwi BM25) and FAISS, weighted to leverage both the lexical matching strengths of BM25 and the semantic understanding of FAISS.

Key points of Comparison

Exact Keyword Matching vs. Semantic Matching

  • BM25 (and Kiwi BM25) excel in finding documents that share exact terms or closely related morphological variants.

  • FAISS retrieves documents that may not have exact lexical overlap but are semantically similar (e.g., synonyms or paraphrases).

Impact of Korean morphological analysis

  • Korean often merges stems and endings into single words (โ€œ์•ˆ๋…•ํ•˜์„ธ์š”โ€ โ†’ โ€œ์•ˆ๋…• + ํ•˜ + ์„ธ์š”โ€). Kiwi BM25 handles this by splitting the query and documents more precisely.

  • This can yield more relevant results when dealing with conjugated verbs, particles, or compound nouns.

Ensemble Approaches

  • By combining lexical (BM25) and semantic (FAISS) retrievers, we can produce a more balanced set of results.

  • The weighting (e.g., 70:30 or 30:70) can be tuned to emphasize one aspect over the other.

  • Using MMR (Maximal Marginal Relevance) ensures diversity in the retrieved results, reducing redundancy.

from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Initialize BM25 retriever using raw documents
bm25 = BM25Retriever.from_documents(docs)

# Initialize BM25 retriever with a custom preprocessing function (e.g., Kiwi tokenizer)
kiwi_bm25 = BM25Retriever.from_documents(docs, preprocess_func=kiwi_tokenize)

# Initialize FAISS retriever with OpenAI embeddings
faiss = FAISS.from_documents(docs, OpenAIEmbeddings()).as_retriever()

# Create an ensemble retriever combining BM25 and FAISS with a 70:30 weighting
bm25_faiss_73 = EnsembleRetriever(
    retrievers=[bm25, faiss],  # List of retrieval models to combine
    weights=[0.7, 0.3],        # Weighting for BM25 (70%) and FAISS (30%) results
    search_type="mmr",        # Use MMR (Maximal Marginal Relevance) to diversify search results
)

# Create an ensemble retriever combining BM25 and FAISS with a 30:70 weighting
bm25_faiss_37 = EnsembleRetriever(
    retrievers=[bm25, faiss],  # List of retrieval models to combine
    weights=[0.3, 0.7],        # Weighting for BM25 (30%) and FAISS (70%) results
    search_type="mmr",        # Use MMR (Maximal Marginal Relevance) to diversify search results
)

# Create an ensemble retriever combining Kiwi BM25 and FAISS with a 70:30 weighting
kiwibm25_faiss_73 = EnsembleRetriever(
    retrievers=[kiwi_bm25, faiss],  # List of retrieval models to combine
    weights=[0.7, 0.3],             # Weighting for Kiwi BM25 (70%) and FAISS (30%) results
    search_type="mmr",             # Use MMR (Maximal Marginal Relevance) to diversify search results
)

# Create an ensemble retriever combining Kiwi BM25 and FAISS with a 30:70 weighting
kiwibm25_faiss_37 = EnsembleRetriever(
    retrievers=[kiwi_bm25, faiss],  # List of retrieval models to combine
    weights=[0.3, 0.7],             # Weighting for Kiwi BM25 (30%) and FAISS (70%) results
    search_type="mmr",             # Use MMR (Maximal Marginal Relevance) to diversify search results
)

# Dictionary to store all retrievers for easy access
retrievers = {
    "bm25": bm25,  # Standard BM25 retriever
    "kiwi_bm25": kiwi_bm25,  # BM25 retriever with Kiwi tokenizer
    "faiss": faiss,  # FAISS retriever with OpenAI embeddings
    "bm25_faiss_73": bm25_faiss_73,  # Ensemble retriever (BM25:70%, FAISS:30%)
    "bm25_faiss_37": bm25_faiss_37,  # Ensemble retriever (BM25:30%, FAISS:70%)
    "kiwi_bm25_faiss_73": kiwibm25_faiss_73,  # Ensemble retriever (Kiwi BM25:70%, FAISS:30%)
    "kiwi_bm25_faiss_37": kiwibm25_faiss_37,  # Ensemble retriever (Kiwi BM25:30%, FAISS:70%)
}
# Define a function to print search results from multiple retrievers
def print_search_results(retrievers, query):
    """
    Prints the top search result from each retriever for a given query.
    
    Args:
        retrievers (dict): A dictionary of retriever instances.
        query (str): The search query.
    """
    print(f"Query: {query}")
    for name, retriever in retrievers.items():
        # Retrieve and print the top search result for each retriever
        print(f"{name}\t: {retriever.invoke(query)[0].page_content}")
    print("===" * 20)

Displaying Search Results

Let's display the search results for a variety of queries, and see how different retrievers perform.

print_search_results(retrievers, "๊ธˆ์œต๋ณดํ—˜")
Query: ๊ธˆ์œต๋ณดํ—˜
    bm25	: ๊ธˆ์œต๋ณด์”จ ํ—˜ํ•œ๋ง ์ข€ ํ•˜์ง€๋งˆ์‹œ๊ณ , ์ €์ถ•์ด๋‚˜ ์ข€ ํ•˜์‹œ๋˜๊ฐ€์š”. ๋ญ๊ฐ€ ๊ทธ๋ฆฌ ๊ธ‰ํ•˜์‹ ์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค์š”.
    kiwi_bm25	: ์ €์ถ•๊ธˆ์œต๋ณดํ—˜์€ ์ €์ถ•๊ณผ ๊ธˆ์œต์„ ํ†ตํ•ด ๋ชฉ๋ˆ ๋งˆ๋ จ์— ๋„์›€์„ ์ฃผ๋Š” ๋ณดํ—˜์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋ง ๋ณด์žฅ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    faiss	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    bm25_faiss_73	: ๊ธˆ์œต๋‹จํญ๊ฒฉ๋ณดํ—˜์€ ์ €์ถ•์€ ์ปค๋…• ์œ„ํ—˜ ๋Œ€๋น„์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค. ๋†’์€ ์œ„ํ—˜์„ ๊ฐ์ˆ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ณ ๊ฐ์—๊ฒŒ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    bm25_faiss_37	: ๊ธˆ์œต๋‹จํญ๊ฒฉ๋ณดํ—˜์€ ์ €์ถ•์€ ์ปค๋…• ์œ„ํ—˜ ๋Œ€๋น„์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค. ๋†’์€ ์œ„ํ—˜์„ ๊ฐ์ˆ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ณ ๊ฐ์—๊ฒŒ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_73	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_37	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    ============================================================
print_search_results(retrievers, "๊ธˆ์œต ๋ณดํ—˜")
Query: ๊ธˆ์œต ๋ณดํ—˜
    bm25	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    kiwi_bm25	: ์ €์ถ•๊ธˆ์œต๋ณดํ—˜์€ ์ €์ถ•๊ณผ ๊ธˆ์œต์„ ํ†ตํ•ด ๋ชฉ๋ˆ ๋งˆ๋ จ์— ๋„์›€์„ ์ฃผ๋Š” ๋ณดํ—˜์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋ง ๋ณด์žฅ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    faiss	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    bm25_faiss_73	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    bm25_faiss_37	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_73	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_37	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    ============================================================
print_search_results(retrievers, "๊ธˆ์œต์ €์ถ•๋ณดํ—˜")
Query: ๊ธˆ์œต์ €์ถ•๋ณดํ—˜
    bm25	: ๊ธˆ์œต๋ณด์”จ ํ—˜ํ•œ๋ง ์ข€ ํ•˜์ง€๋งˆ์‹œ๊ณ , ์ €์ถ•์ด๋‚˜ ์ข€ ํ•˜์‹œ๋˜๊ฐ€์š”. ๋ญ๊ฐ€ ๊ทธ๋ฆฌ ๊ธ‰ํ•˜์‹ ์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค์š”.
    kiwi_bm25	: ์ €์ถ•๊ธˆ์œต๋ณดํ—˜์€ ์ €์ถ•๊ณผ ๊ธˆ์œต์„ ํ†ตํ•ด ๋ชฉ๋ˆ ๋งˆ๋ จ์— ๋„์›€์„ ์ฃผ๋Š” ๋ณดํ—˜์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋ง ๋ณด์žฅ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    faiss	: ๊ธˆ์œต์ €์ถ•๋ณดํ—˜์€ ๊ทœ์น™์ ์ธ ์ €์ถ•์„ ํ†ตํ•ด ๋ชฉ๋ˆ์„ ๋งˆ๋ จํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ƒ๋ช…๋ณดํ—˜ ๊ธฐ๋Šฅ๋„ ๊ฒธ๋น„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
    bm25_faiss_73	: ๊ธˆ์œต๋‹จํญ๊ฒฉ๋ณดํ—˜์€ ์ €์ถ•์€ ์ปค๋…• ์œ„ํ—˜ ๋Œ€๋น„์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค. ๋†’์€ ์œ„ํ—˜์„ ๊ฐ์ˆ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ณ ๊ฐ์—๊ฒŒ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    bm25_faiss_37	: ๊ธˆ์œต์ €์ถ•์‚ฐ๋ฌผ๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ €์ถ• ๋ชฉ์ ๊ณผ ๋”๋ถˆ์–ด, ์ถ•์‚ฐ๋ฌผ ์ œ๊ณต ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ๋Š” ํŠน๋ณ„ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_73	: ์ €์ถ•๊ธˆ์œต๋ณดํ—˜์€ ์ €์ถ•๊ณผ ๊ธˆ์œต์„ ํ†ตํ•ด ๋ชฉ๋ˆ ๋งˆ๋ จ์— ๋„์›€์„ ์ฃผ๋Š” ๋ณดํ—˜์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋ง ๋ณด์žฅ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_37	: ๊ธˆ์œต์ €์ถ•๋ณดํ—˜์€ ๊ทœ์น™์ ์ธ ์ €์ถ•์„ ํ†ตํ•ด ๋ชฉ๋ˆ์„ ๋งˆ๋ จํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ƒ๋ช…๋ณดํ—˜ ๊ธฐ๋Šฅ๋„ ๊ฒธ๋น„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
    ============================================================
print_search_results(retrievers, "์ถ•์‚ฐ๋ฌผ ๋ณดํ—˜")
Query: ์ถ•์‚ฐ๋ฌผ ๋ณดํ—˜
    bm25	: ๊ธˆ์œต์ €์ถ•์‚ฐ๋ฌผ๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ €์ถ• ๋ชฉ์ ๊ณผ ๋”๋ถˆ์–ด, ์ถ•์‚ฐ๋ฌผ ์ œ๊ณต ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ๋Š” ํŠน๋ณ„ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    kiwi_bm25	: ๊ธˆ์œต์ €์ถ•์‚ฐ๋ฌผ๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ €์ถ• ๋ชฉ์ ๊ณผ ๋”๋ถˆ์–ด, ์ถ•์‚ฐ๋ฌผ ์ œ๊ณต ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ๋Š” ํŠน๋ณ„ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    faiss	: ๊ธˆ์œต์ €์ถ•์‚ฐ๋ฌผ๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ €์ถ• ๋ชฉ์ ๊ณผ ๋”๋ถˆ์–ด, ์ถ•์‚ฐ๋ฌผ ์ œ๊ณต ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ๋Š” ํŠน๋ณ„ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    bm25_faiss_73	: ๊ธˆ์œต์ €์ถ•์‚ฐ๋ฌผ๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ €์ถ• ๋ชฉ์ ๊ณผ ๋”๋ถˆ์–ด, ์ถ•์‚ฐ๋ฌผ ์ œ๊ณต ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ๋Š” ํŠน๋ณ„ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    bm25_faiss_37	: ๊ธˆ์œต์ €์ถ•์‚ฐ๋ฌผ๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ €์ถ• ๋ชฉ์ ๊ณผ ๋”๋ถˆ์–ด, ์ถ•์‚ฐ๋ฌผ ์ œ๊ณต ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ๋Š” ํŠน๋ณ„ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_73	: ๊ธˆ์œต์ €์ถ•์‚ฐ๋ฌผ๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ €์ถ• ๋ชฉ์ ๊ณผ ๋”๋ถˆ์–ด, ์ถ•์‚ฐ๋ฌผ ์ œ๊ณต ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ๋Š” ํŠน๋ณ„ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_37	: ๊ธˆ์œต์ €์ถ•์‚ฐ๋ฌผ๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ €์ถ• ๋ชฉ์ ๊ณผ ๋”๋ถˆ์–ด, ์ถ•์‚ฐ๋ฌผ ์ œ๊ณต ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ๋Š” ํŠน๋ณ„ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    ============================================================
print_search_results(retrievers, "์ €์ถ•๊ธˆ์œต๋ณดํ—˜")
Query: ์ €์ถ•๊ธˆ์œต๋ณดํ—˜
    bm25	: ๊ธˆ์œต๋ณด์”จ ํ—˜ํ•œ๋ง ์ข€ ํ•˜์ง€๋งˆ์‹œ๊ณ , ์ €์ถ•์ด๋‚˜ ์ข€ ํ•˜์‹œ๋˜๊ฐ€์š”. ๋ญ๊ฐ€ ๊ทธ๋ฆฌ ๊ธ‰ํ•˜์‹ ์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค์š”.
    kiwi_bm25	: ์ €์ถ•๊ธˆ์œต๋ณดํ—˜์€ ์ €์ถ•๊ณผ ๊ธˆ์œต์„ ํ†ตํ•ด ๋ชฉ๋ˆ ๋งˆ๋ จ์— ๋„์›€์„ ์ฃผ๋Š” ๋ณดํ—˜์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋ง ๋ณด์žฅ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    faiss	: ์ €์ถ•๊ธˆ์œต๋ณดํ—˜์€ ์ €์ถ•๊ณผ ๊ธˆ์œต์„ ํ†ตํ•ด ๋ชฉ๋ˆ ๋งˆ๋ จ์— ๋„์›€์„ ์ฃผ๋Š” ๋ณดํ—˜์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋ง ๋ณด์žฅ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    bm25_faiss_73	: ๊ธˆ์œต๋‹จํญ๊ฒฉ๋ณดํ—˜์€ ์ €์ถ•์€ ์ปค๋…• ์œ„ํ—˜ ๋Œ€๋น„์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค. ๋†’์€ ์œ„ํ—˜์„ ๊ฐ์ˆ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ณ ๊ฐ์—๊ฒŒ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    bm25_faiss_37	: ๊ธˆ์œต๋‹จํญ๊ฒฉ๋ณดํ—˜์€ ์ €์ถ•์€ ์ปค๋…• ์œ„ํ—˜ ๋Œ€๋น„์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค. ๋†’์€ ์œ„ํ—˜์„ ๊ฐ์ˆ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ณ ๊ฐ์—๊ฒŒ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_73	: ์ €์ถ•๊ธˆ์œต๋ณดํ—˜์€ ์ €์ถ•๊ณผ ๊ธˆ์œต์„ ํ†ตํ•ด ๋ชฉ๋ˆ ๋งˆ๋ จ์— ๋„์›€์„ ์ฃผ๋Š” ๋ณดํ—˜์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋ง ๋ณด์žฅ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_37	: ์ €์ถ•๊ธˆ์œต๋ณดํ—˜์€ ์ €์ถ•๊ณผ ๊ธˆ์œต์„ ํ†ตํ•ด ๋ชฉ๋ˆ ๋งˆ๋ จ์— ๋„์›€์„ ์ฃผ๋Š” ๋ณดํ—˜์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋ง ๋ณด์žฅ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    ============================================================
print_search_results(retrievers, "๊ธˆ์œต๋ณด์”จ ๊ฐœ์ธ์ •๋ณด ์กฐํšŒ")
Query: ๊ธˆ์œต๋ณด์”จ ๊ฐœ์ธ์ •๋ณด ์กฐํšŒ
    bm25	: ๊ธˆ์œต๋ณด์”จ ํ—˜ํ•œ๋ง ์ข€ ํ•˜์ง€๋งˆ์‹œ๊ณ , ์ €์ถ•์ด๋‚˜ ์ข€ ํ•˜์‹œ๋˜๊ฐ€์š”. ๋ญ๊ฐ€ ๊ทธ๋ฆฌ ๊ธ‰ํ•˜์‹ ์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค์š”.
    kiwi_bm25	: ๊ธˆ์œต๋ณด์”จ ํ—˜ํ•œ๋ง ์ข€ ํ•˜์ง€๋งˆ์‹œ๊ณ , ์ €์ถ•์ด๋‚˜ ์ข€ ํ•˜์‹œ๋˜๊ฐ€์š”. ๋ญ๊ฐ€ ๊ทธ๋ฆฌ ๊ธ‰ํ•˜์‹ ์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค์š”.
    faiss	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    bm25_faiss_73	: ๊ธˆ์œต๋ณด์”จ ํ—˜ํ•œ๋ง ์ข€ ํ•˜์ง€๋งˆ์‹œ๊ณ , ์ €์ถ•์ด๋‚˜ ์ข€ ํ•˜์‹œ๋˜๊ฐ€์š”. ๋ญ๊ฐ€ ๊ทธ๋ฆฌ ๊ธ‰ํ•˜์‹ ์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค์š”.
    bm25_faiss_37	: ๊ธˆ์œต๋‹จํญ๊ฒฉ๋ณดํ—˜์€ ์ €์ถ•์€ ์ปค๋…• ์œ„ํ—˜ ๋Œ€๋น„์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค. ๋†’์€ ์œ„ํ—˜์„ ๊ฐ์ˆ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ณ ๊ฐ์—๊ฒŒ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    kiwi_bm25_faiss_73	: ๊ธˆ์œต๋ณด์”จ ํ—˜ํ•œ๋ง ์ข€ ํ•˜์ง€๋งˆ์‹œ๊ณ , ์ €์ถ•์ด๋‚˜ ์ข€ ํ•˜์‹œ๋˜๊ฐ€์š”. ๋ญ๊ฐ€ ๊ทธ๋ฆฌ ๊ธ‰ํ•˜์‹ ์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค์š”.
    kiwi_bm25_faiss_37	: ๊ธˆ์œต๋ณดํ—˜์€ ์žฅ๊ธฐ์ ์ธ ์ž์‚ฐ ๊ด€๋ฆฌ์™€ ์œ„ํ—˜ ๋Œ€๋น„๋ฅผ ๋ชฉ์ ์œผ๋กœ ๊ณ ์•ˆ๋œ ๊ธˆ์œต ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
    ============================================================

Conclusion

By running the code and observing the top documents returned for each query, you can see how each retriever type has its strengths:

  • BM25 / Kiwi BM25: Great for precise keyword matching, beneficial for Korean morphological nuances.

  • FAISS: Finds semantically related documents even if the wording differs.

  • Ensemble: Balances both worlds, often achieving better overall coverage for a wide range of queries.

Set up the environment. You may refer to for more details.

You can checkout the for more details.

kiwipiepy
FAISS
OpenAIEmbeddings
Environment Setup
langchain-opentutorial
JeongGi Park
Juni Lee
LangChain Open Tutorial
Overview
Environment Setup
Korean Tokenization
Testing with Various Sentences
Comparing Search Results Using Different Retrievers
Conclusion