Ensemble Retriever with Convex Combination (CC)
Author: Harheem Kim
Peer Review:
Proofread : JaeJun Shim
This is a part of LangChain Open Tutorial
Overview
This tutorial focuses on implementing and comparing different ensemble retrieval methods in LangChain. While LangChain's built-in EnsembleRetriever uses the Reciprocal Rank Fusion (RRF) method, we'll explore an additional approach by implementing the Convex Combination (CC) method.
The tutorial guides you through creating custom implementations of both RRF and CC methods , allowing for a direct performance comparison between these ensemble techniques.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
You can alternatively set OPENAI_API_KEY in .env file and load it.
[Note] This is not necessary if you've already set OPENAI_API_KEY in previous steps.
Process Document
This section outlines the preparation process for processing PDF documents before storing them in a vector store.
We use PDFPlumberLoader to load the PDF file and leverage RecursiveCharacterTextSplitter to break down the document into smaller, manageable chunks.
The chunk size is set to 200 characters with no overlap, allowing for efficient processing while maintaining the document's semantic integrity.
Initialize Retrievers
This section initializes retrievers to implement two different search approaches. We create embeddings using OpenAI's text-embedding-3-small model and set up FAISS vector search based on these embeddings.
Additionally, we configure a BM25 retriever for keyword-based search, with both retrievers set to return the top 5 most relevant results.
Implement Ensemble Retrievers
This section introduces a custom retriever implementing two ensemble search methods, designed to compare performance against LangChain's built-in EnsembleRetriever .
We implement both Reciprocal Rank Fusion (RRF) , which combines results based on document rankings, and Convex Combination (CC) , which utilizes normalized scores.
Both methods integrate results from FAISS and BM25 retrievers to provide more accurate and diverse search results, allowing users to select the most suitable ensemble approach for their needs.
Compare and Test
This section presents a test function for comparing ensemble retrieval results.
While the RRF method , which follows LangChain's default implementation, produces identical results to Original , the CC method utilizing normalized scores and weights offers different search patterns.
By testing with real queries and comparing these approaches, we can identify which ensemble method better suits our project requirements.
Last updated