The Cross Encoder Reranker is a technique designed to enhance the performance of Retrieval-Augmented Generation (RAG) systems.
This guide explains how to implement a reranker using Hugging Face's Cross Encoders to refine the ranking of retrieved documents, promoting those most relevant to a query.
Table of Contents
References
Environment Setup
[Note]
langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
# Set environment variables
from langchain_opentutorial import set_env
set_env(
{
"OPENAI_API_KEY": "",
}
)
You can alternatively set OPENAI_API_KEY in .env file and load it.
[Note] This is not necessary if you've already set OPENAI_API_KEY in previous steps.
# Configuration file to manage API keys as environment variables
from dotenv import load_dotenv
# Load API key information
load_dotenv(override=True)
Refined Search Quality :
Improves the relevance and quality of the retrieved documents .
RAG System Boost :
Enhances the performance of Retrieval-Augmented Generation (RAG) systems by refining input relevance.
Seamless Integration :
Easily adaptable to various workflows and compatible with multiple frameworks.
Model Versatility :
Offers flexibility with a wide range of pre-trained models for tailored use cases.
Document Count Settings for Reranker
Reranking is generally performed on the top 5–10documents retrieved during the initial search.
The ideal number of documents for reranking should be determined through experimentation and evaluation, as it depends on the dataset characteristics and computational resources available.
Trade-offs When Using a Reranker
Accuracy vs Processing Time :
Striking a balance between achieving higher accuracy and minimizing processing time.
Performance Improvement vs Computational Cost :
Weighing the benefits of improved performance against the additional computational resources required.
Search Speed vs Relevance Accuracy :
Managing the trade-off between faster retrieval and maintaining high relevance in results.
System Requirements :
Ensuring the system meets the necessary hardware and software requirements to support reranking.
Dataset Characteristics :
Considering the scale, diversity, and specific attributes of the dataset to optimize reranker performance.
Explaining the Implementation of Cross Encoder Reranker with a Simple Example
# Helper function to format and print document content
def pretty_print_docs(docs):
# Print each document in the list with a separator between them
print(
f"\n{'-' * 100}\n".join( # Separator line for better readability
[f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)] # Format: Document number + content
)
)
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load documents
documents = TextLoader("./data/appendix-keywords.txt").load()
# Configure text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
# Split documents into chunks
texts = text_splitter.split_documents(documents)
# # Set up the embedding model
embeddings_model = HuggingFaceEmbeddings(
model_name="sentence-transformers/msmarco-distilbert-dot-v5",
model_kwargs = {"tokenizer_kwargs": {"clean_up_tokenization_spaces": False}}
)
# Create FAISS index from documents and set up retriever
retriever = FAISS.from_documents(texts, embeddings_model).as_retriever(
search_kwargs={"k": 10}
)
# Define the query
query = "Can you tell me about Word2Vec?"
# Execute the query and retrieve results
docs = retriever.invoke(query)
# Display the retrieved documents
pretty_print_docs(docs)
Document 1:
Word2Vec
Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.
Example: In a Word2Vec model, "king" and "queen" are represented by vectors located close to each other.
Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
----------------------------------------------------------------------------------------------------
Document 2:
Token
Definition: A token refers to a smaller unit of text obtained by splitting a larger piece of text. It can be a word, phrase, or sentence.
Example: The sentence "I go to school" can be tokenized into "I," "go," "to," and "school."
Related Keywords: Tokenization, Natural Language Processing (NLP), Syntax Analysis
----------------------------------------------------------------------------------------------------
Document 3:
Example: A customer information table in a relational database is an example of structured data.
Related Keywords: Database, Data Analysis, Data Modeling
----------------------------------------------------------------------------------------------------
Document 4:
Schema
Definition: A schema defines the structure of a database or file, detailing how data is organized and stored.
Example: A relational database schema specifies column names, data types, and key constraints.
Related Keywords: Database, Data Modeling, Data Management
----------------------------------------------------------------------------------------------------
Document 5:
Keyword Search
Definition: Keyword search involves finding information based on user-inputted keywords, commonly used in search engines and database systems.
Example: Searching
When a user searches for "coffee shops in Seoul," the system returns a list of relevant coffee shops.
Related Keywords: Search Engine, Data Search, Information Retrieval
----------------------------------------------------------------------------------------------------
Document 6:
TF-IDF (Term Frequency-Inverse Document Frequency)
Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.
Example: Words with high TF-IDF values are often unique and critical for understanding the document.
Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining
----------------------------------------------------------------------------------------------------
Document 7:
SQL
Definition: SQL (Structured Query Language) is a programming language for managing data in databases.
It allows you to perform various operations such as querying, updating, inserting, and deleting data.
Example: SELECT * FROM users WHERE age > 18; retrieves information about users aged above 18.
Related Keywords: Database, Query, Data Management
----------------------------------------------------------------------------------------------------
Document 8:
Open Source
Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.
Example: The Linux operating system is a well-known open source project.
Related Keywords: Software Development, Community, Technical Collaboration
Structured Data
Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.
----------------------------------------------------------------------------------------------------
Document 9:
Semantic Search
Definition: Semantic search is a search technique that understands the meaning of a user's query beyond simple keyword matching, returning results that are contextually relevant.
Example: If a user searches for "planets in the solar system," the system provides information about planets like Jupiter and Mars.
Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining
----------------------------------------------------------------------------------------------------
Document 10:
GPT (Generative Pretrained Transformer)
Definition: GPT is a generative language model pre-trained on vast datasets, capable of performing various text-based tasks. It generates natural and coherent text based on input.
Example: A chatbot generating detailed answers to user queries is powered by GPT models.
Related Keywords: Natural Language Processing (NLP), Text Generation, Deep Learning
Now, let's wrap the base_retriever with a ContextualCompressionRetriever . The CrossEncoderReranker leverages HuggingFaceCrossEncoder to re-rank the retrieved results.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Initialize the model
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3")
# Select the top 3 documents
compressor = CrossEncoderReranker(model=model, top_n=3)
# Initialize the contextual compression retriever
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
# Retrieve compressed documents
compressed_docs = compression_retriever.invoke("Can you tell me about Word2Vec?")
# Display the documents
pretty_print_docs(compressed_docs)
Document 1:
Word2Vec
Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.
Example: In a Word2Vec model, "king" and "queen" are represented by vectors located close to each other.
Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
----------------------------------------------------------------------------------------------------
Document 2:
Open Source
Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.
Example: The Linux operating system is a well-known open source project.
Related Keywords: Software Development, Community, Technical Collaboration
Structured Data
Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.
----------------------------------------------------------------------------------------------------
Document 3:
TF-IDF (Term Frequency-Inverse Document Frequency)
Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.
Example: Words with high TF-IDF values are often unique and critical for understanding the document.
Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining
Set up the environment. You may refer to for more details.