Cross Encoder Reranker

Author: JeongHo Shin
Peer Review:
Proofread : JaeJun Shim
This is a part of LangChain Open Tutorial

Overview

The Cross Encoder Reranker is a technique designed to enhance the performance of Retrieval-Augmented Generation (RAG) systems. This guide explains how to implement a reranker using Hugging Face's Cross Encoders to refine the ranking of retrieved documents, promoting those most relevant to a query.

References

Hugging Face Cross Encoders

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
    }
)

You can alternatively set OPENAI_API_KEY in .env file and load it.

[Note] This is not necessary if you've already set OPENAI_API_KEY in previous steps.

# Configuration file to manage API keys as environment variables
from dotenv import load_dotenv

# Load API key information
load_dotenv(override=True)

%%capture --no-stderr
%pip install langchain-opentutorial

Key Features and Mechanism

Purpose

Re-rank retrieved documents to refine their ranking, prioritizing the most relevant results for the query.

Structure

Accepts both the query and document as a single input pair, enabling joint processing.

Mechanism

Single Input Pair : Processes the query and document as a combined input to output a relevance score directly.
Self-Attention Mechanism : Uses self-attention to jointly analyze the query and document , effectively capturing their semantic relationship.

Advantages

Higher Accuracy : Provides more precise similarity scores.
Deep Contextual Analysis : Explores semantic nuances between query and document .

Limitations

High Computational Costs : Processing can be time-intensive.
Scalability Issues : Not suitable for large-scale document collections without optimization.

Practical Applications

A Bi-Encoder quickly retrieves candidate documents by computing lightweight similarity scores.
A Cross Encoder refines these results by deeply analyzing the semantic relationship between the query and the retrieved documents .

Implementation

Use Hugging Face cross encoders or BAAI/bge-reranker models.
Easily integrate with frameworks like LangChain through the CrossEncoderReranker component.

Key Advantages of Reranker

Precise Similarity Scoring : Delivers highly accurate measurements of relevance between the query and documents.
Semantic Depth : Analyzes deeper semantic relationships, uncovering nuances in query - document interactions.
Refined Search Quality : Improves the relevance and quality of the retrieved documents .
RAG System Boost : Enhances the performance of Retrieval-Augmented Generation (RAG) systems by refining input relevance.
Seamless Integration : Easily adaptable to various workflows and compatible with multiple frameworks.
Model Versatility : Offers flexibility with a wide range of pre-trained models for tailored use cases.

Document Count Settings for Reranker

Reranking is generally performed on the top 5–10 documents retrieved during the initial search.
The ideal number of documents for reranking should be determined through experimentation and evaluation, as it depends on the dataset characteristics and computational resources available.

Trade-offs When Using a Reranker

Accuracy vs Processing Time : Striking a balance between achieving higher accuracy and minimizing processing time.
Performance Improvement vs Computational Cost : Weighing the benefits of improved performance against the additional computational resources required.
Search Speed vs Relevance Accuracy : Managing the trade-off between faster retrieval and maintaining high relevance in results.
System Requirements : Ensuring the system meets the necessary hardware and software requirements to support reranking.
Dataset Characteristics : Considering the scale, diversity, and specific attributes of the dataset to optimize reranker performance.

Explaining the Implementation of Cross Encoder Reranker with a Simple Example

# Helper function to format and print document content
def pretty_print_docs(docs):
    # Print each document in the list with a separator between them
    print(
        f"\n{'-' * 100}\n".join(  # Separator line for better readability
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]  # Format: Document number + content
        )
    )

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load documents
documents = TextLoader("./data/appendix-keywords.txt").load()

# Configure text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# Split documents into chunks
texts = text_splitter.split_documents(documents)

# # Set up the embedding model
embeddings_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/msmarco-distilbert-dot-v5",
    model_kwargs = {"tokenizer_kwargs": {"clean_up_tokenization_spaces": False}}
)

# Create FAISS index from documents and set up retriever
retriever = FAISS.from_documents(texts, embeddings_model).as_retriever(
    search_kwargs={"k": 10}
)

# Define the query
query = "Can you tell me about Word2Vec?"

# Execute the query and retrieve results
docs = retriever.invoke(query)

# Display the retrieved documents
pretty_print_docs(docs)

Document 1:
    
    Word2Vec
    Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.
    Example: In a Word2Vec model, "king" and "queen" are represented by vectors located close to each other.
    Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
    ----------------------------------------------------------------------------------------------------
    Document 2:
    
    Token
    Definition: A token refers to a smaller unit of text obtained by splitting a larger piece of text. It can be a word, phrase, or sentence.
    Example: The sentence "I go to school" can be tokenized into "I," "go," "to," and "school."
    Related Keywords: Tokenization, Natural Language Processing (NLP), Syntax Analysis
    ----------------------------------------------------------------------------------------------------
    Document 3:
    
    Example: A customer information table in a relational database is an example of structured data.
    Related Keywords: Database, Data Analysis, Data Modeling
    ----------------------------------------------------------------------------------------------------
    Document 4:
    
    Schema
    Definition: A schema defines the structure of a database or file, detailing how data is organized and stored.
    Example: A relational database schema specifies column names, data types, and key constraints.
    Related Keywords: Database, Data Modeling, Data Management
    ----------------------------------------------------------------------------------------------------
    Document 5:
    
    Keyword Search
    Definition: Keyword search involves finding information based on user-inputted keywords, commonly used in search engines and database systems.
    Example: Searching 
    When a user searches for "coffee shops in Seoul," the system returns a list of relevant coffee shops.
    Related Keywords: Search Engine, Data Search, Information Retrieval
    ----------------------------------------------------------------------------------------------------
    Document 6:
    
    TF-IDF (Term Frequency-Inverse Document Frequency)
    Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.
    Example: Words with high TF-IDF values are often unique and critical for understanding the document.
    Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining
    ----------------------------------------------------------------------------------------------------
    Document 7:
    
    SQL
    Definition: SQL (Structured Query Language) is a programming language for managing data in databases. 
    It allows you to perform various operations such as querying, updating, inserting, and deleting data.
    Example: SELECT * FROM users WHERE age > 18; retrieves information about users aged above 18.
    Related Keywords: Database, Query, Data Management
    ----------------------------------------------------------------------------------------------------
    Document 8:
    
    Open Source
    Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.
    Example: The Linux operating system is a well-known open source project.
    Related Keywords: Software Development, Community, Technical Collaboration
    Structured Data
    Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.
    ----------------------------------------------------------------------------------------------------
    Document 9:
    
    Semantic Search
    Definition: Semantic search is a search technique that understands the meaning of a user's query beyond simple keyword matching, returning results that are contextually relevant.
    Example: If a user searches for "planets in the solar system," the system provides information about planets like Jupiter and Mars.
    Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining
    ----------------------------------------------------------------------------------------------------
    Document 10:
    
    GPT (Generative Pretrained Transformer)
    Definition: GPT is a generative language model pre-trained on vast datasets, capable of performing various text-based tasks. It generates natural and coherent text based on input.
    Example: A chatbot generating detailed answers to user queries is powered by GPT models.
    Related Keywords: Natural Language Processing (NLP), Text Generation, Deep Learning

Now, let's wrap the base_retriever with a ContextualCompressionRetriever . The CrossEncoderReranker leverages HuggingFaceCrossEncoder to re-rank the retrieved results.

Multilingual Support BGE Reranker: bge-reranker-v2-m3

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Initialize the model
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3")

# Select the top 3 documents
compressor = CrossEncoderReranker(model=model, top_n=3)

# Initialize the contextual compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

# Retrieve compressed documents
compressed_docs = compression_retriever.invoke("Can you tell me about Word2Vec?")

# Display the documents
pretty_print_docs(compressed_docs)

Document 1:
    
    Word2Vec
    Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.
    Example: In a Word2Vec model, "king" and "queen" are represented by vectors located close to each other.
    Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
    ----------------------------------------------------------------------------------------------------
    Document 2:
    
    Open Source
    Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.
    Example: The Linux operating system is a well-known open source project.
    Related Keywords: Software Development, Community, Technical Collaboration
    Structured Data
    Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.
    ----------------------------------------------------------------------------------------------------
    Document 3:
    
    TF-IDF (Term Frequency-Inverse Document Frequency)
    Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.
    Example: Words with high TF-IDF values are often unique and critical for understanding the document.
    Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining

Previous11-Reranker NextJinaReranker

Last updated 4 months ago