Evaluation using RAGAS

Author: Sungchul Kim
Peer Review: Yoonji, Sunyoung Park
This is a part of LangChain Open Tutorial

Overview

This tutorial will show you how to evaluate the quality of your LLM output using RAGAS .

Before starting this tutorial, let's review metrics to be used in this tutorial, Context Recall , Context Precision , Answer Relevancy , and Faithfulness first.

Context Recall

It estimates "how well the retrieved context matches the LLM-generated answer" . It is calculated using question, ground truth, and retrieved context. The value is between 0 and 1, and higher values indicate better performance. To estimate $\text{Context Recall}$ from the ground truth answer, each claim in the ground truth answer is analyzed to see if it can be attributed to the retrieved context. In the ideal scenario, all claims in the ground truth answer should be able to be attributed to the retrieved context.

$\text{Context Recall} = \frac{|\text{GT claims that can be attributed to context}|}{|\text{Number of claims in GT}|}$

Context Precision

It estimates "whether ground-truth related items in contexts are ranked at the top" .

Ideally, all relevant chunks should appear in the top ranks. This metric is calculated using question, ground_truth, and contexts, with values ranging from 0 to 1. Higher scores indicate better precision.

The formula for $\text{Context Precision@K}$ is as follows:

$\text{Context Precision@K} = \frac{\sum_{k=1}^{K} (\text{Precision@k} \times v_k)}{\text{Total number of relevant items in the top K results}}$

Here, $\text{Precision@k}$ is calculated as follows:

$\text{Precision@k} = \frac{\text{true positives@k}}{(\text{true positives@k + false positives@k})}$

$\text{K}$ is the total number of chunks in contexts, and $v_k \in {0, 1}$ is the relevance indicator at rank k.

This metric is used to evaluate the quality of the retrieved context in information retrieval systems. It measures how well relevant information is placed in the top ranks, allowing for performance assessment.

Answer Relevancy (Response Relevancy)

It is a metric that evaluates "how well the generated answer matches the given prompt" .

The main features and calculation methods of this metric are as follows:

Purpose: Evaluate the relevance of the generated answer.
Score interpretation: Lower scores indicate incomplete or duplicate information in the answer, while higher scores indicate better relevance.
Elements used in calculation: question, context, answer

The calculation method for $\text{Answer Relevancy}$ is defined as the average cosine similarity between the original question and the generated synthetic questions.

$\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^N \cos(E_{g_i}, E_o) = \frac{1}{N} \sum_{i=1}^N \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\| \|E_o\|}$

Here:

$E_{g_i}$ : the embedding of the generated question $i$
$E_o$ : the embedding of the original question
$N$ : the number of generated questions (default value is 3)

Note:

The actual score is mostly between 0 and 1, but mathematically it can be between -1 and 1 due to the characteristics of cosine similarity.

This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the original question's intent.

Faithfulness

It is a metric that evaluates "the factual consistency of the generated answer compared to the given context" .

The main features and calculation methods of this metric are as follows:

Purpose: Evaluate the factual consistency of the generated answer compared to the given context.
Calculation elements: Use the generated answer and the retrieved context.
Score range: Adjusted between 0 and 1, with higher values indicating better performance.

The calculation method for $\text{Faithfulness score}$ is as follows:

$\text{Faithfulness score} = \frac{|\text{Number of claims in the generated answer that can be inferred from given context}|}{|\text{Total number of claims in the generated answer}|}$

Calculation process:

Identify claims in the generated answer.
Verify each claim against the given context to check if it can be inferred from the context.
Use the above formula to calculate the score.

Example:

Question: "When and where was Einstein born?"
Context: "Albert Einstein (born March 14, 1879) is a German-born theoretical physicist, widely considered one of the most influential scientists of all time."
High faithfulness answer: "Einstein was born in Germany on March 14, 1879."
Low faithfulness answer: "Einstein was born in Germany on March 20, 1879."

This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the given context.

References

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
        "ragas",
        "pymupdf",
        "faiss-cpu",
    ],
    verbose=False,
    upgrade=False,
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Evaluation-using-RAGAS",
    }
)

Environment variables have been set successfully.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

Load saved `RAGAS` dataset

Load the RAGAS dataset that you saved in the previous step.

import pandas as pd

df = pd.read_csv("data/ragas_synthetic_dataset.csv")
df.head()

user_input

reference_contexts

reference

synthesizer_name

What is the role of generative AI in the conte...

["Agents\nThis combination of reasoning,\nlogi...

Generative AI models can be trained to use too...