Evaluation using RAGAS

Open in Colabarrow-up-right Open in GitHubarrow-up-right

Overview

This tutorial will show you how to evaluate the quality of your LLM output using RAGAS .

Before starting this tutorial, let's review metrics to be used in this tutorial, Context Recall , Context Precision , Answer Relevancy , and Faithfulness first.

Context Recall

It estimates "how well the retrieved context matches the LLM-generated answer" . It is calculated using question, ground truth, and retrieved context. The value is between 0 and 1, and higher values indicate better performance. To estimate $\text{Context Recall}$ from the ground truth answer, each claim in the ground truth answer is analyzed to see if it can be attributed to the retrieved context. In the ideal scenario, all claims in the ground truth answer should be able to be attributed to the retrieved context.

Context Recall=GT claims that can be attributed to contextNumber of claims in GT\text{Context Recall} = \frac{|\text{GT claims that can be attributed to context}|}{|\text{Number of claims in GT}|}

Context Precision

It estimates "whether ground-truth related items in contexts are ranked at the top" .

Ideally, all relevant chunks should appear in the top ranks. This metric is calculated using question, ground_truth, and contexts, with values ranging from 0 to 1. Higher scores indicate better precision.

The formula for $\text{Context Precision@K}$ is as follows:

Context Precision@K=k=1K(Precision@k×vk)Total number of relevant items in the top K results\text{Context Precision@K} = \frac{\sum_{k=1}^{K} (\text{Precision@k} \times v_k)}{\text{Total number of relevant items in the top K results}}

Here, $\text{Precision@k}$ is calculated as follows:

Precision@k=true positives@k(true positives@k + false positives@k)\text{Precision@k} = \frac{\text{true positives@k}}{(\text{true positives@k + false positives@k})}

$\text{K}$ is the total number of chunks in contexts, and $v_k \in {0, 1}$ is the relevance indicator at rank k.

This metric is used to evaluate the quality of the retrieved context in information retrieval systems. It measures how well relevant information is placed in the top ranks, allowing for performance assessment.

Answer Relevancy (Response Relevancy)

It is a metric that evaluates "how well the generated answer matches the given prompt" .

The main features and calculation methods of this metric are as follows:

  1. Purpose: Evaluate the relevance of the generated answer.

  2. Score interpretation: Lower scores indicate incomplete or duplicate information in the answer, while higher scores indicate better relevance.

  3. Elements used in calculation: question, context, answer

The calculation method for $\text{Answer Relevancy}$ is defined as the average cosine similarity between the original question and the generated synthetic questions.

Answer Relevancy=1Ni=1Ncos(Egi,Eo)=1Ni=1NEgiEoEgiEo\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^N \cos(E_{g_i}, E_o) = \frac{1}{N} \sum_{i=1}^N \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\| \|E_o\|}

Here:

  • $E_{g_i}$ : the embedding of the generated question $i$

  • $E_o$ : the embedding of the original question

  • $N$ : the number of generated questions (default value is 3)

Note:

  • The actual score is mostly between 0 and 1, but mathematically it can be between -1 and 1 due to the characteristics of cosine similarity.

This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the original question's intent.

Faithfulness

It is a metric that evaluates "the factual consistency of the generated answer compared to the given context" .

The main features and calculation methods of this metric are as follows:

  1. Purpose: Evaluate the factual consistency of the generated answer compared to the given context.

  2. Calculation elements: Use the generated answer and the retrieved context.

  3. Score range: Adjusted between 0 and 1, with higher values indicating better performance.

The calculation method for $\text{Faithfulness score}$ is as follows:

Faithfulness score=Number of claims in the generated answer that can be inferred from given contextTotal number of claims in the generated answer\text{Faithfulness score} = \frac{|\text{Number of claims in the generated answer that can be inferred from given context}|}{|\text{Total number of claims in the generated answer}|}

Calculation process:

  1. Identify claims in the generated answer.

  2. Verify each claim against the given context to check if it can be inferred from the context.

  3. Use the above formula to calculate the score.

Example:

  • Question: "When and where was Einstein born?"

  • Context: "Albert Einstein (born March 14, 1879) is a German-born theoretical physicist, widely considered one of the most influential scientists of all time."

  • High faithfulness answer: "Einstein was born in Germany on March 14, 1879."

  • Low faithfulness answer: "Einstein was born in Germany on March 20, 1879."

This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the given context.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setuparrow-up-right for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorialarrow-up-right for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

Load saved RAGAS dataset

Load the RAGAS dataset that you saved in the previous step.

user_input
reference_contexts
reference
synthesizer_name

0

What is the role of generative AI in the conte...

["Agents\nThis combination of reasoning,\nlogi...

Generative AI models can be trained to use too...

single_hop_specifc_query_synthesizer

1

What are the essential components of an agent'...

['Agents\nWhat is an agent?\nIn its most funda...

The essential components in an agent's cogniti...

single_hop_specifc_query_synthesizer

2

What are the key considerations for selecting ...

['Agents\nFigure 1. General agent architecture...

When selecting a model for an agent, it is cru...

single_hop_specifc_query_synthesizer

3

How does retrieval augmented generation enhanc...

['Agents\nThe tools\nFoundational models, desp...

Retrieval augmented generation (RAG) significa...

single_hop_specifc_query_synthesizer

4

In the context of AI agents, how does the CoT ...

['Agents\nAgents vs. models\nTo gain a clearer...

The CoT framework enhances reasoning capabilit...

single_hop_specifc_query_synthesizer

Create a batch dataset by assigning the questions to batch_dataset . Batch dataset is useful when you want to process a large number of questions at once.

Call batch() to get answers for the batch dataset ( batch_dataset ).

Store the answers generated by the LLM in the answer column.

Evaluate the answers

Using ragas.evaluate(), we can evaluate the answers.

user_input
retrieved_contexts
response
reference
context_precision
faithfulness
answer_relevancy
context_recall

0

What is the role of generative AI in the conte...

[Agents\nThis combination of reasoning,\nlogic...

The role of generative AI in the context of ag...

The role of generative AI in the context of ag...

1.0

0.470588

1.000000

0.75

1

What are the essential components of an agent'...

[Agents\nWhat is an agent?\nIn its most fundam...

The essential components of an agent's cogniti...

The essential components of an agent's cogniti...

1.0

0.400000

1.000000

0.50

2

What are the key considerations for selecting ...

[Agents\nFigure 1. General agent architecture ...

The key considerations for selecting a model f...

The key considerations for selecting a model f...

1.0

0.333333

1.000000

0.50

3

How does retrieval augmented generation enhanc...

[Agents\nThe tools\nFoundational models, despi...

Retrieval Augmented Generation (RAG) enhances ...

Retrieval Augmented Generation (RAG) enhances ...

1.0

0.500000

0.919411

1.00

4

In the context of AI agents, how does the CoT ...

[Agents\nAgents vs. models\nTo gain a clearer ...

The Chain-of-Thought (CoT) framework enhances ...

The Chain-of-Thought (CoT) framework enhances ...

1.0

0.076923

0.944423

0.00

context_precision
faithfulness
answer_relevancy
context_recall

0

1.0

0.470588

1.000000

0.750000

1

1.0

0.400000

1.000000

0.500000

2

1.0

0.333333

1.000000

0.500000

3

1.0

0.500000

0.919411

1.000000

4

1.0

0.076923

0.944423

0.000000

5

1.0

0.923077

0.938009

1.000000

6

1.0

1.000000

0.984205

0.750000

7

1.0

0.920000

0.971321

1.000000

8

1.0

0.352941

0.963824

0.666667

9

1.0

0.916667

0.972590

1.000000

Last updated