Evaluation using RAGAS
Author: Sungchul Kim
Peer Review: Yoonji, Sunyoung Park
This is a part of LangChain Open Tutorial
Overview
This tutorial will show you how to evaluate the quality of your LLM output using RAGAS .
Before starting this tutorial, let's review metrics to be used in this tutorial, Context Recall , Context Precision , Answer Relevancy , and Faithfulness first.
Context Recall
It estimates "how well the retrieved context matches the LLM-generated answer" . It is calculated using question, ground truth, and retrieved context. The value is between 0 and 1, and higher values indicate better performance. To estimate $\text{Context Recall}$ from the ground truth answer, each claim in the ground truth answer is analyzed to see if it can be attributed to the retrieved context. In the ideal scenario, all claims in the ground truth answer should be able to be attributed to the retrieved context.
Context Recall=∣Number of claims in GT∣∣GT claims that can be attributed to context∣
Context Precision
It estimates "whether ground-truth related items in contexts are ranked at the top" .
Ideally, all relevant chunks should appear in the top ranks. This metric is calculated using question, ground_truth, and contexts, with values ranging from 0 to 1. Higher scores indicate better precision.
The formula for $\text{Context Precision@K}$ is as follows:
Context Precision@K=Total number of relevant items in the top K results∑k=1K(Precision@k×vk)
Here, $\text{Precision@k}$ is calculated as follows:
Precision@k=(true positives@k + false positives@k)true positives@k
$\text{K}$ is the total number of chunks in contexts, and $v_k \in {0, 1}$ is the relevance indicator at rank k.
This metric is used to evaluate the quality of the retrieved context in information retrieval systems. It measures how well relevant information is placed in the top ranks, allowing for performance assessment.
Answer Relevancy (Response Relevancy)
It is a metric that evaluates "how well the generated answer matches the given prompt" .
The main features and calculation methods of this metric are as follows:
Purpose: Evaluate the relevance of the generated answer.
Score interpretation: Lower scores indicate incomplete or duplicate information in the answer, while higher scores indicate better relevance.
Elements used in calculation: question, context, answer
The calculation method for $\text{Answer Relevancy}$ is defined as the average cosine similarity between the original question and the generated synthetic questions.
Answer Relevancy=N1∑i=1Ncos(Egi,Eo)=N1∑i=1N∥Egi∥∥Eo∥Egi⋅Eo
Here:
$E_{g_i}$ : the embedding of the generated question $i$
$E_o$ : the embedding of the original question
$N$ : the number of generated questions (default value is 3)
Note:
The actual score is mostly between 0 and 1, but mathematically it can be between -1 and 1 due to the characteristics of cosine similarity.
This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the original question's intent.
Faithfulness
It is a metric that evaluates "the factual consistency of the generated answer compared to the given context" .
The main features and calculation methods of this metric are as follows:
Purpose: Evaluate the factual consistency of the generated answer compared to the given context.
Calculation elements: Use the generated answer and the retrieved context.
Score range: Adjusted between 0 and 1, with higher values indicating better performance.
The calculation method for $\text{Faithfulness score}$ is as follows:
Faithfulness score=∣Total number of claims in the generated answer∣∣Number of claims in the generated answer that can be inferred from given context∣
Calculation process:
Identify claims in the generated answer.
Verify each claim against the given context to check if it can be inferred from the context.
Use the above formula to calculate the score.
Example:
Question: "When and where was Einstein born?"
Context: "Albert Einstein (born March 14, 1879) is a German-born theoretical physicist, widely considered one of the most influential scientists of all time."
High faithfulness answer: "Einstein was born in Germany on March 14, 1879."
Low faithfulness answer: "Einstein was born in Germany on March 20, 1879."
This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the given context.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.
[Note] This is not necessary if you've already set the required API keys in previous steps.
Load saved RAGAS dataset
RAGAS datasetLoad the RAGAS dataset that you saved in the previous step.
0
What is the role of generative AI in the conte...
["Agents\nThis combination of reasoning,\nlogi...
Generative AI models can be trained to use too...
single_hop_specifc_query_synthesizer
1
What are the essential components of an agent'...
['Agents\nWhat is an agent?\nIn its most funda...
The essential components in an agent's cogniti...
single_hop_specifc_query_synthesizer
2
What are the key considerations for selecting ...
['Agents\nFigure 1. General agent architecture...
When selecting a model for an agent, it is cru...
single_hop_specifc_query_synthesizer
3
How does retrieval augmented generation enhanc...
['Agents\nThe tools\nFoundational models, desp...
Retrieval augmented generation (RAG) significa...
single_hop_specifc_query_synthesizer
4
In the context of AI agents, how does the CoT ...
['Agents\nAgents vs. models\nTo gain a clearer...
The CoT framework enhances reasoning capabilit...
single_hop_specifc_query_synthesizer
Create a batch dataset by assigning the questions to batch_dataset .
Batch dataset is useful when you want to process a large number of questions at once.
Reference for batch : Link
Call batch() to get answers for the batch dataset ( batch_dataset ).
Store the answers generated by the LLM in the answer column.
Evaluate the answers
Using ragas.evaluate(), we can evaluate the answers.
0
What is the role of generative AI in the conte...
[Agents\nThis combination of reasoning,\nlogic...
The role of generative AI in the context of ag...
The role of generative AI in the context of ag...
1.0
0.470588
1.000000
0.75
1
What are the essential components of an agent'...
[Agents\nWhat is an agent?\nIn its most fundam...
The essential components of an agent's cogniti...
The essential components of an agent's cogniti...
1.0
0.400000
1.000000
0.50
2
What are the key considerations for selecting ...
[Agents\nFigure 1. General agent architecture ...
The key considerations for selecting a model f...
The key considerations for selecting a model f...
1.0
0.333333
1.000000
0.50
3
How does retrieval augmented generation enhanc...
[Agents\nThe tools\nFoundational models, despi...
Retrieval Augmented Generation (RAG) enhances ...
Retrieval Augmented Generation (RAG) enhances ...
1.0
0.500000
0.919411
1.00
4
In the context of AI agents, how does the CoT ...
[Agents\nAgents vs. models\nTo gain a clearer ...
The Chain-of-Thought (CoT) framework enhances ...
The Chain-of-Thought (CoT) framework enhances ...
1.0
0.076923
0.944423
0.00
0
1.0
0.470588
1.000000
0.750000
1
1.0
0.400000
1.000000
0.500000
2
1.0
0.333333
1.000000
0.500000
3
1.0
0.500000
0.919411
1.000000
4
1.0
0.076923
0.944423
0.000000
5
1.0
0.923077
0.938009
1.000000
6
1.0
1.000000
0.984205
0.750000
7
1.0
0.920000
0.971321
1.000000
8
1.0
0.352941
0.963824
0.666667
9
1.0
0.916667
0.972590
1.000000
Last updated