This tutorial will show you how to evaluate the quality of your LLM output using RAGAS .
Before starting this tutorial, let's review metrics to be used in this tutorial, Context Recall , Context Precision , Answer Relevancy , and Faithfulness first.
Context Recall
It estimates "how well the retrieved context matches the LLM-generated answer" .
It is calculated using question, ground truth, and retrieved context. The value is between 0 and 1, and higher values indicate better performance. To estimate $\text{Context Recall}$ from the ground truth answer, each claim in the ground truth answer is analyzed to see if it can be attributed to the retrieved context. In the ideal scenario, all claims in the ground truth answer should be able to be attributed to the retrieved context.
Context Recall=∣Number of claims in GT∣∣GT claims that can be attributed to context∣
Context Precision
It estimates "whether ground-truth related items in contexts are ranked at the top" .
Ideally, all relevant chunks should appear in the top ranks. This metric is calculated using question, ground_truth, and contexts, with values ranging from 0 to 1. Higher scores indicate better precision.
The formula for $\text{Context Precision@K}$ is as follows:
Context Precision@K=Total number of relevant items in the top K results∑k=1K(Precision@k×vk)
Here, $\text{Precision@k}$ is calculated as follows:
$\text{K}$ is the total number of chunks in contexts, and $v_k \in {0, 1}$ is the relevance indicator at rank k.
This metric is used to evaluate the quality of the retrieved context in information retrieval systems. It measures how well relevant information is placed in the top ranks, allowing for performance assessment.
Answer Relevancy (Response Relevancy)
It is a metric that evaluates "how well the generated answer matches the given prompt" .
The main features and calculation methods of this metric are as follows:
Purpose: Evaluate the relevance of the generated answer.
Score interpretation: Lower scores indicate incomplete or duplicate information in the answer, while higher scores indicate better relevance.
Elements used in calculation: question, context, answer
The calculation method for $\text{Answer Relevancy}$ is defined as the average cosine similarity between the original question and the generated synthetic questions.
Here:
$E_{g_i}$ : the embedding of the generated question $i$
$E_o$ : the embedding of the original question
$N$ : the number of generated questions (default value is 3)
Note:
The actual score is mostly between 0 and 1, but mathematically it can be between -1 and 1 due to the characteristics of cosine similarity.
This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the original question's intent.
Faithfulness
It is a metric that evaluates "the factual consistency of the generated answer compared to the given context" .
The main features and calculation methods of this metric are as follows:
Purpose: Evaluate the factual consistency of the generated answer compared to the given context.
Calculation elements: Use the generated answer and the retrieved context.
Score range: Adjusted between 0 and 1, with higher values indicating better performance.
The calculation method for $\text{Faithfulness score}$ is as follows:
Calculation process:
Identify claims in the generated answer.
Verify each claim against the given context to check if it can be inferred from the context.
Use the above formula to calculate the score.
Example:
Question: "When and where was Einstein born?"
Context: "Albert Einstein (born March 14, 1879) is a German-born theoretical physicist, widely considered one of the most influential scientists of all time."
High faithfulness answer: "Einstein was born in Germany on March 14, 1879."
Low faithfulness answer: "Einstein was born in Germany on March 20, 1879."
This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the given context.
['Agents\nWhat is an agent?\nIn its most fundamental form, a Generative AI agent can be defined as an application that\nattempts to achieve a goal by observing the world and acting upon it using the tools that it\nhas at its disposal. Agents are autonomous and can act independently of human intervention,\nespecially when provided with proper goals or objectives they are meant to achieve. Agents\ncan also be proactive in their approach to reaching their goals. Even in the absence of\nexplicit instruction sets from a human, an agent can reason about what it should do next to\nachieve its ultimate goal. While the notion of agents in AI is quite general and powerful, this\nwhitepaper focuses on the specific types of agents that Generative AI models are capable of\nbuilding at the time of publication.\nIn order to understand the inner workings of an agent, let’s first introduce the foundational\ncomponents that drive the agent’s behavior, actions, and decision making. The combination\nof these components can be described as a cognitive architecture, and there are many\nsuch architectures that can be achieved by the mixing and matching of these components.\nFocusing on the core functionalities, there are three essential components in an agent’s\ncognitive architecture as shown in Figure 1.\nSeptember 2024 5\n']
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Step 1: Load Documents
loader = PyMuPDFLoader("data/Newwhitepaper_Agents2.pdf")
docs = loader.load()
# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)
# Step 3: Create Embeddings
embeddings = OpenAIEmbeddings()
# Step 4: Create DB and Save
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)
# Step 5: Create Retriever
retriever = vectorstore.as_retriever()
# Step 6: Create Prompt
prompt = PromptTemplate.from_template(
"""You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
#Context:
{context}
#Question:
{question}
#Answer:"""
)
# Step 7: Create LLM
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)
# Step 8: Create Chain
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
Create a batch dataset by assigning the questions to batch_dataset .
Batch dataset is useful when you want to process a large number of questions at once.
batch_dataset = [question for question in test_dataset["user_input"]]
batch_dataset[:3]
['What is the role of generative AI in the context of agents?',
"What are the essential components of an agent's cognitive architecture as of September 2024?",
'What are the key considerations for selecting a model for an agent in the context of advancements expected by September 2024?']
Call batch() to get answers for the batch dataset ( batch_dataset ).
answer = chain.batch(batch_dataset)
answer[:3]
['The role of generative AI in the context of agents is to extend the capabilities of language models by leveraging tools to access real-time information, suggest real-world actions, and autonomously plan and execute complex tasks. Generative AI models can be trained to use external tools to access specific information or perform actions, such as making API calls to send emails or complete transactions. These agents are autonomous, capable of acting independently of human intervention, and can proactively reason about what actions to take to achieve their goals. The orchestration layer, a cognitive architecture, structures the reasoning, planning, and decision-making processes of these agents.',
"The essential components of an agent's cognitive architecture as of September 2024 include the core functionalities that drive the agent’s behavior, actions, and decision-making. These components can be described as a cognitive architecture, which involves the orchestration layer that structures reasoning, planning, decision-making, and guides the agent's actions.",
"The key considerations for selecting a model for an agent, in the context of advancements expected by September 2024, include the model's ability to reason and act on various tasks, its capability to select the right tools, and how well those tools have been defined. Additionally, the model should be able to interact with external data and services through tools, which can significantly extend its capabilities beyond what the foundational model alone can achieve. This involves using tools like web API methods (GET, POST, PATCH, DELETE) to access and process real-world information, thereby supporting more specialized systems like retrieval augmented generation (RAG). The iterative approach to building complex agent architectures, focusing on experimentation and refinement, is also crucial to finding solutions for specific business cases and organizational needs."]
Store the answers generated by the LLM in the answer column.
# Overwrite or add 'answer' column
if "answer" in test_dataset.column_names:
test_dataset = test_dataset.remove_columns(["answer"]).add_column("answer", answer)
else:
test_dataset = test_dataset.add_column("answer", answer)
Evaluate the answers
Using ragas.evaluate(), we can evaluate the answers.
from ragas import evaluate
from ragas.metrics import (
answer_relevancy,
faithfulness,
context_recall,
context_precision,
)
# Format dataset structure
formatted_dataset = []
for item in test_dataset:
formatted_item = {
"question": item["user_input"],
"answer": item["answer"],
"reference": item["answer"],
"contexts": item["reference_contexts"],
"retrieved_contexts": item["reference_contexts"],
}
formatted_dataset.append(formatted_item)
# Convert to RAGAS dataset
ragas_dataset = Dataset.from_list(formatted_dataset)
result = evaluate(
dataset=ragas_dataset,
metrics=[
context_precision,
faithfulness,
answer_relevancy,
context_recall,
],
)
result