LangChain OpenTutorial
  • 🦜️🔗 The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • HTMLHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Vector Stores
    • Chroma
    • Faiss
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j
    • Weaviate
    • Faiss
    • {VectorStore Name}
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Title
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • CoT Based Smart Web Search
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Context Recall
  • Context Precision
  • Answer Relevancy (Response Relevancy)
  • Faithfulness
  • Table of Contents
  • References
  • Environment Setup
  • Load saved RAGAS dataset
  • Evaluate the answers
  1. 16-Evaluations

Evaluation using RAGAS

PreviousGenerate synthetic test dataset (with RAGAS)NextHF-Upload

Last updated 3 months ago

  • Author:

  • Peer Review: ,

  • This is a part of

Overview

This tutorial will show you how to evaluate the quality of your LLM output using RAGAS .

Before starting this tutorial, let's review metrics to be used in this tutorial, Context Recall , Context Precision , Answer Relevancy , and Faithfulness first.

Context Recall

It estimates "how well the retrieved context matches the LLM-generated answer" . It is calculated using question, ground truth, and retrieved context. The value is between 0 and 1, and higher values indicate better performance. To estimate $\text{Context Recall}$ from the ground truth answer, each claim in the ground truth answer is analyzed to see if it can be attributed to the retrieved context. In the ideal scenario, all claims in the ground truth answer should be able to be attributed to the retrieved context.

Context Recall=∣GT claims that can be attributed to context∣∣Number of claims in GT∣\text{Context Recall} = \frac{|\text{GT claims that can be attributed to context}|}{|\text{Number of claims in GT}|}Context Recall=∣Number of claims in GT∣∣GT claims that can be attributed to context∣​

Context Precision

It estimates "whether ground-truth related items in contexts are ranked at the top" .

Ideally, all relevant chunks should appear in the top ranks. This metric is calculated using question, ground_truth, and contexts, with values ranging from 0 to 1. Higher scores indicate better precision.

The formula for $\text{Context Precision@K}$ is as follows:

Context Precision@K=∑k=1K(Precision@k×vk)Total number of relevant items in the top K results\text{Context Precision@K} = \frac{\sum_{k=1}^{K} (\text{Precision@k} \times v_k)}{\text{Total number of relevant items in the top K results}}Context Precision@K=Total number of relevant items in the top K results∑k=1K​(Precision@k×vk​)​

Here, $\text{Precision@k}$ is calculated as follows:

Precision@k=true positives@k(true positives@k + false positives@k)\text{Precision@k} = \frac{\text{true positives@k}}{(\text{true positives@k + false positives@k})}Precision@k=(true positives@k + false positives@k)true positives@k​

$\text{K}$ is the total number of chunks in contexts, and $v_k \in {0, 1}$ is the relevance indicator at rank k.

This metric is used to evaluate the quality of the retrieved context in information retrieval systems. It measures how well relevant information is placed in the top ranks, allowing for performance assessment.

Answer Relevancy (Response Relevancy)

It is a metric that evaluates "how well the generated answer matches the given prompt" .

The main features and calculation methods of this metric are as follows:

  1. Purpose: Evaluate the relevance of the generated answer.

  2. Score interpretation: Lower scores indicate incomplete or duplicate information in the answer, while higher scores indicate better relevance.

  3. Elements used in calculation: question, context, answer

The calculation method for $\text{Answer Relevancy}$ is defined as the average cosine similarity between the original question and the generated synthetic questions.

Answer Relevancy=1N∑i=1Ncos⁡(Egi,Eo)=1N∑i=1NEgi⋅Eo∥Egi∥∥Eo∥\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^N \cos(E_{g_i}, E_o) = \frac{1}{N} \sum_{i=1}^N \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\| \|E_o\|}Answer Relevancy=N1​∑i=1N​cos(Egi​​,Eo​)=N1​∑i=1N​∥Egi​​∥∥Eo​∥Egi​​⋅Eo​​

Here:

  • $E_{g_i}$ : the embedding of the generated question $i$

  • $E_o$ : the embedding of the original question

  • $N$ : the number of generated questions (default value is 3)

Note:

  • The actual score is mostly between 0 and 1, but mathematically it can be between -1 and 1 due to the characteristics of cosine similarity.

This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the original question's intent.

Faithfulness

It is a metric that evaluates "the factual consistency of the generated answer compared to the given context" .

The main features and calculation methods of this metric are as follows:

  1. Purpose: Evaluate the factual consistency of the generated answer compared to the given context.

  2. Calculation elements: Use the generated answer and the retrieved context.

  3. Score range: Adjusted between 0 and 1, with higher values indicating better performance.

The calculation method for $\text{Faithfulness score}$ is as follows:

Faithfulness score=∣Number of claims in the generated answer that can be inferred from given context∣∣Total number of claims in the generated answer∣\text{Faithfulness score} = \frac{|\text{Number of claims in the generated answer that can be inferred from given context}|}{|\text{Total number of claims in the generated answer}|}Faithfulness score=∣Total number of claims in the generated answer∣∣Number of claims in the generated answer that can be inferred from given context∣​

Calculation process:

  1. Identify claims in the generated answer.

  2. Verify each claim against the given context to check if it can be inferred from the context.

  3. Use the above formula to calculate the score.

Example:

  • Question: "When and where was Einstein born?"

  • Context: "Albert Einstein (born March 14, 1879) is a German-born theoretical physicist, widely considered one of the most influential scientists of all time."

  • High faithfulness answer: "Einstein was born in Germany on March 14, 1879."

  • Low faithfulness answer: "Einstein was born in Germany on March 20, 1879."

This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the given context.

Table of Contents

References


Environment Setup

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
        "ragas",
        "pymupdf",
        "faiss-cpu",
    ],
    verbose=False,
    upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Evaluation-using-RAGAS",
    }
)
Environment variables have been set successfully.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)
True

Load saved RAGAS dataset

Load the RAGAS dataset that you saved in the previous step.

import pandas as pd

df = pd.read_csv("data/ragas_synthetic_dataset.csv")
df.head()
user_input
reference_contexts
reference
synthesizer_name

0

What is the role of generative AI in the conte...

["Agents\nThis combination of reasoning,\nlogi...

Generative AI models can be trained to use too...

single_hop_specifc_query_synthesizer

1

What are the essential components of an agent'...

['Agents\nWhat is an agent?\nIn its most funda...

The essential components in an agent's cogniti...

single_hop_specifc_query_synthesizer

2

What are the key considerations for selecting ...

['Agents\nFigure 1. General agent architecture...

When selecting a model for an agent, it is cru...

single_hop_specifc_query_synthesizer

3

How does retrieval augmented generation enhanc...

['Agents\nThe tools\nFoundational models, desp...

Retrieval augmented generation (RAG) significa...

single_hop_specifc_query_synthesizer

4

In the context of AI agents, how does the CoT ...

['Agents\nAgents vs. models\nTo gain a clearer...

The CoT framework enhances reasoning capabilit...

single_hop_specifc_query_synthesizer

from datasets import Dataset

test_dataset = Dataset.from_pandas(df)
test_dataset
Dataset({
        features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],
        num_rows: 10
    })
import ast

# Convert contexts column from string to list
def convert_to_list(example):
    contexts = ast.literal_eval(example["reference_contexts"])
    return {"reference_contexts": contexts}

test_dataset = test_dataset.map(convert_to_list)
print(test_dataset)
Map: 100%|██████████| 10/10 [00:00<00:00, 721.48 examples/s]
Dataset({
    features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],
    num_rows: 10
})


test_dataset[1]["reference_contexts"]
['Agents\nWhat is an agent?\nIn its most fundamental form, a Generative AI agent can be defined as an application that\nattempts to achieve a goal by observing the world and acting upon it using the tools that it\nhas at its disposal. Agents are autonomous and can act independently of human intervention,\nespecially when provided with proper goals or objectives they are meant to achieve. Agents\ncan also be proactive in their approach to reaching their goals. Even in the absence of\nexplicit instruction sets from a human, an agent can reason about what it should do next to\nachieve its ultimate goal. While the notion of agents in AI is quite general and powerful, this\nwhitepaper focuses on the specific types of agents that Generative AI models are capable of\nbuilding at the time of publication.\nIn order to understand the inner workings of an agent, let’s first introduce the foundational\ncomponents that drive the agent’s behavior, actions, and decision making. The combination\nof these components can be described as a cognitive architecture, and there are many\nsuch architectures that can be achieved by the mixing and matching of these components.\nFocusing on the core functionalities, there are three essential components in an agent’s\ncognitive architecture as shown in Figure 1.\nSeptember 2024 5\n']
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Step 1: Load Documents
loader = PyMuPDFLoader("data/Newwhitepaper_Agents2.pdf")
docs = loader.load()

# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)

# Step 3: Create Embeddings
embeddings = OpenAIEmbeddings()

# Step 4: Create DB and Save
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

# Step 5: Create Retriever
retriever = vectorstore.as_retriever()

# Step 6: Create Prompt
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

#Context: 
{context}

#Question:
{question}

#Answer:"""
)

# Step 7: Create LLM
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Step 8: Create Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Create a batch dataset by assigning the questions to batch_dataset . Batch dataset is useful when you want to process a large number of questions at once.

batch_dataset = [question for question in test_dataset["user_input"]]
batch_dataset[:3]
['What is the role of generative AI in the context of agents?',
     "What are the essential components of an agent's cognitive architecture as of September 2024?",
     'What are the key considerations for selecting a model for an agent in the context of advancements expected by September 2024?']

Call batch() to get answers for the batch dataset ( batch_dataset ).

answer = chain.batch(batch_dataset)
answer[:3]
['The role of generative AI in the context of agents is to extend the capabilities of language models by leveraging tools to access real-time information, suggest real-world actions, and autonomously plan and execute complex tasks. Generative AI models can be trained to use external tools to access specific information or perform actions, such as making API calls to send emails or complete transactions. These agents are autonomous, capable of acting independently of human intervention, and can proactively reason about what actions to take to achieve their goals. The orchestration layer, a cognitive architecture, structures the reasoning, planning, and decision-making processes of these agents.',
     "The essential components of an agent's cognitive architecture as of September 2024 include the core functionalities that drive the agent’s behavior, actions, and decision-making. These components can be described as a cognitive architecture, which involves the orchestration layer that structures reasoning, planning, decision-making, and guides the agent's actions.",
     "The key considerations for selecting a model for an agent, in the context of advancements expected by September 2024, include the model's ability to reason and act on various tasks, its capability to select the right tools, and how well those tools have been defined. Additionally, the model should be able to interact with external data and services through tools, which can significantly extend its capabilities beyond what the foundational model alone can achieve. This involves using tools like web API methods (GET, POST, PATCH, DELETE) to access and process real-world information, thereby supporting more specialized systems like retrieval augmented generation (RAG). The iterative approach to building complex agent architectures, focusing on experimentation and refinement, is also crucial to finding solutions for specific business cases and organizational needs."]

Store the answers generated by the LLM in the answer column.

# Overwrite or add 'answer' column
if "answer" in test_dataset.column_names:
    test_dataset = test_dataset.remove_columns(["answer"]).add_column("answer", answer)
else:
    test_dataset = test_dataset.add_column("answer", answer)

Evaluate the answers

Using ragas.evaluate(), we can evaluate the answers.

from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

# Format dataset structure
formatted_dataset = []
for item in test_dataset:
    formatted_item = {
        "question": item["user_input"],
        "answer": item["answer"],
        "reference": item["answer"],
        "contexts": item["reference_contexts"],
        "retrieved_contexts": item["reference_contexts"],
    }
    formatted_dataset.append(formatted_item)

# Convert to RAGAS dataset
ragas_dataset = Dataset.from_list(formatted_dataset)

result = evaluate(
    dataset=ragas_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

result
Evaluating: 100%|██████████| 40/40 [00:46<00:00,  1.16s/it]
{'context_precision': 1.0000, 'faithfulness': 0.5894, 'answer_relevancy': 0.9694, 'context_recall': 0.7167}
result_df = result.to_pandas()
result_df.head()
user_input
retrieved_contexts
response
reference
context_precision
faithfulness
answer_relevancy
context_recall

0

What is the role of generative AI in the conte...

[Agents\nThis combination of reasoning,\nlogic...

The role of generative AI in the context of ag...

The role of generative AI in the context of ag...

1.0

0.470588

1.000000

0.75

1

What are the essential components of an agent'...

[Agents\nWhat is an agent?\nIn its most fundam...

The essential components of an agent's cogniti...

The essential components of an agent's cogniti...

1.0

0.400000

1.000000

0.50

2

What are the key considerations for selecting ...

[Agents\nFigure 1. General agent architecture ...

The key considerations for selecting a model f...

The key considerations for selecting a model f...

1.0

0.333333

1.000000

0.50

3

How does retrieval augmented generation enhanc...

[Agents\nThe tools\nFoundational models, despi...

Retrieval Augmented Generation (RAG) enhances ...

Retrieval Augmented Generation (RAG) enhances ...

1.0

0.500000

0.919411

1.00

4

In the context of AI agents, how does the CoT ...

[Agents\nAgents vs. models\nTo gain a clearer ...

The Chain-of-Thought (CoT) framework enhances ...

The Chain-of-Thought (CoT) framework enhances ...

1.0

0.076923

0.944423

0.00

result_df.to_csv("data/ragas_evaluation_result.csv", index=False)
result_df.loc[:, "context_precision":"context_recall"]
context_precision
faithfulness
answer_relevancy
context_recall

0

1.0

0.470588

1.000000

0.750000

1

1.0

0.400000

1.000000

0.500000

2

1.0

0.333333

1.000000

0.500000

3

1.0

0.500000

0.919411

1.000000

4

1.0

0.076923

0.944423

0.000000

5

1.0

0.923077

0.938009

1.000000

6

1.0

1.000000

0.984205

0.750000

7

1.0

0.920000

0.971321

1.000000

8

1.0

0.352941

0.963824

0.666667

9

1.0

0.916667

0.972590

1.000000

Set up the environment. You may refer to for more details.

You can checkout the for more details.

Reference for batch :

RAGAS Documentation
RAGAS Metrics
RAGAS Metrics - Context Recall
RAGAS Metrics - Context Precision
RAGAS Metrics - Answer Relevancy (Response Relevancy)
RAGAS Metrics - Faithfulness
Environment Setup
langchain-opentutorial
Link
Overview
Environment Setup
Load saved RAGAS dataset
Evaluate the answers
Sungchul Kim
Yoonji
Sunyoung Park
LangChain Open Tutorial