Generate synthetic test dataset (with RAGAS)

Open in Colabarrow-up-right Open in GitHubarrow-up-right

Overview

Welcome Back!

Hi everyone! Welcome to our first lecture in the evaluation section. We're going to try something special today! While we've been building RAG systems, we haven't really talked about how to test if they're working well. To properly evaluate a RAG system, we need good test data—and that's exactly what we'll be creating in this tutorial! We'll learn how to build datasets that will help us measure our RAG pipeline's performance.

Today, what we are going to learn...

In this session, we'll focus on using RAGAS to create evaluation datasets for RAG systems. Our main tasks will include:

  • Preprocessing documents for evaluation.

  • Defining evaluation objects.

  • Defining Knowledge Graphs, creating Nodes, and establishing relationships between nodes

  • Concepts of Extractor

  • Configuring data distributions to generate various types of test questions.

We'll explore these concepts through hands-on practice, giving you a practical foundation for building evaluation datasets.

Why this matters...

The goal is to craft datasets that objectively assess the performance of your RAG system. A well-designed test can highlight how your system handles diverse questions and scenarios, revealing both strengths and areas needing improvement.

By the end of this tutorial, you'll have the skills to build robust datasets for comprehensive evaluation.

Without further ado, let's get started!

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setuparrow-up-right for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorialarrow-up-right for more details.

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

Looking Back at What We've Learned

In this session, let's review what we've learned so far.

We Have Learned About RAG

LLM is a powerful technology, but it has limitations in reflecting real-time information due to the constraints of its training data.

For example, let's say NASA discovered a new planet yesterday, making the total number of planets in the solar system nine. What would happen if we asked an LLM about the number of planets in the solar system? Because LLM responds based on its trained data, it would say there are eight planets. We call this phenomenon hallucination , and to resolve this, we need to wait for a model version up .

RAG emerged to overcome these limitations. Instead of immediately responding to user questions, the RAG pipeline first searches for the latest information from external knowledge repositories and then generates responses based on this information. This enables the system to provide answers that reflect the most up-to-date information.

Is Our RAG Design Effective?

You have learned various techniques for implementing RAG. Some of you may have already built your own RAG systems and applied them to your work.

However, we need to ask an important question: Is our RAG system truly a 'good' RAG? How can we judge the quality of RAG?

Simply saying "this RAG doesn't perform well" is not enough. We need to be able to measure and verify RAG's performance through objective evaluation metrics.

Why Use Synthetic Test Dataset?

Evaluating the performance of RAG systems is a crucial process. However, manually creating hundreds of question-answer pairs requires enormous time and effort.

Moreover, manually written questions often remain at a simple and superficial level, making it difficult to thoroughly evaluate the performance of RAG systems.

By utilizing synthetic data to solve these problems, we can reduce developer time spent on building test datasets by up to 90%. Additionally, it enables more thorough performance evaluation by automatically generating test cases of various difficulty levels and types.

Installation

To proceed with this tutorial, you need to install the RAGAS and pdfplumber package. Through the command below, we'll install the RAGASand pdfplumber package, and immediately after, we'll explore the concept of RAGAS and learn about Python's RAGAS package in detail.

What is RAGAS?

RAGAS (Retrieval Augmented Generation Assessment Suite) is a comprehensive evaluation framework designed to assess the performance of RAG systems. It helps developers and researchers measure how well their RAG implementations are working through various metrics and evaluation methods.

Let's revisit the example we saw earlier.

Let's say NASA discovered a new planet yesterday, making the total number of planets in our solar system nine. To evaluate the performance of a RAG system, let's ask the test question "How many planets are in our solar system?" RAGAS evaluates the system's response using these key metrics:

  1. Answer Relevancy: Checks if the answer directly addresses the question about the number of planets

  2. Context Relevancy: Checks if the system retrieved the recent NASA announcement instead of old astronomy textbooks

  3. Faithfulness: Checks if the answer about nine planets is based on the NASA announcement and not on outdated data

  4. Context Precision: Checks if the system used the NASA announcement efficiently without including unnecessary space information

For example, if the RAG system responds with outdated information saying there are eight planets, RAGAS will give it a low context relevancy score. Or if it makes claims about the new planet that aren't in the NASA announcement, it will receive a low faithfulness score.

RAGAS in Python

You can easily use RAGAS with Python libraries.

Ragas is a library that provides tools to supercharge the evaluation of Large Language Model (LLM) applications. It is designed to help you evaluate your LLM applications with ease and confidence.

Document Processing

Let's prepare our documents through preprocessing before building the dataset!

Document

While the official RAGAS package website demonstrates tutorials using markdown, in this tutorial, we'll be working with pdf files . Please use the files located in the data folder .

Document Preprocessing

We will use PDFPlumberLoader to load PDF files and process document pages starting from index 3 through the final index.

The output documents from PDFPlumberLoader include detailed metadata about the PDF and its pages, returning one document per page.

Each document object includes a metadata dictionary that can be used to store additional information about the document, which can be accessed through metadata.

Please check if the metadata dictionary contains a key called filename .

This key will be used in the Test datasets generation process . The filename attribute in metadata is used to identify chunks belonging to the same document.

Dataset Generation

We'll create datasets using ChatOpenAI. Before writing the code, let's define the roles of our objects:

  • Dataset Generator: generator_llm

  • Document Embeddings: embeddings

First, let's initialize the DocumentStore. We'll configure it to use custom LLM and embeddings.

Self Check

Let's check the total number of nodes in the knowledge graph.

Run this code to verify if knowledge graph nodes have been created. If no nodes were created, there may be issues with executing subsequent code.

Now we will establish relationships between nodes in the knowledge graph.

Extractor

The extracted information is used to establish the relationship between the nodes. Before generating relationships between nodes, we will first examine only the three main extractors.

  1. KeyphrasesExtractor

  2. SummaryExtractor

  3. HeadlinesExtractor

First, I will import all the necessary modules.

1. KeyphrasesExtractor


[Note] Refactoring for Performance Improvement!

In the bonus section of this tutorial, we optimized the code to significantly reduce execution time—from 45 seconds to 1 minute down to just 3-8 seconds! If you're familiar with parallel processing and asynchronous processing, you can combine these techniques to further enhance performance. We used the asyncio module for asynchronous processing and the multiprocessing module for parallel processing. (Tested on an M1 CPU with 4 performance cores and 4 efficiency cores.)

Check out the details in the Bonus Refactoring Section!


2. SummaryExtractor


[Note] Refactoring for Performance Improvement!

You can refactor it like the Summary Keyphrases Extractor.

Check out the details in the Bonus Refactoring Section!


3. HeadlinesExtractor


[Note] Refactoring for Performance Improvement!

You can refactor it like the Headlines Extractor.

Check out the details in the Bonus Refactoring Section!


Relationship builder

We will define relationships using the extracted information from earlier. In the case of technology documents, the relationship can be established between the nodes based on the entities present in the nodes.

Since both the nodes have the same entities, the relationship is established between the nodes based on the entity similarity. let's take a look at this.

Relationships can be formed using the builder.

Last updated