Summary Evaluators

Open in Colab Open in GitHub

Overview

This document provides a comprehensive guide to building and evaluating RAG systems using LangChain tools. It demonstrates how to define RAG performance testing functions and utilize summary evaluators for relevance assessment. By leveraging models like GPT-4o-mini and Ollama , you can evaluate the relevance of generated answers and questions effectively.

Table of Contents

References


Environment Setup

Setting up your environment is the first step. See the Environment Setup guide for more details.

[Note]

The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials. Check out the langchain-opentutorial for more details.

You can set API keys in a .env file or set them manually.

[Note] If you’re not using the .env file, no worries! Just enter the keys directly in the cell below, and you’re good to go.

Defining a Function for RAG Performance Testing

We’ll create a RAG system for testing purposes.

We’ll create functions using GPT-4o-mini and Ollama model to answer questions.

The OpenAIRelevanceGrader evaluates the relevance of the question , context , and answer .

  • target="retrieval-question" : Evaluates the relevance of the question to the context .

  • target="retrieval-answer" : Evaluates the relevance of the answer to the context .

We first need to define OpenAIRelevanceGrader.

Then, set retrieval-question grader and retriever-answer grader.

Invoke the graders.

Summary Evaluator for Relevance Assessment

Certain metrics can only be defined at the experiment level rather than for individual runs of an experiment.

For example, you may need to compute the evaluation score of a classifier across all runs derived from a dataset.

This is referred to as summary_evaluators.

These evaluators take lists of Runs and Examples instead of single instances.

Now, Let's evaluate.

Check the result.

[ Note ] Results are not available for individual datasets but can be reviewed at the experiment level.

Last updated