Pairwise Evaluation

Open in Colabarrow-up-right Open in GitHubarrow-up-right

Overview

You can also evaluate the responses of LLM models by comparing them with each other. Using pairwise preference scoring, you can generate more reliable feedback.

This comparative evaluation method is commonly encountered on platforms like Chatbot Arenaarrow-up-right or LLM leaderboards.

Table of Contents

References


Environment Setup

Setting up your environment is the first step. See the Environment Setuparrow-up-right guide for more details.

[Note]

The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials. Check out the langchain-opentutorialarrow-up-right for more details.

You can set API keys in a .env file or set them manually.

[Note] If you’re not using the .env file, no worries! Just enter the keys directly in the cell below, and you’re good to go.

Pairwise Evaluation

Prepare two answers generated from the LLM in advance.

Provide these two answers to the LLM and have it select the better answer using the following prompt:

"You are an LLM judge. Compare the following two answers to a question and determine which one is better. Better answer is the one that is more detailed and informative. If the answer is not related to the question, it is not a good answer."

Configure the system so that the judged result can be uploaded to and verified on langsmith.

Let's define a function.

Now, you can generate a dataset from these example executions.

Only the inputs need to be saved.

Conduct a comparative evaluation.

Last updated