Pairwise Evaluation

Overview

You can also evaluate the responses of LLM models by comparing them with each other. Using pairwise preference scoring, you can generate more reliable feedback.

This comparative evaluation method is commonly encountered on platforms like Chatbot Arena or LLM leaderboards.

Table of Contents

References


Environment Setup

Setting up your environment is the first step. See the Environment Setup guide for more details.

[Note]

The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials. Check out the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)

You can set API keys in a .env file or set them manually.

[Note] If you’re not using the .env file, no worries! Just enter the keys directly in the cell below, and you’re good to go.

from dotenv import load_dotenv
from langchain_opentutorial import set_env

# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.
if not load_dotenv():
    set_env(
        {
            "OPENAI_API_KEY": "",
            "LANGCHAIN_API_KEY": "",
            "LANGCHAIN_TRACING_V2": "true",
            "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
            "LANGCHAIN_PROJECT": "12-LangSmith-Pairwise-Evaluation",
        }
    )

Pairwise Evaluation

Prepare two answers generated from the LLM in advance.

Provide these two answers to the LLM and have it select the better answer using the following prompt:

"You are an LLM judge. Compare the following two answers to a question and determine which one is better. Better answer is the one that is more detailed and informative. If the answer is not related to the question, it is not a good answer."

Configure the system so that the judged result can be uploaded to and verified on langsmith.

Let's define a function.

Now, you can generate a dataset from these example executions.

Only the inputs need to be saved.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser


def evaluate_pairwise(runs: list, example) -> dict:
    """
    A simple evaluator for pairwise answers to score based on engagement
    """

    # Save scores
    scores = {}
    for i, run in enumerate(runs):
        scores[run.id] = i

    # Execution pairs for each example
    answer_a = runs[0].outputs["answer"]
    answer_b = runs[1].outputs["answer"]
    question = example.inputs["question"]

    # LLM with function calls, using a high-performance model
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    # Structured prompt
    grade_prompt = PromptTemplate.from_template(
        """
        You are an LLM judge. Compare the following two answers to a question and determine which one is better.
        Better answer is the one that is more detailed and informative.
        If the answer is not related to the question, it is not a good answer.
        
        # Question:
        {question}
        
        #Answer A: 
        {answer_a}
        
        #Answer B: 
        {answer_b}
        
        Output should be either `A` or `B`. Pick the answer that is better.
        
        #Preference:
        """
    )
    answer_grader = grade_prompt | llm | StrOutputParser()

    # Obtain scores
    score = answer_grader.invoke(
        {
            "question": question,
            "answer_a": answer_a,
            "answer_b": answer_b,
        }
    )
    # score = score["Preference"]

    # Map execution assignments based on scores
    if score == "A":  # Preference for Assistant A
        scores[runs[0].id] = 1
        scores[runs[1].id] = 0
    elif score == "B":  # Preference for Assistant B
        scores[runs[0].id] = 0
        scores[runs[1].id] = 1
    else:
        scores[runs[0].id] = 0
        scores[runs[1].id] = 0

    return {"key": "ranked_preference", "scores": scores}

Conduct a comparative evaluation.

from langsmith.evaluation import evaluate_comparative

# Replace with an array of experiment names or IDs
evaluate_comparative(
    ["MODEL_COMPARE_EVAL-05b6496b", "MODEL_COMPARE_EVAL-c264adb7"],
    # Array of evaluators
    evaluators=[evaluate_pairwise],
)
View the pairwise evaluation results at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=33fa8084-b82f-45ee-a3dd-c374caad16e0%2Cf784a8c4-88ab-4a35-89a7-3aba5367f182&comparativeExperiment=f9b31d2e-299a-45bc-a61c-0c2622dbceac
    
    
  0%|          | 0/6 [00:00<?, ?it/s]





<langsmith.evaluation._runner.ComparativeExperimentResults at 0x105fc5bd0>

Last updated