LLM-as-Judge

Author: Sunyoung Park (architectyou)
Design:
Peer Review:
This is a part of LangChain Open Tutorial

Overview

LLM-as-a-judge is one of the methods for evaluating and improving large language models, where an LLM evaluates the outputs of other models, similar to human evaluation. LLMs are considered difficult to evaluate because they can do more than simply selecting correct answers.

Therefore, to evaluate such capabilities, although still imperfect, using a second LLM as an evaluator - that is, LLM-as-a-Judge - is expected to be effective.

Typically, models that are larger and better than the ones used in specific LLM applications are used as evaluation models. In this tutorial, we will explore the Off-the-shelf Evaluators provided by LangSmith. Off-the-shelf Evaluators refer to pre-defined prompt-based LLM evaluators. While they are easy to use, custom evaluators need to be defined to use more extended features. Basically, evaluation is performed by passing the following three pieces of information to the LLM Evaluator:

input: Question defined in the dataset
prediction: Answer generated by LLM
reference: Answer defined in the dataset

References

Environment Setup

Setting up your environment is the first step. See the Environment Setup guide for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_openai",
        "pymupdf",
        "faiss-cpu" #if gpu is available, use "faiss-gpu"
    ],
    verbose=False,
    upgrade=False,
)

You can set API keys in a .env file or set them manually.

[Note] If you’re not using the .env file, no worries! Just enter the keys directly in the cell below, and you’re good to go.

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "LLM-as-Judge",
    }
)

Environment variables have been set successfully.

# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

Define functions for RAG performance testing

Create RAG system to use for testing

from myrag import PDFRAG
from langchain_openai import ChatOpenAI

# Create PDFRAG object
rag = PDFRAG(
    "data/Newwhitepaper_Agents2.pdf",
    ChatOpenAI(model="gpt-4o-mini", temperature=0),
)

# Create Retriever
retriever = rag.create_retriever()

# Create Chain
chain = rag.create_chain(retriever)

# Generate answer for question
chain.invoke("List up the name of the authors")

'The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.'

Create function as ask_question. this function takes a dictionary called inputs as input and returns a dictionary with answer as output.

# Create function to answer question
def ask_question(inputs: dict):
    return {"answer": chain.invoke(inputs["question"])}

# The example of Uer Quesiton
llm_answer = ask_question(
    {"question": "List up the name of the authors"}
)
llm_answer

{'answer': 'The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.'}

Defining a function for evaluator prompt output.

# The function for evaluator prompt output
def print_evaluator_prompt(evaluator):
    return evaluator.evaluator.prompt.pretty_print()

Question-Answer Evaluator

This is the most basic Evaluator that accesses questions(Query) and answers(Answer).

User input is defined as input, LLM-generated response as prediction, and the correct answer as reference.

However, in the Prompt variables, they are defiend as query, result, and answer.

query : User input
result : LLM-generated response
answer : Correct answer

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Create Question-Answer Evaluator
qa_evalulator = LangChainStringEvaluator("qa")

# Print the prompt
print_evaluator_prompt(qa_evalulator)

You are a teacher grading a quiz.
    You are given a question, the student's answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.
    
    Example Format:
    QUESTION: question here
    STUDENT ANSWER: student's answer here
    TRUE ANSWER: true answer here
    GRADE: CORRECT or INCORRECT here
    
    Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! 
    
    QUESTION: [33;1m[1;3m{query}[0m
    STUDENT ANSWER: [33;1m[1;3m{result}[0m
    TRUE ANSWER: [33;1m[1;3m{answer}[0m
    GRADE:

# Set the dataset name
dataset_name = "RAG_EVAL_DATASET"

# Execute evaluation
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=[qa_evalulator],
    experiment_prefix="RAG_EVAL",
    # Specify experiment metadata
    metadata={
        "variant": "Evaluation with QA Evaluator",
    },
)

View the evaluation results for experiment: 'RAG_EVAL-899d197c' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=66642322-32fe-4d30-9275-cca885c98205

0it [00:00, ?it/s]

Context-based Answer Evaluator

LangChainStringEvaluator("context_qa"): Instructs the LLM chain to use reference "context" for determining accuracy.
LangChainStringEvaluator("cot_qa"): "cot_qa" is similar to the "context_qa" evaluator, but differs in that it instructs the LLM to use 'chain-of-thought' reasoning before making a final judgment.

[Note] First, you need to define a function that returns Context: context_answer_rag_answer Then, create a LangChainStringEvaluator. During creation, properly map the return values of the previously defined function through prepare_data.

[Details]

run: Results generated by LLM (context, answer, input)
example: Data defined in the dataset (question and answer)

The LangChainStringEvaluator needs the following three pieces of information to perform evaluation:

prediction: Answer generated by LLM
reference: Answer defined in the dataset
input: Question defined in the dataset

However, since LangChainStringEvaluator("context_qa") uses reference as Context, it is defined differently. (Note) Below, we defined a function that returns context, answer, and question to utilize the context_qa evaluator.

# Define Context-based RAG Response function
def context_answer_rag_answer(inputs: dict):
    context = retriever.invoke(inputs["question"])
    return {
        "context": "\n".join([doc.page_content for doc in context]),
        "answer": chain.invoke(inputs["question"]),
        "query": inputs["question"],
    }

# Execute the function
context_answer_rag_answer(
    {"question": "List up the name of the authors"}
)

{'context': 'Agents\nAuthors: Julia Wiesinger, Patrick Marlow \nand Vladimir Vuskovic\nAgents\n2\nSeptember 2024\nAcknowledgements\nReviewers and Contributors\nEvan Huang\nEmily Xue\nOlcan Sercinoglu\nSebastian Riedel\nSatinder Baveja\nAntonio Gulli\nAnant Nawalgaria\nCurators and Editors\nAntonio Gulli\nAnant Nawalgaria\nGrace Mollison \nTechnical Writer\nJoey Haymaker\nDesigner\nMichael Lanning\n38\nSummary\x08\n40\nEndnotes\x08\n42\nTable of contents\nAgents\n22\nSeptember 2024\nUnset\nfunction_call {\n  name: "display_cities"\n  args: {\n    "cities": ["Crested Butte", "Whistler", "Zermatt"],\n    "preferences": "skiing"\n    }\n}\nSnippet 5. Sample Function Call payload for displaying a list of cities and user preferences',
     'answer': 'The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.',
     'query': 'List up the name of the authors'}

# Create an cot_qa Evaluator
cot_qa_evaluator = LangChainStringEvaluator(
    "cot_qa",
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],  # Generated answer by LLM
        "reference": run.outputs["context"],  # Context
        "input": example.inputs["question"],  # Question defined in the dataset
    },
)

# Create an context_qa Evaluator
context_qa_evaluator = LangChainStringEvaluator(
    "context_qa",
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],  # Generated answer by LLM
        "reference": run.outputs["context"],  # Context
        "input": example.inputs["question"],  # Question defined in the dataset
    },
)

# Print evaluator prompt output
print("COT_QA Evaluator Prompt")
print_evaluator_prompt(cot_qa_evaluator)
print("Context_QA Evaluator Prompt")
print_evaluator_prompt(context_qa_evaluator)

COT_QA Evaluator Prompt
    You are a teacher grading a quiz.
    You are given a question, the context the question is about, and the student's answer. You are asked to score the student's answer as either CORRECT or INCORRECT, based on the context.
    Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.
    
    Example Format:
    QUESTION: question here
    CONTEXT: context the question is about here
    STUDENT ANSWER: student's answer here
    EXPLANATION: step by step reasoning here
    GRADE: CORRECT or INCORRECT here
    
    Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! 
    
    QUESTION: [33;1m[1;3m{query}[0m
    CONTEXT: [33;1m[1;3m{context}[0m
    STUDENT ANSWER: [33;1m[1;3m{result}[0m
    EXPLANATION:
    Context_QA Evaluator Prompt
    You are a teacher grading a quiz.
    You are given a question, the context the question is about, and the student's answer. You are asked to score the student's answer as either CORRECT or INCORRECT, based on the context.
    
    Example Format:
    QUESTION: question here
    CONTEXT: context the question is about here
    STUDENT ANSWER: student's answer here
    GRADE: CORRECT or INCORRECT here
    
    Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! 
    
    QUESTION: [33;1m[1;3m{query}[0m
    CONTEXT: [33;1m[1;3m{context}[0m
    STUDENT ANSWER: [33;1m[1;3m{result}[0m
    GRADE:

Execute the Evaluation, and Check the result that returned.

# Set the dataset name
dataset_name = "RAG_EVAL_DATASET"

# Execute evaluation 
evaluate(
    context_answer_rag_answer,
    data=dataset_name,
    evaluators=[cot_qa_evaluator, context_qa_evaluator],
    experiment_prefix="RAG_EVAL",
    metadata={
        "variant": "Evaluation with COT_QA & Context_QA Evaluator",
    },
)

View the evaluation results for experiment: 'RAG_EVAL-8087cb7d' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=80980c43-0edd-4483-a4a7-18eb2bf81d3b

0it [00:00, ?it/s]

inputs.question

outputs.context

outputs.answer

outputs.query

error

reference.answer

feedback.COT Contextual Accuracy

feedback.Contextual Accuracy

execution_time

example_id

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The three targeted learnings to enhance model ...

What are the three targeted learnings to enhan...

None

The three targeted learning approaches to enha...

2.606171

0e661de4-636b-425d-8f6e-0a52b8070576

a3c6714d-8f28-4a82-93b4-d4f260af54ae

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

What are the key functions of an agent's orche...

None

The key functions of an agent's orchestration ...

4.474181

3561c6fe-6ed4-4182-989a-270dcd635f32

180daa5e-4279-47ac-9150-d19ab5eb94cb

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The authors are Julia Wiesinger, Patrick Marlo...

List up the name of the authors

None

The authors are Julia Wiesinger, Patrick Marlo...

1.298198

b03e98d1-44ad-4142-8dfa-7b0a31a57096

d8eed689-7e42-4897-936b-b3628ee5632c

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

What is Tree-of-thoughts?

None

Tree-of-thoughts (ToT) is a prompt engineering...

2.477597

be18ec98-ab18-4f30-9205-e75f1cb70844

ef6126c1-cba0-4cfd-9725-628dc5a861e4

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

The frameworks used for reasoning and planning...

What is the framework used for reasoning and p...

None

The frameworks used for reasoning and planning...

2.092742

eb4b29a7-511c-4f78-a08f-2d5afeb84320

ed7c7dba-8102-49e7-ad0b-f118f79d7e6f

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Agents differ from standalone language models ...

How do agents differ from standalone language ...

None

Agents can use tools to access real-time data ...

6.000040

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

9f190340-10c5-4af2-be25-c35bf5b7f29a

Even if the generated answer doesn't match the Ground Truth, it will be evaluated as CORRECT if the given Context is accurate.

Criteria

When reference labels (correct answers) are unavailable or difficult to obtain, you can use the "Criteria" or "Score" evaluators to assess runs against a custom set of criteria.

This is useful when you want to monitor high-level semantic aspects of the model's responses.

LangChainStringEvaluator("criteria", config = {"criteria": "one of the criteria below"})

Criteria

Description

conciseness

Evaluates if the answer is concise and simple

relevance

Evaluates if the answer is relevant to the question

correctness

Evaluates if the answer is correct

coherence

Evaluates if the answer is coherent

harmfulness

Evaluates if the answer is harmful or dangerous

maliciousness

Evaluates if the answer is malicious or aggravating

helpfulness

Evaluates if the answer is helpful

controversiality

Evaluates if the answer is controversial

misogyny

Evaluates if the answer is misogynistic

criminality

Evaluates if the answer promotes criminal behavior

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Set Evaluator
criteria_evaluator = [
    LangChainStringEvaluator("criteria", config={"criteria": "conciseness"}),
    LangChainStringEvaluator("criteria", config={"criteria": "misogyny"}),
    LangChainStringEvaluator("criteria", config={"criteria": "criminality"}),
]

# Set the name of Dataset
dataset_name = "RAG_EVAL_DATASET"

# Execute Evaluation
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=criteria_evaluator,
    experiment_prefix="CRITERIA-EVAL",
    # Specify experiment metadata
    metadata={
        "variant": "Evaluation with Criteria Evaluator",
    },
)

View the evaluation results for experiment: 'CRITERIA-EVAL-c7ebf8e3' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=0a18e29b-60fe-4427-a51f-c52299a18898

0it [00:00, ?it/s]

Use of Evaluator when correct answers exist(labeled_criteria)

When correct answers exist, it's possible to evaluate by comparing the LLM-generated answer with the correct answer. As shown in the example below, pass the correct answer to reference and the LLM-generated answer to prediction. Such settings are defined through prepare_data. Additionally, the LLM used for answer evaluation is defined through llm in the config.

from langsmith.evaluation import LangChainStringEvaluator
from langchain_openai import ChatOpenAI

# Create labeled_criteria Evaluator
labeled_criteria_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": example.outputs["answer"],  # Correct answer
        "input": example.inputs["question"],
    },
)

# Print evaluator prompt
print_evaluator_prompt(labeled_criteria_evaluator)

You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
    [BEGIN DATA]
    ***
    [Input]: [33;1m[1;3m{input}[0m
    ***
    [Submission]: [33;1m[1;3m{output}[0m
    ***
    [Criteria]: helpfulness: Is this submission helpful to the user, taking into account the correct reference answer?
    ***
    [Reference]: [33;1m[1;3m{reference}[0m
    ***
    [END DATA]
    Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line.

Here's the example of evaluating relevance. This time, we pass the context as the reference through prepare_data.

from langchain_openai import ChatOpenAI

relevance_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": "relevance",
        "llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],  # Convey the context
        "input": example.inputs["question"],
    },
)

print_evaluator_prompt(relevance_evaluator)

You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
    [BEGIN DATA]
    ***
    [Input]: [33;1m[1;3m{input}[0m
    ***
    [Submission]: [33;1m[1;3m{output}[0m
    ***
    [Criteria]: relevance: Is the submission referring to a real quote from the text?
    ***
    [Reference]: [33;1m[1;3m{reference}[0m
    ***
    [END DATA]
    Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line.

Execute the Evaluation, and Check the result that returned.

from langsmith.evaluation import evaluate

# Set the name of Dataset
dataset_name = "RAG_EVAL_DATASET"

# Execute Evaluation
experiment_results = evaluate(
    context_answer_rag_answer,
    data=dataset_name,
    evaluators=[labeled_criteria_evaluator, relevance_evaluator],
    experiment_prefix="LABELED-EVAL",
    # Specify experiment metadata
    metadata={
        "variant": "Evaluation with Labeled_criteria Evaluator",
    },
)

View the evaluation results for experiment: 'LABELED-EVAL-80ee678c' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=36dc1710-9c3a-46ce-b1ab-a209bb8b700d

0it [00:00, ?it/s]

Custom function Evaluator

Here's an example of creating an evaluator that returns scores. You can normalize scores through normalize_by. The converted scores are normalized to values between (0 ~ 1). The accuracy below is a user-defined criterion. You can define and use appropriate prompts for your needs.

from langsmith.evaluation import LangChainStringEvaluator

# Create labeled score evaluator
labeled_score_evaluator = LangChainStringEvaluator(
    "labeled_score_string",
    config={
        "criteria": {
            "accuracy": "How accurate is this prediction compared to the reference on a scale of 1-10?"
        },
        "normalize_by": 10,
        "llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    },
)

print_evaluator_prompt(labeled_score_evaluator)

================================[1m System Message [0m================================
    
    You are a helpful assistant.
    
    ================================[1m Human Message [0m=================================
    
    [Instruction]
    Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. [33;1m[1;3m{criteria}[0m[Ground truth]
    [33;1m[1;3m{reference}[0m
    Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
    
    [Question]
    [33;1m[1;3m{input}[0m
    
    [The Start of Assistant's Answer]
    [33;1m[1;3m{prediction}[0m
    [The End of Assistant's Answer]

Execute the Evaluation, and Check the result that returned.

from langsmith.evaluation import evaluate

# Execute evaluatoin
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=[labeled_score_evaluator],
    experiment_prefix="LABELED-SCORE-EVAL",
    # Specify experiment metadata
    metadata={
        "variant": "Evaluation with Labeled_score Evaluator",
    },
)

View the evaluation results for experiment: 'LABELED-SCORE-EVAL-ca73be6c' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=a846a6af-3409-4907-9d90-849e0532533f