LLM-as-Judge

Open in Colab Open in GitHub

Overview

LLM-as-a-judge is one of the methods for evaluating and improving large language models, where an LLM evaluates the outputs of other models, similar to human evaluation. LLMs are considered difficult to evaluate because they can do more than simply selecting correct answers.

Therefore, to evaluate such capabilities, although still imperfect, using a second LLM as an evaluator - that is, LLM-as-a-Judge - is expected to be effective.

Typically, models that are larger and better than the ones used in specific LLM applications are used as evaluation models. In this tutorial, we will explore the Off-the-shelf Evaluators provided by LangSmith.Off-the-shelf Evaluators refer to pre-defined prompt-based LLM evaluators. While they are easy to use, custom evaluators need to be defined to use more extended features. Basically, evaluation is performed by passing the following three pieces of information to the LLM Evaluator:

  • input: Question defined in the dataset

  • prediction: Answer generated by LLM

  • reference: Answer defined in the dataset

Table of Contents

References


Environment Setup

Setting up your environment is the first step. See the Environment Setup guide for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

You can set API keys in a .env file or set them manually.

[Note] If you’re not using the .env file, no worries! Just enter the keys directly in the cell below, and you’re good to go.

Define functions for RAG performance testing

Create RAG system to use for testing

Create function as ask_question . This function takes a dictionary called inputs as input and returns a dictionary with answer as output.

Defining a function for evaluator prompt output.

Question-Answer Evaluator

This is the most basic Evaluator that accesses questions(Query) and answers(Answer).

User input is defined as input, LLM-generated response as prediction, and the correct answer as reference.

However, in the Prompt variables, they are defiend as query, result, and answer.

  • query : User input

  • result : LLM-generated response

  • answer : Correct answer

rag-eval

Context-based Answer Evaluator

  • LangChainStringEvaluator("context_qa"): Instructs the LLM chain to use reference "context" for determining accuracy.

  • LangChainStringEvaluator("cot_qa"): "cot_qa" is similar to the "context_qa" evaluator, but differs in that it instructs the LLM to use 'chain-of-thought' reasoning before making a final judgment.

[Note]

First, you need to define a function that returns Context: context_answer_rag_answer Then, create a LangChainStringEvaluator. During creation, properly map the return values of the previously defined function through prepare_data.

[Details]

  • run : Results generated by LLM (context, answer, input)

  • example : Data defined in the dataset (question and answer)

The LangChainStringEvaluator needs the following three pieces of information to perform evaluation:

  • prediction : Answer generated by LLM

  • reference : Answer defined in the dataset

  • input : Question defined in the dataset

However, since LangChainStringEvaluator("context_qa") uses reference as Context, it is defined differently. (Note) Below, we defined a function that returns context , answer , and question to utilize the context_qa evaluator.

Execute the Evaluation, and Check the result that returned.

inputs.question
outputs.context
outputs.answer
outputs.query
error
reference.answer
feedback.COT Contextual Accuracy
feedback.Contextual Accuracy
execution_time
example_id
id

0

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The three targeted learnings to enhance model ...

What are the three targeted learnings to enhan...

None

The three targeted learning approaches to enha...

0

0

2.606171

0e661de4-636b-425d-8f6e-0a52b8070576

a3c6714d-8f28-4a82-93b4-d4f260af54ae

1

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

What are the key functions of an agent's orche...

None

The key functions of an agent's orchestration ...

1

1

4.474181

3561c6fe-6ed4-4182-989a-270dcd635f32

180daa5e-4279-47ac-9150-d19ab5eb94cb

2

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The authors are Julia Wiesinger, Patrick Marlo...

List up the name of the authors

None

The authors are Julia Wiesinger, Patrick Marlo...

1

1

1.298198

b03e98d1-44ad-4142-8dfa-7b0a31a57096

d8eed689-7e42-4897-936b-b3628ee5632c

3

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

What is Tree-of-thoughts?

None

Tree-of-thoughts (ToT) is a prompt engineering...

1

1

2.477597

be18ec98-ab18-4f30-9205-e75f1cb70844

ef6126c1-cba0-4cfd-9725-628dc5a861e4

4

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

The frameworks used for reasoning and planning...

What is the framework used for reasoning and p...

None

The frameworks used for reasoning and planning...

1

1

2.092742

eb4b29a7-511c-4f78-a08f-2d5afeb84320

ed7c7dba-8102-49e7-ad0b-f118f79d7e6f

5

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Agents differ from standalone language models ...

How do agents differ from standalone language ...

None

Agents can use tools to access real-time data ...

1

1

6.000040

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

9f190340-10c5-4af2-be25-c35bf5b7f29a

context-based-eval

Even if the generated answer doesn't match the Ground Truth , it will be evaluated as CORRECT if the given Context is accurate.

Criteria

When reference labels (correct answers) are unavailable or difficult to obtain, you can use the "Criteria" or "Score" evaluators to assess runs against a custom set of criteria.

This is useful when you want to monitor high-level semantic aspects of the model's responses.

Criteria
Description

conciseness

Evaluates if the answer is concise and simple

relevance

Evaluates if the answer is relevant to the question

correctness

Evaluates if the answer is correct

coherence

Evaluates if the answer is coherent

harmfulness

Evaluates if the answer is harmful or dangerous

maliciousness

Evaluates if the answer is malicious or aggravating

helpfulness

Evaluates if the answer is helpful

controversiality

Evaluates if the answer is controversial

misogyny

Evaluates if the answer is misogynistic

criminality

Evaluates if the answer promotes criminal behavior

criteria-eval

Use of Evaluator when correct answers exist(labeled_criteria)

When correct answers exist, it's possible to evaluate by comparing the LLM-generated answer with the correct answer. As shown in the example below, pass the correct answer to reference and the LLM-generated answer to prediction. Such settings are defined through prepare_data. Additionally, the LLM used for answer evaluation is defined through llm in the config.

Here's the example of evaluating relevance. This time, we pass the context as the reference through prepare_data .

Execute the Evaluation, and Check the result that returned.

labeled-eval

Custom function Evaluator

Here's an example of creating an evaluator that returns scores. You can normalize scores through normalize_by. The converted scores are normalized to values between (0 ~ 1). The accuracy below is a user-defined criterion. You can define and use appropriate prompts for your needs.

Execute the Evaluation, and Check the result that returned.

labeled-score-eval

Last updated