LLM-as-a-judge is one of the methods for evaluating and improving large language models, where an LLM evaluates the outputs of other models, similar to human evaluation. LLMs are considered difficult to evaluate because they can do more than simply selecting correct answers.
Therefore, to evaluate such capabilities, although still imperfect, using a second LLM as an evaluator - that is, LLM-as-a-Judge - is expected to be effective.
Typically, models that are larger and better than the ones used in specific LLM applications are used as evaluation models. In this tutorial, we will explore the Off-the-shelf Evaluators provided by LangSmith. Off-the-shelf Evaluators refer to pre-defined prompt-based LLM evaluators. While they are easy to use, custom evaluators need to be defined to use more extended features. Basically, evaluation is performed by passing the following three pieces of information to the LLM Evaluator:
# Load API keys from .env filefrom dotenv import load_dotenvload_dotenv(override=True)
True
Define functions for RAG performance testing
Create RAG system to use for testing
from myrag import PDFRAGfrom langchain_openai import ChatOpenAI# Create PDFRAG objectrag =PDFRAG("data/Newwhitepaper_Agents2.pdf",ChatOpenAI(model="gpt-4o-mini", temperature=0),)# Create Retrieverretriever = rag.create_retriever()# Create Chainchain = rag.create_chain(retriever)# Generate answer for questionchain.invoke("List up the name of the authors")
'The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.'
Create function as ask_question. this function takes a dictionary called inputs as input and returns a dictionary with answer as output.
# Create function to answer questiondefask_question(inputs:dict):return{"answer": chain.invoke(inputs["question"])}
# The example of Uer Quesitonllm_answer =ask_question( {"question": "List up the name of the authors"})llm_answer
{'answer': 'The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.'}
Defining a function for evaluator prompt output.
# The function for evaluator prompt outputdefprint_evaluator_prompt(evaluator):return evaluator.evaluator.prompt.pretty_print()
Question-Answer Evaluator
This is the most basic Evaluator that accesses questions(Query) and answers(Answer).
User input is defined as input, LLM-generated response as prediction, and the correct answer as reference.
However, in the Prompt variables, they are defiend as query, result, and answer.
query : User input
result : LLM-generated response
answer : Correct answer
from langsmith.evaluation import evaluate, LangChainStringEvaluator# Create Question-Answer Evaluatorqa_evalulator =LangChainStringEvaluator("qa")# Print the promptprint_evaluator_prompt(qa_evalulator)
You are a teacher grading a quiz.
You are given a question, the student's answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.
Example Format:
QUESTION: question here
STUDENT ANSWER: student's answer here
TRUE ANSWER: true answer here
GRADE: CORRECT or INCORRECT here
Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!
QUESTION: [33;1m[1;3m{query}[0m
STUDENT ANSWER: [33;1m[1;3m{result}[0m
TRUE ANSWER: [33;1m[1;3m{answer}[0m
GRADE:
# Set the dataset namedataset_name ="RAG_EVAL_DATASET"# Execute evaluationexperiment_results =evaluate( ask_question, data=dataset_name, evaluators=[qa_evalulator], experiment_prefix="RAG_EVAL",# Specify experiment metadata metadata={"variant": "Evaluation with QA Evaluator", },)
View the evaluation results for experiment: 'RAG_EVAL-899d197c' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=66642322-32fe-4d30-9275-cca885c98205
0it [00:00, ?it/s]
Context-based Answer Evaluator
LangChainStringEvaluator("context_qa"): Instructs the LLM chain to use reference "context" for determining accuracy.
LangChainStringEvaluator("cot_qa"): "cot_qa" is similar to the "context_qa" evaluator, but differs in that it instructs the LLM to use 'chain-of-thought' reasoning before making a final judgment.
[Note] First, you need to define a function that returns Context: context_answer_rag_answer Then, create a LangChainStringEvaluator. During creation, properly map the return values of the previously defined function through prepare_data.
[Details]
run: Results generated by LLM (context, answer, input)
example: Data defined in the dataset (question and answer)
The LangChainStringEvaluator needs the following three pieces of information to perform evaluation:
prediction: Answer generated by LLM
reference: Answer defined in the dataset
input: Question defined in the dataset
However, since LangChainStringEvaluator("context_qa") uses reference as Context, it is defined differently. (Note) Below, we defined a function that returns context, answer, and question to utilize the context_qa evaluator.
# Define Context-based RAG Response functiondefcontext_answer_rag_answer(inputs:dict): context = retriever.invoke(inputs["question"])return{"context":"\n".join([doc.page_content for doc in context]),"answer": chain.invoke(inputs["question"]),"query": inputs["question"],}
# Execute the functioncontext_answer_rag_answer( {"question": "List up the name of the authors"})
{'context': 'Agents\nAuthors: Julia Wiesinger, Patrick Marlow \nand Vladimir Vuskovic\nAgents\n2\nSeptember 2024\nAcknowledgements\nReviewers and Contributors\nEvan Huang\nEmily Xue\nOlcan Sercinoglu\nSebastian Riedel\nSatinder Baveja\nAntonio Gulli\nAnant Nawalgaria\nCurators and Editors\nAntonio Gulli\nAnant Nawalgaria\nGrace Mollison \nTechnical Writer\nJoey Haymaker\nDesigner\nMichael Lanning\n38\nSummary\x08\n40\nEndnotes\x08\n42\nTable of contents\nAgents\n22\nSeptember 2024\nUnset\nfunction_call {\n name: "display_cities"\n args: {\n "cities": ["Crested Butte", "Whistler", "Zermatt"],\n "preferences": "skiing"\n }\n}\nSnippet 5. Sample Function Call payload for displaying a list of cities and user preferences',
'answer': 'The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.',
'query': 'List up the name of the authors'}
# Create an cot_qa Evaluatorcot_qa_evaluator =LangChainStringEvaluator("cot_qa", prepare_data=lambdarun, example: {"prediction": run.outputs["answer"], # Generated answer by LLM"reference": run.outputs["context"], # Context"input": example.inputs["question"], # Question defined in the dataset },)# Create an context_qa Evaluatorcontext_qa_evaluator =LangChainStringEvaluator("context_qa", prepare_data=lambdarun, example: {"prediction": run.outputs["answer"], # Generated answer by LLM"reference": run.outputs["context"], # Context"input": example.inputs["question"], # Question defined in the dataset },)# Print evaluator prompt outputprint("COT_QA Evaluator Prompt")print_evaluator_prompt(cot_qa_evaluator)print("Context_QA Evaluator Prompt")print_evaluator_prompt(context_qa_evaluator)
COT_QA Evaluator Prompt
You are a teacher grading a quiz.
You are given a question, the context the question is about, and the student's answer. You are asked to score the student's answer as either CORRECT or INCORRECT, based on the context.
Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.
Example Format:
QUESTION: question here
CONTEXT: context the question is about here
STUDENT ANSWER: student's answer here
EXPLANATION: step by step reasoning here
GRADE: CORRECT or INCORRECT here
Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!
QUESTION: [33;1m[1;3m{query}[0m
CONTEXT: [33;1m[1;3m{context}[0m
STUDENT ANSWER: [33;1m[1;3m{result}[0m
EXPLANATION:
Context_QA Evaluator Prompt
You are a teacher grading a quiz.
You are given a question, the context the question is about, and the student's answer. You are asked to score the student's answer as either CORRECT or INCORRECT, based on the context.
Example Format:
QUESTION: question here
CONTEXT: context the question is about here
STUDENT ANSWER: student's answer here
GRADE: CORRECT or INCORRECT here
Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!
QUESTION: [33;1m[1;3m{query}[0m
CONTEXT: [33;1m[1;3m{context}[0m
STUDENT ANSWER: [33;1m[1;3m{result}[0m
GRADE:
Execute the Evaluation, and Check the result that returned.
# Set the dataset namedataset_name ="RAG_EVAL_DATASET"# Execute evaluation evaluate( context_answer_rag_answer, data=dataset_name, evaluators=[cot_qa_evaluator, context_qa_evaluator], experiment_prefix="RAG_EVAL", metadata={"variant": "Evaluation with COT_QA & Context_QA Evaluator", },)
View the evaluation results for experiment: 'RAG_EVAL-8087cb7d' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=80980c43-0edd-4483-a4a7-18eb2bf81d3b
0it [00:00, ?it/s]
inputs.question
outputs.context
outputs.answer
outputs.query
error
reference.answer
feedback.COT Contextual Accuracy
feedback.Contextual Accuracy
execution_time
example_id
id
0
What are the three targeted learnings to enhan...
Agents\n33\nSeptember 2024\nEnhancing model pe...
The three targeted learnings to enhance model ...
What are the three targeted learnings to enhan...
None
The three targeted learning approaches to enha...
0
0
2.606171
0e661de4-636b-425d-8f6e-0a52b8070576
a3c6714d-8f28-4a82-93b4-d4f260af54ae
1
What are the key functions of an agent's orche...
implementation of the agent orchestration laye...
The key functions of an agent's orchestration ...
What are the key functions of an agent's orche...
None
The key functions of an agent's orchestration ...
1
1
4.474181
3561c6fe-6ed4-4182-989a-270dcd635f32
180daa5e-4279-47ac-9150-d19ab5eb94cb
2
List up the name of the authors
Agents\nAuthors: Julia Wiesinger, Patrick Marl...
The authors are Julia Wiesinger, Patrick Marlo...
List up the name of the authors
None
The authors are Julia Wiesinger, Patrick Marlo...
1
1
1.298198
b03e98d1-44ad-4142-8dfa-7b0a31a57096
d8eed689-7e42-4897-936b-b3628ee5632c
3
What is Tree-of-thoughts?
weaknesses depending on the specific applicati...
Tree-of-thoughts (ToT) is a prompt engineering...
What is Tree-of-thoughts?
None
Tree-of-thoughts (ToT) is a prompt engineering...
1
1
2.477597
be18ec98-ab18-4f30-9205-e75f1cb70844
ef6126c1-cba0-4cfd-9725-628dc5a861e4
4
What is the framework used for reasoning and p...
reasoning frameworks (CoT, ReAct, etc.) to \nf...
The frameworks used for reasoning and planning...
What is the framework used for reasoning and p...
None
The frameworks used for reasoning and planning...
1
1
2.092742
eb4b29a7-511c-4f78-a08f-2d5afeb84320
ed7c7dba-8102-49e7-ad0b-f118f79d7e6f
5
How do agents differ from standalone language ...
1.\t Agents extend the capabilities of languag...
Agents differ from standalone language models ...
How do agents differ from standalone language ...
None
Agents can use tools to access real-time data ...
1
1
6.000040
f4a5a0cf-2d2e-4e15-838a-bc8296eb708b
9f190340-10c5-4af2-be25-c35bf5b7f29a
Even if the generated answer doesn't match the Ground Truth, it will be evaluated as CORRECT if the given Context is accurate.
Criteria
When reference labels (correct answers) are unavailable or difficult to obtain, you can use the "Criteria" or "Score" evaluators to assess runs against a custom set of criteria.
This is useful when you want to monitor high-level semantic aspects of the model's responses.
LangChainStringEvaluator("criteria", config = {"criteria": "one of the criteria below"})
Criteria
Description
conciseness
Evaluates if the answer is concise and simple
relevance
Evaluates if the answer is relevant to the question
correctness
Evaluates if the answer is correct
coherence
Evaluates if the answer is coherent
harmfulness
Evaluates if the answer is harmful or dangerous
maliciousness
Evaluates if the answer is malicious or aggravating
helpfulness
Evaluates if the answer is helpful
controversiality
Evaluates if the answer is controversial
misogyny
Evaluates if the answer is misogynistic
criminality
Evaluates if the answer promotes criminal behavior
from langsmith.evaluation import evaluate, LangChainStringEvaluator# Set Evaluatorcriteria_evaluator = [LangChainStringEvaluator("criteria", config={"criteria": "conciseness"}),LangChainStringEvaluator("criteria", config={"criteria": "misogyny"}),LangChainStringEvaluator("criteria", config={"criteria": "criminality"}),]# Set the name of Datasetdataset_name ="RAG_EVAL_DATASET"# Execute Evaluationexperiment_results =evaluate( ask_question, data=dataset_name, evaluators=criteria_evaluator, experiment_prefix="CRITERIA-EVAL",# Specify experiment metadata metadata={"variant": "Evaluation with Criteria Evaluator", },)
View the evaluation results for experiment: 'CRITERIA-EVAL-c7ebf8e3' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=0a18e29b-60fe-4427-a51f-c52299a18898
0it [00:00, ?it/s]
Use of Evaluator when correct answers exist(labeled_criteria)
When correct answers exist, it's possible to evaluate by comparing the LLM-generated answer with the correct answer. As shown in the example below, pass the correct answer to reference and the LLM-generated answer to prediction. Such settings are defined through prepare_data. Additionally, the LLM used for answer evaluation is defined through llm in the config.
from langsmith.evaluation import LangChainStringEvaluatorfrom langchain_openai import ChatOpenAI# Create labeled_criteria Evaluatorlabeled_criteria_evaluator =LangChainStringEvaluator("labeled_criteria", config={"criteria": {"helpfulness": ("Is this submission helpful to the user,"" taking into account the correct reference answer?" ) },"llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"), }, prepare_data=lambdarun, example: {"prediction": run.outputs["answer"],"reference": example.outputs["answer"], # Correct answer"input": example.inputs["question"], },)# Print evaluator promptprint_evaluator_prompt(labeled_criteria_evaluator)
You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
[BEGIN DATA]
***
[Input]: [33;1m[1;3m{input}[0m
***
[Submission]: [33;1m[1;3m{output}[0m
***
[Criteria]: helpfulness: Is this submission helpful to the user, taking into account the correct reference answer?
***
[Reference]: [33;1m[1;3m{reference}[0m
***
[END DATA]
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line.
Here's the example of evaluating relevance. This time, we pass the context as the reference through prepare_data.
You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
[BEGIN DATA]
***
[Input]: [33;1m[1;3m{input}[0m
***
[Submission]: [33;1m[1;3m{output}[0m
***
[Criteria]: relevance: Is the submission referring to a real quote from the text?
***
[Reference]: [33;1m[1;3m{reference}[0m
***
[END DATA]
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line.
Execute the Evaluation, and Check the result that returned.
from langsmith.evaluation import evaluate# Set the name of Datasetdataset_name ="RAG_EVAL_DATASET"# Execute Evaluationexperiment_results =evaluate( context_answer_rag_answer, data=dataset_name, evaluators=[labeled_criteria_evaluator, relevance_evaluator], experiment_prefix="LABELED-EVAL",# Specify experiment metadata metadata={"variant": "Evaluation with Labeled_criteria Evaluator", },)
View the evaluation results for experiment: 'LABELED-EVAL-80ee678c' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=36dc1710-9c3a-46ce-b1ab-a209bb8b700d
0it [00:00, ?it/s]
Custom function Evaluator
Here's an example of creating an evaluator that returns scores. You can normalize scores through normalize_by. The converted scores are normalized to values between (0 ~ 1). The accuracy below is a user-defined criterion. You can define and use appropriate prompts for your needs.
from langsmith.evaluation import LangChainStringEvaluator# Create labeled score evaluatorlabeled_score_evaluator =LangChainStringEvaluator("labeled_score_string", config={"criteria": {"accuracy": "How accurate is this prediction compared to the reference on a scale of 1-10?" },"normalize_by": 10,"llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"), }, prepare_data=lambdarun, example: {"prediction": run.outputs["answer"],"reference": example.outputs["answer"],"input": example.inputs["question"], },)print_evaluator_prompt(labeled_score_evaluator)
================================[1m System Message [0m================================
You are a helpful assistant.
================================[1m Human Message [0m=================================
[Instruction]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. [33;1m[1;3m{criteria}[0m[Ground truth]
[33;1m[1;3m{reference}[0m
Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
[Question]
[33;1m[1;3m{input}[0m
[The Start of Assistant's Answer]
[33;1m[1;3m{prediction}[0m
[The End of Assistant's Answer]
Execute the Evaluation, and Check the result that returned.
from langsmith.evaluation import evaluate# Execute evaluatoinexperiment_results =evaluate( ask_question, data=dataset_name, evaluators=[labeled_score_evaluator], experiment_prefix="LABELED-SCORE-EVAL",# Specify experiment metadata metadata={"variant": "Evaluation with Labeled_score Evaluator", },)
View the evaluation results for experiment: 'LABELED-SCORE-EVAL-ca73be6c' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=a846a6af-3409-4907-9d90-849e0532533f