LangSmith Custom LLM Evaluation
Author: HeeWung Song(Dan)
Design:
Peer Review:
This is a part of LangChain Open Tutorial
Overview
LangSmith Custom LLM Evaluation is a customizable evaluation framework in LangChain that enables users to assess LLM application outputs based on their specific requirements.
Custom Evaluation Logic:
Define your own evaluation criteria
Create specific scoring mechanisms
Easy Integration:
Works with LangChain's RAG systems
Compatible with LangSmith for evaluation tracking
Evaluation Methods:
Simple metric-based evaluation
Advanced LLM-based assessment
Table of Contents
References
Environment Setup
Setting up your environment is the first step. See the Environment Setup guide for more details.
[Note]
The
langchain-opentutorialis a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.Check out the
langchain-opentutorialfor more details.
Alternatively, you can set and load OPENAI_API_KEY from a .env file.
[Note] This is only necessary if you haven't already set OPENAI_API_KEY in previous steps.
RAG System Setup
We will build a basic RAG (Retrieval-Augmented Generation) system to test Custom Evaluators. This implementation creates a question-answering system based on PDF documents, which will serve as our foundation for evaluation purposes.
This RAG system will be used to evaluate answer quality and accuracy through Custom Evaluators in later sections.
RAG System Preparation
Document Processing
load_documents(): Loads PDF documents usingPyMuPDFLoadersplit_documents(): Splits documents into appropriate sizes usingRecursiveCharacterTextSplitter
Vector Store Creation
create_vectorstore(): Creates vector DB usingOpenAIEmbeddingsandFAISScreate_retriever(): Generates a retriever based on the vector store
QA Chain Configuration
create_chain(): Creates a chain that answers questions based on retrieved contextIncludes prompt template for question-answering tasks
We'll create a function called ask_question that takes a dictionary inputs as a parameter and returns a dictionary with an answer key. This function will serve as our question-answering interface.
Basic Custom Evaluator
Let's explore the fundamental concepts of creating Custom Evaluators. Custom Evaluators are evaluation tools in LangChain's LangSmith evaluation system that users can define according to their specific requirements. LangSmith provides a comprehensive platform for monitoring, evaluating, and improving LLM applications.
Understanding Evaluator Arguments
Custom Evaluator functions can use the following arguments:
run (Run): The complete Run object generated by the applicationexample (Example): Dataset example containing inputs, outputs, and metadatainputs (dict): Input dictionary for a single example from the datasetoutputs (dict): Output dictionary generated by the application for given inputsreference_outputs (dict): Reference output dictionary associated with the example
In most cases, inputs, outputs, and reference_outputs are sufficient. The run and example objects are only needed when additional metadata is required.
Understanding Output Types
Custom Evaluators can return results in the following formats:
Dictionary Format (Recommended)
Basic Types (Python)
int,float,bool: Continuous numerical metricsstr: Categorical metrics
Multiple Metrics
Random Score Evaluator Example
Now, let's create a simple Custom Evaluator example. This evaluator will return a random score between 1 and 10, regardless of the answer content.
Random Score Evaluator Implementation
Takes
RunandExampleobjects as input parametersReturns a dictionary in the format:
{\"key\": \"random_score\", \"score\": score}
Here's the basic implementation of a random score evaluator:
0
What are the three targeted learnings to enhan...
The three targeted learnings to enhance model ...
None
The three targeted learning approaches to enha...
4
3.112384
0e661de4-636b-425d-8f6e-0a52b8070576
ae36f6a7-86a2-4f0a-89d2-8be9671ca3cb
1
What are the key functions of an agent's orche...
The key functions of an agent's orchestration ...
None
The key functions of an agent's orchestration ...
6
4.077394
3561c6fe-6ed4-4182-989a-270dcd635f32
6c65f286-a103-4a60-b906-555fd405ea7e
2
List up the name of the authors
The authors are Julia Wiesinger, Patrick Marlo...
None
The authors are Julia Wiesinger, Patrick Marlo...
7
1.172011
b03e98d1-44ad-4142-8dfa-7b0a31a57096
429dad1e-f68c-4f67-ae36-cc2171c4c6a0
3
What is Tree-of-thoughts?
Tree-of-thoughts (ToT) is a prompt engineering...
None
Tree-of-thoughts (ToT) is a prompt engineering...
5
1.374912
be18ec98-ab18-4f30-9205-e75f1cb70844
be337bef-90b0-4b6a-b9ab-941562ab4b44
4
What is the framework used for reasoning and p...
The frameworks used for reasoning and planning...
None
The frameworks used for reasoning and planning...
7
1.821961
eb4b29a7-511c-4f78-a08f-2d5afeb84320
9cff92b1-04e7-49f5-ab2a-85763468e6cb
5
How do agents differ from standalone language ...
Agents differ from standalone language models ...
None
Agents can use tools to access real-time data ...
1
2.135424
f4a5a0cf-2d2e-4e15-838a-bc8296eb708b
3fbe6fa6-88bf-46de-bdfa-0f39eac18c78

Custom LLM-as-Judge
Now, we'll create a LLM Chain to use as an evaluator.
First, let's define a function that returns context, answer, and question:
Let's run our evaluation using LangSmith's evaluate function. We'll use our custom evaluator to assess the RAG system's performance across our test dataset.
We'll use the teddynote/context-answer-evaluator prompt template from LangChain Hub, which provides a structured evaluation framework for RAG systems.
The evaluator uses the following criteria:
Accuracy (0-10): How well the answer aligns with the context
Comprehensiveness (0-10): How complete and detailed the answer is
Context Precision (0-10): How effectively the context information is used
The final score is normalized to a 0-1 scale using the formula:Final Score = (Accuracy + Comprehensiveness + Context Precision) / 30
This evaluation framework helps us quantitatively assess the quality of our RAG system's responses.
Let's evaluate our system using the previously created context_answer_rag_answer function. We'll pass the generated answer and context to our custom_llm_evaluator for assessment.
Let's define our custom_evaluator function.
run.outputs: Gets theanswer,context, andquestiongenerated by the RAG chainexample.outputs: Gets the reference answer from our dataset
Let's run our evaluation using LangSmith's evaluate function.
0
What are the three targeted learnings to enhan...
Agents\n33\nSeptember 2024\nEnhancing model pe...
The three targeted learnings to enhance model ...
What are the three targeted learnings to enhan...
None
The three targeted learning approaches to enha...
0.87
3.603254
0e661de4-636b-425d-8f6e-0a52b8070576
85ddbfcb-8c49-4551-890a-f137d7b413b8
1
What are the key functions of an agent's orche...
implementation of the agent orchestration laye...
The key functions of an agent's orchestration ...
What are the key functions of an agent's orche...
None
The key functions of an agent's orchestration ...
0.93
4.028933
3561c6fe-6ed4-4182-989a-270dcd635f32
0b423bb6-c722-41af-ae6e-c193ebc3ff8a
2
List up the name of the authors
Agents\nAuthors: Julia Wiesinger, Patrick Marl...
The authors are Julia Wiesinger, Patrick Marlo...
List up the name of the authors
None
The authors are Julia Wiesinger, Patrick Marlo...
0.87
1.885114
b03e98d1-44ad-4142-8dfa-7b0a31a57096
54e0987b-502f-48a7-877f-4b3d56bd82cf
3
What is Tree-of-thoughts?
weaknesses depending on the specific applicati...
Tree-of-thoughts (ToT) is a prompt engineering...
What is Tree-of-thoughts?
None
Tree-of-thoughts (ToT) is a prompt engineering...
0.87
1.732563
be18ec98-ab18-4f30-9205-e75f1cb70844
f0b02411-b377-4eaa-821a-2108b8b4836f
4
What is the framework used for reasoning and p...
reasoning frameworks (CoT, ReAct, etc.) to \nf...
The frameworks used for reasoning and planning...
What is the framework used for reasoning and p...
None
The frameworks used for reasoning and planning...
0.83
2.651672
eb4b29a7-511c-4f78-a08f-2d5afeb84320
38d34eb6-1ec5-44ea-a7d0-c7c98d46b0bc
5
How do agents differ from standalone language ...
1.\t Agents extend the capabilities of languag...
Agents differ from standalone language models ...
How do agents differ from standalone language ...
None
Agents can use tools to access real-time data ...
0.93
2.519094
f4a5a0cf-2d2e-4e15-838a-bc8296eb708b
49b26b38-e499-4c71-bdcb-eccfa44a1beb

Last updated