LangSmith Online Evaluation
Author: JeongGi Park
Design:
Peer Review:
This is a part of LangChain Open Tutorial
Overview
This notebook provides tools to evaluate and track the performance of language models using LangSmith's online evaluation capabilities.
By setting up chains and using custom configurations, users can assess model outputs, including hallucination detection and context recall, ensuring robust performance in various scenarios.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorial
is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorial
for more details.
%%capture --no-stderr
%pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package
package.install(
[
"langsmith",
"langchain-openai",
"langchain_community",
"pymupdf",
"faiss-cpu"
],
verbose=False,
upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env
set_env(
{
"OPENAI_API_KEY": "",
"LANGCHAIN_API_KEY": "",
"LANGCHAIN_TRACING_V2": "true",
"LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
"LANGCHAIN_PROJECT": "LangSmith-Online-Evaluation",
},
)
Environment variables have been set successfully.
[Note] If you are using a .env
file, proceed as follows.
from dotenv import load_dotenv
load_dotenv(override=True)
True
Build a Pipeline for Online Evaluations
The provided Python script defines a class PDFRAG
and related functionality to set up a RAG(Retriever-Augmented Generation) pipeline for online evaluation of language models.
Explain for PDFRAG
PDFRAG
The PDFRAG
class is a modular framework for:
Document Loading: Ingesting a PDF document.
Document Splitting: Dividing the content into manageable chunks for processing.
Vectorstore Creation: Converting chunks into vector representations using embeddings.
Retriever Setup: Enabling retrieval of the most relevant chunks for a given query.
Chain Construction: Creating a QA(Question-Answering) chain with prompt templates.
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough
class PDFRAG:
def __init__(self, file_path: str, llm):
self.file_path = file_path
self.llm = llm
def load_documents(self):
# Load Documents
loader = PyMuPDFLoader(self.file_path)
docs = loader.load()
return docs
def split_documents(self, docs):
# Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)
return split_documents
def create_vectorstore(self, split_documents):
# Embedding
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create DB
vectorstore = FAISS.from_documents(
documents=split_documents, embedding=embeddings
)
return vectorstore
def create_retriever(self):
vectorstore = self.create_vectorstore(
self.split_documents(self.load_documents())
)
# Retriever
retriever = vectorstore.as_retriever()
return retriever
def create_chain(self, retriever):
# Create Prompt
prompt = PromptTemplate.from_template(
"""You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
#Context:
{context}
#Question:
{question}
#Answer:"""
)
# Chain
chain = (
{
"context": retriever,
"question": RunnablePassthrough(),
}
| prompt
| self.llm
| StrOutputParser()
)
return chain
Set Up the RAG System with PDFRAG
The following code demonstrates how to instantiate and use the PDFRAG
class to set up a RAG(Retriever-Augmented Generation) pipeline using a specific PDF document and a GPT-based model.
from langchain_openai import ChatOpenAI
# Create a PDFRAG object
rag = PDFRAG(
"data/Newwhitepaper_Agents2.pdf",
ChatOpenAI(model="gpt-4o-mini", temperature=0),
)
# Create a retriever
retriever = rag.create_retriever()
# Create a chain
rag_chain = rag.create_chain(retriever)
Create a Parallel Evaluation Runnable
The following code demonstrates how to create a RunnableParallel
object to evaluate multiple aspects of the RAG(Retriever-Augmented Generation) pipeline concurrently.
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
# Create a RunnableParallel object.
evaluation_runnable = RunnableParallel(
{
"context": retriever,
"answer": rag_chain,
"question": RunnablePassthrough(),
}
)
_ = evaluation_runnable.invoke("How do agents differ from standalone language models?")
Make Online LLM-as-judge
This guide explains how to set up and configure an online LLM evaluator using LangSmith. It walks you through creating evaluation rules, configuring API keys and prompts, and targeting specific outputs with tags for precise assessments.
1. click Add Rule
Click “Add Rule” to create a new evaluation rule in your LangSmith project.

2. Create Evaluator
Open the Evaluator creation page to define how your outputs will be judged.

3. Set Secrets & API Keys
Provide the necessary API keys and environment secrets for your LLM provider.


4. Set Provider, Model, Prompt
Choose the LLM provider, select a model, and write the prompt you want to use.

5. Select Halluciantion
Pick the “Hallucination” criteria to evaluate factual accuracy in responses.

6. Set facts for output.context
Enter the factual information in “output.context” so the evaluator can reference it.

7. Set answer for output.answer
Specify the expected answer in “output.answer” for comparison.

8. Check Preview for Data
Review your evaluation data in the Preview tab to confirm correctness.

Caution
You must view the preview and then turn off preview mode again before proceeding to the next step. And you have to fill "Name" to continue.


9. Save and Continue
Save your evaluator and click “Continue” to finalize the configuration.
10. Make "Tag"
Create a tag so you can selectively run evaluations on particular outputs.

Instead of evaluating all steps, you can set "Tag" to evaluate only specific tags.
11. Set "Tag" that you want
Choose the tag you wish to use for targeted evaluations.

12. Run evaluations only for specific tags (hallucination)
Trigger the evaluation process exclusively for outputs labeled with your chosen tag.

Run Evaluations
The following code demonstrates how to perform evaluations on the RAG(Retriever-Augmented Generation) pipeline, including hallucination detection, context recall assessment, and combined evaluations.
from langchain_core.runnables import RunnableConfig
# set a tag
hallucination_config = RunnableConfig(tags=["hallucination_eval"])
context_recall_config = RunnableConfig(tags=["context_recall_eval"])
all_eval_config = RunnableConfig(tags=["hallucination_eval", "context_recall_eval"])
# run chain
_ = evaluation_runnable.invoke("How do agents differ from standalone language models?")
# Request a Hallucination evaluation
_ = evaluation_runnable.invoke(
"How do agents differ from standalone language models?", config=hallucination_config)
# Request a Context Recall assessment
_ = evaluation_runnable.invoke(
"How do agents differ from standalone language models?",
config=context_recall_config,
)
# All evaluation requests
_ = evaluation_runnable.invoke(
"How do agents differ from standalone language models?", config=all_eval_config
)
Last updated