LangSmith Online Evaluation
Author: JeongGi Park
Design:
Peer Review:
This is a part of LangChain Open Tutorial
Overview
This notebook provides tools to evaluate and track the performance of language models using LangSmith's online evaluation capabilities.
By setting up chains and using custom configurations, users can assess model outputs, including hallucination detection and context recall, ensuring robust performance in various scenarios.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
[Note] If you are using a .env file, proceed as follows.
Build a Pipeline for Online Evaluations
The provided Python script defines a class PDFRAG and related functionality to set up a RAG(Retriever-Augmented Generation) pipeline for online evaluation of language models.
Explain for PDFRAG
PDFRAGThe PDFRAG class is a modular framework for:
Document Loading: Ingesting a PDF document.
Document Splitting: Dividing the content into manageable chunks for processing.
Vectorstore Creation: Converting chunks into vector representations using embeddings.
Retriever Setup: Enabling retrieval of the most relevant chunks for a given query.
Chain Construction: Creating a QA(Question-Answering) chain with prompt templates.
Set Up the RAG System with PDFRAG
The following code demonstrates how to instantiate and use the PDFRAG class to set up a RAG(Retriever-Augmented Generation) pipeline using a specific PDF document and a GPT-based model.
Create a Parallel Evaluation Runnable
The following code demonstrates how to create a RunnableParallel object to evaluate multiple aspects of the RAG(Retriever-Augmented Generation) pipeline concurrently.
Make Online LLM-as-judge
This guide explains how to set up and configure an online LLM evaluator using LangSmith. It walks you through creating evaluation rules, configuring API keys and prompts, and targeting specific outputs with tags for precise assessments.
1. click Add Rule
Click “Add Rule” to create a new evaluation rule in your LangSmith project.

2. Create Evaluator
Open the Evaluator creation page to define how your outputs will be judged.

3. Set Secrets & API Keys
Provide the necessary API keys and environment secrets for your LLM provider.


4. Set Provider, Model, Prompt
Choose the LLM provider, select a model, and write the prompt you want to use.

5. Select Halluciantion
Pick the “Hallucination” criteria to evaluate factual accuracy in responses.

6. Set facts for output.context
Enter the factual information in “output.context” so the evaluator can reference it.

7. Set answer for output.answer
Specify the expected answer in “output.answer” for comparison.

8. Check Preview for Data
Review your evaluation data in the Preview tab to confirm correctness.

Caution
You must view the preview and then turn off preview mode again before proceeding to the next step. And you have to fill "Name" to continue.


9. Save and Continue
Save your evaluator and click “Continue” to finalize the configuration.

10. Make "Tag"
Create a tag so you can selectively run evaluations on particular outputs.

Instead of evaluating all steps, you can set "Tag" to evaluate only specific tags.
11. Set "Tag" that you want
Choose the tag you wish to use for targeted evaluations.

12. Run evaluations only for specific tags (hallucination)
Trigger the evaluation process exclusively for outputs labeled with your chosen tag.

Run Evaluations
The following code demonstrates how to perform evaluations on the RAG(Retriever-Augmented Generation) pipeline, including hallucination detection, context recall assessment, and combined evaluations.
Last updated