LangSmith Online Evaluation

Author: JeongGi Park
Design:
Peer Review:
This is a part of LangChain Open Tutorial

Overview

This notebook provides tools to evaluate and track the performance of language models using LangSmith's online evaluation capabilities.

By setting up chains and using custom configurations, users can assess model outputs, including hallucination detection and context recall, ensuring robust performance in various scenarios.

References

Langsmith DOC

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain-openai",
        "langchain_community",
        "pymupdf",
        "faiss-cpu"
    ],
    verbose=False,
    upgrade=False,
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "LangSmith-Online-Evaluation",
    },
)

Environment variables have been set successfully.

[Note] If you are using a .env file, proceed as follows.

from dotenv import load_dotenv

load_dotenv(override=True)

True

Build a Pipeline for Online Evaluations

The provided Python script defines a class PDFRAG and related functionality to set up a RAG(Retriever-Augmented Generation) pipeline for online evaluation of language models.

Explain for `PDFRAG`

The PDFRAG class is a modular framework for:

Document Loading: Ingesting a PDF document.
Document Splitting: Dividing the content into manageable chunks for processing.
Vectorstore Creation: Converting chunks into vector representations using embeddings.
Retriever Setup: Enabling retrieval of the most relevant chunks for a given query.
Chain Construction: Creating a QA(Question-Answering) chain with prompt templates.

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough


class PDFRAG:
    def __init__(self, file_path: str, llm):
        self.file_path = file_path
        self.llm = llm

    def load_documents(self):
        # Load Documents
        loader = PyMuPDFLoader(self.file_path)
        docs = loader.load()
        return docs

    def split_documents(self, docs):
        # Split Documents
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
        split_documents = text_splitter.split_documents(docs)
        return split_documents

    def create_vectorstore(self, split_documents):
        # Embedding
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

        # Create DB
        vectorstore = FAISS.from_documents(
            documents=split_documents, embedding=embeddings
        )
        return vectorstore

    def create_retriever(self):
        vectorstore = self.create_vectorstore(
            self.split_documents(self.load_documents())
        )
        # Retriever
        retriever = vectorstore.as_retriever()
        return retriever

    def create_chain(self, retriever):
        # Create Prompt
        prompt = PromptTemplate.from_template(
            """You are an assistant for question-answering tasks. 
        Use the following pieces of retrieved context to answer the question. 
        If you don't know the answer, just say that you don't know. 

        #Context: 
        {context}

        #Question:
        {question}

        #Answer:"""
        )

        # Chain
        chain = (
            {
                "context": retriever,
                "question": RunnablePassthrough(),
            }
            | prompt
            | self.llm
            | StrOutputParser()
        )
        return chain

Set Up the RAG System with PDFRAG

The following code demonstrates how to instantiate and use the PDFRAG class to set up a RAG(Retriever-Augmented Generation) pipeline using a specific PDF document and a GPT-based model.

from langchain_openai import ChatOpenAI

# Create a PDFRAG object
rag = PDFRAG(
    "data/Newwhitepaper_Agents2.pdf",
    ChatOpenAI(model="gpt-4o-mini", temperature=0),
)

# Create a retriever
retriever = rag.create_retriever()

# Create a chain
rag_chain = rag.create_chain(retriever)

Create a Parallel Evaluation Runnable

The following code demonstrates how to create a RunnableParallel object to evaluate multiple aspects of the RAG(Retriever-Augmented Generation) pipeline concurrently.

from langchain_core.runnables import RunnableParallel, RunnablePassthrough

# Create a RunnableParallel object.
evaluation_runnable = RunnableParallel(
    {
        "context": retriever,
        "answer": rag_chain,
        "question": RunnablePassthrough(),
    }
)

_ = evaluation_runnable.invoke("How do agents differ from standalone language models?")

Make Online LLM-as-judge

This guide explains how to set up and configure an online LLM evaluator using LangSmith. It walks you through creating evaluation rules, configuring API keys and prompts, and targeting specific outputs with tags for precise assessments.

1. click Add Rule

Click “Add Rule” to create a new evaluation rule in your LangSmith project.

2. Create Evaluator

Open the Evaluator creation page to define how your outputs will be judged.

3. Set Secrets & API Keys

Provide the necessary API keys and environment secrets for your LLM provider.

4. Set Provider, Model, Prompt

Choose the LLM provider, select a model, and write the prompt you want to use.

5. Select Halluciantion

Pick the “Hallucination” criteria to evaluate factual accuracy in responses.

6. Set facts for output.context

Enter the factual information in “output.context” so the evaluator can reference it.

7. Set answer for output.answer

Specify the expected answer in “output.answer” for comparison.

8. Check Preview for Data

Review your evaluation data in the Preview tab to confirm correctness.

Caution

You must view the preview and then turn off preview mode again before proceeding to the next step. And you have to fill "Name" to continue.

9. Save and Continue

Save your evaluator and click “Continue” to finalize the configuration.

save check-for-save

10. Make "Tag"

Create a tag so you can selectively run evaluations on particular outputs.

Instead of evaluating all steps, you can set "Tag" to evaluate only specific tags.

11. Set "Tag" that you want

Choose the tag you wish to use for targeted evaluations.

12. Run evaluations only for specific tags (hallucination)

Trigger the evaluation process exclusively for outputs labeled with your chosen tag.

Run Evaluations

The following code demonstrates how to perform evaluations on the RAG(Retriever-Augmented Generation) pipeline, including hallucination detection, context recall assessment, and combined evaluations.

from langchain_core.runnables import RunnableConfig

# set a tag
hallucination_config = RunnableConfig(tags=["hallucination_eval"])
context_recall_config = RunnableConfig(tags=["context_recall_eval"])
all_eval_config = RunnableConfig(tags=["hallucination_eval", "context_recall_eval"])

# run chain
_ = evaluation_runnable.invoke("How do agents differ from standalone language models?")

# Request a Hallucination evaluation
_ = evaluation_runnable.invoke(
    "How do agents differ from standalone language models?", config=hallucination_config)

# Request a Context Recall assessment
_ = evaluation_runnable.invoke(
    "How do agents differ from standalone language models?",
    config=context_recall_config,
)

# All evaluation requests
_ = evaluation_runnable.invoke(
    "How do agents differ from standalone language models?", config=all_eval_config
)

PreviousLangSmith Repeat Evaluation NextLangFuse Online Evaluation

Last updated 5 months ago