LangSmith Repeat Evaluation

Author: Hwayoung Cha
Design:
Peer Review:
This is a part of LangChain Open Tutorial

Overview

Repeat evaluation is a method for measuring the performance of a model more accurately by performing multiple evaluations on the same dataset.

You can add repetition to the experiment. This notebook demonstrates how to use LangSmith for repeatable evaluations of language models. It covers setting up evaluation workflows, running evaluations on different datasets, and analyzing results to ensure consistency. The focus is on leveraging LangSmith's tools for reproducible and scalable model evaluation.

This allows the evaluation to be repeated multiple times, which is useful in the following cases:

For larger evaluation sets
For chains that can generate variable responses
For evaluations that can produce variable scores (e.g., llm-as-judge)

You can learn how to run an evaluation from this site.

References

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

    [notice] A new release of pip is available: 24.3.1 -> 25.0.1
    [notice] To update, run: python.exe -m pip install --upgrade pip

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_openai",
        "langchain_core",
        "langchain_community",
        "langchain_ollama",
        "faiss-cpu",
        "pymupdf",
    ],
    verbose=False,
    upgrade=False,
)

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGSMITH_TRACING_V2": "true",
        "LANGSMITH_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_PROJECT": "Repeat-Evaluations"
    }
)

Environment variables have been set successfully.

You can alternatively set OPENAI_API_KEY in .env file and load it.

[Note] This is not necessary if you've already set OPENAI_API_KEY in previous steps.

# Configuration file to manage API keys as environment variables
from dotenv import load_dotenv

# Load API key information
load_dotenv(override=True)

True

Performing Repetitive Evaluations with `num_repetitions`

LangSmith offers a simple way to perform repetitive evaluations using the num_repetitions parameter in the evaluate function. This parameter specifies how many times each example in your dataset should be evaluated.

When you set num_repetitions=N, LangSmith will:

Run each example in your dataset N times.

Aggregate the results to provide a more accurate measure of your model's performance.

For example:

If your dataset has 10 examples and you set num_repetitions=5, each example will be evaluated 5 times, resulting in a total of 50 runs.

Define a function for RAG performance testing

Create a RAG system to use for performance testing.

from myrag import PDFRAG


# Create a function to generate responses to questions.
def ask_question_with_llm(llm):
    # Create a PDFRAG object
    rag = PDFRAG(
        "data/Newwhitepaper_Agents2.pdf",
        llm,
    )

    # Create a retriever
    retriever = rag.create_retriever()

    # Create a chain
    rag_chain = rag.create_chain(retriever)

    def _ask_question(inputs: dict):
        # Context retrieval for the question
        context = retriever.invoke(inputs["question"])
        # Combine the retrieved documents into a single string.
        context = "\n".join([doc.page_content for doc in context])
        # Return a dictionary containing the question, context, and answer.
        return {
            "question": inputs["question"],
            "context": context,
            "answer": rag_chain.invoke(inputs["question"]),
        }

    return _ask_question

In this tutorial, we use the llama3.2 model for repetitive evaluations. Make sure to install Ollama on your local machine and run ollama pull llama3.2 to download the model before proceeding with this tutorial.

!ollama pull llama3.2

pulling manifest ⠙ pulling manifest ⠙ pulling manifest ⠹ pulling manifest ⠸ pulling manifest ⠼ pulling manifest ⠴ pulling manifest ⠦ pulling manifest 
    pulling dde5aa3fc5ff... 100% ▕████████████████▏ 2.0 GB                         
    pulling 966de95ca8a6... 100% ▕████████████████▏ 1.4 KB                         
    pulling fcc5a6bec9da... 100% ▕████████████████▏ 7.7 KB                         
    pulling a70ff7e570d9... 100% ▕████████████████▏ 6.0 KB                         
    pulling 56bb8bd477a5... 100% ▕████████████████▏   96 B                         
    pulling 34bb5ab01051... 100% ▕████████████████▏  561 B                         
    verifying sha256 digest 
    writing manifest 
    success

Below is an example of loading and invoking the model:

from langchain_ollama import ChatOllama

# Load the Ollama model
ollama = ChatOllama(model="llama3.2")

# Call the Ollama model
ollama.invoke("hello")

AIMessage(content='Hello! How can I assist you today?', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-02-17T06:53:39.1001407Z', 'done': True, 'done_reason': 'stop', 'total_duration': 640983000, 'load_duration': 31027500, 'prompt_eval_count': 26, 'prompt_eval_duration': 288000000, 'eval_count': 10, 'eval_duration': 319000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-e563e830-e561-4333-a402-ef1227d68222-0', usage_metadata={'input_tokens': 26, 'output_tokens': 10, 'total_tokens': 36})

from langchain_openai import ChatOpenAI

gpt_chain = ask_question_with_llm(ChatOpenAI(model="gpt-4o-mini", temperature=1.0))

# Load the Ollama model.
ollama_chain = ask_question_with_llm(ChatOllama(model="llama3.2"))

Repetitive evaluation of RAG using GPT models

This section demonstrates the process of conducting repetitive evaluations of a RAG system using GPT models. It focuses on setting up and executing repeated tests to assess the consistency and performance of the RAG system across various scenarios, helping to identify potential areas for improvement and ensure reliable outputs.

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Create a QA evaluator
cot_qa_evalulator = LangChainStringEvaluator(
    "cot_qa",
    config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],
        "input": example.inputs["question"],
    },
)

dataset_name = "RAG_EVAL_DATASET"

# Run the evaluation
evaluate(
    gpt_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="REPEAT_EVAL",
    # Specify the experiment metadata.
    metadata={
        "variant": "Perform repeat evaluation. GPT-4o-mini model (cot_qa)",
    },
    num_repetitions=3,
)

View the evaluation results for experiment: 'REPEAT_EVAL-9906ae0d' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=4ca1ec21-cda0-4b78-abda-f3ad3b42edc5

0it [00:00, ?it/s]

inputs.question

outputs.question

outputs.context

outputs.answer

error

reference.answer

feedback.COT Contextual Accuracy

execution_time

example_id

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The three targeted learnings to enhance model ...

None

The three targeted learning approaches to enha...

13.151277

0e661de4-636b-425d-8f6e-0a52b8070576

510240bb-4c28-4440-a769-929be7edb98f

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

None

The key functions of an agent's orchestration ...

4.226702

3561c6fe-6ed4-4182-989a-270dcd635f32

60c42896-89fe-4a57-b8e3-e5cdacabae30

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The authors of the document are Julia Wiesinge...

None

The authors are Julia Wiesinger, Patrick Marlo...

2.524669

b03e98d1-44ad-4142-8dfa-7b0a31a57096

d9a3335b-06d6-46a0-bcb1-3a84d3d56c66

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

2.944406

be18ec98-ab18-4f30-9205-e75f1cb70844

0d8cc590-0518-4098-b006-b0613d5e7cb8

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

The framework used for reasoning and planning ...

None

The frameworks used for reasoning and planning...

2.452457

eb4b29a7-511c-4f78-a08f-2d5afeb84320

155ef405-4754-441f-a178-177922122d63

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Agents differ from standalone language models ...

None

Agents can use tools to access real-time data ...

2.868793

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

e0d61836-a440-463d-82c0-c32053b6337b

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The three targeted learnings to enhance model ...

None

The three targeted learning approaches to enha...

3.615821

0e661de4-636b-425d-8f6e-0a52b8070576

65fb7cdf-4545-4330-b4b4-055fdfe710cb

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

None

The key functions of an agent's orchestration ...

2.201849

3561c6fe-6ed4-4182-989a-270dcd635f32

9d587a12-e035-45d6-9a8b-64c58ae4dd67

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The authors listed are Julia Wiesinger, Patric...

None

The authors are Julia Wiesinger, Patrick Marlo...

1.720297

b03e98d1-44ad-4142-8dfa-7b0a31a57096

eaff2aba-0e70-4a7c-b47f-912ac6318016

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

2.107871

be18ec98-ab18-4f30-9205-e75f1cb70844

7029baaf-2e66-4d71-98c5-443577b5c430

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

The frameworks used for reasoning and planning...

None

The frameworks used for reasoning and planning...

2.265368

eb4b29a7-511c-4f78-a08f-2d5afeb84320

04b223a3-5ae5-4180-a0c0-db818a9e28af

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Agents differ from standalone language models ...

None

Agents can use tools to access real-time data ...

2.088294

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

676c6265-8cc1-41ac-828c-e294ac3f4a10

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The three targeted learning approaches mention...

None

The three targeted learning approaches to enha...

3.550540

0e661de4-636b-425d-8f6e-0a52b8070576

1b92081d-ca19-4679-906e-187dea30a5dc

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

None

The key functions of an agent's orchestration ...

4.070889

3561c6fe-6ed4-4182-989a-270dcd635f32

07b70cac-203f-4d39-998d-befef6bc0bd8

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The authors are Julia Wiesinger, Patrick Marlo...

None

The authors are Julia Wiesinger, Patrick Marlo...

1.588084

b03e98d1-44ad-4142-8dfa-7b0a31a57096

0f6ccf7a-f79f-4fdb-ab00-4831930e6e98

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

2.138192

be18ec98-ab18-4f30-9205-e75f1cb70844

bd0f5f68-215e-4756-b87b-0aef5e4f01ab

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

The frameworks used for reasoning and planning...

None

The frameworks used for reasoning and planning...

2.071085

eb4b29a7-511c-4f78-a08f-2d5afeb84320

826d6013-987c-4095-80dd-612591271c2f

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Agents differ from standalone language models ...

None

Agents can use tools to access real-time data ...

2.863684

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

5b172bbf-abe0-4a71-8a32-d2f05e4039bb

Repetitive evaluation of RAG using Ollama

This part focuses on performing repetitive evaluations of the RAG system using Ollama. It illustrates the process of setting up and running multiple tests with Ollama, allowing for a comprehensive evaluation of the RAG system's performance with these specific models.

# Create a QA evaluator
cot_qa_evalulator = LangChainStringEvaluator(
    "cot_qa",
    config={"llm": ChatOllama(model="llama3.2", temperature=0)},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],
        "input": example.inputs["question"],
    },
)

dataset_name = "RAG_EVAL_DATASET"

# Run the evaluation
evaluate(
    ollama_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="REPEAT_EVAL",
    # Specify the experiment metadata.
    metadata={
        "variant": "Perform repeat evaluation. Ollama(llama3.2) (cot_qa)",
    },
    num_repetitions=3,
)

View the evaluation results for experiment: 'REPEAT_EVAL-8279cd53' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=cee9221e-93d8-40fd-9585-519466fa7f99

0it [00:00, ?it/s]

inputs.question

outputs.question

outputs.context

outputs.answer

error

reference.answer

feedback.COT Contextual Accuracy

execution_time

example_id

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

In-context learning, Fine-tuning based learning.

None

The three targeted learning approaches to enha...

0.0

2.527441

0e661de4-636b-425d-8f6e-0a52b8070576

96233779-b37d-484f-85a8-22a7320ff72b

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

Based on the retrieved context, it appears tha...

None

The key functions of an agent's orchestration ...

0.0

7.891397

3561c6fe-6ed4-4182-989a-270dcd635f32

5f761c37-3bf0-4b64-91bf-0b1167165184

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The names of the authors are:\n\n1. Julia Wies...

None

The authors are Julia Wiesinger, Patrick Marlo...

1.0

3.461620

b03e98d1-44ad-4142-8dfa-7b0a31a57096

5e56e10f-9220-4107-b0da-cfd206e4cd27

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts is a prompt engineering frame...

None

Tree-of-thoughts (ToT) is a prompt engineering...

1.0

3.017406

be18ec98-ab18-4f30-9205-e75f1cb70844

4f0f23af-2cf3-4de2-923f-d8cbbd184a47

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

Based on the provided context, it appears that...

None

The frameworks used for reasoning and planning...

0.0

8.636841

eb4b29a7-511c-4f78-a08f-2d5afeb84320

f729da06-0b0e-42ff-88f1-64676e19d1b0

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

According to the context, agents differ from s...

None

Agents can use tools to access real-time data ...

1.0

6.293883

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

045cdaba-4dc0-46ad-a955-4d00944bfabd

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The two methods mentioned for enhancing model ...

None

The three targeted learning approaches to enha...

0.0

3.524431

0e661de4-636b-425d-8f6e-0a52b8070576

e1f26ba7-cd91-4e4f-8684-4af4262b8c17

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

Based on the retrieved context, the key functi...

None

The key functions of an agent's orchestration ...

NaN

5.473330

3561c6fe-6ed4-4182-989a-270dcd635f32

10df33b1-8936-454f-9c13-9baedb8d557a

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The names of the authors are:\n\n1. Julia Wies...

None

The authors are Julia Wiesinger, Patrick Marlo...

1.0

2.525374

b03e98d1-44ad-4142-8dfa-7b0a31a57096

77e497f6-3f3e-400d-a385-72063096f879

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

1.0

2.907534

be18ec98-ab18-4f30-9205-e75f1cb70844

a6b767b3-b831-4cbb-a62f-2e351a948a01

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

Based on the retrieved context, it appears tha...

None

The frameworks used for reasoning and planning...

0.0

6.760531

eb4b29a7-511c-4f78-a08f-2d5afeb84320

c00fd2ce-4108-45e8-8b0d-0e2419e883f3

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Based on the provided context, it appears that...

None

Agents can use tools to access real-time data ...

1.0

6.969271

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

239706b7-f82c-49dd-a4ba-15d845d40f3e

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

In-context learning and Fine-tuning based lear...

None

The three targeted learning approaches to enha...

0.0

2.515873

0e661de4-636b-425d-8f6e-0a52b8070576

bad8da17-774d-43e4-b0f1-9436f4a6f516

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

None

The key functions of an agent's orchestration ...

0.0

6.819861

3561c6fe-6ed4-4182-989a-270dcd635f32

a08170c2-8953-450f-9e49-1b431f87f506

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The names of the authors are:\n\n1. Julia Wies...

None

The authors are Julia Wiesinger, Patrick Marlo...

1.0

2.512632

b03e98d1-44ad-4142-8dfa-7b0a31a57096

e7b1221e-23fe-4715-8315-daa7375dd73f

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-Thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

1.0

3.005581

be18ec98-ab18-4f30-9205-e75f1cb70844

9c043533-e24e-498d-a27c-02b5499fd27e

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

Based on the provided context, it seems that t...

None

The frameworks used for reasoning and planning...

0.0

4.558945

eb4b29a7-511c-4f78-a08f-2d5afeb84320

8875837e-fca5-4bf8-bf94-2fc733ae7387

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

According to the retrieved context, agents dif...

None

Agents can use tools to access real-time data ...

0.0

5.888388

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

1889177c-ea36-488d-9327-26147f4e83ee

PreviousPairwise Evaluation NextLangSmith Online Evaluation

Last updated 4 months ago

Overview

Table of Contents

References

Environment Setup

Performing Repetitive Evaluations with num_repetitions

Define a function for RAG performance testing

Repetitive evaluation of RAG using GPT models

Repetitive evaluation of RAG using Ollama

Performing Repetitive Evaluations with `num_repetitions`