LangChain OpenTutorial
  • 🦜️🔗 The LangChain Open Tutorial for Everyone
  • 01-Basic
    • Getting Started on Windows
    • 02-Getting-Started-Mac
    • OpenAI API Key Generation and Testing Guide
    • LangSmith Tracking Setup
    • Using the OpenAI API (GPT-4o Multimodal)
    • Basic Example: Prompt+Model+OutputParser
    • LCEL Interface
    • Runnable
  • 02-Prompt
    • Prompt Template
    • Few-Shot Templates
    • LangChain Hub
    • Personal Prompts for LangChain
    • Prompt Caching
  • 03-OutputParser
    • PydanticOutputParser
    • PydanticOutputParser
    • CommaSeparatedListOutputParser
    • Structured Output Parser
    • JsonOutputParser
    • PandasDataFrameOutputParser
    • DatetimeOutputParser
    • EnumOutputParser
    • Output Fixing Parser
  • 04-Model
    • Using Various LLM Models
    • Chat Models
    • Caching
    • Caching VLLM
    • Model Serialization
    • Check Token Usage
    • Google Generative AI
    • Huggingface Endpoints
    • HuggingFace Local
    • HuggingFace Pipeline
    • ChatOllama
    • GPT4ALL
    • Video Q&A LLM (Gemini)
  • 05-Memory
    • ConversationBufferMemory
    • ConversationBufferWindowMemory
    • ConversationTokenBufferMemory
    • ConversationEntityMemory
    • ConversationKGMemory
    • ConversationSummaryMemory
    • VectorStoreRetrieverMemory
    • LCEL (Remembering Conversation History): Adding Memory
    • Memory Using SQLite
    • Conversation With History
  • 06-DocumentLoader
    • Document & Document Loader
    • PDF Loader
    • WebBaseLoader
    • CSV Loader
    • Excel File Loading in LangChain
    • Microsoft Word(doc, docx) With Langchain
    • Microsoft PowerPoint
    • TXT Loader
    • JSON
    • Arxiv Loader
    • UpstageDocumentParseLoader
    • LlamaParse
    • HWP (Hangeul) Loader
  • 07-TextSplitter
    • Character Text Splitter
    • 02. RecursiveCharacterTextSplitter
    • Text Splitting Methods in NLP
    • TokenTextSplitter
    • SemanticChunker
    • Split code with Langchain
    • MarkdownHeaderTextSplitter
    • HTMLHeaderTextSplitter
    • RecursiveJsonSplitter
  • 08-Embedding
    • OpenAI Embeddings
    • CacheBackedEmbeddings
    • HuggingFace Embeddings
    • Upstage
    • Ollama Embeddings With Langchain
    • LlamaCpp Embeddings With Langchain
    • GPT4ALL
    • Multimodal Embeddings With Langchain
  • 09-VectorStore
    • Vector Stores
    • Chroma
    • Faiss
    • Pinecone
    • Qdrant
    • Elasticsearch
    • MongoDB Atlas
    • PGVector
    • Neo4j
    • Weaviate
    • Faiss
    • {VectorStore Name}
  • 10-Retriever
    • VectorStore-backed Retriever
    • Contextual Compression Retriever
    • Ensemble Retriever
    • Long Context Reorder
    • Parent Document Retriever
    • MultiQueryRetriever
    • MultiVectorRetriever
    • Self-querying
    • TimeWeightedVectorStoreRetriever
    • TimeWeightedVectorStoreRetriever
    • Kiwi BM25 Retriever
    • Ensemble Retriever with Convex Combination (CC)
  • 11-Reranker
    • Cross Encoder Reranker
    • JinaReranker
    • FlashRank Reranker
  • 12-RAG
    • Understanding the basic structure of RAG
    • RAG Basic WebBaseLoader
    • Exploring RAG in LangChain
    • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
    • Conversation-With-History
    • Translation
    • Multi Modal RAG
  • 13-LangChain-Expression-Language
    • RunnablePassthrough
    • Inspect Runnables
    • RunnableLambda
    • Routing
    • Runnable Parallel
    • Configure-Runtime-Chain-Components
    • Creating Runnable objects with chain decorator
    • RunnableWithMessageHistory
    • Generator
    • Binding
    • Fallbacks
    • RunnableRetry
    • WithListeners
    • How to stream runnables
  • 14-Chains
    • Summarization
    • SQL
    • Structured Output Chain
    • StructuredDataChat
  • 15-Agent
    • Tools
    • Bind Tools
    • Tool Calling Agent
    • Tool Calling Agent with More LLM Models
    • Iteration-human-in-the-loop
    • Agentic RAG
    • CSV/Excel Analysis Agent
    • Agent-with-Toolkits-File-Management
    • Make Report Using RAG, Web searching, Image generation Agent
    • TwoAgentDebateWithTools
    • React Agent
  • 16-Evaluations
    • Generate synthetic test dataset (with RAGAS)
    • Evaluation using RAGAS
    • HF-Upload
    • LangSmith-Dataset
    • LLM-as-Judge
    • Embedding-based Evaluator(embedding_distance)
    • LangSmith Custom LLM Evaluation
    • Heuristic Evaluation
    • Compare experiment evaluations
    • Summary Evaluators
    • Groundedness Evaluation
    • Pairwise Evaluation
    • LangSmith Repeat Evaluation
    • LangSmith Online Evaluation
    • LangFuse Online Evaluation
  • 17-LangGraph
    • 01-Core-Features
      • Understanding Common Python Syntax Used in LangGraph
      • Title
      • Building a Basic Chatbot with LangGraph
      • Building an Agent with LangGraph
      • Agent with Memory
      • LangGraph Streaming Outputs
      • Human-in-the-loop
      • LangGraph Manual State Update
      • Asking Humans for Help: Customizing State in LangGraph
      • DeleteMessages
      • DeleteMessages
      • LangGraph ToolNode
      • LangGraph ToolNode
      • Branch Creation for Parallel Node Execution
      • Conversation Summaries with LangGraph
      • Conversation Summaries with LangGraph
      • LangGrpah Subgraph
      • How to transform the input and output of a subgraph
      • LangGraph Streaming Mode
      • Errors
      • A Long-Term Memory Agent
    • 02-Structures
      • LangGraph-Building-Graphs
      • Naive RAG
      • Add Groundedness Check
      • Adding a Web Search Module
      • LangGraph-Add-Query-Rewrite
      • Agentic RAG
      • Adaptive RAG
      • Multi-Agent Structures (1)
      • Multi Agent Structures (2)
    • 03-Use-Cases
      • LangGraph Agent Simulation
      • Meta Prompt Generator based on User Requirements
      • CRAG: Corrective RAG
      • Plan-and-Execute
      • Multi Agent Collaboration Network
      • Multi Agent Collaboration Network
      • Multi-Agent Supervisor
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • 08-LangGraph-Hierarchical-Multi-Agent-Teams
      • SQL-Agent
      • 10-LangGraph-Research-Assistant
      • LangGraph Code Assistant
      • Deploy on LangGraph Cloud
      • Tree of Thoughts (ToT)
      • Ollama Deep Researcher (Deepseek-R1)
      • Functional API
      • Reflection in LangGraph
  • 19-Cookbook
    • 01-SQL
      • TextToSQL
      • SpeechToSQL
    • 02-RecommendationSystem
      • ResumeRecommendationReview
    • 03-GraphDB
      • Movie QA System with Graph Database
      • 05-TitanicQASystem
      • Real-Time GraphRAG QA
    • 04-GraphRAG
      • Academic Search System
      • Academic QA System with GraphRAG
    • 05-AIMemoryManagementSystem
      • ConversationMemoryManagementSystem
    • 06-Multimodal
      • Multimodal RAG
      • Shopping QnA
    • 07-Agent
      • 14-MoARAG
      • CoT Based Smart Web Search
      • 16-MultiAgentShoppingMallSystem
      • Agent-Based Dynamic Slot Filling
      • Code Debugging System
      • New Employee Onboarding Chatbot
      • 20-LangGraphStudio-MultiAgent
      • Multi-Agent Scheduler System
    • 08-Serving
      • FastAPI Serving
      • Sending Requests to Remote Graph Server
      • Building a Agent API with LangServe: Integrating Currency Exchange and Trip Planning
    • 08-SyntheticDataset
      • Synthetic Dataset Generation using RAG
    • 09-Monitoring
      • Langfuse Selfhosting
Powered by GitBook
On this page
  • Overview
  • Table of Contents
  • References
  • Environment Setup
  • Performing Repetitive Evaluations with num_repetitions
  • Define a function for RAG performance testing
  • Repetitive evaluation of RAG using GPT models
  • Repetitive evaluation of RAG using Ollama
  1. 16-Evaluations

LangSmith Repeat Evaluation

PreviousPairwise EvaluationNextLangSmith Online Evaluation

Last updated 2 months ago

  • Author:

  • Design:

  • Peer Review:

  • This is a part of

Overview

Repeat evaluation is a method for measuring the performance of a model more accurately by performing multiple evaluations on the same dataset.

You can add repetition to the experiment. This notebook demonstrates how to use LangSmith for repeatable evaluations of language models. It covers setting up evaluation workflows, running evaluations on different datasets, and analyzing results to ensure consistency. The focus is on leveraging LangSmith's tools for reproducible and scalable model evaluation.

This allows the evaluation to be repeated multiple times, which is useful in the following cases:

  • For larger evaluation sets

  • For chains that can generate variable responses

  • For evaluations that can produce variable scores (e.g., llm-as-judge)

You can learn how to run an evaluation from .

Table of Contents

References


Environment Setup

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

%%capture --no-stderr
%pip install langchain-opentutorial
    [notice] A new release of pip is available: 24.3.1 -> 25.0.1
    [notice] To update, run: python.exe -m pip install --upgrade pip
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_openai",
        "langchain_core",
        "langchain_community",
        "langchain_ollama",
        "faiss-cpu",
        "pymupdf",
    ],
    verbose=False,
    upgrade=False,
)
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGSMITH_TRACING_V2": "true",
        "LANGSMITH_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_PROJECT": "Repeat-Evaluations"
    }
)
Environment variables have been set successfully.

You can alternatively set OPENAI_API_KEY in .env file and load it.

[Note] This is not necessary if you've already set OPENAI_API_KEY in previous steps.

# Configuration file to manage API keys as environment variables
from dotenv import load_dotenv

# Load API key information
load_dotenv(override=True)
True

Performing Repetitive Evaluations with num_repetitions

LangSmith offers a simple way to perform repetitive evaluations using the num_repetitions parameter in the evaluate function. This parameter specifies how many times each example in your dataset should be evaluated.

When you set num_repetitions=N, LangSmith will:

Run each example in your dataset N times.

Aggregate the results to provide a more accurate measure of your model's performance.

For example:

If your dataset has 10 examples and you set num_repetitions=5, each example will be evaluated 5 times, resulting in a total of 50 runs.

Define a function for RAG performance testing

Create a RAG system to use for performance testing.

from myrag import PDFRAG


# Create a function to generate responses to questions.
def ask_question_with_llm(llm):
    # Create a PDFRAG object
    rag = PDFRAG(
        "data/Newwhitepaper_Agents2.pdf",
        llm,
    )

    # Create a retriever
    retriever = rag.create_retriever()

    # Create a chain
    rag_chain = rag.create_chain(retriever)

    def _ask_question(inputs: dict):
        # Context retrieval for the question
        context = retriever.invoke(inputs["question"])
        # Combine the retrieved documents into a single string.
        context = "\n".join([doc.page_content for doc in context])
        # Return a dictionary containing the question, context, and answer.
        return {
            "question": inputs["question"],
            "context": context,
            "answer": rag_chain.invoke(inputs["question"]),
        }

    return _ask_question
!ollama pull llama3.2
pulling manifest ⠙ pulling manifest ⠙ pulling manifest ⠹ pulling manifest ⠸ pulling manifest ⠼ pulling manifest ⠴ pulling manifest ⠦ pulling manifest 
    pulling dde5aa3fc5ff... 100% ▕████████████████▏ 2.0 GB                         
    pulling 966de95ca8a6... 100% ▕████████████████▏ 1.4 KB                         
    pulling fcc5a6bec9da... 100% ▕████████████████▏ 7.7 KB                         
    pulling a70ff7e570d9... 100% ▕████████████████▏ 6.0 KB                         
    pulling 56bb8bd477a5... 100% ▕████████████████▏   96 B                         
    pulling 34bb5ab01051... 100% ▕████████████████▏  561 B                         
    verifying sha256 digest 
    writing manifest 
    success 

Below is an example of loading and invoking the model:

from langchain_ollama import ChatOllama

# Load the Ollama model
ollama = ChatOllama(model="llama3.2")

# Call the Ollama model
ollama.invoke("hello") 
AIMessage(content='Hello! How can I assist you today?', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-02-17T06:53:39.1001407Z', 'done': True, 'done_reason': 'stop', 'total_duration': 640983000, 'load_duration': 31027500, 'prompt_eval_count': 26, 'prompt_eval_duration': 288000000, 'eval_count': 10, 'eval_duration': 319000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-e563e830-e561-4333-a402-ef1227d68222-0', usage_metadata={'input_tokens': 26, 'output_tokens': 10, 'total_tokens': 36})
from langchain_openai import ChatOpenAI

gpt_chain = ask_question_with_llm(ChatOpenAI(model="gpt-4o-mini", temperature=1.0))

# Load the Ollama model.
ollama_chain = ask_question_with_llm(ChatOllama(model="llama3.2"))

Repetitive evaluation of RAG using GPT models

This section demonstrates the process of conducting repetitive evaluations of a RAG system using GPT models. It focuses on setting up and executing repeated tests to assess the consistency and performance of the RAG system across various scenarios, helping to identify potential areas for improvement and ensure reliable outputs.

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Create a QA evaluator
cot_qa_evalulator = LangChainStringEvaluator(
    "cot_qa",
    config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],
        "input": example.inputs["question"],
    },
)

dataset_name = "RAG_EVAL_DATASET"

# Run the evaluation
evaluate(
    gpt_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="REPEAT_EVAL",
    # Specify the experiment metadata.
    metadata={
        "variant": "Perform repeat evaluation. GPT-4o-mini model (cot_qa)",
    },
    num_repetitions=3,
)
View the evaluation results for experiment: 'REPEAT_EVAL-9906ae0d' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=4ca1ec21-cda0-4b78-abda-f3ad3b42edc5
    
    
0it [00:00, ?it/s]
inputs.question
outputs.question
outputs.context
outputs.answer
error
reference.answer
feedback.COT Contextual Accuracy
execution_time
example_id
id

0

What are the three targeted learnings to enhan...

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The three targeted learnings to enhance model ...

None

The three targeted learning approaches to enha...

1

13.151277

0e661de4-636b-425d-8f6e-0a52b8070576

510240bb-4c28-4440-a769-929be7edb98f

1

What are the key functions of an agent's orche...

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

None

The key functions of an agent's orchestration ...

1

4.226702

3561c6fe-6ed4-4182-989a-270dcd635f32

60c42896-89fe-4a57-b8e3-e5cdacabae30

2

List up the name of the authors

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The authors of the document are Julia Wiesinge...

None

The authors are Julia Wiesinger, Patrick Marlo...

1

2.524669

b03e98d1-44ad-4142-8dfa-7b0a31a57096

d9a3335b-06d6-46a0-bcb1-3a84d3d56c66

3

What is Tree-of-thoughts?

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

1

2.944406

be18ec98-ab18-4f30-9205-e75f1cb70844

0d8cc590-0518-4098-b006-b0613d5e7cb8

4

What is the framework used for reasoning and p...

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

The framework used for reasoning and planning ...

None

The frameworks used for reasoning and planning...

1

2.452457

eb4b29a7-511c-4f78-a08f-2d5afeb84320

155ef405-4754-441f-a178-177922122d63

5

How do agents differ from standalone language ...

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Agents differ from standalone language models ...

None

Agents can use tools to access real-time data ...

1

2.868793

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

e0d61836-a440-463d-82c0-c32053b6337b

6

What are the three targeted learnings to enhan...

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The three targeted learnings to enhance model ...

None

The three targeted learning approaches to enha...

1

3.615821

0e661de4-636b-425d-8f6e-0a52b8070576

65fb7cdf-4545-4330-b4b4-055fdfe710cb

7

What are the key functions of an agent's orche...

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

None

The key functions of an agent's orchestration ...

1

2.201849

3561c6fe-6ed4-4182-989a-270dcd635f32

9d587a12-e035-45d6-9a8b-64c58ae4dd67

8

List up the name of the authors

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The authors listed are Julia Wiesinger, Patric...

None

The authors are Julia Wiesinger, Patrick Marlo...

1

1.720297

b03e98d1-44ad-4142-8dfa-7b0a31a57096

eaff2aba-0e70-4a7c-b47f-912ac6318016

9

What is Tree-of-thoughts?

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

1

2.107871

be18ec98-ab18-4f30-9205-e75f1cb70844

7029baaf-2e66-4d71-98c5-443577b5c430

10

What is the framework used for reasoning and p...

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

The frameworks used for reasoning and planning...

None

The frameworks used for reasoning and planning...

1

2.265368

eb4b29a7-511c-4f78-a08f-2d5afeb84320

04b223a3-5ae5-4180-a0c0-db818a9e28af

11

How do agents differ from standalone language ...

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Agents differ from standalone language models ...

None

Agents can use tools to access real-time data ...

1

2.088294

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

676c6265-8cc1-41ac-828c-e294ac3f4a10

12

What are the three targeted learnings to enhan...

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The three targeted learning approaches mention...

None

The three targeted learning approaches to enha...

1

3.550540

0e661de4-636b-425d-8f6e-0a52b8070576

1b92081d-ca19-4679-906e-187dea30a5dc

13

What are the key functions of an agent's orche...

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

None

The key functions of an agent's orchestration ...

1

4.070889

3561c6fe-6ed4-4182-989a-270dcd635f32

07b70cac-203f-4d39-998d-befef6bc0bd8

14

List up the name of the authors

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The authors are Julia Wiesinger, Patrick Marlo...

None

The authors are Julia Wiesinger, Patrick Marlo...

1

1.588084

b03e98d1-44ad-4142-8dfa-7b0a31a57096

0f6ccf7a-f79f-4fdb-ab00-4831930e6e98

15

What is Tree-of-thoughts?

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

1

2.138192

be18ec98-ab18-4f30-9205-e75f1cb70844

bd0f5f68-215e-4756-b87b-0aef5e4f01ab

16

What is the framework used for reasoning and p...

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

The frameworks used for reasoning and planning...

None

The frameworks used for reasoning and planning...

1

2.071085

eb4b29a7-511c-4f78-a08f-2d5afeb84320

826d6013-987c-4095-80dd-612591271c2f

17

How do agents differ from standalone language ...

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Agents differ from standalone language models ...

None

Agents can use tools to access real-time data ...

1

2.863684

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

5b172bbf-abe0-4a71-8a32-d2f05e4039bb

Repetitive evaluation of RAG using Ollama

This part focuses on performing repetitive evaluations of the RAG system using Ollama. It illustrates the process of setting up and running multiple tests with Ollama, allowing for a comprehensive evaluation of the RAG system's performance with these specific models.

# Create a QA evaluator
cot_qa_evalulator = LangChainStringEvaluator(
    "cot_qa",
    config={"llm": ChatOllama(model="llama3.2", temperature=0)},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],
        "input": example.inputs["question"],
    },
)

dataset_name = "RAG_EVAL_DATASET"

# Run the evaluation
evaluate(
    ollama_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="REPEAT_EVAL",
    # Specify the experiment metadata.
    metadata={
        "variant": "Perform repeat evaluation. Ollama(llama3.2) (cot_qa)",
    },
    num_repetitions=3,
)
View the evaluation results for experiment: 'REPEAT_EVAL-8279cd53' at:
    https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=cee9221e-93d8-40fd-9585-519466fa7f99
    
    
0it [00:00, ?it/s]
inputs.question
outputs.question
outputs.context
outputs.answer
error
reference.answer
feedback.COT Contextual Accuracy
execution_time
example_id
id

0

What are the three targeted learnings to enhan...

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

In-context learning, Fine-tuning based learning.

None

The three targeted learning approaches to enha...

0.0

2.527441

0e661de4-636b-425d-8f6e-0a52b8070576

96233779-b37d-484f-85a8-22a7320ff72b

1

What are the key functions of an agent's orche...

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

Based on the retrieved context, it appears tha...

None

The key functions of an agent's orchestration ...

0.0

7.891397

3561c6fe-6ed4-4182-989a-270dcd635f32

5f761c37-3bf0-4b64-91bf-0b1167165184

2

List up the name of the authors

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The names of the authors are:\n\n1. Julia Wies...

None

The authors are Julia Wiesinger, Patrick Marlo...

1.0

3.461620

b03e98d1-44ad-4142-8dfa-7b0a31a57096

5e56e10f-9220-4107-b0da-cfd206e4cd27

3

What is Tree-of-thoughts?

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts is a prompt engineering frame...

None

Tree-of-thoughts (ToT) is a prompt engineering...

1.0

3.017406

be18ec98-ab18-4f30-9205-e75f1cb70844

4f0f23af-2cf3-4de2-923f-d8cbbd184a47

4

What is the framework used for reasoning and p...

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

Based on the provided context, it appears that...

None

The frameworks used for reasoning and planning...

0.0

8.636841

eb4b29a7-511c-4f78-a08f-2d5afeb84320

f729da06-0b0e-42ff-88f1-64676e19d1b0

5

How do agents differ from standalone language ...

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

According to the context, agents differ from s...

None

Agents can use tools to access real-time data ...

1.0

6.293883

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

045cdaba-4dc0-46ad-a955-4d00944bfabd

6

What are the three targeted learnings to enhan...

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

The two methods mentioned for enhancing model ...

None

The three targeted learning approaches to enha...

0.0

3.524431

0e661de4-636b-425d-8f6e-0a52b8070576

e1f26ba7-cd91-4e4f-8684-4af4262b8c17

7

What are the key functions of an agent's orche...

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

Based on the retrieved context, the key functi...

None

The key functions of an agent's orchestration ...

NaN

5.473330

3561c6fe-6ed4-4182-989a-270dcd635f32

10df33b1-8936-454f-9c13-9baedb8d557a

8

List up the name of the authors

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The names of the authors are:\n\n1. Julia Wies...

None

The authors are Julia Wiesinger, Patrick Marlo...

1.0

2.525374

b03e98d1-44ad-4142-8dfa-7b0a31a57096

77e497f6-3f3e-400d-a385-72063096f879

9

What is Tree-of-thoughts?

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

1.0

2.907534

be18ec98-ab18-4f30-9205-e75f1cb70844

a6b767b3-b831-4cbb-a62f-2e351a948a01

10

What is the framework used for reasoning and p...

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

Based on the retrieved context, it appears tha...

None

The frameworks used for reasoning and planning...

0.0

6.760531

eb4b29a7-511c-4f78-a08f-2d5afeb84320

c00fd2ce-4108-45e8-8b0d-0e2419e883f3

11

How do agents differ from standalone language ...

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

Based on the provided context, it appears that...

None

Agents can use tools to access real-time data ...

1.0

6.969271

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

239706b7-f82c-49dd-a4ba-15d845d40f3e

12

What are the three targeted learnings to enhan...

What are the three targeted learnings to enhan...

Agents\n33\nSeptember 2024\nEnhancing model pe...

In-context learning and Fine-tuning based lear...

None

The three targeted learning approaches to enha...

0.0

2.515873

0e661de4-636b-425d-8f6e-0a52b8070576

bad8da17-774d-43e4-b0f1-9436f4a6f516

13

What are the key functions of an agent's orche...

What are the key functions of an agent's orche...

implementation of the agent orchestration laye...

The key functions of an agent's orchestration ...

None

The key functions of an agent's orchestration ...

0.0

6.819861

3561c6fe-6ed4-4182-989a-270dcd635f32

a08170c2-8953-450f-9e49-1b431f87f506

14

List up the name of the authors

List up the name of the authors

Agents\nAuthors: Julia Wiesinger, Patrick Marl...

The names of the authors are:\n\n1. Julia Wies...

None

The authors are Julia Wiesinger, Patrick Marlo...

1.0

2.512632

b03e98d1-44ad-4142-8dfa-7b0a31a57096

e7b1221e-23fe-4715-8315-daa7375dd73f

15

What is Tree-of-thoughts?

What is Tree-of-thoughts?

weaknesses depending on the specific applicati...

Tree-of-Thoughts (ToT) is a prompt engineering...

None

Tree-of-thoughts (ToT) is a prompt engineering...

1.0

3.005581

be18ec98-ab18-4f30-9205-e75f1cb70844

9c043533-e24e-498d-a27c-02b5499fd27e

16

What is the framework used for reasoning and p...

What is the framework used for reasoning and p...

reasoning frameworks (CoT, ReAct, etc.) to \nf...

Based on the provided context, it seems that t...

None

The frameworks used for reasoning and planning...

0.0

4.558945

eb4b29a7-511c-4f78-a08f-2d5afeb84320

8875837e-fca5-4bf8-bf94-2fc733ae7387

17

How do agents differ from standalone language ...

How do agents differ from standalone language ...

1.\t Agents extend the capabilities of languag...

According to the retrieved context, agents dif...

None

Agents can use tools to access real-time data ...

0.0

5.888388

f4a5a0cf-2d2e-4e15-838a-bc8296eb708b

1889177c-ea36-488d-9327-26147f4e83ee

Set up the environment. You may refer to for more details.

You can checkout the for more details.

In this tutorial, we use the llama3.2 model for repetitive evaluations. Make sure to install on your local machine and run ollama pull llama3.2 to download the model before proceeding with this tutorial.

How to run an evaluation
How to evaluate with repetitions
Environment Setup
langchain-opentutorial
Ollama
Hwayoung Cha
LangChain Open Tutorial
this site
Overview
Environment Setup
Performing Repetitive Evaluations with num_repetitions
Define a function for RAG performance testing
Repetitive evaluation of RAG using GPT models
Repetitive evaluation of RAG using Ollama models
13-langsmith-repeat-evaluation-01
13-langsmith-repeat-evaluation-02