LangSmith-Dataset

Author: Minji
Design:
Peer Review:
This is a part of LangChain Open Tutorial

Overview

The notebook demonstrates how to create a dataset for evaluating Retrieval-Augmented Generation (RAG) models using LangSmith. It includes steps for setting up environment variables, creating datasets with questions and answers, and uploading examples to LangSmith for testing. Additionally, it provides instructions on using HuggingFace datasets and updating datasets with new examples.

References

Environment Setup

Setting up your environment is the first step. See the Environment Setup guide for more details.

[Note]

The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials. Check out the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
    ],
    verbose=False,
    upgrade=False,
)

    [notice] A new release of pip is available: 24.3.1 -> 25.0
    [notice] To update, run: pip install --upgrade pip

You can set API keys in a .env file or set them manually.

[Note] If you’re not using the .env file, no worries! Just enter the keys directly in the cell below, and you’re good to go.

from dotenv import load_dotenv
from langchain_opentutorial import set_env

# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.
if not load_dotenv():
    set_env(
        {
            "OPENAI_API_KEY": "",
            "LANGCHAIN_API_KEY": "",
            "LANGCHAIN_TRACING_V2": "true",
            "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
            "LANGCHAIN_PROJECT": "04-LangSmith-Dataset ",  # set the project name same as the title
            "HUGGINGFACEHUB_API_TOKEN": "",
        }
    )

You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

# Configuration file to manage the API KEY as an environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv(override=True)

True

Creating a LangSmith Dataset

Let's learn how to build a custom RAG evaluation dataset.

To construct a dataset, you need to understand three main processes:

Case: Evaluating whether the retrieval is relevant to the question

Question - Retrieval

Case: Evaluating whether the answer is relevant to the question

Question - Answer

Case: Checking if the answer is based on the retrieved documents (Hallucination Check)

Retrieval - Answer

Thus, you typically need Question, Retrieval, and Answer information. However, it is practically challenging to construct ground truth for Retrieval.

If ground truth for Retrieval exists, you can save and use it all in your dataset. Otherwise, you can create and use a dataset with only Question and Answer

Creating Examples for LangSmith Dataset

Use inputs and outputs to create a dataset.

The dataset consists of questions and answers.

import pandas as pd

# List of questions and answers
inputs = [
    "What is the name of the generative AI created by Samsung Electronics?",
    "On what date did U.S. President Biden issue an executive order ensuring safe and trustworthy AI development and usage?",
    "Please briefly describe Cohere's data provenance explorer."
]

# List of corresponding answers
outputs = [
    "The name of the generative AI created by Samsung Electronics is Samsung Gauss.",
    "On October 30, 2023, U.S. President Biden issued an executive order.",
    "Cohere's data provenance explorer is a platform that tracks the sources and licensing status of datasets used for training AI models, ensuring transparency. It collaborates with 12 organizations and provides source information for over 2,000 datasets, helping developers understand data composition and lineage.",
]

# Create question-answer pairs
qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]

# Convert to a DataFrame
df = pd.DataFrame(qa_pairs)

# Display the DataFrame,
df.head()

question

answer

What is the name of the generative AI created ...

The name of the generative AI created by Samsu...

On what date did U.S. President Biden issue an...

On October 30, 2023, U.S. President Biden issu...

Please briefly describe Cohere's data provenan...

Cohere's data provenance explorer is a platfor...

Alternatively, you can use the Synthetic Dataset generated in a previous tutorial.

The code below shows an example of using an uploaded HuggingFace Dataset.

%pip install -qU datasets

    [notice] A new release of pip is available: 24.3.1 -> 25.0
    [notice] To update, run: pip install --upgrade pip
    Note: you may need to restart the kernel to use updated packages.

After installing the package, you may need to restart the kernel for the changes to take effect. This is because newly installed packages might not be recognized immediately in the current session.

In Google Colab, you must run %pip install each time you start a new session, even if you installed the package before. Colab environments are temporary, so installed packages are lost when the session restarts.

from datasets import load_dataset
import os

# Set dataset name (change to your desired name)
huggingface_id = ""  # Your Hugging Face username(ID)
dataset_name = f"{huggingface_id}/rag-synthetic-dataset"

# Download dataset from HuggingFace Dataset using the repo_id
dataset = load_dataset(
    dataset_name,
    token=os.environ["HUGGINGFACEHUB_API_TOKEN"], 
)

# View dataset by split
huggingface_df = dataset["test_v1"].to_pandas()
huggingface_df.head()

README.md:   0%|          | 0.00/408 [00:00
test_v1-00000-of-00001.parquet:   0%|          | 0.00/21.1k [00:00<?, ?B/s]Generating test_v1 split:   0%|          | 0/10 [00:00<?, ? examples/s]

user_input

reference_contexts

reference

synthesizer_name

Wht is an API?

["Agents\nThis combination of reasoning,\nlogi...

An API can be used by a model to make various ...

single_hop_specifc_query_synthesizer

What are the three essential components in an ...

['Agents\nWhat is an agent?\nIn its most funda...

The three essential components in an agent's c...

single_hop_specifc_query_synthesizer

What Chain-of-Thought do in agent model, how i...

['Agents\nFigure 1. General agent architecture...

Chain-of-Thought is a reasoning and logic fram...

single_hop_specifc_query_synthesizer

Waht is the DELETE method used for?

['Agents\nThe tools\nFoundational models, desp...

The DELETE method is a common web API method t...

single_hop_specifc_query_synthesizer

How do foundational components contribute to t...

['<1-hop>\n\nAgents\ncombining specialized age...

Foundational components contribute to the cogn...

NewMultiHopQuery

Creating a Dataset for LangSmith Testing

Create a new dataset under Datasets & Testing.

You can also create a dataset directly using the LangSmith UI from a CSV file.

For more details, refer to the documentation below:

LangSmith UI Documentation

from langsmith import Client

client = Client()
dataset_name = "RAG_EVAL_DATASET"


# Function to create a dataset
def create_dataset(client, dataset_name, description=None):
    for dataset in client.list_datasets():
        if dataset.name == dataset_name:
            return dataset

    dataset = client.create_dataset(
        dataset_name=dataset_name,
        description=description,
    )
    return dataset


# Create dataset
dataset = create_dataset(client, dataset_name)

# Add examples to the created dataset
client.create_examples(
    inputs=[{"question": q} for q in df["question"].tolist()],
    outputs=[{"answer": a} for a in df["answer"].tolist()],
    dataset_id=dataset.id,
)

You can add examples to the dataset later.

# New list of questions
new_questions = [
    "What is the name of the generative AI created by Samsung Electronics?",
    "Is it true that Google invested $2 billion in Teddynote?",
]

# New list of corresponding answers
new_answers = [
    "The name of the generative AI created by Samsung Electronics is Teddynote.",
    "This is not true. Google agreed to invest up to $2 billion in Anthropic, starting with $500 million and planning to invest an additional $1.5 billion in the future.",
]

# Verify the updated version in the UI
client.create_examples(
    inputs=[{"question": q} for q in new_questions],
    outputs=[{"answer": a} for a in new_answers],
    dataset_id=dataset.id,
)

Congratulations! The dataset is now ready.

PreviousHF-Upload NextLLM-as-Judge

Last updated 6 months ago

Overview

Table of Contents

References

Environment Setup

Creating a LangSmith Dataset

Creating Examples for LangSmith Dataset

Creating a Dataset for LangSmith Testing