HF-Upload

Author: Sun Hyoung Lee
Design:
Peer Review :
Proofread:
This is a part of LangChain Open Tutorial

Overview

The process involves loading a local CSV file, converting it to a HuggingFace Dataset format, and uploading it to the Hugging Face Hub as a private dataset. This process allows for easy sharing and access of the dataset through the HuggingFace infrastructure.

References

Huggingface / Share a dataset to the Hub

Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
You can checkout the langchain-opentutorial for more details.

API Key Configuration

To use HuggingFace Dataset , you need to obtain a HuggingFace write token.

Once you have your API key, set it as the value for the variable HUGGINGFACEHUB_API_TOKEN .

%%capture --no-stderr
%pip install langchain-opentutorial

    [notice] A new release of pip is available: 24.3.1 -> 25.0.1
    [notice] To update, run: python.exe -m pip install --upgrade pip

# Install required packages
from langchain_opentutorial import package

package.install(
    ["datasets"],
    verbose=False,
    upgrade=False,
)

You can set API keys in a .env file or set them manually.

[Note] If you’re not using the .env file, no worries! Just enter the keys directly in the cell below, and you’re good to go.

from dotenv import load_dotenv
from langchain_opentutorial import set_env

# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.
if not load_dotenv(override=True):
    set_env(
        {
            "LANGCHAIN_API_KEY": "",
            "LANGCHAIN_TRACING_V2": "true",
            "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
            "LANGCHAIN_PROJECT": "", # set the project name same as the title
            "HUGGINGFACEHUB_API_TOKEN": "",
        }
    )

from dotenv import load_dotenv

load_dotenv(override=True)

True

Upload Generated Dataset

Import the pandas library for data upload

import pandas as pd

df = pd.read_csv("data/ragas_synthetic_dataset.csv")
df.head()

user_input

reference_contexts

reference

synthesizer_name

Wht is an API?

["Agents\nThis combination of reasoning,\nlogi...

An API can be used by a model to make various ...

single_hop_specifc_query_synthesizer

What are the three essential components in an ...

['Agents\nWhat is an agent?\nIn its most funda...

The three essential components in an agent's c...

single_hop_specifc_query_synthesizer

What Chain-of-Thought do in agent model, how i...

['Agents\nFigure 1. General agent architecture...

Chain-of-Thought is a reasoning and logic fram...

single_hop_specifc_query_synthesizer

Waht is the DELETE method used for?

['Agents\nThe tools\nFoundational models, desp...

The DELETE method is a common web API method t...

single_hop_specifc_query_synthesizer

How do foundational components contribute to t...

['<1-hop>\n\nAgents\ncombining specialized age...

Foundational components contribute to the cogn...

NewMultiHopQuery

Upload to HuggingFace Dataset

Convert a Pandas DataFrame(df) to a Hugging Face Dataset and proceed with the upload.

from datasets import Dataset

# Convert pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Check the dataset
print(dataset)

Dataset({
        features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],
        num_rows: 10
    })

from datasets import Dataset
import os

# Convert pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Set dataset name (change to your desired name)
hf_username = "icarus1026"  # Your Hugging Face Username(ID)
dataset_name = f"{hf_username}/rag-synthetic-dataset"

# Upload dataset
dataset.push_to_hub(
    dataset_name,
    private=True,  # Set private=False for a public dataset
    split="test_v1",  # Enter dataset split name
    token=os.getenv("HUGGINGFACEHUB_API_TOKEN"),
)

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00
No files have been modified since last commit. Skipping to prevent empty commit.
[Note] The Dataset Viewer may take some time to display.

PreviousEvaluation using RAGAS NextLangSmith-Dataset

Last updated 4 months ago

Overview

Table of Contents

References

Environment Setup

API Key Configuration

Upload Generated Dataset

Upload to HuggingFace Dataset