HF-Upload

Overview

The process involves loading a local CSV file, converting it to a HuggingFace Dataset format, and uploading it to the Hugging Face Hub as a private dataset. This process allows for easy sharing and access of the dataset through the HuggingFace infrastructure.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

API Key Configuration

To use HuggingFace Dataset , you need to obtain a HuggingFace write token.

Once you have your API key, set it as the value for the variable HUGGINGFACEHUB_API_TOKEN .

%%capture --no-stderr
%pip install langchain-opentutorial
    [notice] A new release of pip is available: 24.3.1 -> 25.0.1
    [notice] To update, run: python.exe -m pip install --upgrade pip
# Install required packages
from langchain_opentutorial import package

package.install(
    ["datasets"],
    verbose=False,
    upgrade=False,
)

You can set API keys in a .env file or set them manually.

[Note] If you’re not using the .env file, no worries! Just enter the keys directly in the cell below, and you’re good to go.

from dotenv import load_dotenv
from langchain_opentutorial import set_env

# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.
if not load_dotenv(override=True):
    set_env(
        {
            "LANGCHAIN_API_KEY": "",
            "LANGCHAIN_TRACING_V2": "true",
            "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
            "LANGCHAIN_PROJECT": "", # set the project name same as the title
            "HUGGINGFACEHUB_API_TOKEN": "",
        }
    )
from dotenv import load_dotenv

load_dotenv(override=True)
True

Upload Generated Dataset

Import the pandas library for data upload

import pandas as pd

df = pd.read_csv("data/ragas_synthetic_dataset.csv")
df.head()
user_input
reference_contexts
reference
synthesizer_name

0

Wht is an API?

["Agents\nThis combination of reasoning,\nlogi...

An API can be used by a model to make various ...

single_hop_specifc_query_synthesizer

1

What are the three essential components in an ...

['Agents\nWhat is an agent?\nIn its most funda...

The three essential components in an agent's c...

single_hop_specifc_query_synthesizer

2

What Chain-of-Thought do in agent model, how i...

['Agents\nFigure 1. General agent architecture...

Chain-of-Thought is a reasoning and logic fram...

single_hop_specifc_query_synthesizer

3

Waht is the DELETE method used for?

['Agents\nThe tools\nFoundational models, desp...

The DELETE method is a common web API method t...

single_hop_specifc_query_synthesizer

4

How do foundational components contribute to t...

['<1-hop>\n\nAgents\ncombining specialized age...

Foundational components contribute to the cogn...

NewMultiHopQuery

Upload to HuggingFace Dataset

Convert a Pandas DataFrame(df) to a Hugging Face Dataset and proceed with the upload.

from datasets import Dataset

# Convert pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Check the dataset
print(dataset)
Dataset({
        features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],
        num_rows: 10
    })
from datasets import Dataset
import os

# Convert pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Set dataset name (change to your desired name)
hf_username = "icarus1026"  # Your Hugging Face Username(ID)
dataset_name = f"{hf_username}/rag-synthetic-dataset"

# Upload dataset
dataset.push_to_hub(
    dataset_name,
    private=True,  # Set private=False for a public dataset
    split="test_v1",  # Enter dataset split name
    token=os.getenv("HUGGINGFACEHUB_API_TOKEN"),
)
Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00
No files have been modified since last commit. Skipping to prevent empty commit.
[Note] The Dataset Viewer may take some time to display.

Last updated