HF-Upload
Author: Sun Hyoung Lee
Design:
Peer Review :
Proofread:
This is a part of LangChain Open Tutorial
Overview
The process involves loading a local CSV file, converting it to a HuggingFace Dataset format, and uploading it to the Hugging Face Hub as a private dataset. This process allows for easy sharing and access of the dataset through the HuggingFace infrastructure.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorial
is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorial
for more details.
API Key Configuration
To use HuggingFace Dataset
, you need to obtain a HuggingFace write token.
Once you have your API key, set it as the value for the variable HUGGINGFACEHUB_API_TOKEN
.
%%capture --no-stderr
%pip install langchain-opentutorial
[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip
# Install required packages
from langchain_opentutorial import package
package.install(
["datasets"],
verbose=False,
upgrade=False,
)
You can set API keys in a .env
file or set them manually.
[Note] If you’re not using the .env
file, no worries! Just enter the keys directly in the cell below, and you’re good to go.
from dotenv import load_dotenv
from langchain_opentutorial import set_env
# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.
if not load_dotenv(override=True):
set_env(
{
"LANGCHAIN_API_KEY": "",
"LANGCHAIN_TRACING_V2": "true",
"LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
"LANGCHAIN_PROJECT": "", # set the project name same as the title
"HUGGINGFACEHUB_API_TOKEN": "",
}
)
from dotenv import load_dotenv
load_dotenv(override=True)
True
Upload Generated Dataset
Import the pandas library for data upload
import pandas as pd
df = pd.read_csv("data/ragas_synthetic_dataset.csv")
df.head()
0
Wht is an API?
["Agents\nThis combination of reasoning,\nlogi...
An API can be used by a model to make various ...
single_hop_specifc_query_synthesizer
1
What are the three essential components in an ...
['Agents\nWhat is an agent?\nIn its most funda...
The three essential components in an agent's c...
single_hop_specifc_query_synthesizer
2
What Chain-of-Thought do in agent model, how i...
['Agents\nFigure 1. General agent architecture...
Chain-of-Thought is a reasoning and logic fram...
single_hop_specifc_query_synthesizer
3
Waht is the DELETE method used for?
['Agents\nThe tools\nFoundational models, desp...
The DELETE method is a common web API method t...
single_hop_specifc_query_synthesizer
4
How do foundational components contribute to t...
['<1-hop>\n\nAgents\ncombining specialized age...
Foundational components contribute to the cogn...
NewMultiHopQuery
Upload to HuggingFace Dataset
Convert a Pandas DataFrame(df
) to a Hugging Face Dataset and proceed with the upload.
from datasets import Dataset
# Convert pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)
# Check the dataset
print(dataset)
Dataset({
features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],
num_rows: 10
})
from datasets import Dataset
import os
# Convert pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)
# Set dataset name (change to your desired name)
hf_username = "icarus1026" # Your Hugging Face Username(ID)
dataset_name = f"{hf_username}/rag-synthetic-dataset"
# Upload dataset
dataset.push_to_hub(
dataset_name,
private=True, # Set private=False for a public dataset
split="test_v1", # Enter dataset split name
token=os.getenv("HUGGINGFACEHUB_API_TOKEN"),
)
Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00
No files have been modified since last commit. Skipping to prevent empty commit.
[Note] The Dataset Viewer may take some time to display.
Last updated