Huggingface Endpoints
Author: Sooyoung
Peer Review : effort-type, Ivy Bae
Proofread : frimer
This is a part of LangChain Open Tutorial
Overview
This tutorial covers the endpoints provided by Hugging Face. There are two types of endpoints available: Serverless and Dedicated. It is a basic tutorial that begins with obtaining a Hugging Face token in order to use these endpoints.
You can learn the following topics:
How to obtain a Hugging Face token
How to use Serverless Endpoints
How to use Dedicated Endpoints
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
About Huggingface Endpoints
What is Hugging Face Hub?
It is a platform that hosts over 120,000 models, more than 20,000 datasets, and 50,000 demo apps (Spaces). All resources are open-source and publicly accessible, allowing anyone to view and collaborate.
What are Endpoints?
Endpoints act like doors that let your application easily connect to models. They help you add various machine learning features to your app effortlessly.
Text Generation Inference
It uses a special server built specifically for fast text generation. This server is built using Rust, Python, and gRPC, offering excellent performance.
Obtaining a Huggingface Token
After signing up on Hugging Face, you can obtain a token from the following URL.
Reference Model List
Below is a link to the Hugging Face LLM leaderboard, Model List, LogicKor. For more information, you can check the link.
LogicKor Leaderboard LogicKor Leaderboard's link is for the leaderboard of Korean models. As the model performance increased, it has been archived due to meaningless scores as of October 17, 2024. However, you can find the best-performing Korean models.
Using Hugging Face Endpoints
To use Hugging Face Endpoints, install the huggingface_hub package in Python.
We previously installed huggingface_hub through langchain-opentutorial. However, if you need to install it separately, you can do so by running the pip install huggingface_hub command.
To use the Hugging Face endpoint, you need an API token key. If you don't have a huggingface token follwing this here.
If you have already set the token in HUGGINGFACEHUB_API_TOKEN, the API token is automatically recognized.
OR
You can use from huggingface_hub import login.
Serverless Endpoints
The Inference API is free to use but comes with usage limitations. For production-level inference solutions, consider using the Inference Endpoints service.
Inference Endpoints enable you to deploy any machine learning model seamlessly on dedicated, fully managed infrastructure. You can tailor the deployment to align with your model, latency, throughput, and compliance requirements by selecting the cloud provider, region, compute instance, auto-scaling range, and security level.
Below is an example of how to access the Inference API.
First of all, create a simple prompt using PromptTemplate
[Note]
In this example, the model used is
microsoft/Phi-3-mini-4k-instructIf you want change aother model, assign the HuggingFace model's repository ID to the variable
repo_id.link : https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
The response is below :
Dedicated Endpoints
Using free serverless APIs allows you to quickly implement and iterate your solutions. However, because the load is shared with other requests, there can be rate limits for high-volume use cases.
For enterprise workloads, it is recommended to use Inference Endpoints - Dedicated. This gives you access to a fully managed infrastructure that offers greater flexibility and speed.
These resources also include ongoing support, guaranteed uptime, and options like AutoScaling.
Set the Inference Endpoint URL to the
hf_endpoint_urlvariable.
[Note]
This address is not a Dedicated Endpoint but rather a public endpoint provided by Hugging Face. Because Dedicated Endpoints are a paid service, a public endpoint was used for this example.
For more details, please refer to this link.



The following example shows the code implemented using a chain.
Last updated