Hugging Face Inference Endpoints

May 30, 2024

This guide will walk through how to use an ML model from Hugging Face without installing libraries with complex dependencies (e.g. Pytorch, Keras, Tensorflow), buying expensive GPU's or installing CUDA drivers, or downloaded massive serialized models.

Hugging Face is a popular website that allows you to share and access pre-trained models and has many useful features for evaluating and integrating those models in your own projects. Hugging Face offers serverless inference endpoints. Inference endpoints are becoming an industry standard way of providing a model's predication and classification capability to others without passing your environment's dependencies to them as well.

Working with API's in Python is a useful skill for many data and software professionals. Having a basic understanding of HTTP methods, libraries to work with them, and associated data formats constitutes a surprising amount of the web.

Let's go!

Find A Model

Searching huggingface.co, I found a model that I am interested in using:

The "Deploy" feature is also really cool as it provides an "Inference API (serverless)" option. Many ML models require frameworks and unique environments to run so using API's for portability is now a commonplace solution. Hugging Face plays nicely with this strategy and even takes it further by making push-button deployments. It's not available for all models but let's try it!

Setup Your Project

Configure Git Repo

I've set up my own Github repo for code in this guide. I like starting projects in Github when I can because I like the files GitHub generates (.gitignore, readme.md, etc.).  

Then I clone it locally:

chris@Mac-mini Code % git clone git@github.com:Shumakriss/py-apis.git

Now, the project is available for your favorite editor or IDE like VS Code.

Configure Python Environment

I am going to do my best to stick to Python built-ins in this project but I'm going to create a virtual environment just in case I need it. 

I used pyenv to install Python 3.12.3 previously. For this project, I'll create a virtualenv from that version and activate it:

chris@Mac-mini Code % cd py-apis/

chris@Mac-mini py-apis % /Users/chris/.pyenv/versions/3.12.3/bin/python -m venv py-apis-venv

chris@Mac-mini py-apis % ./py-apis-venv/bin/activate

(py-apis-venv) chris@Mac-mini py-apis % ls

README.md py-apis-venv

I also need to add my custom virtualenv folder name to .gitignore:

(py-apis-venv) chris@Mac-mini py-apis % echo 'py-apis-venv' >> .gitignore

Then update git

git add .gitignore

git commit -m "Adding venv to .gitignore"

git push

Write A Test Script

I'm going to create a file for our Python code.

(py-apis-venv) chris@Mac-mini py-apis % touch test.py

I'm going to copy some boilerplate code from the Deploy option in Hugging Face we talked about. 

HuggingFace Sample Code

On the site, find your model, then click Deploy, then "Inference API (serverless)". This presents us with some sample code. 

It's not complete because we need a real authorization header. For now, just copy it to your editor:

Installing Dependencies And Authorizing Requests

The code is pretty straightforward use of the requests library. While very common, the library is not a Python built-in. Trying to run this code will give you an error:

(py-apis-venv) chris@Mac-mini py-apis % python test.py 

Traceback (most recent call last):

  File "/Users/chris/Code/py-apis/test.py", line 1, in <module>

    import requests

ModuleNotFoundError: No module named 'requests'

If you're using virtualenv like me, you can just use pip with that environment activated:

(py-apis-venv) chris@Mac-mini py-apis % pip install requests

Collecting requests

  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)

Collecting charset-normalizer<4,>=2 (from requests)

  Using cached charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (33 kB)

Collecting idna<4,>=2.5 (from requests)

  Using cached idna-3.7-py3-none-any.whl.metadata (9.9 kB)

Collecting urllib3<3,>=1.21.1 (from requests)

  Using cached urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB)

Collecting certifi>=2017.4.17 (from requests)

  Using cached certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB)

Downloading requests-2.32.3-py3-none-any.whl (64 kB)

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.9/64.9 kB 3.2 MB/s eta 0:00:00

Using cached certifi-2024.2.2-py3-none-any.whl (163 kB)

Using cached charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl (119 kB)

Using cached idna-3.7-py3-none-any.whl (66 kB)

Using cached urllib3-2.2.1-py3-none-any.whl (121 kB)

Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests

Successfully installed certifi-2024.2.2 charset-normalizer-3.3.2 idna-3.7 requests-2.32.3 urllib3-2.2.1

The code doesn't provide any output so let's fix that:

import requests


API_URL = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B"

headers = {"Authorization": "Bearer hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}


def query(payload):

   response = requests.post(API_URL, headers=headers, json=payload)

   return response.json()

  

output = query({

   "inputs": "Can you please let us know more details about your ",

})


print(output)

Running again shows us what we already knew about the authorization token:

(py-apis-venv) chris@Mac-mini py-apis % python test.py

{'error': 'Authorization header is correct, but the token seems invalid'}

Setting Up Tokens

We'll need to set up an access token in our profile, much like GitHub, and give it some permissions.

Keeping Our Token Safe

In order to make sure that we don't check this value in by accident, we can create a file called .hf_token in our project that contains the token value and tell git to ignore it:

touch .hf_token

echo '.hf_token' >> .gitignore

Then we can load it in Python dynamically:

import requests


API_URL = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B"


with open('.hf_token', 'r') as file:

   token = file.read()


headers = {"Authorization": f"Bearer {token}"}


def query(payload):

   response = requests.post(API_URL, headers=headers, json=payload)

   return response.json()

  

output = query({

   "inputs": "Can you please let us know more details about your ",

})


print(output)

Be sure to confirm now that your .hf_token file will not be added to git.

(py-apis-venv) chris@Mac-mini py-apis % git status

On branch main

Your branch is up to date with 'origin/main'.

Changes not staged for commit:

  (use "git add <file>..." to update what will be committed)

  (use "git restore <file>..." to discard changes in working directory)

modified:   .gitignore

Untracked files:

  (use "git add <file>..." to include in what will be committed)

test.py

no changes added to commit (use "git add" and/or "git commit -a")

Since it doesn't show up in git status, you should be in the clear. If you see .hf_token in your status, don't commit and definitely don't push until you fix the issue.

Back To Testing The API

Now our output confirms that we have authenticated but that our model doesn't support serverless inference:

(py-apis-venv) chris@Mac-mini py-apis % python test.py                

{'error': 'The model meta-llama/Meta-Llama-3-8B is too large to be loaded automatically (16GB > 10GB). Please use Spaces (https://huggingface.co/spaces) or Inference Endpoints (https://huggingface.co/inference-endpoints).'}

Digging around on Hugging Face suggests that I might need to use a paid hosting solution and I would rather not (or more importantly, I don't want to ask you to do that!).

So let's switch models!

Finding New Models

This time, I'm going to filter on text classification models so we don't need to do any image processing. I'll also sort them by "Most downloads":

Switching To BERT

Let's use the BERT model at the top of the list. The sample code is pretty similar to the llama model. Once we adjust it to handle our token safely and print results, here's what it looks like:

import requests


API_URL = "https://api-inference.huggingface.co/models/distilbert/distilbert-base-uncased-finetuned-sst-2-english"


with open('.hf_token', 'r') as file:

   token = file.read()


headers = {"Authorization": f"Bearer {token}"}


def query(payload):

   response = requests.post(API_URL, headers=headers, json=payload)

   return response.json()

  

output = query({

       "inputs": "I like you. I love you",

})


print(output)

The main difference is the query in the payload and the API_URL. Running this code confirms that it's usable via serverless inference and provides cool results:

(py-apis-venv) chris@Mac-mini py-apis % python test.py

[[{'label': 'POSITIVE', 'score': 0.9998738765716553}, {'label': 'NEGATIVE', 'score': 0.0001261125726159662}]]

Switching the text to "I don't like you. I hate you." gives us appropriately opposite results:

(py-apis-venv) chris@Mac-mini py-apis % python test.py

[[{'label': 'NEGATIVE', 'score': 0.999196469783783}, {'label': 'POSITIVE', 'score': 0.0008035015198402107}]]

API Fun Facts

HTTP Methods

In this example, we used the Python requests library to integrate with an API. API's use HTTP under the hood to communicate. Within the HTTP specification, there are "verbs" that provide some loose constraints on behavior. Common examples of these verbs are "GET" and "POST" which broadly correspond to read-only and send and/or retrieve requests, respecitively. 

With the requests library, you choose the verb like this:

response = requests.post(API_URL, headers=headers, json=payload)

Headers are sometimes optional. Payloads are sometimes optional for GET requests but required for POST requests. 

Did you find this useful?