Hugging Face
Deploying models to Huggingface Inference Endpoints with Hugging Face :hugging_face:.
Hugging Face Inference Endpoints provides a secure production solution to easily deploy any transformers
, sentence-transformers
, and diffusers
models on a dedicated and autoscaling infrastructure managed by Hugging Face. An Inference Endpoint is built from a model from the Hub.
This service provides dedicated and autoscaling infrastructure managed by Hugging Face, allowing you to deploy models without dealing with containers and GPUs.
When to use it?
You should use Hugging Face Model Deployer:
if you want to deploy Transformers, Sentence-Transformers, or Diffusion models on dedicated and secure infrastructure.
if you prefer a fully-managed production solution for inference without the need to handle containers and GPUs.
if your goal is to turn your models into production-ready APIs with minimal infrastructure or MLOps involvement
Cost-effectiveness is crucial, and you want to pay only for the raw compute resources you use.
Enterprise security is a priority, and you need to deploy models into secure offline endpoints accessible only via a direct connection to your Virtual Private Cloud (VPCs).
If you are looking for a more easy way to deploy your models locally, you can use the MLflow Model Deployer flavor.
How to deploy it?
The Hugging Face Model Deployer flavor is provided by the Hugging Face ZenML integration, so you need to install it on your local machine to be able to deploy your models. You can do this by running the following command:
To register the Hugging Face model deployer with ZenML you need to run the following command:
Here,
token
parameter is the Hugging Face authentication token. It can be managed through Hugging Face settings.namespace
parameter is used for listing and creating the inference endpoints. It can take any of the following values, username or organization name or*
depending on where the inference endpoint should be created.
We can now use the model deployer in our stack.
How to use it
There are two mechanisms for using the Hugging Face model deployer integration:
Using the pre-built huggingface_model_deployer_step to deploy a Hugging Face model.
Running batch inference on a deployed Hugging Face model using the HuggingFaceDeploymentService
If you'd like to see this in action, check out this example of of a deployment pipeline and an inference pipeline.
Deploying a model
The pre-built huggingface_model_deployer_step exposes a HuggingFaceServiceConfig
that you can use in your pipeline. Here is an example snippet:
Within the HuggingFaceServiceConfig
you can configure:
model_name
: the name of the model in ZenML.endpoint_name
: the name of the inference endpoint. We add a prefixzenml-
and first 8 characters of the service uuid as a suffix to the endpoint name.repository
: The repository name in the user’s namespace ({username}/{model_id}
) or in the organization namespace ({organization}/{model_id}
) from the Hugging Face hub.framework
: The machine learning framework used for the model (e.g."custom"
,"pytorch"
)accelerator
: The hardware accelerator to be used for inference. (e.g."cpu"
,"gpu"
)instance_size
: The size of the instance to be used for hosting the model (e.g."large"
,"xxlarge"
)instance_type
: Inference Endpoints offers a selection of curated CPU and GPU instances. (e.g."c6i"
,"g5.12xlarge"
)region
: The cloud region in which the Inference Endpoint will be created (e.g."us-east-1"
,"eu-west-1"
forvendor = aws
and"eastus"
for Microsoft Azure vendor.).vendor
: The cloud provider or vendor where the Inference Endpoint will be hosted (e.g."aws"
).token
: The Hugging Face authentication token. It can be managed through huggingface settings. The same token can be passed used while registering the Hugging Face model deployer.account_id
: (Optional) The account ID used to link a VPC to a private Inference Endpoint (if applicable).min_replica
: (Optional) The minimum number of replicas (instances) to keep running for the Inference Endpoint. Defaults to0
.max_replica
: (Optional) The maximum number of replicas (instances) to scale to for the Inference Endpoint. Defaults to1
.revision
: (Optional) The specific model revision to deploy on the Inference Endpoint for the Hugging Face repository .task
: Select a supported Machine Learning Task. (e.g."text-classification"
,"text-generation"
)custom_image
: (Optional) A custom Docker image to use for the Inference Endpoint.namespace
: The namespace where the Inference Endpoint will be created. The same namespace can be passed used while registering the Hugging Face model deployer.endpoint_type
: (Optional) The type of the Inference Endpoint, which can be"protected"
,"public"
(default) or"private"
.
For more information and a full list of configurable attributes of the Hugging Face Model Deployer, check out the SDK Docs and Hugging Face endpoint code.
Running inference on a provisioned inference endpoint
The following code example shows how to run inference against a provisioned inference endpoint:
For more information and a full list of configurable attributes of the Hugging Face Model Deployer, check out the SDK Docs.
Last updated