LogoLogo
ProductResourcesGitHubStart free
  • Documentation
  • Learn
  • ZenML Pro
  • Stacks
  • API Reference
  • SDK Reference
  • Overview
  • Integrations
  • Stack Components
    • Orchestrators
      • Local Orchestrator
      • Local Docker Orchestrator
      • Kubeflow Orchestrator
      • Kubernetes Orchestrator
      • Google Cloud VertexAI Orchestrator
      • AWS Sagemaker Orchestrator
      • AzureML Orchestrator
      • Databricks Orchestrator
      • Tekton Orchestrator
      • Airflow Orchestrator
      • Skypilot VM Orchestrator
      • HyperAI Orchestrator
      • Lightning AI Orchestrator
      • Develop a custom orchestrator
    • Artifact Stores
      • Local Artifact Store
      • Amazon Simple Cloud Storage (S3)
      • Google Cloud Storage (GCS)
      • Azure Blob Storage
      • Develop a custom artifact store
    • Container Registries
      • Default Container Registry
      • DockerHub
      • Amazon Elastic Container Registry (ECR)
      • Google Cloud Container Registry
      • Azure Container Registry
      • GitHub Container Registry
      • Develop a custom container registry
    • Step Operators
      • Amazon SageMaker
      • AzureML
      • Google Cloud VertexAI
      • Kubernetes
      • Modal
      • Spark
      • Develop a Custom Step Operator
    • Experiment Trackers
      • Comet
      • MLflow
      • Neptune
      • Weights & Biases
      • Google Cloud VertexAI Experiment Tracker
      • Develop a custom experiment tracker
    • Image Builders
      • Local Image Builder
      • Kaniko Image Builder
      • AWS Image Builder
      • Google Cloud Image Builder
      • Develop a Custom Image Builder
    • Alerters
      • Discord Alerter
      • Slack Alerter
      • Develop a Custom Alerter
    • Annotators
      • Argilla
      • Label Studio
      • Pigeon
      • Prodigy
      • Develop a Custom Annotator
    • Data Validators
      • Great Expectations
      • Deepchecks
      • Evidently
      • Whylogs
      • Develop a custom data validator
    • Feature Stores
      • Feast
      • Develop a Custom Feature Store
    • Model Deployers
      • MLflow
      • Seldon
      • BentoML
      • Hugging Face
      • Databricks
      • vLLM
      • Develop a Custom Model Deployer
    • Model Registries
      • MLflow Model Registry
      • Develop a Custom Model Registry
  • Service Connectors
    • Introduction
    • Complete guide
    • Best practices
    • Connector Types
      • Docker Service Connector
      • Kubernetes Service Connector
      • AWS Service Connector
      • GCP Service Connector
      • Azure Service Connector
      • HyperAI Service Connector
  • Popular Stacks
    • AWS
    • Azure
    • GCP
    • Kubernetes
  • Deployment
    • 1-click Deployment
    • Terraform Modules
    • Register a cloud stack
    • Infrastructure as code
  • Contribute
    • Custom Stack Component
    • Custom Integration
Powered by GitBook
On this page
  • When to use it?
  • How do you deploy it?
  • How do you use it?
  • Deploy an LLM

Was this helpful?

Edit on GitHub
  1. Stack Components
  2. Model Deployers

vLLM

Deploying your LLM locally with vLLM.

PreviousDatabricksNextDevelop a Custom Model Deployer

Last updated 1 month ago

Was this helpful?

is a fast and easy-to-use library for LLM inference and serving.

When to use it?

You should use vLLM Model Deployer:

  • Deploying Large Language models with state-of-the-art serving throughput creating an OpenAI-compatible API server

  • Continuous batching of incoming requests

  • Quantization: GPTQ, AWQ, INT4, INT8, and FP8

  • Features such as PagedAttention, Speculative decoding, Chunked pre-fill

How do you deploy it?

The vLLM Model Deployer flavor is provided by the vLLM ZenML integration, so you need to install it on your local machine to be able to deploy your models. You can do this by running the following command:

zenml integration install vllm -y

To register the vLLM model deployer with ZenML you need to run the following command:

zenml model-deployer register vllm_deployer --flavor=vllm

The ZenML integration will provision a local vLLM deployment server as a daemon process that will continue to run in the background to serve the latest vLLM model.

How do you use it?

If you'd like to see this in action, check out this example of a .

Deploy an LLM


from zenml import pipeline
from typing import Annotated
from steps.vllm_deployer import vllm_model_deployer_step
from zenml.integrations.vllm.services.vllm_deployment import VLLMDeploymentService


@pipeline()
def deploy_vllm_pipeline(
    model: str,
    timeout: int = 1200,
) -> Annotated[VLLMDeploymentService, "GPT2"]:
    service = vllm_model_deployer_step(
        model=model,
        timeout=timeout,
    )
    return service

Configuration

Within the VLLMDeploymentService you can configure:

  • model: Name or path of the Hugging Face model to use.

  • tokenizer: Name or path of the Hugging Face tokenizer to use. If unspecified, model name or path will be used.

  • served_model_name: The model name(s) used in the API. If not specified, the model name will be the same as the model argument.

  • trust_remote_code: Trust remote code from Hugging Face.

  • tokenizer_mode: The tokenizer mode. Allowed choices: ['auto', 'slow', 'mistral']

  • dtype: Data type for model weights and activations. Allowed choices: ['auto', 'half', 'float16', 'bfloat16', 'float', 'float32']

  • revision: The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

The exposes a VLLMDeploymentService that you can use in your pipeline. Here is an example snippet:

Here is an of running a GPT-2 model using vLLM.

vLLM
deployment pipeline
vllm_model_deployer_step
example