vLLM
Deploying your LLM locally with vLLM.
Last updated
Was this helpful?
Deploying your LLM locally with vLLM.
Last updated
Was this helpful?
is a fast and easy-to-use library for LLM inference and serving.
You should use vLLM Model Deployer:
Deploying Large Language models with state-of-the-art serving throughput creating an OpenAI-compatible API server
Continuous batching of incoming requests
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Features such as PagedAttention, Speculative decoding, Chunked pre-fill
The vLLM Model Deployer flavor is provided by the vLLM ZenML integration, so you need to install it on your local machine to be able to deploy your models. You can do this by running the following command:
To register the vLLM model deployer with ZenML you need to run the following command:
The ZenML integration will provision a local vLLM deployment server as a daemon process that will continue to run in the background to serve the latest vLLM model.
If you'd like to see this in action, check out this example of a .
Within the VLLMDeploymentService
you can configure:
model
: Name or path of the Hugging Face model to use.
tokenizer
: Name or path of the Hugging Face tokenizer to use. If unspecified, model name or path will be used.
served_model_name
: The model name(s) used in the API. If not specified, the model name will be the same as the model
argument.
trust_remote_code
: Trust remote code from Hugging Face.
tokenizer_mode
: The tokenizer mode. Allowed choices: ['auto', 'slow', 'mistral']
dtype
: Data type for model weights and activations. Allowed choices: ['auto', 'half', 'float16', 'bfloat16', 'float', 'float32']
revision
: The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
The exposes a VLLMDeploymentService
that you can use in your pipeline. Here is an example snippet:
Here is an of running a GPT-2 model using vLLM.