Core Concepts

Precise definitions for ZenML Pro resource pools, subject policies, and resource requests.

This page defines pools, policies, and requests for ZenML Pro.

Pools

A resource pool is a named shared bucket. For each resource key (for example gpu), you set how many units exist in the pool. Policies on that pool further split that capacity among orchestrators and step operators.

Steps and pools use the same keys and integer amounts. Typical keys are:

Key
Meaning

gpu

GPU count (requested by steps through ResourceSettings.gpu_count)

mcpu

Milli-CPU (requested by steps through ResourceSettings.cpu_count * 1000, rounded up)

memory_mb

Memory in megabytes (requested by steps through ResourceSettings.memory)

step_run

One concurrent step run (added automatically by the server for each step)

Custom keys (for example tpu) can be set with pool_resources on the step.

CLI: pools

# Create a pool (capacity as JSON or YAML)
zenml resource-pool create training-gpus \
  --capacity '{"gpu": 8, "step_run": 32}' \
  --description "Shared training GPUs for the workspace"

# List pools with occupied vs total capacity
zenml resource-pool list

# Inspect one pool (name, ID prefix, or full ID)
zenml resource-pool describe training-gpus

# Shrink or grow capacity (0 removes a key from the pool)
zenml resource-pool update training-gpus --capacity '{"gpu": 4}'

# Remove a pool (use -y to skip confirmation)
zenml resource-pool delete training-gpus --yes

Policies

A policy connects one stack component—the orchestrator or step operator that acts as the resource requester for a step—to one pool. Think of three knobs per resource key:

  • Reserved — How much of the pool you label as belonging to this component for accounting. Usage up to that amount counts as in share; anything above it (while the pool still has free units) is borrowed idle capacity. Reserved is not a separate pile of hardware: it is the share used to decide who is “in their rights” versus who is on spare capacity. Across all policies on the same pool, reserved totals per key cannot exceed the pool capacity. Reserved must also be ≤ that policy’s limit for the same key.

  • Limit — The hard ceiling on how much this component may hold from the pool at once for that key. Grants never go above the limit, even if the pool is empty. For preemptible workloads, the space between reserved and limit is where borrowing can happen (subject to pool free capacity). Non-preemptible work does not use that band: each requested amount per key must be ≤ reserved, and a higher limit does not raise that ceiling (limit still caps preemptible burst and total use).

  • Priority — A number; higher means that component’s requests are preferred in the queue. When the reconciler must preempt someone, it looks at lower priority preemptible runs first as victims (see below).

circle-exclamation

CLI: policies

Resource requests

For eligible runs, the server builds a resource request from the step’s ResourceSettings, records whether the step is preemptible, and tracks status: queued, allocated, rejected, preempted, or cancelled.

circle-info

Only dynamic pipelines participate in resource queuing and allocation waiting: the server creates resource requests and the client blocks until allocation when the snapshot is dynamic. Static pipelines do not use this path today.

What users set in ResourceSettings becomes a server-side resource request (for example gpu_countgpu, cpu_countmcpu, memorymemory_mb, plus an implicit concurrent step_run slot). The pool must define capacity for each key requested by the step, except for three built-in types: if the pool has no row for mcpu, memory_mb, or the implicit step_run key, ZenML Pro treats that dimension as effectively unbounded at the pool layer, so missing rows there do not by themselves cause rejection. For every other key (including gpu and custom keys from pool_resources), a missing pool row means zero capacity: a positive request is rejected and the step run fails to start.

circle-exclamation

Step decorators: ResourceSettings

Declare demand on the step; the server turns it into the resource request when the pipeline is dynamic and pooling applies to the stack.

Typical GPU / CPU / memory (preemptible by default):

Non-preemptible (must stay within policy reserved per key):

Custom pool keys (must exist on the pool and policy when non-preemptible):

Typed fields override the same keys if both appear in pool_resources. See step configurationarrow-up-right in the OSS docs for the full ResourceSettings model.

CLI: resource requests

Requests are created when dynamic steps run; you inspect or clean them up from the CLI (IDs come from list output or the dashboard).

See also

Last updated

Was this helpful?