Examples

Step-by-step ZenML Pro resource pool examples: pool JSON, policy JSON, ResourceSettings, and outcomes for new users.

Read this page like a short course: each section is one self-contained scenario. You will always see three things—the pool (shared capacity), the policy (how one orchestrator or step operator may use that pool), and the step (ResourceSettings). Then we spell out what the server does.

Assumptions unless stated otherwise:

  • Steps are preemptible by default if you omit preemptible=False.

  • One step run at a time when we say “no other work is running,” so you can focus on a single decision.

  • Every key in a policy’s reserved and limit must exist on the pool’s capacity. You cannot meter a resource in policy that the pool does not define.

  • If one orchestrator or step operator has policies to several pools, the step still receives at most one allocation from one pool. The whole request must be eligible on that pool; resources are not split across pools for a single step.

For definitions of reserved, limit, and priority, see Core concepts. For preemption ordering, see How preemption worksarrow-up-right.

Primer: from ResourceSettings to the resource request

Say a step declares:

from zenml.config import ResourceSettings

ResourceSettings(
    gpu_count=2,
    cpu_count=4,
    memory="16GiB",
    pool_resources={"tensorrt_sessions": 1},
    preemptible=True,
)

ZenML turns that into one resource request. Roughly:

Source
Request key
How the amount is derived
Example value

gpu_count

gpu

Same as gpu_count

2

cpu_count

mcpu

ceil(cpu_count * 1000)

4000

memory

memory_mb

Converted to megabytes

17180 (for "16GiB")

pool_resources

(your names)

Copied as-is, merged with typed fields

1

Server

step_run

Always 1 per step

1

The pool must define capacity for bounded keys such as gpu and tensorrt_sessions. If the pool has no row for mcpu, memory_mb, or step_run, that dimension is unbounded at the pool layer (see the examples below). For everything else, a missing pool row means zero capacity. If you want a policy to set reserved / limit on a key, that key must appear on the pool first—policy keys are always a subset of pool keys.


Warm-up: one pool, one policy, only GPUs

Preemptible step borrows past reserved

Story: The team has four GPUs “labeled” for them, but the pool is empty. They ask for six GPUs and allow preemption. They may borrow two idle GPUs.

Pool

Policy (orchestrator team-ml-orch attached to this pool)

Step

Outcome: Allocated immediately (no queue). Four GPUs count against the policy reserved share; two are borrowed from free pool capacity (between reserved and limit, and pool must still have free units).


Non-preemptible step stays inside reserved

Story: Same pool and policy. Production wants two GPUs and opts out of preemption. Two is within the four-GPU reservation.

Pool

Policy

Step

Outcome: Allocated (assuming no other contention). Non-preemptible work must satisfy requested ≤ reserved per key; 2 ≤ 4 passes.


Non-preemptible step beyond reserved

Story: Same pool and policy. Production asks for six GPUs but refuses preemption. Non-preemptible work cannot use the “borrow” band above reserved.

Pool

Policy

Step

Outcome: Rejected immediately (dynamic run fails fast). Six exceeds reserved (4) for gpu; non-preemptible requests cannot borrow up to limit.


CPU, memory, and step slots: unbounded vs metered

GPU-only pool—CPU and memory not quota’d

Story: You only modeled GPUs on the pool and policy. The step still sends mcpu and memory_mb on the request, but those keys are unbounded at the pool layer when omitted, and this policy omits them too—so they do not block non-preemptible work.

Pool

Policy

Step

Outcome: Allocated if nothing else is wrong. Only gpu is gated here; mcpu / memory_mb / step_run are not limited by pool or policy in this pattern. CPU and memory remain informational unless you add rows later.


Non-preemptible CPU inside policy reserved

Story: You cap milli-CPU on the pool, then split it with reserved / limit on the policy. Non-preemptible CPU demand must fit reserved per key.

Pool

Policy

Step

Outcome: Allocated. cpu_count=2mcpu 2000 ≤ reserved 4000, and gpu is valid.


Non-preemptible CPU over policy reserved

Pool

Policy

Step

Outcome: Rejected. cpu_count=8mcpu 8000 > reserved 4000; non-preemptible work cannot borrow toward limit on mcpu.


Preemptible CPU burst with policy mcpu rows

Pool and Policy: same as Non-preemptible CPU inside policy reserved (pool includes gpu and mcpu; policy sets both keys).

Step

Outcome: May allocate using headroom up to limit on mcpu (and pool free capacity), analogous to GPU borrowing.


Preemptible when pool lists mcpu but policy omits it

Story: The pool caps total milli-CPU. With no mcpu on the policy, reserved defaults to 0 and limit falls back to the pool total.

Pool

Policy (only gpu)

Step

Outcome: Allocated or queued then allocated when possible. mcpu 4000 ≤ effective limit 8000 (pool total).


Non-preemptible when pool lists mcpu but policy omits it

Pool and Policy: same as the previous example.

Step

Outcome: Rejected. Any positive mcpu with reserved 0 fails for non-preemptible work. Fix: add mcpu to the policy with enough reserved, or remove mcpu from the pool if you wanted fully unbounded CPU at the pool layer.


Capping concurrent steps with step_run

Story: You want both GPUs and a ceiling on how many steps from this orchestrator run at once. Each step always requests one step_run.

Pool

Policy

Step

Outcome: The server grants only when both gpu and step_run have enough free units. If GPUs are free but all step_run slots are taken, the request waits in the queue.


Custom keys from pool_resources

Custom key fully configured

Story: You track a scarce license or device class with pool_resources.

Pool

Policy

Step

Outcome: Allocated when 1 ≤ reserved for both gpu and tensorrt_sessions. Unbounded defaults do not apply to custom keys—the pool must list them.


Custom key on pool but missing from policy

Story: Same pool capacity; policy only defines gpu.

Pool

Policy

Step

Outcome: Rejected. Missing policy row → reserved 0 for tensorrt_sessions; non-preemptible cannot ask for a positive amount. Fix: add tensorrt_sessions to the policy, or mark the step preemptible if borrowing is acceptable.


Hard rejections (no queue)

Request exceeds pool capacity

Pool

Policy

Step

Outcome: Rejected immediately. Ten exceeds the pool total for gpu; the request does not join a queue.


Request exceeds policy limit (pool could fit)

Pool

Policy

Step

Outcome: Rejected. Six exceeds this component’s limit (4) for gpu, even if eight GPUs exist in the pool.


Contention: queues and priorities

Two teams, same priority, not enough GPUs

Story: Red and Blue orchestrators share one pool. Policies use the same priority. Many preemptible steps each want 2 GPUs; the pool cannot satisfy everyone at once.

Pool

Policies

Step (typical for either team)

Outcome: Requests wait in the pool queue until GPUs free up. Among the same policy priority, ordering tends to favor older waiters (FIFO-style). The allocator also prefers a request that still fits entirely in its unused reserved slice over one that must borrow when both are waiting—so a team with reservation headroom is not stuck behind another team that is already bursting, if the next grant can be served from that reserved slice. No preemption until a higher-priority waiter or reclaim logic forces it.


Higher priority wins; lower may be preempted

Story: Sandbox bursts with preemptible work. Production has higher policy priority and needs GPUs when the pool is full.

Pool

Policies

Sandbox (already holding six GPUs, preemptible)

Prod (new, preemptible, needs four)

Outcome: If four GPUs cannot be granted without reclaiming space, the reconciler may preempt Sandbox’s preemptible runs (lower policy priority) so Prod can proceed. See How preemption worksarrow-up-right for victim ordering.


Production non-preemptible waits on reserved only

Story: Prod uses preemptible=False and asks only for what is reserved. If another non-preemptible job on the same stack component already holds the reserved GPUs, this step does not borrow from Sandbox’s burst.

Pool

Policies (same as previous example: Sandbox priority 10, Prod priority 100)

Prod step

Outcome: Waits in the queue if Prod’s reserved gpu (2) is already used by other non-preemptible work on prod-orch. It will not take Sandbox’s borrowed GPUs. Ways out: raise reserved for Prod, wait for the other job to finish, or use preemptible Prod work if policy allows.


Multiple pools and multi-key requests

circle-exclamation

Two pools on one orchestrator—primary pool wins

Story: You attach two policies to the same orchestrator pointing at different pools. The step still produces one resource request, enqueued in every eligible pool; only one pool may win.

Pools

Policies

Step

Outcome: The server tries higher policy priority first—eu-west before eu-north. Whichever pool grants first owns the allocation; the other queue entry is dropped as stale. Use this for primary/fallback or regional capacity, not for splitting one step across unrelated quotas.


One step must satisfy every key in each pool

Story: Eligibility is checked per pool against all keys on the request. If a pool lacks a key the step needs, that pool treats it as zero capacity—the request is not eligible there.

Pool A (GPUs only)

Pool B (full bundle)

Policy (example: only Pool B is attached, or imagine Pool A attached alone)

Step

Outcome: A pool with only gpu cannot satisfy tensorrt_sessions—that dimension is zero there, so the request does not enqueue on that pool.

Lesson: Model every scarce bounded dimension you care about on one pool (or ensure every candidate pool defines the same key set for those keys). The next section shows how mcpu and memory_mb differ: omitting them on one pool keeps that path eligible even when another pool meters them strictly.


Two pools: higher-priority path meters CPU/RAM; GPU-only path still wins

Story: One orchestrator has two policies (same pattern as Two pools on one orchestrator—primary pool wins). Pool B’s capacity and policy include mcpu and memory_mb, with reserved amounts sized for small non-preemptible jobs. Pool A only defines gpu; it does not list mcpu or memory_mb, so those dimensions are unbounded at the pool layer and its policy does not reserve them. A non-preemptible step asks for one GPU but more CPU and RAM than Pool B’s policy allows. The higher-priority policy (Pool B) cannot grant that request; the lower-priority policy (Pool A) can, because the request’s CPU and memory demand is not quota’d on that path. The allocation is owned by Pool A.

Pool A (GPUs only—no mcpu or memory_mb on the pool)

Pool B (GPUs plus metered CPU and memory)

Policies (same component, different priorities—B is preferred when both can grant)

Step

Outcome: The request maps to roughly mcpu 8000 and tens of thousands of memory_mb for 32GiB. For non-preemptible work, each key must be ≤ policy reserved on the path you use. Pool B’s policy reserves only mcpu 4000 and memory_mb 8192, so that path cannot satisfy the step. Pool A’s policy has no mcpu or memory_mb rows; with those keys absent from the pool, they are not treated as zero capacity, so the step remains eligible there on gpu alone. The reconciler allocates from Pool A and drops the competing queue row for Pool B. If you want large non-preemptible jobs to stay on the metered pool, raise reserved (and capacity) on Pool B for mcpu and memory_mb, or reduce demand in ResourceSettings—otherwise the GPU-only policy acts as an escape hatch for heavy CPU/RAM asks.


See also

Last updated

Was this helpful?