Checkpoints
Durable work units with persistence and concurrency support.
A checkpoint is a unit of work inside a flow whose output is automatically persisted. It's also the contract between the runner and the execution target: the runner owns durable control flow (order, retry, replay, resume, wait), the execution target (inline, isolated container, sandbox, external tool) does the work, and the checkpoint is what they agree on.
That separation is why a checkpoint failure is never just a crash — it's persisted context the runner, agent loop, or a human can retry, replay, or feed back into the flow. See How It Works for the full model.
Checkpoints are replay boundaries
Every checkpoint is a boundary the runner remembers. On the first run, checkpoint outputs are computed and stored. On replay, completed checkpoints return their persisted outputs — execution only re-enters the first incomplete one.

You can also override a cached checkpoint's output during replay — useful when you want to correct a single step's result and let the rest of the flow continue. See Replay and overrides.
Defining a checkpoint
Decorate work functions with @checkpoint:
Checkpoints are reusable — define them once and call them from any flow.
Composing checkpoints in a flow
Call checkpoints from inside a @flow to build your workflow:
Checkpoints execute sequentially by default. The return value of one checkpoint can be passed directly as input to the next — standard Python data flow.
Concurrent execution
For independent work that can run in parallel, use .submit():
.submit() returns a future-like object. Call .result() on it to get the checkpoint's return value. This is the primary fan-out pattern in Kitaru.
The object returned by .submit() is a runtime future — use .result() to collect the value. You can submit multiple checkpoints and collect their results later for fan-out / fan-in patterns.
Additional concurrent helpers
Kitaru also provides .map() and .product() for batch concurrent execution:
These are convenience wrappers over concurrent submission. See the API reference for detailed signatures.
Decorator options
retries
0
Automatic retries on checkpoint failure
cache
True
Reuse the persisted output from a previous run when inputs and code match. Set False to disable on this checkpoint (overrides the flow-level default).
type
None
A label for UI visualization (e.g. "llm_call", "tool_call")
runtime
None
Execution runtime: "inline" or "isolated" (see below)
Like flow options, retries must be non-negative.
Isolated runtime
By default, checkpoints run inline — in the same process/pod as the runner. This is the right default for most orchestration. For checkpoints that run untrusted code, need a different image or resources, or must be strongly isolated from the rest of the run, set runtime="isolated" and the runner will place the checkpoint on a separate container/job on the configured stack (Kubernetes, Vertex AI, SageMaker, AzureML). Locally it falls back to inline so dev loops stay fast.
This applies to every execution of the checkpoint, whether called directly or submitted concurrently with .submit():
runtime controls where a checkpoint runs (same process vs. separate container). .submit() controls when — it enables concurrency. The two are independent: you can use .submit() without isolation, or isolation without .submit().
If the active orchestrator does not support isolated steps, the runtime is silently downgraded to inline with a warning. Local stacks always run inline.
When retries are enabled, Kitaru records each failed attempt before the final checkpoint outcome. You can inspect this history through KitaruClient().executions.get(exec_id).checkpoints[*].attempts.
Error handling and retries
When a checkpoint raises an unhandled exception, the flow stops immediately and the execution is marked as failed. No subsequent checkpoints run.
Automatic retries
The retries parameter on @checkpoint tells Kitaru to re-run the checkpoint automatically before giving up:
Each failed attempt is recorded, so you can inspect the full retry history through the execution's checkpoint attempts. If the checkpoint still fails after all retries, the flow fails.
For retrying the entire flow (not just a single checkpoint), see the retries option on flows.
Resuming after failure
When a flow fails, you don't need to re-run everything from scratch. Use replay to re-execute from the point of failure — checkpoints that already succeeded return their recorded results, and execution picks up at the first incomplete checkpoint.
Return values
Checkpoint return values must be serializable — Kitaru persists them so they can be reused in future executions. Prefer:
Built-in Python types (
str,int,float,bool,list,dict)Pydantic models
JSON-compatible data structures
Rules to know
Kitaru enforces several guardrails in the current release:
Checkpoints only work inside a flow. Calling a checkpoint outside a
@flowraisesKitaruContextError.No nested checkpoints. Calling one checkpoint from inside another is not supported and raises
KitaruContextError..submit()requires a running flow. Concurrent submission is only available during flow execution, not during flow compilation..map()and.product()follow the same rules as.submit()— they require a running flow context.
Next steps
Add structured metadata to your checkpoints with Logging and Metadata
Understand how results and errors surface in Flows
See the full API in the Checkpoint Reference
Related blog posts
Last updated
Was this helpful?