Connecting artifacts via a Model
Structuring an MLOps project
Last updated
Structuring an MLOps project
Last updated
Now that we've learned about managing artifacts and models, we can shift our attention again to the thing that brings them together: Pipelines. This trifecta together will then inform how we structure our project.
In order to see the recommended repository structure of a ZenML MLOps project, read the best practices section.
An MLOps project can often be broken down into many different pipelines. For example:
A feature engineering
pipeline that prepares raw data into a format ready to get trained.
A training pipeline
that takes input data from a feature engineering pipeline and trains a models on it.
An inference pipeline
that runs batch predictions on the trained model and often takes pre-processing from the training pipeline.
A deployment pipeline that deploys a trained model into a production endpoint.
The lines between these pipelines can often get blurry: Some use cases call for these pipelines to be merged into one big pipeline. Others go further and break the pipeline down into even smaller chunks. Ultimately, the decision of how to structure your pipelines depends on the use case and requirements of the project.
No matter how you design these pipelines, one thing stays consistent: you will often need to transfer or share information (in particular artifacts, models, and metadata) between pipelines. Here are some common patterns that you can use to help facilitate such an exchange:
Client
Let's say we have a feature engineering pipeline and a training pipeline. The feature engineering pipeline is like a factory, pumping out many different datasets. Only a few of these datasets should be selected to be sent to the training pipeline to train an actual model.
In this scenario, the ZenML Client can be used to facilitate such an exchange:
Please note, that in the above example, the train_data
and test_data
artifacts are not materialized in memory in the @pipeline
function, but rather the train_data
and test_data
objects are simply references to where this data is stored in the artifact store. Therefore, one cannot use any logic regarding the nature of this data itself during compilation time (i.e. in the @pipeline
function).
Model
While passing around artifacts with IDs or names is very useful, it is often desirable to have the ZenML Model be the point of reference instead.
For example, let's say we have a training pipeline called train_and_promote
and an inference pipeline called do_predictions
. The training pipeline produces many different model artifacts, all of which are collected within a ZenML Model. Each time the train_and_promote
pipeline runs, it creates a new iris_classifier
. However, it only promotes the model to production
if a certain accuracy threshold is met. The promotion can be also be done manually with human intervention, or it can be automated through setting a particular threshold.
On the other side, the do_predictions
pipeline simply picks up the latest promoted model and runs batch inference on it. It need not know of the IDs or names of any of the artifacts produced by the training pipeline's many runs. This way these two pipelines can independently be run, but can rely on each other's output.
In code, this is very simple. Once the pipelines are configured to use a particular model, we can use get_step_context
to fetch the configured model within a step directly. Assuming there is a predict
step in the do_predictions
pipeline, we can fetch the production
model like so:
However, this approach has the downside that if the step is cached, then it could lead to unexpected results. You could simply disable the cache in the above step or the corresponding pipeline. However, one other way of achieving this would be to resolve the artifact at the pipeline level:
Ultimately, both approaches are fine. You should decide which one to use based on your own preferences.