By default, both will point to a subfolder of your local
.zenml directory, which is created when you run
zenml init. It’ll contain both the Metadata Store (default: SQLite) as well as the Artifact Store (default: local folder in the ZenML repo).
The Metadata Store can be simply configured to use any MySQL server (=>5.6):
zenml config metadata set mysql \--host="127.0.0.1" \--port="3306" \--username="USER" \--password="PASSWD" \--database="DATABASE"
The Artifact Store can be a local filesystem path or a bucket path (current support for GCP and AWS).
zenml config artifacts set "gs://your-bucket/sub/dir"
In Python, the
MetadataStore classes can be used to override the default stores set above.
ZenML uses Google’s ML Metadata under-the-hood to automatically track all metadata produced by ZenML pipelines. ML Metadata standardizes metadata tracking and makes it easy to keep track of iterative experimentation as it happens. This not only helps in post-training workflows to compare results as experiments progress but also has the added advantage of leveraging caching of pipeline steps.
All parameters of every ZenML step are persisted in the Metadata Store and also in the declarative pipeline configs. In the config, they can be seen quite easily in the
steps key. Here is a sample stemming from this Python step:
That translates to the following config:
steps: ...trainer:args:batch_size: 1dropout_chance: 0.2epochs: 1hidden_activation: reluhidden_layers: nulllast_activation: sigmoidloss: mselr: 0.001metrics: nulloutput_units: 1source: ...
args key represents all the parameters captured and persisted.
For most use-cases, ZenML exposes native interfaces to fetch these parameters after a pipeline has been run successfully. E.g. the
repo.compare_training_runs() method compares all pipelines in a Repository and extensively uses the ML Metadata store to spin up a visualization of comparison of training pipeline results.
However, if users would like direct access to the store, they can easily use the ML Metadata Python library to quickly access their parameters. In order to understand more about how Execution Parameters and ML Metadata work please refer to the TFX docs.
As all steps are persisted in the same pattern shown above, it is very simple to track any metadata you desire. Whenever creating a step, simply add the parameters you want to track as kwargs (key-worded parameters) in your Step
__init__() method (i.e. the constructor) and then pass the exact same parameters in the
class MyTrainer(BaseTrainerStep):def __init__(self,arg_1: int = 8,arg_2: float = 0.001,arg_3: str = 'mse',**kwargs):# Pass the exact same parameters as the signature down with the **kwargs dictsuper(BaseTrainerStep, self).__init__(arg_1=arg_1,arg_2=arg_2,arg_3=arg_3,**kwargs)
As long as the above pattern is followed, any primitive type can be tracked centrally with ZenML. As a note, while all steps are base classes of
BaseStep , for specific step types, other base classes are available. For example, if creating a preprocessing step, inherit from the
BasePreprocesserStep step and if trainer step the
BaseTrainerStep or even further one level down to the
TorchBaseTrainerStep. To see all the step interfaces check out the steps module of ZenML.
Caching is an important mechanism for ZenML, which not only speeds up ML development but also allows for the re-usability of interim ML artifacts inevitably produced in the experimentation stage of training an ML model.
Whenever a pipeline is run in the same repository, ZenML tracks all Steps executed in the repository. The outputs of these steps are stored as they are computed in the Metadata and Artifact Stores. Whenever another pipeline is run afterward that has the same Step configurations of a previously run pipeline, ZenML simply uses the previously computed output to warm start the pipeline, rather than recomputing the output.
This not only makes each subsequent run potentially much faster but also saves on computing cost. With ZenML, it is possible to preprocess millions of datapoints. Imagine having to re-compute this each time an experiment is run, even if it is a small change to a hyper-parameter of the
This is usually solved by creating snapshots of preprocessed data unfortunately stored in random arbitrary places. In the worst case in local folders, and in the best case in some form of Cloud Storage but with a manually defined order. These have the following disadvantages:
Data might get lost or corrupted
Data might not be easy to share across teams or environments
Data can be manipulated unexpectedly and transparently without anyone knowing.
ZenML, all of this is taken care of in the background. Immutable snapshots of interim artifacts are stored in the Artifact Store and stored for quick reference in the Metadata Store. Therefore, as long as these remain intact, data cannot be lost, corrupted, or manipulated in any way unexpectedly. Also, setting up a collaborative environment with ZenML ensures that this data is accessible to everyone and is consumed natively by all ZenML steps with ease.
Create and run the first pipeline:
training_pipeline = TrainingPipeline(name='Pipeline A')# create the actual pipelinetraining_pipeline.run()
Then get the pipeline:
from zenml.repo import Repository# Get a reference in code to the current reporepo = Repository()pipeline_a = repo.get_pipeline_by_name('Pipeline A')
Create a new pipeline and change one step:
pipeline_b = pipeline_a.copy('Pipeline B') # pipeline_a itself is immutablepipeline_b.add_trainer(...) # change trainer steppipeline_b.run()
In the above example, if there is a shared Metadata and Artifact Store, all steps preceding the
TrainerStep in the pipeline will be cached and reused in Pipeline B. For large datasets, this will yield enormous benefits in terms of cost and time.