In ZenML, a pipeline refers to a sequence of steps which represent independent entities that gets a certain set of inputs and creates the corresponding outputs as artifacts. These output artifacts can potentially be fed into other steps as inputs, and that’s how the order of execution is decided.
Each artifact that is produced along the way is stored in an artifact store and the corresponding execution is tracked by a metadata store associated with the pipeline. These artifacts can be fetched directly or via helper methods. For instance in a training pipeline,
view_schema() can be used as helper methods to easily view the artifacts from interim steps in a pipeline.
Finally, ZenML already natively separates configuration from code in its design. That means that every step in a pipeline has its parameters tracked and stored in the declarative config file in the selected pipelines directory. Therefore, pulling a pipeline and running it in another environment not only ensures that the code will be the same, but also the configuration.
All of the ideas above are brought together to construct the foundation of the
BasePipeline in ZenML. As the name suggests, it is utilized as a base class to create, execute and track pipeline runs which represent a higher-order abstraction for standard ML tasks.
TrainingPipeline is a specialized pipeline built on top of the
BasePipeline and it is used to run a training experiment and deploy the resulting model. It covers a fixed set of steps representing the processes, which can be found in most of the machine learning workflows:
Split: responsible for splitting your dataset into smaller datasets such as train, eval, etc.
Sequence (Optional): responsible for extracting sequences from time-series data
Transform: responsible for the preprocessing of your data
Train: responsible for the model creation and training process
Evaluate: responsible for the evaluation of your results
Deploy: responsible for the model deployment
Implementation-wise, it covers the required methods to add the aforementioned steps to the pipeline. Additionally, it also features a set of helper functions, which makes it easier to interact with the output artifacts once the execution of the instance is completed. For instance, after the pipeline is executed, you can use
view_statistics to take a deeper look into the statistics of your dataset/splits or you can use
download_model to retrieve the trained model to a specified location.
class TrainingPipeline(BasePipeline):# Step functionsdef add_split(self, split_step: BaseSplit):...def add_sequencer(self, sequencer_step: BaseSequencerStep):...def add_preprocesser(self, preprocessor_step: BasePreprocesserStep):...def add_trainer(self, trainer_step: BaseTrainerStep):...def add_evaluator(self, evaluator_step: BaseEvaluatorStep):...def add_deployment(self, deployment_step: BaseDeployerStep):...# Helper functionsdef view_statistics(self, magic: bool = False, port: int = 0):...def view_schema(self):...def evaluate(self, magic: bool = False, port: int = 0):...def download_model(self, out_path: Text = None, overwrite: bool = False):...def view_anomalies(self, split_name='eval'):...
The code snippet below shows how quickly you can wrap up a
TrainingPipeline and get it up and running. All you have to do is:
Create an instance of a
Add a datasource to your instance
Add the desired steps along with their configuration
Simply run it
Most importantly, even when executing a simple example such as this, you maintain all the advantages that ZenML brings to the table such as reproducibility, scalability, and collaboration to their full extent.
from zenml.pipelines import TrainingPipelinetraining_pipeline = TrainingPipeline(name='MyFirstPipeline')training_pipeline.add_datasource(ds)training_pipeline.add_split(...)training_pipeline.add_preprocesser(...)training_pipeline.add_trainer(...)training_pipeline.add_evaluator(...)training_pipeline.run()