Organize ML code into ZenML¶
There is a good chance as a data scientist, you might already have code lying around that you would not like to re-write just to start using ZenML. The following is a step-by-step guide on how to refactor Jupyter notebook PoC model training code into a production-ready ZenML pipeline.
Why should I do this?¶
Putting machine learning models in production is hard. Going from PoC quick scripting to actually having a model deployed and staying healthy is usually a long an arduous journey for any ML team. By putting your ML code in the form of ZenML pipelines, that journey is cut significantly shorter and is much easier.
A familar story¶
As a data scientist the following (pseudo-)code might seem familiar:
import libraries
# CELL 1: Read data
df = pd.read_*("/path/to/file.csv")
df.describe()
# INSERT HERE: a 100 more cells deleted and updated to explore data.
# CELL 2: Split
train, eval = split_data()
# INSERT HERE: Figure out if the split worked
# CELL 3: Preprocess
# nice, oh lets normalize
preprocess(train, eval)
# Exploring preprocessed data, same drill as before
# CELL 4: Train
model = create_model()
model = model.fit(train, eval)
# if youre lucky here, just look at some normal metrics like accuracy. otherwise:
# CELL 5: Evaluate
evaluate_model(train)
# INSERT HERE: do this a 1000 times
# CELL 6: Export (i.e. pickle it)
export_model(model)
Step 1: Separate the split¶
from random import randint
from zenml.core.datasources.csv_datasource import CSVDatasource
from zenml.core.pipelines.training_pipeline import TrainingPipeline
from zenml.core.steps.evaluator.tfma_evaluator import TFMAEvaluator
from zenml.core.steps.preprocesser.standard_preprocesser.standard_preprocesser import StandardPreprocesser
from zenml.core.steps.split.random_split import RandomSplit
from zenml.core.steps.trainer.tensorflow_trainers.tf_ff_trainer import FeedForwardTrainer
training_pipeline = TrainingPipeline(
name=f'Experiment {randint(0, 10000)}',
enable_cache=True
)
# Add a datasource. This will automatically track and version it.
ds = CSVDatasource(name=f'My CSV Datasource {randint(0, 100000)}',
path='gs://zenml_quickstart/diabetes.csv')
training_pipeline.add_datasource(ds)
# Add a split
training_pipeline.add_split(RandomSplit(
split_map={'eval': 0.3, 'train': 0.7}))
# Add a preprocessing unit
training_pipeline.add_preprocesser(
StandardPreprocesser(
features=['times_pregnant', 'pgc', 'dbp', 'tst', 'insulin', 'bmi',
'pedigree', 'age'],
labels=['has_diabetes'],
overwrite={'has_diabetes': {
'transform': [{'method': 'no_transform', 'parameters': {}}]}}
))
# Add a trainer
training_pipeline.add_trainer(FeedForwardTrainer(
loss='binary_crossentropy',
last_activation='sigmoid',
output_units=1,
metrics=['accuracy'],
epochs=3))
# Add an evaluator
training_pipeline.add_evaluator(
TFMAEvaluator(slices=[['has_diabetes']],
metrics={'has_diabetes': ['binary_crossentropy',
'binary_accuracy']}))
# Run the pipeline locally
training_pipeline.run()
What to do next?¶
Now what would be a great time to see what ZenML has to offer with standard powerful abstractions like Pipelines, Steps, Datasources and Backends.