Chapter 2
Add some normalization
If you want to see the code for this chapter of the guide, head over to the GitHub.

Normalize the data.

Now before writing any trainers we can actually normalize our data to make sure we get better results. To do this let's add another step and make the pipeline a bit more complex.

Create steps

We can think of this as a normalizer step that takes data from the importer and normalizes it:
1
# Add another step
2
@step
3
def normalize_mnist(
4
X_train: np.ndarray, X_test: np.ndarray
5
) -> Output(X_train_normed=np.ndarray, X_test_normed=np.ndarray):
6
"""Normalize the values for all the images so they are between 0 and 1"""
7
X_train_normed = X_train / 255.0
8
X_test_normed = X_test / 255.0
9
return X_train_normed, X_test_normed
Copied!
And now our pipeline looks like this:
1
@pipeline
2
def load_and_normalize_pipeline(
3
importer,
4
normalizer,
5
):
6
"""Pipeline now has two steps we need to connect together"""
7
X_train, y_train, X_test, y_test = importer()
8
normalizer(X_train=X_train, X_test=X_test)
Copied!

Run

You can run this as follows:
1
python chapter_2.py
Copied!
The output will look as follows (note: this is filtered to highlight the most important logs)
1
Creating pipeline: load_and_normalize_pipeline
2
Cache enabled for pipeline `load_and_normalize_pipeline`
3
Using orchestrator `local_orchestrator` for pipeline `load_and_normalize_pipeline`. Running pipeline..
4
Step `importer_mnist` has started.
5
Step `importer_mnist` has finished in 1.751s.
6
Step `normalize_mnist` has started.
7
Step `normalize_mnist` has finished in 1.848s.
Copied!

Inspect

You can add the following code to fetch the pipeline:
1
from zenml.core.repo import Repository
2
3
repo = Repository()
4
p = repo.get_pipeline(pipeline_name="load_and_normalize_pipeline")
5
runs = p.runs
6
print(f"Pipeline `load_and_normalize_pipeline` has {len(runs)} run(s)")
7
run = runs[-1]
8
print(f"The run you just made has {len(run.steps)} steps.")
9
step = run.get_step('normalizer')
10
print(f"The `normalizer` step has {len(step.outputs)} output artifacts.")
11
for k, o in step.outputs.items():
12
arr = o.read()
13
print(f"Output '{k}' is an array with shape: {arr.shape}")
Copied!
You will get the following output:
1
Pipeline `load_and_normalize_pipeline` has 1 run(s)
2
The run you just made has 2 steps.
3
The `normalizer` step has 2 output artifacts.
4
Output 'X_train_normed' is an array with shape: (60000, 28, 28)
5
Output 'X_test_normed' is an array with shape: (10000, 28, 28)
Copied!
Which confirms again that the data is stored properly! Now we are ready to create some trainers..
Last modified 6d ago
Export as PDF
Copy link