Managing machine learning datasets

Model datasets using simple abstractions.

As machine learning projects grow in complexity, you often need to work with various data sources and manage intricate data flows. This chapter explores how to use custom Dataset classes and Materializers in ZenML to handle these challenges efficiently. For strategies on scaling your data processing for larger datasets, refer to scaling strategies for big data.

Introduction to Custom Dataset Classes

In this tutorial you will learn how to model complex and heterogeneous data sources in ZenML by

  1. Defining a Dataset base class;

  2. Implementing concrete subclasses for CSV files and BigQuery tables;

  3. Writing Materializers so ZenML can persist and reload those objects; and

  4. Wiring everything together inside a pipeline.

Custom Dataset classes in ZenML provide a way to encapsulate data loading, processing, and saving logic for different data sources. They're particularly useful when:

  1. Working with multiple data sources (e.g., CSV files, databases, cloud storage)

  2. Dealing with complex data structures that require special handling

  3. Implementing custom data processing or transformation logic

Implementing Dataset Classes for Different Data Sources

Let's create a base Dataset class and implement it for CSV and BigQuery data sources:

from abc import ABC, abstractmethod
import pandas as pd
from google.cloud import bigquery
from typing import Optional

class Dataset(ABC):
    @abstractmethod
    def read_data(self) -> pd.DataFrame:
        pass

class CSVDataset(Dataset):
    def __init__(self, data_path: str, df: Optional[pd.DataFrame] = None):
        self.data_path = data_path
        self.df = df

    def read_data(self) -> pd.DataFrame:
        if self.df is None:
            self.df = pd.read_csv(self.data_path)
        return self.df

class BigQueryDataset(Dataset):
    def __init__(
        self,
        table_id: str,
        df: Optional[pd.DataFrame] = None,
        project: Optional[str] = None,
    ):
        self.table_id = table_id
        self.project = project
        self.df = df
        self.client = bigquery.Client(project=self.project)

    def read_data(self) -> pd.DataFrame:
        query = f"SELECT * FROM `{self.table_id}`"
        self.df = self.client.query(query).to_dataframe()
        return self.df

    def write_data(self) -> None:
        job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
        job = self.client.load_table_from_dataframe(self.df, self.table_id, job_config=job_config)
        job.result()

Creating Custom Materializers

Materializers in ZenML handle the serialization and deserialization of artifacts. Custom Materializers are essential for working with custom Dataset classes:

Managing Complexity in Pipelines with Multiple Data Sources

When working with multiple data sources, it's crucial to design flexible pipelines that can handle different scenarios. Here's an example of how to structure a pipeline that works with both CSV and BigQuery datasets:

Best Practices for Designing Flexible and Maintainable Pipelines

When working with custom Dataset classes in ZenML pipelines, it's crucial to design your pipelines to accommodate various data sources and processing requirements.

Here are some best practices to ensure your pipelines remain flexible and maintainable:

  1. Use a common base class: The Dataset base class allows for consistent handling of different data sources within your pipeline steps. This abstraction enables you to swap out data sources without changing the overall pipeline structure.

  1. Create specialized steps to load the right dataset: Implement separate steps to load different datasets, while keeping underlying steps standardized.

  1. Implement flexible pipelines: Design your pipelines to adapt to different data sources or processing requirements. You can use configuration parameters or conditional logic to determine which steps to execute.

  1. Modular step design: Focus on creating steps that perform specific tasks (e.g., data loading, transformation, analysis) that can work with different dataset types. This promotes code reuse and ease of maintenance.

By following these practices, you can create ZenML pipelines that efficiently handle complex data flows and multiple data sources while remaining adaptable to changing requirements. This approach allows you to leverage the power of custom Dataset classes throughout your machine learning workflows, ensuring consistency and flexibility as your projects evolve.

Next steps

Last updated

Was this helpful?