Handling big data
Learn about how to manage big data with ZenML.
As your datasets grow, a single‑machine pandas workflow eventually hits its limits. This tutorial walks you through progressively scaling a ZenML pipeline:
Optimizing in‑memory processing for small‑to‑medium data.
Moving to chunked / out‑of‑core techniques when the data no longer fits comfortably in RAM.
Offloading heavy aggregations to a cloud data warehouse like BigQuery.
Plugging in distributed compute engines (Spark, Ray, Dask…) for truly massive workloads.
Pick the section that matches your current bottleneck or read sequentially to see how the techniques build on one another.
Understanding Dataset Size Thresholds
Before diving into specific strategies, it's important to understand the general thresholds where different approaches become necessary:
Small datasets (up to a few GB): These can typically be handled in-memory with standard pandas operations.
Medium datasets (up to tens of GB): Require chunking or out-of-core processing techniques.
Large datasets (hundreds of GB or more): Necessitate distributed processing frameworks.
Optimize in‑memory workflows (up to a few GB)
For datasets that can still fit in memory but are becoming unwieldy, consider these optimizations:
Use efficient data formats: Switch from CSV to more efficient formats like Parquet:
import pyarrow.parquet as pq
class ParquetDataset(Dataset):
def __init__(self, data_path: str):
self.data_path = data_path
def read_data(self) -> pd.DataFrame:
return pq.read_table(self.data_path).to_pandas()
def write_data(self, df: pd.DataFrame):
table = pa.Table.from_pandas(df)
pq.write_table(table, self.data_path)Implement basic data sampling: Add sampling methods to your Dataset classes:
Optimize pandas operations: Use efficient pandas and numpy operations to minimize memory usage:
Go out‑of‑core (tens of GB)
When your data no longer fits comfortably in memory, consider these strategies:
Chunk large CSV files
Implement chunking in your Dataset classes to process large files in manageable pieces:
Push heavy SQL to your data warehouse
You can utilize data warehouses like Google BigQuery for its distributed processing capabilities:
Distribute the workload (hundreds of GB+)
When dealing with very large datasets, you may need to leverage distributed computing frameworks like Apache Spark or Ray. ZenML doesn't have built-in integrations for these frameworks, but you can use them directly within your pipeline steps. Here's how you can incorporate Spark and Ray into your ZenML pipelines:
Plug in Apache Spark
To use Spark within a ZenML pipeline, you simply need to initialize and use Spark within your step function:
Note that you'll need to have Spark installed in your environment and ensure that the necessary Spark dependencies are available when running your pipeline.
Plug in Ray
Similarly, to use Ray within a ZenML pipeline, you can initialize and use Ray directly within your step:
As with Spark, you'll need to have Ray installed in your environment and ensure that the necessary Ray dependencies are available when running your pipeline.
Plug in Dask
Dask is a flexible library for parallel computing in Python. It can be integrated into ZenML pipelines to handle large datasets and parallelize computations. Here's how you can use Dask within a ZenML pipeline:
In this example, we've created a custom DaskDataFrameMaterializer to handle Dask DataFrames. The pipeline creates a Dask DataFrame, processes it using Dask's distributed computing capabilities, and then computes the final result.
Speed up single‑node code with Numba
Numba is a just-in-time compiler for Python that can significantly speed up numerical Python code. Here's how you can integrate Numba into a ZenML pipeline:
The pipeline creates a Numba-accelerated function, applies it to a large NumPy array, and returns the result.
Important Considerations
Environment Setup: Ensure that your execution environment (local or remote) has the necessary frameworks (Spark or Ray) installed.
Resource Management: When using these frameworks within ZenML steps, be mindful of resource allocation. The frameworks will manage their own resources, which needs to be coordinated with ZenML's orchestration.
Error Handling: Implement proper error handling and cleanup, especially for shutting down Spark sessions or Ray runtime.
Data I/O: Consider how data will be passed into and out of the distributed processing step. You might need to use intermediate storage (like cloud storage) for large datasets.
Scaling: While these frameworks allow for distributed processing, you'll need to ensure your infrastructure can support the scale of computation you're attempting.
By incorporating Spark or Ray directly into your ZenML steps, you can leverage the power of distributed computing for processing very large datasets while still benefiting from ZenML's pipeline management and versioning capabilities.
Version data externally with LakeFS (50 GB+)
Sometimes the bottleneck isn't compute — it's the artifact store. Serializing a 100 GB DataFrame through ZenML's artifact store on every run is slow and expensive, even if the data hasn't changed. For large, slowly-changing datasets, a better pattern is to keep the heavy data in a dedicated data versioning layer and pass only lightweight references through the pipeline.
LakeFS is a good fit here. It provides Git-like branching, commits, and rollbacks on top of your existing object store (S3, GCS, Azure Blob). ZenML handles pipeline orchestration and lineage; LakeFS handles the data.
The core idea: define a small Pydantic model that points to data in LakeFS, and pass that between steps instead of the data itself:
Each step reads/writes data directly to LakeFS via its S3-compatible API. ZenML only sees the ~200-byte reference object — the artifact store never touches the heavy data:
The key trick is returning the commit SHA rather than the branch name. This makes downstream steps deterministic — they always read the exact same data snapshot, no matter when they run.
This pattern works well when:
Your datasets are large enough that serializing through the artifact store is a bottleneck (50 GB+).
You want Git-like data versioning (branches, diffs, rollbacks) alongside your pipeline versioning.
Multiple pipelines or teams share the same datasets and need isolation.
For a complete working example with LakeFS, synthetic data generation, validation, and training, see the LakeFS data versioning example.
Choosing the Right Scaling Strategy
When selecting a scaling strategy, consider:
Dataset size: Start with simpler strategies for smaller datasets and move to more complex solutions as your data grows.
Processing complexity: Simple aggregations might be handled by BigQuery, while complex ML preprocessing might require Spark or Ray.
Infrastructure and resources: Ensure you have the necessary compute resources for distributed processing.
Update frequency: Consider how often your data changes and how frequently you need to reprocess it.
Team expertise: Choose technologies that your team is comfortable with or can quickly learn.
Remember, it's often best to start simple and scale up as needed. ZenML's flexible architecture allows you to evolve your data processing strategies as your project grows.
By implementing these scaling strategies, you can extend your ZenML pipelines to handle datasets of any size, ensuring that your machine learning workflows remain efficient and manageable as your projects scale. For more information on creating custom Dataset classes and managing complex data flows, refer back to custom dataset classes.
Last updated
Was this helpful?