Module core.steps.split.random_split

Implementation of a random split of the input data set.

Functions

RandomSplitPartitionFn(element: Any, num_partitions: int, split_map: Dict[str, float]) ‑> int : Function for a random split of the data; to be used in a beam.Partition. This function implements a simple random split algorithm by drawing integers from a categorical distribution defined by the values in split_map.

Args:
    element: Data point, in format tf.train.Example.
    num_partitions: Number of splits, unused here.
    split_map: Dict mapping {split_name: percentage of data in split}.

Returns:
    An integer n, where 0 ≤ n ≤ num_partitions - 1.

lint_split_map(split_map: Dict[str, float]) : Small utility to lint the split_map

Classes

RandomSplit(split_map: Dict[str, float], statistics=None, schema=None) : Random split. Use this to randomly split data based on a cumulative distribution function defined by a split_map dict.

Random split constructor.

Randomly split the data based on a cumulative distribution function
defined by split_map.

Example usage:

# Split data randomly, but evenly into train, eval and test

>>> split = RandomSplit(
... split_map = {"train": 0.334,
...              "eval": 0.333,
...              "test": 0.333})

Here, each data split gets assigned about one third of the probability
mass. The split is carried out by sampling from the categorical
distribution defined by the values p_i in the split map, i.e.

P(index = i) = p_i, i = 1,...,n ;

where n is the number of splits defined in the split map. Hence, the
values in the split map must sum up to 1. For more information, see
https://en.wikipedia.org/wiki/Categorical_distribution.


Args:
    statistics: Parsed statistics from a preceding StatisticsGen.
    schema: Parsed schema from a preceding SchemaGen.
    split_map: A dict { split_name: percentage of data in split }.

### Ancestors (in MRO)

* zenml.core.steps.split.base_split_step.BaseSplit
* zenml.core.steps.base_step.BaseStep

### Methods

`get_split_names(self) ‑> List[str]`
:   Returns the names of the splits associated with this split step.
    
    Returns:
        A list of strings, which are the split names.

`partition_fn(self)`
:   Returns the partition function associated with the current split type,
    along with keyword arguments used in the signature of the partition
    function.
    
    To be eligible in use in a Split Step, the partition_fn has to adhere
    to the following design contract:
    
    1. The signature is of the following type:
    
        >>> def partition_fn(element, n, **kwargs) -> int,
    
        where n is the number of splits;
    2. The partition_fn only returns signed integers i less than n, i.e. ::
    
            0 ≤ i ≤ n - 1.
    
    Returns:
        A tuple (partition_fn, kwargs) of the partition function and its
         additional keyword arguments (see above).