Module core.steps.split.categorical_domain_split_step

Implementation of the categorical domain split.

Functions

CategoricalPartitionFn(element: Any, num_partitions: int, categorical_column: str, split_map: Dict[str, List[Union[str, int]]], unknown_category_policy: str) ‑> int : Function for a categorical split on data to be used in a beam.Partition. Args: element: Data point, given as a tf.train.Example. num_partitions: Number of splits, unused here. categorical_column: Name of the categorical column in the data on which to perform the split. split_map: Dict {split_name: [category_list]} mapping the categorical values in categorical_column to their respective splits. unknown_category_policy: Text, identifier on how to handle categorical values not present in the split_map.

Returns:
    An integer n, where 0 ≤ n ≤ num_partitions - 1.

lint_split_map(split_map: Dict[str, List[Union[str, int]]]) : Small utility to lint the split_map

Classes

CategoricalDomainSplit(categorical_column: str, split_map: Dict[str, List[Union[str, int]]], unknown_category_policy: str = 'skip', statistics=None, schema=None) : Categorical domain split. Use this to split data based on values in a single categorical column. A categorical column is defined here as a column with finitely many values of type integer or string.

Categorical domain split constructor.

Use this class to split your data based on values in
a single categorical column. A categorical column is defined here as a
column with finitely many values of type `integer` or `string`.

Example usage:

# Split on a categorical attribute called "color".

# red and blue datapoints go into the train set,
   green and yellow ones go into the eval set. Other colors,
   e.g. "purple", are discarded due to the "skip" flag.

>>> split = CategoricalDomainSplit(
... categorical_column="color",
... split_map = {"train": ["red", "blue"],
...              "eval": ["green", "yellow"]},
... unknown_category_policy="skip")

Supply the ``unknown_category_policy`` flag to set the unknown
category handling policy. There are two main options:

Setting ``unknown_category_policy`` to any key in the split map
indicates that any missing categories should be put into that
particular split. For example, supplying
``unknown_category_policy="train"`` indicates that all missing
categories should go into the training dataset, while
``unknown_category_policy="eval"`` indicates that all missing
categories should go into the evaluation dataset.

Setting ``unknown_category_policy="skip"`` indicates that data points
with unknown categorical values (i.e., values not present in any of the
categorical value lists inside the split map) should be taken out of
the data set.

Args:
    statistics: Parsed statistics artifact from a preceding
     StatisticsGen.
    schema: Parsed schema artifact from a preceding SchemaGen.
    categorical_column: Name of the categorical column in the data on
     which to split.
    split_map: A dict { split_name: [categorical_values] } mapping
     categorical values to their respective splits.
    unknown_category_policy: String, indicates how to handle categories
     in the data that are not present in the split map.

### Ancestors (in MRO)

* zenml.core.steps.split.base_split_step.BaseSplit
* zenml.core.steps.base_step.BaseStep

### Methods

`get_split_names(self) ‑> List[str]`
:   Returns the names of the splits associated with this split step.
    
    Returns:
        A list of strings, which are the split names.

`partition_fn(self)`
:   Returns the partition function associated with the current split type,
    along with keyword arguments used in the signature of the partition
    function.
    
    To be eligible in use in a Split Step, the partition_fn has to adhere
    to the following design contract:
    
    1. The signature is of the following type:
    
        >>> def partition_fn(element, n, **kwargs) -> int,
    
        where n is the number of splits;
    2. The partition_fn only returns signed integers i less than n, i.e. ::
    
            0 ≤ i ≤ n - 1.
    
    Returns:
        A tuple (partition_fn, kwargs) of the partition function and its
         additional keyword arguments (see above).