Module core.steps.split.categorical_ratio_split_step

Implementation of the ratio-based categorical split.

Functions

lint_split_map(split_map: Dict[str, float]) : Small utility to lint the split_map

Classes

CategoricalRatioSplit(categorical_column: str, categories: List[Union[str, int]], split_ratio: Dict[str, float], unknown_category_policy: str = 'skip', statistics=None, schema=None) : Categorical ratio split. Use this to split data based on a list of values of interest in a single categorical column. A categorical column is defined here as a column with finitely many values of type integer or string. In contrast to the categorical domain split, here categorical values are assigned to different splits by the corresponding percentages, defined inside a split ratio object.

Categorical domain split constructor.

Use this class to split your data based on values in
a single categorical column. A categorical column is defined here as a
column with finitely many values of type `integer` or `string`.

Example usage:

# Split on a categorical attribute called "color", with a defined list
of categories of interest

# half of the categories go entirely into the train set,
  the other half into the eval set. Other colors, e.g. "purple",
   are discarded due to the "skip" flag.

>>> split = CategoricalRatioSplit(
... categorical_column="color",
... categories = ["red", "green", "blue", "yellow"],
... split_ratio = {"train": 0.5,
...                "eval": 0.5},
... unknown_category_policy="skip")

Supply the unknown_category_policy flag to set the unknown category
handling policy. There are two main options:

Setting unknown_category_policy to any key in the split map indicates
that any missing categories should be put into that particular split.
For example, supplying ``unknown_category_policy="train"`` indicates
that all missing categories should go into the training dataset, while
``unknown_category_policy="eval"`` indicates that all missing
categories should go into the evaluation dataset.

Setting ``unknown_category_policy="skip"`` indicates that data points
with unknown categorical values (i.e., values not present in the
categorical value list) should be taken out of the data set.

Args:
    statistics: Parsed statistics from a preceding StatisticsGen.
    schema: Parsed schema from a preceding SchemaGen.
    categorical_column: Name of the categorical column used for
     splitting.
    categories: List of categorical values found in the categorical
     column on which to split.
    split_ratio: A dict mapping { split_name: percentage of categories
                            in split }.
    unknown_category_policy: String, indicates how to handle categories
     in the data that are not present in the supplied category list.

### Ancestors (in MRO)

* zenml.core.steps.split.base_split_step.BaseSplit
* zenml.core.steps.base_step.BaseStep

### Methods

`get_split_names(self) ‑> List[str]`
:   Returns the names of the splits associated with this split step.
    
    Returns:
        A list of strings, which are the split names.

`partition_fn(self)`
:   Returns the partition function associated with the current split type,
    along with keyword arguments used in the signature of the partition
    function.
    
    To be eligible in use in a Split Step, the partition_fn has to adhere
    to the following design contract:
    
    1. The signature is of the following type:
    
        >>> def partition_fn(element, n, **kwargs) -> int,
    
        where n is the number of splits;
    2. The partition_fn only returns signed integers i less than n, i.e. ::
    
            0 ≤ i ≤ n - 1.
    
    Returns:
        A tuple (partition_fn, kwargs) of the partition function and its
         additional keyword arguments (see above).