Module core.steps.split.categorical_ratio_split_step¶
Implementation of the ratio-based categorical split.
Functions¶
lint_split_map(split_map: Dict[str, float])
: Small utility to lint the split_map
Classes¶
CategoricalRatioSplit(categorical_column: str, categories: List[Union[str, int]], split_ratio: Dict[str, float], unknown_category_policy: str = 'skip', statistics=None, schema=None)
: Categorical ratio split. Use this to split data based on a list of values
of interest in a single categorical column. A categorical column is
defined here as a column with finitely many values of type integer
or
string
. In contrast to the categorical domain split, here categorical
values are assigned to different splits by the corresponding percentages,
defined inside a split ratio object.
Categorical domain split constructor.
Use this class to split your data based on values in
a single categorical column. A categorical column is defined here as a
column with finitely many values of type `integer` or `string`.
Example usage:
# Split on a categorical attribute called "color", with a defined list
of categories of interest
# half of the categories go entirely into the train set,
the other half into the eval set. Other colors, e.g. "purple",
are discarded due to the "skip" flag.
>>> split = CategoricalRatioSplit(
... categorical_column="color",
... categories = ["red", "green", "blue", "yellow"],
... split_ratio = {"train": 0.5,
... "eval": 0.5},
... unknown_category_policy="skip")
Supply the unknown_category_policy flag to set the unknown category
handling policy. There are two main options:
Setting unknown_category_policy to any key in the split map indicates
that any missing categories should be put into that particular split.
For example, supplying ``unknown_category_policy="train"`` indicates
that all missing categories should go into the training dataset, while
``unknown_category_policy="eval"`` indicates that all missing
categories should go into the evaluation dataset.
Setting ``unknown_category_policy="skip"`` indicates that data points
with unknown categorical values (i.e., values not present in the
categorical value list) should be taken out of the data set.
Args:
statistics: Parsed statistics from a preceding StatisticsGen.
schema: Parsed schema from a preceding SchemaGen.
categorical_column: Name of the categorical column used for
splitting.
categories: List of categorical values found in the categorical
column on which to split.
split_ratio: A dict mapping { split_name: percentage of categories
in split }.
unknown_category_policy: String, indicates how to handle categories
in the data that are not present in the supplied category list.
### Ancestors (in MRO)
* zenml.core.steps.split.base_split_step.BaseSplit
* zenml.core.steps.base_step.BaseStep
### Methods
`get_split_names(self) ‑> List[str]`
: Returns the names of the splits associated with this split step.
Returns:
A list of strings, which are the split names.
`partition_fn(self)`
: Returns the partition function associated with the current split type,
along with keyword arguments used in the signature of the partition
function.
To be eligible in use in a Split Step, the partition_fn has to adhere
to the following design contract:
1. The signature is of the following type:
>>> def partition_fn(element, n, **kwargs) -> int,
where n is the number of splits;
2. The partition_fn only returns signed integers i less than n, i.e. ::
0 ≤ i ≤ n - 1.
Returns:
A tuple (partition_fn, kwargs) of the partition function and its
additional keyword arguments (see above).