I use a json format feat_configs
to management the overall feature transform and DNN model constructing. Which is the format like
[
{'name': 'a', 'dtype': 'numerical', 'norm': 'std'}, # 'norm' in ['std','[0,1]']
{'name': 'a', 'dtype': 'numerical', 'bins': 10, 'emb_dim': 8}, # Discretization
{'name': 'b', 'dtype': 'category', 'emb_dim': 8, 'hash_buckets': 100}, # category feature with hash_buckets
{'name': 'c', 'dtype': 'category', 'islist': True, 'emb_dim': 8, 'maxlen': 256}, # sequence feature, if maxlen is set, it will truncate the sequence to the maximum length
]
- name: The name of the feature.
- dtype: The data type of the input data. Options are
numerical
orcategory
. - cross: Cross other category columns to construct a new feature. The type is list[str].
- expr: Pre-process the column by using any polars expr. The type is a str of polars Expr.
Some other keys will auto generated by FeatureTransformer
, don't need to set mannually:
- type: The feature type for DNN model. Options are
dense
orsparse
. In common, numerical data without discretization will taken intodense
, category data or numerical data with discretization will taken intosparse
. - num_embedding: number embeddings of lookup table.
- fillna (Optional): Default value for missing numerical features. Options are:
'mean'
: Replace with the mean value.'min'
: Replace with the minimum value.'max'
: Replace with the maximum value.- Float number: Replace with the specified float number.
- norm (Optional): Normalization method for numerical features. Options are:
'std'
: Standard normalization.'[0,1]'
: Min-max normalization.
- bins (Optional): Number of bins for discretization. Only can set 1 of
norm
andbins
.- emb_dim: Embedding dimension for the feature.
- emb_dim: Embedding dimension for the feature.
- hash_buckets (Optional):
- If integer, number of hash buckets for the feature. sklearn
- If False or not set,
FeatureTransformer
will auto generate a dynamicvocab
for the feature, dynamic vocab means when fitting by new data, the vocab will updated automatically. - seed: seed to generate the hash integer. If want to avoid hash collision, you can input two columns with difference names by setting difference
seed
value. And concat them when building nn.Module.
- min_freq (Optional): Minimum frequency for category features, only effective when fitting.
- case_sensitive (Optional): Boolean indicating if category features should be case-sensitive. If
False
, convert category features to lowercase. - outliers (Optional): Outliers for category features. Options are:
- List: Replace with
'category_oov'
. - Dict: Replace with the specified value.
- List: Replace with
- oov (Optional): Default value for out-of-vocabulary category features. Default is
'other'
. - fillna (Optional): Default value for missing category features. If not set, no process and take it as a new category. If you want it to be out-of-vocabulary, set it explicitly here.
- islist (Optional): Boolean indicating if the feature is a list (sequence) feature. If
True
, the following keys are also applicable to padding the features to the same length (though I'd suggest padding the sequence usingcollate_fn
in pyTorch):- padding_value: Default value for padding list features. Default is
None
, which means no padding. - maxlen: Maximum length for sequence features. If set, it will truncate the sequence to the maximum length.
- padding_value: Default value for padding list features. Default is