Skip to content

Latest commit

 

History

History
55 lines (45 loc) · 3.63 KB

feature_conf.md

File metadata and controls

55 lines (45 loc) · 3.63 KB

Feature Configurations (feat_configs)

I use a json format feat_configs to management the overall feature transform and DNN model constructing. Which is the format like

[
    {'name': 'a', 'dtype': 'numerical', 'norm': 'std'},   # 'norm' in ['std','[0,1]']
    {'name': 'a', 'dtype': 'numerical', 'bins': 10, 'emb_dim': 8}, # Discretization
    {'name': 'b', 'dtype': 'category', 'emb_dim': 8, 'hash_buckets': 100}, # category feature with hash_buckets
    {'name': 'c', 'dtype': 'category', 'islist': True, 'emb_dim': 8, 'maxlen': 256}, # sequence feature, if maxlen is set, it will truncate the sequence to the maximum length
]

Common Keys (Required)

  • name: The name of the feature.
  • dtype: The data type of the input data. Options are numerical or category.
  • cross: Cross other category columns to construct a new feature. The type is list[str].
  • expr: Pre-process the column by using any polars expr. The type is a str of polars Expr.

Some other keys will auto generated by FeatureTransformer, don't need to set mannually:

  • type: The feature type for DNN model. Options are dense or sparse. In common, numerical data without discretization will taken into dense, category data or numerical data with discretization will taken into sparse.
  • num_embedding: number embeddings of lookup table.

Numerical Features

  • fillna (Optional): Default value for missing numerical features. Options are:
    • 'mean': Replace with the mean value.
    • 'min': Replace with the minimum value.
    • 'max': Replace with the maximum value.
    • Float number: Replace with the specified float number.
  • norm (Optional): Normalization method for numerical features. Options are:
    • 'std': Standard normalization.
    • '[0,1]': Min-max normalization.
  • bins (Optional): Number of bins for discretization. Only can set 1 of norm and bins.
    • emb_dim: Embedding dimension for the feature.

Category Features

  • emb_dim: Embedding dimension for the feature.
  • hash_buckets (Optional):
    • If integer, number of hash buckets for the feature. sklearn
    • If False or not set, FeatureTransformer will auto generate a dynamic vocab for the feature, dynamic vocab means when fitting by new data, the vocab will updated automatically.
    • seed: seed to generate the hash integer. If want to avoid hash collision, you can input two columns with difference names by setting difference seed value. And concat them when building nn.Module.
  • min_freq (Optional): Minimum frequency for category features, only effective when fitting.
  • case_sensitive (Optional): Boolean indicating if category features should be case-sensitive. If False, convert category features to lowercase.
  • outliers (Optional): Outliers for category features. Options are:
    • List: Replace with 'category_oov'.
    • Dict: Replace with the specified value.
  • oov (Optional): Default value for out-of-vocabulary category features. Default is 'other'.
  • fillna (Optional): Default value for missing category features. If not set, no process and take it as a new category. If you want it to be out-of-vocabulary, set it explicitly here.
  • islist (Optional): Boolean indicating if the feature is a list (sequence) feature. If True, the following keys are also applicable to padding the features to the same length (though I'd suggest padding the sequence using collate_fn in pyTorch):
    • padding_value: Default value for padding list features. Default is None, which means no padding.
    • maxlen: Maximum length for sequence features. If set, it will truncate the sequence to the maximum length.