Static boundary / alerting for categorical features #139

jeaninejuliettes · 2021-09-27T12:40:29Z

jeaninejuliettes
Sep 27, 2021

Hi,

first of all let me say I'm really impressed with both the histogrammar and the popmon package! Nice job :)
I do have a few questions on the alerting/calculation of boundaries:

is it possible to use a fixed value as a boundary (for instance for a variable age I would like the minimum to be 0 or 18)
how does the calculation/alerting work for categorical features? Will you get a warning when the ratio of the values changes, or a new value emerges or a value disappears?

Thanks in advance!
Jeanine

Answered by mbaak

Sep 27, 2021

Hello,

Thanks for the kind words.

Yes, you can set fixed traffic light boundaries. Simply set the option monitoring_rules when generating the report. See for examples:
https://github.com/ing-bank/popmon/blob/master/popmon/pipeline/report.py#L81
For any categorical feature we compare the distribution of a new batch of data with the reference histogram. If the distributions are significantly different an alert will be raised. We don't explicitly check for empty bins, since these can also happen in low statistics data batches, but if an empty bin causes a significant difference between the two histograms, it will certainly show up in the test statistics. For example in the metric "max_pr…

View full answer

mbaak · 2021-09-27T17:09:24Z

mbaak
Sep 27, 2021
Maintainer

Hello,

Thanks for the kind words.

Yes, you can set fixed traffic light boundaries. Simply set the option monitoring_rules when generating the report. See for examples:
https://github.com/ing-bank/popmon/blob/master/popmon/pipeline/report.py#L81
For any categorical feature we compare the distribution of a new batch of data with the reference histogram. If the distributions are significantly different an alert will be raised. We don't explicitly check for empty bins, since these can also happen in low statistics data batches, but if an empty bin causes a significant difference between the two histograms, it will certainly show up in the test statistics. For example in the metric "max_prob_diff", which is the maximum bin difference between two normalized histograms. (This is identical to what is done for numerical features.)

In addition, for any categorical feature we check if there are new labels seen in a new batch of data, i.e. not seen in the reference histogram. If so the metric "unknown_labels" is set to 1. See:
https://github.com/ing-bank/popmon/blob/master/popmon/analysis/comparison/hist_comparer.py#L109

So yes, you should get a warning when the ratio of the values changes or a value disappears, when that change is significant enough. And by default a warning is raised when a new label is found compared with the reference data.

Hope this helps!

2 replies

jeaninejuliettes Sep 28, 2021
Author

Hi @mbaak,

thanks for your quick reply! This definitely helps :) I'm still struggling a bit with how to set a fixed monitoring_rule, I've read the code you referred to, but I'm still not sure how this should be done. Can you give an example if I would want a fixed rule for the feature Age, being that Age cannot come below 18 or above 100 for instance?

Thanks!

jeaninejuliettes Sep 28, 2021
Author

And maybe just one other question I'm curious about, in your own daily work, how do you decide on the monitoring rules for a feature and when/if you want to deviate from the default? For numerical features I'm plotting a histogram and the default boundaries at the moment, which helps me decide if I think these boundaries make sense or not. But are there any better options as far as you're concerned? Especially for the categorical features this seems quite a challenge :)

mbaak · 2021-09-29T10:37:31Z

mbaak
Sep 29, 2021
Maintainer

Hi @jeaninejuliettes,

The monitoring rules are set on metrics derived from the Age distribution for each slice of data.
So, just to be clear, they are not set on individual data points, but on the histogram of data points.

For each histogram we keep track of the maximum and minimum value though. You could do this:

monitoring_rules = {"Age:min": [120, 120, 0, 0], "Age:max": [120, 120, 0, 0]}
df.pm_stability_report(monitoring_rules=monitoring_rules)

So both min and max are expected to be between 0 and 120. If not, a red alert is raised.
Here the red and yellow traffic boundaries are the same. Alternatively, if you do want yellow alerts as well, one could do:
"Age:min": [120, 100, 18, 0]

Then a yellow alert is raised when a minimum age is found between 0-18 or 100-120.

1 reply

jeaninejuliettes Sep 29, 2021
Author

@mbaak, that makes sense, thanks!

mbaak · 2021-09-29T10:43:23Z

mbaak
Sep 29, 2021
Maintainer

Then the question of how to set monitoring rules in practice.

The short answer is: how to set them is quite usecase specific.
In my experience, when monitoring distributions for a longer period of time, you find out which distributions are really important, and what thresholds you want to use on those.

But the default values in popmon are a good start. They will catch most, if not all, forms of dataset change/shift. In most cases reporting less than the default settings is the first step to producing use-case specific popmon report.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static boundary / alerting for categorical features #139

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Static boundary / alerting for categorical features #139

jeaninejuliettes Sep 27, 2021

Replies: 3 comments · 3 replies

mbaak Sep 27, 2021 Maintainer

jeaninejuliettes Sep 28, 2021 Author

jeaninejuliettes Sep 28, 2021 Author

mbaak Sep 29, 2021 Maintainer

jeaninejuliettes Sep 29, 2021 Author

mbaak Sep 29, 2021 Maintainer

jeaninejuliettes
Sep 27, 2021

Replies: 3 comments 3 replies

mbaak
Sep 27, 2021
Maintainer

jeaninejuliettes Sep 28, 2021
Author

jeaninejuliettes Sep 28, 2021
Author

mbaak
Sep 29, 2021
Maintainer

jeaninejuliettes Sep 29, 2021
Author

mbaak
Sep 29, 2021
Maintainer