Static boundary / alerting for categorical features #139
-
Hi, first of all let me say I'm really impressed with both the histogrammar and the popmon package! Nice job :)
Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
Hello, Thanks for the kind words.
In addition, for any categorical feature we check if there are new labels seen in a new batch of data, i.e. not seen in the reference histogram. If so the metric "unknown_labels" is set to 1. See: So yes, you should get a warning when the ratio of the values changes or a value disappears, when that change is significant enough. And by default a warning is raised when a new label is found compared with the reference data. Hope this helps! |
Beta Was this translation helpful? Give feedback.
-
The monitoring rules are set on metrics derived from the Age distribution for each slice of data. For each histogram we keep track of the maximum and minimum value though. You could do this: monitoring_rules = {"Age:min": [120, 120, 0, 0], "Age:max": [120, 120, 0, 0]} So both min and max are expected to be between 0 and 120. If not, a red alert is raised. Then a yellow alert is raised when a minimum age is found between 0-18 or 100-120. |
Beta Was this translation helpful? Give feedback.
-
Then the question of how to set monitoring rules in practice. The short answer is: how to set them is quite usecase specific. But the default values in popmon are a good start. They will catch most, if not all, forms of dataset change/shift. In most cases reporting less than the default settings is the first step to producing use-case specific popmon report. |
Beta Was this translation helpful? Give feedback.
Hello,
Thanks for the kind words.
Yes, you can set fixed traffic light boundaries. Simply set the option monitoring_rules when generating the report. See for examples:
https://github.com/ing-bank/popmon/blob/master/popmon/pipeline/report.py#L81
For any categorical feature we compare the distribution of a new batch of data with the reference histogram. If the distributions are significantly different an alert will be raised. We don't explicitly check for empty bins, since these can also happen in low statistics data batches, but if an empty bin causes a significant difference between the two histograms, it will certainly show up in the test statistics. For example in the metric "max_pr…