Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate metrics over time #68

Open
gverbock opened this issue Feb 10, 2021 · 10 comments
Open

Evaluate metrics over time #68

gverbock opened this issue Feb 10, 2021 · 10 comments
Labels
enhancement New feature or request investigation needed One to do research related to this issue, and share findings

Comments

@gverbock
Copy link
Contributor

Problem Description
It would be great if Probatus would give the metric (including volatility) over time so that eventual drops in model performance can be spotted easily. The time aggregation level (day, month, quarter) would be chosen by the user.

Desired Outcome
The output would be a dataframe containing the following columns: dates, metric1, metric2. The input could then be used for a plot like the following:
image
Also possibility to evaluate for out-of-time would be required.

Solution Outline
Maybe incorporate it in the metric_volatility class. Passing a series with the date to aggregate and using groupby before computing the metrics.

@gverbock gverbock added the enhancement New feature or request label Feb 10, 2021
@operte
Copy link
Contributor

operte commented Feb 15, 2021

I'm not sure, but is this something that can be done with popmon?

@timvink
Copy link
Contributor

timvink commented Feb 15, 2021

Have you thought about how a potential API would look like (pseudo code)?

@Matgrb
Copy link
Contributor

Matgrb commented Feb 16, 2021

I think this could be done by extending BaseVolatilityEstimator and implementing something similar to TrainTestVolatility with one crucial difference:

When you split the data into train and test you take into account time column:

  • stratify split based on time column, this allows to have train and test samples from the entire time duration. Repeating this split multiple times allows to plot volatility of the Out-of-sample split volatility over time
  • split data into multiple time-based folds. At each test on one fold and train on remaining folds. In order to get time based volatility you can apply bootstrapping on train and test folds. This will basically tell you if you how a given time-based fold is different and volatile when predicted based on a model trained on other folds.
  • split data into multiple time-based folds. Then apply the schema as shown in the image below. In order to get the volatility you can again apply bootstrapping on train and test folds. This will tell you how volatile the model is with OOT splits, and how much data you need for training to have a stable OOT result.

image

The first option seems easiest to implement reusing most of the existing code. For the remaining two it would be more difficult. I suppose the first one would be a good starting point.

In all cases, you would need to consider a new plotting method, that would be similar for all time-based metric volatility. You could also make another base class BaseTimeBasedVolatilityEstimator, which overwrites the plot method of BaseVolatilityEstimator.

Regarding use of popmon, we could try to use it for plotting, however, i think this is a minor part of the feature, and we could get a more efficient implementation if we do it ourselves.

@gverbock
Copy link
Contributor Author

gverbock commented Feb 17, 2021

My thoughts were to start simple:

Having something like

class PerformanceOverTimeEstimator(model, X, y, scorer_list, dates, frequency)       
       
def boosting_process(self,...) 
            X_proba = model.predict_proba(X)
       
          for boost in range(0, 1000)
               X_boost, Y_boost = time_stratified_sampling(X, y, dates, frequency)
              scores = compute_scores_over_time(X_boost, Y_boost, dates, frequency, scorer_list)
             result.append(score)

     def plot_results

     def results_as_table

I had in mind to have a fitted model as argument so that hyperparmeters optimization is done outside the class

@Matgrb Matgrb reopened this Feb 17, 2021
@Matgrb
Copy link
Contributor

Matgrb commented Feb 17, 2021

Possible improvements:

  • X_proba could be computed using cross validation using cross_val_predict, to ensure there is no leakage.
  • Let's try to stick to the probatus API: init (clf, metrics, ...), fit(X, y, ...), compute(metrics, ...), plot().
  • The clf provided by the user can be model or a model wrapped into GridSearchCV that will perform hyperparameter optimization at each training. So you don't have to worry too much about the hyperparam opt.

What would the time_stratified_sampling and compute_scores_over_time do? Also what would the frequency parameter do?

What would be use case for using this code? Could you provide example what this analysis tells you about the model/data?

@gverbock
Copy link
Contributor Author

gverbock commented Feb 17, 2021

Good points Mateusz.

  • Frequency would provide the level of aggregation over time. Monthly, quarterly, ...
  • time_stratified_sampling would ensure the bootstrap is homogeneously distributed across time.
  • compute scores_over_time would compute the score_metrics for each unit of time (month, quarter).

The benefit of the new code is that the user sees (for example) the AUC over time and can easily spot performance degradation. For example specific months (say Covid, summer holidays, ....).

This helps you to either assess the impact of unexpected changes (like for example Covid, crisis, bad publicity) but also understanding the reason may reveal some weaknesses of the model. For example the model starts deteriorate once the mortgage production increased. Then you would try to mitigate this by adding features related to mortgages.

@Matgrb
Copy link
Contributor

Matgrb commented Feb 17, 2021

So to summarize:

  1. Compute probabilities for the entire X using Cross-Validation
  2. Split data into time buckets
  3. For each window of data, randomly sample examples multiple times (bootstrapping), and measure the metric e.g. AUC multiple times.
  4. Compute a plot and report about volatility of the metric in each time bucket

Is that correct?

I like the approach for the simplicity. It provides you information for which periods of time to be cautious, and possible data drifts. It is similar to another issue #72 but it focuses on how the performance of target prediction changes over time.

The limitation I see is that when you compute the probabilities for X using CV, the model is trained on the data from the entire time span. Imagine you have a sample in the middle of the dataset, the model has seen samples before and after that.

Let's also ask others what do they think? @timvink @anilkumarpanda @operte

@gverbock
Copy link
Contributor Author

You understood it correctly.

I am not sure the limitation you raise on the cross-section would have a large impact.

@Matgrb
Copy link
Contributor

Matgrb commented Feb 17, 2021

Indeed probably low impact 👍

However, I would reach out to a couple of users and see if they would find it useful for their projects.

@Matgrb Matgrb added the investigation needed One to do research related to this issue, and share findings label Feb 26, 2021
@ReinierKoops
Copy link

Is this still a feature that we want to work on @gverbock @anilkumarpanda ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request investigation needed One to do research related to this issue, and share findings
Projects
None yet
Development

No branches or pull requests

5 participants