-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add adversarial validation to sample_similarity #4
Comments
I just watched a presentation by @nanne-aben on covariate shift that details a different approach:
Benefits of that approach is that you do not have to subsample your training data, not losing any information.
Definition of done would be the helper method for sample weights + a tutorial on "dealing with covariate shift" in the probatus docs. Thoughts? |
This is definitely something that would be nice to have. A couple of thoughts:
Now we do train/test split within resemblance model (here train and test is created from combined That is why, making this would either require making a completely separate feature that is similar to SHAPImportanceResemblance, but implementing the CV correctly, or would require rework of the entire sample_similarity module, to use CV instead of train/test split. I would be voting for the first option, because in Resemblance model you don't really need the CV, since it is a simple test and it is not about squeezing the most out of the performance of the model. |
Happy to hear you liked the approach! Maybe it helps if I give some more
background on how we use this exactly.
When we apply the model, we basically
1. Take a sample of our X_train and of our X_test. Let's call these
X_train_adv and X_test_adv. We make sure these are of the same size btw, so
our adversarial prediction problem is balanced.
2. Train our adversarial model on the concatenated [X_train_adv,
X_test_adv]
3. Determine the sample weights w for the remaining part of X_train
(which we'll call X_train_dash)
4. Train our model with X_train_dash, y_train_dash and w.
When we want to make an estimate of our test set performance (without
knowing y_test), we:
1. Train our adversarial (or resemblance) model on the concatenated
[X_train_adv, X_test_adv]
2. Determine the sample weights w for X_train_dash
3. Do cross-validation with X_train_dash, y_train_dash and w.
Here, w feeds in to both the model fit (e.g. fit this point as if it's
10x as important), as well as the determination of the performance (e.g.
miss-classifying this point would count 10x as heavy)
Finally, we have some artificial data experiments where we do know y_test,
which we use to validate whether the approach works. In these settings we
keep a part of y_test separate (never to be seen by the adversarial model)
to test performance, just to be sure.
I guess this would address your concerns about bias. I'm not sure whether
you really have to be that careful to mitigate these biases in all
situations (since you don't reuse information on y anywhere), but we have a
big enough sample size to set some data aside, so we figured we'd stay on
the safe side.
Happy to think along with you, let me know if I can help!
…On Fri, 27 Nov 2020 at 14:01, Mateusz Garbacz ***@***.***> wrote:
This is definitely something that would be nice to have. A couple of
thoughts:
1.
How do we see this feature being further used? In some way we would
use quite a lot of information from the test, even if we don't use labels.
Wouldn't this cause a bias when we measure OOT Test score?
2.
Implementing this would require some work on how we handle data.
Now we do train/test split within resemblance model (here train and test
is created from combined X1 and X2, unfortunately in your example it is
also train and test). In order to calculate the sample weights, we would
need to compute the predictions on all samples of X1, which would require
use of cross-validation.
That is why, making this would either require making a completely separate
feature that is similar to SHAPImportanceResemblance, but implementing the
CV correctly, or would require rework of the entire sample_similarity
module, to use CV instead of train/test split. I would be voting for the
first option, because in Resemblance model you don't really need the CV,
since it is a simple test and it is not about squeezing the most out of the
performance of the model.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALOBCX55ZNSFOFVHHSPP3ZLSR6PKTANCNFSM4TTGUZHA>
.
|
Now that I think about it, we actually do have some artificial data
experiments where we were less strict in not using the same data twice. And
it still performed really well on unseen test data. So I guess how strict
you want to be depends on the user and the use-case. Happy to share some of
those experiments btw, if you're interested.
…On Fri, Nov 27, 2020 at 6:37 PM Nanne Aben ***@***.***> wrote:
Happy to hear you liked the approach! Maybe it helps if I give some more
background on how we use this exactly.
When we apply the model, we basically
1. Take a sample of our X_train and of our X_test. Let's call these
X_train_adv and X_test_adv. We make sure these are of the same size btw, so
our adversarial prediction problem is balanced.
2. Train our adversarial model on the concatenated [X_train_adv,
X_test_adv]
3. Determine the sample weights w for the remaining part of X_train
(which we'll call X_train_dash)
4. Train our model with X_train_dash, y_train_dash and w.
When we want to make an estimate of our test set performance (without
knowing y_test), we:
1. Train our adversarial (or resemblance) model on the concatenated
[X_train_adv, X_test_adv]
2. Determine the sample weights w for X_train_dash
3. Do cross-validation with X_train_dash, y_train_dash and w.
Here, w feeds in to both the model fit (e.g. fit this point as if it's
10x as important), as well as the determination of the performance (e.g.
miss-classifying this point would count 10x as heavy)
Finally, we have some artificial data experiments where we do know y_test,
which we use to validate whether the approach works. In these settings we
keep a part of y_test separate (never to be seen by the adversarial model)
to test performance, just to be sure.
I guess this would address your concerns about bias. I'm not sure whether
you really have to be that careful to mitigate these biases in all
situations (since you don't reuse information on y anywhere), but we have a
big enough sample size to set some data aside, so we figured we'd stay on
the safe side.
Happy to think along with you, let me know if I can help!
On Fri, 27 Nov 2020 at 14:01, Mateusz Garbacz ***@***.***>
wrote:
> This is definitely something that would be nice to have. A couple of
> thoughts:
>
> 1.
>
> How do we see this feature being further used? In some way we would
> use quite a lot of information from the test, even if we don't use labels.
> Wouldn't this cause a bias when we measure OOT Test score?
> 2.
>
> Implementing this would require some work on how we handle data.
>
> Now we do train/test split within resemblance model (here train and test
> is created from combined X1 and X2, unfortunately in your example it is
> also train and test). In order to calculate the sample weights, we would
> need to compute the predictions on all samples of X1, which would
> require use of cross-validation.
>
> That is why, making this would either require making a completely
> separate feature that is similar to SHAPImportanceResemblance, but
> implementing the CV correctly, or would require rework of the entire
> sample_similarity module, to use CV instead of train/test split. I would be
> voting for the first option, because in Resemblance model you don't really
> need the CV, since it is a simple test and it is not about squeezing the
> most out of the performance of the model.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#4 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ALOBCX55ZNSFOFVHHSPP3ZLSR6PKTANCNFSM4TTGUZHA>
> .
>
|
I like it, especially the second option that you have presented with the use of CV, it is more data efficient. Another tweak that can be done there is using a model with Could you share the experiments? I am interested how this works in practice. Regarding the bias, this is tricky. Imagine having a OOT Test set, which covers the entire pandemics of Covid-19. In that times, the dataset has changed dramatically, compared to pre-pandemics Train. If you use the data distribution during the pandemics to make the pre-pandemic dataset training better suited, this will cause a strong leakage of information from test to train. The model will be definitely better suited for the future, assuming that the situation doesnt change so much post pandemic, but estimated performance is less realistic, because in this case the model "knew" about upcoming data shift, even though in production it would not. This is of course an extreme example, but I wanted to illustrate where this could go wrong. In the end it is user's choice, whether this bias is an issue or not for a given problem. Couple of use scenarios I can think of that would decrease possible impact of such bias:
|
Sure, happy to share. What's the best way of sharing it?
W.r.t. the Covid data, let me answer using an example. For ease of
notation, let's time-split the dataset into t1, t2, t3, etc. Let's say we
have x_t1, y_t1 and x_t2 available (and hence y_t2 is unavailable). Using
adversarial modelling, we could then create a model to predict y_t2. In the
example above, you ask whether we would be able to predict x_t3. If x_t3 is
similar to x_t2, you should be fine. If it's not, you'll need to retrain
your adversarial model.
…On Mon, Nov 30, 2020 at 2:23 PM Mateusz Garbacz ***@***.***> wrote:
I like it, especially the second option that you have presented with the
use of CV, it is more data efficient. Another tweak that can be done there
is using a model with class_weight='balanced'.
Could you share the experiments? I am interested how this works in
practice.
Regarding the bias, this is tricky. Imagine having a OOT Test set, which
covers the entire pandemics of Covid-19. In that times, the dataset has
changed dramatically, compared to pre-pandemics Train. If you use the data
distribution during the pandemics to make the pre-pandemic dataset training
better suited, this will cause a strong leakage of information from test to
train. The model will be definitely better suited for the future, assuming
that the situation doesnt change so much post pandemic, but estimated
performance is less realistic, because in this case the model "knew" about
upcoming data shift, even though in production it would not. This is of
course an extreme example, but I wanted to illustrate where this could go
wrong. In the end it is user's choice, whether this bias is an issue or not
for a given problem.
Couple of use scenarios I can think of that would decrease possible impact
of such bias:
- Set last month of Train data as validation set. In this case, the
older Train data can be weighted to better represent most recent times, and
no bias would be introduced by using information from the test set.
- Split Test set into two parts, use one part to do adversarial
validation. Then the performance on the first and second part of the test
set can be illustrated to indicate whether there is any bias introduced (in
case the performance between Test1 and Test2 differs)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALOBCXZYX2OBVQCLUS36RQDSSOMFFANCNFSM4TTGUZHA>
.
|
Interesting discussion. Framed slightly differently, you could use adversarial/resemblance modelling to calibrate your model as a last (retraining) step, in order to improve performance in production where there is a (known) distribution shift like covid-19. To do that without leakage, you need get Back to # We have X_train, y_train, X_test, y_test, X_adversarial
# Normal model
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier().fit(X_train, y_train)
# Resemblance model
from probatus.sample_similarity import SHAPImportanceResemblance
clf = RandomForestClassifier()
rm = SHAPImportanceResemblance(clf)
shap_resemblance_model = rm.fit_compute(X_train, X_adverserial)
# Model calibration
resemblance_model = shap_resemblance_model.model # new method
probs = resemblance_model.predict(X_train)
weights = calculate_weight(probs) # new function
calibrated_model = LGBMClassifier().fit(X_train, y_train, sample_weights = weights)
# Compare performance
# get AUC from model.predict(X_test, y_test)
# get AUC from calibrated_model.predict(X_test, y_test) The new parts are in the model calibration section. I think we can simplify that process a bit more, maybe something like: ac = probatus.calibration.AdversialCalibrator()
ac.fit_compute(model, resemblance_model, X_train, y_train, X_train, X_test, X_adverserial) # returns pd.DataFrame comparing calibrated model with non-calibrated model Thoughts? |
Sorry for the late reply, I was already on Christmas leave :)
Framed slightly differently, you could use adversarial/resemblance
modelling to *calibrate* your model as a last (retraining) step, in order
to improve performance in production where there is a (known) distribution
shift like covid-19.
To do that without leakage, you need get X_train_adversarial by splitting
your out-of-time test set into two, or take previously unused out-of-time
data for which you don't have labels yet. Then you train a resemblance
model on X_train_adversarial, and use that model to sets instance weights
for your original model, and retrain it one more time using those. You can
then measure the performance difference between your original model and
your calibrated model using the same out-of-time test dataset.
Yes, that's it exactly!
…On Tue, Dec 22, 2020 at 2:08 PM Tim Vink ***@***.***> wrote:
Interesting discussion.
Framed slightly differently, you could use adversarial/resemblance
modelling to *calibrate* your model as a last (retraining) step, in order
to improve performance in production where there is a (known) distribution
shift like covid-19.
To do that without leakage, you need get X_train_adversarial by splitting
your out-of-time test set into two, or take previously unused out-of-time
data for which you don't have labels yet. Then you train a resemblance
model on X_train_adversarial, and use that model to sets instance weights
for your original model, and retrain it one more time using those. You can
then measure the performance difference between your original model and
your calibrated model using the same out-of-time test dataset.
Back to probatus. I think there is an opportunity to build some tooling &
documentation for this in a new probatus.calibration module. Some pseudo
code:
# We have X_train, y_train, X_test, y_test, X_adversarial
# Normal modelfrom sklearn.ensemble import GradientBoostingClassifiermodel = GradientBoostingClassifier().fit(X_train, y_train)
# Resemblance modelfrom probatus.sample_similarity import SHAPImportanceResemblanceclf = RandomForestClassifier()rm = SHAPImportanceResemblance(clf)shap_resemblance_model = rm.fit_compute(X_train, X_adverserial)
# Model calibrationresemblance_model = shap_resemblance_model.model # new methodprobs = resemblance_model.predict(X_train)weights = calculate_weight(probs) # new functioncalibrated_model = LGBMClassifier().fit(X_train, y_train, sample_weights = weights)
# Compare performance# get AUC from model.predict(X_test, y_test) # get AUC from calibrated_model.predict(X_test, y_test)
The new parts are in the model calibration section. I think we can
simplify that process a bit more, maybe something like:
ac = probatus.calibration.AdversialCalibrator()ac.fit_compute(model, resemblance_model, X_train, y_train, X_train, X_test, X_adverserial) # returns pd.DataFrame comparing calibrated model with non-calibrated model
Thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALOBCX2IELHRCVBKIRA23TLSWCK3TANCNFSM4TTGUZHA>
.
|
Once we find out that the distribution of the training and test data are different, we can use adversarial validation to identify a subset of training data that is similar to the test data.
This subset can then be used as the validation set.Thus we will have a good idea of how our model will perform on the test set, which belongs to a different distribution than our training set.
The pseudo-code to include adversarial validation can be as follows :
More information can be found here and here
The text was updated successfully, but these errors were encountered: