Imputing categorical data by predictive mean matching #576

stefvanbuuren · 2023-08-07T22:23:38Z

Predictive mean matching (PMM) is the default method of mice for imputing numerical variables, but it has long been possible to impute factors. This PR introduces better support to work with categorical variables in PMM.

The former system worked as follows: If we specify PMM for an unordered factor, then the similarity among potential donors is expressed on the linear predictor, and we take the observed category of a random draw among the five closest donor cases. As the linear predictor summarizes the available predictive information, matching should produce reasonable imputations. This method is fast and robust against empty cell and fitting problems. The downside is that it depends on category order. In particular, in mice.impute.pmm() we have the shortcut:

ynum <- y
 if (is.factor(y)) {
   ynum <- as.integer(y)
 }

The order of integers in ynum may have no sensible interpretation for an unordered factor. The problem is less likely to surface for ordered factors, though there is still the assumption that the categories are equidistant.

The new system quantifies ynum and could yield better results because of higher $R^2$. The PR follows a similar strategy as Frank Harrell's function Hmisc::aregImpute(). The method calculates the canonical correlation between y (as dummy matrix) and a linear combination of imputation model predictors x. Similar methods are known as MORALS (Gifi, 1980) or ACE (Breiman and Friedman, 1985). The algorithm then replaces each category of y by a single number taken from the first canonical variate. After this step, the imputation model is fitted, and the predicted values from that model are extracted to function as the similarity measure for the matching step.

The method works for both ordered and unordered factors. No special precautions are taken to ensure monotonicity between the category numbers and the quantifications, so the method should be able to preserve quadratic and other non-monotone relations of the predicted metric. It may be beneficial to remove very sparsely filled categories, for which there is a new trim argument.

Potential advantages are:

Simpler and faster than fitting a generalised linear model, e.g., logistic regression or the proportional odds model;
Should be insensitive to the order of categories;
No need to solve problems with perfect prediction;
Should inherit the good statistical properties of predictive mean matching.

Note that we still lack solid evidence for these claims.

Here are some examples for the new functionality.

library(mice, warn.conflicts = FALSE)

# Factor quantification
xname <- c("age", "hgt", "wgt")
br <- boys[c(1:10, 101:110, 501:510, 601:620, 701:710), ]
r <- stats::complete.cases(br[, xname])
x <- br[r, xname]
y <- factor(br[r, "tv"])
ry <- !is.na(y)
table(y)
#> y
#>  6  8 10 12 13 15 16 20 25 
#>  1  2  1  1  1  4  1  4  7

# Impute factor by optimizing canonical correlation y, x
mice.impute.pmm(y, ry, x)
#>  [1] 15 25 25 8  20 8  25 8  25 25 20 8  25 8  25 15 20 15 15 25 20 8  16 25 15
#> [26] 25 20 25 20 8  20 25 20 20 15 8  20 15
#> Levels: 6 8 10 12 13 15 16 20 25

# Only categories with at least 2 cases can be donor
mice.impute.pmm(y, ry, x, trim = 2L)
#>  [1] 8  20 8  15 15 25 20 8  8  20 8  8  25 15 15 20 15 8  25 8  20 20 25 25 25
#> [26] 25 15 25 20 20 8  25 8  25 15 25 25 25
#> Levels: 6 8 10 12 13 15 16 20 25

# In addition, eliminate categories 8 and 20
mice.impute.pmm(y, ry, x, trim = 2L, exclude = c(8, 20))
#>  [1] 15 15 25 25 15 25 25 25 15 15 25 25 25 25 25 15 25 25 25 25 25 25 25 25 15
#> [26] 25 15 25 15 25 25 25 15 25 25 25 25 25
#> Levels: 6 8 10 12 13 15 16 20 25

# To get old behavior: as.integer(y))
mice.impute.pmm(y, ry, x, quantify = FALSE)
#>  [1] 8  15 12 8  8  12 8  15 8  12 12 10 6  12 12 6  15 8  6  6  15 15 6  10 15
#> [26] 15 10 25 20 15 10 25 8  16 25 25 25 25
#> Levels: 6 8 10 12 13 15 16 20 25

^{Created on 2023-08-07 with reprex v2.0.2}

…ta before estimating the matching model

stefvanbuuren · 2023-08-08T09:25:48Z

Related #196, #268

stefvanbuuren · 2023-08-08T09:38:56Z

Some points to myself to consider:

stats::cancor() executes before mice:::estimice(), so we may need to make it more robust and flag loggedEvents in case of multi-collinear data;
stats::cancor() and mice:::estimice() do partly similar calculations, so we may potentially speed up by re-using parts of cancor();
Perhaps it is enough to quantify only at initialisation, and fix the quantifications during iteration;
It is not yet clear whether we need additional procedures to incorporate the uncertainty of quantifications;
Would similar strategies works for mice::quickpred() and mice:::remove_lindep()?

hanneoberman · 2023-08-08T10:21:15Z

Clear potential for improved support for categorical variables. Earlier comment below about recreating old behaviour may be disregarded!

However, I'm wondering when exactly the added value would show up. When reproducing the example, I do not see different results with the new implementation versus the 'legacy code' (specified with quantify = FALSE):

# recreating reprex data
library(mice, warn.conflicts = FALSE)
xname <- c("age", "hgt", "wgt")
br <- boys[c(1:10, 101:110, 501:510, 601:620, 701:710), ]
r <- stats::complete.cases(br[, xname])
x <- br[r, xname]
y <- factor(br[r, "tv"])

# imputing with new and old behaviour
dat <- cbind(y, x)
imp_can <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123)
imp_old <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123, quantify = FALSE)
all.equal(imp_can$imp$y, imp_old$imp$y)
#> [1] TRUE

^{Created on 2023-08-08 with reprex v2.0.2}

How high should $R^2$ be before this new implementation provides more efficient imputations? Or did I not specify the old behavior in the right way?

Update: reprex did not use the right package version.

# recreating reprex data
library(mice, warn.conflicts = FALSE)
xname <- c("age", "hgt", "wgt")
br <- boys[c(1:10, 101:110, 501:510, 601:620, 701:710), ]
r <- stats::complete.cases(br[, xname])
x <- br[r, xname]
y <- factor(br[r, "tv"])

# imputing with new and old behaviour
dat <- cbind(y, x)
imp_can <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123)
imp_old <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123, quantify = FALSE)
complete(imp_can)$y
#>  [1] 25 20 25 25 20 20 20 25 20 25 15 25 20 8  8  20 25 25 25 8  8  8  15 8  10
#> [26] 16 15 20 25 12 13 15 20 25 15 15 25 20 10 6  25 20 25 8  20 25 20 25 25 25
#> [51] 25 16 16 16 13 20 13 15 25 25
#> Levels: 6 8 10 12 13 15 16 20 25
complete(imp_old)$y
#>  [1] 10 15 6  15 6  6  15 12 6  8  10 6  15 8  8  8  12 8  10 6  8  8  15 12 10
#> [26] 15 15 20 15 12 13 15 20 25 15 15 25 20 25 6  25 20 25 25 15 25 20 15 25 25
#> [51] 25 16 16 25 16 16 25 15 25 25
#> Levels: 6 8 10 12 13 15 16 20 25

^{Created on 2023-08-08 with reprex v2.0.2}

stefvanbuuren · 2023-08-08T14:10:06Z

@hanneoberman Thanks.

How high should be before this new implementation provides more efficient imputations?

Hard to answer in general. We would expect larger differences for variables whose integer category ordering is wrong is some way. For example, physical strength and age has a curve-linear relation. If age is coded as young-middle-old, then imputing age | strength or strength | age using as.integer() has an attenuated slope. The optimal scaling step may quantify them with a different order , e.g. young-old-middle, and hence increase $R^2$. Also, there are many variables that have no inherent order, like color, religion, and postal code that we may want to impute. Scaling these may also result in a better predictive model.

stefvanbuuren · 2023-08-08T14:33:54Z

stats::cancor() executes before mice:::estimice(), so we may need to make it more robust and flag loggedEvents in case of multi-collinear data;

Hmm, we definitely need to increase the robustness of stats::cancor() before merging.
https://stackoverflow.com/questions/5850763/canonical-correlation-analysis-in-r

stefvanbuuren · 2023-08-08T21:31:14Z

I wrote and executed various tests aimed at testing and breaking stats::cancor(), but it held up very well. Some findings:

Adding one or many duplicates in the x variables had no impact on the transform;
Adding many junk variables to x changes the transform, but does not crash cancor();
Setting defaultMethod all to "pmm" and running all tests did not reveal problems related to cancor();
Setting method = "pmm" and running all tests did not reveal problems related to cancor();

All in all, I believe that cancor() handles crappy data quite well. Lack of robustness does not seem to be an issue that should uphold the merge in the mice master branch.

stefvanbuuren added 3 commits August 7, 2023 13:49

Tidy exclude lines, set exclude = NULL as default

d6273a7

Add new functionality to mice.impute.pmm() to quantify categorical da…

ee435bb

…ta before estimating the matching model

Add some example of factor imputation by pmm

4858d05

stefvanbuuren requested a review from hanneoberman August 7, 2023 22:23

stefvanbuuren added the enhancement label Aug 7, 2023

hanneoberman approved these changes Aug 8, 2023

View reviewed changes

stefvanbuuren added 2 commits August 8, 2023 23:09

Add tests intended to break stats::cancor()

d10fd2b

Write NEWS entry

a8aba42

stefvanbuuren merged commit d67fdb0 into master Aug 8, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imputing categorical data by predictive mean matching #576

Imputing categorical data by predictive mean matching #576

stefvanbuuren commented Aug 7, 2023

stefvanbuuren commented Aug 8, 2023

stefvanbuuren commented Aug 8, 2023

hanneoberman commented Aug 8, 2023 •

edited

Loading

stefvanbuuren commented Aug 8, 2023

stefvanbuuren commented Aug 8, 2023

stefvanbuuren commented Aug 8, 2023

Imputing categorical data by predictive mean matching #576

Imputing categorical data by predictive mean matching #576

Conversation

stefvanbuuren commented Aug 7, 2023

stefvanbuuren commented Aug 8, 2023

stefvanbuuren commented Aug 8, 2023

hanneoberman commented Aug 8, 2023 • edited Loading

stefvanbuuren commented Aug 8, 2023

stefvanbuuren commented Aug 8, 2023

stefvanbuuren commented Aug 8, 2023

hanneoberman commented Aug 8, 2023 •

edited

Loading