-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imputing categorical data by predictive mean matching #576
Conversation
…ta before estimating the matching model
Some points to myself to consider:
|
Clear potential for improved support for categorical variables. Earlier comment below about recreating old behaviour may be disregarded!
# recreating reprex data
library(mice, warn.conflicts = FALSE)
xname <- c("age", "hgt", "wgt")
br <- boys[c(1:10, 101:110, 501:510, 601:620, 701:710), ]
r <- stats::complete.cases(br[, xname])
x <- br[r, xname]
y <- factor(br[r, "tv"])
# imputing with new and old behaviour
dat <- cbind(y, x)
imp_can <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123)
imp_old <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123, quantify = FALSE)
all.equal(imp_can$imp$y, imp_old$imp$y)
#> [1] TRUE Created on 2023-08-08 with reprex v2.0.2
Update: reprex did not use the right package version. # recreating reprex data
library(mice, warn.conflicts = FALSE)
xname <- c("age", "hgt", "wgt")
br <- boys[c(1:10, 101:110, 501:510, 601:620, 701:710), ]
r <- stats::complete.cases(br[, xname])
x <- br[r, xname]
y <- factor(br[r, "tv"])
# imputing with new and old behaviour
dat <- cbind(y, x)
imp_can <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123)
imp_old <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123, quantify = FALSE)
complete(imp_can)$y
#> [1] 25 20 25 25 20 20 20 25 20 25 15 25 20 8 8 20 25 25 25 8 8 8 15 8 10
#> [26] 16 15 20 25 12 13 15 20 25 15 15 25 20 10 6 25 20 25 8 20 25 20 25 25 25
#> [51] 25 16 16 16 13 20 13 15 25 25
#> Levels: 6 8 10 12 13 15 16 20 25
complete(imp_old)$y
#> [1] 10 15 6 15 6 6 15 12 6 8 10 6 15 8 8 8 12 8 10 6 8 8 15 12 10
#> [26] 15 15 20 15 12 13 15 20 25 15 15 25 20 25 6 25 20 25 25 15 25 20 15 25 25
#> [51] 25 16 16 25 16 16 25 15 25 25
#> Levels: 6 8 10 12 13 15 16 20 25 Created on 2023-08-08 with reprex v2.0.2 |
@hanneoberman Thanks.
Hard to answer in general. We would expect larger differences for variables whose integer category ordering is wrong is some way. For example, physical strength and age has a curve-linear relation. If age is coded as young-middle-old, then imputing age | strength or strength | age using |
Hmm, we definitely need to increase the robustness of |
I wrote and executed various tests aimed at testing and breaking
All in all, I believe that |
Predictive mean matching (PMM) is the default method of
mice
for imputing numerical variables, but it has long been possible to impute factors. This PR introduces better support to work with categorical variables in PMM.The former system worked as follows: If we specify PMM for an unordered factor, then the similarity among potential donors is expressed on the linear predictor, and we take the observed category of a random draw among the five closest donor cases. As the linear predictor summarizes the available predictive information, matching should produce reasonable imputations. This method is fast and robust against empty cell and fitting problems. The downside is that it depends on category order. In particular, in
mice.impute.pmm()
we have the shortcut:The order of integers in
ynum
may have no sensible interpretation for an unordered factor. The problem is less likely to surface for ordered factors, though there is still the assumption that the categories are equidistant.The new system quantifies$R^2$ . The PR follows a similar strategy as Frank Harrell's function
ynum
and could yield better results because of higherHmisc::aregImpute()
. The method calculates the canonical correlation betweeny
(as dummy matrix) and a linear combination of imputation model predictorsx
. Similar methods are known as MORALS (Gifi, 1980) or ACE (Breiman and Friedman, 1985). The algorithm then replaces each category ofy
by a single number taken from the first canonical variate. After this step, the imputation model is fitted, and the predicted values from that model are extracted to function as the similarity measure for the matching step.The method works for both ordered and unordered factors. No special precautions are taken to ensure monotonicity between the category numbers and the quantifications, so the method should be able to preserve quadratic and other non-monotone relations of the predicted metric. It may be beneficial to remove very sparsely filled categories, for which there is a new
trim
argument.Potential advantages are:
Note that we still lack solid evidence for these claims.
Here are some examples for the new functionality.
Created on 2023-08-07 with reprex v2.0.2