How do I perform Multiple Imputation using Predictive Mean Matching in R?

Predictive Mean Matching (PMM) is a semi-parametric imputation approach. It is similar to the regression method except that for each missing value, it fills in a value randomly from among the a observed donor values from an observation whose regression-predicted values are closest to the regression-predicted value for the missing value from the simulated regression model (Heitjan and Little 1991; Schenker and Taylor 1996).

The PMM method ensures that imputed values are plausible; it might be more appropriate than the regression method (which assumes a joint multivariate normal distribution) if the normality assumption is violated (Horton and Lipsitz 2001, p. 246).

This page uses the following packages. Make sure that you can load them before trying to run the examples on this page. If you do not have a package installed, run: install.packages("packagename"), or if you see the version is out of date, run: update.packages().

library(mice)
library(VIM)
library(lattice)
library(ggplot2)

Version info: Code for this page was tested in R version 3.0.1 (2013-05-16) On: 2013-11-08 With: ggplot2 0.9.3.1; VIM 4.0.0; colorspace 1.2-4; mice 2.18; nnet 7.3-7; MASS 7.3-29; lattice 0.20-23; knitr 1.5

Please note: The purpose of this page is to show how to use various data analysis commands associated with imputation using PMM. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses. The page is based on a 2011 paper by Stef van Buuren and Karin Groothuis-Oudhoorn from the Jounal of Statsitical Software.

## Let us use the famous anscombe data and set a few to NA
anscombe <- within(anscombe, {
    y1[1:3] <- NA
    y4[3:5] <- NA
})
## view
anscombe

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8    NA 9.14  7.46  6.58
## 2   8  8  8  8    NA 8.14  6.77  5.76
## 3  13 13 13  8    NA 8.74 12.74    NA
## 4   9  9  9  8  8.81 8.77  7.11    NA
## 5  11 11 11  8  8.33 9.26  7.81    NA
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Missing Data Patterns

The Multiple Imputation by Chained Equations (MICE) package, not only allows for performing imputations but includes several functions for identifying the missing data pattern(s) present in a particular dataset.

## missing data patterns
md.pattern(anscombe)

##   x1 x2 x3 x4 y2 y3 y1 y4  
## 6  1  1  1  1  1  1  1  1 0
## 2  1  1  1  1  1  1  0  1 1
## 2  1  1  1  1  1  1  1  0 1
## 1  1  1  1  1  1  1  0  0 2
##    0  0  0  0  0  0  3  3 6

The grid above represents the 4 missing data patterns present in our modified anscombe file. The first row represents the 6 observations that have complete information for all 8 variables. This is shown by having the value ‘1’ in each column representing non-missing information. The second column represents the 2 observations that are missing information only on the variable y1 .

## Number of observations per patterns for all pairs of variables
p <- md.pairs(anscombe)
p

## $rr
##    x1 x2 x3 x4 y1 y2 y3 y4
## x1 11 11 11 11  8 11 11  8
## x2 11 11 11 11  8 11 11  8
## x3 11 11 11 11  8 11 11  8
## x4 11 11 11 11  8 11 11  8
## y1  8  8  8  8  8  8  8  6
## y2 11 11 11 11  8 11 11  8
## y3 11 11 11 11  8 11 11  8
## y4  8  8  8  8  6  8  8  8
## 
## $rm
##    x1 x2 x3 x4 y1 y2 y3 y4
## x1  0  0  0  0  3  0  0  3
## x2  0  0  0  0  3  0  0  3
## x3  0  0  0  0  3  0  0  3
## x4  0  0  0  0  3  0  0  3
## y1  0  0  0  0  0  0  0  2
## y2  0  0  0  0  3  0  0  3
## y3  0  0  0  0  3  0  0  3
## y4  0  0  0  0  2  0  0  0
## 
## $mr
##    x1 x2 x3 x4 y1 y2 y3 y4
## x1  0  0  0  0  0  0  0  0
## x2  0  0  0  0  0  0  0  0
## x3  0  0  0  0  0  0  0  0
## x4  0  0  0  0  0  0  0  0
## y1  3  3  3  3  0  3  3  2
## y2  0  0  0  0  0  0  0  0
## y3  0  0  0  0  0  0  0  0
## y4  3  3  3  3  2  3  3  0
## 
## $mm
##    x1 x2 x3 x4 y1 y2 y3 y4
## x1  0  0  0  0  0  0  0  0
## x2  0  0  0  0  0  0  0  0
## x3  0  0  0  0  0  0  0  0
## x4  0  0  0  0  0  0  0  0
## y1  0  0  0  0  3  0  0  1
## y2  0  0  0  0  0  0  0  0
## y3  0  0  0  0  0  0  0  0
## y4  0  0  0  0  1  0  0  3

The matrix (array) rr represents the number of observations where both pairs of values are observed. The matrix mm represents the exact opposite, these are the number of observations where both variables are missing values. The matrix rm shows the number of observations where the first variable’s value (i.e. the variable in the row) is observed and second (or column) variable is missing. The matrix mr is just the opposite of rm.

Missing Data Visualization

The VIM package in R can be used visualize missing data using several types of plots. We will demonstrate a few VIM package functions.

The marginplot function below will plot both the complete and incomplete observations for the variables specified. Here we will plot the available values for y1 and y4 .

## Margin plot of y1 and y4
marginplot(anscombe[c(5, 8)], col = c("blue", "red", "orange"))

The plot above allows you to examine the pattern and distribution of complete and incomplete observations. The blue dots represent individuals that have observed values for both y1 and y4 . The blue boxes located on the left and bottom margins are box plots of the non-missing values for each variable.

The red dots represent individuals that have missing values for either y1 but observed for y4 (left margin) or missing values for y4 but observed for y1 (bottom margin). The red boxes located on the left and bottom margins are box plots representing of the marginal distributions of these observed values.

Under the Missing Completely at Random (MCAR) assumption the red and blue box plots should be identical.

The pbox function below wil plot the marginal distribution of a variable within levels or categories of another variable. Here we obtain a plot of the distibution of the variable x2 by y1 and y4 . Pos=2 or position 2 in the anscombe file refers to the fact that x2 is in the second column of the data file.

## distributions of missing variable by another specified variable
pbox(anscombe, pos = 2)

In this cases as in the margins plot, the box plots are blue for observed values and red for missing values. This plot is useful is examining the Missing at Random (MAR) assumption that missingness is based on other observed variable(s) but not on the values of the missing variable(s) itself. Evidence supporting MAR over MCAR would be if the distribution of x2 for those observations with missing information for y1 or y4 were much higher or much lower than those of the non-missing observations. In the above graph, the boxplots appear to mostly overlap once again providing support for the assumption of MCAR.

Using MICE (Mulitple Imputation by Chained Equations)

The minimum information needed to use is the name of the data frame with missing values you would like to impute. The mice function will detect which variables is the data set have missing information. The default method of imputation in the MICE package is PMM and the default number of imputations is 5. If you would like to change the default number you can supply a second argument which we demonstrate below.

## by default it does 5 imputations for all missing values
imp1 <- mice(anscombe, m = 5)

## 
##  iter imp variable
##   1   1  y1  y4
##   1   2  y1  y4
##   1   3  y1  y4
##   1   4  y1  y4
##   1   5  y1  y4
##   2   1  y1  y4
##   2   2  y1  y4
##   2   3  y1  y4
##   2   4  y1  y4
##   2   5  y1  y4
##   3   1  y1  y4
##   3   2  y1  y4
##   3   3  y1  y4
##   3   4  y1  y4
##   3   5  y1  y4
##   4   1  y1  y4
##   4   2  y1  y4
##   4   3  y1  y4
##   4   4  y1  y4
##   4   5  y1  y4
##   5   1  y1  y4
##   5   2  y1  y4
##   5   3  y1  y4
##   5   4  y1  y4
##   5   5  y1  y4

## Inspect the multiple imputed data set
imp1

## Multiply imputed data set
## Call:
## mice(data = anscombe, m = 5)
## Number of multiple imputations:  5
## Missing cells per column:
## x1 x2 x3 x4 y1 y2 y3 y4 
##  0  0  0  0  3  0  0  3 
## Imputation methods:
##    x1    x2    x3    x4    y1    y2    y3    y4 
##    ""    ""    ""    "" "pmm"    ""    "" "pmm" 
## VisitSequence:
## y1 y4 
##  5  8 
## PredictorMatrix:
##    x1 x2 x3 x4 y1 y2 y3 y4
## x1  0  0  0  0  0  0  0  0
## x2  0  0  0  0  0  0  0  0
## x3  0  0  0  0  0  0  0  0
## x4  0  0  0  0  0  0  0  0
## y1  1  0  0  1  0  1  1  1
## y2  0  0  0  0  0  0  0  0
## y3  0  0  0  0  0  0  0  0
## y4  1  0  0  1  1  1  1  0
## Random generator seed value:  NA

The output states that, as we requested, 5 imputed datasets were created. Our two variables with missing values were imputed using “pmm”. The predictor matrix tells us which variables in the dataset were used to produce predicted values for matching. For example, variables x1, x4 , y2-y4 were used to created predicted values for y1. We did not specify a seed value, so R chose one randomly; however, if you wanted to be able to reproduce your imputation you could set a seed for the random number generator.

Imputation Diagnostic Checks

As we mentioned earlier, one of the benefits to performing imputation using the method of PMM, is that we will get plausible values imputed. We can check the imputed values stored in each of the 5 imputed dataset stored in imp1.

imp1$imp$y1

##       1     2     3    4    5
## 1  8.81  7.24  8.81 8.81 7.24
## 2  8.33  8.33  7.24 7.24 8.33
## 3 10.84 10.84 10.84 9.96 9.96

The first three observation were missing information for y1. Above we can see what values were imputed for those observations in each of our 5 imputed datasets.

Now, before we can use our imputed datsets we need to combine them together with our original observed data. We can do this using a function in the mice package called complete . We will make our data long by stacking or appending our five imputed datasets and then we will use the inc=TRUE argument to specify we also want to append our observed original data.

Note: This “long” dataset is now in a format that can also be used for analysis in other statistical packages including SAS and Stata.

imp_tot2 <- complete(imp1, "long", inc = TRUE)

We can inspect the distributions of the original and imputed data using the stripplot function that is part of the lattice package.

## labels observed data in blue and imputed data in red for y1
col <- rep(c("blue", "red")[1 + as.numeric(is.na(imp1$data$y1))], 6)
## plots data for y1 by imputation
stripplot(y1 ~ .imp, data = imp_tot2, jit = TRUE, col = col, xlab = "imputation Number")

As you can see above, the blue dots represent the observed data values for y1 from the original anscombe file. The red dots represent the imputed values in each of our five imputed datasets.

Regression with imputed datasets

Now we are ready use are multiply imputed dataset in an analysis. We will demonstrate how do this, by running a linear regression model with y1 as the outcome and y4 and x1 as predictors.

## linear regression for each imputed data set - 5 regression are run
fitm <- with(imp1, lm(y1 ~ y4 + x1))
summary(fitm)

## 
##  ## summary of imputation 1 :
## 
## Call:
## lm(formula = y1 ~ y4 + x1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.839 -0.353  0.126  0.506  0.817 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.9529     1.6427    3.62  0.00675 ** 
## y4           -0.3650     0.1516   -2.41  0.04264 *  
## x1            0.5133     0.0919    5.58  0.00052 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.86 on 8 degrees of freedom
## Multiple R-squared:  0.885,	Adjusted R-squared:  0.856 
## F-statistic: 30.8 on 2 and 8 DF,  p-value: 0.000174
## 
## 
##  ## summary of imputation 2 :
## 
## Call:
## lm(formula = y1 ~ y4 + x1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.683 -0.561  0.402  0.702  0.992 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    5.415      1.847    2.93   0.0189 * 
## y4            -0.321      0.174   -1.85   0.1019   
## x1             0.518      0.106    4.88   0.0012 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.02 on 8 degrees of freedom
## Multiple R-squared:  0.839,	Adjusted R-squared:  0.798 
## F-statistic: 20.8 on 2 and 8 DF,  p-value: 0.000677
## 
## 
##  ## summary of imputation 3 :
## 
## Call:
## lm(formula = y1 ~ y4 + x1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.691 -0.541  0.268  0.571  0.903 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0681     1.6223    3.12  0.01414 *  
## y4           -0.2995     0.1519   -1.97  0.08417 .  
## x1            0.5445     0.0929    5.86  0.00038 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.88 on 8 degrees of freedom
## Multiple R-squared:  0.881,	Adjusted R-squared:  0.851 
## F-statistic: 29.5 on 2 and 8 DF,  p-value: 0.000204
## 
## 
##  ## summary of imputation 4 :
## 
## Call:
## lm(formula = y1 ~ y4 + x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6816 -0.4355  0.0969  0.4771  1.0254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.3909     1.5415    3.50  0.00811 ** 
## y4           -0.3150     0.1455   -2.17  0.06226 .  
## x1            0.5146     0.0877    5.87  0.00038 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.842 on 8 degrees of freedom
## Multiple R-squared:  0.88,	Adjusted R-squared:  0.85 
## F-statistic: 29.4 on 2 and 8 DF,  p-value: 0.000205
## 
## 
##  ## summary of imputation 5 :
## 
## Call:
## lm(formula = y1 ~ y4 + x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6262 -0.4712  0.0512  0.6208  1.3275 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    5.996      1.861    3.22   0.0122 * 
## y4            -0.343      0.169   -2.03   0.0769 . 
## x1             0.452      0.108    4.18   0.0031 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.967 on 8 degrees of freedom
## Multiple R-squared:  0.84,	Adjusted R-squared:   0.8 
## F-statistic: 20.9 on 2 and 8 DF,  p-value: 0.000661

R will estimate our regression model separately for each imputed dataset, 1 though 5. We then need to summarize or pool those estimates to get one overall set of parameter estimates.

## pool coefficients and standard errors across all 5 regression models
pool(fitm)

## Call: pool(object = fitm)
## 
## Pooled coefficients:
## (Intercept)          y4          x1 
##       5.564      -0.329       0.509 
## 
## Fraction of information about the coefficients missing due to nonresponse: 
## (Intercept)          y4          x1 
##       0.267       0.238       0.331

## output parameter estimates
summary(pool(fitm))

##                est    se     t   df Pr(>|t|)  lo 95  hi 95 nmis   fmi
## (Intercept)  5.564 1.763  3.16 6.11  0.01917  1.270 9.8587   NA 0.267
## y4          -0.329 0.161 -2.04 6.34  0.08503 -0.718 0.0607    3 0.238
## x1           0.509 0.105  4.86 5.58  0.00344  0.248 0.7692    0 0.331
##             lambda
## (Intercept) 0.0614
## y4          0.0303
## x1          0.1272

Now we have one set of parameter estimates for our linear regression model

Additional Resources

You may also want to check out our FAQ page on how R handles missing data.