Office of Advanced Research Computing (OARC), Statistical Methods and Data Analytics
In most of applied statistical analysis we have missing data.
Methods that help us to understand and handle statistical analysis of missing data is called missing data analysis.
Many statistical methods cannot be applied directly to data with missing.
There are three main approaches to handle missing data
Omission: Ignoring (dropping) samples that has missing value from the analysis
Imputation: Filling missing values
Analysis: Using statistical methods unaffected by missing data. Some statistical technics such as full information maximum likelihood and EM are in this category.
Depending on the type of data and model that you are using, one of the methods may better serve your needs.
In this workshop we will focus on Multiple Imputation to handle missing data.
Other methods might not possible or sufficient. For example we may not have enough data or any data left at all if cases with missing data is dropped.
Implementing Full information Maximum Likelihood is not always easy and available.
Multiple imputation is reliable and available in statistical package.
In some Multiple Imputation methods we are able to impute the missing values with possible and realistic values.
There are currently 165 packages on CRAN that more or less deal with missing data.
For the list of packages in R related to missing data analysis check:
For example if we want to estimate the average weight lost for people participating in a diet experiment and men more likely refuse to report their weight lost, we may get a misleading result if the average of weight lost between men and women are different.
If we drop participants with incomplete answer to a survey we are losing information and the power of our analysis will be lower.
If we impute the missing values with a fixed value or values that does not incorporate the uncertainty in the population the result will not reflect the real variation that we would have in the data had no missing. This may cause the confidence interval for estimation narrower or wider or might change the inference about the quantity of interest.
NA
. When we read
the data in R from a text or csv file a blank space will transform to
NA
by default. However, most of read functions have option
to assign NA
to specific value in the data. For example, if
999 means missing value in the original file, when you read the data in
R, you can set 999 to NA
.Let’s now look at missing data in R.
#Please make sure the packages that we are going to use are installed.
#install.packages("mice", dependencies = TRUE)
#install.packages("VIM")
#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("lattice")
#in R a missing value is represented by NA
#In this example `a` is not missing! it is null
a <- c()
a
## NULL
In the last example a
is not missing! it is null
## [1] NA
In the next example we create a vector with 4th element missing
## [1] 1 2 3 NA 5
The generic function is.na
indicates which elements
are missing.
The generic function is.na<-
sets elements to
NA
.
## [1] FALSE FALSE FALSE TRUE FALSE
## [1] 4
## [1] 1
## [1] TRUE
## [1] 1 NA 3 NA 5
#lets try apply some computational function on x
#by default most of the function in R do not ignore the NA
mean(x)
## [1] NA
## [1] NA
## [1] 3
## [1] 9
To draw a random variable from certain distributions, R has some
built in functions that start with letter r. For example
rnorm
draw a random value (or vector) from a normal
distribution or runif
draws a random number between 0 and 1
(by default) with a uniform probability density.
In the following example we will generate a dataset with missing value.
Function set.seed
is setting random seed in R, So every
time we draw a sample we will get the same result.
#setting seed
set.seed(123)
#We draw a sample of size 25 from a standard normal distribution and store them in y1
y1 <- rnorm(25)
#We draw a sample of size 25 from a normal distribution with mean 1 and standard deviations 5
y2 <- rnorm(n = 25, mean = 1, sd = 5)
#Let's draw 25 random value from 0 and 1 with equally likely chances.
z1 <- runif(n = 25, min = 0, max = 1)
# we can create missing in y1 if z is less than 0.3
y1 <- replace(y1, z1 < 0.5, NA)
Exercise:
Draw a standard normal random vector of size 25 and for each value greater than 1 and less than -1 replace y2 with a missing value
We can put together y1 and y2 as a data frame which is a usual data format in R
#data frame is the usual way a multivariate data is stored in R
dat <- data.frame(y1 = y1, y2 = y2)
#this is how a missing data look like in R
head(dat)
## y1 y2
## 1 -0.56047565 NA
## 2 NA NA
## 3 NA 1.766866
## 4 0.07050839 NA
## 5 NA 7.269075
## 6 1.71506499 3.132321
In missing data analysis, missing indicator is defined as a matrix with the same size of the data with values 0 as missing and 1 as observed to show which values in the data are missing.
In R we can create this using is.na
function.
The exclamation mark !
changes TRUE to FALSE and
vice-versa!
#R is missing indicator and we can keep it as TRUE and FALSE
R <- !is.na(dat)
# or we can make R as 0 and 1, 0 indicates missing and 1 indicates observed
R <- apply(R, 2, as.numeric)
#name the columns of R
colnames(R) <- c("R1","R2")
#desplay data and missing indicator together
dat <- cbind(dat, R)
head(dat)
## y1 y2 R1 R2
## 1 -0.56047565 NA 1 0
## 2 NA NA 0 0
## 3 NA 1.766866 0 1
## 4 0.07050839 NA 1 0
## 5 NA 7.269075 0 1
## 6 1.71506499 3.132321 1 1
The missing data mechanism is the process that generated the missing values.
Missing data mechanisms generally fall into one of three main categories.
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
A variable is missing completely at random, if neither the variables in the dataset nor the unobserved value of the variable itself predict whether a value will be missing.
This is the strangest assumption on missingness. When data are missing completely at random, analyzing only on the complete cases is unbiased.
For example, in surveys, men may be more likely to decline to answer some questions than women.
Multiple imputation and other modern methods such as direct maximum likelihood generally assumes that the data are MAR.
For example, Individuals with very high incomes are more likely to decline to answer questions about their income than individuals with more moderate incomes.
This is most liberal assumption for missing mechanism. Statistical models for the MNAR processes are beyond the scope of this seminar.
If all variables in a dataset are MCAR then the dataset is MCAR. If all variables in a dataset are MCAR or MAR then the dataset is MAR. If at least one variable in a dataset is MNAR then the dataset is called MNAR.
In the figure below \(X\) represent variables that are completely observed. \(Y\) represent variables that are partly missing. \(Z\) represent unobserved component of the cause of missingness unrelated to \(X\) and \(Y\). \(R\) represent the missingness or missing indicator.
For more information on missing data mechanisms please see:
Allison, 2002
Enders, 2010
Little & Rubin, 2002
Rubin, 1976
Schafer & Graham, 2002
Example:
We are using mvrnorm(N, mu = mu, Sigma = Sigma)
from
Package MASS, Generate a bivariate normal with sample size 100 with mean
zero and variance 1 and covariance 0.5. Then we create a missing
indicator \(R_1\) if the first variable
is less than -0.5 and a missing indicator \(R_2\) if the second variable is bigger than
0.5. Next, we implement the missing indicator R1 and R2 to first and
second variable of the bivariate sample.
#bivariate normal sample
Y <- MASS::mvrnorm(100, mu = c(0, 0), Sigma = .5 * diag(2) + .5)
R2 <- Y[,1] < -0.5
R1 <- Y[,2] > 0.5
#apply missingness R1 and R2 to the data
dat.MAR <- data.frame(Y1 = replace(Y[,1], R1==TRUE, NA),
Y2 = replace(Y[,2], R2==TRUE, NA))
head(dat.MAR)
## Y1 Y2
## 1 NA NA
## 2 0.4699910 -0.3403352
## 3 NA NA
## 4 -0.7778968 NA
## 5 -0.9119195 NA
## 6 NA NA
The best way to treat missing data is not to have them (Orchard and Woodbury 1972).
This methods involves deleting cases in a particular dataset that has missing values on any variable of interest. It is a common technique because it is easy to implement and works with any type of analysis.
This methods involves replacing the missing values for each individual variable with it’s overall estimated mean from the available cases. While this is a simple and easily implemented method for dealing with missing values it has some unfortunate consequences. The most important problem with mean imputation, also called mean substitution, is that it will result in an artificial reduction in variability due to the fact you are imputing values at the center of the variable’s distribution. This also has the unintended consequence of changing the magnitude of correlations between the imputed variable and other variables.
In R we can use a loop over all columns of data to impute column by
column their mean. We will see using mice
we can do this
directly.
#first store the missing in new data to impute
dat_mean_im <- dat
for(i in 1:ncol(dat_mean_im)) {
dat_mean_im[ , i][is.na(dat_mean_im[ , i])] <- mean(dat_mean_im[ , i], na.rm = TRUE)
}
#getting mean and standard deviation of data with missing
data.frame(mean = apply(dat, 2,mean, na.rm = TRUE), sd = apply(dat, 2,sd, na.rm = TRUE))
## mean sd
## y1 -0.05294653 1.0705524
## y2 1.94918108 4.2747637
## R1 0.48000000 0.5099020
## R2 0.84000000 0.3741657
# mean and standard deviation of mean imputed data
data.frame(mean = apply(dat_mean_im, 2,mean), sd = apply(dat_mean_im, 2,sd))
## mean sd
## y1 -0.05294653 0.7247674
## y2 1.94918108 3.9023075
## R1 0.48000000 0.5099020
## R2 0.84000000 0.3741657
Does mean imputation gives a better point estimate for standard deviation?
A slightly more sophisticated type of imputation is a regression/conditional mean imputation, which replaces missing values with predicted scores from a regression equation. The strength of this approach is that it uses complete information to impute values. The drawback here is that all your predicted values will fall directly on the regression line once again decreasing variability, just not as much as with unconditional mean imputation. Moreover, statistical models cannot distinguish between observed and imputed values and therefore do not incorporate into the model the error or uncertainly associated with that imputed value. Further discussion and an example of deterministic imputation can be found in Craig Enders book “Applied Missing Data Analysis” (2010). The image below is from that book:
In recognition of the problems with regression imputation and the reduced variability associated with this approach, researchers developed a technique to incorporate or “add back” lost variability. A residual term, that is randomly drawn from a normal distribution with mean zero and variance equal to the residual variance from the regression model, is added to the predicted scores from the regression imputation thus restoring some of the lost variability. This method is superior to the previous methods as it will produce unbiased coefficient estimates under MAR. However, the standard errors produced during regression estimation while less biased then the single imputation approach, will still be attenuated.
For continuous and multivariate normal data, FIML is an extension of the normal theory ML estimator, which accommodates missing data as a by-product of using raw data as inputs. In ML, the likelihood function is usually computed assuming all observations have complete responses and observations with missing values are discarded.
FIML accommodates missing responses by using the non-missing information to calculate the likelihood instead of discarding missing cases (Enders, 2010).
We will discuss this method in more detail next
Multiple imputation is essentially an iterative form of stochastic imputation.
Instead of filling in a single value, the distribution of the observed data is used to estimate multiple values that reflect the uncertainty around the true value.
Multiple Imputation has three basic phases:
Imputation or Fill-in Phase: The missing data are filled in with estimated values and a complete data set is created. This process of fill-in is repeated \(m\) times.
Analysis Phase: Each of the m complete data sets is then analyzed using a statistical method of interest (e.g. linear regression).
Pooling Phase: The parameter estimates (e.g. coefficients and standard errors) obtained from each analyzed data set are then combined for inference.
Two main algorithm for Imputation phase are Expectation-maximization with bootstrapping (EMB) and Fully conditional specification (FCS).
In EMB, Firstly, multiple bootstrapped samples are generated from incomplete data. In the Expectation-Maximization (EM) algorithm, each bootstrapped sample goes through the Expectation (E) stage, calculating the expected likelihood of the model using assumed model parameters. In the Maximization (M) stage, updated model parameters are estimated such that they maximize the expected likelihood generated in the E stage. One cycle consists of an E and M stage. The updated model parameters are utilized in the E stage of the next cycle to estimate the likelihood. Cycles are repeated until convergence, where updated model parameters do not differ significantly from the previous cycle. Missing cases are imputed by values drawn from the distribution described by the updated model parameters. Each bootstrapped sample is imputed and used in the analysis phase.
FCS is another approach to multiple imputation which imputes missing data on a variable-by-variable basis.
In mice
package in R, FCS is implemented using the
multiple imputation by chained equations (MICE) algorithm, a Markov
chain Monte Carlo (MCMC) method developed by van Buuren (2007). Missing
values are initially imputed by randomly selected observed values. In
the first step, model parameter estimates \(\hat\theta\) that characterize the missing
data are drawn from a posterior distribution derived from other
variables, non-missing data of that variable, and the missingness. In
the second step, imputations are drawn from another posterior
distribution, derived from the same information and the parameter
estimates from the first step. Missing values on the variable are
imputed in the second step. The two steps above are iterated k times,
ending at the kth variable. One cycle is complete when all variables
have gone through these two steps. Cycles are repeated until the imputed
values are stable and convergence is reached. The final imputed dataset
is saved as one imputation to be used as input in the analysis
phase.
The EMB algorithm runs much faster compared to FCS as convergence is more easily assessed and each bootstrapped sample is independent of each other. However, EMB assumes that the data form a multivariate-normal distribution, and imputed values are continuous.
In stochastic imputation we assume a linear relationship between observed variables and missing variable and we randomly draw a sample from residuals and add it to the mean predicted value of missing variable \(Y\).
In normal multiple imputation we go one step further and assume the coefficient of linear model is not fixed and are random variables.
In statistics when we assume parameters are random variable we can estimate their so called posterior distribution using Bayesian method.
In normal imputation we impute the missing values by randomly draw sample from posterior distribution of parameters of the model and then we use those randomly drawn parameters to draw the imputed value.
When we have complete data then estimation of parameter of model is straight forward by using maximum likelihood method. Likelihood function is:
\[L(Y; \theta) = \text{Prob}(Y | \theta) \]
The EM procedure is as follow:
Markov chain Monte Carlo is a modified procedure to EM. In MCMC we draw the missing \(Y_{mis}\) from its posterior distribution and treat it as if we have observed it. We repeat this procedure until we reach a stationary point.
The stationary point means that after \(k\) iterations, \(\theta^{(t+k)}\), the simulated value after iteration t + k, is independent of \(\theta^{(t)}\). We will then say \(k\) is stationary point.
mice
packageThe R package mice imputes incomplete multivariate data by chained equations developed by Stef van Buuren and Karin Groothuis-Oudshoorn (2011).
Over the last decade, mice has become an important piece of imputation software, offering a very flexible environment for dealing with incomplete data.
versatility of imputation methods in mice gives flexibility of imputing different types of data.
For this workshop we use a dataset named hsb2_mar
which is a MAR missing data hsb2_mar
This is not a real data and it is created just to illustrate statistical methods in this workshop.
The data contains test scores and demographic and school information for 200 high school students
The analysis that we are interested is linear regression of read score by independent variables write, and math and with adding predictors female, and prog at the end.
Let’s read the data and prepare the data for the analysis.
#hsb2_mar is missing at random
hsb2_mar <- read.csv("https://stats.idre.ucla.edu/wp-content/uploads/2021/03/hsb2_mar.csv")
summary(hsb2_mar)
## id race ses schtyp read
## Min. : 1.00 Min. :1.00 Min. :1.000 Min. :1.00 Min. :28.00
## 1st Qu.: 50.75 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.00 1st Qu.:44.00
## Median :100.50 Median :4.00 Median :2.000 Median :1.00 Median :50.00
## Mean :100.50 Mean :3.43 Mean :2.055 Mean :1.16 Mean :52.29
## 3rd Qu.:150.25 3rd Qu.:4.00 3rd Qu.:3.000 3rd Qu.:1.00 3rd Qu.:60.00
## Max. :200.00 Max. :4.00 Max. :3.000 Max. :2.00 Max. :76.00
## NA's :9
## write math science socst
## Min. :31.00 Min. :33.0 Min. :26.00 Min. :26.00
## 1st Qu.:46.00 1st Qu.:45.0 1st Qu.:44.00 1st Qu.:46.00
## Median :54.00 Median :52.0 Median :53.00 Median :52.00
## Mean :52.95 Mean :52.9 Mean :51.31 Mean :52.41
## 3rd Qu.:60.00 3rd Qu.:60.0 3rd Qu.:58.00 3rd Qu.:61.00
## Max. :67.00 Max. :75.0 Max. :74.00 Max. :71.00
## NA's :17 NA's :15 NA's :16
## prog female
## Length:200 Length:200
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
#since variables prog and female are categorical we make them as factor.
hsb2_mar$prog <- factor(hsb2_mar$prog)
#we can change the reference group of a factor
hsb2_mar$prog <- relevel(hsb2_mar$prog, ref = "vocational")
hsb2_mar$female <- factor(hsb2_mar$female)
hsb2_mar$female <- relevel(hsb2_mar$female, ref = "male")
#now we can see NA apears in the summary
summary(hsb2_mar)
## id race ses schtyp read
## Min. : 1.00 Min. :1.00 Min. :1.000 Min. :1.00 Min. :28.00
## 1st Qu.: 50.75 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.00 1st Qu.:44.00
## Median :100.50 Median :4.00 Median :2.000 Median :1.00 Median :50.00
## Mean :100.50 Mean :3.43 Mean :2.055 Mean :1.16 Mean :52.29
## 3rd Qu.:150.25 3rd Qu.:4.00 3rd Qu.:3.000 3rd Qu.:1.00 3rd Qu.:60.00
## Max. :200.00 Max. :4.00 Max. :3.000 Max. :2.00 Max. :76.00
## NA's :9
## write math science socst prog
## Min. :31.00 Min. :33.0 Min. :26.00 Min. :26.00 vocational:46
## 1st Qu.:46.00 1st Qu.:45.0 1st Qu.:44.00 1st Qu.:46.00 academic :95
## Median :54.00 Median :52.0 Median :53.00 Median :52.00 general :41
## Mean :52.95 Mean :52.9 Mean :51.31 Mean :52.41 NA's :18
## 3rd Qu.:60.00 3rd Qu.:60.0 3rd Qu.:58.00 3rd Qu.:61.00
## Max. :67.00 Max. :75.0 Max. :74.00 Max. :71.00
## NA's :17 NA's :15 NA's :16
## female
## male : 81
## female:101
## NA's : 18
##
##
##
##
By default in R if we run lm
function it use only
complete cases. Basically default for lm
is listwise
deletion.
##
## Call:
## lm(formula = read ~ write + math, data = hsb2_mar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9779 -4.6945 -0.2417 5.1221 21.1444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.79737 3.84339 1.769 0.0789 .
## write 0.34444 0.08139 4.232 3.92e-05 ***
## math 0.51945 0.08363 6.211 4.46e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.445 on 158 degrees of freedom
## (39 observations deleted due to missingness)
## Multiple R-squared: 0.4782, Adjusted R-squared: 0.4716
## F-statistic: 72.41 on 2 and 158 DF, p-value: < 2.2e-16
How many cases has been used in the above model?
#We can see it from the summary
#158 df + 3 = 161
#or
#complete.cases gives logical vector TRUE/FALSE is a row is complete
sum(complete.cases(hsb2_mar[c("read", "write", "math")]))
## [1] 161
We now use Mean value imputation and compute mean of write, read, and math, then using mean imputed data, we run the regression model of read vs.write, and math.
#mean value imputation
data_mean_im <- hsb2_mar[c("read", "write", "math")]
for(i in 1:ncol(data_mean_im)) {
data_mean_im[ , i][is.na(data_mean_im[ , i])] <- mean(data_mean_im[ , i], na.rm = TRUE)
}
#regression on imputed data by mean value
m_mean_im <- lm(read ~ write + math, data = data_mean_im)
summary(m_mean_im)
##
## Call:
## lm(formula = read ~ write + math, data = data_mean_im)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9882 -5.0091 -0.3489 5.1041 20.9949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.92439 3.61542 2.192 0.0296 *
## write 0.33851 0.07174 4.718 4.50e-06 ***
## math 0.49982 0.07057 7.083 2.44e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.488 on 197 degrees of freedom
## Multiple R-squared: 0.4423, Adjusted R-squared: 0.4367
## F-statistic: 78.13 on 2 and 197 DF, p-value: < 2.2e-16
#comparing two models side by side
sjPlot::tab_model(m_complete, m_mean_im, show.p = FALSE, show.se = TRUE, digits = 4, dv.labels = c("complete case", "mean Imputed"))
complete case | mean Imputed | |||||
---|---|---|---|---|---|---|
Predictors | Estimates | std. Error | CI | Estimates | std. Error | CI |
(Intercept) | 6.7974 | 3.8434 | -0.7937 – 14.3884 | 7.9244 | 3.6154 | 0.7945 – 15.0543 |
write | 0.3444 | 0.0814 | 0.1837 – 0.5052 | 0.3385 | 0.0717 | 0.1970 – 0.4800 |
math | 0.5195 | 0.0836 | 0.3543 – 0.6846 | 0.4998 | 0.0706 | 0.3607 – 0.6390 |
Observations | 161 | 200 | ||||
R2 / R2 adjusted | 0.478 / 0.472 | 0.442 / 0.437 |
lavaan
packagelavaan is a R package for latent variable analysis. You can use lavaan to estimate a large variety of multivariate statistical models, including path analysis, confirmatory factor analysis, structural equation modeling and growth curve models.
For missing data we can use the missing = “FIML” or “ML” argument to ask lavaan to estimate the ‘full information maximum likelihood’.
#regression model with lavaan setting missing to FIML
library(lavaan)
lm.fiml <- sem('read ~ write + math', data = hsb2_mar,
missing = "FIML", fixed.x = F)
summary(lm.fiml)
## lavaan 0.6.13 ended normally after 26 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 9
##
## Number of observations 200
## Number of missing patterns 5
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Observed
## Observed information based on Hessian
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## read ~
## write 0.327 0.076 4.286 0.000
## math 0.514 0.077 6.688 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## write ~~
## math 54.484 7.588 7.180 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .read 7.772 3.477 2.236 0.025
## write 53.071 0.676 78.553 0.000
## math 52.795 0.672 78.550 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .read 53.452 5.593 9.557 0.000
## write 86.703 9.078 9.551 0.000
## math 86.870 8.928 9.730 0.000
mice
and
VIM
Package mice has many nice functions to investigate missing data. For
example we can study missing data pattern by function
md.pattern
library(mice)
library(lattice)
#again we set seed at first
set.seed(123)
#Check the missingness pattern for the hsb2_mar dataset
md.pattern(hsb2_mar, rotate.names = TRUE)
## id race ses schtyp socst read math science write prog female
## 117 1 1 1 1 1 1 1 1 1 1 1 0
## 15 1 1 1 1 1 1 1 1 1 1 0 1
## 14 1 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 2
## 12 1 1 1 1 1 1 1 1 0 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 2
## 1 1 1 1 1 1 1 1 1 0 0 1 2
## 13 1 1 1 1 1 1 1 0 1 1 1 1
## 1 1 1 1 1 1 1 1 0 1 0 1 2
## 1 1 1 1 1 1 1 1 0 0 1 1 2
## 11 1 1 1 1 1 1 0 1 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 0 2
## 1 1 1 1 1 1 1 0 1 1 0 1 2
## 2 1 1 1 1 1 1 0 1 0 1 1 2
## 8 1 1 1 1 1 0 1 1 1 1 1 1
## 1 1 1 1 1 1 0 1 0 1 1 1 2
## 0 0 0 0 0 9 15 16 17 18 18 93
Looking at the missing data pattern is always useful (but may be difficult for datasets with many variables). It can give you an indication on how much information is missing and how the missingness is distributed.
There are many other useful packages and missing data visualizations.
For example package VIM
has some interesting plots.
library(VIM)
aggr_plot <- aggr(hsb2_mar, col=c('blue','red'), numbers=TRUE,
sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3,
ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## prog 0.090
## female 0.090
## write 0.085
## science 0.080
## math 0.075
## read 0.045
## id 0.000
## race 0.000
## ses 0.000
## schtyp 0.000
## socst 0.000
marginplot from package VIM
is a very nice visualization
to observe relationship between missing data and other variables in the
data.
We did mean imputation manually before. To mean impute using mice we simply set method to “mean”. The option method in mice tell what method should be used for imputation.
The result is an object of class mids
.
#select variables "read", "math", and "write" in a new data
hsb2_mar_1 <- hsb2_mar[c("read", "math","write")]
#mean impute in mice
imp <- mice(hsb2_mar_1, method = "mean", m = 1, maxit = 1)
##
## iter imp variable
## 1 1 read math write
## [1] "mids"
As you can see, the algorithm ran for 1 iteration (maxit = 1) and presented us with only 1 imputation (m = 1) for each missing datum.
we can extract imputed data using function
complete(imp)
.
## read math write
## 1 57 54 59.00000
## 2 47 61 62.00000
## 3 60 58 54.00000
## 4 54 57 63.00000
## 5 68 53 59.00000
## 6 55 61 49.00000
## 7 55 66 52.95082
## 8 52 55 54.00000
## 9 52 57 67.00000
## 10 36 42 57.00000
## read math write
## 52.28796 52.89730 52.95082
Now we can use imputed data for the analysis in mice. The return
object is in class mira
.
## # A tibble: 3 × 6
## term estimate std.error statistic p.value nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 (Intercept) 7.92 3.62 2.19 2.96e- 2 200
## 2 math 0.500 0.0706 7.08 2.44e-11 200
## 3 write 0.339 0.0717 4.72 4.50e- 6 200
## [1] "mira" "matrix"
The argument method = “norm.predict” first fits a regression model for each observed value, based on the corresponding values in other variables and then imputes the missing values with the predicted values.
##
## iter imp variable
## 1 1 read math write
## # A tibble: 3 × 6
## term estimate std.error statistic p.value nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 (Intercept) 6.36 3.23 1.97 5.03e- 2 200
## 2 math 0.534 0.0717 7.45 2.92e-12 200
## 3 write 0.334 0.0722 4.63 6.74e- 6 200
It is clear that something has changed. In fact, we extrapolated the regression model for the observed data to missing data. In other words; the relation gets stronger.
##
## iter imp variable
## 1 1 read math write
## # A tibble: 3 × 6
## term estimate std.error statistic p.value nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 (Intercept) 8.15 3.29 2.47 1.42e- 2 200
## 2 math 0.504 0.0742 6.80 1.22e-10 200
## 3 write 0.325 0.0720 4.52 1.08e- 5 200
This code imputes the missing values in the data set by the stochastic regression imputation method.
We add a random error to allow for our imputations to be off the regression line.
The function does not incorporate the variability of the regression weights, so it is not ‘proper’ in the sense of Rubin (1987). For small samples, the variability of the imputed data will be underestimated.
Re-run the stochastic imputation model with seed 123 and verify if your results are the same as the ones below:
##stochastic regression imputation with setting seed to 123 give the same result every time
imp <- mice(hsb2_mar_1, method = "norm.nob", m = 1, maxit = 1, seed = 123)
##
## iter imp variable
## 1 1 read math write
## # A tibble: 3 × 6
## term estimate std.error statistic p.value nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 (Intercept) 8.18 3.26 2.50 1.31e- 2 200
## 2 math 0.544 0.0717 7.59 1.27e-12 200
## 3 write 0.292 0.0730 4.01 8.72e- 5 200
The imputation procedure uses random sampling, and therefore, the results will be (perhaps slightly) different if we repeat the imputations. In order to get exactly the same result, you can use the seed argument.
We now impute the missing data in the dataset multiple times by using default option of mice.
We see for each variable there is a method of imputation. Since all
variables used in the data are continuous, by default mice use
imputation method pmm
or Predictive Mean Matching.
##
## iter imp variable
## 1 1 read math write
## 1 2 read math write
## 1 3 read math write
## 1 4 read math write
## 1 5 read math write
## 2 1 read math write
## 2 2 read math write
## 2 3 read math write
## 2 4 read math write
## 2 5 read math write
## 3 1 read math write
## 3 2 read math write
## 3 3 read math write
## 3 4 read math write
## 3 5 read math write
## 4 1 read math write
## 4 2 read math write
## 4 3 read math write
## 4 4 read math write
## 4 5 read math write
## 5 1 read math write
## 5 2 read math write
## 5 3 read math write
## 5 4 read math write
## 5 5 read math write
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## read math write
## "pmm" "pmm" "pmm"
## PredictorMatrix:
## read math write
## read 0 1 1
## math 1 0 1
## write 1 1 0
As you can see, the algorithm ran for 5 iterations (the default) and presented us with 5 imputations for each missing datum. We can omit printing of the iteration cycle when we run mice by adding print=FALSE.
mids
To obtain an overview of the information stored in the object imp of
class mids
, use the attributes() function.
## $names
## [1] "data" "imp" "m" "where"
## [5] "blocks" "call" "nmis" "method"
## [9] "predictorMatrix" "visitSequence" "formulas" "post"
## [13] "blots" "ignore" "seed" "iteration"
## [17] "lastSeedValue" "chainMean" "chainVar" "loggedEvents"
## [21] "version" "date"
##
## $class
## [1] "mids"
## $read
## 1 2 3 4 5
## 19 55 55 57 47 52
## 40 50 60 48 45 47
## 51 52 39 45 50 47
## 74 47 52 63 39 42
## 88 45 44 43 47 50
## 92 42 42 47 28 44
## 97 47 39 35 47 36
## 153 63 50 63 65 65
## 182 55 63 52 65 71
##
## $math
## 1 2 3 4 5
## 18 63 57 58 58 41
## 37 49 40 49 46 40
## 45 55 51 39 50 44
## 53 63 64 58 58 50
## 73 39 40 46 43 40
## 90 53 41 52 50 57
## 114 57 46 47 50 49
## 118 50 61 41 72 48
## 123 65 75 57 57 68
## 124 43 37 52 45 42
## 132 42 44 33 57 33
## 136 64 48 53 75 71
## 149 68 71 75 69 56
## 164 61 40 46 43 51
## 178 43 40 49 49 42
##
## $write
## 1 2 3 4 5
## 7 62 61 43 62 59
## 28 57 54 62 59 59
## 37 54 40 46 54 49
## 48 46 47 54 46 47
## 59 57 40 44 49 40
## 64 59 59 59 49 59
## 81 43 62 62 57 59
## 104 65 67 63 65 62
## 108 37 40 49 41 47
## 110 63 54 49 65 61
## 112 46 44 54 57 57
## 114 41 54 31 46 42
## 147 49 49 57 49 54
## 161 60 62 65 59 65
## 171 59 67 59 63 65
## 191 65 62 63 65 65
## 198 63 63 62 65 65
The collection of the m imputed data sets can be exported by function complete() in long, broad and repeated formats.
The number of imputed data sets can be specified by the
m
argument. For example, to create just three imputed data
sets, specify
The predictor matrix is a square matrix that specifies the variables that are used to impute each incomplete variable.
## read math write
## read 0 1 1
## math 1 0 1
## write 1 1 0
## read math write
## read 0 1 1
## math 1 0 1
## write 1 1 0
pred[ ,"write"] <- 0
#Use your new predictor matrix in mice() as follows
imp <- mice(hsb2_mar_1, pred=pred, print=F)
There is a special function called quickpred()
for a
quick selection procedure of predictors, which can be handy for datasets
containing many variables. See ?quickpred for more info. Selecting
predictors according to data relations with a minimum correlation of
\(\rho=.60\). can be done by
#set the minimum rho
ini <- mice(hsb2_mar_1, pred=quickpred(hsb2_mar_1, mincor=.6), print=F)
ini$pred
## read math write
## read 0 1 0
## math 1 0 1
## write 0 1 0
The mice() function implements an iterative Markov Chain Monte Carlo type of algorithm. Let us have a look at the trace lines generated by the algorithm to study convergence:
The plot shows the mean (left) and standard deviation (right) of the imputed values only. In general, we would like the streams to be free of any trends at the final iterations.
For each column, the algorithm requires a specification of the imputation method. To see which method was used by default:
## read math write
## "pmm" "pmm" "pmm"
List of default methods used for different variable type is:
An up-to-date overview of the methods in mice can be found by
## Warning in .S3methods(generic.function, class, envir): function 'mice' appears
## not to be S3 generic; found functions that look like S3 methods
## [1] mice.impute.2l.bin mice.impute.2l.lmer
## [3] mice.impute.2l.norm mice.impute.2l.pan
## [5] mice.impute.2lonly.mean mice.impute.2lonly.norm
## [7] mice.impute.2lonly.pmm mice.impute.cart
## [9] mice.impute.jomoImpute mice.impute.lasso.logreg
## [11] mice.impute.lasso.norm mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm mice.impute.lda
## [15] mice.impute.logreg mice.impute.logreg.boot
## [17] mice.impute.mean mice.impute.midastouch
## [19] mice.impute.mnar.logreg mice.impute.mnar.norm
## [21] mice.impute.mpmm mice.impute.norm
## [23] mice.impute.norm.boot mice.impute.norm.nob
## [25] mice.impute.norm.predict mice.impute.panImpute
## [27] mice.impute.passive mice.impute.pmm
## [29] mice.impute.polr mice.impute.polyreg
## [31] mice.impute.quadratic mice.impute.rf
## [33] mice.impute.ri mice.impute.sample
## [35] mice.mids mice.theme
## see '?methods' for accessing help and source code
Let us change the imputation method for write to Bayesian normal linear regression imputation
## read math write
## "pmm" "pmm" "pmm"
## read math write
## "pmm" "pmm" "norm"
Extend number of imputation:
Further diagnostic checking. Use function stripplot()
Generally, one would prefer for the imputed data to be plausible values, i.e. values that could have been observed if they had not been missing. In order to form an idea about plausibility, one may check the imputations and compare them against the observed values. If we are willing to assume that the data are missing completely at random (MCAR), then the imputations should have the same distribution as the observed data. In general, distributions may be different because the missing data are MAR (or even MNAR). However, very large discrepancies need to be screened.
Remember that write was imputed by Bayesian linear regression and imputed values may therefore be different than observed values.
After imputing the missing data we runs the analytic model of interest within each of the imputed datasets. We then need to pool the result of M analysis.
The basic procedure of pooling the result of imputed data described by Rubin (1987, Chap.3)
In a statistical analysis we want to estimate \(Q\), a quantity of interest in the population.
\(Q\) can be population mean, population variance, or population regression coefficient, or etc.
Suppose if we had no missing data \(\hat{Q}\) were point estimated for quantity \(Q\) and \(\hat{U}\) is the estimated variance of \(\hat{Q}\).
For each imputed data \(j\), let \(\hat{Q}^{(j)}\) and \(\hat{U}^{(j)}\) be the point and variance estimate.
The overall point estimate of \(Q\) will be:
\[\bar{Q} = \frac{1}{M}\sum_{j=1}^{M}\hat{Q}^{(j)}\]
\[T = \bar{U} + B(1 + 1/M)\] Where,
\[\bar{U} = \frac{1}{M}\sum_{j = 1}^{M}\hat{U}^{(j)}\]
and,
\[B = \frac{1}{M-1}\sum_{j = 1}^{M}(\hat{Q}^{(j)} - \bar{Q})^2,\]
are the within- and between-imputation variances, respectively.
The additional sampling variance is literally the variance between divided by m. This value represents the sampling error associated with the overall or average coefficient estimates. It is used as a correction factor for using a specific number of imputations.
The relative increase in variance due to nonresponse is
\[ \hat{r} = \frac{B + \frac{1}{M}B}{\bar{U}}\]
\[ \hat{\gamma} = \frac{B + \frac{1}{M}B}{T} = \frac{\hat{r}}{\hat{r} + 1} \] In some literature this refers to the approximation of rate of missing information.
\[\nu_M = (M - 1) \hat{\gamma} ^ {-2}\]
Modified procedure described by Barnard and Rubin (1999).
\[ \tilde{\nu}_M = \left(\frac{1}{\nu_M} + \frac{1}{\hat\nu}\right)^{-1}\] where \[ \hat{\nu} = \frac{\nu_{com}(\nu_{com} + 1)(1 - \hat{\gamma})}{\nu_{com} + 3} .\]
## [1] "mira" "matrix"
## [1] "analyses" "call" "call1" "nmis"
##
## Call:
## lm(formula = read ~ write + math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0184 -4.9976 -0.4221 4.6915 21.2835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.81268 3.28902 2.375 0.0185 *
## write 0.31321 0.07276 4.305 2.63e-05 ***
## math 0.53019 0.07261 7.302 6.83e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.327 on 197 degrees of freedom
## Multiple R-squared: 0.4924, Adjusted R-squared: 0.4873
## F-statistic: 95.55 on 2 and 197 DF, p-value: < 2.2e-16
Now we can combine the result by pool
## term estimate std.error statistic df p.value
## 1 (Intercept) 7.4375413 3.32790584 2.234901 185.2389 2.661868e-02
## 2 write 0.3181009 0.07380952 4.309755 157.4321 2.869523e-05
## 3 math 0.5286378 0.07323118 7.218753 161.1994 1.954372e-11
## Class: mipo m = 5
## term m estimate ubar b t dfcom df
## 1 (Intercept) 5 7.4375413 10.808563425 0.2219948567 11.074957253 197 185.2389
## 2 write 5 0.3181009 0.005121394 0.0002720429 0.005447845 197 157.4321
## 3 math 5 0.5286378 0.005064415 0.0002486593 0.005362806 197 161.1994
## riv lambda fmi
## 1 0.02464655 0.02405371 0.03442294
## 2 0.06374271 0.05992305 0.07164236
## 3 0.05891917 0.05564085 0.06714344
Use mice to multiple hsb2_mar using variables: “read”, “write”, “math”, “science”, “socst”, “prog”, and “female”.
Then run the regression model “read” as independent variable and “write”, “math”, “prog”, and “female” as predictors.
References:
Flexible Imputation of Missing Data, Second Edition By Stef van Buuren
https://www.gerkovink.com/miceVignettes/
Amelia II: A Program for Missing Data