1.4 Missing data in R

In R missing data is represented by NA. When we read the data in R from a text or csv file a blank space will transform to NA by default. However, most of read functions have option to assign NA to specific value in the data. For example, if 999 means missing value in the original file, when you read the data in R, you can set 999 to NA.

Let’s now look at missing data in R.

#Please make sure the packages that we are going to use are installed.
#install.packages("mice", dependencies = TRUE)
#install.packages("VIM")
#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("lattice")

#in R a missing value is represented by NA

#In this example `a` is not missing! it is null
a <- c()
a

## NULL

#> a
#NULL

In the last example a is not missing! it is null

#In this one means `a` is a missing value
a <- NA
a

## [1] NA

In the next example we create a vector with 4th element missing

# x is a vector with 4th element missing
x <- c(1, 2, 3, NA, 5)
x

## [1]  1  2  3 NA  5

The generic function is.na indicates which elements are missing.
The generic function is.na<- sets elements to NA.

#is.na() detect whether a value is `NA`.
is.na(x)

## [1] FALSE FALSE FALSE  TRUE FALSE

#to get the location of NAs
which(is.na(x))

## [1] 4

#To get total number of NA in an array 
sum(is.na(x))

## [1] 1

#anyNA check whether there exist any missing
anyNA(x)

## [1] TRUE

is.na(x) <- 2
x

## [1]  1 NA  3 NA  5

#lets try apply some computational function on x
#by default most of the function in R do not ignore the NA
mean(x)

## [1] NA

sum(x)

## [1] NA

#?mean

mean(x, na.rm = TRUE)

## [1] 3

sum(x, na.rm = TRUE)

## [1] 9

1.5 Drawing a random sample in R

To draw a random variable from certain distributions, R has some built in functions that start with letter r. For example rnorm draw a random value (or vector) from a normal distribution or runif draws a random number between 0 and 1 (by default) with a uniform probability density.

In the following example we will generate a dataset with missing value.

Function set.seed is setting random seed in R, So every time we draw a sample we will get the same result.

#setting seed
set.seed(123)
#We draw a sample of size 25 from a standard normal distribution and store them in y1
y1 <- rnorm(25)

#We draw a sample of size 25 from a normal distribution with mean 1 and standard deviations 5
y2 <- rnorm(n = 25, mean = 1, sd = 5)
#Let's draw 25 random value from 0 and 1 with equally likely chances.
z1 <- runif(n = 25, min = 0, max = 1)
# we can create missing in y1 if z is less than 0.3
y1 <- replace(y1, z1 < 0.5, NA)

Exercise:

Draw a standard normal random vector of size 25 and for each value greater than 1 and less than -1 replace y2 with a missing value

We can put together y1 and y2 as a data frame which is a usual data format in R

#data frame is the usual way a multivariate data is stored in R
dat <- data.frame(y1 = y1, y2 = y2)
#this is how a missing data look like in R
head(dat)

##            y1       y2
## 1 -0.56047565       NA
## 2          NA       NA
## 3          NA 1.766866
## 4  0.07050839       NA
## 5          NA 7.269075
## 6  1.71506499 3.132321

1.6 Missing indicator and Missing Mechanism

In missing data analysis, missing indicator is defined as a matrix with the same size of the data with values 0 as missing and 1 as observed to show which values in the data are missing.
In R we can create this using is.na function.
The exclamation mark ! changes TRUE to FALSE and vice-versa!

#R is missing indicator and we can keep it as TRUE and FALSE

R <- !is.na(dat)
# or we can make R as 0 and 1, 0 indicates missing and 1 indicates observed
R <- apply(R, 2, as.numeric)
#name the columns of R
colnames(R) <- c("R1","R2")
#desplay data and missing indicator together  
dat <- cbind(dat, R)
head(dat)

##            y1       y2 R1 R2
## 1 -0.56047565       NA  1  0
## 2          NA       NA  0  0
## 3          NA 1.766866  0  1
## 4  0.07050839       NA  1  0
## 5          NA 7.269075  0  1
## 6  1.71506499 3.132321  1  1

The missing data mechanism is the process that generated the missing values.
Missing data mechanisms generally fall into one of three main categories.
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Missing not at random (MNAR)
A variable is missing completely at random, if neither the variables in the dataset nor the unobserved value of the variable itself predict whether a value will be missing.

This is the strangest assumption on missingness. When data are missing completely at random, analyzing only on the complete cases is unbiased.

A variable is said to be missing at random if other variables (but not the variable itself) in the dataset can be used to predict the missingness. This is more liberal assumption than MCAR.

For example, in surveys, men may be more likely to decline to answer some questions than women.

Multiple imputation and other modern methods such as direct maximum likelihood generally assumes that the data are MAR.

A Variable said to be missing not at random if the value of the unobserved variable itself predicts missingness.

For example, Individuals with very high incomes are more likely to decline to answer questions about their income than individuals with more moderate incomes.

This is most liberal assumption for missing mechanism. Statistical models for the MNAR processes are beyond the scope of this seminar.

If all variables in a dataset are MCAR then the dataset is MCAR. If all variables in a dataset are MCAR or MAR then the dataset is MAR. If at least one variable in a dataset is MNAR then the dataset is called MNAR.

In the figure below \(X\) represent variables that are completely observed. \(Y\) represent variables that are partly missing. \(Z\) represent unobserved component of the cause of missingness unrelated to \(X\) and \(Y\). \(R\) represent the missingness or missing indicator.

For more information on missing data mechanisms please see:

Allison, 2002

Enders, 2010

Little & Rubin, 2002

Rubin, 1976

Schafer & Graham, 2002

Example:

We are using mvrnorm(N, mu = mu, Sigma = Sigma) from Package MASS, Generate a bivariate normal with sample size 100 with mean zero and variance 1 and covariance 0.5. Then we create a missing indicator \(R_1\) if the first variable is less than -0.5 and a missing indicator \(R_2\) if the second variable is bigger than 0.5. Next, we implement the missing indicator R1 and R2 to first and second variable of the bivariate sample.

#bivariate normal sample
Y <- MASS::mvrnorm(100, mu = c(0, 0), Sigma = .5 * diag(2) + .5)
R2 <- Y[,1] < -0.5
R1 <- Y[,2] > 0.5
#apply missingness R1 and R2 to the data
dat.MAR <- data.frame(Y1 = replace(Y[,1], R1==TRUE, NA),
                      Y2 = replace(Y[,2], R2==TRUE, NA))
head(dat.MAR)

##           Y1         Y2
## 1         NA         NA
## 2  0.4699910 -0.3403352
## 3         NA         NA
## 4 -0.7778968         NA
## 5 -0.9119195         NA
## 6         NA         NA

1.7 Common techniques for dealing with missing data:

The best way to treat missing data is not to have them (Orchard and Woodbury 1972).

Complete case analysis (listwise deletion)

This methods involves deleting cases in a particular dataset that has missing values on any variable of interest. It is a common technique because it is easy to implement and works with any type of analysis.

Mean Imputation

This methods involves replacing the missing values for each individual variable with it’s overall estimated mean from the available cases. While this is a simple and easily implemented method for dealing with missing values it has some unfortunate consequences. The most important problem with mean imputation, also called mean substitution, is that it will result in an artificial reduction in variability due to the fact you are imputing values at the center of the variable’s distribution. This also has the unintended consequence of changing the magnitude of correlations between the imputed variable and other variables.

In R we can use a loop over all columns of data to impute column by column their mean. We will see using mice we can do this directly.

#first store the missing in new data to impute 
dat_mean_im <- dat
for(i in 1:ncol(dat_mean_im)) {
  dat_mean_im[ , i][is.na(dat_mean_im[ , i])] <- mean(dat_mean_im[ , i], na.rm = TRUE)
}

#getting mean and standard deviation of data with missing 

data.frame(mean = apply(dat, 2,mean, na.rm = TRUE), sd = apply(dat, 2,sd,  na.rm = TRUE))

##           mean        sd
## y1 -0.05294653 1.0705524
## y2  1.94918108 4.2747637
## R1  0.48000000 0.5099020
## R2  0.84000000 0.3741657

# mean and standard deviation of mean imputed data
data.frame(mean = apply(dat_mean_im, 2,mean), sd = apply(dat_mean_im, 2,sd))

##           mean        sd
## y1 -0.05294653 0.7247674
## y2  1.94918108 3.9023075
## R1  0.48000000 0.5099020
## R2  0.84000000 0.3741657

Does mean imputation gives a better point estimate for standard deviation?

Single or Deterministic Imputation

A slightly more sophisticated type of imputation is a regression/conditional mean imputation, which replaces missing values with predicted scores from a regression equation. The strength of this approach is that it uses complete information to impute values. The drawback here is that all your predicted values will fall directly on the regression line once again decreasing variability, just not as much as with unconditional mean imputation. Moreover, statistical models cannot distinguish between observed and imputed values and therefore do not incorporate into the model the error or uncertainly associated with that imputed value. Further discussion and an example of deterministic imputation can be found in Craig Enders book “Applied Missing Data Analysis” (2010). The image below is from that book:

Stochastic Imputation

In recognition of the problems with regression imputation and the reduced variability associated with this approach, researchers developed a technique to incorporate or “add back” lost variability. A residual term, that is randomly drawn from a normal distribution with mean zero and variance equal to the residual variance from the regression model, is added to the predicted scores from the regression imputation thus restoring some of the lost variability. This method is superior to the previous methods as it will produce unbiased coefficient estimates under MAR. However, the standard errors produced during regression estimation while less biased then the single imputation approach, will still be attenuated.

Full information maximum likelihood (FIML)

For continuous and multivariate normal data, FIML is an extension of the normal theory ML estimator, which accommodates missing data as a by-product of using raw data as inputs. In ML, the likelihood function is usually computed assuming all observations have complete responses and observations with missing values are discarded.

FIML accommodates missing responses by using the non-missing information to calculate the likelihood instead of discarding missing cases (Enders, 2010).

Multiple Imputation

We will discuss this method in more detail next

1.8 Multiple Imputation

Multiple imputation is essentially an iterative form of stochastic imputation.
Instead of filling in a single value, the distribution of the observed data is used to estimate multiple values that reflect the uncertainty around the true value.

Multiple Imputation has three basic phases:

Imputation or Fill-in Phase: The missing data are filled in with estimated values and a complete data set is created. This process of fill-in is repeated \(m\) times.
Analysis Phase: Each of the m complete data sets is then analyzed using a statistical method of interest (e.g. linear regression).
Pooling Phase: The parameter estimates (e.g. coefficients and standard errors) obtained from each analyzed data set are then combined for inference.

1.9 some thechnical note on imputation methods (Optional)

Two main algorithm for Imputation phase are Expectation-maximization with bootstrapping (EMB) and Fully conditional specification (FCS).

Expectation-maximization with bootstrapping (EMB)

In EMB, Firstly, multiple bootstrapped samples are generated from incomplete data. In the Expectation-Maximization (EM) algorithm, each bootstrapped sample goes through the Expectation (E) stage, calculating the expected likelihood of the model using assumed model parameters. In the Maximization (M) stage, updated model parameters are estimated such that they maximize the expected likelihood generated in the E stage. One cycle consists of an E and M stage. The updated model parameters are utilized in the E stage of the next cycle to estimate the likelihood. Cycles are repeated until convergence, where updated model parameters do not differ significantly from the previous cycle. Missing cases are imputed by values drawn from the distribution described by the updated model parameters. Each bootstrapped sample is imputed and used in the analysis phase.

Fully conditional specification (FCS)

FCS is another approach to multiple imputation which imputes missing data on a variable-by-variable basis.

In mice package in R, FCS is implemented using the multiple imputation by chained equations (MICE) algorithm, a Markov chain Monte Carlo (MCMC) method developed by van Buuren (2007). Missing values are initially imputed by randomly selected observed values. In the first step, model parameter estimates \(\hat\theta\) that characterize the missing data are drawn from a posterior distribution derived from other variables, non-missing data of that variable, and the missingness. In the second step, imputations are drawn from another posterior distribution, derived from the same information and the parameter estimates from the first step. Missing values on the variable are imputed in the second step. The two steps above are iterated k times, ending at the kth variable. One cycle is complete when all variables have gone through these two steps. Cycles are repeated until the imputed values are stable and convergence is reached. The final imputed dataset is saved as one imputation to be used as input in the analysis phase.

The EMB algorithm runs much faster compared to FCS as convergence is more easily assessed and each bootstrapped sample is independent of each other. However, EMB assumes that the data form a multivariate-normal distribution, and imputed values are continuous.

1.10 A note on Expectation Maximization (EM) (Optional)

In stochastic imputation we assume a linear relationship between observed variables and missing variable and we randomly draw a sample from residuals and add it to the mean predicted value of missing variable \(Y\).
In normal multiple imputation we go one step further and assume the coefficient of linear model is not fixed and are random variables.
In statistics when we assume parameters are random variable we can estimate their so called posterior distribution using Bayesian method.
In normal imputation we impute the missing values by randomly draw sample from posterior distribution of parameters of the model and then we use those randomly drawn parameters to draw the imputed value.
When we have complete data then estimation of parameter of model is straight forward by using maximum likelihood method. Likelihood function is:

\[L(Y; \theta) = \text{Prob}(Y | \theta) \]

In missing data we do not observe all data so the likelihood function is not available but we can use Expectation Maximization algorithm.

The EM procedure is as follow:

MCMC norm

Markov chain Monte Carlo is a modified procedure to EM. In MCMC we draw the missing \(Y_{mis}\) from its posterior distribution and treat it as if we have observed it. We repeat this procedure until we reach a stationary point.

The stationary point means that after \(k\) iterations, \(\theta^{(t+k)}\), the simulated value after iteration t + k, is independent of \(\theta^{(t)}\). We will then say \(k\) is stationary point.

2 Multiple Imputation in R

2.1 The `mice` package

The R package mice imputes incomplete multivariate data by chained equations developed by Stef van Buuren and Karin Groothuis-Oudshoorn (2011).
Over the last decade, mice has become an important piece of imputation software, offering a very flexible environment for dealing with incomplete data.
versatility of imputation methods in mice gives flexibility of imputing different types of data.

2011 Multivariate Imputation by Chained Equations in R JSS

2.2 hsb2_mar data

For this workshop we use a dataset named hsb2_mar which is a MAR missing data hsb2_mar
This is not a real data and it is created just to illustrate statistical methods in this workshop.
The data contains test scores and demographic and school information for 200 high school students
The analysis that we are interested is linear regression of read score by independent variables write, and math and with adding predictors female, and prog at the end.

Let’s read the data and prepare the data for the analysis.

#hsb2_mar is missing at random 
hsb2_mar <- read.csv("https://stats.idre.ucla.edu/wp-content/uploads/2021/03/hsb2_mar.csv")
summary(hsb2_mar)

##        id              race           ses            schtyp          read      
##  Min.   :  1.00   Min.   :1.00   Min.   :1.000   Min.   :1.00   Min.   :28.00  
##  1st Qu.: 50.75   1st Qu.:3.00   1st Qu.:2.000   1st Qu.:1.00   1st Qu.:44.00  
##  Median :100.50   Median :4.00   Median :2.000   Median :1.00   Median :50.00  
##  Mean   :100.50   Mean   :3.43   Mean   :2.055   Mean   :1.16   Mean   :52.29  
##  3rd Qu.:150.25   3rd Qu.:4.00   3rd Qu.:3.000   3rd Qu.:1.00   3rd Qu.:60.00  
##  Max.   :200.00   Max.   :4.00   Max.   :3.000   Max.   :2.00   Max.   :76.00  
##                                                                 NA's   :9      
##      write            math         science          socst      
##  Min.   :31.00   Min.   :33.0   Min.   :26.00   Min.   :26.00  
##  1st Qu.:46.00   1st Qu.:45.0   1st Qu.:44.00   1st Qu.:46.00  
##  Median :54.00   Median :52.0   Median :53.00   Median :52.00  
##  Mean   :52.95   Mean   :52.9   Mean   :51.31   Mean   :52.41  
##  3rd Qu.:60.00   3rd Qu.:60.0   3rd Qu.:58.00   3rd Qu.:61.00  
##  Max.   :67.00   Max.   :75.0   Max.   :74.00   Max.   :71.00  
##  NA's   :17      NA's   :15     NA's   :16                     
##      prog              female         
##  Length:200         Length:200        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

#since variables prog and female are categorical we make them as factor.
hsb2_mar$prog <- factor(hsb2_mar$prog)
#we can change the reference group of a factor
hsb2_mar$prog <- relevel(hsb2_mar$prog, ref = "vocational")
hsb2_mar$female <- factor(hsb2_mar$female)
hsb2_mar$female <-  relevel(hsb2_mar$female, ref = "male")
#now we can see NA apears in the summary
summary(hsb2_mar)

##        id              race           ses            schtyp          read      
##  Min.   :  1.00   Min.   :1.00   Min.   :1.000   Min.   :1.00   Min.   :28.00  
##  1st Qu.: 50.75   1st Qu.:3.00   1st Qu.:2.000   1st Qu.:1.00   1st Qu.:44.00  
##  Median :100.50   Median :4.00   Median :2.000   Median :1.00   Median :50.00  
##  Mean   :100.50   Mean   :3.43   Mean   :2.055   Mean   :1.16   Mean   :52.29  
##  3rd Qu.:150.25   3rd Qu.:4.00   3rd Qu.:3.000   3rd Qu.:1.00   3rd Qu.:60.00  
##  Max.   :200.00   Max.   :4.00   Max.   :3.000   Max.   :2.00   Max.   :76.00  
##                                                                 NA's   :9      
##      write            math         science          socst               prog   
##  Min.   :31.00   Min.   :33.0   Min.   :26.00   Min.   :26.00   vocational:46  
##  1st Qu.:46.00   1st Qu.:45.0   1st Qu.:44.00   1st Qu.:46.00   academic  :95  
##  Median :54.00   Median :52.0   Median :53.00   Median :52.00   general   :41  
##  Mean   :52.95   Mean   :52.9   Mean   :51.31   Mean   :52.41   NA's      :18  
##  3rd Qu.:60.00   3rd Qu.:60.0   3rd Qu.:58.00   3rd Qu.:61.00                  
##  Max.   :67.00   Max.   :75.0   Max.   :74.00   Max.   :71.00                  
##  NA's   :17      NA's   :15     NA's   :16                                     
##     female   
##  male  : 81  
##  female:101  
##  NA's  : 18  
##              
##              
##              
##

By default in R if we run lm function it use only complete cases. Basically default for lm is listwise deletion.

m_complete <- lm(read ~ write + math, data = hsb2_mar)
summary(m_complete)

## 
## Call:
## lm(formula = read ~ write + math, data = hsb2_mar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9779  -4.6945  -0.2417   5.1221  21.1444 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.79737    3.84339   1.769   0.0789 .  
## write        0.34444    0.08139   4.232 3.92e-05 ***
## math         0.51945    0.08363   6.211 4.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.445 on 158 degrees of freedom
##   (39 observations deleted due to missingness)
## Multiple R-squared:  0.4782, Adjusted R-squared:  0.4716 
## F-statistic: 72.41 on 2 and 158 DF,  p-value: < 2.2e-16

#sjPlot::tab_model(m_complete, show.se = TRUE, digits = 4)

How many cases has been used in the above model?

#We can see it from the summary
#158 df + 3 = 161
#or
#complete.cases gives logical vector TRUE/FALSE is a row is complete
sum(complete.cases(hsb2_mar[c("read", "write", "math")]))

## [1] 161

2.3 Mean value imputation

We now use Mean value imputation and compute mean of write, read, and math, then using mean imputed data, we run the regression model of read vs.write, and math.

#mean value imputation 
data_mean_im <- hsb2_mar[c("read", "write", "math")]
for(i in 1:ncol(data_mean_im)) {
  data_mean_im[ , i][is.na(data_mean_im[ , i])] <- mean(data_mean_im[ , i], na.rm = TRUE)
}

#regression on imputed data by mean value

m_mean_im <- lm(read ~ write + math, data = data_mean_im)
summary(m_mean_im)

## 
## Call:
## lm(formula = read ~ write + math, data = data_mean_im)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9882  -5.0091  -0.3489   5.1041  20.9949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.92439    3.61542   2.192   0.0296 *  
## write        0.33851    0.07174   4.718 4.50e-06 ***
## math         0.49982    0.07057   7.083 2.44e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.488 on 197 degrees of freedom
## Multiple R-squared:  0.4423, Adjusted R-squared:  0.4367 
## F-statistic: 78.13 on 2 and 197 DF,  p-value: < 2.2e-16

#comparing two models side by side

sjPlot::tab_model(m_complete, m_mean_im, show.p = FALSE, show.se = TRUE, digits = 4, dv.labels = c("complete case", "mean Imputed"))

	complete case			mean Imputed
Predictors	Estimates	std. Error	CI	Estimates	std. Error	CI
(Intercept)	6.7974	3.8434	-0.7937 – 14.3884	7.9244	3.6154	0.7945 – 15.0543
write	0.3444	0.0814	0.1837 – 0.5052	0.3385	0.0717	0.1970 – 0.4800
math	0.5195	0.0836	0.3543 – 0.6846	0.4998	0.0706	0.3607 – 0.6390
Observations	161			200
R² / R² adjusted	0.478 / 0.472			0.442 / 0.437

2.4 FIML using `lavaan` package

lavaan is a R package for latent variable analysis. You can use lavaan to estimate a large variety of multivariate statistical models, including path analysis, confirmatory factor analysis, structural equation modeling and growth curve models.

For missing data we can use the missing = “FIML” or “ML” argument to ask lavaan to estimate the ‘full information maximum likelihood’.

#regression model with lavaan setting missing to FIML
library(lavaan)
lm.fiml <- sem('read ~ write + math', data = hsb2_mar, 
              missing = "FIML", fixed.x = F)
summary(lm.fiml)

## lavaan 0.6.13 ended normally after 26 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         9
## 
##   Number of observations                           200
##   Number of missing patterns                         5
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Observed
##   Observed information based on                Hessian
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   read ~                                              
##     write             0.327    0.076    4.286    0.000
##     math              0.514    0.077    6.688    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   write ~~                                            
##     math             54.484    7.588    7.180    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .read              7.772    3.477    2.236    0.025
##     write            53.071    0.676   78.553    0.000
##     math             52.795    0.672   78.550    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .read             53.452    5.593    9.557    0.000
##     write            86.703    9.078    9.551    0.000
##     math             86.870    8.928    9.730    0.000

2.5 Some missing data visualization using `mice` and `VIM`

Package mice has many nice functions to investigate missing data. For example we can study missing data pattern by function md.pattern

library(mice)
library(lattice)
#again we set seed at first
set.seed(123)

#Check the missingness pattern for the hsb2_mar dataset

md.pattern(hsb2_mar, rotate.names = TRUE)

##     id race ses schtyp socst read math science write prog female   
## 117  1    1   1      1     1    1    1       1     1    1      1  0
## 15   1    1   1      1     1    1    1       1     1    1      0  1
## 14   1    1   1      1     1    1    1       1     1    0      1  1
## 1    1    1   1      1     1    1    1       1     1    0      0  2
## 12   1    1   1      1     1    1    1       1     0    1      1  1
## 1    1    1   1      1     1    1    1       1     0    1      0  2
## 1    1    1   1      1     1    1    1       1     0    0      1  2
## 13   1    1   1      1     1    1    1       0     1    1      1  1
## 1    1    1   1      1     1    1    1       0     1    0      1  2
## 1    1    1   1      1     1    1    1       0     0    1      1  2
## 11   1    1   1      1     1    1    0       1     1    1      1  1
## 1    1    1   1      1     1    1    0       1     1    1      0  2
## 1    1    1   1      1     1    1    0       1     1    0      1  2
## 2    1    1   1      1     1    1    0       1     0    1      1  2
## 8    1    1   1      1     1    0    1       1     1    1      1  1
## 1    1    1   1      1     1    0    1       0     1    1      1  2
##      0    0   0      0     0    9   15      16    17   18     18 93

Looking at the missing data pattern is always useful (but may be difficult for datasets with many variables). It can give you an indication on how much information is missing and how the missingness is distributed.

There are many other useful packages and missing data visualizations. For example package VIM has some interesting plots.

library(VIM)
aggr_plot <- aggr(hsb2_mar, col=c('blue','red'), numbers=TRUE, 
                  sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3,
                  ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable Count
##      prog 0.090
##    female 0.090
##     write 0.085
##   science 0.080
##      math 0.075
##      read 0.045
##        id 0.000
##      race 0.000
##       ses 0.000
##    schtyp 0.000
##     socst 0.000

marginplot from package VIM is a very nice visualization to observe relationship between missing data and other variables in the data.

VIM::marginplot(hsb2_mar[c("read","math")])

2.6 Mean imputation in mice

We did mean imputation manually before. To mean impute using mice we simply set method to “mean”. The option method in mice tell what method should be used for imputation.

The result is an object of class mids.

#select variables "read", "math", and "write" in a new data
hsb2_mar_1 <- hsb2_mar[c("read", "math","write")]
#mean impute in mice
imp <- mice(hsb2_mar_1, method = "mean", m = 1, maxit = 1)

## 
##  iter imp variable
##   1   1  read  math  write

#the class of multiple imputed data is mids
class(imp)

## [1] "mids"

As you can see, the algorithm ran for 1 iteration (maxit = 1) and presented us with only 1 imputation (m = 1) for each missing datum.

we can extract imputed data using function complete(imp).

comp_data <- complete(imp)
head(comp_data, 10)

##    read math    write
## 1    57   54 59.00000
## 2    47   61 62.00000
## 3    60   58 54.00000
## 4    54   57 63.00000
## 5    68   53 59.00000
## 6    55   61 49.00000
## 7    55   66 52.95082
## 8    52   55 54.00000
## 9    52   57 67.00000
## 10   36   42 57.00000

#mean on each column(variables) that used for imputation
colMeans(hsb2_mar_1, na.rm = TRUE)

##     read     math    write 
## 52.28796 52.89730 52.95082

Now we can use imputed data for the analysis in mice. The return object is in class mira.

fit <- with(imp, lm(read ~ math + write))
summary(fit)

## # A tibble: 3 × 6
##   term        estimate std.error statistic  p.value  nobs
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl> <int>
## 1 (Intercept)    7.92     3.62        2.19 2.96e- 2   200
## 2 math           0.500    0.0706      7.08 2.44e-11   200
## 3 write          0.339    0.0717      4.72 4.50e- 6   200

#the class of fit is mira
class(fit)

## [1] "mira"   "matrix"

2.7 Impute the missing data with regression imputation

The argument method = “norm.predict” first fits a regression model for each observed value, based on the corresponding values in other variables and then imputes the missing values with the predicted values.

#regression imputation
imp <- mice(hsb2_mar_1, method = "norm.predict", m = 1, maxit = 1)

## 
##  iter imp variable
##   1   1  read  math  write

#runing the mode for imputed data
fit <- with(imp, lm(read ~ math + write))
summary(fit)

## # A tibble: 3 × 6
##   term        estimate std.error statistic  p.value  nobs
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl> <int>
## 1 (Intercept)    6.36     3.23        1.97 5.03e- 2   200
## 2 math           0.534    0.0717      7.45 2.92e-12   200
## 3 write          0.334    0.0722      4.63 6.74e- 6   200

It is clear that something has changed. In fact, we extrapolated the regression model for the observed data to missing data. In other words; the relation gets stronger.

2.8 Impute the missing data with stochastic regression imputation

#stochastic regression imputation
imp <- mice(hsb2_mar_1, method = "norm.nob", m = 1, maxit = 1)

## 
##  iter imp variable
##   1   1  read  math  write

#runing the mode for imputed data
fit <- with(imp, lm(read ~ math + write))
summary(fit)

## # A tibble: 3 × 6
##   term        estimate std.error statistic  p.value  nobs
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl> <int>
## 1 (Intercept)    8.15     3.29        2.47 1.42e- 2   200
## 2 math           0.504    0.0742      6.80 1.22e-10   200
## 3 write          0.325    0.0720      4.52 1.08e- 5   200

This code imputes the missing values in the data set by the stochastic regression imputation method.

We add a random error to allow for our imputations to be off the regression line.

The function does not incorporate the variability of the regression weights, so it is not ‘proper’ in the sense of Rubin (1987). For small samples, the variability of the imputed data will be underestimated.

Re-run the stochastic imputation model with seed 123 and verify if your results are the same as the ones below:

##stochastic regression imputation with setting seed to 123 give the same result every time
imp <- mice(hsb2_mar_1, method = "norm.nob", m = 1, maxit = 1, seed = 123)

## 
##  iter imp variable
##   1   1  read  math  write

fit <- with(imp, lm(read ~ math + write))
summary(fit)

## # A tibble: 3 × 6
##   term        estimate std.error statistic  p.value  nobs
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl> <int>
## 1 (Intercept)    8.18     3.26        2.50 1.31e- 2   200
## 2 math           0.544    0.0717      7.59 1.27e-12   200
## 3 write          0.292    0.0730      4.01 8.72e- 5   200

The imputation procedure uses random sampling, and therefore, the results will be (perhaps slightly) different if we repeat the imputations. In order to get exactly the same result, you can use the seed argument.

2.9 Multiple imputation using default option of mice function

We now impute the missing data in the dataset multiple times by using default option of mice.

We see for each variable there is a method of imputation. Since all variables used in the data are continuous, by default mice use imputation method pmm or Predictive Mean Matching.

#using mice default
imp <- mice(hsb2_mar_1)

## 
##  iter imp variable
##   1   1  read  math  write
##   1   2  read  math  write
##   1   3  read  math  write
##   1   4  read  math  write
##   1   5  read  math  write
##   2   1  read  math  write
##   2   2  read  math  write
##   2   3  read  math  write
##   2   4  read  math  write
##   2   5  read  math  write
##   3   1  read  math  write
##   3   2  read  math  write
##   3   3  read  math  write
##   3   4  read  math  write
##   3   5  read  math  write
##   4   1  read  math  write
##   4   2  read  math  write
##   4   3  read  math  write
##   4   4  read  math  write
##   4   5  read  math  write
##   5   1  read  math  write
##   5   2  read  math  write
##   5   3  read  math  write
##   5   4  read  math  write
##   5   5  read  math  write

imp

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##  read  math write 
## "pmm" "pmm" "pmm" 
## PredictorMatrix:
##       read math write
## read     0    1     1
## math     1    0     1
## write    1    1     0

As you can see, the algorithm ran for 5 iterations (the default) and presented us with 5 imputations for each missing datum. We can omit printing of the iteration cycle when we run mice by adding print=FALSE.

2.10 Object `mids`

To obtain an overview of the information stored in the object imp of class mids, use the attributes() function.

#information stored in mids class
attributes(imp)

## $names
##  [1] "data"            "imp"             "m"               "where"          
##  [5] "blocks"          "call"            "nmis"            "method"         
##  [9] "predictorMatrix" "visitSequence"   "formulas"        "post"           
## [13] "blots"           "ignore"          "seed"            "iteration"      
## [17] "lastSeedValue"   "chainMean"       "chainVar"        "loggedEvents"   
## [21] "version"         "date"           
## 
## $class
## [1] "mids"

#for example we can extract imputed values
imp$imp

## $read
##      1  2  3  4  5
## 19  55 55 57 47 52
## 40  50 60 48 45 47
## 51  52 39 45 50 47
## 74  47 52 63 39 42
## 88  45 44 43 47 50
## 92  42 42 47 28 44
## 97  47 39 35 47 36
## 153 63 50 63 65 65
## 182 55 63 52 65 71
## 
## $math
##      1  2  3  4  5
## 18  63 57 58 58 41
## 37  49 40 49 46 40
## 45  55 51 39 50 44
## 53  63 64 58 58 50
## 73  39 40 46 43 40
## 90  53 41 52 50 57
## 114 57 46 47 50 49
## 118 50 61 41 72 48
## 123 65 75 57 57 68
## 124 43 37 52 45 42
## 132 42 44 33 57 33
## 136 64 48 53 75 71
## 149 68 71 75 69 56
## 164 61 40 46 43 51
## 178 43 40 49 49 42
## 
## $write
##      1  2  3  4  5
## 7   62 61 43 62 59
## 28  57 54 62 59 59
## 37  54 40 46 54 49
## 48  46 47 54 46 47
## 59  57 40 44 49 40
## 64  59 59 59 49 59
## 81  43 62 62 57 59
## 104 65 67 63 65 62
## 108 37 40 49 41 47
## 110 63 54 49 65 61
## 112 46 44 54 57 57
## 114 41 54 31 46 42
## 147 49 49 57 49 54
## 161 60 62 65 59 65
## 171 59 67 59 63 65
## 191 65 62 63 65 65
## 198 63 63 62 65 65

#extract completed data from third imputed values
c3 <- complete(imp, 3)

The collection of the m imputed data sets can be exported by function complete() in long, broad and repeated formats.

#long
c.long <- complete(imp, "long")
#wide
c.broad <- complete(imp, "broad")

# analysis for 5 imputed data:
fit <- with(imp, lm(read ~ math + write))
#summary(fit)

The number of imputed data sets can be specified by the m argument. For example, to create just three imputed data sets, specify

#three imputations
imp <- mice(hsb2_mar_1, m = 3, print=F)

2.11 Change the predictor matrix

The predictor matrix is a square matrix that specifies the variables that are used to impute each incomplete variable.

#predicted matrix
imp$pred

##       read math write
## read     0    1     1
## math     1    0     1
## write    1    1     0

ini <- mice(hsb2_mar_1, maxit=0, print=F)
pred <- ini$pred
pred

##       read math write
## read     0    1     1
## math     1    0     1
## write    1    1     0

pred[ ,"write"] <- 0
#Use your new predictor matrix in mice() as follows
imp <- mice(hsb2_mar_1, pred=pred, print=F)

There is a special function called quickpred() for a quick selection procedure of predictors, which can be handy for datasets containing many variables. See ?quickpred for more info. Selecting predictors according to data relations with a minimum correlation of \(\rho=.60\). can be done by

#set the minimum rho
ini <- mice(hsb2_mar_1, pred=quickpred(hsb2_mar_1, mincor=.6), print=F)
ini$pred

##       read math write
## read     0    1     0
## math     1    0     1
## write    0    1     0

2.12 Inspect the convergence of the algorithm

The mice() function implements an iterative Markov Chain Monte Carlo type of algorithm. Let us have a look at the trace lines generated by the algorithm to study convergence:

imp <- mice(hsb2_mar_1, m = 5, maxit = 10,print=F)
#plots the trace line
plot(imp)

The plot shows the mean (left) and standard deviation (right) of the imputed values only. In general, we would like the streams to be free of any trends at the final iterations.

2.13 change the imputation method

For each column, the algorithm requires a specification of the imputation method. To see which method was used by default:

#imputation methods
imp$meth

##  read  math write 
## "pmm" "pmm" "pmm"

List of default methods used for different variable type is:

An up-to-date overview of the methods in mice can be found by

methods(mice)

## Warning in .S3methods(generic.function, class, envir): function 'mice' appears
## not to be S3 generic; found functions that look like S3 methods

##  [1] mice.impute.2l.bin              mice.impute.2l.lmer            
##  [3] mice.impute.2l.norm             mice.impute.2l.pan             
##  [5] mice.impute.2lonly.mean         mice.impute.2lonly.norm        
##  [7] mice.impute.2lonly.pmm          mice.impute.cart               
##  [9] mice.impute.jomoImpute          mice.impute.lasso.logreg       
## [11] mice.impute.lasso.norm          mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm   mice.impute.lda                
## [15] mice.impute.logreg              mice.impute.logreg.boot        
## [17] mice.impute.mean                mice.impute.midastouch         
## [19] mice.impute.mnar.logreg         mice.impute.mnar.norm          
## [21] mice.impute.mpmm                mice.impute.norm               
## [23] mice.impute.norm.boot           mice.impute.norm.nob           
## [25] mice.impute.norm.predict        mice.impute.panImpute          
## [27] mice.impute.passive             mice.impute.pmm                
## [29] mice.impute.polr                mice.impute.polyreg            
## [31] mice.impute.quadratic           mice.impute.rf                 
## [33] mice.impute.ri                  mice.impute.sample             
## [35] mice.mids                       mice.theme                     
## see '?methods' for accessing help and source code

Let us change the imputation method for write to Bayesian normal linear regression imputation

#initial methods
ini <- mice(hsb2_mar_1, maxit = 0)
meth <- ini$meth
meth

##  read  math write 
## "pmm" "pmm" "pmm"

meth["write"] <- "norm"
meth

##   read   math  write 
##  "pmm"  "pmm" "norm"

#now we can use this new methods
imp <- mice(hsb2_mar_1, meth = meth, print=F)
plot(imp)

Extend number of imputation:

#add additional 45 imputations
imp50 <- mice.mids(imp, maxit=45, print=F)
plot(imp50)

Further diagnostic checking. Use function stripplot()

Generally, one would prefer for the imputed data to be plausible values, i.e. values that could have been observed if they had not been missing. In order to form an idea about plausibility, one may check the imputations and compare them against the observed values. If we are willing to assume that the data are missing completely at random (MCAR), then the imputations should have the same distribution as the observed data. In general, distributions may be different because the missing data are MAR (or even MNAR). However, very large discrepancies need to be screened.

#strip plots
stripplot(imp, math~.imp, pch=20, cex=2)

stripplot(imp)

Remember that write was imputed by Bayesian linear regression and imputed values may therefore be different than observed values.

2.14 Analysis/Pooling Phase

After imputing the missing data we runs the analytic model of interest within each of the imputed datasets. We then need to pool the result of M analysis.
The basic procedure of pooling the result of imputed data described by Rubin (1987, Chap.3)
In a statistical analysis we want to estimate \(Q\), a quantity of interest in the population.
\(Q\) can be population mean, population variance, or population regression coefficient, or etc.
Suppose if we had no missing data \(\hat{Q}\) were point estimated for quantity \(Q\) and \(\hat{U}\) is the estimated variance of \(\hat{Q}\).
For each imputed data \(j\), let \(\hat{Q}^{(j)}\) and \(\hat{U}^{(j)}\) be the point and variance estimate.
The overall point estimate of \(Q\) will be:

\[\bar{Q} = \frac{1}{M}\sum_{j=1}^{M}\hat{Q}^{(j)}\]

The overall variance estimate will be:

\[T = \bar{U} + B(1 + 1/M)\] Where,

\[\bar{U} = \frac{1}{M}\sum_{j = 1}^{M}\hat{U}^{(j)}\]

and,

\[B = \frac{1}{M-1}\sum_{j = 1}^{M}(\hat{Q}^{(j)} - \bar{Q})^2,\]

are the within- and between-imputation variances, respectively.

The additional sampling variance is literally the variance between divided by m. This value represents the sampling error associated with the overall or average coefficient estimates. It is used as a correction factor for using a specific number of imputations.
The relative increase in variance due to nonresponse is

\[ \hat{r} = \frac{B + \frac{1}{M}B}{\bar{U}}\]

Proportion of the variation attributable to the missing data is

\[ \hat{\gamma} = \frac{B + \frac{1}{M}B}{T} = \frac{\hat{r}}{\hat{r} + 1} \] In some literature this refers to the approximation of rate of missing information.

The adjusted degrees of freedom for test for Q is

\[\nu_M = (M - 1) \hat{\gamma} ^ {-2}\]

Modified procedure described by Barnard and Rubin (1999).

\[ \tilde{\nu}_M = \left(\frac{1}{\nu_M} + \frac{1}{\hat\nu}\right)^{-1}\] where \[ \hat{\nu} = \frac{\nu_{com}(\nu_{com} + 1)(1 - \hat{\gamma})}{\nu_{com} + 3} .\]

2.15 The last step is to fit the imputed data and pool the result:

fit <- with(imp, lm(read ~ write + math))
#fit
class(fit)

## [1] "mira"   "matrix"

#ls() extract information in fit
ls(fit)

## [1] "analyses" "call"     "call1"    "nmis"

#result of analysis on second imputed data 
summary(fit$analyses[[2]])

## 
## Call:
## lm(formula = read ~ write + math)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0184  -4.9976  -0.4221   4.6915  21.2835 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.81268    3.28902   2.375   0.0185 *  
## write        0.31321    0.07276   4.305 2.63e-05 ***
## math         0.53019    0.07261   7.302 6.83e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.327 on 197 degrees of freedom
## Multiple R-squared:  0.4924, Adjusted R-squared:  0.4873 
## F-statistic: 95.55 on 2 and 197 DF,  p-value: < 2.2e-16

Now we can combine the result by pool

pool.fit <- pool(fit)
summary(pool.fit)

##          term  estimate  std.error statistic       df      p.value
## 1 (Intercept) 7.4375413 3.32790584  2.234901 185.2389 2.661868e-02
## 2       write 0.3181009 0.07380952  4.309755 157.4321 2.869523e-05
## 3        math 0.5286378 0.07323118  7.218753 161.1994 1.954372e-11

pool.fit

## Class: mipo    m = 5 
##          term m  estimate         ubar            b            t dfcom       df
## 1 (Intercept) 5 7.4375413 10.808563425 0.2219948567 11.074957253   197 185.2389
## 2       write 5 0.3181009  0.005121394 0.0002720429  0.005447845   197 157.4321
## 3        math 5 0.5286378  0.005064415 0.0002486593  0.005362806   197 161.1994
##          riv     lambda        fmi
## 1 0.02464655 0.02405371 0.03442294
## 2 0.06374271 0.05992305 0.07164236
## 3 0.05891917 0.05564085 0.06714344

Missing Data in R

1 Missing Data

1.1 Introduction

1.2 Why Multiple Imputation

1.3 Goals of statistical analysis with missing data

1.4 Missing data in R

1.5 Drawing a random sample in R

1.6 Missing indicator and Missing Mechanism

1.7 Common techniques for dealing with missing data:

1.8 Multiple Imputation

1.9 some thechnical note on imputation methods (Optional)

1.10 A note on Expectation Maximization (EM) (Optional)

2 Multiple Imputation in R

2.1 The `mice` package

2.2 hsb2_mar data

2.3 Mean value imputation

2.4 FIML using `lavaan` package

2.5 Some missing data visualization using `mice` and `VIM`

2.6 Mean imputation in mice

2.7 Impute the missing data with regression imputation

2.8 Impute the missing data with stochastic regression imputation

2.9 Multiple imputation using default option of mice function

2.10 Object `mids`

2.11 Change the predictor matrix

2.12 Inspect the convergence of the algorithm

2.13 change the imputation method

2.14 Analysis/Pooling Phase

2.15 The last step is to fit the imputed data and pool the result:

2.16 Exercise

Thank you!

Missing Data in R

1 Missing Data

1.1 Introduction

1.2 Why Multiple Imputation

1.3 Goals of statistical analysis with missing data

1.4 Missing data in R

1.5 Drawing a random sample in R

1.6 Missing indicator and Missing Mechanism

1.7 Common techniques for dealing with missing data:

1.8 Multiple Imputation

1.9 some thechnical note on imputation methods (Optional)

1.10 A note on Expectation Maximization (EM) (Optional)

2 Multiple Imputation in R

2.1 The mice package

2.2 hsb2_mar data

2.3 Mean value imputation

2.4 FIML using lavaan package

2.5 Some missing data visualization using mice and VIM

2.6 Mean imputation in mice

2.7 Impute the missing data with regression imputation

2.8 Impute the missing data with stochastic regression imputation

2.9 Multiple imputation using default option of mice function

2.10 Object mids

2.11 Change the predictor matrix

2.12 Inspect the convergence of the algorithm

2.13 change the imputation method

2.14 Analysis/Pooling Phase

2.15 The last step is to fit the imputed data and pool the result:

2.16 Exercise

Thank you!

2.1 The `mice` package

2.4 FIML using `lavaan` package

2.5 Some missing data visualization using `mice` and `VIM`

2.10 Object `mids`