Statistical Methods and Data Analytics

Outline

Part 1: A Review of Modeling Count Data and Generalized Linear Models (GLMs)

  • Count Data
  • Poisson and Negative Binomial Distributions
  • Generalized Linear Models (GLMs)
  • Dispersion

Part 2: Zero-Inflated and Hurdle Models

  • Zero-Inflated Poisson
  • Zero-Inflated Negative Binomial
  • Poisson Hurdle Model
  • Negative Binomial Hurdle Model
  • Prediction of Zero-Inflated and Hurdle Models

Packages used in the workshop

# Loading required packages for the workshop
library(tidyverse)  # For data manipulation and visualization
library(AER)        # Applied Econometrics with R
library(pscl)       # Political Science Computational Laboratory
library(MASS)       # Modern Applied Statistics with S
library(sjPlot)     # For visualizing regression models
library(lmtest)     # For hypothesis testing in linear models
library(emmeans)    # For estimated marginal means
library(performance) # For model performance evaluation
library(sandwich)   #Robust Covariance Matrix Estimators

Count Data

Count data refers to a type of data where the outcome variable represents the number of occurrences or events within a specific observation time period or unit.

Examples of count data:

  1. Number of customer arrivals at a store per day.
  2. Number of defects in a manufacturing process per 1000 units.
  3. Number of people having a particular disease per 100,000 people.
  4. Number of accidents on a highway per day.
  5. Number of patients who died at a hospital within a week.

Distribution of Count Data

  • Count data is discrete, meaning it takes on only non-negative integer values (0, 1, 2, …).
  • It often follows a skewed distribution, with many low counts and a few high counts.

Generating a sample of size 100 from \(y \sim \text{Poisson}(\lambda = 2)\)

# Set the random seed
set.seed(100)

# Sample size
n <- 100

# Draw Poisson random variable
count.p <- rpois(n = n, lambda = 2)

# Frequency table of counts
table(count.p)
## count.p
##  0  1  2  3  4  5  6 
##  7 31 31 17  9  4  1
# Sample mean and sample variance
mean(count.p)
## [1] 2.06
var(count.p)
## [1] 1.652929
# Create a barplot of the frequency table
barplot(table(count.p), col = "lightblue", 
        xlab = "Count", ylab = "Frequency")

Generating a sample of size 100 from \(y \sim NB(\mu = 2, \theta = 1)\)

# Drawing negative binomial random variables
count.nb <- rnbinom(n = n, size = 1, mu = 2)

# Frequency table of counts
table(count.nb)
## count.nb
##  0  1  2  3  4  5  6  7 18 
## 28 30 15 12  7  5  1  1  1
# Sample mean and sample variance
mean(count.nb)
## [1] 1.8
var(count.nb)
## [1] 5.252525
# Create a barplot of the frequency table
barplot(table(count.nb), col = "lightblue", 
        xlab = "Count", ylab = "Frequency")

Data Used in the Workshop

  • Cross-sectional data originated from the US National Medical Expenditure Survey (NMES).
  • Data are loaded using the R package AER (Applied Econometrics with R).
  • The data consist of a sub-sample of individuals aged 66 and over, all covered by Medicare.
  • The objective is to model the number of visits predicted by certain covariates in the dataset.
# Load the data in R from AER
library(AER)
data("NMES1988")

# Select variables used in the workshop
dat <- NMES1988[, c(1, 6:8, 13, 15, 18)]

# Display the first few rows of the dataset
head(dat)
##   visits hospital  health chronic gender school insurance
## 1      5        1 average       2   male      6       yes
## 2      1        0 average       2 female     10       yes
## 3     13        3    poor       4 female     10        no
## 4     16        1    poor       2   male      3       yes
## 5      3        0 average       2 female      6       yes
## 6     17        0    poor       5 female      7        no
# Show the structure of the dataset
str(dat)
## 'data.frame':    4406 obs. of  7 variables:
##  $ visits   : int  5 1 13 16 3 17 9 3 1 0 ...
##  $ hospital : int  1 0 3 1 0 0 0 0 0 0 ...
##  $ health   : Factor w/ 3 levels "poor","average",..: 2 2 1 1 2 1 2 2 2 2 ...
##   ..- attr(*, "contrasts")= num [1:3, 1:2] 1 0 0 0 0 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:3] "poor" "average" "excellent"
##   .. .. ..$ : chr [1:2] "poor" "excellent"
##  $ chronic  : int  2 2 4 2 2 5 0 0 0 0 ...
##  $ gender   : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 1 1 ...
##  $ school   : int  6 10 10 3 6 7 8 8 8 8 ...
##  $ insurance: Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 2 2 2 2 ...

Examine Number of visits

  • Plot a histogram of the observed count frequencies

  • Summary statistics of the count

# Assign the counts to a variable
counts <- dat$visits

# Display summary statistics of the counts
summary(counts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   4.000   5.774   8.000  89.000
# Estimate mu from the sample mean
mu <- mean(counts)
mu
## [1] 5.774399
# Estimate variance from the sample variance
s2 <- var(counts)
s2
## [1] 45.68712
# Create a bar plot of the frequency table
plot(table(counts), col = "blue", ylab = "Frequency")

Regression Models with Count Outcome

  • In count models, the outcome represents counts or frequencies of events.
  • A frequent approach for modeling count data is Poisson Regression.
  • In Poisson Regression, the distribution of the response given predictors is Poisson with:

\[ \text{mean response} = \mu = E(y_i | x_1, x_2, \dots, x_p) \]

Here is a plot of Poisson regression on simulated data:

Overdispersion

  • A constraint of the Poisson distribution is that the variance is equivalent to the mean, known as equidispersion.
  • Overdispersion occurs when the variance of the observed counts exceeds the mean, indicating that the variability in the data is greater than what would be expected based on the Poisson distribution.

Causes of Overdispersion:

  • The problem of omitted covariates.
  • Clustering, similar to omitted covariates where the clustering is omitted.
  • Misspecified model assumptions, such as using a Poisson model instead of Negative Binomial.
  • Excess Zeros, such as some patients in the visits data never going to the doctor at all.

Addressing Overdispersion:

  • Robust standard error.
  • Negative Binomial (NB) Regression.
  • Zero-Inflated and Hurdle Models.

Generalized Linear Models (GLMs)

  • Generalized Linear Models (GLMs) are extended class of the linear model.

  • They are used when the response variable is not continuous or not normally distributed, such as binary, count, or non-negative continuous data.

  • In GLMs, there are three key components:

Random Component: The probability distribution of the response variable.

Linear predictor: The relationship between the predictor variables and the expected value of the response variable.

Link Function: That transform the expected outcome to the linear predictor. Commonly used link functions include the identity, logit, log, and inverse link functions.

Linear predictor:

In mathematical notation linear predictor can be express as:

\[\eta_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} \] Where,

  • \(\eta_i\) represents the linear predictor for the \(i\)th observation
  • \(x_{ij}\) represents the value of the \(j\)th predictor variable for the \(i\)th observation
  • \(\beta_j\) represents the regression coefficient corresponding to the \(j\)th predictor variable
  • \(p\) is the number of predictor variables.

We can summarize the above equation in the vector notation,

\[\eta_i = \mathbf{x}_i^T\mathbf{\beta}\]

Link Function:

Probability distribution of the response variable

Poisson distribution

The Probability Mass Function of a Poisson random variable \(Y\) with \(\mu > 0\) is

\[Pr(Y = y; \mu) = \dfrac{\exp(-\mu)\mu^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\] - The population mean is: \(E(Y) = \mu\)

  • The population variance is: \(E[(Y - \mu)^2] = \mu\)

  • Sometimes the parameter (mean) of the Poisson distribution is noted as \(\lambda\).

Negative binomial (NB) distribution

The distribution of negative binomial, when it is specified in terms of the mean, \(\mu\), and size, \(\theta\), is:

\[Pr(Y = y; \mu, \theta) = \frac{\Gamma(y + \theta)}{ \Gamma(\theta)y!} \left(\frac{\mu^y \theta ^ \theta}{(\mu + \theta)^{y +\theta}}\right), \text{ for } y = 0, 1, 2, \dots\]

Where \(\mu\) , \(\theta\), and \(\Gamma(.)\) are the mean of the count distribution, the size parameter, and the gamma function.

  • The population mean is: \(E(Y) = \mu\)

  • The population variance is: \(E[(Y - \mu)^2] = \mu(1 + \mu / \theta)\)

  • In count regression, we are modeling \(E(y_i | \mathbf{x}_i) = \mu_i\)

Maximum Likelihood Estimate for GLMs

  • Maximum Likelihood Estimation (MLE) is used to estimate the parameters of GLMs.

  • The likelihood function, measures the probability of observing the data given the model parameters.

  • Maximum Likelihood Estimate of parameters achieves by maximizing the log of likelihood function.

  • By combining the three key components of GLM, we can evaluate the likelihood function for the GLM models.

  • In R, maximum likelihood estimation for GLMs is typically performed using iterative optimization algorithms such as Fisher scoring or Newton-Raphson.

Poisson Regression model for number of visits

  • We first start with the poisson regression of number of visits predicted by hospital, health, chronic, gender, school, and insurance.

  • In R, for Poisson model we use glm function.

## Create a formula
formula <- visits ~ hospital + health + chronic + gender + school + insurance
## Fit Poisson model using glm function from stats ( which loads with R)
m.pois <- glm(formula = formula,
                    family  = poisson(link = "log"),
                    data    = dat)
summary(m.pois)
## 
## Call:
## glm(formula = formula, family = poisson(link = "log"), data = dat)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.028874   0.023785  43.258   <2e-16 ***
## hospital         0.164797   0.005997  27.478   <2e-16 ***
## healthpoor       0.248307   0.017845  13.915   <2e-16 ***
## healthexcellent -0.361993   0.030304 -11.945   <2e-16 ***
## chronic          0.146639   0.004580  32.020   <2e-16 ***
## gendermale      -0.112320   0.012945  -8.677   <2e-16 ***
## school           0.026143   0.001843  14.182   <2e-16 ***
## insuranceyes     0.201687   0.016860  11.963   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 26943  on 4405  degrees of freedom
## Residual deviance: 23168  on 4398  degrees of freedom
## AIC: 35959
## 
## Number of Fisher Scoring iterations: 5

If the deviance is not close to the degrees of freedom, then there is a possibility of overdispersion.

We will check this later!

Extracting the results and interpreting the coefficient

  • Since for the expected count we used log-link, exponentiation of coefficients (using exp() function) converts them to multiplicative effects on the expected count.

  • Exponentiated coefficients are often expressed as Rate Ratio (RR) or Relative Risk or Risk Ratio.

  • For example, the RR for ‘healthpoor’ is 1.2818534. Since ‘average health’ serves as the reference level, we can interpret this as follows: the expected count or rate of visits for individuals with poor health is 28% higher than the expected rate for those with average health, holding other covariates constant. Alternatively, we expect the rate of visits for individuals with poor health to be 1.28 times higher than the expected rate for those with average health.

##using coeff function
coefficient <- coefficients(m.pois)
CI <- confint(m.pois)
colnames(CI) <- c("L", "U")
#Calculate Rate Ratios
RR <- exp(coefficient)
CI_RR <- exp(confint(m.pois))
colnames(CI_RR) <- c("L_RR", "U_RR")

#put them together:
print(round(cbind(coefficient, CI, RR, CI_RR),4))
##                 coefficient       L       U     RR   L_RR   U_RR
## (Intercept)          1.0289  0.9821  1.0754 2.7979 2.6702 2.9311
## hospital             0.1648  0.1530  0.1765 1.1792 1.1653 1.1930
## healthpoor           0.2483  0.2132  0.2832 1.2819 1.2377 1.3274
## healthexcellent     -0.3620 -0.4219 -0.3031 0.6963 0.6558 0.7385
## chronic              0.1466  0.1376  0.1556 1.1579 1.1476 1.1684
## gendermale          -0.1123 -0.1377 -0.0870 0.8938 0.8713 0.9167
## school               0.0261  0.0225  0.0298 1.0265 1.0228 1.0302
## insuranceyes         0.2017  0.1687  0.2348 1.2235 1.1838 1.2647
  • Alternatively, you can use tab_model from package sjPlot (Data Visualization for Statistics in Social Science).
#load sjPlot, install it if you need to
library(sjPlot)
tab_model(m.pois, show.dev = TRUE)
  visits
Predictors Incidence Rate Ratios CI p
(Intercept) 2.80 2.67 – 2.93 <0.001
hospital 1.18 1.17 – 1.19 <0.001
health [poor] 1.28 1.24 – 1.33 <0.001
health [excellent] 0.70 0.66 – 0.74 <0.001
chronic 1.16 1.15 – 1.17 <0.001
gender [male] 0.89 0.87 – 0.92 <0.001
school 1.03 1.02 – 1.03 <0.001
insurance [yes] 1.22 1.18 – 1.26 <0.001
Observations 4406
R2 Nagelkerke 0.577
Deviance 23167.806

For testing overdispersion we can use check_overdispersion from package performance or dispersiontest from package AER.

# Using dispersiontest from package AER
dispersiontest(m.pois)
## 
##  Overdispersion test
## 
## data:  m.pois
## z = 11.509, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   6.706192
# Using check_overdispersion from package performance
library(performance)
check_overdispersion(m.pois)
## # Overdispersion test
## 
##        dispersion ratio =     6.706
##   Pearson's Chi-Squared = 29493.587
##                 p-value =   < 0.001
## Overdispersion detected.
  • The result suggest that the true dispersion is likely greater than 1. Meaning we might have overdispersion.

  • This might results with biased estimation of standard errors.

  • As a remedy we can use robust standard error.

# Compute sandwich covariance matrix from package lmtest 
library(lmtest)
library(sandwich)
coeftest(m.pois, vcov = sandwich)
## 
## z test of coefficients:
## 
##                  Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)      1.028874   0.064530 15.9442 < 2.2e-16 ***
## hospital         0.164797   0.021945  7.5095 5.935e-14 ***
## healthpoor       0.248307   0.054022  4.5964 4.298e-06 ***
## healthexcellent -0.361993   0.077449 -4.6740 2.954e-06 ***
## chronic          0.146639   0.012908 11.3605 < 2.2e-16 ***
## gendermale      -0.112320   0.035343 -3.1780  0.001483 ** 
## school           0.026143   0.005084  5.1422 2.715e-07 ***
## insuranceyes     0.201687   0.043128  4.6765 2.919e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Negative binomial Model for number of visits

  • For negative binomial model we use glm.nb function from package MASS (Modern Applied Statistics with S).

  • The NB model is typically effective in handling overdispersion caused by unobserved heterogeneity.

## Fit NB model using glm.nb function from MASS
library(MASS)
m.nb <- glm.nb(formula, data = dat, link = log)
summary(m.nb)
## 
## Call:
## glm.nb(formula = formula, data = dat, link = log, init.theta = 1.206603534)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.929257   0.054591  17.022  < 2e-16 ***
## hospital         0.217772   0.020176  10.793  < 2e-16 ***
## healthpoor       0.305013   0.048511   6.288 3.23e-10 ***
## healthexcellent -0.341807   0.060924  -5.610 2.02e-08 ***
## chronic          0.174916   0.012092  14.466  < 2e-16 ***
## gendermale      -0.126488   0.031216  -4.052 5.08e-05 ***
## school           0.026815   0.004394   6.103 1.04e-09 ***
## insuranceyes     0.224402   0.039464   5.686 1.30e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.2066) family taken to be 1)
## 
##     Null deviance: 5743.7  on 4405  degrees of freedom
## Residual deviance: 5044.5  on 4398  degrees of freedom
## AIC: 24359
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.2066 
##           Std. Err.:  0.0336 
## 
##  2 x log-likelihood:  -24341.1070
# Compare AIC between NB and Poisson models
c(AIC_NB = m.nb$aic, AIC_Pois = m.pois$aic)
##   AIC_NB AIC_Pois 
## 24359.11 35959.23
# Check for overdispersion in NB model
check_overdispersion(m.nb)
## # Overdispersion test
## 
##        dispersion ratio =    1.262
##   Pearson's Chi-Squared = 5548.711
##                 p-value =  < 0.001
## Overdispersion detected.
  • As we can see there is an extra parameter \(\theta\) that is estimated from this model compare to Poisson Model.

  • The estimated parameters change slightly but the standard errors are clearly greater than Poisson model.

  • AIC for NB model is 24359 while AIC for Poisson model was 35959.

  • NB model also use log-link by default. Here, we can also use exponentiation of coefficients and interpret them as rate ratios similar to Poisson model.

The excess of zero counts

  • In the count models, there are cases that we might have more zeros in the count than we expected.

  • We can check the excess of zero counts by predicting expected number of zeros and compare it with observed frequency of zero.

  • In R we can calculate the ratio of expected number of zeros by observed number of zeros either directly from the model fitted values or by check_zeroinflation from the package performance.

# Calculate observed and expected number of zero counts
obs.zero <- sum(dat$visits == 0)
Ezero.pois <- round(sum(dpois(0,lambda = fitted(m.pois))))
Ezero.nb <- round(sum(dnbinom(0,mu = fitted(m.nb), size = m.nb$theta)))

# Calculate ratios of observed to expected zeros
obs.zero / Ezero.pois 
## [1] 14.53191
obs.zero / Ezero.nb 
## [1] 1.123355
# Check for zero inflation in Poisson and NB models
check_zeroinflation(m.pois)
## # Check for zero-inflation
## 
##    Observed zeros: 683
##   Predicted zeros: 47
##             Ratio: 0.07
## Model is underfitting zeros (probable zero-inflation).
check_zeroinflation(m.nb)
## # Check for zero-inflation
## 
##    Observed zeros: 683
##   Predicted zeros: 608
##             Ratio: 0.89
## Model is underfitting zeros (probable zero-inflation).

When the number of observed zeros is substantially greater than the number of expected zeros, we may have overdispersion that has not been addressed yet.

Zero-Inflated and Hurdle Models

  • Zero-inflated and hurdle models are used when we need to deal with count data that has excess zeroes.

  • These models can be understood as a mixture of two subset of populations. In one subset, we have a usual count model that may or may not generate zero, and the other subset only produce zero count.

  • The probability that an observation is coming from the subset that produce only zero (often noted as \(\pi\)) can be modeled as a logistic regression model.

Zero inflated models

  • Zero-inflated models is used when theory suggests that the excess zeros are generated by a separate process from the count values.

  • In Zero-inflated model, the count component can also produce zero and the distribution of count part can be Poisson or negative binomial.

Hurdle models

  • In Hurdle models we are modeling excess zeroes separately from the rest of the data.

  • They suggest that there is a hurdle (threshold) that separates the zero counts from the positive counts. Different distributions are used to model these two components.

  • The zero counts is typically a binary response model and the the positive counts is a truncated-at-zero count mode.

Probability distribution of Zero-Inflated Models

Zero-inflated Poisson

  • probability mass function (PMF) of a zero-inflated Poisson (ZIP) distribution is:

\[P(Y = y) = \begin{cases} \pi + (1-\pi) e^{-\mu}, & \text{if } y = 0 \\ (1-\pi) \frac{\mu^y}{y!} e^{-\mu}, & \text{if } y > 0 \\ \end{cases}\]

  • \(\pi\) is the probability of excess zero. That is probability of an observation always produce 0.
  • \(P(Y=y)\) is the probability of observing \(y\) events. \(\mu\) is the rate parameter of Poisson distribution.

The mean and variance of the ZIP model are given by

\[E(Y) = (1 - \pi)\mu \]

\[Var(Y) = \mu(1-\pi)(1+\pi\mu)\] If \(\pi\) equals zero, the zero-inflated Poisson reduces to the standard Poisson model. As \(\pi\) increases, the variance also increases, leading to larger overdispersion.

Zero-inflated negative binomial

  • probability mass function (PMF) of a zero-inflated negative binomial (ZNB) distribution is:

\[P(Y = y) = \begin{cases} \pi + (1-\pi) \frac{\theta ^ \theta}{(\mu + \theta)^{\theta}}, & \text{if } y = 0 \\ (1-\pi) \frac{\Gamma(y + \theta)}{ \Gamma(\theta)y!} \left(\frac{\mu^y \theta ^ \theta}{(\mu + \theta)^{y +\theta}}\right) & \text{if } y > 0 \\ \end{cases}\]

The mean is same as ZIP:

\[E(Y) = (1 - \pi)\mu \] and the variance of the ZNB model is given by

\[Var(Y) = \mu(1-\pi)\left[1 + \mu\left(\pi+ \frac{1}{\theta}\right)\right] \]

Zero-Inflated Model for the number of visits

To model the number of visits, because we observed there are excess zeroes, we may suspect that there are some variables that can help to detect whether an individual is going to have an office visit.

  • insurance: For example, people without private health insurance may refuse to have an office visits because of copay.

  • gender: Men might not want to go to office visit compare to women.

  • chronic: Individuals with zero of chronic might not go to office visit at all.

  • school: Individuals with lower level of education might not go to office visit for routine check up.

We use those variable to model whether individuals tend to have an office visit or not.

Zero-Inflated Model for the number of visits

  • In R we use function zeroinfl from package pscl (Political Science Computational Laboratory) to estimate zero-inflated models.

  • Function zeroinfl has two part for formula, the first part is modeling count and the second part is modeling zero component, separated by |.

  • The option dist identifies the type of count distribution.

  • The see other option and explanations of formula for zeroifl for example, how to set offset for an exposure variable please see the R help documentation by running ?zeroinl.

# Fit zero-inflated Poisson and NB models
library(pscl)
m.zeroinf.pois <- zeroinfl(
  visits ~ hospital + health + chronic + gender + school + insurance | #the count component
                             chronic + insurance + school + gender, #the zero component
                           data = dat, dist = "poisson")
m.zeroinf.nb <- zeroinfl(visits ~ hospital + health + chronic + gender + school + insurance | 
                           chronic + insurance + school + gender,
                           data = dat, dist = "negbin")

summary(m.zeroinf.pois)
## 
## Call:
## zeroinfl(formula = visits ~ hospital + health + chronic + gender + school + 
##     insurance | chronic + insurance + school + gender, data = dat, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1455 -1.1616 -0.4734  0.5485 25.6156 
## 
## Count model coefficients (poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.405287   0.024187  58.102  < 2e-16 ***
## hospital         0.159073   0.006059  26.252  < 2e-16 ***
## healthpoor       0.253476   0.017707  14.315  < 2e-16 ***
## healthexcellent -0.308014   0.031305  -9.839  < 2e-16 ***
## chronic          0.101857   0.004722  21.570  < 2e-16 ***
## gendermale      -0.062439   0.013061  -4.781 1.75e-06 ***
## school           0.019188   0.001874  10.238  < 2e-16 ***
## insuranceyes     0.080539   0.017153   4.695 2.66e-06 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.09716    0.13981  -0.695    0.487    
## chronic      -0.56813    0.04370 -13.000  < 2e-16 ***
## insuranceyes -0.75044    0.10194  -7.361 1.82e-13 ***
## school       -0.05470    0.01216  -4.498 6.87e-06 ***
## gendermale    0.40457    0.08904   4.544 5.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 20 
## Log-likelihood: -1.614e+04 on 13 Df
summary(m.zeroinf.nb)
## 
## Call:
## zeroinfl(formula = visits ~ hospital + health + chronic + gender + school + 
##     insurance | chronic + insurance + school + gender, data = dat, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1980 -0.7081 -0.2728  0.3265 18.0135 
## 
## Count model coefficients (negbin with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.197683   0.056576  21.169  < 2e-16 ***
## hospital         0.211311   0.020326  10.396  < 2e-16 ***
## healthpoor       0.285186   0.045147   6.317 2.67e-10 ***
## healthexcellent -0.322441   0.060206  -5.356 8.53e-08 ***
## chronic          0.128995   0.011933  10.810  < 2e-16 ***
## gendermale      -0.084300   0.031027  -2.717  0.00659 ** 
## school           0.021488   0.004365   4.922 8.55e-07 ***
## insuranceyes     0.117544   0.041631   2.823  0.00475 ** 
## Log(theta)       0.395647   0.035368  11.187  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.07043    0.26763  -0.263  0.79242    
## chronic      -1.24725    0.17852  -6.987 2.82e-12 ***
## insuranceyes -1.24289    0.22153  -5.611 2.02e-08 ***
## school       -0.08394    0.02646  -3.172  0.00151 ** 
## gendermale    0.55588    0.19808   2.806  0.00501 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 1.4853 
## Number of iterations in BFGS optimization: 27 
## Log-likelihood: -1.209e+04 on 14 Df

Extracting the coefficients and Prediction in Zero-inflated models

Because both models have the same class we are going to examine only zero-inflated negative binomial.

  • The result of zeroinfl function is an object of class zeroinfl.

  • Functions summary and print are similar to standard lm and glm return summary statistics of coefficients of both part of the model.

  • Function coef extract coefficients of model (both components by default or defined by model = c("count","zero")).

  • Function vcov extract covariance matrix of coefficients. Can be defined by model = c("count","zero").

  • Other useful function in regression packages such as linearHypothesis is package car or lrtest also support object zeroinfl models.

#Show Class of model object
class(m.zeroinf.pois)
## [1] "zeroinfl"
class(m.zeroinf.nb)
## [1] "zeroinfl"
# Extract coefficients for count and zero components 
coef(m.zeroinf.nb, model = "count")
##     (Intercept)        hospital      healthpoor healthexcellent         chronic 
##      1.19768349      0.21131051      0.28518552     -0.32244098      0.12899539 
##      gendermale          school    insuranceyes 
##     -0.08430000      0.02148765      0.11754390
coef(m.zeroinf.nb, model = "zero")
##  (Intercept)      chronic insuranceyes       school   gendermale 
##  -0.07043143  -1.24725098  -1.24288644  -0.08394011   0.55588270
# Compute variance of count component
diag(vcov(m.zeroinf.nb, model = "count"))
##     (Intercept)        hospital      healthpoor healthexcellent         chronic 
##    3.200882e-03    4.131493e-04    2.038284e-03    3.624793e-03    1.423912e-04 
##      gendermale          school    insuranceyes 
##    9.626465e-04    1.905524e-05    1.733126e-03
  • Function predict is predicting new data or observed data. It computes predicted mean, predicted probability for values of count, and expected or mean of the count and zero component.

To understand what the predict() produce I am creating a data frame of prediction and I compute them using model coefficients.

# Predict response, count, zero, and probability of zero
predict.data <- data.frame(response = predict(m.zeroinf.nb, type = "response" ),
           count = predict(m.zeroinf.nb, type = "count"),
           zero = predict(m.zeroinf.nb, type = "zero" ),
           prob0 = predict(m.zeroinf.nb, type = "prob")[, 1])

# Calculate predictions directly by computing linear predictions
predict.data <- predict.data %>% 
  mutate(xb.count = model.matrix(m.zeroinf.nb,"count") %*% coef(m.zeroinf.nb, "count"),
         xb.zero  = model.matrix(m.zeroinf.nb, "zero") %*% coef(m.zeroinf.nb, "zero"),
         mu = exp(xb.count),
         pi = 1 / (1 + exp(-xb.zero)),
         p0 = pi + (1 - pi)*dnbinom(0, mu = mu, size = m.zeroinf.nb$theta))
round(head(predict.data),5)
##   response    count    zero   prob0 xb.count  xb.zero       mu      pi      p0
## 1  6.08628  6.22862 0.02285 0.10743  1.82915 -3.75558  6.22862 0.02285 0.10743
## 2  5.92128  5.97805 0.00950 0.09954  1.78809 -4.64722  5.97805 0.00950 0.09954
## 3 17.20018 17.24735 0.00274 0.02584  2.84766 -5.89884 17.24735 0.00274 0.02584
## 4  7.54011  7.76695 0.02921 0.09335  2.04988 -3.50376  7.76695 0.02921 0.09335
## 5  5.41309  5.48570 0.01324 0.11251  1.70214 -4.31146  5.48570 0.01324 0.11251
## 6  9.74980  9.75968 0.00101 0.05041  2.27826 -6.89427  9.75968 0.00101 0.05041

Interpreting the zero inflated model

1 - Examine the Logistic Component:

  • The logistic component predicts the excess zeros (i.e., the probability of observing a zero versus count).

  • We interpret the coefficients of the logistic component similar to logistic regression. They represent the log-odds of the excess zeros.

  • For example, the coefficient fo chronic is -1.24725 and it means the log-odds that an indevidual will never tend to have an office visit (or they do not have routine check up) decreses by -1.24725 on avarage for every extra number of chronic conditions, holding other predictors in the zero part constant.

  • Similar to logistic regression we can exponentiate the coefficient and interpret it as multipicative effect in the odds metric.

  • In R we can directly compute the exponentiation of the coefficient from summary or coef. Alternatively, we can use tab_model from package sjPlot.

  Zero-Inflated Poisson Zero-Inflated_NB
Predictors Incidence Rate Ratios CI p Incidence Rate Ratios CI p
(Intercept) 4.08 3.89 – 4.27 <0.001 3.31 2.96 – 3.70 <0.001
hospital 1.17 1.16 – 1.19 <0.001 1.24 1.19 – 1.29 <0.001
health [poor] 1.29 1.24 – 1.33 <0.001 1.33 1.22 – 1.45 <0.001
health [excellent] 0.73 0.69 – 0.78 <0.001 0.72 0.64 – 0.82 <0.001
chronic 1.11 1.10 – 1.12 <0.001 1.14 1.11 – 1.16 <0.001
gender [male] 0.94 0.92 – 0.96 <0.001 0.92 0.86 – 0.98 0.007
school 1.02 1.02 – 1.02 <0.001 1.02 1.01 – 1.03 <0.001
insurance [yes] 1.08 1.05 – 1.12 <0.001 1.12 1.04 – 1.22 0.005
Zero-Inflated Model
(Intercept) 0.91 0.69 – 1.19 0.487 0.93 0.55 – 1.57 0.792
chronic 0.57 0.52 – 0.62 <0.001 0.29 0.20 – 0.41 <0.001
insurance [yes] 0.47 0.39 – 0.58 <0.001 0.29 0.19 – 0.45 <0.001
school 0.95 0.92 – 0.97 <0.001 0.92 0.87 – 0.97 0.002
gender [male] 1.50 1.26 – 1.78 <0.001 1.74 1.18 – 2.57 0.005
Observations 4406 4406
R2 / R2 adjusted 0.627 / 0.626 0.886 / 0.885
  • The odds that an individual will never tend to have an office visit (or they do not have routine check up) for every extra number of chronic conditions decreases by \(100 \times (1 - 0.2872938) \approx 71%\) , holding other predictors in the zero part constant.

  • When we are interpreting the odds, it is a good idea to calculate probability of the outcome for an average individual.

# Predicted probability of the excess zeros for an average individual
mean(dat$chronic)
## [1] 1.541988
(predicted0 <- predict(
  m.zeroinf.nb, 
  newdata = data.frame(
    hospital = mean(dat$hospital), 
    health = "average",
    chronic = mean(dat$chronic), 
    gender = "female",
    school = mean(dat$school), 
    insurance = "no"
  ),
  type = "zero"
))
##          1 
## 0.05429772
# Calculate the odds
(odds <- predicted0 / (1 - predicted0))
##          1 
## 0.05741524
# Calculate odds changes
odds * 0.2872938
##          1 
## 0.01649504

2 - Examine the Count Component:

  • The count component models the count data conditional on not being an excess zero.

  • We interpret the coefficients of the count component as we would in a standard count model (e.g., Poisson or negative binomial regression).

  • The interpretation is limited to cases which have a chance of observing the event. For example people who usually have an office visit for the routine checkups but may or may not missed it in 1997.

  • For example, if the coefficient for a predictor is positive, it indicates that the expected count increases with increasing values of that predictor, given the individual is a person who does not refuse to have an office visit.

Plotting the zero-inflated count

  • It is often useful to plot the result of regression to visualize the effect of predictors.

  • Now that we understand the predict function for zero-inflated models, we can use it to get predictions for new values and plot them.

  • We can also use package emmeans to prob interactions for the model and plot them.

  • Two key arguments for functions in package emmeans are:

mode = c("response", "count", "zero", "prob0") and lin.pred = c(FALSE, TRUE).

# Create new data for predictions
new.data <- data.frame(
  hospital = mean(dat$hospital),
  health = "poor",
  chronic = seq(0, 8, 0.1),
  gender = "female",
  school = mean(dat$school),
  insurance = "no"
)

# Predicted response
new.data$response <- predict(m.zeroinf.nb, newdata = new.data, type ="response")

# Predicted count only in count component
new.data$count <- predict(m.zeroinf.nb, newdata = new.data, type = "count")

# Predicted probability of excess zero
new.data$pi <- predict(m.zeroinf.nb, newdata = new.data, type = "zero")

# Linear predictor
new.data$linear.count <-  log(new.data$count)

# Plot predicted response against chronic variable
ggplot(new.data, aes(x = chronic, y = response)) +
  geom_line(color = "blue", size = 1)

# Plot predicted count against chronic variable
ggplot(new.data, aes(x = chronic, y = count)) +
  geom_line(color = "lightblue", size = 1)

# Plot both predicted response and count together
new.data %>% 
  pivot_longer(cols = c(count, response),
               names_to = "type", values_to = "visits") %>% 
  ggplot(aes(x = chronic, y = visits, color = type)) +
  geom_line(size = 1)

# Plot linear predictor
ggplot(new.data, aes(x = chronic, y = linear.count)) +
  geom_line(color = "black", size = 1.5) +
  ylab("linear predictor count")

# Plot probability of excess zero
ggplot(new.data, aes(x = chronic, y = pi)) +
  geom_line(color = "red", size = 1.5) +
  ylab("probability of excess zero")

# Use emmeans 
library(emmeans)
emmip(m.zeroinf.nb, ~ chronic,
      at = list(
        hospital = mean(dat$hospital),
        health = "poor",
        chronic = 0:8,
        gender = "female",
        school = mean(dat$school),
        insurance = "no"
      ), 
      lin.pred = FALSE, mode ="count", CIs = TRUE) +
  ylab("count")

Probability distribution of Hurdle Models

Suppose the probability of zero is

\[Pr(Y = 0) = \pi,\] and suppose the PMF for the count model is \[Pr(Y= y) = f(y), \text{ for } y = 0, 1, 2, \dots\]

Thus, PMF for truncated count will be:

\[Pr(Y = y | y > 0) = \dfrac{ f(y)I_{(y > 0)}(y)} {1 - f(0)} ,\]

where \(I_A(.)\) is the indicator function (i.e. it is equal to 1 when \(y \in A\) and 0 when \(y \notin A\)).

Thus, the PMF for hurdle model will be:

\[P(Y = y) = \begin{cases} \pi & \text{if } y = 0 \\ \dfrac{(1 -\pi) \times f(y)} {1 - f(0)} & \text{if } y > 0 \\ \end{cases}\]

The mean of the hurdle model are given by

\[E(Y) = \dfrac{(1 -\pi) \times \mu} {1 - f(0)}\]

where \(\mu\) is the mean of the count distribution.

In R we model the \(\pi_h = 1 - \pi\) by logistic regression.

  • In R we can use function hurdle from package pscl to fit hurdle models.

  • It is very important to note that in the R package pscl, the hurdle function models the probability of positive counts using logistic regression, unlike the zeroinfl function which logistic regression models the probability of zero.

Hurdle Model for the number of visits

  • In hurdle models it is suggested to include all predictor from count component in the zero component.

  • For the number of visits we are going to use hospital, health, chronic, gender, school, and insurance as predictors for both part of the model.

  • If we use the same predictors in both part of the model, we do not need to define zero part in the model formula.

# Fit hurdle models
m.hurdle.pois <- hurdle(
  visits ~ hospital + health + chronic + gender + school + insurance, 
  data = dat, dist = "poisson")
m.hurdle.nb <- hurdle(
  visits ~ hospital + health + chronic + gender + school + insurance,
  data = dat, dist = "negbin")

summary(m.hurdle.pois)
## 
## Call:
## hurdle(formula = visits ~ hospital + health + chronic + gender + school + 
##     insurance, data = dat, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4144 -1.1565 -0.4770  0.5432 25.0228 
## 
## Count model coefficients (truncated poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.406459   0.024180  58.167  < 2e-16 ***
## hospital         0.158967   0.006061  26.228  < 2e-16 ***
## healthpoor       0.253521   0.017708  14.317  < 2e-16 ***
## healthexcellent -0.303677   0.031150  -9.749  < 2e-16 ***
## chronic          0.101720   0.004719  21.557  < 2e-16 ***
## gendermale      -0.062247   0.013055  -4.768 1.86e-06 ***
## school           0.019078   0.001872  10.194  < 2e-16 ***
## insuranceyes     0.080879   0.017139   4.719 2.37e-06 ***
## Zero hurdle model coefficients (binomial with logit link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.043147   0.139852   0.309 0.757688    
## hospital         0.312449   0.091437   3.417 0.000633 ***
## healthpoor      -0.008716   0.161024  -0.054 0.956833    
## healthexcellent -0.289570   0.142682  -2.029 0.042409 *  
## chronic          0.535213   0.045378  11.794  < 2e-16 ***
## gendermale      -0.415658   0.087608  -4.745 2.09e-06 ***
## school           0.058541   0.011989   4.883 1.05e-06 ***
## insuranceyes     0.747120   0.100880   7.406 1.30e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 14 
## Log-likelihood: -1.613e+04 on 16 Df
summary(m.hurdle.nb)
## 
## Call:
## hurdle(formula = visits ~ hospital + health + chronic + gender + school + 
##     insurance, data = dat, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1718 -0.7080 -0.2737  0.3196 18.0092 
## 
## Count model coefficients (truncated negbin with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.197699   0.058973  20.309  < 2e-16 ***
## hospital         0.211898   0.021396   9.904  < 2e-16 ***
## healthpoor       0.315958   0.048056   6.575 4.87e-11 ***
## healthexcellent -0.331861   0.066093  -5.021 5.14e-07 ***
## chronic          0.126421   0.012452  10.152  < 2e-16 ***
## gendermale      -0.068317   0.032416  -2.108   0.0351 *  
## school           0.020693   0.004535   4.563 5.04e-06 ***
## insuranceyes     0.100172   0.042619   2.350   0.0188 *  
## Log(theta)       0.333255   0.042754   7.795 6.46e-15 ***
## Zero hurdle model coefficients (binomial with logit link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.043147   0.139852   0.309 0.757688    
## hospital         0.312449   0.091437   3.417 0.000633 ***
## healthpoor      -0.008716   0.161024  -0.054 0.956833    
## healthexcellent -0.289570   0.142682  -2.029 0.042409 *  
## chronic          0.535213   0.045378  11.794  < 2e-16 ***
## gendermale      -0.415658   0.087608  -4.745 2.09e-06 ***
## school           0.058541   0.011989   4.883 1.05e-06 ***
## insuranceyes     0.747120   0.100880   7.406 1.30e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 1.3955
## Number of iterations in BFGS optimization: 16 
## Log-likelihood: -1.209e+04 on 17 Df

Zero component of Hurdle Models

  • We can see that the coefficients for zero components are exactly the same in Hurdle Poisson and Hurdle Negative Binomial. This is because the zero part is estimated completely separate from the Truncated count.

  • If we set all positive counts equal to 1 and run a logistic regression we will get the same result for zero part.

#Setting all positive count equal to one
dat <- dat %>% mutate(visits.1 = ifelse(visits >= 1, 1, 0))
#Run a logistic regression for visit1
m0 <- glm(visits.1 ~ hospital + health + chronic + gender + school + insurance, 
                           data = dat, family = binomial)
summary(m0)
## 
## Call:
## glm(formula = visits.1 ~ hospital + health + chronic + gender + 
##     school + insurance, family = binomial, data = dat)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.043147   0.139851   0.309 0.757687    
## hospital         0.312449   0.091428   3.417 0.000632 ***
## healthpoor      -0.008716   0.161020  -0.054 0.956833    
## healthexcellent -0.289570   0.142681  -2.029 0.042408 *  
## chronic          0.535213   0.045377  11.795  < 2e-16 ***
## gendermale      -0.415658   0.087607  -4.745 2.09e-06 ***
## school           0.058541   0.011989   4.883 1.05e-06 ***
## insuranceyes     0.747120   0.100878   7.406 1.30e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3800.7  on 4405  degrees of freedom
## Residual deviance: 3439.7  on 4398  degrees of freedom
## AIC: 3455.7
## 
## Number of Fisher Scoring iterations: 5

Extracting the coefficients and Prediction in Hurdle Models

  • The result of hurdle function is an object of class hurdle.

  • The behavior of the hurdle object for typical functions in regression models is similar to that of the zeroinfl object.

  • Function hurdletest can be used to test of the null hypothesis that no zero hurdle is required in hurdle regression models for count data. For this test both count and zero distributions should set to be the same.

# Show class of model objects
class(m.hurdle.pois)
## [1] "hurdle"
class(m.hurdle.nb)
## [1] "hurdle"
# Extract coefficients for count and zero components
coef(m.hurdle.nb, model = "count")
##     (Intercept)        hospital      healthpoor healthexcellent         chronic 
##      1.19769892      0.21189820      0.31595757     -0.33186113      0.12642059 
##      gendermale          school    insuranceyes 
##     -0.06831702      0.02069321      0.10017164
coef(m.hurdle.nb, model = "zero")
##     (Intercept)        hospital      healthpoor healthexcellent         chronic 
##     0.043146760     0.312448582    -0.008715843    -0.289570220     0.535212638 
##      gendermale          school    insuranceyes 
##    -0.415658035     0.058541235     0.747119813
# Compute variance of count component
diag(vcov(m.hurdle.nb, model = "count"))
##     (Intercept)        hospital      healthpoor healthexcellent         chronic 
##    3.477873e-03    4.577914e-04    2.309346e-03    4.368292e-03    1.550599e-04 
##      gendermale          school    insuranceyes 
##    1.050772e-03    2.056468e-05    1.816343e-03
# Test no zero hurdle
m.hurdle.nb.nb <- hurdle(
  visits ~ hospital + health + chronic + gender + school + insurance,
  data = dat, dist = "negbin", zero.dist = "negbin")
hurdletest(m.hurdle.nb.nb)
## Wald test for hurdle models
## 
## Restrictions:
## count_((Intercept) - zero_(Intercept) = 0
## count_hospital - zero_hospital = 0
## count_healthpoor - zero_healthpoor = 0
## count_healthexcellent - zero_healthexcellent = 0
## count_chronic - zero_chronic = 0
## count_gendermale - zero_gendermale = 0
## count_school - zero_school = 0
## count_insuranceyes - zero_insuranceyes = 0
## 
## Model 1: restricted model
## Model 2: visits ~ hospital + health + chronic + gender + school + insurance
## 
##   Res.Df Df  Chisq Pr(>Chisq)    
## 1   4396                         
## 2   4388  8 50.834  2.825e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Function predict in hurdle model can get arguments type = c("response", "prob", "count", "zero").

It is predicting new data or observed data.

"response" computes the predicted mean of outcome.

"prob" computes predicted probability for different values of count. For count = 0, it is \(\pi = 1 - \pi_h\) or predicted probability of 0 (from zero model which is the same as probability of zero for overall model).

"count" is for computing expected or mean of the count.

"zero" computes different than the object zeroinfl. It computes \(\frac{1 - \pi}{1 - f(0)}\), the ratio of probability of zero to 1 minus probability of getting 0 from the count distribution.

To understand what the predict() produce I am creating a data frame of prediction and I compute them using model coefficients.

##    response     count      zero      prob0 xb.count  xb.zero        mu
## 1  6.069804  6.162313 0.9849880 0.10825122 1.818452 2.108730  6.162313
## 2  5.938648  5.798758 1.0241240 0.07972391 1.757644 2.446104  5.798758
## 3 17.535174 17.495882 1.0022458 0.02417322 2.861966 3.698040 17.495882
## 4  7.456947  7.943310 0.9387706 0.12737279 2.072330 1.924390  7.943310
## 5  5.413351  5.338106 1.0140957 0.09868345 1.674871 2.211939  5.338106
## 6 10.005083  9.880982 1.0125596 0.04227832 2.290612 3.120283  9.880982
##           f0        pi         p0     zero2 response3 <- mu * pi/(1 - f0)
## 1 0.09466022 0.8917488 0.10825122 0.9849880                      6.069804
## 2 0.10140173 0.9202761 0.07972391 1.0241240                      5.938648
## 3 0.02635981 0.9758268 0.02417322 1.0022458                     17.535174
## 4 0.07045750 0.8726272 0.12737279 0.9387706                      7.456947
## 5 0.11121151 0.9013165 0.09868345 1.0140957                      5.413351
## 6 0.05415774 0.9577217 0.04227832 1.0125596                     10.005083

Interpreting the hurdle model

  • The binary part of the model helps identify factors that influence the presence or absence of the event of interest.

  • The coefficients of the zero part of hurdle models are interpreted in terms of log-odds or by exponentiation of them, representing the odds ratios of observing a positive count.

  • For example, in the doctor visits data, they are interpreted in terms of the odds ratios of an individual having at least one office visit.

  • The count part of the model estimates the effects of predictor variables on the count outcome, conditional on non-zero counts.

  • For example, in doctor visits, the exponentiated coefficients of counts represent the rate ratios of doctor visits given an individual has had at least one doctor visit.

For plotting a hurdle model, we can use the predict function and also the emmeans package. It’s important to note that the mode = "prob0" argument is used for the probability of 0, not mode = "zero".

# Use sjPlot to show model summaries
tab_model(m.hurdle.pois, m.hurdle.nb, m0, 
          dv.labels =c("Hurdle Poisson", "Hurdle NB", "logistic model"))
  Hurdle Poisson Hurdle NB logistic model
Predictors Incidence Rate Ratios CI p Incidence Rate Ratios CI p Odds Ratios CI p
(Intercept) 4.08 3.89 – 4.28 <0.001 3.31 2.95 – 3.72 <0.001
hospital 1.17 1.16 – 1.19 <0.001 1.24 1.19 – 1.29 <0.001
health [poor] 1.29 1.24 – 1.33 <0.001 1.37 1.25 – 1.51 <0.001
health [excellent] 0.74 0.69 – 0.78 <0.001 0.72 0.63 – 0.82 <0.001
chronic 1.11 1.10 – 1.12 <0.001 1.13 1.11 – 1.16 <0.001
gender [male] 0.94 0.92 – 0.96 <0.001 0.93 0.88 – 1.00 0.035
school 1.02 1.02 – 1.02 <0.001 1.02 1.01 – 1.03 <0.001
insurance [yes] 1.08 1.05 – 1.12 <0.001 1.11 1.02 – 1.20 0.019
(Intercept) 1.04 0.79 – 1.37 0.758
hospital 1.37 1.15 – 1.65 0.001
health [poor] 0.99 0.73 – 1.37 0.957
health [excellent] 0.75 0.57 – 0.99 0.042
chronic 1.71 1.56 – 1.87 <0.001
gender [male] 0.66 0.56 – 0.78 <0.001
school 1.06 1.04 – 1.09 <0.001
insurance [yes] 2.11 1.73 – 2.57 <0.001
Zero-Inflated Model
(Intercept) 1.04 0.79 – 1.37 0.758 1.04 0.79 – 1.37 0.758
hospital 1.37 1.14 – 1.64 0.001 1.37 1.14 – 1.64 0.001
health [poor] 0.99 0.72 – 1.36 0.957 0.99 0.72 – 1.36 0.957
health [excellent] 0.75 0.57 – 0.99 0.042 0.75 0.57 – 0.99 0.042
chronic 1.71 1.56 – 1.87 <0.001 1.71 1.56 – 1.87 <0.001
gender [male] 0.66 0.56 – 0.78 <0.001 0.66 0.56 – 0.78 <0.001
school 1.06 1.04 – 1.09 <0.001 1.06 1.04 – 1.09 <0.001
insurance [yes] 2.11 1.73 – 2.57 <0.001 2.11 1.73 – 2.57 <0.001
Observations 4406 4406 4406
R2 / R2 adjusted 0.635 / 0.635 0.890 / 0.890 0.089
# Plot probability of zero against chronic variable
emmip(m.hurdle.nb, ~ chronic,
       at = list(
         hospital = mean(dat$hospital),
         health = "poor",
         chronic = 0:8,
         gender = "female",
         school = mean(dat$school),
         insurance = "no"
       ), 
       lin.pred = FALSE, mode ="prob0", CIs = TRUE) +
  ylab("Probability of zero")

Comparing models

  • We can compare the dispersion between different models.

  • We can also compare predicted number of zeros in different models with observed number of zeros.

# Compute dispersion statistics for different models

Dispersion <- c(sum(residuals(m.pois, type = "pearson")^2) / m.pois$df.residual,
sum(residuals(m.nb, type = "pearson")^2) / m.nb$df.residual,
sum(residuals(m.zeroinf.pois, type = "pearson")^2) / m.zeroinf.pois$df.residual,
sum(residuals(m.zeroinf.nb, type = "pearson")^2) / m.zeroinf.nb$df.residual,
sum(residuals(m.hurdle.pois, type = "pearson")^2) / m.hurdle.pois$df.residual,
sum(residuals(m.hurdle.nb, type = "pearson")^2) / m.hurdle.nb$df.residual)

#Predicted number of zeros

Observed.zero = sum(dat$visits == 0)

Expected.zero <- round(c("Pois" = sum(dpois(0, fitted(m.pois))),
  "NB" = sum(dnbinom(0, mu = fitted(m.nb), size = m.nb$theta)),
  "Zero-inflated Poisson" = sum(predict(m.zeroinf.pois, type = "prob")[,1]),
  "Zero-inflated NB" = sum(predict(m.zeroinf.nb, type = "prob")[,1]),
  "Hurdle Poisson" = sum(predict(m.hurdle.pois, type = "prob")[,1]),
  "Hurdle NB" = sum(predict(m.hurdle.nb, type = "prob")[,1])))

Ratio <-  Observed.zero / Expected.zero 

cbind(Expected.zero, Ratio, Dispersion)
##                       Expected.zero      Ratio Dispersion
## Pois                             47 14.5319149   6.706136
## NB                              608  1.1233553   1.261644
## Zero-inflated Poisson           682  1.0014663   3.737124
## Zero-inflated NB                707  0.9660537   1.282264
## Hurdle Poisson                  683  1.0000000   3.799354
## Hurdle NB                       683  1.0000000   1.259898

hurdle Negative binomial has a lowest dispersion also NB has a low dispersion. If we want to handle excess zero and also need to handle over dispersion we might want to use robust standard error for zero-inflated negative binomial or Hurdle negative binomial.

  • There are other packages in R to model zero-Inflated and hurdle models such as glmmTMB, brms.

Exercise: Analyzing Fish Catch Data

In this exercise, we have data on 250 groups that visited a park. Each group was surveyed regarding the number of fish they caught (count), the number of children in the group (child), the total number of people in the group (persons), and whether they brought a camper to the park (camper).

Analyze this dataset using three different models:

  • Negative Binomial Model:

Predict the number of fish caught using the predictors child, persons, and camper.

  • Zero-Inflated Negative Binomial Model:

    For the count part, use child and camper as predictors. For the zero part, use the predictor variable persons.

  • Hurdle Negative Binomial Model:

    Include all predictors for both the count and zero models.

Instructions:

  • Load the dataset using the provided code snippet below.
  • Implement each of the three models as described above.
  • Analyze and interpret the results.
# Read the data from idre website
ex.data <- read.csv("https://stats.idre.ucla.edu/stat/data/fish.csv")

# Check the structure of the data
str(ex.data)
## 'data.frame':    250 obs. of  8 variables:
##  $ nofish  : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ livebait: int  0 1 1 1 1 1 1 1 0 1 ...
##  $ camper  : int  0 1 0 1 0 1 0 0 1 1 ...
##  $ persons : int  1 1 1 2 1 4 3 4 3 1 ...
##  $ child   : int  0 0 0 1 0 2 1 3 2 0 ...
##  $ xb      : num  -0.896 -0.558 -0.402 -0.956 0.437 ...
##  $ zg      : num  3.05 1.746 0.28 -0.602 0.528 ...
##  $ count   : int  0 0 0 0 1 0 0 0 0 1 ...
# Make the variable 'camper' a factor since it is binary
ex.data$camper <- factor(ex.data$camper, labels = c("no", "yes"))

# Fit the negative binomial model
ex.m.nb <- glm.nb(count ~ child + camper + persons, data = ex.data)

# Fit the zero-inflated negative binomial model
ex.m.zeroinf.nb <- zeroinfl(count ~ child + camper | persons, 
                            dist = "negbin", data = ex.data)

# Fit the hurdle negative binomial model
ex.m.hurdle.nb <- hurdle(count ~ child + camper + persons | 
                         child + camper + persons, 
                         dist = "negbin", data = ex.data)

# Summarize the results of each model
summary(ex.m.nb)
## 
## Call:
## glm.nb(formula = count ~ child + camper + persons, data = ex.data, 
##     init.theta = 0.4635287626, link = log)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.6250     0.3304  -4.918 8.74e-07 ***
## child        -1.7805     0.1850  -9.623  < 2e-16 ***
## camperyes     0.6211     0.2348   2.645  0.00816 ** 
## persons       1.0608     0.1144   9.273  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.4635) family taken to be 1)
## 
##     Null deviance: 394.25  on 249  degrees of freedom
## Residual deviance: 210.65  on 246  degrees of freedom
## AIC: 820.44
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.4635 
##           Std. Err.:  0.0712 
## 
##  2 x log-likelihood:  -810.4440
summary(ex.m.zeroinf.nb)
## 
## Call:
## zeroinfl(formula = count ~ child + camper | persons, data = ex.data, 
##     dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5861 -0.4617 -0.3886 -0.1974 18.0135 
## 
## Count model coefficients (negbin with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.3710     0.2561   5.353 8.64e-08 ***
## child        -1.5153     0.1956  -7.747 9.41e-15 ***
## camperyes     0.8791     0.2693   3.265   0.0011 ** 
## Log(theta)   -0.9854     0.1760  -5.600 2.14e-08 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   1.6031     0.8365   1.916   0.0553 .
## persons      -1.6666     0.6793  -2.453   0.0142 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 0.3733 
## Number of iterations in BFGS optimization: 22 
## Log-likelihood: -432.9 on 6 Df
summary(ex.m.hurdle.nb)
## 
## Call:
## hurdle(formula = count ~ child + camper + persons | child + camper + 
##     persons, data = ex.data, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72774 -0.51588 -0.25938 -0.03287 14.70973 
## 
## Count model coefficients (truncated negbin with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.6215     0.5960  -2.720 0.006521 ** 
## child        -1.0945     0.3198  -3.422 0.000621 ***
## camperyes     0.3746     0.3360   1.115 0.264934    
## persons       1.0029     0.1551   6.465 1.01e-10 ***
## Log(theta)   -1.0530     0.4974  -2.117 0.034266 *  
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3087     0.4612  -5.005 5.57e-07 ***
## child        -2.1380     0.3107  -6.882 5.90e-12 ***
## camperyes     1.0179     0.3246   3.136  0.00171 ** 
## persons       1.1104     0.1911   5.811 6.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 0.3489
## Number of iterations in BFGS optimization: 13 
## Log-likelihood: -395.2 on 9 Df
# Use tab_model to display the models' summaries
tab_model(ex.m.zeroinf.nb, ex.m.hurdle.nb, ex.m.nb)
  count count count
Predictors Incidence Rate Ratios CI p Incidence Rate Ratios CI p Incidence Rate Ratios CI p
(Intercept) 3.94 2.38 – 6.51 <0.001 0.20 0.06 – 0.64 0.007
child 0.22 0.15 – 0.32 <0.001 0.33 0.18 – 0.63 0.001
camper [yes] 2.41 1.42 – 4.08 0.001 1.45 0.75 – 2.81 0.265
persons 2.73 2.01 – 3.69 <0.001
(Intercept) 0.20 0.10 – 0.38 <0.001
child 0.17 0.11 – 0.24 <0.001
camper [yes] 1.86 1.17 – 2.95 0.008
persons 2.89 2.31 – 3.66 <0.001
Zero-Inflated Model
(Intercept) 4.97 0.96 – 25.60 0.055 0.10 0.04 – 0.25 <0.001
persons 0.19 0.05 – 0.72 0.014 3.04 2.09 – 4.41 <0.001
child 0.12 0.06 – 0.22 <0.001
camper [yes] 2.77 1.46 – 5.23 0.002
Observations 250 250 250
R2 / R2 adjusted 0.801 / 0.798 0.940 / 0.940 0.656

References

  • Zeileis, A., Kleiber, C., & Jackman, S. (2008). Regression Models for Count Data in R. Journal of Statistical Software, 27(8), 1-25. https://doi.org/10.18637/jss.v027.i08

  • Cameron AC, Trivedi PK (1998). Regression Analysis of Count Data. Cambridge University Press, Cambridge.

  • Deb P, Trivedi PK (1997). Demand for Medical Care by the Elderly: A Finite Mixture Approach.” Journal of Applied Econometrics, 12, 313-336.