NOTE: Zero-inflated negative binomial regression using **proc countreg** is only available in SAS version 9.2 or higher.

This page shows an example of zero-inflated
negative binomial regression analysis with
footnotes explaining the output in SAS. We have data on 250 groups that went
to a park for a weekend,
fish.sas7bdat. Each group was questioned about how
many fish they caught (**count**), how many children were in the group (**child**),
how many people were in the group (**persons**), and whether or not they
brought a camper to the park (**camper**). We explore the relationship of
**count** with **child**,
**camper** and
**persons**.

As assumed for a negative binomial model, our response variable is a count
variable and the variance of the response variable is greater than the mean of
the response variable. Sometimes when analyzing a response variable that is a
count variable, the number of zeroes may seem excessive. In a dataset in which
the response variable is a count, the number of zeroes may seem excessive. With
the example dataset in mind, consider the processes that could lead to a
response variable value of zero. A group may have spent the entire weekend
fishing, but failed to catch a fish. Another group may have not done any
fishing over the weekend and, not surprisingly, caught zero fish. The
first group *could* have caught one or more fish, but did not. The
second group was certain to catch zero fish. The second group will be referred
to from this point forward as a “certain zero”. Thus, the number of zeroes may
be inflated and the number of groups catching zero fish cannot be explained in
the same manner as the groups that caught more than zero fish.

A standard negative binomial model would not distinguish between the two processes causing an excessive number of zeroes, but a zero-inflated model allows for and accommodates this complication. When analyzing a dataset with an excessive number of outcome zeros and two possible processes that arrive at a zero outcome, a zero-inflated model should be considered. We can look at a histogram of the response variable to try to gauge if the number of zeros is excessive. (If two processes generated the zeroes in the response variable but there is not an excessive number of zeroes, a zero-inflated model may or may not be used.)

data fish; set "D:\data\fish"; run; proc means data = fish mean std min max var; var count child persons; run;The MEANS Procedure Variable Mean Std Dev Minimum Maximum Variance ---------------------------------------------------------------------------------------- count 3.2960000 11.6350281 0 149.0000000 135.3738795 child 0.6840000 0.8503153 0 3.0000000 0.7230361 persons 2.5280000 1.1127303 1.0000000 4.0000000 1.2381687 ----------------------------------------------------------------------------------------proc univariate data = fish noprint; histogram count / midpoints = 0 to 50 by 1 vscale = count ; run;

The zero-inflated negative binomial regression generates two separate models and then
combines them. First, a logit model is generated for the “certain zero” cases
described above, predicting whether or not a student would be in this group.
Then, a negative binomial model is generated to predict the counts for those students who
are not certain zeros. Finally, the two models are combined. In SAS,
the **proc countreg** procedure that can easily run a zero-inflated negative
binomial regression.

When running a
zero-inflated negative binomial model in **proc genmod**, you must specify both models: first the
count model on the **model** statement, then the model predicting the certain zeros
on the **zeromodel** statement. In this example, we
are predicting count with **child** and **camper** and
predicting the certain zeros with **persons**. The code for this model
and its output can be seen below.

proc countreg data = fish method = qn; model count = child camper / dist= zinegbin; zeromodel count ~ persons; run;The COUNTREG Procedure Model Fit Summary Dependent Variable count Number of Observations 250 Data Set WORK.FISH Model ZINB ZI Link Function Logistic Log Likelihood -432.89091 Maximum Absolute Gradient 7.13505E-8 Number of Iterations 28 Optimization Method Dual Quasi-Newton AIC 877.78181 SBC 898.91058 Algorithm converged. Parameter Estimates Standard Approx Parameter DF Estimate Error t Value Pr > |t| Intercept 1 1.371048 0.256114 5.35 <.0001 child 1 -1.515255 0.195591 -7.75 <.0001 camper 1 0.879051 0.269274 3.26 0.0011 Inf_Intercept 1 1.603106 0.836493 1.92 0.0553 Inf_persons 1 -1.666566 0.679265 -2.45 0.0141 _Alpha 1 2.678759 0.471328 5.68 <.0001

## Model Fit

Model Fit Summary Dependent Variable count Number of Observations 250 Data Set WORK.FISH Model ZINB ZI Link Function^{a}Logistic Log Likelihood^{b}-432.89091 Maximum Absolute Gradient 7.13505E-8 Number of Iterations^{c}28 Optimization Method Dual Quasi-Newton AIC^{d}877.78181 SBC^{e}898.91058

a. **ZI Link Function** –
This indicates the type of model that will be used to predict the excessive
zeroes. In this example, a logistic model will be used. This is the
default for zero-inflated negative binomial models in **proc countreg**.

b. **Log Likelihood **– This is the log likelihood of the fitted full
model. It is calculated from the probability of the data given the model’s
parameter estimates. The parameter estimates that are reported in the
output are those
that maximize the log likelihood of the model.

c. **Number of Iterations ** – The fitting of this model is an
iterative procedure. With each iteration of the model, the parameters are
updated and the log-likelihood is calculated. This process continues until
the log-likelihood is no longer improved and has been maximized. The
number of iterations required by SAS to reach this point are reported here.

d. **AIC** – This is the Akaike Information Criterion. It is calculated as
**AIC** = -2 Log Likelihood + 2(*s*), where *s* is the total number of
predictors in the model. In this example, *s* = 6 and we can see **AIC**
= -2* -432.89091 + 2*6 = 877.78181. **AIC** is used for the comparison of models from
different samples or nonnested models. It penalizes for the number of predictors
in the model. Ultimately, the model with the smallest **AIC** is
considered the best.

e. **SBC** – This is Schwartz’s Bayesian information criterion. Like
the **AIC**, it is based on the log likelihood and penalizes for the number of
predictors in the model. The smallest **SBC** is most desirable.

## Parameter Estimates

Parameter Estimates Standard Approx ParameterDF Estimate^{f}Error^{g}^{h }t ValuePr > |t|^{i}^{j}Intercept 1 1.371048 0.256114 5.35 <.0001 child 1 -1.515255 0.195591 -7.75 <.0001 camper 1 0.879051 0.269274 3.26 0.0011 Inf_Intercept 1 1.603106 0.836493 1.92 0.0553 Inf_persons 1 -1.666566 0.679265 -2.45 0.0141 _Alpha^{k}1 2.678759 0.471328 5.68 <.0001

f. **Parameter **–
These refer to the independent variables in the model as well as intercepts
(a.k.a. constants) and, in the case of negative binomial regression, a
dispersion parameter. The parameter names that begin **Inf_** are the
parameters from the inflation model. The dispersion parameter is **_Alpha**.

g. **Estimate ** –
These are the regression coefficients. The coefficients for **Intercept**, **
child**, and **camper** are interpreted as you would interpret coefficients
from a standard negative binomial model: the expected number of fish caught
changes by a factor of exp(**Estimate**) for each unit increase in the corresponding
predictor.

** Predicting Number of Fish Caught for the Non-“Certain Zero” Groups**

** child – **If a group were to increase its **child** count by one,
the expected number of fish caught would decrease by a factor of exp(-1.515255)
= 0.2197521 while holding all other variables in the model
constant. Thus, the more children in a group, the fewer caught fish are
predicted.

** camper – **The expected number of fish caught in a weekend for a
group with a camper is exp(0.879051) = 2.408613 times the expected number of
fish caught in a weekend for a group without a camper while holding all other
variables in the model constant. Thus, if a group with a camper and a group
without a camper are not certain zeros and have identical numbers of children,
the expected number fish caught for the group with a camper is 2.408613 times
the expected number of fish caught by the group without a camper.

** Intercept** – If all of the predictor variables in the model are
evaluated at zero, the predicted number of fish caught would be calculated as
exp(**Intercept**) = exp(1.371048). For groups without a camper or children
(the variables **camper **and** child** evaluated at zero), the predicted
number of fish caught would be 3.939477.

** Predicting “Certain Zero” Groups**

** Inf_persons- **If a group were to increase its **persons**
value by one, the odds that it would be a “Certain Zero” would decrease by
a factor of exp(-1.666566) = 0.1888946. In other words, the more people in a
group, the less likely the group is a certain zero.

** Inf_Intercept** – If all of the predictor variables in the
inflation model are
evaluated at zero, the odds of being a “Certain Zero” is exp(1.603106) =
4.96844. This means that the predicted odds of a group with zero persons is
4.96844 (though remember that evaluating **persons **at zero is out of the
range of plausible values–every group must have at least one person).

h. **
Standard Error** –
These are the standard errors of the individual regression coefficients for the
two models. They are used in both the calculation of the **t **test
statistic, superscript i.

i. **t Value** –
This is the test statistic used to test against a null hypothesis that the **Estimate **
is equal to zero. The **t Value** is the ratio of the **Estimate** to the **
Standard Error** of the given predictor. The value follows a t-distribution
with (# of observations – # of parameters) = (250 – 6) = 244 degrees of freedom.

j. **Approx Pr > |t|** –
This is the approximate probability of the **t** test statistic (or a more
extreme test statistic) would be observed under the null hypothesis. The
null hypothesis is that the coefficient is zero, given that the rest of the predictors are in the
model. For a given alpha level, **P>|t|** determines whether or not the null
hypothesis can be rejected. If **P>|t| **is less than alpha, then the null
hypothesis can be rejected and the parameter estimate is considered significant
at that alpha level.

** Predicting Number of Fish Caught for the Non-“Certain Zero” Groups**

**child** – The **t** test statistic for the predictor **child**
is (-1.515255/ 0.195591) = -7.75 with an associated p-value of <.0001. If we
set our alpha level to 0.05, we would reject the null hypothesis and conclude
that the regression coefficient for **child** has been found to be
statistically different from zero given the other variables are in the model.

** camper –**The **t** test statistic for the predictor **camper
**is (0.879051/0.269274) = 3.26 with an associated p-value of 0.0011. If we
set our alpha level to 0.05, we would reject the null hypothesis and conclude
that the regression coefficient for **camper **has been found to be
statistically different from zero given the other variables are in the model.

** Intercept** – The **t** test statistic for the intercept, **
Intercept,** is (1.371048/0.256114) = 5.35 with an associated p-value of <
0.001. If we set our alpha level at 0.05, we would reject the null hypothesis
and conclude that **Intercept** has been found to be statistically different
from zero given the other variables are in the model and evaluated at zero.

** Predicting “Certain Zero” Groups**

** Inf_persons- **The **t** test statistic for the predictor **Inf_persons
**is (-1.666566/0.679265) = -2.45 with an associated p-value of 0.0141. If we
again set our alpha level to 0.05, we would reject the null hypothesis and
conclude that the regression coefficient for **Inf_persons **has been
found to be statistically different from zero given the other variables are in
the model.

**Inf_Intercept** -The **t** test statistic for the intercept, **Inf_Intercept,**
is (1.603106/0.836493) = 1.92 with an associated p-value of 0.0553. With an
alpha level of 0.05, we would fail to reject the null hypothesis and conclude
that **Inf_Intercept** has not been found to be statistically different from
zero given the other variables are in the model.

k. **_Alpha** – This is the dispersion parameter of the count model. If
the dispersion parameter is zero, then log(dispersion parameter) = -infinity.
If this is true, then a Poisson model would be appropriate. Based on the t value
of 5.68 and the associated p-value of <.0001, we can reject the null
hypothesis that **_Alpha** is equal to zero. Thus, a Poisson model
would not be appropriate and we are justified in using a negative binomial
model.