When modeling a count variable, there are several available models to choose from. Stata supports Poisson, negative binomial, zero-inflated Poisson, zero-inflated negative binomial models, zero-truncated Poisson, and zero-truncated negative binomial models. Among these, you can likely narrow your data down to one or two models based on how the data were collected and the distribution of your outcome variable.

Some important questions to ask yourself before you start building a count model are:

- Does my outcome variable contain zeroes?
- If yes, what does a zero mean?
- If no, why not?
- How is my outcome variable distributed?
- What is the variance of my outcome variable?
- How does the variance compare to the mean?

The answers to the first set of questions can indicate whether your data are zero-truncated or zero-inflated.

**If your count variable does not contain zeroes…**
ask yourself why. Given what you know about the variable, would you have
expected to see counts of zero? In some instances, a count of zero will
never appear in the dataset. For example, if your count variable is the number
of visits to a given coffee shop in a week and your data are collected among people
*in* the coffee shop,
then all the people for whom you collected data have a count of at least one. A
person who never visits the coffee shop will not be included in the dataset. This
is an example of *zero-truncated data* and would most appropriately be
modeled using a zero-truncated model. However, just because a variable
does not contain zeroes does not mean the variable is zero-truncated.
Truncation occurs because of the way in which the data are collected and cannot
be assumed just from looking at the data.

**If your count variable does contain zeroes…**
again,
ask yourself why.
What does a count of zero mean, given what you know about the variable? In some
instances, records in a dataset could ONLY have counts of zero. For
examples, if you are collecting data on the number of fish caught by a person in
a given weekend, a person who did not go fishing could only have a count of
zero. A person who did go fishing could also have a count of zero, but these two
zeroes would have very different meanings. This is an example of *
zero-inflated data* and may be most appropriately modeled with a
zero-inflated model.

The answers to the second set of questions can indicate whether a Poisson or negative binomial distribution is more appropriate for your data. The Poisson distribution is characterized by equal mean and variance. If the sample mean and sample variance of your outcome variable are in the same neighborhood, a Poisson model may be appropriate. If the variance exceeds the mean by a great deal, a negative binomial model may be appropriate. Keep in mind that zero-truncation and zero-inflation can impact these statistics and there is not a magical cut-off determining when one model is better than the other. These are simply aspects of your data worth considering.

## Deciding on a final model

You may find that, after making all of the above considerations and deciding
which predictors should be included in your model, you are still torn. You have
two count models that both seem substantively reasonable, but that are clearly
mathematically different. At this point, the **countfit **function in Stata
(written by Long and Freese) can be a helpful tool. You can download it by
typing **search countfit** (see How can I
use the search command to search for programs and get additional help? for
more information about using **search**) and following the appropriate links.

**Countfit **runs user-specified count models (Poisson, zero-inflated
Poisson, negative binomial, and zero-inflated negative binomial) with
user-specified variables and compares the model residuals. If you do not
indicate which models you wish to compare (**nbreg**, **zinp**, **prm**,
or **zip**), **countfit** will default to
running and comparing all four. One if the quirks of **countfit** is that it displays results for a given predictor next to the predictor’s
*label* rather than name. So
an unlabeled predictor will lead to both empty space in the output and possible
confusion. This can be prevented by labeling ALL predictors before running **
countfit**.

In this example, we will be looking at academic information on
316 students. The response variable is days absent during the school year (**daysabs**).
We have narrowed our model choices down to two negative binomial models, one
with zero-inflation and one without. We believe that the **daysabs** can best
be predicted with math standardized tests score (**mathnce**),
language standardized tests score (**langnce**) and gender (**female**).
We suspect certain zeroes can be predicted with bilingual status (**biling**)
and language score (**langnce**). We will run the **countfit** command
indicating these predictors and the two models we wish to compare,
then discuss the output.

use https://stats.idre.ucla.edu/stat/stata/notes/lahigh, clear generate female = (gender == 1)label variable female `"female"'countfit daysabs mathnce langnce female, inf(biling langnce) nbreg zinb

---------------------------------------------------------- Variable | NBRM ZINB ---------------------------------+------------------------ daysabs | ctbs math nce | 0.998 0.998 | -0.33 -0.34 ctbs lang nce | 0.986 0.987 | -2.57 -2.38 female | 1.539 1.502 | 3.09 2.94 Constant | 9.825 9.795 | 10.89 10.94 ---------------------------------+------------------------ lnalpha | Constant | 1.288 1.191 | 2.65 1.62 ---------------------------------+------------------------ inflate | bilingual status | 10.132 | 1.53 ctbs lang nce | 1.094 | 1.92 Constant | 0.000 | -2.16 ---------------------------------+------------------------ Statistics | alpha | 1.288 N | 316.000 316.000 ll | -880.873 -879.131 bic | 1790.525 1804.307 aic | 1771.746 1774.262 ---------------------------------------------------------- legend: b/t Comparison of Mean Observed and Predicted Count Maximum At Mean Model Difference Value |Diff| --------------------------------------------- NBRM 0.011 1 0.004 ZINB 0.018 1 0.005 NBRM: Predicted and actual probabilities Count Actual Predicted |Diff| Pearson ------------------------------------------------ 0 0.196 0.201 0.004 0.031 1 0.146 0.135 0.011 0.278 2 0.098 0.104 0.006 0.093 3 0.082 0.083 0.001 0.004 4 0.070 0.068 0.001 0.008 5 0.063 0.057 0.006 0.230 6 0.047 0.048 0.000 0.001 7 0.038 0.040 0.002 0.046 8 0.028 0.034 0.006 0.318 9 0.035 0.029 0.005 0.319 ------------------------------------------------ Sum 0.804 0.799 0.044 1.327 ZINB: Predicted and actual probabilities Count Actual Predicted |Diff| Pearson ------------------------------------------------ 0 0.196 0.207 0.011 0.176 1 0.146 0.127 0.018 0.850 2 0.098 0.101 0.003 0.022 3 0.082 0.082 0.000 0.000 4 0.070 0.068 0.001 0.007 5 0.063 0.057 0.006 0.194 6 0.047 0.048 0.001 0.006 7 0.038 0.041 0.003 0.077 8 0.028 0.035 0.007 0.393 9 0.035 0.030 0.005 0.240 ------------------------------------------------ Sum 0.804 0.798 0.055 1.965 Tests and Fit Statistics ------------------------------------------------------------------------- NBRM BIC= -28.290 AIC= 5.607 Prefer Over Evidence ------------------------------------------------------------------------- vs ZINB BIC= -14.507 dif= -13.783 NBRM ZINB Very strong AIC= 5.615 dif= -0.008 NBRM ZINB Vuong= 0.858 prob= 0.195 ZINB NBRM p=0.195

## Making sense of the output

First, let’s discuss the graph. **Countfit** produces a graph that plots the residuals from the tested models. Small residuals are indicative of good-fitting
models, so the models with lines closest to zero should be considered for our
data. This may be useful in eliminating a model, but be careful
about deciding on one model over all the others based solely on this graph.
In the graph above, we see that our two models perform very similarly for counts
greater than two, and that they both differ most from the actual values and each other at the
zero and one counts. At the zero and one counts, the negative binomial
model appears slightly better than the zero-inflated negative binomial model.

## Model parameters and fit

The first table in the output summarizes the parameter estimates from each of the tested models. For each of the models, we see the exponentiated coefficients and their t-statistics in the first block of the table. Then, for any negative binomial models, we will see the estimated dispersion parameters. Next, for any zero-inflated models, we see the estimates from the logistic model predicting the certain zeroes. In the last block of the table, a set of fit statistics is provided for each of the four models. This includes the log-likelihood, BIC, and AIC.

---------------------------------------------------------- Variable | NBRM ZINB ---------------------------------+------------------------ daysabs | ctbs math nce | 0.998 0.998 | -0.33 -0.34 ctbs lang nce | 0.986 0.987 | -2.57 -2.38 female | 1.539 1.502 | 3.09 2.94 Constant | 9.825 9.795 | 10.89 10.94 ---------------------------------+------------------------ lnalpha | Constant | 1.288 1.191 | 2.65 1.62 ---------------------------------+------------------------ inflate | bilingual status | 10.132 | 1.53 ctbs lang nce | 1.094 | 1.92 Constant | 0.000 | -2.16 ---------------------------------+------------------------ Statistics | alpha | 1.288 N | 316.000 316.000 ll | -880.873 -879.131 bic | 1790.525 1804.307 aic | 1771.746 1774.262 ---------------------------------------------------------- legend: b/t

From the last block, we can see that the two models are extremely close. The parameter estimates are nearly identical. We can continue to look at the rest of the output.

## Residuals by count

Next, we see a table with one line per model showing the maximum and mean differences in observed versus predicted counts.

Comparison of Mean Observed and Predicted Count Maximum At Mean Model Difference Value |Diff| --------------------------------------------- NBRM 0.011 1 0.004 ZINB 0.018 1 0.005

This confirms what we observed in the graph: both models performed worst when predicting a count of 1. Between these two, we see that the negative binomial did better at this prediction and, overall, had a lower mean absolute difference between predicted and observed values. At this point, the negative binomial model is looking more appropriate than the zero-inflated negative binomial model. Next, we have one table for each of the models containing count-by-count information.

NBRM: Predicted and actual probabilities Count Actual Predicted |Diff| Pearson ------------------------------------------------ 0 0.196 0.201 0.004 0.031 1 0.146 0.135 0.011 0.278 2 0.098 0.104 0.006 0.093 3 0.082 0.083 0.001 0.004 4 0.070 0.068 0.001 0.008 5 0.063 0.057 0.006 0.230 6 0.047 0.048 0.000 0.001 7 0.038 0.040 0.002 0.046 8 0.028 0.034 0.006 0.318 9 0.035 0.029 0.005 0.319 ------------------------------------------------ Sum 0.804 0.799 0.044 1.327 ZINB: Predicted and actual probabilities Count Actual Predicted |Diff| Pearson ------------------------------------------------ 0 0.196 0.207 0.011 0.176 1 0.146 0.127 0.018 0.850 2 0.098 0.101 0.003 0.022 3 0.082 0.082 0.000 0.000 4 0.070 0.068 0.001 0.007 5 0.063 0.057 0.006 0.194 6 0.047 0.048 0.001 0.006 7 0.038 0.041 0.003 0.077 8 0.028 0.035 0.007 0.393 9 0.035 0.030 0.005 0.240 ------------------------------------------------ Sum 0.804 0.798 0.055 1.965

In these two tables, we are able to see, for counts 0-9, the actual
proportion of our data records with the given count and the predicted proportion
from each models. The absolute
difference is included, as is the given count’s contribution to a Pearson
Chi-Square statistic comparing the actual distribution of the data and the
distribution proposed by the model. For a given row, the Pearson statistic can
be calculated as N(|Diff|^{2})/Predicted, where N is the number of
observations in the dataset. Looking at the sum of the Pearson column
gives us a sense of how close the predicted proportions were to the actual
proportions. Using this method to compare, the negative binomial appears better
than the zero-inflated negative binomial.

## Summary comparisons

Tests and Fit Statistics ------------------------------------------------------------------------- NBRM BIC= -28.290 AIC= 5.607 Prefer Over Evidence ------------------------------------------------------------------------- vs ZINB BIC= -14.507 dif= -13.783 NBRM ZINB Very strong AIC= 5.615 dif= -0.008 NBRM ZINB Vuong= 0.858 prob= 0.195 ZINB NBRM p=0.195

In this table, the tested models are compared to each other head-to-head using the tests appropriate to each comparison. Each line can be boiled down to the last three columns. They suggest which model is preferred by the given comparison and the strength of the evidence supporting this preference. When we compare our two models using the BIC and AIC, the negative binomial is preferred over zero-inflated negative binomial. The Vuong test prefers zero-inflated negative binomial model over the negative binomial model, but not at a statistically significant level. Thus, these model fit statistics support what we have seen in the model residuals.

## Full model output

There may be times when you wish to see the full model output for each of the
four models. This can be especially helpful in choosing between two models
because it allows you to see the significance levels of the parameters, which
are not included in the **countfit** output. For instance, in this
example, we might be interested in the significance levels of our zero-inflated
variables. To do
this, you can add the **noisily** option to the **countfit** command:

countfit daysabs mathnce langnce female, inf(biling langnce) nbreg zinb noisily

This will print the full output from each of the four models, then the summary output provided without the
**noisily** option.

## Replace option

When you run **countfit**, variables are generated
and added to your dataset. If you ran the same **countfit** command a
second time, you would encounter error messages telling you that variables have
already been defined. Thus, if you are running multiple **countfit** tests,
you can add the **replace** option to your command to replace the variables
generated by prior runs with the variables generated by the current run.

## Final thoughts

**Countfit** presents
several nice ways for you to compare models that are not necessarily easy to
compare. It presents side-by-side fit statistics and parameter estimates, it
gives you a graphical representation of the model residuals, as well as a table
of the residuals by counts. All of these pieces of information are worth taking
into consideration when choosing a model. For your purposes, some of these
measures may be more relevant than the others. Think carefully about what you
are aiming to do in the model and which measures should be prioritized over
others. While **countfit** is very convenient and provides lots of
information, sorting through the information it provides must be done carefully.