**Version info:** Code for this page was tested in Mplus version 6.12.

Zero-truncated negative binomial regression is used to model count data for which the value zero cannot occur and when there is evidence of over dispersion .

**Please Note:** The purpose of this page is to show how to use various data analysis commands.
It does not cover all aspects of the research process which researchers are expected to do. In
particular, it does not cover data cleaning and verification, verification of assumptions, model
diagnostics and potential follow-up analyses.

## Examples of zero-truncated negative binomial

Example 1.

A study of the length of hospital stay, in days, as a function of age, kind of health insurance and whether or not the patient died while in the hospital. Length of hospital stay is recorded as a minimum of at least one day.

Example 2.

A study of the number of journal articles published by tenured faculty as a function of discipline (fine arts, science, social science, humanities, medical, etc). To get tenure faculty must publish, i.e., there are no tenured faculty with zero publications.

Example 3.

A study by the county traffic court on the number of tickets received by teenagers as predicted by school performance, amount of driver training and gender. Only individuals who have received at least one citation are in the traffic court files.

## Description of the data

Let’s pursue Example 1 from above.

We have a **hypothetical** data file available here with 1,493 observations.
The variable describing length of hospital visit is **stay****age** gives the age group from 1 to 9 which will be treated as
interval in this example.
The variables **hmo** and **died** are binary indicator variables for HMO
insured patients and patients who died while in hospital, respectively.

Let’s look at the data.

Data: File is C:ztnb.dat; Variable: Names are stay age hmo died; Missing are all (-9999) ; Analysis: Type = basic ; Plot: Type = plot1;ESTIMATED SAMPLE STATISTICS Means STAY AGE HMO DIED ________ ________ ________ ________ 1 9.729 5.234 0.160 0.343 Covariances STAY AGE HMO DIED ________ ________ ________ ________ STAY 66.100 AGE -0.615 2.785 HMO -0.169 -0.006 0.134 DIED -0.447 0.121 0.000 0.225 Correlations STAY AGE HMO DIED ________ ________ ________ ________ STAY 1.000 AGE -0.045 1.000 HMO -0.057 -0.010 1.000 DIED -0.116 0.152 0.000 1.000

## Analysis methods you might consider

Before we show how you can analyze these data with a zero-truncated negative binomial analysis, let’s consider some other methods that you might use.

- Zero-truncated Negative Binomial Regression – The focus of this web page.
- Zero-truncated Poisson Regression – Useful if there is no overdispersion in the zero truncated variable. Currently, zero-truncated poisson models are not possible in Mplus.
- Negative Binomial Regression – Ordinary negative binomial regression will have difficulty with zero-truncated data. It will try to predict zero counts even though there are no zero values.
- Poisson Regression – The same concerns as for negative binomial regression, namely, ordinary poisson regression will have difficulty with zero-truncated data. It will try to predict zero counts even though there are no zero values.
- OLS Regression – You could try to analyze these data using OLS regression. However, count data are highly non-normal and are not well estimated by OLS regression.

## Zero-trunacated negative binomial regression

In the syntax below, we have indicated that **stay** is a count
variable by using the **count** statement. The **(nbt)** option is
used to indicate 2 things: that we are modeling our count variable with a
negative binomial distribution, and that we are specifying a zero-truncated model.
Without the** (t) **option we would be estimating a negative
binomial model without
zero-truncation. Also, we do not need a **usevariables** statement
because
we are using all of the variables in the data set in the current model.
We have omitted the **missing** statement because we have no missing data in
this data set. The default estimation method is MLR – maximum likelihood
parameter estimates with standard errors and a chi-square test statistic that
are robust to non-normality and non-independence of observations. The MLR standard errors
are computed using a sandwich estimator. This is what we generally call robust
standard errors. To get the "regular" standard errors, we use the **estimator
= ml **on the **analysis** statement. (In the next example, we will
omit the **analysis** statement and obtain the robust standard errors.)
Our regression equations is specified in the model statement: we are predicting
length of **stay** using **age**, **hmo** status and whether the
patient died**.**

Data: File is C:ztnb.dat ; Variable: Names = stay age hmo died; Count = stay(nbt); Model: stay on age hmo died;MODEL FIT INFORMATION Number of Free Parameters 5 Loglikelihood H0 Value -4755.280 H0 Scaling Correction Factor 1.156 for MLR Information Criteria Akaike (AIC) 9520.559 Bayesian (BIC) 9547.102 Sample-Size Adjusted BIC 9531.218 (n* = (n + 2) / 24) MODEL RESULTS Two-Tailed Estimate S.E. Est./S.E. P-Value STAY ON AGE -0.016 0.013 -1.194 0.233 HMO -0.147 0.057 -2.571 0.010 DIED -0.218 0.053 -4.142 0.000 Intercepts STAY 2.408 0.075 32.039 0.000 Dispersion STAY 0.566 0.037 15.316 0.000

In the MODEL FIT INFORMATION portion of the output, you will find the log
likelihood for the final model as well as a number of fit statistics. In the
MODEL RESULTS section of the output you will find the negative binomial
regression coefficients (estimates) for each of the variables, standard errors
and the ratio of the estimate to its standard error. This can be used as a
Z test, where values greater than 2 are considered to be statistically
significant. We see that **hmo **and** died** but not **age** are
significant predictors of **stay**. Thus, for example, for patients who
use HMO services compared to those who do not, the log count of days stayed is
about 0.147 less.

Now let’s rerun the model without the **analysis: estimator = ml** statement in order to obtain robust standard errors.

Data: File is C:ztnb.dat ; Variable: Names = stay age hmo died; Missing = all (-9999) ; Count = stay(nbt); Model: stay on age hmo died; Analysis: estimator = ml;MODEL RESULTS Two-Tailed Estimate S.E. Est./S.E. P-Value STAY ON AGE -0.016 0.013 -1.197 0.231 HMO -0.147 0.059 -2.483 0.013 DIED -0.218 0.046 -4.718 0.000 Intercepts STAY 2.408 0.072 33.457 0.000 Dispersion STAY 0.566 0.031 18.132 0.000

Robust standard errors tend to be larger than "regular" standard errors,
though not always as we see for the variable **age**. The results
changed very little when using regular standard errors.

## Things to consider

- Count data often use an exposure variable to indicate the number of times the event
could have happened. You can incorporate exposure into your model by using the
**exposure()**option. - It is not recommended that zero-truncated negative binomial models be applied to small samples. What constitutes a small sample does not seem to be clearly defined in the literature.
- Pseudo-R-squared values differ from OLS R-squareds, please see FAQ: What are pseudo R-squareds? for a discussion on this issue.

## References

- Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.
- Hilbe, J. M. (2007). Negative binomial regression. Cambridge, UK: Cambridge University Press.