Regression Models with Count Data

About This Presentation

It is a broad survey of count regression models
It is designed to demonstrate the range of analyses available for count regression models
It is not an in-depth statistical presentation
It is not a how-to manual that will train you in count data analysisWhy Use Count Regression Models

Count data is common in many disciplines
Count models can be used for rate data in many instances by using exposure
Count data often analyzed incorrectly with OLS regression

Regression Models with Count Data Outline
Poisson Regression
Negative Binomial Regression
Zero-Inflated Count Models
- Zero-inflated Poisson
- Zero-inflated Negative Binomial
Zero-Truncated Count Models
- Zero-truncated Poisson
- Zero-truncated Negative Binomial
Hurdle Models
Random-effects Count Models

Poisson Distribution

The Poisson probability distribution.
λ is the mean or expected value of a Poisson distribution
λ is also the variance of a Poisson distribution
Poisson is a one parameter λ (lambda)

Likelihood function for the Poisson model.

Some History of the Poisson Distribution

The Poisson distribution was first published by Siméon-Denis Poisson in 1838.

There is a famous example from a 1898 book by Ladislaus von Bortkiewicz showing that the number of soldiers killed by mule-kicks each year in the Prussian cavalry followed a Poisson distribution. The title of the book was The Law of Small Numbers.

Poisson Examples

As a comparison, here is a normal distribution with the same mean and variance as the Poisson distribution above.

Negative Binomial Distribution One formulation of the negative binomial distribution can be used to model count data with over-dispersion.
The negative binomial distribution has two parameters: λ and α
λ is the mean or expected value of the distribution
α is the over dispersion parameter
When α = 0 the negative binomial distribution is the same as a Poisson distribution

Likelihood function for the negative binomial model.

Negative Binomial Examples

Maximum Likelihood Estimation

Count models are estimated using maximum likelihood

Log likelihood:
Computationally, the log of the likelihood function is easier to work with.
Analysis proceeds iteratively until the log likelihood converges.
Larger (in the closer to zero sense) log likelihoods are better.BIC:
Schwartz’ Bayesian information criterion = -2*ln(L)+k*ln(n).
Given any two estimated models, the model with the lower value of BIC is the one to be preferred.
Can be used to compare different models, even models that are non-nested.
Interpretation of absolute difference: 0-2 Weak, 2-6 Positive, 6-10 Strong, and >10 Very Strong.Sample size:
The small sample behavior of ML estimators for count models is largely unknown.
It is risky to use ML with samples smaller than 100.

Samples over 500 seem adequate.

Soccer Goals Example

Number of goals scored for each game in a season (mean= 2.16 variance= 3.00)

Image goalshist

Number goals displayed as a connected line graph

Image goals1

Observed number of goals overlayed with a Poisson distribution

Image goals2

Observed number of goals overlayed with a negative binomial distribution

Image goals3

Exposure

Count models need some sort of mechanism to deal with the fact that counts can be made over different observation periods. For example, the number of accidents are recorded for 50 different intersections. However, the number of vehicles that pass through the intersections can vary greatly. Fifteen accidents for 30,000 vehicles is very different from 15 accidents for 1,500 vehicles. Count models account for these differences by including the log of the exposure variable in model with coefficient constrained to be one.

The use of exposure is superior in many instances to analyzing rates as response variables because it makes use of the correct probability distributions. It should be noted that exposure is used to adjust counts on the response variable and that it is possible to various kinds of rates, indexes or per capita measures as predictors.

Deaths

The response variable is the number of deaths recorded at each of five different age-group and two smoker categories. The difference in the number of patient years will be accounted for with an exposure variable pyears. Below, note that rows 1 and 10 have almost identical numbers of deaths but have very different values for patient years.

   agecat     smokes     deaths     pyears
        1          1         32     52,407
        2          1        104     43,248
        3          1        206     28,612
        4          1        186     12,663
        5          1        102      5,317
        1          0          2     18,790
        2          0         12     10,673
        3          0         28      5,710
        4          0         28      2,585
        5          0         31      1,462
  
    variable |         N      mean  variance       min       max
-------------+--------------------------------------------------
      deaths |        10      73.1  5390.767         2       206
      pyears |        10   18146.7  3.16e+08      1462     52407
----------------------------------------------------------------

The predictor variables are four age-group dummy variables and a dummy variable to indicate smokers.

These data can be analyzed with either a Poisson regression model or a negative binomial regression model.

     ll    df    BIC     model
 -33.60015  6   81.0158  poisson 
 -33.60014  7   83.3184  negative binomial

There is not much difference between the two models based on the log-likelihood and the BIC but the Poisson model has a slightly better BIC.

Poisson regression                                Number of obs   =         10
                                                  LR chi2(5)      =     922.93
                                                  Prob > chi2     =     0.0000
Log likelihood = -33.600153                       Pseudo R2       =     0.9321

------------------------------------------------------------------------------
      deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      smokes |   .3545356   .1073741     3.30   0.001     .1440862     .564985
          a2 |   1.484007   .1951034     7.61   0.000     1.101611    1.866403
          a3 |   2.627505   .1837273    14.30   0.000     2.267406    2.987604
          a4 |   3.350493   .1847992    18.13   0.000     2.988293    3.712693
          a5 |   3.700096   .1922195    19.25   0.000     3.323353     4.07684
       _cons |  -7.919326   .1917618   -41.30   0.000    -8.295172   -7.543479
      pyears | (exposure)
------------------------------------------------------------------------------

Zero-inflated Models

Zero-inflated models attempt to account for excess zeros, i.e., there is thought to be two kinds of zeros, “true zeros” and excess zeros. Zero-inflated models estimate two equations, one for the count model and one for the excess zero’s.

Fishing Example

Somehow, it seems very appropriate to discuss fish and Poisson in the same example.

Response variable: count — the number of fish caught by visitors to a state park

Predictors: child — number of children in group camper — camping one or more nights during stay (0/1) persons — number of people in group

Child is also used as the predictor of the inflated-zeros

In our example “true zeros” are obtained from people who fish but do not catch anything. The excess zeros come from those who do not fish at all. Separate equations are used to predict each kind of zero.

Image fishhist

We will look at two count models and two zero-inflated models and an OLS regression for comparison.

    ll    df     BIC    model
 -749.3503 4  1520.754  ols
 -645.2568 4  1312.567  poisson 
 -391.0271 5   809.621  negative binomial
 -561.5176 6  1156.116  zero-inflated poisson
 -384.8586 7   808.311  zero-inflated negative binomial

The results for the zero-inflated negative binomial are given below. Note this model is only marginally better than the ordinary negative binomial model and the Vuong test is not significant but the data were clearly generated by a process that leads to inflated zeros.

Zero-inflated negative binomial regression        Number of obs   =        248
                                                  Nonzero obs     =        106
                                                  Zero obs        =        142

Inflation model = logit                           LR chi2(3)      =      69.88
Log likelihood  = -384.8586                       Prob > chi2     =     0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
count        |
       child |  -1.017093   .2734393    -3.72   0.000    -1.553025   -.4811623
      camper |   .4953317   .2320177     2.13   0.033     .0405854     .950078
     persons |   .9412857   .1142745     8.24   0.000     .7173118     1.16526
       _cons |  -1.440209   .3237543    -4.45   0.000    -2.074756   -.8056624
-------------+----------------------------------------------------------------
inflate      |
       child |   2.857134   .7912044     3.61   0.000     1.306402    4.407866
       _cons |  -4.155515   1.358571    -3.06   0.002    -6.818266   -1.492765
-------------+----------------------------------------------------------------
    /lnalpha |   .5080559   .1945664     2.61   0.009     .1267128     .889399
-------------+----------------------------------------------------------------
       alpha |   1.662057   .3233804                      1.135091    2.433667
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0: chibar2(01) =   353.32 Pr>=chibar2 =  0.0000
Vuong test of zinb vs. standard negative binomial: z =     1.29  Pr>z = 0.0981

Here are the results from the several of the other models that we can look at the differences in the output.

OLS regression

      Source |       SS       df       MS              Number of obs =     248
-------------+------------------------------           F(  3,   244) =   31.67
       Model |  2381.37932     3  793.793107           Prob > F      =  0.0000
    Residual |  6116.21745   244   25.066465           R-squared     =  0.2802
-------------+------------------------------           Adj R-squared =  0.2714
       Total |  8497.59677   247  34.4032258           Root MSE      =  5.0066

------------------------------------------------------------------------------
       count |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       child |   -3.66316   .4517096    -8.11   0.000    -4.552908   -2.773412
      camper |   1.803715   .6462559     2.79   0.006     .5307624    3.076667
     persons |   2.958617   .3471834     8.52   0.000     2.274758    3.642476
       _cons |  -3.513368    .911917    -3.85   0.000    -5.309602   -1.717134
------------------------------------------------------------------------------

Poisson regression

Poisson regression                                Number of obs   =        248
                                                  LR chi2(3)      =     825.76
                                                  Prob > chi2     =     0.0000
Log likelihood = -645.25676                       Pseudo R2       =     0.3902

------------------------------------------------------------------------------
       count |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       child |  -1.435991   .0817147   -17.57   0.000    -1.596149   -1.275834
      camper |   .6581591   .0928267     7.09   0.000     .4762221     .840096
     persons |   .8921543   .0412052    21.65   0.000     .8113936     .972915
       _cons |  -1.406989   .1512168    -9.30   0.000    -1.703369    -1.11061
------------------------------------------------------------------------------

Negative binomial regression

Negative binomial regression                      Number of obs   =        248
                                                  LR chi2(3)      =      94.97
Dispersion     = mean                             Prob > chi2     =     0.0000
Log likelihood = -391.02712                       Pseudo R2       =     0.1083

------------------------------------------------------------------------------
       count |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       child |  -1.663023   .1952636    -8.52   0.000    -2.045732   -1.280313
      camper |   .5581623   .2323451     2.40   0.016     .1027743     1.01355
     persons |   .9721464   .1223069     7.95   0.000     .7324293    1.211864
       _cons |  -1.462463    .334303    -4.37   0.000    -2.117685   -.8072411
-------------+----------------------------------------------------------------
    /lnalpha |   .7456885   .1585742                      .4348888    1.056488
-------------+----------------------------------------------------------------
       alpha |   2.107892   .3342573                      1.544791    2.876253
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0:  chibar2(01) =  508.46 Prob>=chibar2 = 0.000

Image fishnbvargr

Days Absent From School

Days absent from school during one school year (mean = 5.81 variance = 55.49)

Predictors of the count variable are female, school and reading.

Several variables were tried as predictors of excess zeros for zero-inflated models. The large number of zeros in the data might seem to suggest some type of zero-inflated model. However, the zero-inflated models were not used for both empirical reasons and because there did not seem to be a reasonable way explain what kind of a process would generate excess zeros in days absent.

Image lahist

Histogram of days absent by school.

Image lahist2

We will look at five different models, two count models, two zero-inflated count model and an ols regression thrown in for good measure.

    ll    df     BIC    model
-1060.365  4  2143.75   ols regression  
-1435.846  3  2894.72   poisson  
 -867.240  4  1763.26   negative binomial 
-1278.182  5  2590.90   zero-inflated poisson       
 -867.200  6  1774.69   zero-inflated negative binomial

Both the negative binomial and the zero-inflated negative binomial are very close in log likelihoods and BIC’s, there’s a slight edge to the straight negative binomial due to having fewer degrees of freedom.

Negative binomial regression                      Number of obs   =        316
                                                  LR chi2(2)      =      44.02
Dispersion     = mean                             Prob > chi2     =     0.0000
Log likelihood = -869.23382                       Pseudo R2       =     0.0247

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   .3366034    .132156     2.55   0.011     .0775823    .5956244
      school |  -.8628592   .1323017    -6.52   0.000    -1.122166   -.6035526
       _cons |   2.776001   .2142417    12.96   0.000     2.356094    3.195907
-------------+----------------------------------------------------------------
    /lnalpha |   .1568369   .0986426                      -.036499    .3501728
-------------+----------------------------------------------------------------
       alpha |   1.169805   .1153925                      .9641591    1.419313
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0:  chibar2(01) = 1152.21 Prob>=chibar2 = 0.000

Image lanbvargr

Zero-truncated Models

There are a number of interesting situations in which the count variable cannot take on the value zero.

Take for example the response variable length of hospital stay (mean = 5.37 variance = 49.24). The predictor variables are age, hmo and died (died before discharge). Note that there are no zero counts in the data.

Image ztphist

Image ztpnbvargr

Note that both Poisson and negative binomial predict a probability for zero length of hospital stay. The negative binomial provides a closer fit to the observed than does the Poisson.

    ll     df      BIC   model
-6908.799   4  13846.83  zero-truncated poisson  
-4755.28    5   9547.10  zero-truncated negative binomial

The results for the zero-truncated negative binomial are given below.

Zero-truncated negative binomial regression       Number of obs   =       1493
                                                  LR chi2(3)      =      31.14
Dispersion     = mean                             Prob > chi2     =     0.0000
Log likelihood = -4755.2796                       Pseudo R2       =     0.0033

------------------------------------------------------------------------------
        stay |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0156929    .013107    -1.20   0.231    -.0413822    .0099964
         hmo |  -.1470576   .0592161    -2.48   0.013     -.263119   -.0309962
        died |  -.2177714   .0461605    -4.72   0.000    -.3082442   -.1272985
       _cons |   2.408328    .071982    33.46   0.000     2.267245     2.54941
-------------+----------------------------------------------------------------
    /lnalpha |  -.5686389   .0551506                     -.6767321   -.4605457
-------------+----------------------------------------------------------------
       alpha |   .5662957   .0312316                      .5082753    .6309393
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0:  chibar2(01) = 4307.04 Prob>=chibar2 = 0.000

Hurdle Models

A hurdle model is a modified count model in which there are two processes, one generating the zeros and one generating the positive values. The two models are not constrained to be the same. The concept underlying the hurdle model is that a binomial probability model governs the binary outcome of whether a count variable has a zero or a positive value. If the value is positive, the “hurdle is crossed,” and the conditional distribution of the positive values is governed by a zero-truncated count model.

We will illustrate this model using artificially generated data with response variable y and predictors x1 and x2.

         y |      Freq. 
------------+---------------
          0 |        315       
          1 |        123       
          2 |        138       
          3 |        117       
          4 |         88       
          5 |         60       
          6 |         35       
          7 |         23       
          8 |         24       
          9 |         19       
         10 |         16       
         11 |         11       
         12 |         10       
         13 |          5       
         14 |          1       
         15 |          4       
         16 |          1       
         19 |          2       
         21 |          2       
         22 |          2       
         26 |          1       
         32 |          2       
         34 |          1

Image hurdle

Since the count part of the model was generated using a random Poisson generator, we will look at three different Poisson based models.

    ll     df        BIC   model
-2133.079   3     4286.881 poisson
-1748.863   6     3539.173 zero-inflated poisson
-1730.438   6     3502.322 poisson-logit hurdle

The Poisson-logit hurdle model is clearly the best choice here. The results for this model are given below

Poisson-Logit Hurdle Regression                   Number of obs   =       1000
                                                  Wald chi2(2)    =     187.13
Log likelihood = -1730.4377                       Prob > chi2     =     0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
logit        |
          x1 |   1.494141   .1092238    13.68   0.000     1.280067    1.708216
          x2 |   .0637035    .086461     0.74   0.461     -.105757     .233164
       _cons |   .9951839   .0855003    11.64   0.000     .8276064    1.162761
-------------+----------------------------------------------------------------
poisson      |
          x1 |   .1732047   .0226329     7.65   0.000      .128845    .2175644
          x2 |   .7216143   .0198055    36.44   0.000     .6827963    .7604324
       _cons |   1.009172   .0284237    35.50   0.000     .9534629    1.064882
------------------------------------------------------------------------------

Days Absent Clustered in Schools

We will take another look at days absent (mean = 5.37 variance = 49.24), this time obtained from 12 schools with about 50 students per school. The predictors are gender and reading. Any count model will need to account for the lack of independence within schools.

Image abshist

We have a number of options for analyzing these data. We can use standard poisson with cluster robust standard errors to account for within school association. Or we can use one of the random-effects models for poisson or negative binomial. These are random intercept only models.

    ll    df     BIC    model
-2938.653  3  5896.653  poisson with cluster robust standard errors
-1718.321  4  3462.437  negative binomial with cluster robust standard errors
-2718.767  4  5463.33   random-effects poisson w/ gaussian random effects
-2718.864  4  5463.523  random-effects poisson w/ gamma random effects
-1696.797  5  3425.838  random-effects negative binomial w/ beta random effects

The random-effects negative binomial has the best log likelihood and BIC. The results of the model are given below.

Random-effects negative binomial regression     Number of obs      =       632
Group variable (i): sid                         Number of groups   =        12

Random effects u_i ~ Beta                       Obs per group: min =        51
                                                               avg =      52.7
                                                               max =        53

                                                Wald chi2(2)       =     36.14
Log likelihood  = -1696.7966                    Prob > chi2        =    0.0000

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   .3587537   .0788313     4.55   0.000     .2042472    .5132602
     reading |  -.0102389   .0022909    -4.47   0.000    -.0147289   -.0057489
       _cons |   .1944149    .131354     1.48   0.139    -.0630343     .451864
-------------+----------------------------------------------------------------
       /ln_r |   2.260673   .4859689                      1.308192    3.213155
       /ln_s |   3.922436   .5105199                      2.921835    4.923037
-------------+----------------------------------------------------------------
           r |   9.589543    4.66022                      3.699478    24.85738
           s |   50.52336   25.79318                      18.57535    137.4193
------------------------------------------------------------------------------
Likelihood-ratio test vs. pooled: chibar2(01) =    32.25 Prob>=chibar2 = 0.000

Image absnbvargr

Summary of Count Regression Models

Poisson regression model
Negative binomial regression model
Zero-inflated poisson model
Zero-inflated negative binomial model
Zero-truncated poisson model
Zero-truncated negative binomial model
Hurdle models
Random-effects count models

Some Stata Specific Information

Here are the Stata commands for selected models shown above:

poisson regression with exposure, deaths
(also see DAE page):

. poisson deaths smokes a2 a3 a4 a5, exposure(pyears)

zero-inflated negative binomial, fish
(also see DAE page):

. zinb count child camper persons, inflate(child) vuong zip

negative binomial regression, days absent
(also see DAE page):
 
. nbreg daysabs female school

Zero-truncated negative binomial regression, length of hosptial stay
(also see DAE page):

. ztnb stay age hmo died

Poisson-Logit Hurdle Regression, artifical data

. hplogit y x1 x2

negative binomial regression with cluster robust standard errors, days absent:

. nbreg daysabs female reading, cluster(sid)

random effects negative binomial, days absent:

. xtnbreg daysabs female reading, i(sid)

References Here are some places to read more about regression models with count data.

Agresti, A. (2001) Categorical Data Analysis (2nd ed). New York: Wiley.

Agresti, A. (1996) An Introduction to Categorical Data Analysis. New York: Wiley.

Long, S. J. (1997) Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: SAGE Publications, Inc.

Long, S. J. & Freese, J. (2001) Regression Models for Categorical Dependent Variables using Stata. College Station, TX: Stata Press.