Logistic Regression in Stata

OARC Statistical Methods and Data Analytics

Purpose

This workshop introduces the logistic regression model and discusses how to run the model and associated methods in Stata.

In this seminar, we will cover:

  • the binomial distribution
  • the logistic regression model
  • running a logistic regression model in Stata
  • interpretation of estimated parameters
  • using model predictions to understand the model
  • assessment of model fit and assumptions

The Binomial Distribution

Binary variables

Binary variables have 2 levels, typically 0 (false/failure) and 1 (true/success).

Below are 8 observations of binary variable \(y\):

\[y=(0,1,1,1,1,0,1,1)\]

The mean of a 0/1 variable is the proportion of ones:

\[\bar{y}=\frac{\sum_iy}{n}=.75\]

The mean is also an estimate of the probability of ones in the population, \(p=Pr(y=1)\):

\[\bar{y}=\hat{p}=.75\]

The \(\hat{}\) symbol denotes a sample estimate of a population parameter.

Binomial distribution

With \(n\) observations of a binary variable, we can observe between 0 and \(n\) ones (successes).

Over repeated samples of size \(n\), the number of ones we observe in each sample is determined by the probability of success, \(p\).

For samples of size \(n\) with probability of success \(p\), the binomial distribution describes the probability of observing each number of successes, from 0 to \(n\).

The binomial distribution has 2 parameters:

  • \(n\), the sample size or number of trials (typically known so not estimated)
  • \(p\), the probability of success

A histogram of the number of successes (ones) in 1000 samples of size \(n=8\) from a population with \(p=.25\) approximates the binomial distribution for \(n=8\), \(p=.25\):

Stata Graph - Graph 0 .1 .2 .3 Proportion of 1000 samples 0 2 4 6 Number of successes in samples of n=8 with probability of success p=.25

When the \(p\) is low, most samples contain more zeroes than ones.

When \(p\) is higher, say \(p=.75\), most samples contain more ones than zeroes.

Stata Graph - Graph 0 .1 .2 .3 Proportion of 1000 samples 2 4 6 8 Number of successes in samples of n=8 with probability of success p=.75

Let \(y^*\) be the number of successes in a sample of size \(n\) of binary variable \(y\).

The exact probability of observing \(k\) successes in samples of size \(n\) where the probability of success is \(p\) is given by:

\[Pr(y^*=k|n,p) = {n\choose k}p^k(1-p)^{(n-k)}\]

Mean and variance of binomial distribution

For samples of binomially-distributed \(y\) of size \(n\) with probability of success \(p\), the mean (expected) number of successes is \(np\):

\[E(y|n,p) = np\]

For samples with \(n=8\) and probability \(p=.75\), the mean is \(6\):

\[E(y|n=8, p=.75) = (8)(.75) = 6\]

The variance of \(y\) is:

\[Var(y) = np(1-p)\]

Both mean and variance of \(y\) vary with \(p\), so the binomial distribution is heteroskedastic.

  • \(Var(y)\) is maximized when \(p=.5\): half zeroes and half ones.
  • \(Var(y)\) is minimized when \(p=0\) of \(p=1\), all zeroes or all ones, respectively.

Stata Graph - Graph 0 .5 1 1.5 2 Var(Y|n=8,p) 0 .2 .4 .6 .8 1 p Variance of Y as a function of p

Representation of binary data

In most data sets, each row will be a single observation of a binary variable (y below):

y x
1 4
1 4
0 5
1 5
0 7

Above, each row is a single observation of a binomial variable y, so \(n=1\) and \(y\) will be either 0 or 1.

When \(n=1\), the mean of binomially-distributed \(y\) is \(p\) itself:

\[E(y|n=1,p) = np = (1)p = p\]

Regression models mean outcomes, so we are modeling \(p\) with \(n=1\).

Stata’s logit command, discussed extensively later, expects data with \(n=1\) per row.

Aggregated data

Regression assumes that observations with the same predictor values (x above) come from the same population with a common \(p=Pr(y=1)\), so we could aggregate the data like so:

n y x
2 2 4
2 1 5
1 0 7

Now \(y\) is the count of ones and \(n\) is the sample size.

In either style of data, \(n\) is known so only \(p\) will be modeled.

The 2 datasets above will produce the same logistic regression estimates, but aggregated data must be analyzed in blogit in Stata rather than logit.

The binomial distribution with \(n=1\) is also known as the Bernoulli distribution.

Analyzing binomially distributed variables in linear regression

Regression models estimate the relationship between the mean outcome and a set of predictors.

As seen above, for binary outcomes the mean is \(p\) itself.

We could model the relationships between predictors and \(p\) in a linear regression, but some assumptions will be violated:

  • homoskedasticity is assumed, but binomial variables are heteroskedastic
  • normally-distributed errors are assumed, which cannot be true with binary outcomes
  • outcome is assumed unbounded, but probabilities are bound between 0 and 1
    • transforming the outcome can extend its bounds, but the other problems persist

Logistic regression addresses each issue above.

The Logistic Regression Model

Odds

Regression models linear relationships between the mean outcome and the predictors, which can be problematic for bounded outcomes:

Stata Graph - Graph -.5 0 .5 1 1.5 model predicted Pr(y=1) 0 20 40 60 80 Hypothetical continuous predictor x Linear relationship between predictor X and Pr(y=1) results in out of bounds predictions

Transforming the outcome can extend its range.

In logistic regression, two transformations on the mean outcome (probability) change its bounds from \([0,1]\) to \((-\infty, \infty)\).

We first transform probabilities to odds:

\[odds(p) = \frac{p}{1-p}\]

Odds lie within \([0,\infty)\) and have a positive, monotonic relationship with probability \(p\).

Stata Graph - Graph 0 10 20 30 40 50 odds 0 .2 .4 .6 .8 1 p

Odds are interpreted as the expected number of successes per failure.

For example, if \(p=.75\), then \(odds(p)=\frac{.75}{.25}=3\), which means we expect 3 successes for every failure.

The next transformation extends the left bound to \(-\infty\).

Logits

If variable \(x\) is defined over \((0,\infty)\), then \(log(x)\) is defined over \((-\infty, \infty)\).

The log of the odds, or logit, is thus defined over \((-\infty, \infty)\).

\[logit(p) = log(odds(p)) = log(\frac{p}{1-p})\]

The logit has an S-shaped (sigmoidal) relationship with the probability:

Stata Graph - Graph -5 0 5 logit 0 .2 .4 .6 .8 1 p

The logit function is often referred to as a link function, which transforms the mean outcome to allow a linear relationship with the predictors.

The Logistic Regression Model

The logistic regression model asserts a linear relationship between the logit (log-odds) of the outcome variable and the predictors \(x_1\) through \(x_k\) for observation \(i\):

\[logit(p_i) = log(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1x_{i1} + ... + \beta_kx_{ik}\]

The linear predictor, \(\beta_0 + \beta_1x_{i1}+...+\beta_kx_{ik}\), expresses the expected outcome in logits (log odds).

The inverse logit function transforms a logit back into a probability:

\[logit^{-1}(x) = \frac{exp(x)}{1+exp(x)}\]

\[logit^{-1}(logit(p_i)) = p_i\]

Thus, the inverse logit of the linear predictor expresses the expected outcome as a probability:

\[p_i = \frac{exp(\beta_0 + \beta_1x_{i1}+...+\beta_kx_{ik})}{1+exp(\beta_0 + \beta_1x_{i1}+...+\beta_kx_{ik})}\]

In the above expression, we see that \(p_i\) has a non-linear relationship with the predictors.

  • logit of the outcome has a linear relationship with the predictors
  • logit has an S-shaped relationship with probability
  • thus, probability has an S-shaped relationship with the (continuous) predictors

Because of this nonlinear relationship, the amount of change in \(p\) due to a one-unit increase in \(x\) depends on the value of \(x\).

Interpreting logistic regression coefficients

Logistic regression coefficients can be interpreted in at least 2 ways.

The first interpretation expresses additive changes in the logits, or log odds of the outcome.

Imagine a simple logistic regression model with a single predictor \(x\):

\[logit(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1x_i\]

Setting \(x=0\), the expected outcome is \(\beta_0\):

\[\begin{align} logit(\frac{p_{x=0}}{1-p_{x=0}}) &= \beta_0 + \beta_1(0) \\ &= \beta_0 \end{align}\]

If we set \(x=1\), the expected outcome is \(\beta_0 + \beta_1\):

\[\begin{align} log(\frac{p_{x=1}}{1-p_{x=1}}) &= \beta_0 + \beta_1(1) \\ &= \beta_0 + \beta_1 \end{align}\]

Taking the difference between these expectations gives us an interpretation of \(\beta_1\):

\[\begin{align} log(\frac{p_{x=1}}{1-p_{x=1}}) - log(\frac{p_{x=0}}{1-p_{x=0}}) &= (\beta_0 + \beta_1) - \beta_0 \\ &= \beta_1 \end{align}\]

Thus, \(\beta_1\) is the expected change in the logit (log-odds) of the outcome for a one-unit increase in the predictor \(x\).

Interpreting coefficients as change in logits is simple because the relationships are linear, but most audiences will struggle with understanding magnitudes in logits.

Odds ratios

A more common way to interpret logistic regression coefficients is to exponentiate them to express odds ratios.

Recall that the exponential function is the inverse of the (natural) logarithm function:

\[exp(log(x)) = x\] Also recall that the difference of two logs is the log of the ratio:

\[log(a) - log(b) = log(\frac{a}{b})\]

Logistic regression coefficients denote the change in the log of the odds of the outcome for a unit increase in the predictor, which we can express as the log of the ratio of the odds:

\[\begin{align} \beta_1 &= log(\frac{p_{x=1}}{1-p_{x=1}}) - log(\frac{p_{x=0}}{1-p_{x=0}}) \\ &= log(odds_{x=1}) - log(odds_{x=0}) \\ &= log(\frac{odds_{x=1}}{odds_{x=0}}) \\ \end{align}\]

Now, if we exponentiate both sides, we see how the exponentiated coefficient is an odds ratio:

\[\begin{align} exp(\beta_1) &= exp(log(\frac{odds_{x=1}}{odds_{x=0}})) \\ &= \frac{odds_{x=1}}{odds_{x=0}} \\ &= \frac{\frac{p_{x=1}}{1-p_{x=1}}}{\frac{p_{x=0}}{1-p_{x=0}}} \\ \end{align}\]

Odds ratios express the expected multiplicative change in the odds of the outcome for a unit increase in the predictor.

Logistic Regression in Stata

Workshop data set

We use a data set of birth weights of 180 infants to demonstrate logistic regression in Stata, where infants weighing less than 2500g are classified as low birth weight (lbw).

Outcome

  • low: 1=low birth weight, 0=normal weight

Predictors

  • age: mother’s age in years (14-45)
  • ptl: count of previous premature labors (0-3)
  • ht: mother’s history of hypertension, 1=hypertensive, 0=non-hypertensive
  • smoke: mother’s history of smoking, 1=smoker, 0=non-smoker
  • race: 1=white, 2=black, 3=other

Logistic regression will estimate linear relationships of each predictor above with the log odds of lbw.

Loading and exploring the low birth weight data set in Stata

The low birth weight data set comes with Stata and can be loaded with the command:

/* load low birth weight data */
webuse lbw, clear

We will explore the variables with histograms

/* histograms of predictors */
hist low, discrete freq

Stata Graph - Graph 0 50 100 150 Frequency -.5 0 .5 1 Birthweight<2500g

hist age, freq

Stata Graph - Graph 0 10 20 30 40 50 Frequency 10 20 30 40 50 Age of mother

hist ptl, freq discrete

Stata Graph - Graph 0 50 100 150 Frequency -1 0 1 2 3 Premature labor history (count)

hist ht, freq discrete

Stata Graph - Graph 0 50 100 150 200 Frequency -.5 0 .5 1 Has history of hypertension

hist smoke, freq discrete

Stata Graph - Graph 0 50 100 150 Frequency -.5 0 .5 1 Smoked during pregnancy

hist race, freq discrete

Stata Graph - Graph 0 20 40 60 80 100 Frequency .5 1 1.5 2 2.5 3 Race

The logit command

Stat’s logit command performs logistic regression using the following syntax:

logit depvar indepvarlist

where depvar is the dependent variable (outcome) and indepvarlist is a list of one or more independent variables (predictors).

  • depvar is generally coded 0/1, and Stata will model the probability that depvar=1
  • Precede categorical predictors with i. in indepvarlist to request that Stata enter dummy variables representing each category but the first (by default) into the regression
    • Stata refers to variables preceded by i. as factors
  • Add the option or after a , after indepvarlist to produce odds ratios; otherwise coefficients are in the logit metric by default

Stata’s logistic command estimates the same model as logit but can only output odds ratios.

Logistic regression of lbw data

We will estimate the relationships of mother’s age, previous premature labors, history of hypertension, smoking, and race with the log odds of having a low birth weight (lbw) baby.

  • linear slopes (in logits) are estimated for age and ptl
  • i. precedes each of ht, smoke and race to request that dummy variables for each category but the first be entered into the model
/* logistic regression with low as outcome,
   age and ptl as continuous predictors,
   ht, smoke and race as categorical predictors */
logit low age ptl i.ht i.smoke i.race
Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood =   -105.286  
Iteration 2:  Log likelihood = -104.90511  
Iteration 3:  Log likelihood = -104.90441  
Iteration 4:  Log likelihood = -104.90441  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(6)    =  24.86
                                                        Prob > chi2   = 0.0004
Log likelihood = -104.90441                             Pseudo R2     = 0.1059

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0469041   .0353279    -1.33   0.184    -.1161454    .0223372
         ptl |   .7501393   .3298036     2.27   0.023     .1037362    1.396542
        1.ht |   1.227762   .6246189     1.97   0.049     .0035312    2.451992
             |
       smoke |
     Smoker  |   .9728681   .3863386     2.52   0.012     .2156584    1.730078
             |
        race |
      Black  |    .973844   .5013314     1.94   0.052    -.0087474    1.956435
      Other  |   1.021987   .4191365     2.44   0.015     .2004942    1.843479
             |
       _cons |  -.9237957    .891843    -1.04   0.300    -2.671776    .8241845
------------------------------------------------------------------------------

Starting towards the top of the output (after the likelihood iterations we see):

  • 189 observations were used
  • The likelihood ratio \(\chi^2\) statistic equals 24.86, which assesses the null hypothesis that all model coefficients (except the intercept _cons)
    • tests if current model improves fit over model with no predictors
    • p-value=0.004 suggests that at least one coefficient is not zero
  • The \(Pseudo R^2\) statistic, a goodness-of-fit measure similar to \(R^2\) for linear regression, but not interpreted as proportion of outcome variance explained
  • A table of regression coefficients and their associated standard errors, test statistics (z) and p-values, and 95% confidence intervals
    • z and p-values assess null hypothesis that coefficient is 0

Interpreting logistic regression coefficients

logit low age ptl i.ht i.smoke i.race
Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood =   -105.286  
Iteration 2:  Log likelihood = -104.90511  
Iteration 3:  Log likelihood = -104.90441  
Iteration 4:  Log likelihood = -104.90441  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(6)    =  24.86
                                                        Prob > chi2   = 0.0004
Log likelihood = -104.90441                             Pseudo R2     = 0.1059

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0469041   .0353279    -1.33   0.184    -.1161454    .0223372
         ptl |   .7501393   .3298036     2.27   0.023     .1037362    1.396542
        1.ht |   1.227762   .6246189     1.97   0.049     .0035312    2.451992
             |
       smoke |
     Smoker  |   .9728681   .3863386     2.52   0.012     .2156584    1.730078
             |
        race |
      Black  |    .973844   .5013314     1.94   0.052    -.0087474    1.956435
      Other  |   1.021987   .4191365     2.44   0.015     .2004942    1.843479
             |
       _cons |  -.9237957    .891843    -1.04   0.300    -2.671776    .8241845
------------------------------------------------------------------------------

The estimated coefficients above are interpreted as change in logits (log odds of being lbw).

Continuous predictors

  • for each additional year of age, the log odds of lbw decrease by .047
  • for each additional premature labor, the log odds of lbw increase by .75

Categorical predictors

  • mothers with a history of hypertension are expected to have an increase of 1.23 in the log odds of lbw compared to mothers without a history
  • mothers who smoked during pregnancy are expected to have an increase of .973 in the log odds of lbw compared to non-smokers
  • black mothers are expected to have an increase of .973 in the log odds of lbw compared to white mothers
  • other race mothers are expected to have an increase of 1.02 in the log odds of lbw compared to white mothers

Interpreting odds ratios

Odds ratios are interpreted in a couple ways (let \(OR\) be some odds ratio):

  • \(OR\) is the factor or multiplicative change in the odds of the outcome
  • \((OR-1)\times100\%\) is the percent change in the odds of the outcome

An odds ratio above one denotes an increase in the odds, below one denotes a decrease in the odds, and equal to one denotes no change in the odds.

We repeat our logistic regression of low, now with odds ratios by adding the or option:

/* rerunning model to express odds ratios */
logit low age ptl i.ht i.smoke i.race, or
Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood =   -105.286  
Iteration 2:  Log likelihood = -104.90511  
Iteration 3:  Log likelihood = -104.90441  
Iteration 4:  Log likelihood = -104.90441  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(6)    =  24.86
                                                        Prob > chi2   = 0.0004
Log likelihood = -104.90441                             Pseudo R2     = 0.1059

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9541789   .0337091    -1.33   0.184     .8903458    1.022589
         ptl |   2.117295   .6982914     2.27   0.023     1.109308    4.041203
        1.ht |   3.413581   2.132187     1.97   0.049     1.003537    11.61146
             |
       smoke |
     Smoker  |   2.645521   1.022067     2.52   0.012     1.240679    5.641093
             |
        race |
      Black  |   2.648104   1.327578     1.94   0.052     .9912907    7.074066
      Other  |   2.778709   1.164658     2.44   0.015     1.222007    6.318482
             |
       _cons |   .3970093   .3540699    -1.04   0.300     .0691294    2.280021
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Note: test (z) statistics and p-values are the same whether regression coefficients or odds ratios are reported

Continuous predictors

  • For each additional year of age, the odds of lbw decrease by 4.6% (\((.954-1)\times100\%=-4.6\%\))
  • For each additional premature labor, the odds of lbw increase by a factor of 2.12

Categorical predictors

  • mothers with a history of hypertension are expected to have 3.41 times the odds of lbw compared to mothers without a history
  • mothers who smoked during pregnancy are expected to have a 2.65 factor increase in the odds of lbw compared to non-smokers
  • black mothers are expected to have 165% (\((2.65-1)\times100\%=165\%\)) increase in the odds of lbw compared to white mothers
  • other race mothers are expected to have 178% (\((2.78-1)\times100\%=178\%\)) increase in the odds of lbw compared to white mothers

Exponentiated interaction coefficients

Interactions in regression models allow the effect of one predictor to vary with another predictor.

For example, we might believe that the effect of age is different for smoking vs nonsmoking mothers.

Interaction terms are generated by taking the product of the component variables.

Stata provides the # and ## operator for interactions, where the former requests only the interaction and the latter requests both the lower order terms and the interaction.

Below, we add the interaction of smoke and age to the model:

/* adding interaction of age and smoke */
logit low c.age##i.smoke ptl i.ht i.race, or
Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood = -105.18462  
Iteration 2:  Log likelihood = -104.70994  
Iteration 3:  Log likelihood =  -104.7077  
Iteration 4:  Log likelihood =  -104.7077  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(7)    =  25.26
                                                        Prob > chi2   = 0.0007
Log likelihood = -104.7077                              Pseudo R2     = 0.1076

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9336296   .0471014    -1.36   0.173       .84573    1.030665
             |
       smoke |
     Smoker  |   .9512806   1.593868    -0.03   0.976     .0356552    25.38017
             |
 smoke#c.age |
     Smoker  |   1.045964   .0751283     0.63   0.532     .9086104    1.204082
             |
         ptl |    2.10686    .694485     2.26   0.024     1.104215    4.019921
        1.ht |   3.469858   2.170839     1.99   0.047     1.018067    11.82625
             |
        race |
      Black  |    2.48018   1.269646     1.77   0.076     .9093663    6.764375
      Other  |   2.704271   1.141077     2.36   0.018     1.182722    6.183265
             |
       _cons |    .660881   .8026954    -0.34   0.733     .0611319    7.144616
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

In the logit metric, interaction coefficients are interpreted as additive changes in a predictor’s effects due to a one-unit increase in the interacting predictor.

Exponentiated interaction coefficients are not odds ratios, but instead are the ratio of one odds ratio vs another.

The exponentiated interaction coefficient above, 1.046, can be interpreted as:

  • the ratio comparing the odds ratio of age for smokers to nonsmokers
    • the odds ratio of age for smokers is 1.046 times larger (or 4.6% larger) than the odds ratio of age for nonsmokers (which is .934m the OR for smoke in the table above)
  • the ratio comparing the odds ratio of smoke between two mothers whose ages differ by one year (older mother in numerator)
    • a mother who is one year older than another mother will have an odds ratio for smoking that is 4.6% larger (holding other predictors constant)

Exercise 1

To practice logistic regression in Stata, we will use the nhanes2 dataset, coming from the National Health and Nutrition Examination Survey, that can be loaded with:

webuse nhanes2, clear

The model variables are:

  • highbp: high blood pressure, 1=high blood pressure, 0=normal
  • region: region of US, 1=NE, 2=MW, 3=S, 4=W
  • rural: rural residence, 0=urban, 1= rural
  • weight: weight in kg
  • age: age in years
  1. Perform a logistic regression of outcome highbp with categorical predictors (factors) region and rural, and linear relationships for continuous predictors weight and age, with coefficients expressed in logits (log odds). Interpret the coefficients.
  2. Rerun the model with coefficients expressed as odds ratios and interpret the odds ratios.

Model Predictions

Predicted outcomes

Understanding the magnitude of predictor effects as changes in logits or as odds ratios can be difficult.

Expressing model effects in probabilities can make magnitudes clearer.

In this section we show how Stata’s predict and margins commands predict model-based probabilities of the outcome and how to use these to interpret model results.

Because of the nonlinear relationship between the predictors and the probability of the outcome:

  • a one-unit change in a predictor will not always lead to the same change in probability of the outcome
  • the change in probability due to a change in a predictor depends on the value of the predictor itself as well as values of the other predictors

At times, it can also be useful to predict outcomes in logits, for example, to assess assumptions of linear relationships.

predict after logistic regression

Stata’s predict command is primarily used to make model-based predictions for the current data.

The basic syntax is:

predict newvar

where newvar is the name of new variable that will hold the predictions.

By default, after logit, predict will estimate predicted probabilities, but many other types of quantities can be predicted with various options (see logit postestimation).

After running predict, newvar will be added to the data set with the predictions.

/* predp will be predicted probabilities */
predict predp
(option pr assumed; Pr(low))

Here are the model variables and predicted probabilities from the model for the first 10 observations:

/* display model variables and predictions for first 10 obs */
list low age ptl ht smoke race predp in 1/10
     | low   age   ptl   ht       smoke    race      predp |
     |-----------------------------------------------------|
  1. |   0    19     0    0   Nonsmoker   Black   .3012971 |
  2. |   0    33     0    0   Nonsmoker   Other   .1900565 |
  3. |   0    20     0    0      Smoker   White   .2913144 |
  4. |   0    21     0    0      Smoker   White   .2817266 |
  5. |   0    18     0    0      Smoker   White    .311053 |
     |-----------------------------------------------------|
  6. |   0    21     0    0   Nonsmoker   Other   .2917718 |
  7. |   0    22     0    0   Nonsmoker   White   .1239348 |
  8. |   0    17     0    0   Nonsmoker   Other   .3319944 |
  9. |   0    29     0    0      Smoker   White   .2122952 |
 10. |   0    26     0    0      Smoker   White   .2367767 |
     +-----------------------------------------------------+

A model with high predictive power will predict probabilities near zero for observations where low=0, and near one for observations where low=1.

Let’s examine the distribution of predicted probabilities by observed outcome low:

/* compare distribution of predp between observations with low=1 vs low=0 */
graph box predp, over(low) b1title("observed low") ytitle("predicted Pr(low=1)")

Stata Graph - Graph 0 .2 .4 .6 .8 predicted Pr(low=1) 0 1 observed low

The distributions are not very separated, indicating that the model has low predictive power.

margins and marginal estimates

Stata’s margins command is one its most powerful and versatile commands, which estimates many kinds of model predictions, including:

  • predictive margins: predicted outcomes at fixed values of some predictors and averaging over the distribution of other predictors
    • also known as marginal means
    • represents the average predicted outcome in the population or some subpopulation
  • adjusted predictions: predicted outcomes at fixed values of all predictors
    • also known as conditional margins
    • represent the predicted outcome for a specific type of person (or unit)
  • marginal effects: effects of predictors, holding some predictors fixed and averaging over the distributions of others
    • typically estimated as a difference on the original scale of the outcome (difference in probabilities for logistic regression)
    • are equal to partial derivatives for continuous predictors

Because of its versatility, margins has an extensive syntax that can be difficult to master (the manual entry for margins is 78 pages long!).

Nevertheless, margins functionality is one of the primary reasons to use Stata.

margins can only be used after a model has been estimated.

Predictive margins

Predictive margins are predicted outcomes averaged over the distribution of some (or all) predictors and at fixed levels of the others.

Generally, when specifying margins, whichever predictors are not fixed at specific levels will be averaged over when predicting the outcome.

Thus, simply specifying margins after our logistic regression of lbw will result in a predicted probability of lbw, averaging over the distributions of all predictors in the model:

/* probability of lbw, averaging over all predictors */
margins
Predictive margins                                         Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       _cons |   .3121693   .0314715     9.92   0.000     .2504864    .3738523
------------------------------------------------------------------------------

Note: z-statistics and p-values assessing null hypotheses testing predictive margins against zero are generally ignored (we generally are not interested in testing whether a predicted probability is different from zero).

The above predictive margin:

  • is interpreted as the estimated average probability of lbw in the population (averaging over or regardless the predictors)
  • equals the observed probability of the outcome
  • also equals the mean of our model-predicted probabilities for each observation, predp
/* means of low and predp equals the predictive margin above */
summ low predp
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         low |        189    .3121693    .4646093          0          1
       predp |        189    .3121693    .1663442   .0458923   .7944289

Predictive margins for subpopulations

We can also estimate predictive margins for specific subpopulations.

The syntax:

margins factorvar

estimates predictive margins for each level of factorvar (predictor preceded by i.), averaging over the distribution of the other predictors.

Thus, the following margins estimates predicted probabilities of lbw for smoking and non-smoking mothers, averaging over the distribution of all other predictors:

/* average probabilities of lbw for smokers and nonsmokers */
margins smoke
Predictive margins                                         Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       smoke |
  Nonsmoker  |   .2400186   .0387952     6.19   0.000     .1639814    .3160559
     Smoker  |   .4283874   .0580954     7.37   0.000     .3145226    .5422522
------------------------------------------------------------------------------

The predictive margin for Nonsmoker above can be estimated manually with the following procedure:

  1. Convert everyone in the sample to smoke=0
  2. Leave all other predictors at their observed values
  3. Predict probabilities for everyone based on the model estimates
  4. The mean of these predicted probabilities is the predictive margin

By leaving the other predictors at their observed values when predicting and then averaging the predictions, the resulting mean is averaged over the distribution of those predictors.

The predictive margin for Smoker is obtained using the same procedure except everyone is converted to smoke=1 in the first step.

Use # to obtain predictive margins at each combination of the factors’ levels:

/* predictive margins for each combination of smoke and race */
margins smoke#race
Predictive margins                                         Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
  smoke#race |
  Nonsmoker #|
      White  |   .1551203   .0467924     3.32   0.001     .0634088    .2468317
  Nonsmoker #|
      Black  |    .314714   .0899992     3.50   0.000     .1383188    .4911092
  Nonsmoker #|
      Other  |   .3245289    .056837     5.71   0.000     .2131304    .4359273
     Smoker #|
      White  |   .3145167   .0568889     5.53   0.000     .2030165     .426017
     Smoker #|
      Black  |   .5340904   .1121103     4.76   0.000     .3143583    .7538226
     Smoker #|
      Other  |   .5454064   .0971575     5.61   0.000     .3549812    .7358315
------------------------------------------------------------------------------

The predictive margins above are the predicted probabilities of lbw for each combination of smoking status and race, averaging over age, premature labors, and history of hypertension.

Fixing predictor levels in margins with at()

We can fix the levels of any predictor in margins, including continuous predictors, with the at() option:

  • specify at() after a ,
  • within at()
    • specify the names of predictors whose levels are to be fixed
    • after each predictor’s name, specify = and then the value at which to fix the predictor
    • to fix the predictor at multiple values, after =, put the values inside () using one of the following notations
      • a list of numbers
      • a/b requests every integer from a to b
      • a(c)b requests all values from a to b in increments of c

Remember that any predictor not specified in margins will be averaged over when predicting outcomes.

First, we estimate the predictive margin for mothers who are age 20:

/* average probability of lbw for 20-year old mothers */
margins, at(age=20)
Predictive margins                                         Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
At: age = 20

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       _cons |   .3365013   .0372488     9.03   0.000     .2634951    .4095075
------------------------------------------------------------------------------

Next are predictive margins for smoking and non-smoking mothers who are aged 20 or 30:

/* average probabilities of lbw for smoking and non-smoking moms age 20 or 30 */
margins smoke, at(age=(20 30))
Predictive margins                                         Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
1._at: age = 20
2._at: age = 30

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
   _at#smoke |
1#Nonsmoker  |   .2612321   .0444322     5.88   0.000     .1741465    .3483176
   1#Smoker  |   .4590947   .0625156     7.34   0.000     .3365665     .581623
2#Nonsmoker  |    .187449   .0493688     3.80   0.000     .0906879      .28421
   2#Smoker  |   .3577227   .0789483     4.53   0.000     .2029869    .5124585
------------------------------------------------------------------------------

And average predicted probabilities for mothers aged between 20, 25, 30, 35 or 40 and who have had 0, 1, or 2 premature labors:

/* average probabilities of lbw for mothers aged 20 to 40 (in increments of 5) and 
   either 0 to 2 premature labors */
margins, at(age=(20(5)40) ptl=(0/2))
Predictive margins                                         Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
1._at:  age = 20
        ptl =  0
2._at:  age = 20
        ptl =  1
3._at:  age = 20
        ptl =  2
4._at:  age = 25
        ptl =  0
5._at:  age = 25
        ptl =  1
6._at:  age = 25
        ptl =  2
7._at:  age = 30
        ptl =  0
8._at:  age = 30
        ptl =  1
9._at:  age = 30
        ptl =  2
10._at: age = 35
        ptl =  0
11._at: age = 35
        ptl =  1
12._at: age = 35
        ptl =  2
13._at: age = 40
        ptl =  0
14._at: age = 40
        ptl =  1
15._at: age = 40
        ptl =  2

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         _at |
          1  |    .303371   .0387435     7.83   0.000      .227435    .3793069
          2  |   .4632617   .0746991     6.20   0.000     .3168543    .6096692
          3  |   .6311228    .135043     4.67   0.000     .3664434    .8958023
          4  |   .2598657   .0371378     7.00   0.000     .1870771    .3326544
          5  |    .410806   .0679355     6.05   0.000     .2776549     .543957
          6  |   .5799304   .1349761     4.30   0.000     .3153822    .8444786
          7  |   .2204237    .052976     4.16   0.000     .1165927    .3242547
          8  |   .3602354    .079823     4.51   0.000     .2037853    .5166855
          9  |    .527131   .1434264     3.68   0.000     .2460204    .8082417
         10  |   .1852631   .0689992     2.69   0.007     .0500271    .3204991
         11  |   .3124608   .0995509     3.14   0.002     .1173446    .5075769
         12  |    .473814   .1592928     2.97   0.003     .1616059    .7860222
         13  |   .1544039   .0806383     1.91   0.056    -.0036444    .3124521
         14  |   .2681937   .1185453     2.26   0.024     .0358491    .5005383
         15  |   .4211062   .1789255     2.35   0.019     .0704186    .7717937
------------------------------------------------------------------------------

Above we see that Stata uses a shorthand _at variable to label the estimates with a key above the margins table.

Graphing margins estimates with marginsplot

Another compelling reason to use margins is marginsplot, which graphs the estimates of the last margins command.

A plot of predicted outcomes across a range of predictor values can be helpful to interpret the predictor’s effects.

marginsplot will plot the predicted outcome at the predictor values specified in the previous margins and will generally connect them with a line.

Let’s examine a plot of the predictive margins for smoking and non-smoking mothers:

/* estimate marginal means for smoking and non-smoking mothers */
margins smoke
/* plot marginal means */
marginsplot

Stata Graph - Graph .2 .3 .4 .5 .6 Pr(low) Nonsmoker Smoker Smoked during pregnancy Predictive margins of smoke with 95% CIs

If more than one predictor is specified in margins, marginsplot will place one predictor on the x-axis and will plot separate lines by the other predictors.

/* estimate marginal means for smoking and non-smoking mothers at ages 20, 30 and 40 */
margins smoke, at(age=(20 30 40))
* plot marginal means
marginsplot

Stata Graph - Graph 0 .2 .4 .6 Pr(low) 20 30 40 Age of mother Nonsmoker Smoker Predictive margins of smoke with 95% CIs

Use the x() (shorthand for xdimension()) option to specify which variable appears on the x-axis:

/* estimate marginal means for smoking and non-smoking mothers at ages 20, 30 and 40 */
margins smoke, at(age=(20 30 40))
/* plot marginal means */
marginsplot, x(smoke)

Stata Graph - Graph 0 .2 .4 .6 Pr(low) Nonsmoker Smoker Smoked during pregnancy age=20 age=30 age=40 Predictive margins of smoke with 95% CIs

If many predictors are specified in margins, the subsequent marginsplot may be cluttered with too many lines:

/* margins with many predictors fixed at many levels */
margins smoke#race, at(age=(20 30 40) ptl=(0 1))
/* too many lines in marginsplot */
marginsplot

Stata Graph - Graph 0 .2 .4 .6 .8 1 Pr(low) 20 30 40 Age of mother ptl=0, Nonsmoker, White ptl=0, Nonsmoker, Black ptl=0, Nonsmoker, Other ptl=0, Smoker, White ptl=0, Smoker, Black ptl=0, Smoker, Other ptl=1, Nonsmoker, White ptl=1, Nonsmoker, Black ptl=1, Nonsmoker, Other ptl=1, Smoker, White ptl=1, Smoker, Black ptl=1, Smoker, Other Predictive margins of smoke#race with 95% CIs

Specify some of the predictors in a by() option to plot separate graphs by those variables:

/* separate graphs by race and ptl */
marginsplot, by(race ptl)

Stata Graph - Graph 0 .5 1 0 .5 1 20 30 40 20 30 40 20 30 40 White, ptl=0 White, ptl=1 Black, ptl=0 Black, ptl=1 Other, ptl=0 Other, ptl=1 Nonsmoker Smoker Pr(low) Age of mother Predictive margins of smoke#race with 95% CIs

Adjusted predictions

If all model predictors are specified in margins, Stata calls the resulting estimated outcomes adjusted predictions.

Adjusted predictions are predictions for a specific type of person rather than some average.

/* adjusted predictions of probability of lbw for smokers and non-smokers of each race
   who are age 30 with no previous premature labors and a history of hypertension */
margins smoke#race, at(age=30 ptl=0 ht=1)
Adjusted predictions                                       Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
At: age = 30
    ptl =  0
    ht  =  1

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
  smoke#race |
  Nonsmoker #|
      White  |    .249149   .1345422     1.85   0.064    -.0145488    .5128469
  Nonsmoker #|
      Black  |    .467717   .1883502     2.48   0.013     .0985574    .8368766
  Nonsmoker #|
      Other  |   .4797188   .1743215     2.75   0.006      .138055    .8213826
     Smoker #|
      White  |    .467474   .1718665     2.72   0.007     .1306219    .8043262
     Smoker #|
      Black  |   .6992137   .1651199     4.23   0.000     .3755847    1.022843
     Smoker #|
      Other  |   .7092406    .159545     4.45   0.000     .3965383    1.021943
------------------------------------------------------------------------------

Notice how Stata labels the estimates adjusted predictions.

Adjusted predictions are often estimated at the means of the predictors, so Stata provides an atmeans option to fix all predictors at their means that have not already been fixed at other values.

/* adjusted predictions of probability of lbw for smokers and non-smokers of each race
   who have a history of hypertension and are at the mean age and number of premature labors */
margins smoke#race, at(ht=1) atmeans
Adjusted predictions                                       Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
At: age     =  23.2381 (mean)
    ptl     = .1957672 (mean)
    ht      =        1
    0.smoke = .6084656 (mean)
    1.smoke = .3915344 (mean)
    1.race  = .5079365 (mean)
    2.race  = .1375661 (mean)
    3.race  = .3544974 (mean)

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
  smoke#race |
  Nonsmoker #|
      White  |   .3454404    .153753     2.25   0.025       .04409    .6467909
  Nonsmoker #|
      Black  |   .5829029   .1710685     3.41   0.001     .2476148     .918191
  Nonsmoker #|
      Other  |   .5945589   .1554467     3.82   0.000     .2898889    .8992289
     Smoker #|
      White  |   .5826656   .1562253     3.73   0.000     .2764697    .8888615
     Smoker #|
      Black  |   .7871062   .1220021     6.45   0.000     .5479866    1.026226
     Smoker #|
      Other  |    .795062    .117221     6.78   0.000      .565313    1.024811
------------------------------------------------------------------------------

Note: Except in linear regression models, an outcome predicted at the mean of the predictors will generally not equal the predicted outcome averaged over the distribution of the predictors

Marginal effects

In linear regression, a regression coefficient can be directly interpreted as the change in the outcome in its original metric for a one-unit increase in the predictor.

In models where the predictors are linearly related to the transformed outcome via a link function, as in logistic regression, such a coefficient interpretation is not possible, because:

  • the magnitude of the predictor’s effect is not constant over the range of the predictor
  • the magnitude also depends on the values of the other predictors

Marginal effects express predictor effects in the original metric of the outcome, with the other predictors fixed at specific values or averaged over.

In logistic regression, a marginal effect expresses a change in probabilities.

margins estimates marginal effects with the dydx() option.

Marginal effects of categorical predictors

Marginal effects of categorical predictors are straightforward to interpret: a difference in probabilities (risk difference), at fixed values of other predictors or averaging over them.

For example, the following estimates the difference in probabilities of lbw between mothers with and without a history of hypertension, averaging over the distribution of all other predictors:

/* marginal effect of history of hypertension */
margins, dydx(ht)
Average marginal effects                                   Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
dy/dx wrt:  1.ht

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        1.ht |   .2579502   .1344364     1.92   0.055    -.0055403    .5214407
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

On average, mothers with a history of hypertension are expected to have 0.258 higher probability of lbw than mothers with no history.

For most audiences, the above risk difference is easier to understand than odds ratios or changes in logits.

The above risk difference is equal to the difference between the predictive margins for ht:

/* marginal effect of ht is difference between these predictive margins */
margins ht
Predictive margins                                         Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
          ht |
          0  |   .2941321   .0324756     9.06   0.000     .2304811    .3577831
          1  |   .5520823   .1304096     4.23   0.000     .2964842    .8076804
------------------------------------------------------------------------------

Instead of averaging over race, we can estimate the marginal effect of hypertension separately by race:

/* marginal effect of ht by race, averaging over other predictors */
margins race, dydx(ht)
Average marginal effects                                   Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
dy/dx wrt:  1.ht

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
0.ht         |  (base outcome)
-------------+----------------------------------------------------------------
1.ht         |
        race |
      White  |   .2334744   .1364055     1.71   0.087    -.0338755    .5008243
      Black  |    .270582   .1302459     2.08   0.038     .0153047    .5258592
      Other  |   .2702543   .1287184     2.10   0.036     .0179708    .5225377
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

Above we see that the magnitude of the marginal effect of hypertension depends on the value of race.

Note: In models where the outcome has a nonlinear relationship with the predictors, the fact that the marginal effect of one predictor varies across levels of another predictor does not necessarily imply that an interaction term was entered in the model. See VanderWeele and Knol (2014) for a thorough explanation.

Marginal effects of continuous predictors

Marginal effects of continuous predictors are more difficult to interpret.

The marginal effect of a continuous predictor is the partial derivative of the outcome with respect to that predictor, at fixed values of other predictors or averaging over them.

  • ideally, the marginal effect would be interpreted as the average slope of the outcome vs the predictor
    • however, with curved relationships, the slope is continuously changing with the predictor, so the derivative is often interpreted as the instantaneous rate of change
  • if the relationship between the outcome and predictor is approximately linear, the marginal effect approximates the change in the outcome for a one-unit change in the predictor
    • if the relationship is very curved, this interpretation will be off

In the figure below that displays the curved relationship between predictor \(x\) and \(Pr(y=1)\), the relationship is close to linear at \(x=0\), but not so linear at \(x=2\).

The estimated marginal effect of \(x\) will approximate the change in \(Pr(y=1)\) for a one-unit increase in \(x\) if most observations have \(x\) values near 0 but not if most have observations near 2.

Stata Graph - Graph 0 .2 .4 .6 .8 1 Pr(y=1) -5 -4 -3 -2 -1 0 1 2 3 4 5 Hypothetical continuous predictor x

Below, we estimate the marginal effect of age, averaging over the other predictors:

/* marginal effect of age is a derivative */
margins, dydx(age)
Average marginal effects                                   Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
dy/dx wrt:  age

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0087802   .0065292    -1.34   0.179    -.0215772    .0040167
------------------------------------------------------------------------------
  • this marginal effect of age can be safely interpreted as the average instantaneous rate of change of the outcome with respect to age
  • if lbw and age have a mostly linear relationship for mothers of birthing age, then it can interpreted as approximately the average change in the probability of lbw per additional year

For the age range of mothers in this data set, 14-45, the relationship between age and probability of lbw is close to linear:

/* predictive margins for ages 14-45 */
margins, at(age=(14/45))
/* plot margins without confidence intervals */
marginsplot, noci

Stata Graph - Graph .15 .2 .25 .3 .35 .4 Pr(low) 10 20 30 40 50 Age of mother Predictive margins

Thus, the marginal effect of age is approximately the expected change in the probability of lbw for each additional year of age.

Odds ratios vs marginal effects

We have now seen that the predictor effects in logistic regression models can be expressed as odds ratios or marginal effects, so which to report?

Odds ratios

  • familiar to most audiences
  • magnitude does not depend on other predictors (unless involved in interaction), so has consistent interpretation

Bad

  • can be large even when absolute change in probabilities is small (if \(p_1=0.01\) and \(p_2=0.00001\) then \(OR\approx 100\), but the difference is only \(0.01-0.0001=.0099\))
  • odds are generally not as well understood as probabilities, though this varies by discipline
  • often misinterpreted as risk ratios (\(p_1/p_0)\), which odds ratios approximate only when \(p\) is very small
  • comparing odds ratios between groups may not be valid if the amount of unexplained variation in the groups differs (Allison, 1999)

Marginal effects

Good

  • expresses effect in probabilities, which are generally well understood
  • can be interepreted as estimated average effects in the population

Bad

  • can be hard to calculate (without Stata)
  • many audiences are not familiar with the term “marginal effect”
  • magnitude depends on other predictors
  • will generally not have the exact interpretation of the change in the outcome for a one-unit increase in the predictor, though sometimes will approximate it well

If space allows, reporting both types of effect estimates might be preferable.

Generally, it is advisable to report some sort of baseline probability of the outcome so that the audience has an idea of the magnitudes of effects, especially if only odds ratios are reported.

The journal Health Services Research has encouraged the reporting of marginal effects over odds ratios (Norton et al., 2024), so consult the target journal’s instructions for authors for guidance.

Exercise 2

Use the NHANES data and the previously run logistic regression to:

  1. Estimate the following:
    • the average probability of high blood pressure in each region of the US
    • the average probability of high blood pressure for rural and urban residents in each region of the US
  2. Run margins, at(region=4 urban=0 weight=60 age=30) and interpret the estimate
  3. Estimate the average marginal effect of age
    • Estimate the predictive margins for every year of age from 20 to 70 and graph the predictive margins vs age
    • Can the marginal effect of be interpreted as approximately the increase in the probability of highbp for each additional year of age?

Assessing model fit and model assumptions

Absolute fit

Stata provides two \(\chi^2\)-based tests of the fit of the data to the model with the command estat gof.

Both assess the null hypothesis that the model-predicted frequency of ones match the observed frequency of ones.

Rejection of the null hypothesis suggests bad fit.

Pearson’s goodness of fit test

Specifying only estat gof requests Pearson’s \(\chi^2\) goodness-of-fit test:

/* Pearson goodness-of-fit test */
estat gof
Goodness-of-fit test after logistic model
Variable: low

      Number of observations =    189
Number of covariate patterns =    108
           Pearson chi2(101) = 104.23
                 Prob > chi2 = 0.3929

Pearson’s \(\chi^2\) goodness-of-fit test procedure:

  • split data into \(k\) groups based on having the same values on the predictors (covariates)
    • Stata tells us there are \(k=108\) unique patterns of predictor values
  • Calculate the expected number of events for each group, which is the sum of the model-predicted probabilities in that group
    • e.g. if a group has 2 observations each with predicted probability \(p_i = 0.5\), the expected number of events is \(0.5 + 0.5 = 1\)
  • the Pearson \(\chi^2\) statistic is calculated:

\[\chi^2 = \sum_k\frac{(O_k - E_k)^2}{E_k}\]

where \(k\) indexes a group defined by predictor patterns, and \(O_k\) and \(E_k\) are the observed and model-expected number of ones in that group, respectively.

If model predictions do not match the observed data well, \(\chi^2\) will be large and the p-value will be small.

Pearson’s statistic has an approximate \(\chi^2\) distribution when the number of observations per group is large enough (>=5).

With 108 groups and 180 total observations, most groups will only contain one observation, so Pearson’s statistic may not be \(\chi^2\)-distributed.

Hosmer and Lemeshow goodness of fit test

Hosmer and Lemeshow (1980) developed a similar \(\chi^2\) statistic based on fewer groups with more observations per group.

  • groups are defined by quantiles of the model-predicted probabilities.
  • the standard version of the statistic divides the data into groups based on deciles
    • 0th to 10th percentile in first group, 11th to 20th percentile in second group,…
  • the number of groups can be reduced for smaller data sets
  • the Hosmer and Lemeshow statistic is then calculated the same way as Pearson’s statistic using observed and expected number of ones

Adding the group() option with the number of groups to estat gof requests the Hosmer and Lemeshow goodness of fit test:

/* Hosmer and Lemeshow gof test with groups based on predicted probability deciles */
estat gof, group(10)
note: obs collapsed on 10 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: low

 Number of observations =    189
       Number of groups =     10
Hosmer–Lemeshow chi2(8) =   6.92
            Prob > chi2 = 0.5456

Both goodness of fit tests suggest that the model has adequate fit to the data.

However, Allison(2014) has pointed out some problems with the Hosmer and Lemeshow test:

  • the p-value can change dramatically with the number of groups
  • adding terms that are statistically significant may lead to an increase in the test statistic (indicating worse fit)
  • adding nonsignificant terms can lead to a decrease in the statistic

Nevertheless, Hosmer and Lemeshow’s test is commonly reported.

Allison(2014) discusses some newer goodness of fit tests with potentially better properties, but they would need to be calculated manually.

Relative fit

Often, we are interested in comparing different models’ fit to the data.

Information Criteria

Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC) assess a model’s fit to the data but penalize for model complexity (number of parameters estimated).

Use estat ic after running a regression model to request both AIC and BIC:

/* AIC and BIC */
estat ic
Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |          N   ll(null)  ll(model)      df        AIC        BIC
-------------+---------------------------------------------------------------
           . |        189   -117.336  -104.9044       7   223.8088    246.501
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] IC note.

Lower values on both criteria indicate better fit to the data, accounting for complexity.

We compare our model to a model without continuous predictors age and ptl, and save this reduced model’s estimates with estimates store, which we use below.

/* model without continuous predictors age and ptl*/
logit low i.ht i.smoke i.race, or
/* store model results as m_reduced */
estimates store m_reduced
/* AIC and BIC */
estat ic
Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood = -108.34857  
Iteration 2:  Log likelihood = -108.17535  
Iteration 3:  Log likelihood = -108.17506  
Iteration 4:  Log likelihood = -108.17506  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(4)    =  18.32
                                                        Prob > chi2   = 0.0011
Log likelihood = -108.17506                             Pseudo R2     = 0.0781

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        1.ht |   3.252896    2.02843     1.89   0.059     .9582541     11.0423
             |
       smoke |
     Smoker  |   3.094613   1.155269     3.03   0.002     1.488809    6.432408
             |
        race |
      Black  |   2.803324   1.393428     2.07   0.038     1.058211    7.426326
      Other  |    3.06541   1.240317     2.77   0.006     1.387004    6.774843
             |
       _cons |   .1449379   .0525803    -5.32   0.000     .0711844    .2951068
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.



Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |          N   ll(null)  ll(model)      df        AIC        BIC
-------------+---------------------------------------------------------------
   m_reduced |        189   -117.336  -108.1751       5   226.3501   242.5588
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] IC note.

The model without age and ptl has a higher AIC but a lower BIC than the full model, suggesting that the full model fits the current data better but may be a bit overfit.

Likelihood ratio test

Likelihood ratio tests compare the fit of nested models and test the null hypothesis that the reduced model does not the data significantly worse.

We can compare two models saved with estimates store via a likelihood ratio test using the lrtest command followed by the names of the two saved models.

Below we compare the full and reduced model via likelihood ratio test.

/* refit full model */
logit low age ptl i.ht i.smoke i.race, or
/* store model results */
estimate store m_full
/* likelihood ratio test comparing full and reduced models */
lrtest m_full m_reduced
Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood =   -105.286  
Iteration 2:  Log likelihood = -104.90511  
Iteration 3:  Log likelihood = -104.90441  
Iteration 4:  Log likelihood = -104.90441  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(6)    =  24.86
                                                        Prob > chi2   = 0.0004
Log likelihood = -104.90441                             Pseudo R2     = 0.1059

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9541789   .0337091    -1.33   0.184     .8903458    1.022589
         ptl |   2.117295   .6982914     2.27   0.023     1.109308    4.041203
        1.ht |   3.413581   2.132187     1.97   0.049     1.003537    11.61146
             |
       smoke |
     Smoker  |   2.645521   1.022067     2.52   0.012     1.240679    5.641093
             |
        race |
      Black  |   2.648104   1.327578     1.94   0.052     .9912907    7.074066
      Other  |   2.778709   1.164658     2.44   0.015     1.222007    6.318482
             |
       _cons |   .3970093   .3540699    -1.04   0.300     .0691294    2.280021
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.



Likelihood-ratio test
Assumption: m_reduced nested within m_full

 LR chi2(2) =   6.54
Prob > chi2 = 0.0380

A low p-value suggests the null should be rejected because the reduced model fits the data significantly worse.

Predictive power: Area under ROC Curves

Once we have estimated predicted probabilities, we can classify observations as 0 or 1 based on some cutoff, for example 0.5.

Predictive power measures assess the agreement between the observed zeroes and ones and the model-classified zeroes and ones.

Sensitivity and Specificity

Below, we create a variable that classifies observations with predicted probabilities above 0.5 as one and below 0.5 as zero, and then check its agreement with the observed outcome:

/* classify observations based on cutoff 0.5 */
gen pred_low = 0
replace pred_low = 1 if predp > 0.5
/* check agreement of observed and predicted outcomes, add row percentages */
tab low pred_low, row  
(29 real changes made)


+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

Birthweigh |       pred_low
   t<2500g |         0          1 |     Total
-----------+----------------------+----------
         0 |       117         13 |       130 
           |     90.00      10.00 |    100.00 
-----------+----------------------+----------
         1 |        43         16 |        59 
           |     72.88      27.12 |    100.00 
-----------+----------------------+----------
     Total |       160         29 |       189 
           |     84.66      15.34 |    100.00 

In the table above:

  • observed low are the rows and the predicted classifications are the columns
  • correct classifications are represented in the top-left and bottom-right cells
  • classification errors are represented in the top-right and bottom-left cells

Two quantities we can calculate from the table above are sensitivity and specificity.

  • sensitivity is \(Pr(pred=1|obs=1)\), the probability of correctly classifying a one, or the true positive rate
    • the percent (divided by 100) in the bottom right cell is the sensitivity
  • specificity is \(Pr(pred=0|obs=0)\), the probability of correctly classifying a zero, or the true negative rate
    • the percent (divided by 100) in the top left cell is the specificity
  • the false negative rate and false positive rate are represented by the percents (divided by 100) in the top-right and bottom-left cells, respectively.

The model has good specificity at .9 but bad sensitivity at .27, meaning that it tends to classify observed zeroes correctly but tends to misclassify observed ones.

If correctly identifying ones is more important than identifying zeroes, we might want to lower our cutoff to be more permissive in classifying ones.

Below we change the cutoff probability to 0.25 and recalculate the observed vs predicted table:

/* classify observations based on cutoff 0.25 */
gen pred_low_25 = 0
replace pred_low_25 = 1 if predp > 0.25
/* check agreement of observed and predicted outcomes with new cutoff */
tab low pred_low_25, row 
(121 real changes made)


+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

Birthweigh |      pred_low_25
   t<2500g |         0          1 |     Total
-----------+----------------------+----------
         0 |        57         73 |       130 
           |     43.85      56.15 |    100.00 
-----------+----------------------+----------
         1 |        11         48 |        59 
           |     18.64      81.36 |    100.00 
-----------+----------------------+----------
     Total |        68        121 |       189 
           |     35.98      64.02 |    100.00 

Above, we see the push-pull of sensitivity and specificity.

When we lower the cutoff to increase sensitivity to .81 we also decrease specificity to .44.

The best cutoff choice depends on the relative costs of false positives and false negatives.

Receiver Operator Characteristic (ROC) curves

The most widely reported measure of logistic regression predictive power is the area under the curve (AUC) of a receiver operator characteristic (ROC) curve.

A graph of the ROC curve plots sensitivity against 1-specificity, or the true positive rate vs the false positive rate, across a range of different cutoffs.

The area under the ROC curve is an overall measure of its classification accuracy across different cutoffs.

Below is a figure displaying ROC curves for three different models run on the same data, but with different numbers of predictors:

Stata Graph - Graph 0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 1-specificity 1 predictor 5 predictors 7 predictors Comparison of ROC curves between 3 competing models

In the graph above:

  • sensitivity (true positive rate) is the y-axis
  • 1-specificity (false positive rate) is the x-axis
  • each curve represents a different logistic regression model
    • each point represents a different cutoff, which are determined by that model’s predicted probabilities
    • points towards the bottom left are cutoffs near one and towards the upper right are cutoffs near zero
  • the closer the curve hugs the y-axis and the top axis of the graph, the better the classifier overall
    • the AUC grows as the graph approaches these axes up to its maximum of 1
    • the green curve’s model is the best classifier here, but with 2 additional predictors it performs only a little better than the red curve’s model, so may be overfit
  • the diagonal reference line represents random guessing
    • here AUC=.5

Stata’s lroc command plots the ROC curve for a logit model and calculates the area under the curve.

/* plot ROC curve and estimate AUC */
lroc
Logistic model for low

Number of observations =      189
Area under ROC curve   =   0.7077

file lroc.svg saved as SVG format

Stata Graph - Graph 0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 1 - specificity Area under ROC curve = 0.7077

An AUC of .708 suggests our model is better than random guessing, but will still make a substantial number of classification errors.

The AUC equals another commonly reported measure of predictive power, the concordance or \(c\) statistic, which has a useful interpretation:

  • identify all discordant pairs in the data, where one observation has an observed 1 and the other has a 0 for the outcome
  • the proportion of all discordant pairs where the observation with observed 1 has a higher predicted probability than the observation with observed 0 is the concordance

Logistic regression model assumptions

Validity of the logistic regression model estimates and subsequent inference depends on key assumptions:

  • independence of observations
  • linearity in the logits

Independence of observations

To assess the independence assumption, consider the study design:

  • random sampling should result in independence
  • clustered sampling or repeated measurements on the same units often result in nonindependence

If nonindependence is suspected, the following methods can be used to account for the nonindependence:

  • random effects in multilevel/mixed models (melogit and xtlogit, re)
  • fixed effects (xtlogit, fe or clogit)
  • generalized estimating equations (xtgee, family(binomial))

Linearity in the logits: Pregibon’s link test

Logistic regression assumes that the predictors have a linear relationship with the logit of the outcome.

Practically, this means we have modeled the correct functional form of the relationship.

Violation of linearity suggests that additional non-linear terms should be added (e.g. polynomial terms, splines, interactions).

One method to assess linearity is to use Pregibon’s link test (Pregibon, 1980):

  • formally, the test assesses whether the transformation of the outcome via the link function, here the logit, results in linear relationships with the predictors
    • violation would suggest changing the link (to probit, for example)
  • in practice, this test is often used to assess whether nonlinear terms should be added to the model
  • the test is conducted by regressing the observed outcome on the predicted outcome and the predicted outcome squared
  • significance of the squared predicted outcome term suggests violation of linearity
/* Pregibon's link test to assess linearity */
linktest
Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood = -105.30024  
Iteration 2:  Log likelihood = -104.60155  
Iteration 3:  Log likelihood =  -104.5903  
Iteration 4:  Log likelihood =  -104.5903  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(2)    =  25.49
                                                        Prob > chi2   = 0.0000
Log likelihood = -104.5903                              Pseudo R2     = 0.1086

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        _hat |   .7966206   .3323518     2.40   0.017     .1452231    1.448018
      _hatsq |  -.1584433   .2026237    -0.78   0.434    -.5555784    .2386918
       _cons |   .0237742   .2273172     0.10   0.917    -.4217593    .4693077
------------------------------------------------------------------------------

In the output above, _hatsq represents the squared predicted outcome, and its large p-value of .434 suggests no gross violation of linearity.

Linearity in the logits: graphical assessment

We can also use a graphical method to assess linearity for a continous predictor:

  • create a categorical version of the predictor by grouping the data by quantiles of the predictors
  • replace the continuous version of the predictor with the categorical version in the model
  • plot the predicted logits across age groups
    • remember the predictors have a linear relationship with the logit, not the probability, of the outcome
    • margins can predict logits (with option predict(xb)) and marginsplot can plot them
  • a straight line suggests linearity
/* create a categorical age variable based on age quintiles */
egen age_cat = cut(age), group(5)
/* enter categorical age in logit model as factor */
logit low i.age_cat ptl i.ht i.smoke i.race, or
/* predicted logits across age categories, predict(xb) requests logits */
margins age_cat, predict(xb)
/* graph predicted logits across age categories */
marginsplot

Stata Graph - Graph -3 -2 -1 0 1 Linear prediction (log odds) 0 1 2 3 4 age_cat Predictive margins of age_cat with 95% CIs

The above graph suggests some non-linearity, but the form does not seem biologically plausible so may be due to sampling variability.

Influential observations

It is important to assess the sensitivity of the model’s estimates to the inclusion or exclusion of a few influential observations.

A large change in model estimates after excluding some observations casts doubt on the robustness of the results.

Stata’s predict can calculate 3 different influence statistics after logistic regression.

Each influence statistic is calculated for each observation and measures how much some model statistic changes after omitting that observation, and all others that share the same predictor values, from the regression.

Use the following predict options to request the influence statistics:

  • dbeta: measures change in the regression coefficients
  • dx2: measures change in Pearson’s \(\chi^2\) statistic
  • ddeviance: measures change in deviance, which quantifies the difference in the fit of the current model vs a model that perfectly predicts the outcome

Below assess the model’s sensitivity to influential observations with the following procedure:

  • calculate the delta-beta (dbeta) influence statistic with predict
  • sort the data by the influence statistic to identify the most influential observations
  • rerun the logistic regression omitting the most influential observations
/* calculate delta-beta influence statistic and save as variable dbeta */
predict dbeta, dbeta
/* sort data by dbeta, descending */
gsort -dbeta
/* list 10 observations with largest dbeta */
li dbeta in 1/10
     |    dbeta |
     |----------|
  1. | 1.007646 |
  2. | 1.007646 |
  3. | .6772961 |
  4. |  .577143 |
  5. |  .577143 |
     |----------|
  6. |  .577143 |
  7. |  .577143 |
  8. |  .577143 |
  9. |  .577143 |
 10. | .4954283 |
     +----------+

Above we see 2 observations have dbeta values above 1, and then a significant drop in dbeta for the next observations.

We will rerun the logit model without these 2 observations:

/* rerun model without 2 observations with dbeta > 1 */
logit low age ptl i.ht i.smoke i.race if dbeta < 1, or
Iteration 0:  Log likelihood = -116.58273  
Iteration 1:  Log likelihood = -103.08216  
Iteration 2:  Log likelihood = -102.68944  
Iteration 3:  Log likelihood = -102.68783  
Iteration 4:  Log likelihood = -102.68783  

Logistic regression                                     Number of obs =    187
                                                        LR chi2(6)    =  27.79
                                                        Prob > chi2   = 0.0001
Log likelihood = -102.68783                             Pseudo R2     = 0.1192

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9466164   .0340673    -1.52   0.127     .8821462    1.015798
         ptl |   2.058488   .6825838     2.18   0.029     1.074718    3.942778
        1.ht |   6.088608   4.463934     2.46   0.014     1.446937    25.62043
             |
       smoke |
     Smoker  |   2.870167   1.120249     2.70   0.007       1.3356     6.16791
             |
        race |
      Black  |   2.365727   1.203828     1.69   0.091     .8726108      6.4137
      Other  |   2.624106   1.106879     2.29   0.022     1.147981    5.998299
             |
       _cons |   .4795685   .4324742    -0.81   0.415     .0818914    2.808427
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

The estimates and inferences have changed slightly (more so for the ht coefficient), but overall the model’s estimates are not too sensitive to the inclusion of these 2 observations.

Overdispersion in the logistic regression model

Unlike in linear regression, there is no unexplained error term \(\epsilon_i\) whose variance must be separately estimated.

The variance of binomially-distributed variables is determined by the probability parameter \(p\) (\(n\) is fixed in the data).

If there is unmodeled heterogeneity in \(p\), the variance of the data may be larger than expected by the model, known as overdispersion.

For example:

  • imagine we have data for 100 patients from 2 hospitals (50 each) and we are modeling the probability that a patient is discharged the same day as the patient admitted
  • we believe the 2 hospitals have the same probability of same-day-discharge, \(p=.5\)
  • unknown to us, one hospital actually has \(p=.95\) and the other has \(p=.05\), so their average is \(p=.5\)
  • if we repeatedly collect data for 50 patients year after year, assuming the probabilities stay the same, one hospital will show many values near 50 and the other will show many values near 0
  • values near 50 and 0 are extremely rare for a binomial distribution with \(n=50\) and \(p=.5\), where we expect numbers closer to 25
    • the variance of the observed data will be much larger than expected if both hospitals actually had \(p=.5\)
  • thus, the unmodeled heterogeneity in \(p\) causes the overdispersion

Methods to address overdispersion in logistic regression model:

  • Random effects to model the heterogeneity
  • Generalized estimating equations to account for correlated observations (which results in heterogeneity)
  • Robust standard errors or adjusted standard errors in quasibinomial models

Exercise 3

Assess the linearity assumption of the logistic regression model run on the NHANES data:

  1. Evaluate the linearity assumption with linktest
  2. Use the graphical method to get an idea of the form of the relationship between the logit of high blood pressure and age

References

Allison, P. D. (1999). Comparing Logit and Probit Coefficients Across Groups. Sociological Methods & Research, 28(2), 186-208. https://doi.org/10.1177/0049124199028002003

Allison, P. D. (2014, March). Measures of fit for logistic regression. In Proceedings of the SAS global forum 2014 conference (pp. 1-13). SAS Institute Inc. Cary, NC, USA.

Hosmer D.W., & S. Lemeshow (1980) A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics.

Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. John Wiley & Sons.

Norton, E. C., Dowd, B. E., Garrido, M. M., & Maciejewski, M. L. (2024). Requiem for odds ratios. Health services research, 59(4), e14337.

Pregibon, D. (1980). Goodness of link tests for generalized linear models. Journal of the Royal Statistical Society Series C: Applied Statistics, 29(1), 15-24.

VanderWeele, T. J., & Knol, M. J. (2014). A tutorial on interaction. Epidemiologic methods, 3(1), 33-72.

Williams, R. (2017). Using Stata’s margins command to estimate and interpret adjusted predictions and marginal effects. The Stata Journal, 308-331