Logistic Regression in SAS

UCLA IDRE Statistical Consulting

Objective

This seminar describes how to conduct a logistic regression using proc logistic in SAS. We try to simulate the typical workflow of a logistic regression analysis, using a single example dataset to show the process from beginning to end.

In this seminar, we will cover:

The Logistic Regression Model

Binary variables

Binary variables have 2 levels. We typically use the numbers 0 (FALSE/FAILURE) and 1 (TRUE/SUCCESS) to represent the two levels.

Conveniently, when represented as 0/1, the mean of a binary variable is the probability of observing a 1: $$y = (0, 1, 1, 1, 1, 0, 1, 1)$$ $$\bar{y} = .75 = Prob(y=1)$$

The above data for \(y\) could represent a sample of 8 people, where 0=Healthy and 1=Sick.

Now imagine you repeatedly randomly sampled 8 people 1000 times from the same population where \(Prob(Y=1)=.75\). The number of sick people in each sample will not always be 6, due to randomness in the selection. However, the average number of sick people in those 1000 samples should be close to 6.

Binomial distribution

The number of sick people in each of our samples of 8 people can be described as a variable with a binomial distribution.

The binomial distribution models the number of successes in \(n\) trials with probability \(p\). For a binomially-distributed variable with \(n=8\) and \(p=.75\), we can expect something like the following distribution of successes:

binomial distribution p=.75

Now imagine we administered a drug to a subgroup of these people, and for this group the probability of being sick is \(Prob(Y=1)=.25\). Their expected distribution of the number of sick per sample of 8 people would look like this:

binomial distribution p=.25

The mean of binomial distribution, or the expected number of observations where \(Y=1\) is simply the number of trials times the probability of the outcome:

$$\mu_Y = np$$

The variance of the binomial distribution is a function of the same 2 parameters, \(n\) and \(p\), so it is not independent of the mean (as it is in the normal distribution):

$$\sigma_Y^2 = np(1-p)$$

The variance of a binomially distributed variable is largest at \(p=.5\) and equal to zero at the extremes, \(p=1\) and \(p=0\), where everyone has the same outcome value.

Regression of binary outcomes

With regression models, we attempt to model relationships between the mean of an outcome (dependent variable) and a set of predictors (independent variable). Probabilities are means of binary variables, so we can model them as outcomes in regression. We could attempt to model the relationship between the probability of being sick and using the drug with a linear regression:

$$y_i=\alpha + \beta x_i + \epsilon_i$$ $$Prob(Y_i=1) = \alpha + \beta drug_i + \epsilon_i$$

However, linear regression of binary outcomes is problematic for several reasons:

Instead, we would like to use a regression model that assumes that the outcome is binomially distributed and restricts the values of the outcome to be within [0,1]. Logistic regression will help with this, but it models odds directly, rather than probabilities.

Odds

Odds are a transformation of a probability, and express the likelihood of an event in a different way. Given an event has probability \(p\) of occuring, then the odds of the event is:

$$odds = \frac{p}{(1-p)}$$

So, the \(odds\) of an event is the ratio of the probability of the event occuring to the probability of it not occuring. If \(p=.5\), then \(odds=\frac{.5}{.5}=\frac{1}{1}\), which expresses that the event has 1-to-1 odds, or equal odds of occurring versus not occurring. If \(p=.75\), then \(odds=\frac{.75}{.25}=\frac{3}{1}\), and the event has 3-to-1 odds, or the event should occur 3 times for each 1 time that it does not occur.

Odds have a monotonic relationship with probabilities, so when the odds increase, the underlying probability has also increased. The following graph plots odds for probabilities in the interval \([.1, .9]\). We see the monotonicity as well the range of values that odds can take:

odds for probabilities .1 to .9

The graph suggests that \(odds=0\) when \(p=0\) and that \(odds=\infty\) when \(p=1\), so odds can take on any value in \([0, \infty)\).

Logits

Conveniently, if a variable \(x\) is defined over the interval \([0, \infty)\), then its logarithm, \(log(x)\) is defined over \((-\infty, \infty)\). Thus, since \(odds\) is defined over \([0, \infty)\), then \(log(odds)\) is defined over \((-\infty, \infty)\).

logits for odds from 0.1 to 10

The logarithm of the odds for probability \(p\) is also known as the \(logit\) of \(p\):

$$logit(p) = log\left(\frac{p}{1-p}\right)$$

Because the logit can take on any real value, using it as the dependent variable in a regression model eliminates the problem of producing predictions from the model outside of the range of the dependent variable. This was a big problem when using the probability \(p\) directly as the outcome.

The logistic regression model

The logistic regression model asserts a linear relationship between the logit (log-odds) of the probability of the outcome, \(p_i = Prob(Y_i=1)\), and the \(k\) explanatory predictors (independent variables) for each observation \(i=1,...,n\):

$$log\left(\frac{p_i}{1-p_i}\right) = \alpha + \beta_{1}x_{i1} + \beta_{2}x_{i2} + ... + \beta_{k}x_{ik}$$

First, notice there is no error term \(\epsilon_i\) in the model. Unlike in linear regression, no residual variance is estimated for the model. Remember that for a binomially distributed variable, the variance is determined by the probability (mean) of the outcome, \(\sigma_Y^2 = np(1-p)\), so does not need to be independently estimated.

Second, notice that the while the explanatory predictors have a linear relationship with the logit, they do not have a linear relationship with the probability of the outcome. Instead, continuous variables have an S-shaped relationship with the probability of the outcome. Once we have our logistic regression coefficients, the \(\beta\)s, estimated, we can use the following alternate form of the model to get the predicted probability:

$$p_i = \frac{exp(\alpha + \beta_{1}x_{i1} + \beta_{2}x_{i2} + ... + \beta_{k}x_{ik})}{1 + exp(\alpha + \beta_{1}x_{i1} + \beta_{2}x_{i2} + ... + \beta_{k}x_{ik})}$$

For example, imagine we have estimated the following logistic regression model where the logit of some outcome is predicted by age:

$$log\left(\frac{p_i}{1-p_i}\right) = -5 + .1age_{i}$$

To get predicted probabilities, we can use this form:

$$p_i = \frac{exp(-5 + .1age_{i})}{1 + exp(-5 + .1age_{i})}$$

A graph plotting predicted probabilities against ages ranging from 0 to 100 shows the expected S-shape:

predicted probabilities age 0 to 100

This non-linear relationship implies that 1-unit increases in age do not always translate to a constant increase in the probabilities. Indeed, moving from age 50 to 51 causes a larger increase in probability (more vertical distance) than moving from age 99 to 100. Importantly, though, the change in odds (expressed as a ratio) is constant over the entire range of age.

Practically, this also means that one cannot know the difference in probabilities due to a unit change in a given predictor without knowing the values of the other predictors in the model.

Interpreting logistic regression coefficients - logit metric

Logistic regression coefficients can be interpreted in at least 2 ways. First we'll look at the interpretation in the logit metric. Imagine we have a very simple logistic regression model, with a single predictor, x:

$$log\left(\frac{p_i}{1-p_i}\right) = \alpha + \beta_{1}x$$

If we set \(x=0\), this reduces to:

$$log\left(\frac{p_{x=0}}{1-p_{x=0}}\right) = \alpha$$

So the intercept, \(\alpha\), is interpreted as the logit (log odds) of the outcome when \(x=0\). This generalizes to multiple predictors to be the logit of the outcome when all predictors are equal to 0, \(x_1 = x_2 = ... = x_k = 0\).

Now let's estimate the logit when \(x=1\):

\begin{align} log\left(\frac{p_{x=1}}{1-p_{x=1}}\right) &= \alpha + \beta_1(1) \\ log\left(\frac{p_{x=1}}{1-p_{x=1}}\right) &= \alpha + \beta_1 \end{align}

Now we take the difference between the logit for \(x=1\) and for \(x=0\):

\begin{align} log\left(\frac{p_{x=1}}{1-p_{x=1}}\right) - log\left(\frac{p_{x=0}}{1-p_{x=0}}\right) &= (\alpha + \beta_1) - (\alpha) \\ log\left(\frac{p_{x=1}}{1-p_{x=1}}\right) - log\left(\frac{p_{x=0}}{1-p_{x=0}}\right) &= \beta_1 \end{align}

In the logit metric, the \(\beta\) express the change in logits, or log odds, of the outcome for a 1-unit increase in the predictor (similar to linear regression). Positive coefficients signify increases in the probability with increases in the predictor, though not in a linear fashion.

Interpreting exponentitated logistic rgression coefficients - odds metric

For the following section, we will use \(odds_p\) in place of \(\frac{p}{1-p}\) to make the formulas more readable.

Recall the inverse function for the logarithm is the exponentiation function:

$$exp(log(x)) = x$$

Thus, we can get ourselves out of the logit metric through exponentiation. In the last slide, we saw that the intercept \(\alpha\) is estimated logit of the outcome when the predictors were equal to zero:

$$log(odds_{p_{x=0}}) = \alpha$$

Exponentiating both sides yields:

$$odds_{p_{x=0}} = exp(\alpha)$$

The exponentiated intercept is then interpreted as the odds of the outcome when the predictors are equal to zero.


Now let's interpreted an exponentiated \(\beta\). First, the expression for \(\beta\) in the logit metric:

$$log(odds_{p_{x=1}}) - log(odds_{p_{x=0}}) = \beta_1$$

Using the following logarithm identity:

$$log A - log B = log(\frac{A}{B})$$

We can convert the expression for the \(\beta\) coefficient to this:

$$log\left(\frac{odds_{p_{x=1}}}{odds_{p_{x=0}}}\right) = \beta_1$$

Exponentiating both sides yields:

\begin{align} \frac{odds_{p_{x=1}}}{odds_{p_{x=0}}} &= exp(\beta_1) \\ \\ \frac{\frac{p_{x=1}}{1-p_{x=1}}}{\frac{p_{x=0}}{1-p_{x=0}}} &= exp(\beta_1) \end{align}

Exponentiated \(\beta\) coefficients expresses the change in the odds due to a change in the predictor as a ratio of the odds, and are thus referred to as odds ratios.

The effects of predictors expressed as log-odds or as odds ratios are independent of the values of the other predictors in the model (unless they are interacted).

Seminar Dataset

The Titanic dataset

The dataset can be downloaded from here.

To demonstrate logistic regression modeling with PROC LOGISTIC, we will be using a dataset describing survival of passengers on the Titanic. The dataset was obtained from the Vanderbilt Univeristy Biostatistics Department datasets page. Sources for the data include Encyclopedia Titanica and Titanic: Triumph and Tragedy, by Eaton & Haas(1994), Patrick Stephens Ltd.

The original dataset contains 1,309 passengers and 14 variables. However, to avoid dealing with missing data issues in this seminar, we have restricted the dataset to the 1,043 passengers that had no missing on the following 7 variables:

We always recommend familiarizing yourself with model variables before running the model, which we do now.

Outcome: survived

The outcome survived is a binary 0/1 variable that indicates whether a passenger survived the sinking of the Titanic. The mean of survived, or \(Prob(survived=1)\), is about 0.41. So, less than half of the passengers survived.

Analysis Variable : survived survived
N Mean Std Dev Minimum Maximum
1043 0.4074784 0.4916009 0 1.0000000

survived frequencies

Discrete predictors: female, pclass, parch, and sibsp

The variable female is a binary 0/1 variable signifying male/female. The majority of passengers were male:

sex frequencies

The variable pclass describes the passenger's travel "class" and is a three-level, ordered categorical variable, where 1, 2, and 3 signify first, second, and third class, respectively. Almost half of the passengers were travelling third class:

pclass frequencies

The variables parch and sibsp are count variables, counting the number of parents/children and siblings/spouses aboard. Most passengers traveled alone.

parch frequencies
sibsp frequencies

Continuous predictors: age and fare

The variable age ranges from 0.1667 years to 80.

age histogram

The variable fare ranges from 0 to 512.33 pounds.

fare histogram

Forming hypotheses about what predicts surviving the Titanic sinking

Based on our intuition and vague recollections of the surely historically accurate Titanic movie, we might hypothesize that:

Correlations among variables

To preview whether or hypotheses might be confirmed and whether predictors are correlated and might thus explain the same variance in survival, we inspect a correlation matrix of all of our variables. Correlations only account for linear relationships, so a non-significant correlation does not imply that two variables are unassociated. Note: We do not recommend forming hypotheses about survival after looking at this matrix.

Pearson Correlation Coefficients, N = 1043
Prob > |r| under H0: Rho=0
  survived female pclass parch sibsp age fare
survived
survived
1.00000
 
0.53633
<.0001
-0.31774
<.0001
0.11544
0.0002
-0.01140
0.7130
-0.05742
0.0638
0.24786
<.0001
female
 
0.53633
<.0001
1.00000
 
-0.14103
<.0001
0.22253
<.0001
0.09646
0.0018
-0.06601
0.0330
0.18640
<.0001
pclass
pclass
-0.31774
<.0001
-0.14103
<.0001
1.00000
 
0.01634
0.5981
0.04633
0.1348
-0.40908
<.0001
-0.56456
<.0001
parch
parch
0.11544
0.0002
0.22253
<.0001
0.01634
0.5981
1.00000
 
0.37396
<.0001
-0.14931
<.0001
0.21765
<.0001
sibsp
sibsp
-0.01140
0.7130
0.09646
0.0018
0.04633
0.1348
0.37396
<.0001
1.00000
 
-0.24235
<.0001
0.14213
<.0001
age
age
-0.05742
0.0638
-0.06601
0.0330
-0.40908
<.0001
-0.14931
<.0001
-0.24235
<.0001
1.00000
 
0.17721
<.0001
fare
fare
0.24786
<.0001
0.18640
<.0001
-0.56456
<.0001
0.21765
<.0001
0.14213
<.0001
0.17721
<.0001
1.00000
 

We see that:

Fitting the Logistic Regression Model with proc logistic

Basic proc logistic syntax

The dataset is rather large at 1,043 observations with a not terribly skewed split along the binary outcome survived at 425 survived and 618 who did not survive. Therefore, we feel we have a sufficient sample size to have enough power to include all predictors in our intial model.

proc logistic data=titanic descending;
class pclass female;
model survived = pclass age sibsp parch fare female / expb;
run;

First model output, part 1

We will examine the output of our first logistic regression model in detail:

The SAS System

The LOGISTIC Procedure

Model Information
Data Set WORK.TITANIC  
Response Variable survived survived
Number of Response Levels 2  
Model binary logit  
Optimization Technique Fisher's scoring  

The Model Information table asserts that we are running a binary logistic regression on the outcome survived in the dataset work.titanic, and that the Fisher scoring (iterative reweighted least squares) algorithm is used to find the maximum likelihood solution of the parameter estimates.


Number of Observations Read 1043
Number of Observations Used 1043

This table asserts tells us that all of the 1,043 observations were used in the model


Response Profile
Ordered
Value
survived Total
Frequency
1 1 425
2 0 618

The Response Profile table shows the unconditional frequencies of the outcome survived.


Probability modeled is survived=1.

This statement asserts that we are modeling the probability of survived=1 (the result of using the option descending on the proc logistic statement).

Class Level Information
Class Value Design Variables
pclass 1 1 0
  2 0 1
  3 -1 -1
female 0 1  
  1 -1  

The Class Level Information table shows us that "effect" parameterization has been used for our categorical predictors. With effect parameterization:

We will discuss using an alternative class variable parameterization, reference parameterization (dummy coding), later.


Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

The Model Convergence Status table tells us that the (default) model convergence criterion has been met with the current set of model parameters. This means that the algorithm searching for the best solution cannot find a set of parameters that improves the fit of the model more than .00000001 (measured by the objective/loss function) compared to the current set of parameters.


Model Fit Statistics
Criterion Intercept Only Intercept and
Covariates
AIC 1411.985 985.058
SC 1416.935 1024.657
-2 Log L 1409.985 969.058

The Model Fit Statistics table displays two relative fit statistics, AIC and SC(BIC), and -2 times the log likelihood of the model. Generally, you'll want the values in the Intercept and Covariates column, as these reflect the full model just fit. Lower values on AIC and SC signify better fit (both penalize adding more parameters to the model).

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 440.9268 7 <.0001
Score 391.2102 7 <.0001
Wald 277.4250 7 <.0001

The Testing Global Null Hypothesis: BETA=0 table contains the results of three tests of the global null hypothesis that all coefficients (except for the intercept) are equal to 0, often thought of as a test of the utility of the whole model:

$$\beta_1 = \beta_2 = ... = \beta_k = 0$$

These tests are asymptotically equivalent (equivalent in very large samples), so will ususally agree. In cases where they don't, many prefer the likelihood ratio test (Hosmer et al., 2013)

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
pclass 2 70.7619 <.0001
age 1 34.9919 <.0001
sibsp 1 11.4336 0.0007
parch 1 0.3366 0.5618
fare 1 0.3886 0.5330
female 1 214.8255 <.0001

The Type 3 Analysis of Effects table conducts Wald tests for all of the coefficients against 0 for each effect in the model. For the effect of a categorical variable with more than 2 levels, this will be a joint test of several coefficients. For example, the Wald test for variable pclass involves 2 coefficients and is significant at zero, suggesting systematic differences in survival among passenger classes.

The interpretation of these Type 3 tests depend on the parameterization used in the class statement and whether there are interactions in the model. With "effect" parameterization and no interaction, these Wald tests are tests of equality of means across levels of the variable (test of main effects). See here for more information:

First model output, part 2: model coefficients

Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept   1 1.3424 0.2515 28.4870 <.0001 3.828
pclass 1 1 1.1788 0.1644 51.3910 <.0001 3.251
pclass 2 1 -0.1048 0.1257 0.6952 0.4044 0.901
age   1 -0.0393 0.00665 34.9919 <.0001 0.961
sibsp   1 -0.3577 0.1058 11.4336 0.0007 0.699
parch   1 0.0597 0.1029 0.3366 0.5618 1.062
fare   1 0.00121 0.00194 0.3886 0.5330 1.001
female 0 1 -1.2727 0.0868 214.8255 <.0001 0.280

The Analysis of Maximum Likelihood Effects table lists the model coefficients, their standard errors, and tests of their significance.

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
pclass 1 vs 3 9.515 5.584 16.212
pclass 2 vs 3 2.636 1.779 3.905
age 0.961 0.949 0.974
sibsp 0.699 0.568 0.860
parch 1.062 0.868 1.299
fare 1.001 0.997 1.005
female 0 vs 1 0.078 0.056 0.110

The Odds Ratio: Estimates table displays the exponentiated coefficients for predictors in the model and their confidence intervals (formed by exponentiating the confidence limits on the logit scale). An odds ratio of 1 signifies no change in the odds, so confidence limits that contain 1 represent effects without strong evidence in the data. We will interpret the point estimates of the odds ratios:

First model output, part 3, Association statistics

Association of Predicted Probabilities and
Observed Responses
Percent Concordant 84.5 Somers' D 0.692
Percent Discordant 15.3 Gamma 0.694
Percent Tied 0.2 Tau-a 0.335
Pairs 262650 c 0.846

The final table in our first model output, the succintly titled Association of Predicted Probabilities and Observed Responses table, lists several measures of the ability the current model to predict correctly the survival status of model observations. These statistics are useful to assess the predictive accuracy of the model. We will discuss them in more detail later when we discuss assessing model fit.

Changing to reference parameterization

Unfortunately, many people ignore the Class Level Information and Odds Ratios table, and do not realize that "effect" parameterization is the default, and thus misinterpret the coefficients of class variables. Many people assume the use of dummy coding for categorical variables, known as "reference" parameterization in SAS. Our experience is that most researchers prefer reference parameterization.

Specify param=ref after the / on the class statement to change all variables to reference parameterization. Alternatively, specify (param=ref) directly after a variable's name to change only that variable's parameterization.

proc logistic data=titanic descending;
class pclass female / param=ref;
model survived = pclass age sibsp parch fare female;
run;

We only show some output:

Class Level Information
Class Value Design Variables
pclass 1 1 0
  2 0 1
  3 0 0
female 0 1  
  1 0  

We see that the design variables have changed to dummy coding. The default omitted/reference group in SAS is the last group. We could change the reference grouop for pclass by specifying (ref='1') or (ref='2') after pclass on the class statement.

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
pclass 2 70.7619 <.0001
age 1 34.9919 <.0001
sibsp 1 11.4336 0.0007
parch 1 0.3366 0.5618
fare 1 0.3886 0.5330
female 1 214.8255 <.0001

The Type 3 Analysis of Effects is exactly the same as the table for the model using "effect" parameterization, indicating that switching between these parameterizations does not change the overall fit of the model -- only the interpretation of the coeffcients.

Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept   1 1.5411 0.2468 39.0025 <.0001 4.670
pclass 1 1 2.2528 0.2719 68.6480 <.0001 9.515
pclass 2 1 0.9692 0.2005 23.3621 <.0001 2.636
age   1 -0.0393 0.00665 34.9919 <.0001 0.961
sibsp   1 -0.3577 0.1058 11.4336 0.0007 0.699
parch   1 0.0597 0.1029 0.3366 0.5618 1.062
fare   1 0.00121 0.00194 0.3886 0.5330 1.001
female 0 1 -2.5454 0.1737 214.8255 <.0001 0.078

Here, the model coefficients and associated statistics have changed for pclass and female. Now the pclass coefficients are interpreted as differences from the reference third class. The female coefficient is males-females.

Because the model fit is the same as the effect parameterization model, odds ratios comparing groups are the same, as is the Odds Ratio Estimates table. Again, 1st class passengers have 9.5 times the odds of survival of 3rd class passengers.

Assessing Model Fit

Checking logistic regression model assumptions

The validity of the results of any regression model depends on the plausibility of the model assumptions. The logistic regression model makes no distributional assumptions regarding the outcome (it just needs to be binary), unlike linear regression, which assumes normally-distributed residuals. We thus need verify only the following logistic regression model assumptions:

There is no test for independence in the data (though we suspect that the sibsp effect may be due to correlated outcomes among families), and we have no additional predictors to add in our dataset, so we will only be assessing the first assumption.

We will use global tests of model fit and graphical methods to assess possible poor model fit, which might be remedied by the inclusion of non-linear effects and interactions into the model.

Global tests of model fit

Tests of model fit are designed to measure how well model expected values match observed values. In the case of logistic regression models, these measures quantify how well the predicted probability of the outcome matches the observed probability across groups of subjects.

Omission of non-linear effects and interactions can lead to poor estimates of predicted probabilities, which these tests may detect.

One commonly used test statistic is the Deviance \(\chi^2\) (sometimes just called Deviance \(D\)), which measures how far the current model is from a perfectly fitting saturated model. A saturated model produces predicted probablities that perfectly match the observed probabilities by estimating a separate parameter for each distinct covariate pattern \(j\) (see below).

Another measure is the Pearson \(\chi^2\), which measures lack of fit in a slightly different way. Large (and significant) \(\chi^2\) values for both signify poor fit. The two statistics have the following formulas (adapted from Allison, 1999):

\begin{align} Deviance \, \chi^2 &= 2\sum_{j=1}^J log\left(\frac{O_j}{E_j}\right) \\ Pearson \, \chi^2 &= \sum_{j=1}^J \frac{(O_j-E_j)^2}{E_j} \end{align}

Despite the warning about using these statistics when \(J \approx n\), we will show how to request these \(\chi^2\) test in proc logistic. On the model statement, specify the two options aggregate and scale=none to request these two statistics. The aggregate option creates the \(J\) groups with unique covariate patterns.

proc logistic data=titanic descending;
class pclass female;
model survived = pclass age sibsp parch fare female / expb aggregate scale=none;
run;
Deviance and Pearson Goodness-of-Fit Statistics
Criterion Value DF Value/DF Pr > ChiSq
Deviance 900.0562 938 0.9595 0.8086
Pearson 977.7152 938 1.0423 0.1789

Number of unique profiles: 946

The output tells us that \(J=946\), the number of unique covariate patterns in the dataset. That is only slightly smaller than \(n=1,043\), the total number of observations, so \(J \approx n\), and we shouldn't interpret these tests.

Hosmer and Lemeshow goodness of fit test

Hosmer and Lemeshow (1980) developed a goodness-of-fit statistic designed to be more appropriate to use when \(J=n\), when the number of unique covariate patterns is the same as the number of observations. The statistic is calculated in the follwing way:

  1. Sort the observations by their estimated predicted probabilities based on the model.
  2. Divide the observations into 10 or so equally-sized groups, creating groups with similar predicted probabilities.
  3. Then use the formula for the Pearson \(\chi^2\) statistic, where \(E_j\), the expected number of 1s (successes) in each group is calculated by summing the predicted probabilities in that group, and \(O_j\) is the observed number of 1s in the group.
  4. Under the null hypothesis, the distribution of the Hosmer and Lemeshow statistic, called \(\hat{C}\), is well approximated by the \(\chi^2\) distribution with \(g-2\) degrees of freedom, where \(g\) is the number of groups created in step 2.

To request the Hosmer and Lemeshow goodness-of-fit statistic, simply add the option lackfit to the model statement:

proc logistic data=titanic descending;
class female pclass / param=ref;
model survived = pclass age sibsp parch fare female  / expb lackfit;
run;
Partition for the Hosmer and Lemeshow Test
Group Total survived = 1 survived = 0
Observed Expected Observed Expected
1 104 8 6.64 96 97.36
2 104 18 10.42 86 93.58
3 104 19 13.36 85 90.64
4 104 12 16.91 92 87.09
5 104 23 26.19 81 77.81
6 104 36 39.40 68 64.60
7 104 43 57.27 61 46.73
8 104 71 71.38 33 32.62
9 104 91 84.53 13 19.47
10 107 104 98.91 3 8.09

This table shows us that proc logistic split the dataset into 10 groups, and then shows the observed and expected number of survived=1 and survived=0 in each group. There are small discrepancies between observed and expected throughout the table, and a few large discrepancies (group 7).


Hosmer and Lemeshow Goodness-of-Fit
Test
Chi-Square DF Pr > ChiSq
25.8827 8 0.0011

Above we see the Hosmer and Lemeshow \(\hat{C}\) statistic, labeled Chi-Squared, with a significant p-value at \(p=0.0011\), suggesting poor model fit. Perhaps we should consider interactions and non-linear effects.

Not everyone agrees on the usefulness of the Hosmer and Lemeshow goodness-of-fit test, despite its wide usage. Paul Allison notes that the test can give conflicting results depending on the number of \(J\) groups chosen, and does not always improve when significant effects are added to the model. See here for more details.

Identifying poorly fit observations

Now that we know that the overall model does not fit well, we should try to identify poorly fit observations to see how we might improve the model.

We have previously discussed how the Deviance is an overall measure of the goodness-of-fit, with larger values indicating poorer fit. Each of the \(J\) covariate patterns contributes to the Deviance. Poorly fitting patterns, indicated by disagreement between the observed and expected probabilities of the outcome, contribute more to the Deviance.

We can estimate the contribution of each covariate pattern to the Deviance by calculating the change in the Deviance if we were to rerun the model without that covariate pattern.

This change in the Deviance due to deleting a covariate pattern group, or \(\Delta D\) (Hosmer et al., 2015.), can be estimated for each observation using the difdev= option on the output statement in proc logistic. In the code below:

proc logistic data=titanic descending;
class female pclass / param=ref;
model survived = pclass age sibsp parch fare female  / expb lackfit;
output out=diag difdev=deltad predicted=predp;
run;

Plotting \(\Delta D\) vs predicted probabilities

Hosmer et al. (2013) and Allison (1999) recommend plotting \(\Delta D\) (y-axis) against predicted probabilities to find poorly fitting covariate patterns. \(\Delta D\) values greater than 4 are recommended as a rough cutoff for poor fit.

We use proc sgplot to create the plot using the diag dataset that we just output from the previous model. We put the predicted probabilities predp on the x-axis and the \(\Delta D\) on the y-axis.

proc sgplot data=diag;
scatter x=predp y=deltad;
run;
predicted probs versus delta D

Inspecting the observations with the largest \(\Delta D\)

Now, we'll sort the observations by descending \(\Delta D\) and view the top 20 observations to see if we notice anything systematic. Note the use of the option descending in proc sort to sort by descending order. We then use obs=20 in proc print to print the first 20 observations.

proc sort data=diag;
by descending deltad;
run;

proc print data=diag(obs=20);
run;
Obs pclass survived age sibsp parch fare female _LEVEL_ predp deltad
1 1 0 2.0 1 2 151.550 1 1 0.97492 7.48884
2 3 1 29.0 3 1 22.025 0 1 0.04180 6.44020
3 3 1 36.5 1 0 17.400 0 1 0.05857 5.71069
4 1 0 25.0 1 2 151.550 1 1 0.94020 5.69607
5 3 1 45.0 0 0 8.050 0 1 0.05923 5.69140
6 3 1 44.0 0 0 7.925 0 1 0.06146 5.61593
7 3 1 39.0 0 0 7.925 0 1 0.07383 5.24130
8 2 1 62.0 0 0 10.500 0 1 0.07857 5.15314
9 3 1 39.0 0 1 13.417 0 1 0.07850 5.12371
10 3 1 3.0 4 2 31.388 0 1 0.08348 5.09518
11 1 0 36.0 0 0 31.679 1 1 0.91802 5.04934
12 3 1 25.0 1 0 7.775 0 1 0.08816 4.88155
13 3 1 32.0 0 0 7.579 0 1 0.09498 4.73019
14 3 1 32.0 0 0 7.775 0 1 0.09500 4.72977
15 3 1 32.0 0 0 7.854 0 1 0.09501 4.72960
16 3 1 32.0 0 0 7.925 0 1 0.09502 4.72944
17 3 1 32.0 0 0 8.050 0 1 0.09503 4.72917
18 3 1 31.0 0 0 7.925 0 1 0.09846 4.65764
19 3 1 32.0 0 0 56.496 0 1 0.10019 4.63270
20 3 1 32.0 0 0 56.496 0 1 0.10019 4.63270

Using proc sgpanel to remake the graph of \(\Delta D\) vs predicted probabilities, now colored by pclass and paneled by female, reinforces our idea that the model is not fitting well for third class males (blue circles, left panel) and first class females (green circles, right panel).

proc sgpanel data=diag;
panelby female;
scatter x=predp y=deltad / group=pclass;
run;
predicted probs versus delta D by class and female

Exploring nonlinear effects

Failing to allow for nonlinear effects in the model can also lead to poor fit. Remember that entering a variable "as is" into regression model assumes a linear effect with the outcome, or the log-odds of the outcome for logistic regression. While this will be correct for 0/1 class effects, it may or may not be correct for continuous effects.

We would be surprised if the effect of age on (the log-odds) survival was entirely linear. Hosmer et al. (2013) suggest a simple method to explore whether a continuous variable will have a nonlinear relationship with the log-odds of the outcome.

  1. Split the continuous variable into several intervals to form a categorical version of the variable. They suggest splitting into 4 groups based on quartiles, but note that other grouping strategies can work.
  2. Refit the logistic regression model with the categorical variable and all relevant covariates.
  3. Create a line plot with the coefficient value on the y-axis, and the midpoint of each interval along the x-axis.

We will slightly adapt their method for the seminar to keep coding succint. Instead of quartiles, we are going to use 5 intervals with (nearly) equal spans of age, because the data are large enough to support enough observations in each interval. The advantage of using intervals that span the same range is that we don't have to use the midspoints of the interval to plot -- we can plot them as ordinal 1-2-3-4-5 values because the intervals are equally spaced along the number line.

In SAS, we can use formats to split a continuous variable into intervals. This proc format code creates the format we will use for age:

proc format;
value agecat 0-<15 = '[0,15)'
             15-<30 = '[15,30)'
             30-<45 = '[30,45)'
             45-<60 = '[45,60)'
             60-high = '60+';
run;

We can now run the logistic regression model treating age as categorical with the format we just created. To do this, we need to apply the format with a format statement, and specify age on the class statement.

We are going to interact (categorical) age with female because we suspect the nonlinear effect of age may not be the same for both genders.

In addition, we will use the lsmeans statement to create the plot of the coefficients vs age categories. The lsmeans statement estimates least squares means, or predicted population margins, which are the expected values (means) of the outcome in a population where the levels of all class variables are balanced and averaged over the effects of the continuous covariates. We can estimate these means at each level of categorical age and across female. A plot of these means is analagous to the plot of coefficients suggested by Hosmer et al. (2013).

In order to use the lsmeans statement, we must change the parameterization of the class to param=glm. With "glm" parameterization, the parameter estimates in the Analysis of Maximum Likelihood Estimates are interpreted in the same way as those using reference parameterization (differences from reference group), but the type 3 tests from the Type 3 Analysis of Effects are interpreted in the same way as using effect parameterization.

proc logistic data=titanic descending;
class female(ref=first) pclass age / param=glm;
model survived = pclass age|female sibsp parch fare / expb;
lsmeans age*female /  plots=meanplot(join sliceby=female);
format age agecat.;
run;

age cat coefficients by gender

Modeling Interactions and Nonlinear Effects

Adding interactions to the model

Interactions allow a predictor's effect to vary with levels of another predictor.

Our diagnostics suggested poor fit due to lack of an interaction of female and pclass and possibly female and age. Omitting these interactions constrains the effects of pclass and age to be the same across genders. Instead, we might suspect that the "women and children first" protocol might have affected how passenger class and age were related to survival for the two genders. Or, passenger class and age might have affected which women and children were chosen to save.

In SAS, to add an interaction between 2 variables and the component main effects to a model, specify the two variables separated by |. Here we add interactions of female with both pclass and age.

proc logistic data=titanic descending;
class female pclass / param=ref;
model survived = female|pclass female|age sibsp parch fare / expb;
run;

Relevant output:

Model Fit Statistics
Criterion Intercept Only Intercept and
Covariates
AIC 1411.985 936.198
SC 1416.935 990.646
-2 Log L 1409.985 914.198

Criterion No
interactions
Interactions
AIC 985.058 936.198
SC 1024.657 990.646

The model fit statistics AIC and SC have dropped by more than 30 points each from the previous additive (no interactions) model, suggesting that the interactions are improving the fit.

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
female 1 4.1817 0.0409
pclass 2 65.7214 <.0001
female*pclass 2 30.7923 <.0001
age 1 5.6899 0.0171
age*female 1 3.5980 0.0579
sibsp 1 11.5648 0.0007
parch 1 0.7197 0.3962
fare 1 0.0136 0.9073

The Type 3 tests show strong evidence of a pclass by female interaction, but weaker evidence of an age by female interaction.

Analysis of Maximum Likelihood Estimates
Parameter     DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept     1 0.6928 0.3227 4.6084 0.0318 1.999
female 0   1 -0.8032 0.3928 4.1817 0.0409 0.448
pclass 1   1 3.8387 0.5866 42.8184 <.0001 46.466
pclass 2   1 2.3803 0.3766 39.9540 <.0001 10.809
female*pclass 0 1 1 -2.0225 0.6130 10.8872 0.0010 0.132
female*pclass 0 2 1 -2.4146 0.4663 26.8084 <.0001 0.089
age     1 -0.0277 0.0116 5.6899 0.0171 0.973
age*female 0   1 -0.0278 0.0146 3.5980 0.0579 0.973
sibsp     1 -0.3560 0.1047 11.5648 0.0007 0.700
parch     1 0.0918 0.1083 0.7197 0.3962 1.096
fare     1 -0.00025 0.00216 0.0136 0.9073 1.000

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
sibsp 0.700 0.571 0.860
parch 1.096 0.887 1.355
fare 1.000 0.996 1.004

Odds ratios for effects involved in interactions will not be displayed in the Odds Ratio Estimates table, so we will not see odds ratios for pclass, age, or female here. However, we can use the oddsratio statement to help us estimate ORs for effects involved in interactions.

Use the oddsratio statement to estimate and graph odds ratios

The oddsratio statement can be used to estimate odds ratios for any effect in the model. If an effect is involved in an interaction, SAS will estimate the odds ratio for that effect across all levels of the interacting (moderating) variable.

Let's estimate the odds ratios for pclass across genders:

proc logistic data=titanic descending;
class female pclass / param=ref;
model survived = female|pclass female|age sibsp parch fare / expb;
oddsratio pclass;
run;

Note that even though we only specify pclass on the oddsratio statement, proc logistic is aware that pclass is interacted with female in the model, and will thus estimate its effects across levels of female. Here are the output and graph produced:

Odds Ratio Estimates and Wald Confidence Intervals
Odds Ratio Estimate 95% Confidence Limits
pclass 1 vs 2 at female=0 6.362 3.238 12.500
pclass 1 vs 3 at female=0 6.149 3.353 11.276
pclass 2 vs 3 at female=0 0.966 0.562 1.662
pclass 1 vs 2 at female=1 4.299 1.302 14.190
pclass 1 vs 3 at female=1 46.466 14.716 146.719
pclass 2 vs 3 at female=1 10.809 5.167 22.611

OR pclass across genders

The plot above shows the estimates of the ORs with their confidence intervals. Effects whose intervals cross the reference line at x=1 are not significant. Unfortunately, this particular plot is dominated by the confidence interval of the pclass effect for class1-vs-class3, making the other intervals harder to see.

Nevertheless, we can see that effects of passenger class are quite different for the two genders.

The exponentiated interaction coefficient for female*pclass=1 was .132, meaning the OR for pclass=1 vs pclass=3 for males should be .132 times as large as the OR for females. The pclass=1 vs pclass=3 OR for females is 46.47. So, 46.47*.1323=6.149, the OR for pclass=1 vs pclass=3 for males.

Controlling which variables appear on the odds ratio plot

Each oddsratio statement only accepts one variable to process, i.e. to estimate and plot its odds ratio(s), but we can specify multiple oddsratio statements to place multiple variable effects on one odds ratio plot.

Here we place all variables but pclass on one odds ratio plot:

proc logistic data=titanic descending;
class female pclass / param=ref;
model survived = female|pclass female|age sibsp parch fare / expb;
oddsratio female;
oddsratio age;
oddsratio sibsp;
oddsratio parch;
oddsratio fare;
run;
OR for all variables but pclass

Modifying the odds ratio plot

For the odds ratios of variables involved in interactions, the at option on the oddsratio statement allows us to control which levels of the interacting varaible at which to estimate the odds ratio of the other variable.

Below we use the at option to estimate the odds ratios for female at pclass=1, and across an array age values. Remember to put values for class variables inside ''.

Options for the odds ratio plot to alter its appearance are specified by placing the desired options inside plots=oddsratio() on the proc logistic statement.

Below we convert the axis to a logarithmic scale with base \(e\), the base of the natural log, for a more accurate depiction of the magnitude of the ORs using option specification logbase=e. We also change the type of the plot to horizontalstat, which adds the estimate of each OR and its confidence interval to the graph.

proc logistic data=titanic descending plots(only)=(oddsratio(logbase=e type=horizontalstat)) ;
class female pclass / param=ref;
model survived = female|pclass female|age sibsp parch fare / expb;
oddsratio female / at(pclass='3' age=1 5 18 30 50 70);
run
OR for all variables log axis base e with stats

Now the x-axis units are logarithmically scaled along successive powers of \(e\). For convenience, the values \((e^{-4}, e^{-3}, e^{-2}, e^{-1}, e^{0})\) are equal to \((.0183, .0498, .135, .368, 1)\).

We can see the possible effect that age has on the female ORs, but we should be cautious to make any claims because the evidence is not terribly strong for this interaction.

Exploring non-linear effects with restricted cubic splines

Age spans a wide range in this dataset, from about 2 months to 80 years. It's quite possible it does not have a linear effect on (log-odds) of survival across its entire range.

Regression splines are transformations of variables that allow flexible modeling of complex functional forms. Spline variables are formed by specifying a set of breakpoints (known as knots), and then allowing the function to change its curvature rapidly between knots, but constrained (usually) to be joined at the knots. Restricted cubic splines are a popular choice of splines, that are cubic functions between the knots, and linear functions in the tails, which allows for many possible complex forms.

SAS makes it quite easy to add cubic spline transformations of a variable to the model. The effect statement allows the user to construct effects like regression splines from model variables. In the effect statement below:

proc logistic data=titanic descending;
effect agesp = spline(age / basis=tpf(noint) naturalcubic knotmethod=percentiles(5) details);
class female pclass / param=ref;
model survived = female|pclass female|agesp sibsp parch fare  / expb;
run;

Notice that we interacted the new spline variables with female to allow the age spline functions to vary gender.

Knots for Spline Effect
agesp
Knot Number age
1 18.00000
2 23.00000
3 28.00000
4 34.00000
5 45.00000

This table show us the knot locations.

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
female 1 1.2680 0.2602
pclass 2 62.2160 <.0001
female*pclass 2 33.8166 <.0001
agesp 4 7.4472 0.1141
agesp*female 4 15.3006 0.0041
sibsp 1 18.3201 <.0001
parch 1 0.0716 0.7891
fare 1 0.1157 0.7338

The Type 3 tests are not significant for agesp alone, which represent the splines for females. This might suggest that age does not explain much variance in survival for females, something our previous exploration of nonlinear effects of age suggested. On the other hand, the agesp*female interaction Type 3 test are significant, suggesting that the age spline function is quite different for males.

Analysis of Maximum Likelihood Estimates
Parameter     DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq Exp(Est)
Intercept     1 1.0394 0.5157 4.0619 0.0439 2.827
female 0   1 0.7087 0.6294 1.2680 0.2602 2.031
pclass 1   1 3.7285 0.6063 37.8133 <.0001 41.617
pclass 2   1 2.3909 0.3798 39.6191 <.0001 10.923
female*pclass 0 1 1 -2.1504 0.6326 11.5572 0.0007 0.116
female*pclass 0 2 1 -2.6053 0.4799 29.4706 <.0001 0.074
agesp 1   1 -0.0355 0.0299 1.4136 0.2345 0.965
agesp 2   1 -0.0147 0.0269 0.2982 0.5850 0.985
agesp 3   1 0.0499 0.0735 0.4614 0.4970 1.051
agesp 4   1 -0.0527 0.0671 0.6159 0.4326 0.949
agesp*female 1 0 1 -0.1417 0.0383 13.6707 0.0002 0.868
agesp*female 2 0 1 0.0926 0.0339 7.4685 0.0063 1.097
agesp*female 3 0 1 -0.2313 0.0915 6.3900 0.0115 0.794
agesp*female 4 0 1 0.1857 0.0825 5.0740 0.0243 1.204
sibsp     1 -0.4842 0.1131 18.3201 <.0001 0.616
parch     1 -0.0319 0.1194 0.0716 0.7891 0.969
fare     1 0.000741 0.00218 0.1157 0.7338 1.001

We present the parameter estimates for the splines here, but they are rather hard to interpret. Instead, spline functions are more easily interpreted if graphed. Fortuantely, SAS makes it easy to graph these functions.

Use the effectplot statement to visualize spline effects

A graph of the spline functions will greatly facilitate interpretation. The effectplot statement displays the results of a fitted model by plotting the predicted outcome (y-axis) against an effect (x-axis). We will use the slicefit type of effectplot, which displays fitted values along the range of a continuous variable (age), split or "sliced" by a class variable (female). Note that we specify that age appear on the x=axis, not the spline variable agesp.

proc logistic data=titanic descending;
effect agesp = spline(age / basis=tpf(noint) naturalcubic knotmethod=percentiles(5) details);
class female pclass / param=ref;
model survived = female|pclass female|agesp sibsp parch fare  / expb;
effectplot slicefit (x=age sliceby=female);
run;
age splines by female
Criterion Linear age Splines age
AIC 936.198 923.98
SC 990.646 1008.128

The table above compares the fit statistics for the model with linear age versus the model with spline functions for age (age interacted with female in both). Althought the AIC is lower for the splines model, the SC is higher by more than that difference. The SC penalizes additional parameters more harshly, and the spline functions introduce quite a few new parameters. Nevertheless, the estimated functions look rather non-linear, so arguments can be made to either use either model. We proceed in the seminar with modelling linear effects of age to keep code and output compact.

Reassessing global model fit

We return to our model with both linear effect of age and pclass interacted with female. We can again test for goodness-of-fit with Hosmer and Lemeshow's test, and use diagnostics again if necessary to identify poorly fit observations. We should see some improvement in the Hosmer and Lemeshow statistic since we added meaningful interactions to our model (AIC and SC improved over additive model).

proc logistic data=titanic descending;
class female pclass / param=ref;
model survived = female|pclass female|age sibsp parch fare  / expb lackfit;
run;
Partition for the Hosmer and Lemeshow Test
Group Total survived = 1 survived = 0
Observed Expected Observed Expected
1 105 9 6.66 96 98.34
2 104 11 11.98 93 92.02
3 104 14 15.86 90 88.14
4 105 20 19.89 85 85.11
5 105 20 24.05 85 80.95
6 104 39 34.20 65 69.80
7 104 49 48.54 55 55.46
8 104 67 67.73 37 36.27
9 104 96 95.29 8 8.71
10 104 100 100.80 4 3.20

We see fewer large discrepanices in the Observed vs Expected columns compared to the model without interactions.

Hosmer and Lemeshow Goodness-of-Fit
Test
Chi-Square DF Pr > ChiSq
3.4171 8 0.9055

Now the test statistic is smaller (previously 25.89) and is not significant, suggesting good fit.

Influence Diagnostics

Observations whose outcome values do not match the expected pattern predicted by the model (have large residuals) and who have extreme predictor values (high leverage) can have disproportionate influence on the model coefficients and fit. We recommend assessing how influential observations are affecting the model estimates before concluding that model fit is satisfactory.

We will use the \(DFBETAS\) and \(C\) diagnostic measures to help us identify influential observations. The \(DFBETAS\) statistic is calculated for each covariate pattern and measures the change in all regression coefficients if the model were to be fit without that covariate pattern. \(DFBETAS\), for observation \(j\), is a vector of standardized differences for each coefficient \(i\):

$$DFBETASi_j = \frac{\boldsymbol{\Delta_i\hat{\beta_j}}}{\sigma_i}$$

\(\boldsymbol{\Delta\hat{\beta_j}}\) is the vector or differences in parameter estimates when covariate pattern \(j\), so the above formula refers to the ith element for a particular coefficient. \(\sigma_i\) is the standard error of the ith coefficient.

The \(C\) diagnostic is also calculated for each covariate pattern, so will be the same for observations with the same covariate values. It measures how much the overall model fit changes with deletion of that pattern. It is analogous to the linear regression influence measure Cook's Distance. For observation \(j\), \(C\) is calculated as:

$$C_j = \frac{\chi^2h_j}{(1-h_j)^2}$$

\(\chi^2\) is the square of the Pearson residual, a measure of bad fit, while \(h_j\) is the diagonal element of the hat matrix for covariate pattern \(j\), a measure of leverage.

For both statistics, larger values indicate larger influence on the model. Both are easy to inspect in SAS.

Identifying influential observations

proc logistic data=titanic descending plots(label)=(influence(unpack) dfbetas(unpack));
class female pclass / param=ref;
model survived = female|pclass female|age sibsp parch fare  / expb lackfit;
run;

Many plots are produced. We only show a few.

pearson residuals plot leverage plot


The above 2 plots are produced by influence, and are plots of Pearson residuals on the left and plots of leverage on the right. Remember that influential observations typically have both large residuals and high leverage. We see that observation 3 has the largest residual (in absolute terms) and that observations 46 and 161 have the highest leverages.

C plot

Above is the plot of \(C\). Since \(C\) is calculated using the Pearson residual and leverage, it is not surprising to see obervations 3, 46, and 161 have the highest \(C\) values. Regarding cutoffs for deeming an observation "influential", Hosmer et al. (2013) do not recommend any cutoffs, but instead recommending identifying points that fall far away from other observations, which describes obervations 3, 46, and 161 on the plot of \(C\).

We next inspect some of the \(DFBETAS\) plots. Again, we are looking for observations that are far away from the rest.

dfbetas pclass1 plot dfbetas age plot


In the above two plots, we can see that observation 3 is having a large influence on the coefficients for pclass1 and age.

dfbetas fare plot

In the \(DFBETAS\) plot for fare, we see the large influence of observations 46 and 161.

Inspecting influential observations

Now that we have identified observations 3, 46, and 161 as potentially unduly influential, we should examine them to evaluate whether we should include them in the model

data influence;
set titanic;
if  _N_ in (3, 46, 161) then output;
run;

proc print data=influence;
run;
Obs pclass survived age sibsp parch fare female
1 1 0 2 1 2 151.550 1
2 1 1 36 0 1 512.329 0
3 1 1 35 0 0 512.329 0

The rows going down correspond to observations 3, 46, and 161. Observation 3 is a young, female, first-class passenger, so would be expected to survive by our model. However, she did not, so has a large residual. Observations 46 and 161 are two male passengers who paid the very highest fare at 512.33 pounds, giving them large leverage values. They both in fact survived, so the fact that they are influential suggests that the model expected them not to survive.

We could now proceed to run a model without these observations. However, refitting without them will not reveal any large enough change in any coefficient that our inferences will changes (coefficient estimates from the 2 models lie within the other models' confidence intervals). In general, we should be cautious about removing observations, particularly to move estimates toward confirming hypotheses. So, we will not show the refitted model.

Evaluating Predictive Power

Association statistics

Several association statistics can be used to quantify the predictive power of a fitted logistic regression model, or how well correlated the predicted probabilities of the responses are with the observed responses themselves (Harrell, 2015). These are somewhat different from the global \(\chi^2\) statistics we used before to test for goodness-of-fit, and can agree or disagree with these assocation statistics about the "quality" of a model.

These statistics are output by proc logistic in the Association of Predicted Probabilities and Observed Responses table. Here is that table from the latest model, whose code we repeat here:

proc logistic data=titanic descending;
class female pclass / param=ref;
model survived = female|pclass female|age sibsp parch fare  / expb lackfit;
run;
Association of Predicted Probabilities and
Observed Responses
Percent Concordant 85.7 Somers' D 0.716
Percent Discordant 14.1 Gamma 0.718
Percent Tied 0.3 Tau-a 0.346
Pairs 262650 c 0.858

We first discuss, the statistic \(c\), also known as the concordance index. For this statistic, we form all possible pairs of observations where one has survived=0 and the other has survived=1. If the predicted probability of survival for the observation with survived=0 is lower than the predicted probability of survival for the observation with survived=1, then that pair is concordant. Pairs where the observation with survived=0 has the higher predicted probability of survival are discordant. The \(c\) index is the proportion of all survived=0/survived=1 pairs that are concordant.

The \(c\) index is popular because it is also equal to the area under the receiver operating characteristic curve. \(c=1\) means perfect prediction, while \(c=.5\) means random prediction. Values of \(c\) greater than .8 may indicate a model that predicts responses well (Harrell 2015; Hosmer et al., 2013).

\(Somers' \, D\) is closely related to the \(c\) index and is widely used. It measures the difference between the proportion of pairs that are concordant minus the proportion that are discordant:

$$Somers' \, D=c-(1-c)$$

\(Somers' \, D\) ranges from -1 to 1, where \(Somers' \, D=1\) means perfect prediction, \(Somers' \, D=-1\) means perfectly wrong prediction, and \(Somers' \, D=0\) means random prediction.

We will not be discussing \(Gamma\) and \(Tau-a\) further, beyond to say that they are similar to \(Somers' \, D\), except that \(Gamma\) accounts for pairs of observations with tied values on the predicted probabilities differently, and \(Tau-a\) includes all pairs of observations with the same observed responses in its calculation.

Overall, the model appears to discriminate pretty well.

We present a comparison of \(c\) and \(Somers' \, D\) for the additive model without interactions and the model with interactions (of female with pclass and age). The interactions model has slightly improved values.

Association
Statistic
No
interaction
Interactions
c 0.846 0.858
Somers' D 0.692 0.716

Sensitivity and specificity

Sensitivty and specifity are two measures of the ability of the model to detect true positives and true negatives. By "detect true positives", we mean how often the model predicts the outcome=1 when the observed outcome=1.

Remember that a logistic regression model does not predict an individual's outcome to be either 0 or 1, only the probability that outcome=1. We could try to assign 0/1 values based on the predicted probability, by for instance, choosing a cutoff probability above which observations will be assigned outcome=1. For example, we could choose probability=0.5 to be the cutoff, and any observation with predicted probability above 0.5 will be assigned outcome=1 as their predicted outcome. It is likely that some of those who were assigned predicted=1 actually have observed=1 (a true positive), while others have observed outcome=0 (false positive). Those who have probability<0.5 will be assigned predicted=0. This will match the observed outcome for some (true negative), and will not for others (false negative).

Observed=0 Observed=1
Predicted=0 true
negative
false
negative
Predicted=1 false
positive
true
positive

Now imagine lowering the cutoff probability to prob=0, where everyone will be assigned predicted=1. In this way we will maximize the true positive rate by matching everyone who has observed=1, but we will also be maximizign the false positive rate, as everyone who has observed=0 now has predicted=1. Increasing the cutoff to probability=1 will lower the false positive rate to 0, because no observations will have predicted=1, but will maximize the false negative rate, because now everyone who has observed=1 will have predicted=0. Researchers often search for a cutoff probability that maximizes sensitivity and specificity.

Sensitivity is the proportion of all observations with observed=1 that are assigned predicted=1 . In the formula* below, true positives is the number of observations who have both predicted=1 and observed=1, while false negatives have predicted=0 and observed=1.

$$Sensitivity = \frac{true \, positives}{true \, positives + false \, negatives}$$

Lowering the probability cutoff yields more predicted=1 and less predicted=0, so will raise true positives and lower false negatives, which will thus increase sensitivity.

Specificity is the proportion of all observations with observed=0 that are assigned predicted=0. Below*, true negatives is the number of observations with both predicted=0 and observed=0, and false positives is the number observations with predicted=1 and observed=0.

$$Specificity = \frac{true \, negatives}{true \, negatives + \, false \, positives}$$

Increasing the cutoff probability yields more predicted outcome=0, and will generally increase specificity, because true negatives will increase as false positives decrease.

Reciever operator characteristic curves

Receiver operator characteristic (ROC) curves display the give-and-take of sensitivity and specificity across predicted probability cutoffs. The y-axis measures sensitivity. The x-axis measures (1-specificity). Therefore, points toward the top maximize sensitivity while points toward the left maximize specificity.

A point at the very top left of the graph represents a predicted probability cutoff where sensitivy and specificity are both maximized, where all Predicted=0 and Predicted=1 have been correctly classified. In that scenario, all observations with Observed=0 have a predicted probability lower than the cutoff, and all observations with Observed=1 have a predicted proability higher than the cutoff. That is an unlikely situation.

Producing and ROC curve for a fitted regression model is quite simple in proc logistic. Simply specify the plots=roc on the proc logistic statement to request the plot. We add the option (only) to prevent the influence plots from being generated.

proc logistic data=titanic descending plots(only)=roc;
class female pclass / param=ref;
model survived = female|pclass female|age sibsp parch fare  / expb lackfit;
run;
roc full model

Each point on the curve represents a predicted probability cutoff.

The area under the ROC curve (AUC), which is equivalent to the concordance index \(c\) is displayed and quantifies how well the model discriminates. The AUC will increase if cutoffs are found more towards the top left of the graph, pulling the curve away from the diagonal reference line (which signifies AUC=.5 or no discrimination.) We could now choose a probability cutoff that has the desired tradeoff between sensitivity and specificity.

Comparing ROC curves

proc logistic makes comparing ROC curves graphically and statistically between models quite easy using the roc statement. In the roc statement, one can specify a model that uses a subset of the covariates used in the fitted model. This "submodel" will be fit, its ROC curve plotted agains the ROC curve of the full model, and its AUC will compared to the AUC of the full model. We need only specify the covariates we want to appear in the submodel. The roccontrast statement can produce a \(\chi^2\) test of the difference of AUC values between ROC curves.

In the code below, we use the roc statement to create a submodel that omits the interactions female*pclass and female*age from the model. We label this model 'no interactions'. The roc will estimate the ROC curve for both the full model and the submodel, will plot both, and then estimate the AUC for both. The roccontrast statement then requests a test of the difference between AUC values of the full model and the submodel by specifying the label of the submodel.

proc logistic data=titanic descending;
class female pclass / param=ref;
model survived = female|pclass female|age sibsp parch fare  / expb lackfit;
roc 'no interactions' female pclass age sibsp parch fare;
roccontrast 'no interactions'; 
run;
roc full model
ROC Association Statistics
ROC Model Mann-Whitney Somers' D
(Gini)
Gamma Tau-a
Area Standard
Error
95% Wald
Confidence Limits
Model 0.8583 0.0122 0.8345 0.8822 0.7167 0.7167 0.3464
no interactions 0.8458 0.0129 0.8207 0.8710 0.6917 0.6918 0.3343

The plot and table above are produce by the roc statement. We can see the the model with no interactions has a slightly smaller AUC than the full model.

ROC Contrast Test Results
Contrast DF Chi-Square Pr > ChiSq
no interactions 1 6.7568 0.0093

Above is the \(\chi^2\) test requested by roccontrast. It suggests that the AUC values are different, somewhat suprising since the estimate of the AUC for both models is well within the confidence interval of the other AUC.

Estimating and Reporting Predicted Probabilities

Deciding on the final model

We may be somewhat satisfied with the last model we have fit, as we have:

Now we can proceed to decide how to report the results to an audience. We have already produced some output would be useful to convey model results:

One drawback of each of the above outputs is that they are not in the probability metric, which is most readily understood by the majority of readers. Many readers have a hard time conceptualizing the magnitude of effects if they are expressed as differences in log odds or even as odds ratios.

Predicted probabilities

Although the logistic regression model directly models and predicts the log odds of the outcome, it can easily transform those predictions to probabilities of the outcome, using the transformation:

$$p_i = \frac{exp(\alpha + \beta_{1}x_{i1} + \beta_{2}x_{i2} + ... + \beta_{k}x_{ik})}{1 + exp(\alpha + \beta_{1}x_{i1} + \beta_{2}x_{i2} + ... + \beta_{k}x_{ik})}$$

Several statements available in proc logistic can produce estimated predicted probabilities based on the fitted model, including:

Each of the four statements above has an option ilink that cause probabilities rather than log odds to be predicted.

We will explore using lsmeans and effectplot, as the coding for those statements is simple.

Predicting probabilities with lsmeans

We used the lsmeans statement earlier to explore nonlinear effects of age as it produces mean estimates across levels of model covariates, including interactions. So, for example, the statement lsmeans A*B would produce mean estimates for each crossing of the levels of variables A and B. By default, when estimating these means, the effects of other categorical predictors are balanced, while the effects of continuous covariates are averaged over the population. This essentially "adjusts" for the effects of the additional covariates.

The following lsmeans specification will estimate separate predicted probabilities for each crossing of female and pclass, but averaging over the effects of the remaining covariates (age, sibsp, parch, and fare), since they are all continuous. The ilink option requests the back transformation to probabilities.

proc logistic data=titanic descending;
class female pclass / param=glm;
model survived = female|pclass female|age sibsp parch fare  / expb lackfit;
lsmeans female*pclass /  ilink;
run;
female*pclass Least Squares Means
female pclass Estimate Standard Error z Value Pr > |z| Mean Standard Error
of Mean
0 1 -0.09700 0.2172 -0.45 0.6551 0.4758 0.05416
0 2 -1.9474 0.2421 -8.04 <.0001 0.1248 0.02645
0 3 -1.9132 0.1703 -11.23 <.0001 0.1286 0.01909
1 1 3.5567 0.5084 7.00 <.0001 0.9723 0.01371
1 2 2.0984 0.3240 6.48 <.0001 0.8907 0.03154
1 3 -0.2820 0.2011 -1.40 0.1608 0.4300 0.04928

The predicted probability estimates can be found in the Mean column. It is quite easy to understand the effect passenger class on the probability of survival for both genders. For males, we see a sharp drop in the probability going from first to second class, but then only a slight drop going down to third class. For females, we see a small drop from first to second class, but then a large drop to third class. A table of graph of these probabilites might help the reader understand the female by pclass interaction. These can be more or less interpreted as the population-averaged predicted probabilities for males and females across passenger classes.

Instead of averaging over the covariate effects in the population, we can estimate these population means at fixed values of the other covariates. We can use the at option of the lsmeans to set these values. Below, we set the values of the covariates at which to estimate the predicted probabilties to be fixed at age=30, sibsp=0, parch=1, and fare=100.

proc logistic data=titanic descending;
class female pclass / param=glm;
model survived = female|pclass female|age sibsp parch fare  / expb lackfit;
lsmeans female*pclass / at(age sibsp parch fare)=(30 0 1 100) ilink;
run;
female*pclass Least Squares Means
female pclass age sibsp parch fare Estimate Standard Error z Value Pr > |z| Mean Standard Error
of Mean
0 1 30.00 0.00 1.00 100.00 0.1093 0.2218 0.49 0.6221 0.5273 0.05529
0 2 30.00 0.00 1.00 100.00 -1.7411 0.3015 -5.77 <.0001 0.1492 0.03827
0 3 30.00 0.00 1.00 100.00 -1.7069 0.2616 -6.52 <.0001 0.1536 0.03401
1 1 30.00 0.00 1.00 100.00 3.7683 0.4983 7.56 <.0001 0.9774 0.01099
1 2 30.00 0.00 1.00 100.00 2.3099 0.3700 6.24 <.0001 0.9097 0.03039
1 3 30.00 0.00 1.00 100.00 -0.07045 0.2733 -0.26 0.7966 0.4824 0.06825

We can see the fixed values of the covariates in the table now. The values in the Mean column are now interpreted as the predicted probabilites for a particular kind of person with those covariate values, rather than a population average, since we fixed every covariate.

Using effectplot to plot predicted probabilities vs combinations of predictors

Conveniently, the effectplot statement provides easy graphing of predictors vs predicted outcomes, including on the probability scale in proc logistic. SAS offers several types of effectplots, which are ideal for different combinations of categorical and continuous predictors. We will use the effectplot types interaction, which is good for plotting effects of multiple categorical predictors, and type slicefit, which good for plotting the effects of continuous predictor interacted with a categorical predictor. Below we use two effectplot statements:

proc logistic data=titanic descending;
class female pclass / param=glm;
model survived = female|pclass female|age sibsp parch fare  / expb lackfit;
effectplot interaction (x=pclass sliceby=female) / clm connect noobs;
effectplot slicefit (x=age sliceby=female) / clm;
run;
effects of female and pclass

The above interaction plot makes both the passenger class and the gender effects clear.

effects of age and female

The above plot shows the effect of age on survival, and how the age effect differs between genders. With confidence interval, we can see that females have clearly better probabilities of survival except at the very youngest ages.

Final Words

Extending the logistic regression model

The following extensions to logistic regression modeling can not be covered in this course, but here is some direction for interested researchers:

References

Allison, PD. 1999. Logistic Regression Using the SAS System. SAS Institute Inc: Cary, North Carolina.

Harrell, FE. 2015. Regression Modeling Strategies, 2nd Edition. Springer: New York.

Hosmer, DW., Lemeshow, S, and Sturdivant. 2013. Applied Logistic Regression, 3rd edition. Wiley: Hoboken.