FAQ What is complete or quasi-complete separation in logistic regression and what are some strategies to deal with the issue?

Occasionally when running a logistic regression we would run into the problem of so-called complete separation or quasi-complete separation. On this page, we will discuss what complete or quasi-complete separation means and how to deal with the problem when it occurs.

Notice that the make-up example data set used for this page is extremely small. It is for the purpose of illustration only.

What is complete separation?

A complete separation in a logistic regression, sometimes also referred as perfect prediction, happens when the outcome variable separates a predictor variable completely. Below is an example data set, where Y is the outcome variable, and X1 and X2 are predictor variables.

We can see that observations with Y = 0 all have values of X1<=3 and observations with Y = 1 all have values of X1>3. In other words, Y separates X1 perfectly. The other way to see it is that X1 predicts Y perfectly since X1<=3 corresponds to Y = 0 and X1 > 3 corresponds to Y = 1. If we would dichotomize X1 into a binary variable using the cut point of 3, what we get would be just Y. That is we have found a perfect predictor X1 for the outcome variable Y. In terms of predicted probabilities, we have Prob(Y = 1 | X1<=3) = 0 and Prob(Y=1 X1>3) = 1, without the need for estimating a model.

Complete separation or perfect prediction can happen for somewhat different reasons. Here are two common scenarios.

Another version of the outcome variable is being used as a predictor variable. For example, we might have dichotomized a continuous variable X to a binary variable Y. We then wanted to study the relationship between Y and some predictor variables. If we included X as a predictor variable, we would run into the problem of complete separation of X by Y as explained earlier.
In rare occasions, it might happen simply because the data set is rather small and the distribution is somewhat extreme. For example, it could be the case that if we were to collect more data, we would have observations with Y = 1 and X1 <=3, hence Y would not separate X1 completely.

What happens when we try to fit a logistic regression model of Y on X1 and X2 using our small sample data shown above? Well, the maximum likelihood estimate on the parameter for X1 does not exist. In particular with this example, the larger the coefficient for X1, the larger the likelihood. In other words, the coefficient for X1 should be as large as it can be, which would be infinity! In terms of the behavior of a statistical software package, below is what each package of SAS, SPSS, Stata and R does with our sample data and model. We present these results here in the hope that some level of understanding of the behavior of logistic regression within our familiar software package might help us identify the problem more efficiently.

SAS

data t;
input Y X1 X2;
cards;
0 1  3
0 2  2
0 3 -1
0 3 -1
1 5  2
1 6  4
1 10 1
1 11 0 
;
run;
proc logistic data = t descending;
  model y = x1 x2;
run;
    (some output omitted)

                                    Model Convergence Status

                          Complete separation of data points detected.

WARNING: The maximum likelihood estimate does not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based
         on the last maximum likelihood iteration. Validity of the model fit is questionable.


                                      Model Fit Statistics

                                                          Intercept
                                           Intercept            and
                             Criterion          Only     Covariates

                             AIC              13.090          6.005
                             SC               13.170          6.244
                             -2 Log L         11.090          0.005


WARNING: The validity of the model fit is questionable.

                             Testing Global Null Hypothesis: BETA=0

                     Test                 Chi-Square       DF     Pr > ChiSq

                     Likelihood Ratio        11.0850        2         0.0039
                     Score                    6.8932        2         0.0319
                     Wald                     0.1302        2         0.9370


                            Analysis of Maximum Likelihood Estimates

                                              Standard          Wald
               Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

               Intercept     1    -20.7083     73.7757        0.0788        0.7789
               X1            1      4.4921     12.7425        0.1243        0.7244
               X2            1      2.3960     27.9875        0.0073        0.9318


                                      Odds Ratio Estimates

                                        Point          95% Wald
                           Effect    Estimate      Confidence Limits

                           X1          89.311      <0.001    >999.999
                           X2          10.980      <0.001    >999.999


                  Association of Predicted Probabilities and Observed Responses

                        Percent Concordant    100.0    Somers' D    1.000
                        Percent Discordant      0.0    Gamma        1.000
                        Percent Tied            0.0    Tau-a        0.571
                        Pairs                    16    c            1.000

We can see that the first related message is that SAS detected complete separation of data points, it gives further warning messages indicating that the maximum likelihood estimate does not exist and continues to finish the computation. Also notice that SAS does not tell us which variable is or which variables are being separated completely by the outcome variable.

SPSS

data list list
/Y X1 X2.
begin data.
0 1  3
0 2  2
0 3 -1
0 3 -1
1 5  2
1 6  4
1 10 1
1 11 0 
end data.
logistic regression variable Y 
/method = enter X1 X2.

Logistic Regression

(some output omitted)

Warnings
|-----------------------------------------------------------------------------------------|
|The parameter covariance matrix cannot be computed. Remaining statistics will be omitted.|
|-----------------------------------------------------------------------------------------|

Case Processing Summary
|--------------------------------------|-|-------|
|Unweighted Casesa                     |N|Percent|
|-----------------|--------------------|-|-------|
|Selected Cases   |Included in Analysis|8|100.0  |
|                 |--------------------|-|-------|
|                 |Missing Cases       |0|.0     |
|                 |--------------------|-|-------|
|                 |Total               |8|100.0  |
|-----------------|--------------------|-|-------|
|Unselected Cases                      |0|.0     |
|--------------------------------------|-|-------|
|Total                                 |8|100.0  |
|--------------------------------------|-|-------|
a. If weight is in effect, see classification table for the total number of cases.

Dependent Variable Encoding
|--------------|--------------|
|Original Value|Internal Value|
|--------------|--------------|
|.00           |0             |
|--------------|--------------|
|1.00          |1             |
|--------------|--------------|


Block 0: Beginning Block

Classification Table(a)(,)(b)
|------|-----------------------|---------------------------------|
|      |Observed               |Predicted                        |
|                         |----|--------------|------------------|
|                              |Y             |Percentage Correct|
|                         |    |---------|----|                  |
|                              |.00      |1.00|                  |
|------|------------------|----|---------|----|------------------|
|Step 0|Y                 |.00 |0        |4   |.0                |
|      |                  |----|---------|----|------------------|
|      |                  |1.00|0        |4   |100.0             |
|      |------------------|----|---------|----|------------------|
|      |Overall Percentage     |         |    |50.0              |
|------|-----------------------|---------|----|------------------|
a. Constant is included in the model.
b. The cut value is .500

Variables in the Equation
|---------------|----|----|----|--|-----|------|
|               |B   |S.E.|Wald|df|Sig. |Exp(B)|
|------|--------|----|----|----|--|-----|------|
|Step 0|Constant|.000|.707|.000|1 |1.000|1.000 |
|------|--------|----|----|----|--|-----|------|


Variables not in the Equation
|----------------------------|-----|--|----|
|                            |Score|df|Sig.|
|------|------------------|--|-----|--|----|
|Step 0|Variables         |X1|5.576|1 |.018|
|      |                  |--|-----|--|----|
|      |                  |X2|.681 |1 |.409|
|      |------------------|--|-----|--|----|
|      |Overall Statistics   |6.893|2 |.032|
|------|---------------------|-----|--|----|

Block 1: Method = Enter

Omnibus Tests of Model Coefficients
|------------|----------|--|----|
|            |Chi-square|df|Sig.|
|------|-----|----------|--|----|
|Step 1|Step |11.090    |2 |.004|
|      |-----|----------|--|----|
|      |Block|11.090    |2 |.004|
|      |-----|----------|--|----|
|      |Model|11.090    |2 |.004|
|------|-----|----------|--|----|

Model Summary
|----|-----------------|--------------------|-------------------|
|Step|-2 Log likelihood|Cox & Snell R Square|Nagelkerke R Square|
|----|-----------------|--------------------|-------------------|
|1   |.000a            |.750                |1.000              |
|----|-----------------|--------------------|-------------------|
a. Estimation terminated at iteration number 20 because a perfect fit is detected. This solution is not unique.

We see that SPSS detects a perfect fit and immediately stops the rest of the computation. It does not provide any parameter estimates. Even though, it detects perfection fit, but it does not provides us any information on the set of variables that gives the perfect fit.

Stata

clear
input Y X1 X2
0 1  3
0 2  2
0 3 -1
0 3 -1
1 5  2
1 6  4
1 10 1
1 11 0
end
logit Y X1 X2

outcome = X1 > 3 predicts data perfectly
r(2000);

We see that Stata detects the perfect prediction by X1 and stops computation immediately.

R

y<- c(0,0,0,0,1,1,1,1)
x1<-c(1,2,3,3,5,6,10,11)
x2<-c(3,2,-1,-1,2,4,1,0)
m1<- glm(y~ x1+x2, family=binomial)

Warning message:
In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart,  :
  fitted probabilities numerically 0 or 1 occurred

summary(m1)
Call:
glm(formula = y ~ x1 + x2, family = binomial)

Deviance Residuals: 
         1           2           3           4           5           6           7  
-2.107e-08  -1.404e-05  -2.522e-06  -2.522e-06   1.564e-05   2.107e-08   2.107e-08  
         8  
 2.107e-08

Coefficients:
              Estimate Std. Error   z value Pr(>|z|)
(Intercept)    -66.098 183471.722 -3.60e-04        1
x1              15.288  27362.843     0.001        1
x2               6.241  81543.720  7.65e-05        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.1090e+01  on 7  degrees of freedom
Residual deviance: 4.5454e-10  on 5  degrees of freedom
AIC: 6

Number of Fisher Scoring iterations: 24

The only warning message R gives is right after fitting the logistic model. It says that "fitted probabilities numerically 0 or 1 occurred". This can be interpreted as a perfect prediction or quasi-complete separation. The standard errors for the parameter estimates are way too large. This usually indicates a convergence issue or some degree of data separation.

What is quasi-complete separation and what can be done about it?

Quasi-complete separation in logistic regression happens when the outcome variable separates a predictor variable or a combination of predictor variables almost completely. Here is an example.

Notice that the outcome variable Y separates the predictor variable X1 pretty well except for values of X1 equal to 3. In other words, X1 predicts Y perfectly when X1 <3 (Y = 0) or X1 >3 (Y=1), leaving only X1 = 3 as a case with uncertainty. In terms of expected probabilities, we would have Prob(Y=1 | X1<3) = 0 and Prob(Y=1 | X1>3) = 1, nothing to be estimated, except for Prob(Y = 1 | X1 = 3).

What happens when we try to fit a logistic regression model of Y on X1 and X2 using the data above? It turns out that the maximum likelihood estimate for X1 does not exist. With this example, the larger the parameter for X1, the larger the likelihood, therefore the maximum likelihood estimate of the parameter estimate for X1 does not exist, at least in the mathematical sense. In practice, a value of 15 or larger does not make much difference and they all basically correspond to predicted probability of 1. The behavior of different statistical software packages differ at how they deal with the issue of quasi-complete separation. Below is what each package of SAS, SPSS, Stata and R does with our sample data and model. We present these results here in the hope that some level of understanding of the behavior of logistic regression within our familiar software package might help us identify the problem more efficiently.

SAS

data t2;
input Y X1 X2;
cards;
0 1  3
0 2  0
0 3 -1
0 3  4
1 3  1
1 4  0 
1 5  2
1 6  7
1 10 3
1 11 4
;
run;

proc logistic data = t2 descending;
  model y = x1 x2;
run;

            Model Information

                         Data Set                      WORK.T2
                         Response Variable             Y
                         Number of Response Levels     2
                         Model                         binary logit
                         Optimization Technique        Fisher's scoring

                             Number of Observations Read          10
                             Number of Observations Used          10

                                         Response Profile

                                Ordered                      Total
                                  Value            Y     Frequency

                                      1            1             6
                                      2            0             4

                                   Probability modeled is Y=1.

                                    Model Convergence Status

                       Quasi-complete separation of data points detected.

WARNING: The maximum likelihood estimate may not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based
         on the last maximum likelihood iteration. Validity of the model fit is questionable.

                                      Model Fit Statistics

                                                          Intercept
                                           Intercept            and
                             Criterion          Only     Covariates

                             AIC              15.460          9.784
                             SC               15.763         10.691
                             -2 Log L         13.460          3.784

WARNING: The validity of the model fit is questionable.

                             Testing Global Null Hypothesis: BETA=0

                     Test                 Chi-Square       DF     Pr > ChiSq

                     Likelihood Ratio         9.6767        2         0.0079
                     Score                    4.3528        2         0.1134
                     Wald                     0.1464        2         0.9294

                            Analysis of Maximum Likelihood Estimates

                                              Standard          Wald
               Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

               Intercept     1    -21.4542     64.5674        0.1104        0.7397
               X1            1      6.9705     21.5019        0.1051        0.7458
               X2            1     -0.1206      0.6096        0.0392        0.8431

                                      Odds Ratio Estimates

                                        Point          95% Wald
                           Effect    Estimate      Confidence Limits

                           X1        >999.999      <0.001    >999.999
                           X2           0.886       0.268       2.927

                  Association of Predicted Probabilities and Observed Responses

                        Percent Concordant     95.8    Somers' D    0.917
                        Percent Discordant      4.2    Gamma        0.917
                        Percent Tied            0.0    Tau-a        0.489
                        Pairs                    24    c            0.958

We see that SAS uses all 10 observations and it gives warnings at various points. It informs us that it has detected quasi-complete separation of the data points. It turns out that the parameter estimate for X1 does not mean much at all. Nor the parameter estimate for the intercept. But the coefficient for X2 actually is the correct maximum likelihood estimate for it and can be used in inference about X2 assuming that the intended model is based on both x1 and x2.

Stata

clear
input y x1 x2
0 1  3
0 2  0
0 3 -1
0 3  4
1 3  1
1 4  0 
1 5  2
1 6  7
1 10 3
1 11 4
end
logit y x1 x2

note: outcome = x1 > 3 predicts data perfectly except for
      x1 == 3 subsample:
      x1 dropped and 7 obs not used

Iteration 0:   log likelihood = -1.9095425  
Iteration 1:   log likelihood = -1.8896311  
Iteration 2:   log likelihood = -1.8895913  
Iteration 3:   log likelihood = -1.8895913  

Logistic regression                               Number of obs   =          3
                                                  LR chi2(1)      =       0.04
                                                  Prob > chi2     =     0.8417
Log likelihood = -1.8895913                       Pseudo R2       =     0.0104

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |  (omitted)
          x2 |  -.1206257   .6098361    -0.20   0.843    -1.315883    1.074631
       _cons |  -.5427435   1.421095    -0.38   0.703    -3.328038    2.242551
------------------------------------------------------------------------------

Stata detected that there was a quasi-separation and informed us which predict variable was part of the issue. It tells us that predictor variable x1 predicts the data perfectly except when x1 = 3. It therefore drops all the cases when x1 predicts the outcome variable perfectly, keeping only the three observations for x1 = 3. Since x1 is a constant (=3) on this small sample, it is dropped out of the analysis. The parameter estimate for x2 is actually correct and can be used for inference about x2 assuming that the intended model is based on both x1 and x2.

SPSS

data list list
/y x1 x2.
begin data.
0 1  3
0 2  0
0 3 -1
0 3  4
1 3  1
1 4  0 
1 5  2
1 6  7
1 10 3
1 11 4
end data.
logistic regression variable y 
/method = enter x1 x2.

(Some output omitted)

Block 1: Method = Enter

Omnibus Tests of Model Coefficients
|------------|----------|--|----|
|            |Chi-square|df|Sig.|
|------|-----|----------|--|----|
|Step 1|Step |9.681     |2 |.008|
|      |-----|----------|--|----|
|      |Block|9.681     |2 |.008|
|      |-----|----------|--|----|
|      |Model|9.681     |2 |.008|
|------|-----|----------|--|----|

Model Summary
|----|-----------------|--------------------|-------------------|
|Step|-2 Log likelihood|Cox & Snell R Square|Nagelkerke R Square|
|----|-----------------|--------------------|-------------------|
|1   |3.779a           |.620                |.838               |
|----|-----------------|--------------------|-------------------|
a. Estimation terminated at iteration number 20 because maximum iterations has been reached. Final solution cannot be found.

Classification Table(a)
|------|-----------------------|---------------------------------|
|      |Observed               |Predicted                        |
|                         |----|--------------|------------------|
|                              |y             |Percentage Correct|
|                         |    |---------|----|                  |
|                              |.00      |1.00|                  |
|------|------------------|----|---------|----|------------------|
|Step 1|y                 |.00 |4        |0   |100.0             |
|      |                  |----|---------|----|------------------|
|      |                  |1.00|1        |5   |83.3              |
|      |------------------|----|---------|----|------------------|
|      |Overall Percentage     |         |    |90.0              |
|------|-----------------------|---------|----|------------------|
a. The cut value is .500

Variables in the Equation
|----------------|-------|---------|----|--|----|-------|
|                |B      |S.E.     |Wald|df|Sig.|Exp(B) |
|-------|--------|-------|---------|----|--|----|-------|
|Step 1a|x1      |17.923 |5140.147 |.000|1 |.997|6.082E7|
|       |--------|-------|---------|----|--|----|-------|
|       |x2      |-.121  |.610     |.039|1 |.843|.886   |
|       |--------|-------|---------|----|--|----|-------|
|       |Constant|-54.313|15420.442|.000|1 |.997|.000   |
|-------|--------|-------|---------|----|--|----|-------|
a. Variable(s) entered on step 1: x1, x2.

SPSS tried to iteration to the default number of iterations and couldn’t reach a solution and thus stopped the iteration process. It didn’t tell us anything about quasi-complete separation. So it is up to us to figure out why the computation didn’t converge. One obvious evidence is the magnitude of the parameter estimates for x1. It is really large and its standard error is even larger. Based on this piece of evidence, we should look at the bivariate relationship between the outcome variable y and x1. The parameter estimate for x2 is actually correct and can be used for inference about x2 assuming that the intended model is based on both x1 and x2.

R

y<- c(0,0,0,0,1,1,1,1,1,1)
x1<-c(1,2,3,3,3,4,5,6,10,11)
x2<-c(3,0,-1,4,1,0,2,7,3,4)
m1<- glm(y~ x1+x2, family=binomial)

Warning message:
In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart,  :
  fitted probabilities numerically 0 or 1 occurred

summary(m1)

Call:
glm(formula = y ~ x1 + x2, family = binomial)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-1.004e+00  -5.538e-05   2.107e-08   2.107e-08   1.469e+00  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)   -58.0761 17511.9030  -0.003    0.997
x1             19.1778  5837.3009   0.003    0.997
x2             -0.1206     0.6098  -0.198    0.843

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13.4602  on 9  degrees of freedom
Residual deviance:  3.7792  on 7  degrees of freedom
AIC: 9.7792

Number of Fisher Scoring iterations: 21

The only warning we get from R is right after the glm command about predicted probabilities being 0 or 1. From the parameter estimates we can see that the coefficient for x1 is very large and its standard error is even larger, an indication that the model might have some issues with x1. At this point, we should investigate the bivariate relationship between the outcome variable and x1 closely. On the other hand, the parameter estimate for x2 is actually the correct estimate based on the model and can be used for inference about x2 assuming that the intended model is based on both x1 and x2.

What are the techniques for dealing with quasi-complete separation?

There are few options for dealing with quasi-complete separation. We will briefly discuss some of them here. Let’s say that predictor variable X is being separated by the outcome variable quasi-completely. Our discussion will be focused on what to do with X.

The easiest strategy is "Do nothing". This is because that the maximum likelihood for other predictor variables are still valid as we have seen from previous section. The drawback is that we don’t get any reasonable estimate for the variable that predicts the outcome variable so nicely.
Another simple strategy is to not include X in the model. But this is not a recommended strategy since this leads to biased estimates of other variables in the model.
Possibly we might be able to collapse some categories of X if X is a categorical variable and if it makes sense to do so.
Exact method is a good strategy when the data set is small and the model is not very large.
Bayesian method can be used when we have additional information on the parameter estimate of X.
Firth logistic regression uses a penalized likelihood estimation method.

References

SAS Notes: What do messages about separation (complete or quasi-complete) mean, and how can I fix the problem?

P. Allison, Convergence Failures in Logistic Regression, SAS Global Forum 2008