Occasionally when running a logistic regression we would run into the problem of so-called complete separation or quasi-complete separation. On this page, we will discuss what complete or quasi-complete separation means and how to deal with the problem when it occurs.
Notice that the make-up example data set used for this page is extremely small. It is for the purpose of illustration only.
What is complete separation?
A complete separation in a logistic regression, sometimes also referred as perfect prediction, happens when the outcome variable separates a predictor variable completely. Below is an example data set, where Y is the outcome variable, and X1 and X2 are predictor variables.
Y X1 X2 0 1 3 0 2 2 0 3 -1 0 3 -1 1 5 2 1 6 4 1 10 1 1 11 0
We can see that observations with Y = 0 all have values of X1<=3 and observations with Y = 1 all have values of X1>3. In other words, Y separates X1 perfectly. The other way to see it is that X1 predicts Y perfectly since X1<=3 corresponds to Y = 0 and X1 > 3 corresponds to Y = 1. If we would dichotomize X1 into a binary variable using the cut point of 3, what we get would be just Y. That is we have found a perfect predictor X1 for the outcome variable Y. In terms of predicted probabilities, we have Prob(Y = 1 | X1<=3) = 0 and Prob(Y=1 X1>3) = 1, without the need for estimating a model.
Complete separation or perfect prediction can happen for somewhat different reasons. Here are two common scenarios.
- Another version of the outcome variable is being used as a predictor
variable. For example, we might have dichotomized a continuous variable X to
a binary variable Y. We then wanted to study the relationship between Y and
some predictor variables. If we included X as a predictor variable, we would
run into the problem of complete separation of X by Y as explained earlier.
- In rare occasions, it might happen simply because the data set is rather small and the distribution is somewhat extreme. For example, it could be the case that if we were to collect more data, we would have observations with Y = 1 and X1 <=3, hence Y would not separate X1 completely.
What happens when we try to fit a logistic regression model of Y on X1 and X2 using our small sample data shown above? Well, the maximum likelihood estimate on the parameter for X1 does not exist. In particular with this example, the larger the coefficient for X1, the larger the likelihood. In other words, the coefficient for X1 should be as large as it can be, which would be infinity! In terms of the behavior of a statistical software package, below is what each package of SAS, SPSS, Stata and R does with our sample data and model. We present these results here in the hope that some level of understanding of the behavior of logistic regression within our familiar software package might help us identify the problem more efficiently.
SAS
data t; input Y X1 X2; cards; 0 1 3 0 2 2 0 3 -1 0 3 -1 1 5 2 1 6 4 1 10 1 1 11 0 ; run; proc logistic data = t descending; model y = x1 x2; run; (some output omitted) Model Convergence Status Complete separation of data points detected. WARNING: The maximum likelihood estimate does not exist. WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 13.090 6.005 SC 13.170 6.244 -2 Log L 11.090 0.005 WARNING: The validity of the model fit is questionable. Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 11.0850 2 0.0039 Score 6.8932 2 0.0319 Wald 0.1302 2 0.9370 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -20.7083 73.7757 0.0788 0.7789 X1 1 4.4921 12.7425 0.1243 0.7244 X2 1 2.3960 27.9875 0.0073 0.9318 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits X1 89.311 <0.001 >999.999 X2 10.980 <0.001 >999.999 Association of Predicted Probabilities and Observed Responses Percent Concordant 100.0 Somers' D 1.000 Percent Discordant 0.0 Gamma 1.000 Percent Tied 0.0 Tau-a 0.571 Pairs 16 c 1.000
We can see that the first related message is that SAS detected complete separation of data points, it gives further warning messages indicating that the maximum likelihood estimate does not exist and continues to finish the computation. Also notice that SAS does not tell us which variable is or which variables are being separated completely by the outcome variable.
SPSS
data list list /Y X1 X2. begin data. 0 1 3 0 2 2 0 3 -1 0 3 -1 1 5 2 1 6 4 1 10 1 1 11 0 end data. logistic regression variable Y /method = enter X1 X2.Logistic Regression (some output omitted) Warnings |-----------------------------------------------------------------------------------------| |The parameter covariance matrix cannot be computed. Remaining statistics will be omitted.| |-----------------------------------------------------------------------------------------| Case Processing Summary |--------------------------------------|-|-------| |Unweighted Casesa |N|Percent| |-----------------|--------------------|-|-------| |Selected Cases |Included in Analysis|8|100.0 | | |--------------------|-|-------| | |Missing Cases |0|.0 | | |--------------------|-|-------| | |Total |8|100.0 | |-----------------|--------------------|-|-------| |Unselected Cases |0|.0 | |--------------------------------------|-|-------| |Total |8|100.0 | |--------------------------------------|-|-------| a. If weight is in effect, see classification table for the total number of cases. Dependent Variable Encoding |--------------|--------------| |Original Value|Internal Value| |--------------|--------------| |.00 |0 | |--------------|--------------| |1.00 |1 | |--------------|--------------| Block 0: Beginning Block Classification Table(a)(,)(b) |------|-----------------------|---------------------------------| | |Observed |Predicted | | |----|--------------|------------------| | |Y |Percentage Correct| | | |---------|----| | | |.00 |1.00| | |------|------------------|----|---------|----|------------------| |Step 0|Y |.00 |0 |4 |.0 | | | |----|---------|----|------------------| | | |1.00|0 |4 |100.0 | | |------------------|----|---------|----|------------------| | |Overall Percentage | | |50.0 | |------|-----------------------|---------|----|------------------| a. Constant is included in the model. b. The cut value is .500 Variables in the Equation |---------------|----|----|----|--|-----|------| | |B |S.E.|Wald|df|Sig. |Exp(B)| |------|--------|----|----|----|--|-----|------| |Step 0|Constant|.000|.707|.000|1 |1.000|1.000 | |------|--------|----|----|----|--|-----|------| Variables not in the Equation |----------------------------|-----|--|----| | |Score|df|Sig.| |------|------------------|--|-----|--|----| |Step 0|Variables |X1|5.576|1 |.018| | | |--|-----|--|----| | | |X2|.681 |1 |.409| | |------------------|--|-----|--|----| | |Overall Statistics |6.893|2 |.032| |------|---------------------|-----|--|----| Block 1: Method = Enter Omnibus Tests of Model Coefficients |------------|----------|--|----| | |Chi-square|df|Sig.| |------|-----|----------|--|----| |Step 1|Step |11.090 |2 |.004| | |-----|----------|--|----| | |Block|11.090 |2 |.004| | |-----|----------|--|----| | |Model|11.090 |2 |.004| |------|-----|----------|--|----|Model Summary |----|-----------------|--------------------|-------------------| |Step|-2 Log likelihood|Cox & Snell R Square|Nagelkerke R Square| |----|-----------------|--------------------|-------------------| |1 |.000a |.750 |1.000 | |----|-----------------|--------------------|-------------------| a. Estimation terminated at iteration number 20 because a perfect fit is detected. This solution is not unique.
We see that SPSS detects a perfect fit and immediately stops the rest of the computation. It does not provide any parameter estimates. Even though, it detects perfection fit, but it does not provides us any information on the set of variables that gives the perfect fit.
Stata
clear input Y X1 X2 0 1 3 0 2 2 0 3 -1 0 3 -1 1 5 2 1 6 4 1 10 1 1 11 0 end logit Y X1 X2outcome = X1 > 3 predicts data perfectly r(2000);
We see that Stata detects the perfect prediction by X1 and stops computation immediately.
R
y<- c(0,0,0,0,1,1,1,1) x1<-c(1,2,3,3,5,6,10,11) x2<-c(3,2,-1,-1,2,4,1,0) m1<- glm(y~ x1+x2, family=binomial)Warning message: In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, : fitted probabilities numerically 0 or 1 occurred summary(m1) Call: glm(formula = y ~ x1 + x2, family = binomial)Deviance Residuals: 1 2 3 4 5 6 7 -2.107e-08 -1.404e-05 -2.522e-06 -2.522e-06 1.564e-05 2.107e-08 2.107e-08 8 2.107e-08Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -66.098 183471.722 -3.60e-04 1 x1 15.288 27362.843 0.001 1 x2 6.241 81543.720 7.65e-05 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 1.1090e+01 on 7 degrees of freedom Residual deviance: 4.5454e-10 on 5 degrees of freedom AIC: 6Number of Fisher Scoring iterations: 24
The only warning message R gives is right after fitting the logistic model. It says that "fitted probabilities numerically 0 or 1 occurred". This can be interpreted as a perfect prediction or quasi-complete separation. The standard errors for the parameter estimates are way too large. This usually indicates a convergence issue or some degree of data separation.
What is quasi-complete separation and what can be done about it?
Quasi-complete separation in logistic regression happens when the outcome variable separates a predictor variable or a combination of predictor variables almost completely. Here is an example.
Y X1 X2 0 1 3 0 2 0 0 3 -1 0 3 4 1 3 1 1 4 0 1 5 2 1 6 7 1 10 3 1 11 4
Notice that the outcome variable Y separates the predictor variable X1 pretty well except for values of X1 equal to 3. In other words, X1 predicts Y perfectly when X1 <3 (Y = 0) or X1 >3 (Y=1), leaving only X1 = 3 as a case with uncertainty. In terms of expected probabilities, we would have Prob(Y=1 | X1<3) = 0 and Prob(Y=1 | X1>3) = 1, nothing to be estimated, except for Prob(Y = 1 | X1 = 3).
What happens when we try to fit a logistic regression model of Y on X1 and X2 using the data above? It turns out that the maximum likelihood estimate for X1 does not exist. With this example, the larger the parameter for X1, the larger the likelihood, therefore the maximum likelihood estimate of the parameter estimate for X1 does not exist, at least in the mathematical sense. In practice, a value of 15 or larger does not make much difference and they all basically correspond to predicted probability of 1. The behavior of different statistical software packages differ at how they deal with the issue of quasi-complete separation. Below is what each package of SAS, SPSS, Stata and R does with our sample data and model. We present these results here in the hope that some level of understanding of the behavior of logistic regression within our familiar software package might help us identify the problem more efficiently.
SAS
data t2; input Y X1 X2; cards; 0 1 3 0 2 0 0 3 -1 0 3 4 1 3 1 1 4 0 1 5 2 1 6 7 1 10 3 1 11 4 ; run; proc logistic data = t2 descending; model y = x1 x2; run;Model Information Data Set WORK.T2 Response Variable Y Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 10 Number of Observations Used 10 Response Profile Ordered Total Value Y Frequency 1 1 6 2 0 4 Probability modeled is Y=1.Model Convergence Status Quasi-complete separation of data points detected. WARNING: The maximum likelihood estimate may not exist. WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 15.460 9.784 SC 15.763 10.691 -2 Log L 13.460 3.784 WARNING: The validity of the model fit is questionable. Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 9.6767 2 0.0079 Score 4.3528 2 0.1134 Wald 0.1464 2 0.9294 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -21.4542 64.5674 0.1104 0.7397 X1 1 6.9705 21.5019 0.1051 0.7458 X2 1 -0.1206 0.6096 0.0392 0.8431 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits X1 >999.999 <0.001 >999.999 X2 0.886 0.268 2.927 Association of Predicted Probabilities and Observed Responses Percent Concordant 95.8 Somers' D 0.917 Percent Discordant 4.2 Gamma 0.917 Percent Tied 0.0 Tau-a 0.489 Pairs 24 c 0.958
We see that SAS uses all 10 observations and it gives warnings at various points. It informs us that it has detected quasi-complete separation of the data points. It turns out that the parameter estimate for X1 does not mean much at all. Nor the parameter estimate for the intercept. But the coefficient for X2 actually is the correct maximum likelihood estimate for it and can be used in inference about X2 assuming that the intended model is based on both x1 and x2.
Stata
clear input y x1 x2 0 1 3 0 2 0 0 3 -1 0 3 4 1 3 1 1 4 0 1 5 2 1 6 7 1 10 3 1 11 4 end logit y x1 x2 note: outcome = x1 > 3 predicts data perfectly except for x1 == 3 subsample: x1 dropped and 7 obs not used Iteration 0: log likelihood = -1.9095425 Iteration 1: log likelihood = -1.8896311 Iteration 2: log likelihood = -1.8895913 Iteration 3: log likelihood = -1.8895913 Logistic regression Number of obs = 3 LR chi2(1) = 0.04 Prob > chi2 = 0.8417 Log likelihood = -1.8895913 Pseudo R2 = 0.0104 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | (omitted) x2 | -.1206257 .6098361 -0.20 0.843 -1.315883 1.074631 _cons | -.5427435 1.421095 -0.38 0.703 -3.328038 2.242551 ------------------------------------------------------------------------------
Stata detected that there was a quasi-separation and informed us which
predict variable was part of the issue. It tells us that predictor variable x1
predicts the data perfectly except when x1 = 3. It therefore drops all the cases
when x1 predicts the outcome variable perfectly, keeping only the three
observations for x1 = 3. Since x1 is a constant (=3) on this small sample, it is
dropped out of the analysis. The parameter estimate for x2 is actually correct
and can be used for inference about x2 assuming that the intended model is based
on both x1 and x2.
SPSS
data list list /y x1 x2. begin data. 0 1 3 0 2 0 0 3 -1 0 3 4 1 3 1 1 4 0 1 5 2 1 6 7 1 10 3 1 11 4 end data. logistic regression variable y /method = enter x1 x2.(Some output omitted) Block 1: Method = Enter Omnibus Tests of Model Coefficients |------------|----------|--|----| | |Chi-square|df|Sig.| |------|-----|----------|--|----| |Step 1|Step |9.681 |2 |.008| | |-----|----------|--|----| | |Block|9.681 |2 |.008| | |-----|----------|--|----| | |Model|9.681 |2 |.008| |------|-----|----------|--|----| Model Summary |----|-----------------|--------------------|-------------------| |Step|-2 Log likelihood|Cox & Snell R Square|Nagelkerke R Square| |----|-----------------|--------------------|-------------------| |1 |3.779a |.620 |.838 | |----|-----------------|--------------------|-------------------| a. Estimation terminated at iteration number 20 because maximum iterations has been reached. Final solution cannot be found. Classification Table(a) |------|-----------------------|---------------------------------| | |Observed |Predicted | | |----|--------------|------------------| | |y |Percentage Correct| | | |---------|----| | | |.00 |1.00| | |------|------------------|----|---------|----|------------------| |Step 1|y |.00 |4 |0 |100.0 | | | |----|---------|----|------------------| | | |1.00|1 |5 |83.3 | | |------------------|----|---------|----|------------------| | |Overall Percentage | | |90.0 | |------|-----------------------|---------|----|------------------| a. The cut value is .500 Variables in the Equation |----------------|-------|---------|----|--|----|-------| | |B |S.E. |Wald|df|Sig.|Exp(B) | |-------|--------|-------|---------|----|--|----|-------| |Step 1a|x1 |17.923 |5140.147 |.000|1 |.997|6.082E7| | |--------|-------|---------|----|--|----|-------| | |x2 |-.121 |.610 |.039|1 |.843|.886 | | |--------|-------|---------|----|--|----|-------| | |Constant|-54.313|15420.442|.000|1 |.997|.000 | |-------|--------|-------|---------|----|--|----|-------| a. Variable(s) entered on step 1: x1, x2.
SPSS tried to iteration to the default number of iterations and couldn’t reach a solution and thus stopped the iteration process. It didn’t tell us anything about quasi-complete separation. So it is up to us to figure out why the computation didn’t converge. One obvious evidence is the magnitude of the parameter estimates for x1. It is really large and its standard error is even larger. Based on this piece of evidence, we should look at the bivariate relationship between the outcome variable y and x1. The parameter estimate for x2 is actually correct and can be used for inference about x2 assuming that the intended model is based on both x1 and x2.
R
y<- c(0,0,0,0,1,1,1,1,1,1) x1<-c(1,2,3,3,3,4,5,6,10,11) x2<-c(3,0,-1,4,1,0,2,7,3,4) m1<- glm(y~ x1+x2, family=binomial) Warning message: In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, : fitted probabilities numerically 0 or 1 occurred summary(m1) Call: glm(formula = y ~ x1 + x2, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.004e+00 -5.538e-05 2.107e-08 2.107e-08 1.469e+00 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -58.0761 17511.9030 -0.003 0.997 x1 19.1778 5837.3009 0.003 0.997 x2 -0.1206 0.6098 -0.198 0.843 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13.4602 on 9 degrees of freedom Residual deviance: 3.7792 on 7 degrees of freedom AIC: 9.7792 Number of Fisher Scoring iterations: 21
The only warning we get from R is right after the glm command about predicted probabilities being 0 or 1. From the parameter estimates we can see that the coefficient for x1 is very large and its standard error is even larger, an indication that the model might have some issues with x1. At this point, we should investigate the bivariate relationship between the outcome variable and x1 closely. On the other hand, the parameter estimate for x2 is actually the correct estimate based on the model and can be used for inference about x2 assuming that the intended model is based on both x1 and x2.
What are the techniques for dealing with quasi-complete separation?
There are few options for dealing with quasi-complete separation. We will briefly discuss some of them here. Let’s say that predictor variable X is being separated by the outcome variable quasi-completely. Our discussion will be focused on what to do with X.
- The easiest strategy is "Do nothing". This is because that the maximum likelihood for other predictor variables are still valid as we have seen from previous section. The drawback is that we don’t get any reasonable estimate for the variable that predicts the outcome variable so nicely.
- Another simple strategy is to not include X in the model. But this is not a recommended strategy since this leads to biased estimates of other variables in the model.
- Possibly we might be able to collapse some categories of X if X is a categorical variable and if it makes sense to do so.
- Exact method is a good strategy when the data set is small and the model is not very large.
- Bayesian method can be used when we have additional information on the parameter estimate of X.
- Firth logistic regression uses a penalized likelihood estimation method.
References
SAS Notes: What do messages about separation (complete or quasi-complete) mean, and how can I fix the problem?
P. Allison, Convergence Failures in Logistic Regression, SAS Global Forum 2008