Truncated Regression | SAS Annotated Output

This page shows an example of truncated regression analysis in SAS with footnotes explaining the output. A truncated regression model predicts an outcome variable restricted to a truncated sample of its distribution. For example, if we wish to predict the age of licensed motorists from driving habits, our outcome variable is truncated at 16 (the legal driving age in the U.S.). While the population of ages extends below 16, our sample of the population does not. It is important to note the difference between truncated and censored data. In the case of censored data, there are limitations to the measurement scale that prevent us from knowing the true value of the dependent variable despite having some measurement of it. Consider the speedometer in a car. The speedometer may measure speeds up to 120 miles per hour, but all speeds equal to or greater than 120 mph will be read as 120 mph. Thus, if the speedometer measures the speed to be 120 mph, the car could be traveling 120 mph or any greater speed–we have no way of knowing. Censored data suggest limits on the measurement scale of the outcome variable, while truncated data suggest limits on the outcome variable in the sample of interest.

In this example, we will look at data from a study of students in a special GATE (gifted and talented education) program, https://stats.idre.ucla.edu/wp-content/uploads/2016/02/truncated.sas7bdat. We wish to model achievement (achiv) as a function of gender, language skills and math skills (female, langscore and mathscore in the dataset). A major concern is that students require a minimum achievement score of 40 to enter the special program. Thus, the sample is truncated at an achievement score of 40.

First, we will examine the data. We are interested in checking the range of values of our outcome variable, so we will include a histogram of achiv. For our other variables, we simply want a general sense of the values. For this, we can look at the summary statistics from proc means and a frequency of the categorical variable female.

data truncated;
set "D:\data\trunctated";
run;

proc means data = truncated; 
run;

The MEANS Procedure

Variable       N            Mean         Std Dev         Minimum         Maximum
--------------------------------------------------------------------------------
ID           178     103.6235955      57.0895709       3.0000000     200.0000000
ACHIV        178      54.2359551       8.9632299      41.0000000      76.0000000
FEMALE       178       0.5505618       0.4988401               0       1.0000000
LANGSCORE    178       5.4011236       0.8944896       3.0999999       6.6999998
MATHSCORE    178       5.3028090       0.9483515       3.0999999       7.4000001
--------------------------------------------------------------------------------

proc univariate data = truncated;
var achiv;
histogram achiv;
run;

proc freq data = truncated;
table female;
run;

The FREQ Procedure

                                   Cumulative    Cumulative
FEMALE    Frequency     Percent     Frequency      Percent
-----------------------------------------------------------
     0          80       44.94            80        44.94
     1          98       55.06           178       100.00

Now, we can generate a truncated regression model in SAS using proc qlim. We first indicate the outcome and predictors in the model statement. We then indicate in the endogenous statement that our outcome variable, achiv, is truncated with a lower bound of 40. If our data also had an upper bound, we would include it in this line as well.

proc qlim data = truncated;
model achiv  = female langscore mathscore;
endogenous achiv ~ truncated(lb=40);
run;

The QLIM Procedure

                 Summary Statistics of Continuous Responses
                                                                  N Obs N Obs
                      Standard                     Lower    Upper Lower Upper
Variable     Mean        Error       Type          Bound    Bound Bound Bound
 achiv   54.23596     8.963230    Truncated           40


             Model Fit Summary
Number of Endogenous Variables             1
Endogenous Variable                    achiv
Number of Observations                   178
Log Likelihood                    -574.53056
Maximum Absolute Gradient         2.72145E-6
Number of Iterations                      12
AIC                                     1159
Schwarz Criterion                       1175

Algorithm converged.

                      Parameter Estimates
                                 Standard                 Approx
Parameter        Estimate           Error    t Value    Pr > |t|
Intercept       -0.293996        6.204858      -0.05      0.9622
FEMALE          -2.290930        1.490333      -1.54      0.1242
LANGSCORE        5.064697        1.037769       4.88      <.0001
MATHSCORE        5.004053        0.955571       5.24      <.0001
_Sigma           7.739052        0.547644      14.13      <.0001

Truncated Regression Output

The QLIM Procedure

                 Summary Statistics of Continuous Responses
                                                                    N Obs   N Obs
                      Standard                     Lower    Upper   Lower   Upper
Variable^a    Mean^b       Error^c      Type^d         Bound^e   Bound^f  Bound^g  Bound^h
 achiv   54.23596     8.963230    Truncated           40


             Model Fit Summary
Number of Endogenous Variables             1
Endogenous Variable                    achiv
Number of Observations                   178
Log Likelihoodⁱ                   -574.53056
Maximum Absolute Gradient^j        2.72145E-6
Number of Iterations^k                     12
AIC^l                                    1159
Schwarz Criterion^m                      1175

Algorithm converged.

                      Parameter Estimates
                                 Standard                 Approx
Parameter        Estimateⁿ          Error^o   t Value^p   Pr > |t|^q
Intercept       -0.293996        6.204858      -0.05      0.9622
FEMALE          -2.290930        1.490333      -1.54      0.1242
LANGSCORE        5.064697        1.037769       4.88      <.0001
MATHSCORE        5.004053        0.955571       5.24      <.0001
_Sigma^r          7.739052        0.547644      14.13      <.0001

a. Variable – This is the outcome variable predicted in the regression. In this example, achiv is the truncated outcome variable.

b. Mean – This is the mean of the outcome variable. In this example, the mean of achiv is 54.23596.

c. Standard Error – This is the standard error of our outcome variable. It is equal to 8.9632299, the standard deviation we saw in the proc means output earlier.

d. Type – This describes the type of endogenous variable being modeled. Proc qlim allows for both truncated and censored outcome variables. In this example, our outcome is truncated.

e. Lower Bound – This indicates the lower limit specified for the outcome variable. In this example, the lower limit is 40.

f. Upper Bound – This indicates the upper limit specified for the outcome variable. In this example, we did not specify an upper limit.

g. N Obs Lower Bound – This indicates how many observations in the model had outcome variable values below the lower limit indicated in the function call. In this example, it is the number of observations where achiv < 40. The minimum value of achiv listed in the data summary was 41, so there were zero observations truncated from below.

h. N Obs Upper Bound – This indicates how many observations in the model had outcome variable values above the upper limit indicated on the endogenous statement. In this example, we did not specify an upper limit, so there were zero observations truncated from above.

i. Log Likelihood – This is the log likelihood of the fitted model. It is used in the Likelihood Ratio Chi-Square test of whether all predictors’ regression coefficients in the model are simultaneously zero.

j. Maximum Absolute Gradient – This is the absolute value of the gradient seen in the last iteration. The default convergence criterion used by proc qlim is an absolute gradient of 0.00001. Thus, when the absolute gradient falls below 0.00001, the model has converged. This value is the first absolute gradient less than 0.00001. If you wish to see additional output regarding the iteration history, add the itprint option to the proc qlim statement.

k. Number of Iterations – This is the number of iterations required by SAS for the model to converge. Truncated regression uses maximum likelihood estimation, which is an iterative procedure. The first iteration is the “null” or “empty” model; that is, a model with no predictors. At the next iteration, the specified predictors are included in the model. In this example, the predictors are female, langscore and mathscore. At each iteration, the log likelihood increases because the goal is to maximize the log likelihood. When the difference between successive iterations is very small, the model is said to have “converged” and the iterating stops. For more information on this process, see Regression Models for Categorical and Limited Dependent Variables by J. Scott Long (page 52-61).

l. AIC – This is the Akaike Information Criterion. It is a measure of model fit that is calculated as AIC = -2 Log L + 2p, where p is the number of parameters estimated in the model. In this example, p=5; three predictors, one intercept, and _Sigma (see superscript r). AIC is used for the comparison of models from different samples or non-nested models. Ultimately, the model with the smallest AIC is considered the best.

m. Schwarz Criterion – This is the Schwarz Criterion. It is defined as – 2 Log L + p*log(Σ f_i), where f_i‘s are the frequency values of the i^th observation, and p was defined previously. Like AIC, SC penalizes for the number of predictors in the model and the smallest SC is most desirable.

n. Estimate – These are the estimated regression coefficients. They are interpreted in the same manner as OLS regression coefficients: for a one unit increase in the predictor variable, the expected value of the outcome variable changes by the regression coefficient, given the other predictor variables in the model are held constant.

Intercept – Sometimes called the constant, this is the regression estimate when all predictor variables in the model are evaluated at zero. For a male student (the variable female evaluated at zero) with langscore and mathscore of zero, the predicted achievement score is -0.293996. Note that evaluating langscore and mathscore at zero is out of the range of plausible test scores.

female – The expected achievement score for a female student is 2.290930 units lower than the expected achievement score for a male student while holding all other variables in the model constant. In other words, if two students, one female and one male, had identical language and math scores, the predicted achievement score of the male would be 2.290930 units higher than the predicted achievement score of the female student.

langscore – This is the estimated regression estimate for a one unit increase in langscore, given the other variables are held constant in the model. If a student were to increase her langscore by one point, her predicted achievement score would increase by 5.064697 units, while holding the other variables in the model constant. Thus, the students with higher language scores will have higher predicted achievement scores than students with lower language scores, holding the other variables constant.

mathscore – This is the estimated regression estimate for a one unit increase in mathscore, given the other variables are held constant in the model. If a student were to increase her mathscore by one point, her predicted achievement score would increase by 5.004053 units, while holding the other variables in the model constant. Thus, the students with higher math scores will have higher predicted achievement scores than students with lower math scores, holding the other variables constant.

o. Standard Error – These are the standard errors of the individual regression coefficients. They are used in the calculation of the t test statistic, superscript p.

p. t Value – The test statistic t is the ratio of the Coef. to the Std. Err. of the respective predictor. The t value follows a t-distribution which is used to test against a two-sided alternative hypothesis that the Estimate is not equal to zero.

q. Approx Pr > |t| – This is the probability the t test statistic (or a more extreme test statistic) would be observed under the null hypothesis that a particular predictor’s regression coefficient is zero, given that the rest of the predictors are in the model. For a given alpha level, P>|t| determines whether or not the null hypothesis can be rejected. If P>|t| is less than alpha, then the null hypothesis can be rejected and the parameter estimate is considered statistically significant at that alpha level.

Intercept – The t test statistic for Intercept, is (-0.293996/6.204858) = -0.05 with an associated p-value of 0.9622. If we set our alpha level at 0.05, we would fail to reject the null hypothesis and conclude that Intercept has not been found to be statistically different from zero given female, langscore and mathscore are in the model and evaluated at zero.

female – The t test statistic for the predictor female is (-2.290930/1.490333) = -1.54 with an associated p-value of 0.1242. If we set our alpha level to 0.05, we would fail to reject the null hypothesis and conclude that the regression coefficient for female has not been found to be statistically different from zero given langscore and mathscore are in the model.

langscore – The t test statistic for the predictor langscore is (5.064697/1.037769) = 4.88 with an associated p-value of <0.001. If we set our alpha level to 0.05, we would reject the null hypothesis and conclude that the regression coefficient for langscore has been found to be statistically different from zero given female and mathscore are in the model.

mathscore – The t test statistic for the predictor mathscore is (5.004053/0.955571) = 5.24 with an associated p-value of <0.001. If we set our alpha level to 0.05, we would reject the null hypothesis and conclude that the regression coefficient for mathscore has been found to be statistically different from zero given female and langscore are in the model.

r. _Sigma – This is the estimated standard error of the regression. In this example, the value, 7.739052, is comparable to the root mean squared error that would be obtained in an OLS regression. If we ran an OLS regression with the same outcome and predictors, our RMSE would be 6.8549. This is indicative of how much the outcome varies from the predicted value. _Sigma approximates this quantity for truncated regression.