To illustrate the models explained in this FAQ we will be using the recur data set from Applied Survival Analysis by Hosmer and Lemeshow.
There are at least four different models that one could use to model repeat events in a survival analysis. The choice will depend on the data to be analyzed and the research question to be answered. For a more in depth discussion of the models please refer to section 9.2 of Applied Survival Analysis by Hosmer and Lemeshow.
Model | Subject | Time interval | Events | Stratum | Subject | Time interval | Events | Stratum |
Counting Process | 1 | (0, 6] (6, 9] (9, 56] (56, 88] |
1 1 1 1 |
1 1 1 1 |
2 | (0, 42] (42, 87] (87, 91] |
1 1 0 |
1 1 1 |
Conditional Model A | 1 | (0, 6] (6, 9] (9, 56] (56, 88] |
1 1 1 1 |
1 2 3 4 |
2 | (0, 42] (42, 87] (87, 91] |
1 1 0 |
1 2 3 |
Conditional Model B | 1 | (0, 6] (0, 3] (0, 47] (0, 32] |
1 1 1 1 |
1 2 3 4 |
2 | (0, 42] (0, 45] (0, 3] |
1 1 0 |
1 2 3 |
Marginal | 1 | (0, 6] (0, 9] (0, 56] (0, 88] |
1 1 1 1 |
1 2 3 4 |
2 | (0, 42] (0, 87] (0, 91] (0, 91] |
1 1 0 0 |
1 2 3 4 |
The counting process model
The first model that we will discuss is the counting process model in which each event is assumed to be independent and a subject contributes to the risk set for an event as long as the subject is under observation at the time the event occurs. The data for each subject with multiple events could be described as data for multiple subjects where each has delayed entry and is followed until the next event. This model, thus, ignores the order of the events leaving each subject to be at risk for any event as long as they are still under observation at the time of the event. This implies that a subject could be at risk for a subsequent event without having experienced the prior events.
In the recur data set we see that the first subject will be at risk for for any event occurring between 0 and 88 months and subject two for any events occurring between 0 and 91 months.
We do want to account for the fact that we are modeling multiple events for each subject and thus there might be correlation between the observations within each subject. In order to fit a model which accounts for correlated observations within subjects we will be using two options in proc phreg which are not documented in the online help pages because the options were not introduced until SAS version 8.1 (the online documentation is only current for version 8.0). The covm option requests the model-based estimate for the covariance matrix. The covs option requests a robust sandwich estimate for the covariance matrix which results in a robust standard error for the parameter estimates. A modified score test is also computed for testing the global null hypothesis. The aggregate keyword in the covs option requests a summing up of the score residuals for each distinct id pattern. Beware that this keyword has NO EFFECT if used without the id statement!!!
proc phreg data=recur covs(aggregate) covm; model (time0 time1)*censor(0) = age treat; id id; run; <output omitted> Summary of the Number of Event and Censored Values Percent Total Event Censored Censored 1296 939 357 27.55 Model Fit Statistics Without With Criterion Covariates Covariates -2 LOG L 10399.844 10370.853 AIC 10399.844 10374.853 SBC 10399.844 10384.543 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 28.9903 2 <.0001 Score 29.2116 2 <.0001 Modified Score 44.6322 2 <.0001 Wald (COVM) 29.1437 2 <.0001 Wald (COVS) 53.9374 2 <.0001 Analysis of Maximum Likelihood Estimates with Model-Based Variance Estimate Parameter Standard Hazard Variable DF Estimate Error Chi-Square Pr > ChiSq Ratio Variable Label age 1 0.04379 0.01087 16.2161 <.0001 1.045 Age treat 1 0.24081 0.06582 13.3875 0.0003 1.272 Treatment Assignment Analysis of Maximum Likelihood Estimates with Sandwich Variance Estimate Parameter Standard StdErr Hazard Variable DF Estimate Error Ratio Chi-Square Pr > ChiSq Ratio Variable Label age 1 0.04379 0.00746 0.686 34.4713 <.0001 1.045 Age treat 1 0.24081 0.04755 0.722 25.6491 <.0001 1.272 Treatment Assignment
In the output for the global null hypothesis test there are three additional tests. The Modified score test and the Wald (COVS) test are both based on the robust sandwich estimate and their chi square statistic is much larger than the other tests. The Wald (COVM) test is based on the model-based covariance matrix. All three are very different from the other tests which ignore the correlation of observations within subjects and the chi square statistic is the same size as the other tests. Fortunately, they are all significant at the 0.05 level.
In the parameter estimate table there is a decrease in the size of the standard error and, therefore, an increase in significance of the parameters when using the sandwich variance estimate as compared to the model-based variance parameter estimates.
The hazard ratio estimate for age indicate that for every one increase in age the rate of muscle soreness increases by 4.5%; the hazard ratio for treat indicates that the subjects in the control group experience soreness at a rate which is 27.2% higher than the treatment group. Treat is coded in a rather unusual manner in that 0=treatment group and 1=control group.
The conditional model A
The second model we will discuss is a conditional model. It is conditional because it assumes that it is not possible to be at risk for a subsequent event without having experienced the previous event (i.e. you can not be at risk for event 2 without having experienced event 1). We use a strata variable to indicate the event number. In this model the time interval of a subsequent event starts at the end of the time interval for the previous event. This model is very useful for modeling the full time course of the recurrent event process.
In the recur data set the time intervals are set up exactly the same as in the counting process model with each time interval starting at the time of the previous event. The big difference between this model and the counting process model is that we are using the stratum variable to keep track of the event number; thus, ensuring that it is not possible to be at risk for subsequent events without having experienced the previous events.
proc phreg data=recur covs(aggregate) covm; model (time0 time1)*censor(0) = age treat; strata event; id id; run; <output omitted> Summary of the Number of Event and Censored Values Percent Stratum event Total Event Censored Censored 1 1 400 386 14 3.50 2 2 386 324 62 16.06 3 3 324 186 138 42.59 4 4 186 43 143 76.88 ------------------------------------------------------------------- Total 1296 939 357 27.55 Model Fit Statistics Without With Criterion Covariates Covariates -2 LOG L 8402.728 8333.665 AIC 8402.728 8337.665 SBC 8402.728 8347.354 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 69.0630 2 <.0001 Score 69.8026 2 <.0001 Modified Score 74.5691 2 <.0001 Wald (COVM) 69.1747 2 <.0001 Wald (COVS) 67.5684 2 <.0001 Analysis of Maximum Likelihood Estimates with Model-Based Variance Estimate Parameter Standard Hazard Variable DF Estimate Error Chi-Square Pr > ChiSq Ratio Variable Label age 1 0.05764 0.01096 27.6476 <.0001 1.059 Age treat 1 0.47299 0.07009 45.5346 <.0001 1.605 Treatment Assignment Analysis of Maximum Likelihood Estimates with Sandwich Variance Estimate Parameter Standard StdErr Hazard Variable DF Estimate Error Ratio Chi-Square Pr > ChiSq Ratio Variable Label age 1 0.05764 0.01025 0.935 31.6011 <.0001 1.059 Age treat 1 0.47299 0.07090 1.011 44.5093 <.0001 1.605 Treatment Assignment
In this model there are also three extra global null hypothesis test but for this model the chi-square statistics based on the robust sandwich estimate are not much larger than the chi-square statistics of the other tests.
The standard errors using the Sandwich variance estimate is approximately the same as using the model-base variance with the standard error of the parameter estimate for age being slightly smaller and the standard error for treat being slightly larger.
The hazard ratio estimate for age indicate that for every one increase in age the rate of muscle soreness increases by 5.9%; the hazard ratio for treat indicates that the subjects in the control group experience soreness at a rate which is 60.5% higher than the treatment group. The hazard estimate for treat in is much larger in this conditional model than in the previous count process model.
The conditional model B
The third model we will discuss is also a conditional model. This model only differs from the previous model in the way the time intervals are structured. In this model each time interval starts at zero and ends at the length of time until the next event. The result is that the risk sets for each of these conditional models are completely different and the questions that these analysis answer are also very different. This model is very useful for modeling the time between each of the recurring event rather than the full time course of the recurrent event process.
In the recur data set the first subject experiences four time intervals which each start at time zero but end at the length of time until the next event. Just as in conditional model A we use the stratum variable to keep track of the event number; thus, ensuring that it is not possible to be at risk for subsequent events without having experienced the previous events.
data recur; set recur; zero_time = 0; interval_time = time1 - time0; run; proc phreg data=recur covs(aggregate) covm; model (zero_time interval_time)*censor(0) = age treat; strata event; id id; run; <output omitted> Model Fit Statistics Without With Criterion Covariates Covariates -2 LOG L 9236.976 9167.237 AIC 9236.976 9171.237 SBC 9236.976 9180.927 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 69.7390 2 <.0001 Score 70.7576 2 <.0001 Modified Score 61.8527 2 <.0001 Wald (COVM) 70.0571 2 <.0001 Wald (COVS) 73.0643 2 <.0001 Analysis of Maximum Likelihood Estimates with Model-Based Variance Estimate Parameter Standard Hazard Variable DF Estimate Error Chi-Square Pr > ChiSq Ratio Variable Label age 1 0.05646 0.01081 27.2645 <.0001 1.058 Age treat 1 0.45835 0.06846 44.8300 <.0001 1.581 Treatment Assignment Analysis of Maximum Likelihood Estimates with Sandwich Variance Estimate Parameter Standard StdErr Hazard Variable DF Estimate Error Ratio Chi-Square Pr > ChiSq Ratio Variable Label age 1 0.05646 0.01003 0.928 31.6876 <.0001 1.058 Age treat 1 0.45835 0.06682 0.976 47.0555 <.0001 1.581 Treatment Assignment
Just as in the previous conditional model the chi squared statistics based on the robust sandwich estimate in the global null hypothesis tests are approximately the same as the chi squared statistics of the other tests.
The standard errors using the Sandwich variance estimate is slightly smaller than using the model-base variance.
The hazard ratio estimate for age indicate that for every one increase in age the rate of muscle soreness increases by 5.8%; the hazard ratio for treat indicates that the subjects in the control group experience soreness at a rate which is 58.1% higher than the treatment group. Both estimates are very close to the estimates obtained in the previous conditional model.
The marginal model
In the marginal model each event is considered as a separate process. The time for each event starts at the beginning of follow up time for each subject. Furthermore, each subject is considered to be at risk for all events, regardless of how many events each subject actually experienced. Thus, all subjects in the study contribute follow up times to all possible recurrent events. Thus, the marginal model considers each event separately and models all the available data for the specific event.
In the recur data the first subject had four events and each time interval starts at zero. The second subject only had three events and thus the fourth time interval is simply a duplicate of the third time interval and each time interval starts at zero. Since the maximum number of events was four each subject in the recur study must have four time intervals to be modeled regardless of whether or not they actually experienced four events or not.
proc sort data=recur out=sort_recur; by id event; data recur_marginal; set sort_recur; by id; output; if last.id = 1 and event < 4 then do; interval = event; do i=1 to (4-interval); zero_time=0; event = event+1; censor = 0; output; end; end; run; proc print data=recur_marginal (obs=12); var id zero_time time0 time1 censor; run; zero_ Obs ID time time0 time1 censor 1 1 0 0 6 1 2 1 0 6 9 1 3 1 0 9 56 1 4 1 0 56 88 1 5 2 0 0 42 1 6 2 0 42 87 1 7 2 0 87 91 0 8 2 0 87 91 0 9 3 0 0 15 1 10 3 0 15 17 1 11 3 0 17 36 1 12 3 0 36 112 0 proc phreg data=recur_marginal covs(aggregate) covm; model (zero_time time1)*censor(0) = age treat; strata event; id id; run; <output omitted> Summary of the Number of Event and Censored Values Percent Stratum event Total Event Censored Censored 1 1 400 386 14 3.50 2 2 400 324 76 19.00 3 3 400 186 214 53.50 4 4 400 43 357 89.25 ------------------------------------------------------------------- Total 1600 939 661 41.31 Model Fit Statistics Without With Criterion Covariates Covariates -2 LOG L 9332.530 9181.979 AIC 9332.530 9185.979 SBC 9332.530 9195.669 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 150.5513 2 <.0001 Score 155.7786 2 <.0001 Modified Score 70.8828 2 <.0001 Wald (COVM) 151.5532 2 <.0001 Wald (COVS) 75.3953 2 <.0001 Analysis of Maximum Likelihood Estimates with Model-Based Variance Estimate Parameter Standard Hazard Variable DF Estimate Error Chi-Square Pr > ChiSq Ratio Variable Label age 1 0.07282 0.01076 45.8319 <.0001 1.076 Age treat 1 0.74477 0.06942 115.0882 <.0001 2.106 Treatment Assignment Analysis of Maximum Likelihood Estimates with Sandwich Variance Estimate Parameter Standard StdErr Hazard Variable DF Estimate Error Ratio Chi-Square Pr > ChiSq Ratio Variable Label age 1 0.07282 0.01441 1.340 25.5289 <.0001 1.076 Age treat 1 0.74477 0.09858 1.420 57.0822 <.0001 2.106 Treatment Assignment
The Modified score test and the Wald (COVS) test are both based on the robust sandwich estimate and their
chi square statistic is much smaller than the other tests. The Wald (COVM) test, based on the model-based covariance
matrix, has a test statistic which is approximately the same size as the remaining tests which ignore the correlation of
observations within subjects. The test statistics of the Modified score test and the Wald (COVS) test are
almost half as large as the other test statistics though all are significant at the 0.05 level.
The standard errors using the Sandwich variance estimate is larger than using the
model-base variance. However, this does not affect the significance level and both predictors are
significant at the 0.05 level when using either estimate.
The hazard ratio estimate for
age indicate that for every one increase in age the
rate of muscle soreness increases by 7.6%; the hazard ratio for treat indicates that
the subjects in the control group experience soreness at a rate which is 110.6%
higher than the treatment group which is much larger than both the estimates obtained in the
conditional models.
Fitting a recurring events model using proc phreg is not much more difficult than fitting a single event model. The main difficulty lies in deciding which of the at least four different types of models is the most appropriate for the data at hand and which types of research questions are to be answered.