Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence by Judith D. Singer and John B. Willett Chapter 12: Extending the Discrete-Time Hazard Model
The examples on the page were written in R version 2.4.1.
Table 12.2, page 413. To recreate this table, we will run each model separately and then combine the model statistics.
tenure <- read.table("https://stats.idre.ucla.edu/stat/examples/alda/tenure_pp.csv", sep=",", header=T) attach(tenure) PERIOD2 <- PERIOD^2 PERIOD3 <- PERIOD^3 PERIOD4 <- PERIOD^4 PERIOD5 <- PERIOD^5 PERIOD6 <- PERIOD^6 general <- glm(EVENT ~ D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8 + D9, family = "binomial") order0 <- glm(EVENT~1, family="binomial") order1 <- glm(EVENT~PERIOD, family="binomial") order2 <- glm(EVENT~PERIOD + PERIOD2, family="binomial") order3 <- glm(EVENT~PERIOD + PERIOD2 + PERIOD3, family="binomial") order4 <- glm(EVENT~PERIOD + PERIOD2 + PERIOD3 + PERIOD4, family="binomial") order5 <- glm(EVENT~PERIOD + PERIOD2 + PERIOD3 + PERIOD4 + PERIOD5, family="binomial") dev <- c(order0$deviance, order1$deviance, order2$deviance, order3$deviance, order4$deviance, order5$deviance, general$deviance) dev.diff.p <- c(0, dev[1:5] - dev[2:6]) dev.diff.gen <- c(dev - dev[7]) aic <- c(order0$aic, order1$aic, order2$aic, order3$aic, order4$aic, order5$aic, general$aic) table12.2 <- cbind(dev, dev.diff.p, dev.diff.gen, aic) table12.2 dev dev.diff.p dev.diff.gen aic [1,] 1037.5652 0.00000000 206.361395 1039.5652 [2,] 867.4619 170.10332478 36.258070 871.4619 [3,] 836.3041 31.15779577 5.100274 842.3041 [4,] 833.1725 3.13158698 1.968687 841.1725 [5,] 832.7427 0.42981046 1.538877 842.7427 [6,] 832.7322 0.01051811 1.528358 844.7322 [7,] 831.2038 0.00000000 0.000000 849.2038
Figure 12.1, page 414. First, we will rerun the general model without an intercept for greater ease in using the model coefficients later.
general <- glm(EVENT ~ D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8 + D9 - 1, family = "binomial")
In the loop below, we will use our model estimates to generate a fitted value for each of the nine time points from each of the models. These are the values that will be plotted in the first of the three plots. We also calculate the survival and hazard rates from the quadratic and general models to be plotted in the second and third graphs.
fits <- c() survivor.quad = 1 survivor.gen = 1 for (i in 1:9){ constant = order0$coef[1] linear = order1$coef[1] + order1$coef[2]*i quadratic = order2$coef[1] + order2$coef[2]*i + order2$coef[3]*i**2 cubic = order3$coef[1] + order3$coef[2]*i + order3$coef[3]*i**2 + order3$coef[4]*i**3 hazard.quad = 1/(1 + exp(-quadratic)); survivor.quad = (1 - hazard.quad)*survivor.quad; generalval = general$coef[i] hazard.gen = 1/(1 + exp(-generalval)); survivor.gen = (1 - hazard.gen)*survivor.gen; z <- c(i, constant, linear, quadratic, cubic, generalval, hazard.quad, survivor.quad, hazard.gen, survivor.gen) fits <- rbind(fits, z)}
Each of the three graphs is created by generating the plot area while plotting one line, and then overlaying any additional lines on the same plot.
par(mfrow=c(1,1)) plot(fits[,1], fits[,2], type = "l", lty = 1, xlim = c(0,9), ylim = c(-6,0), xlab = "Years after hire", ylab = "Fitted logit(hazard)") points(fits[,1], fits[,3], type = "l", lty = 2) points(fits[,1], fits[,4], type = "l", lty = 3) points(fits[,1], fits[,5], type = "l", lty = 4) points(fits[,1], fits[,6], type = "l", lty = 5) legend("bottomright", c("constant", "linear", "quadratic", "cubic", "general"), lty = c(1, 2, 3, 4, 5)) par(mfrow=c(1,2)) plot(fits[,1], fits[,7], type = "l", lty = 1, xlim = c(0,9), ylim = c(0,.4), xlab = "Years after hire", ylab = "Fitted hazard") points(fits[,1], fits[,9], type = "l", lty = 2) legend("bottomright", c("quadratic","general"), lty = c(1, 2)) plot(fits[,1], fits[,8], type = "l", lty = 1, xlim = c(0,9), ylim = c(0,1), xlab = "Years after hire", ylab = "Fitted survival") points(fits[,1], fits[,10], type = "l", lty = 2) legend("bottomleft", c("quadratic","general"), lty = c(1, 2))
Figure 12.3, page 423. After reading in the data, we will subset it by parenting transition (PT) values. Then, we will run separate models on these datasets using both logit and cloglog transformations–four models in total. Finally, we will plot the parameter estimates associated with each of the time points from each of the four models.
firstsex<-read.table("https://stats.idre.ucla.edu/stat/examples/alda/firstsex_pp.csv", sep=",", header=T) firstsex0 <- subset(firstsex, pt==0) firstsex1 <- subset(firstsex, pt==1) fs0.logit<-glm(event~d7+d8+d9+d10+d11+d12 - 1, family=binomial(link = "logit"), data = firstsex0) fs1.logit<-glm(event~d7+d8+d9+d10+d11+d12 - 1, family=binomial(link = "logit"), data = firstsex1) fs0.loglog<-glm(event~d7+d8+d9+d10+d11+d12 - 1, family=binomial(link = "cloglog"), data = firstsex0) fs1.loglog<-glm(event~d7+d8+d9+d10+d11+d12 - 1, family=binomial(link = "cloglog"), data = firstsex1)
Having run these four models, we create a single dataset that contains the coefficients from each model and a column of time points. This is the dataset we will use to generate the plot. First, we create a plot with one of the lines. Then, we overlay the other three lines.
fig12.3 <- cbind(time = c(7, 8, 9 , 10, 11, 12), fs0.logit = fs0.logit$coef, fs0.loglog = fs0.loglog$coef, fs1.logit = fs1.logit$coef, fs1.loglog = fs1.loglog$coef) par(mfrow=c(1,1)) plot(fig12.3[,1], fig12.3[,2], type = "l", ylab = "Transformed hazard probability", xlab = "Grade", ylim = c(-4,0)) points(fig12.3[,1], fig12.3[,3], type = "l", lty=2) points(fig12.3[,1], fig12.3[,4], type = "l", lty=3) points(fig12.3[,1], fig12.3[,5], type = "l", lty=4) legend(7, 0, c("pt=0, logit", "pt=0, loglog", "pt=1, logit", "pt=1, loglog"), lty = c(1, 2, 3, 4))
Table 12.3, page 424. For this table, we will run two models with different link functions and then find the fitted baseline hazard using the corresponding model coefficients.
fs.loglog<-glm(event~d7+d8+d9+d10+d11+d12+pt - 1, family=binomial(link = "cloglog"), data = firstsex) fs.logit<-glm(event~d7+d8+d9+d10+d11+d12+pt - 1, family=binomial(link = "logit"), data = firstsex) est.logit <- fs.logit$coef est.loglog <- fs.loglog$coef fbh.logit <- 1/(1+exp(-est.logit)) fbh.loglog <- 1-exp(-exp(est.loglog)) tab12.3 <- cbind(est.loglog, est.logit, fbh.loglog, fbh.logit) tab12.3 est.loglog est.logit fbh.loglog fbh.logit d7 -2.9732973 -2.9943266 0.04984906 0.04768284 d8 -3.6592441 -3.7001239 0.02542322 0.02412410 d9 -2.3156320 -2.2811235 0.09398896 0.09269841 d10 -1.9001387 -1.8225985 0.13890279 0.13912236 d11 -1.7621385 -1.6542272 0.15774911 0.16053845 d12 -1.3426380 -1.1790570 0.22983911 0.23522180 pt 0.7854048 0.8736184 0.88845136 0.70549806 fs.loglog$deviance [1] 634.5013 fs.logit$deviance [1] 634.6618
Figure 12.4, page 432. To generate these graphs, we will first do the calculations required for the scatter plots. Then we will run the model required to add the curves.
depression<-read.table("https://stats.idre.ucla.edu/stat/examples/alda/depression_pp.csv", sep=",", header=T)
For each age between 4 and 39, we are interested in the proportion of subjects that experienced the event (in this example, an initial depressive episode) within the group of subjects with divorced parents (PD = 1) and the group of subjects without divorced parents (PD = 0). To calculate this, we can loop through these combinations of age and PD values, subsetting our data and calculating the proportion within each subset. For the second graph, we will take the logit of this proportion.
percents <- c() for (i in 4:39){ for (j in 0:1){ x <- subset(depression, PERIOD==i & PD==j) y <- table(x$EVENT) if (dim(y)==2){ z <- c(i, j, y[[2]], y[[1]], (y[[2]]/(y[[1]]+y[[2]])))} if (dim(y)==1){ z <- c(i, j, 0, y[[1]], NA)} percents <- rbind(percents, z) } } percents <- cbind(percents, log(percents[,5]/(1-percents[,5]))) colnames(percents) <- c("age", "parent", "event", "nonevent", "percent", "logpercent") percents.nm <- as.data.frame(na.omit(percents)) percent.pd0 <- subset(percents.nm, parent == 0) percent.pd1 <- subset(percents.nm, parent == 1)
Next, we will fit a logistic model to the data. Then, for each age between 4 and 39 and for each PD group, we are interested in the model-predicted proportion of subjects that experienced the event. To calculate this, we can again loop through these combinations of age and PD values and use the model coefficients to generate predicted proportions. We can also calculate the hazard rate for the second plot.
dmodel<-glm(EVENT ~ ONE + age_18 + age_18sq + age_18cub + PD - 1, family=binomial(link = "logit"), data = depression) modelfit<- c() for (i in 0:1){ for (j in 4:39){ fitx = dmodel$coef[1] + dmodel$coef[2]*(j-18) + dmodel$coef[3]*(j-18)^2 + dmodel$coef[4]*(j-18)^3 + dmodel$coef[5]*i fithazard = 1/(1 + exp(-fitx)) modelfit <- rbind(modelfit, c(i, j, fitx, fithazard)) } }
Lastly, we subset our data and generate the plots, starting each with one scatter plot, then layering on the other scatter points and the two fitted curves.
modelfit.0 <- subset(data.frame(modelfit), modelfit[,1]==0) modelfit.1 <- subset(data.frame(modelfit), modelfit[,1]==1) par(mfrow=c(1,2)) plot(percent.pd0$age, percent.pd0$percent, pch = 19, ylim = c(0, .06), xlab = "Age", ylab = "Proportion experiencing event") points(percent.pd1$age, percent.pd1$percent, pch = 22) points(modelfit.0[,2], modelfit.0[,4], type = 'l', lty = 1) points(modelfit.1[,2], modelfit.1[,4], type = 'l', lty = 2) legend(5, 0.06, c("PD = 0", "PD = 1"), lty = c(1, 2)) plot(percent.pd0$age, percent.pd0$logpercent, pch = 19, ylim = c(-8, -2), xlab = "Age", ylab = "Logit(proportion experiencing event)") points(percent.pd1$age, percent.pd1$logpercent, pch = 22) points(modelfit.0[,2], modelfit.0[,3], type = 'l', lty = 1) points(modelfit.1[,2], modelfit.1[,3], type = 'l', lty = 2) legend(5, -2, c("PD = 0", "PD = 1"), lty = c(1, 2))
Figure 12.5, page 437. To generate these graphs, we will first fit a logistic regression model that predicts the event using three age terms, parental divorce, and sex.
depression<-read.table("https://stats.idre.ucla.edu/stat/examples/alda/depression_pp.csv", sep=",", header=T) dmodel<-glm(EVENT ~ ONE + age_18 + age_18sq + age_18cub + PD + FEMALE - 1, family=binomial(link = "logit"), data = depression)
Next, we can to use the model coefficients to generate predicted values for all of the time/PD/sex combinations we are interested in. We will do this in a loop. Within each PD/sex combination, we are also interested in the “survival” rate at each time point. This will be the proportion of subjects that have not yet experienced the event at a given time point. Both the hazard and the survival rate will be calculated in the loop using the predicted values from the model. We will then subset the data by PD/sex combination, resulting in four datasets with one observation for each time point.
modelfit<- c() for (i in 0:1){ for (k in 0:1){ survivor <- 1 for (j in 4:39){ fitx = dmodel$coef[1] + dmodel$coef[2]*(j-18) + dmodel$coef[3]*(j-18)^2 + dmodel$coef[4]*(j-18)^3 + dmodel$coef[5]*i + dmodel$coef[6]*k hazard = 1/(1 + exp(-fitx)) survivor = (1-hazard)*survivor modelfit <- rbind(modelfit, c(i, j, k, fitx, hazard, survivor))}}} colnames(modelfit) <- c("pd", "age", "female", "fit", "hazard", "survival") modelfit.0male <- subset(data.frame(modelfit), modelfit[,1]==0 & modelfit[,3]==0) modelfit.0female <- subset(data.frame(modelfit), modelfit[,1]==0 & modelfit[,3]==1) modelfit.1male <- subset(data.frame(modelfit), modelfit[,1]==1 & modelfit[,3]==0) modelfit.1female <- subset(data.frame(modelfit), modelfit[,1]==1 & modelfit[,3]==1)
To plot the four curves in each plot, we start with one plot, and then overlay the other three lines in the same plot area.
par(mfrow=c(1,2)) plot(modelfit.0male$age, modelfit.0male$hazard, type = 'l', lty = 1, ylim = c(0, .04), xlab = "Age", ylab = "Fitted hazard") points(modelfit.0female$age, modelfit.0female$hazard, type = 'l', lty = 2) points(modelfit.1male$age, modelfit.1male$hazard, type = 'l', lty = 3) points(modelfit.1female$age, modelfit.1female$hazard, type = 'l', lty = 4) legend(5, 0.04, c("PD = 0, Male", "PD = 0, Female", "PD = 1, Male", "PD = 1, Female"), lty = c(1, 2, 3, 4)) plot(modelfit.0male$age, modelfit.0male$survival, type = 'l', lty = 1, ylim = c(0, 1), xlab = "Age", ylab = "Fitted survival") points(modelfit.0female$age, modelfit.0female$survival, type = 'l', lty = 2) points(modelfit.1male$age, modelfit.1male$survival, type = 'l', lty = 3) points(modelfit.1female$age, modelfit.1female$survival, type = 'l', lty = 4) legend(0, 0.2, c("PD = 0, Male", "PD = 0, Female", "PD = 1, Male", "PD = 1, Female"), lty = c(1, 2, 3, 4))
Figure 12.6, page 445. This set of plots is similar to previous examples: we will first calculate the proportion of a subset of subjects who experienced the event at a each time point, and then create plots of the proportion (or some transformation) against time with different lines representing different subject characteristics.
fa<-read.table("https://stats.idre.ucla.edu/stat/examples/alda/firstarrest_pp.csv", sep=",", header=T)
We loop through all of the combinations of time, race, and abused status in our dataset, subsetting on each and calculating the proportion of subjects experiencing the event within each subset. We then take the logit of this proportion. After completing the loop, we subset the percent and logit data into four datasets, one for each race/abuse combination.
percents <- c() for (i in 8:18){ for (j in 0:1){ for (k in 0:1){ x <- subset(fa, PERIOD==i & BLACK==j & ABUSED==k) if (sum(x$EVENT) > 0){ y <- mean(x$EVENT)} if (sum(x$EVENT)==0){ y <- NA} logity <- log(y/(1-y)) z <- c(i, j, k, y, logity) percents <- rbind(percents, z)}}} colnames(percents) <- c("age", "black", "abused", "pct", "hazard") percents.nm <- as.data.frame(na.omit(percents)) percents.w.a <- subset(percents.nm, black==0 & abused==1) percents.b.a <- subset(percents.nm, black==1 & abused==1) percents.w.na <- subset(percents.nm, black==0 & abused==0) percents.b.na <- subset(percents.nm, black==1 & abused==0)
Next, we create two plots: one of the white subjects and one of the black subjects. In each, we first plot a line for one of the abuse categories, and then overlay a line for the other.
par(mfrow=c(1,2)) plot(percents.w.a$age, percents.w.a$hazard, type = "l", lty = 1, ylim = c(-7, -2), xlim = c(8,18), main = "White", xlab = "Age", ylab = "Sample logit(hazard)") points(percents.w.na$age, percents.w.na$hazard, type = "l", lty = 2) legend(8, -2, c("Abused", "Not Abused"), lty = c(1, 2)) plot(percents.b.a$age, percents.b.a$hazard, type = "l", lty = 1, ylim = c(-7, -2), xlim = c(8,18), main = "Black", xlab = "Age", ylab = "Sample logit(hazard)") points(percents.b.na$age, percents.b.na$hazard, type = "l", lty = 2) legend(8, -2, c("Abused", "Not Abused"), lty = c(1, 2))
The first two graphs were created using raw proportions. The third graph is based on a logistic model fitted to the data, using time, race, abuse status, and an abuse-race interaction term as predictors. We will run this model and then use the coefficients to predict hazard rates at each time point for each race/abuse combination. These predicted values will be generated using a loop, as we have done in previous examples.
famodel<-glm(EVENT ~ D8 + D9 + D10 + D11 + D12 + D13 + D14 + D15 + D16 + D17 + D18 + ABUSED + BLACK + ABLACK - 1, family=binomial(link = "logit"), data = fa) modelfit <- c() for (j in 0:1){ for (k in 0:1){ survivor <- 1 for (i in 8:18){ logitfit <- famodel$coef[i-7] + famodel$coef[12]*j + famodel$coef[13]*k + famodel$coef[14]*j*k hazard = 1/(1 + exp(-logitfit)) survivor = (1-hazard)*survivor z <- c(i, j, k, j*k, logitfit, hazard, survivor) modelfit <- rbind(modelfit, z)}}}
After completing the loop, we will subset our data as we did for the first two plots. Then, we create one graph by plotting the line for one of the four race/abuse combinations and then overlaying the other three lines.
colnames(modelfit) <- c("age", "abused", "black", "ablack", "logitfit", "hazard", "survival") modelfit <- as.data.frame(modelfit) modelfit.w.a <- subset(modelfit, black==0 & abused==1) modelfit.b.a <- subset(modelfit, black==1 & abused==1) modelfit.w.na <- subset(modelfit, black==0 & abused==0) modelfit.b.na <- subset(modelfit, black==1 & abused==0) par(mfrow=c(1,1)) plot(modelfit.w.a$age, modelfit.w.a$logitfit, type = "l", lty = 1, ylim = c(-8, -2), xlim = c(8,18), xlab = "Age", ylab = "Fitted logit(hazard)") points(modelfit.w.na$age, modelfit.w.na$logitfit, type = "l", lty = 2) points(modelfit.b.a$age, modelfit.b.a$logitfit, type = "l", lty = 3) points(modelfit.b.na$age, modelfit.b.na$logitfit, type = "l", lty = 4) legend(8, -2, c("White Abused", "White Not Abused", "Black Abused", "Black Not Abused"), lty = c(1, 2, 3, 4))
Table 12.4, page 449. The parameter estimates and fit statistics in this table can be found in the output from each of the three models below.
depression<-read.table("https://stats.idre.ucla.edu/stat/examples/alda/depression_pp.csv", sep=",", header=T) modelA<-glm(EVENT ~ ONE + age_18 + age_18sq + age_18cub + PD + FEMALE + NSIBS- 1, family=binomial(link = "logit"), data = depression) summary(modelA) Coefficients: Estimate Std. Error z value Pr(>|z|) ONE -4.359e+00 1.216e-01 -35.845 < 2e-16 *** age_18 6.106e-02 1.166e-02 5.236 1.64e-07 *** age_18sq -7.308e-03 1.224e-03 -5.972 2.34e-09 *** age_18cub 1.816e-04 7.897e-05 2.300 0.021447 * PD 3.726e-01 1.624e-01 2.295 0.021754 * FEMALE 5.587e-01 1.095e-01 5.103 3.34e-07 *** NSIBS -8.141e-02 2.227e-02 -3.655 0.000257 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 51288.7 on 36997 degrees of freedom Residual deviance: 4124.3 on 36990 degrees of freedom AIC: 4138.3 modelB<-glm(EVENT ~ ONE + age_18 + age_18sq + age_18cub + PD + FEMALE + SIBS12 + SIBS34 + SIBS56 + SIBS78 + SIBS9PLUS - 1, family=binomial(link = "logit"), data = depression) summary(modelB) Coefficients: Estimate Std. Error z value Pr(>|z|) ONE -4.500e+00 2.067e-01 -21.773 < 2e-16 *** age_18 6.145e-02 1.166e-02 5.269 1.37e-07 *** age_18sq -7.289e-03 1.224e-03 -5.958 2.56e-09 *** age_18cub 1.813e-04 7.897e-05 2.296 0.0217 * PD 3.727e-01 1.625e-01 2.294 0.0218 * FEMALE 5.596e-01 1.095e-01 5.109 3.24e-07 *** SIBS12 2.085e-02 1.976e-01 0.106 0.9160 SIBS34 1.076e-02 2.100e-01 0.051 0.9591 SIBS56 -4.942e-01 2.545e-01 -1.942 0.0522 . SIBS78 -7.754e-01 3.437e-01 -2.256 0.0241 * SIBS9PLUS -6.585e-01 3.440e-01 -1.914 0.0556 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 51289 on 36997 degrees of freedom Residual deviance: 4118 on 36986 degrees of freedom AIC: 4140 modelC<-glm(EVENT ~ ONE + age_18 + age_18sq + age_18cub + PD + FEMALE + BIGFAMILY - 1, family=binomial(link = "logit"), data = depression) summary(modelC) Coefficients: Estimate Std. Error z value Pr(>|z|) ONE -4.483e+00 1.087e-01 -41.237 < 2e-16 *** age_18 6.141e-02 1.166e-02 5.265 1.40e-07 *** age_18sq -7.291e-03 1.224e-03 -5.959 2.54e-09 *** age_18cub 1.815e-04 7.897e-05 2.298 0.0215 * PD 3.710e-01 1.623e-01 2.286 0.0222 * FEMALE 5.581e-01 1.095e-01 5.098 3.44e-07 *** BIGFAMILY -6.108e-01 1.446e-01 -4.225 2.39e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 51288.7 on 36997 degrees of freedom Residual deviance: 4118.8 on 36990 degrees of freedom AIC: 4132.8
Figure 12.8, page 458. This example compares the hazard rates of male and female subjects across time for unfitted values, and then for fitted values from several different models. Note that it would be possible to generate all of these fitted and unfitted values in a single loop after running the three models. We have opted to keep these steps separate for easier reading.
math<-read.table("https://stats.idre.ucla.edu/stat/examples/alda/mathdropout_pp.csv", sep=",", header=T)
First, for each combination of time and sex, we use a loop to generate the proportion of subjects experiencing the event and calculate the logit of the proportion. Then we subset this proportion and logit data by sex.
percents <- c() for (i in 1:5){ for (j in 0:1){ x <- subset(math, PERIOD==i & FEMALE==j) if (sum(x$EVENT) > 0){ y <- mean(x$EVENT)} if (sum(x$EVENT)==0){ y <- NA} logity <- log(y/(1-y)) z <- c(i, j, y, logity) percents <- rbind(percents, z)}} colnames(percents) <- c("term", "female", "pct", "logit") percents <- as.data.frame(na.omit(percents)) percents.m <- subset(percents, female==0) percents.f <- subset(percents, female==1)
Next, we run the first of three models, Model A. Then, using a loop, we generate predicted proportion values for each combination of time and sex with the coefficients from Model A and calculate the logit of the fitted proportion. Then we subset this proportion and logit data by sex.
modelA<-glm(EVENT ~ HS11 + HS12 + COLL1 + COLL2 + COLL3 + FEMALE - 1, family=binomial(link = "logit"), data = math) modelfitA <- c() for (i in 1:5){ for (j in 0:1){ logitfit <- modelA$coef[i] + modelA$coef[6]*j hazard = 1/(1 + exp(-logitfit)) z <- c(i, j, logitfit, hazard) modelfitA <- rbind(modelfitA, z)}} colnames(modelfitA) <- c("term", "female", "logitfit", "hazard") modelfitA <- as.data.frame(modelfitA) modelfitA.m <- subset(modelfitA, female==0) modelfitA.f <- subset(modelfitA, female==1)
Next, we run the second of three models, Model B, and go through the same steps as for Model A.
modelB <- glm(EVENT ~ HS11 + HS12 + COLL1 + COLL2 + COLL3 + FHS11 + FHS12 + FCOLL1 + FCOLL2 + FCOLL3 - 1, family=binomial(link = "logit"), data = math)
modelfitB <- c() for (i in 1:5){ for (j in 0:1){ logitfit <- modelB$coef[i] + modelB$coef[i+5]*j hazard = 1/(1 + exp(-logitfit)) z <- c(i, j, logitfit, hazard) modelfitB <- rbind(modelfitB, z)}} colnames(modelfitB) <- c("term", "female", "logitfit", "hazard") modelfitB <- as.data.frame(modelfitB) modelfitB.m <- subset(modelfitB, female==0) modelfitB.f <- subset(modelfitB, female==1)
Next, we run the last of three models, Model C, and go through the same steps as for Model A and Model B.
modelC <- glm(EVENT ~ HS11 + HS12 + COLL1 + COLL2 + COLL3 + FEMALE + FLTIME - 1, family=binomial(link = "logit"), data = math) modelfitC <- c() for (i in 1:5){ for (j in 0:1){ logitfit <- modelC$coef[i] + modelC$coef[6]*j + modelC$coef[7]*i*j hazard = 1/(1 + exp(-logitfit)) z <- c(i, j, logitfit, hazard) modelfitC <- rbind(modelfitC, z)}} colnames(modelfitC) <- c("term", "female", "logitfit", "hazard") modelfitC <- as.data.frame(modelfitC) modelfitC.m <- subset(modelfitC, female==0) modelfitC.f <- subset(modelfitC, female==1)
Lastly, we create four plots: one for the unfitted hazard and one for each of the three fitted hazards. For each graph, we start by plotting one line for males and then overlay a line for females.
par(mfrow=c(2,2)) plot(percents.m$term, percents.m$logit, type = "l", lty = 1, ylim = c(-2.5, 0), main = "Within-group sample hazard functions", xlab = "Term", ylab = "Sample logit(hazard)") points(percents.f$term, percents.f$logit, type = "l", lty = 2) legend("bottomright", c("Male", "Female"), lty = c(1, 2)) plot(modelfitA.m$term, modelfitA.m$logitfit, type = "l", lty = 1, ylim = c(-2.5, 0), main = "Model A", xlab = "Term", ylab = "Fitted logit(hazard)") points(modelfitA.f$term, modelfitA.f$logitfit, type = "l", lty = 2) legend("bottomright", c("Male", "Female"), lty = c(1, 2)) plot(modelfitB.m$term, modelfitB.m$logitfit, type = "l", lty = 1, ylim = c(-2.5, 0), main = "Model B", xlab = "Term", ylab = "Fitted logit(hazard)") points(modelfitB.f$term, modelfitB.f$logitfit, type = "l", lty = 2) legend("bottomright", c("Male", "Female"), lty = c(1, 2)) plot(modelfitC.m$term, modelfitC.m$logitfit, type = "l", lty = 1, ylim = c(-2.5, 0), main = "Model C", xlab = "Term", ylab = "Fitted logit(hazard)") points(modelfitC.f$term, modelfitC.f$logitfit, type = "l", lty = 2) legend("bottomright", c("Male", "Female"), lty = c(1, 2))
Table 12.5, page 459. The parameter estimates and fit statistics in this table can be found in the output from each of the three models below.
modelA<-glm(EVENT ~ HS11 + HS12 + COLL1 + COLL2 + COLL3 + FEMALE - 1, family=binomial(link = "logit"), data = math) summary(modelA) Coefficients: Estimate Std. Error z value Pr(>|z|) HS11 -2.13080 0.05674 -37.555 < 2e-16 *** HS12 -0.94248 0.04789 -19.681 < 2e-16 *** COLL1 -1.44948 0.06345 -22.845 < 2e-16 *** COLL2 -0.61765 0.07571 -8.158 3.42e-16 *** COLL3 -0.77160 0.14280 -5.403 6.54e-08 *** FEMALE 0.37865 0.05015 7.551 4.33e-14 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13250.2 on 9558 degrees of freedom Residual deviance: 9804.3 on 9552 degrees of freedom AIC: 9816.3
modelB <- glm(EVENT ~ HS11 + HS12 + COLL1 + COLL2 + COLL3 + FHS11 + FHS12 + FCOLL1 + FCOLL2 + FCOLL3 - 1, family=binomial(link = "logit"), data = math) summary(modelB) Coefficients: Estimate Std. Error z value Pr(>|z|) HS11 -2.00767 0.07148 -28.087 < 2e-16 *** HS12 -0.96425 0.05854 -16.470 < 2e-16 *** COLL1 -1.48240 0.08471 -17.499 < 2e-16 *** COLL2 -0.71001 0.10073 -7.048 1.81e-12 *** COLL3 -0.86904 0.19077 -4.555 5.23e-06 *** FHS11 0.15680 0.09777 1.604 0.108791 FHS12 0.41866 0.07924 5.284 1.27e-07 *** FCOLL1 0.44068 0.11580 3.805 0.000142 *** FCOLL2 0.57075 0.14455 3.948 7.86e-05 *** FCOLL3 0.60077 0.28573 2.103 0.035502 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13250.2 on 9558 degrees of freedom Residual deviance: 9796.3 on 9548 degrees of freedom AIC: 9816.3
modelC <- glm(EVENT ~ HS11 + HS12 + COLL1 + COLL2 + COLL3 + FEMALE + FLTIME - 1, family=binomial(link = "logit"), data = math) summary(modelC)
Coefficients: Estimate Std. Error z value Pr(>|z|) HS11 -2.04592 0.06464 -31.649 < 2e-16 *** HS12 -0.92551 0.04821 -19.197 < 2e-16 *** COLL1 -1.49655 0.06649 -22.507 < 2e-16 *** COLL2 -0.71781 0.08605 -8.341 < 2e-16 *** COLL3 -0.91655 0.15570 -5.887 3.94e-09 *** FEMALE 0.22749 0.07744 2.937 0.00331 ** FLTIME 0.11977 0.04698 2.549 0.01080 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13250.2 on 9558 degrees of freedom Residual deviance: 9797.8 on 9551 degrees of freedom AIC: 9811.8
Table 12.6, page 465. This table shows some summary information for a set of eight subjects. To generate this table in R, we wrote a function that takes an ID number as an argument and returns a vector containing the information seen in a given row of this table.
firstsex<-read.table("https://stats.idre.ucla.edu/stat/examples/alda/firstsex_pp.csv", sep=",", header=T)
First, we fit a logistic model to the data predicting the event using time, PT, and PAS. From this model, we then generated a deviance residual for each observation.
model12.6<-glm(event ~ d7 + d8 + d9 + d10 + d11 + d12 + pt+pas- 1, family=binomial(link = "logit"), data = firstsex)
firstsex.r <- cbind(firstsex, dev.res = residuals(model12.6, type="deviance"))
Next, we defined a function that limits the dataset to a given subject, and then extracts the subjects last (greatest time point value) values for PT, PAS, grade, and censor. The function then creates a vector containing this information, the deviance residuals from all the time points for which the subject has data, and an SS deviance for the subject.
get12.6 <- function(id.num){ x <- subset(firstsex.r, id==id.num) pt <- max(x$pt) pas <- max(x$pas) grade <- max(x$period) censor <- abs(max(x$event)-1) gr <- x$dev.res gr.l <- length(gr) if (gr.l < 6){ gr <- c(gr, c(rep(NA, (6-gr.l))))} ss.dev <- sum(na.omit(gr*gr)) z <- c(id.num, pt, pas, grade, censor, gr, ss.dev) return(z)}
The table was created by calling the above function with the eight IDs that appear in the table and stacking the eight vectors together.
tab12.6 <- rbind(get12.6(22), get12.6(112), get12.6(166), get12.6(89), get12.6(102), get12.6(87), get12.6(67), get12.6(212)) colnames(tab12.6) <- c("id", "pt", "pas", "grade", "censor", "gr7", "gr8", "gr9", "gr10", "gr11", "gr12", "ss.dev") tab12.6 id pt pas grade censor gr7 gr8 gr9 gr10 gr11 gr12 ss.dev [1,] 22 1 -0.6496482 12 0 -0.4117311 -0.2944395 -0.5840395 -0.7176492 -0.7747741 1.4144633 3.713321 [2,] 112 1 -0.6609311 12 1 -0.4110716 -0.2939581 -0.5831423 -0.7165919 -0.7736556 -0.9562858 2.621976 [3,] 166 1 2.7814130 11 0 -0.6614533 -0.4806816 -0.9108230 -1.0903113 1.1913978 NA 4.106381 [4,] 89 0 -0.0751572 11 0 -0.3248377 -0.2313901 -0.4644566 -0.5751775 1.8623690 NA 4.174028 [5,] 102 1 0.6049317 8 0 -0.4913390 2.3694820 NA NA NA NA 5.855859 [6,] 87 1 2.6779010 7 0 1.8176297 NA NA NA NA NA 3.303778 [7,] 67 1 2.2746520 12 0 -0.6180257 -0.4476515 -0.8559141 -1.0294308 -1.1007321 1.0429686 4.674059 [8,] 212 0 -0.9617912 12 1 -0.2857033 -0.2032141 -0.4097550 -0.5090181 -0.5524318 -0.6958426 1.339299
Figure 12.8, page 467. These two plots are generated using the deviance residuals from a model. The first plot depicts the residuals from all of the observations, while the second depicts the sum of squared residuals for each ID. We first need to fit the model and generate the residuals.
firstsex<-read.table("https://stats.idre.ucla.edu/stat/examples/alda/firstsex_pp.csv", sep=",", header=T) model12.6<-glm(event ~ d7 + d8 + d9 + d10 + d11 + d12 + pt+pas- 1, family=binomial(link = "logit"), data = firstsex) firstsex.r <- cbind(firstsex, dev.res = residuals(model12.6, type="deviance"))
Next, we have defined a function that accepts an ID as an argument and calculates the sum of the squared deviance residuals for observations associated with the given ID.
get12.9 <- function(id.num){ x <- subset(firstsex.r, id==id.num) gr <- x$dev.res ss.dev <- sum((gr*gr)) return(ss.dev)}
Having defined the above function, we loop through the values of ID, calling the above function to calculate the sum of squared residuals for each ID. This gives us a dataset containing one observation per ID.
ss.devs <- c() for (i in 1:216){ z <- c(i, get12.9(i)) if (z[2]!=0){ ss.devs <- rbind(ss.devs, z)}}
Finally, we create two plots: one of all of the deviance residuals against ID and one of the sum of squared residuals against ID.
par(mfrow=c(2,1)) plot(firstsex.r$id, firstsex.r$dev.res, ylab = "Deviance residual", xlab = "ID") plot(ss.devs[,1], ss.devs[,2], ylab = "SS Deviance residual", xlab = "ID")