library(survival)
library(survminer) # for customizable graphs of survival function
library(broom) # for tidy output
library(ggplot2) # for graphing (actually loaded by survminer)
Learn just enough about survival analysis to use the survival
package to:
survival
: Kaplan-Meier estimator, Cox proportional hazards models and extensions (e.g. frailty)survminer
customizable plots of survival functions (requires ggplot2
)broom
: storing output tables as data frames.Survival analysis models survival time, the length of time to an event.
Other names for survival time are failure time or time to event.
Example events:
The survival time clock starts when a subject becomes at risk for the event, at which point they enter the risk set.
Let \(T\) represent survival time.
The survival function, \(S(t)\) expresses the probability that a subject survives beyond time \(T\).
\[S(t)=Pr(T>t)\]
For the survival curve above,
\(S(100)=.577\), the probability that a subject survives beyond 100 days is 0.577.
\(S(200)=.122\), the probability that a subject survives beyond 200 days is 0.122.
Typically, we also assume:
\(S(0)=1\), all subjects survive the very first moment
\(S(\infty)=0\), all subjects fail after infinite time
Median survival time: time \(t\) at which 50% of population expected to be still surviving:
Some methods, e.g. Cox model, focus on the hazard function, \(h(t)\), which is inversely related to the \(S(t)\).
\(h(t)\), is the rate of events, or hazard at time \(t\), given that the subject has survived until time \(t\).
As the hazard increases, more events are expected per unit time, survival will decrease.
Below we see three examples of hazard functions.
In the green hazard function above, \(h(200)=\frac{\text{.0204 events}}{day}\).
The cumulative hazard function, \(H(t)\), expresses how much hazard has accumulated over time up to time \(t\).
\[H(t)=\int_0^t h(u)du\]
The probability that a subject fails increases as the hazard accumulates.
Because the hazard function \(h(t)\) is never negative, the cumulative hazard \(H(t)\) never decreases with time.
The survival function is inversely related to the cumulative hazard function:
\[S(t)=exp(-H(t))\]
Therefore, by modeling either the survival function or the hazard function, we can infer the other.
Often the exact time the event occurs is unknown or censored.
Right censoring means that a subject’s actual survival time is greater than their observed time
Left-censoring means that a subject’s actual survival time is less than their observed time.
Interval censoring means that a subject’s survival time is only known to lie between 2 time points.
We will only discuss methods that handle right-censoring in this workshop.
Many standard methods, such as linear regression, are not equipped to deal with censored outcomes.
*Image adapted from Kleinbaum and Klein, Survival Analysis: A Self-Learning Text, Third Edition, Springer, 2012.
Most survival analysis methods assume non-informative censoring: a subject’s censoring time should not be related to the unobserved survival time
Below, we see 2 graphs diagrams where subjects have the same true survival times (red squares), but censoring times (black circles) are informative on the left and non-informative on the right.
Failing to account for informative censoring may result in biased estimates of survival.
Below are estimated survival functions from the data diagrammed on the previous slide with either noninformative (purple) or informative censoring (green).
Examples of possible informative censoring and resulting bias if not addressed:
A data record for subject in a survival analysis typically includes:
aml
datasetWe’ll start with the aml
dataset in the survival
package.
These data come from a study looking at time to death for patients with acute myelogenous leukemia, comparing “maintained” chemotherapy treatment to “nonmaintained”.
Variables:
time
survival or censoring timestatus
0=censored, 1=deathx
chemotherapy “maintained” or “nonmaintained”time | status | x |
---|---|---|
9 | 1 | Maintained |
13 | 1 | Maintained |
13 | 0 | Maintained |
18 | 1 | Maintained |
23 | 1 | Maintained |
28 | 0 | Maintained |
31 | 1 | Maintained |
34 | 1 | Maintained |
45 | 0 | Maintained |
48 | 1 | Maintained |
161 | 0 | Maintained |
5 | 1 | Nonmaintained |
5 | 1 | Nonmaintained |
8 | 1 | Nonmaintained |
8 | 1 | Nonmaintained |
12 | 1 | Nonmaintained |
16 | 0 | Nonmaintained |
23 | 1 | Nonmaintained |
27 | 1 | Nonmaintained |
30 | 1 | Nonmaintained |
33 | 1 | Nonmaintained |
43 | 1 | Nonmaintained |
45 | 1 | Nonmaintained |
Surv()
function for survival outcomesUse Surv()
to specify the survival outcome variables.
Surv(time, status)
time
survival/censoring time variablestatus
status variable. To code for censored/event use:
Censoring is assumed to be right-censored unless otherwise specified with the type
argument.
Surv()
specification for start-stop formatSome survival analyses require time to be recorded in 2 variables that mark the beginning and end of time intervals. We need this format to model:
In this format, some or all subjects may have multiple rows of data.
This format is sometimes called start-stop format.
The jasa1
data set has this setup, where start
and stop
are the time variables, and event
is the status variable:
id start stop event transplant age year surgery
1 1 0 49 1 0 -17.155373 0.1232033 0
2 2 0 5 1 0 3.835729 0.2546201 0
102 3 0 15 1 1 6.297057 0.2655715 0
3 4 0 35 0 0 -7.737166 0.4900753 0
103 4 35 38 1 1 -7.737166 0.4900753 0
4 5 0 17 1 0 -27.214237 0.6078029 0
To specify the outcome for data in stop-start format, use:
Surv(time, time2, event)
time
and time2
beginning and end of time intervalsevent
is the status at the end of the interval.The Kaplan-Meier (KM) estimator is a non-parametric method to estimate the survival function, \(S(t)\).
\[\hat{S}(t)=\prod_{t_i<t}\left(1-\frac{events_i}{num.at.risk_i}\right)\]
\(\left(1-\frac{events_i}{num.at.risk_i}\right)\) is the proportion of those still at risk that survive time point \(t_i\).
\(\hat{S}(t)\) is the product of the proportion that survive each time point \(t_i\) up to the current time point \(t\).
\(\hat{S}(t)\) does not change if someone drops due to censoring, although the number at risk will drop.
survfit()
survfit()
estimates \(S(t)\) via the KM method.
The first argument to survfit()
is a formula:
Surv()
with outcome variables on the left side of ~
1
(whole sample) or a grouping variable (stratified) on the right.Printing the survfit
object gives a summary:
n
: total number at riskevents
: total number of events that occurredmedian
: median survival time (when 50% of those at risk are expected to still be alive)0.95LCL
, 0.95UCL
: 95% confidence limits for median survivalUsing summary()
on the survfit
object produces a table of the KM estimates of the survival function \(S(t)\) at each failure time.
On each row, at least one subject fails.
time
event times in the data setn.risk
, n.event
, n.censor
number at risk, number of events, number censoredestimate
\(\hat{S}(t)\) survival estimatestd.error
, conf.high
, conf.low
standard error and 95% confidence interval (CI) for \(S(t)\)Call: survfit(formula = Surv(time, status) ~ 1, data = aml)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 23 2 0.9130 0.0588 0.8049 1.000
8 21 2 0.8261 0.0790 0.6848 0.996
9 19 1 0.7826 0.0860 0.6310 0.971
12 18 1 0.7391 0.0916 0.5798 0.942
13 17 1 0.6957 0.0959 0.5309 0.912
18 14 1 0.6460 0.1011 0.4753 0.878
23 13 2 0.5466 0.1073 0.3721 0.803
27 11 1 0.4969 0.1084 0.3240 0.762
30 9 1 0.4417 0.1095 0.2717 0.718
31 8 1 0.3865 0.1089 0.2225 0.671
33 7 1 0.3313 0.1064 0.1765 0.622
34 6 1 0.2761 0.1020 0.1338 0.569
43 5 1 0.2208 0.0954 0.0947 0.515
45 4 1 0.1656 0.0860 0.0598 0.458
48 2 1 0.0828 0.0727 0.0148 0.462
Let’s confirm the first 2 survival estimates:
Above, we can see that censored observations do not affect \(\hat{S}(t)\) when they leave the risk set:
Using tidy()
from the broom
package on the survfit
object adds times where censored observations leave the risk set to the KM table.
# A tibble: 18 × 8
time n.risk n.event n.censor estimate std.error conf.high conf.low
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 23 2 0 0.913 0.0643 1 0.805
2 8 21 2 0 0.826 0.0957 0.996 0.685
3 9 19 1 0 0.783 0.110 0.971 0.631
4 12 18 1 0 0.739 0.124 0.942 0.580
5 13 17 1 1 0.696 0.138 0.912 0.531
6 16 15 0 1 0.696 0.138 0.912 0.531
7 18 14 1 0 0.646 0.157 0.878 0.475
8 23 13 2 0 0.547 0.196 0.803 0.372
9 27 11 1 0 0.497 0.218 0.762 0.324
10 28 10 0 1 0.497 0.218 0.762 0.324
11 30 9 1 0 0.442 0.248 0.718 0.272
12 31 8 1 0 0.386 0.282 0.671 0.223
13 33 7 1 0 0.331 0.321 0.622 0.177
14 34 6 1 0 0.276 0.369 0.569 0.134
15 43 5 1 0 0.221 0.432 0.515 0.0947
16 45 4 1 1 0.166 0.519 0.458 0.0598
17 48 2 1 0 0.0828 0.877 0.462 0.0148
18 161 1 0 1 0.0828 0.877 0.462 0.0148
For example, above we see that a subject with a censored survival time left the risk set at \(t=16\), but \(\hat{S}(t)\) does not decrease.
Use plot
to graph the survfit
KM survival function and its 95% CI.
The KM \(\hat{S}(t)\) is a step function that changes only when an event occurs.
The true underlying survival curve \(S(t)\) is likely smooth, but smoothness of the KM curve is limited by the number of event times in the data.
To estimate separate KM survival functions for different groups, specify one or more grouping variables after the ~
in survfit()
# stratify by x variable
KM.x <- survfit(Surv(time, status) ~ x, data=aml)
# median survival by strata
KM.x
Call: survfit(formula = Surv(time, status) ~ x, data = aml)
n events median 0.95LCL 0.95UCL
x=Maintained 11 7 31 18 NA
x=Nonmaintained 12 11 23 8 NA
We now get separate tables of KM estimates when using summary()
:
Call: survfit(formula = Surv(time, status) ~ x, data = aml)
x=Maintained
time n.risk n.event survival std.err lower 95% CI upper 95% CI
9 11 1 0.909 0.0867 0.7541 1.000
13 10 1 0.818 0.1163 0.6192 1.000
18 8 1 0.716 0.1397 0.4884 1.000
23 7 1 0.614 0.1526 0.3769 0.999
31 5 1 0.491 0.1642 0.2549 0.946
34 4 1 0.368 0.1627 0.1549 0.875
48 2 1 0.184 0.1535 0.0359 0.944
x=Nonmaintained
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 12 2 0.8333 0.1076 0.6470 1.000
8 10 2 0.6667 0.1361 0.4468 0.995
12 8 1 0.5833 0.1423 0.3616 0.941
23 6 1 0.4861 0.1481 0.2675 0.883
27 5 1 0.3889 0.1470 0.1854 0.816
30 4 1 0.2917 0.1387 0.1148 0.741
33 3 1 0.1944 0.1219 0.0569 0.664
43 2 1 0.0972 0.0919 0.0153 0.620
45 1 1 0.0000 NaN NA NA
plot()
produces separate KM survival curves when supplied a stratified survfit
object.
However, confidence intervals are no longer produced by default so we request them with conf.int=TRUE
.\(\dagger\)
We also use col
to specify two colors.
# stratified KM curves with 95% CI, 2 colors
plot(KM.x, ylab="survival probability", xlab="months",
conf.int=TRUE, col=c("red", "blue"))
\(\dagger\) plot()
is a generic base R function and calls plot.survfit()
when supplied a survfit
object. The option conf.int=
is an option of plot.survfit()
.
survminer
plot.survfit()
has limited options to customize a plot of survival functions.
ggsurvplot()
from the survminer
package leverages the graphical power of the ggplot2
package to allow extensive customization of survival function plots.
ggsurvplot()
by default adds +
symbols to denote censored observations.
ggsurvplot()
makes adding a risk table (number at risk at selected timepoints) very easy.
You can pass most arguments to various functions in ggplot2
through ggsurvplot()
.
ggsurvplot(KM.x, conf.int=T,
risk.table=T,
palette="Accent", # argument to scale_color_brewer()
size=2, # argument to geom_line()
ggtheme = theme_minimal()) # changing ggtheme
You can also use traditional ggplot2
syntax by extracting the ggplot
object as ggsurvplot()$plot
.
g <- ggsurvplot(KM.x, conf.int=T,
risk.table=T)$plot # this is the ggplot object
g + scale_fill_grey() + scale_color_grey()
See our ggplot2 seminar to learn how to use the ggplot2
package.
survdiff()
A common hypothesis of interest:
\(H_0\): survival curves across 2 or more groups are equivalent
\(H_A\): survival curves across 2 or more groups are not equivalent
A popular method to evaluate this hypothesis is the log-rank statistic, which is \(\chi^2\) distributed with \(g-1\) degrees of freedom under the null (\(g\) is the number of groups).
The function survdiff()
performs the log-rank test.
Call:
survdiff(formula = Surv(time, status) ~ x, data = aml)
N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained 11 7 10.69 1.27 3.4
x=Nonmaintained 12 11 7.31 1.86 3.4
Chisq= 3.4 on 1 degrees of freedom, p= 0.07
We see weak evidence that the survival curves may be different.
For some problems, early differences in survival curves might be more important than later differences, such as for death after surgery.
Early differences can be emphasized by applying weights that decrease as time advances.
One popular weighting method is to weight the survival times by the survival probabilities \(\hat{S}(t)\) themselves, which decrease with time.
survdiff()
allows weights in calculation of the \(\chi^2\) statistic with the rho=
argument. The weights are defined as \(\hat{S}(t)^\rho\), where \(0 \le \rho \le 1\).
rho=1
weights are the survival estimates, \(\hat{S}(t)^1=\hat{S}(t)\), equivalent to the Gehan-Wilcoxon testrho=0
equal weights, \(\hat{S}(t)^0=1\), the traditional log-rank test and the default0<rho<1
values closer to 1 weight earlier time points more# rho=1 specifies Peto & Peto modification of Gehan-Wilcoxon,
# more weight put on earlier time points
survdiff(Surv(time, status) ~ x, data=aml, rho=1)
Call:
survdiff(formula = Surv(time, status) ~ x, data = aml, rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained 11 3.85 6.14 0.859 2.78
x=Nonmaintained 12 7.18 4.88 1.081 2.78
Chisq= 2.8 on 1 degrees of freedom, p= 0.1
The veteran
data set describes survival times for veterans with lung cancer.
Variables:
time
: survival timestatus
: status, 0=censored, 1=deadtrt
: treatment, 1=standard, 2=testThe Cox model:
For simplicity, we begin with a Cox model with a single time-constant predictor, \(X_1\):
\[h(t|X_1=x_1) = h_0(t)exp(b_1x_1)\]
For example, imagine \(X_1=1\) for treatment and \(X_1=0\) for control.
The hazard at time \(t\) for treatment:
\[\begin{aligned}h(t|X_1=1) &= h_0(t)exp(b_1*1)\\ &= h_0(t)exp(b_1) \end{aligned}\]
and for control:
\[\begin{aligned}h(t|X_1=0) &= h_0(t)exp(b_1*0)\\ &= h_0(t)exp(0) \\ &= h_0(t) \end{aligned}\]
The hazard ratio (HR), \(exp(b_1)\), compares the hazards for treatment and control:
\[\begin{aligned}HR&=\frac{h(t|X_1=1)}{h(t|X_1=0)} \\ &= \frac{h_0(t)exp(b_1)}{h_0(t)} \\ &= exp(b_1) \end{aligned}\]
In general, \(exp(b_1)\) expresses the hazard ratio for a 1-unit increase in the predictor.
\(b_1\) itself is the log-hazard ratio.
The standard Cox model assumes proportional hazards, which means that covariate effects (expressed as hazard ratios) are constant over time.
Notice that the expression for the hazard ratio for the treatment effect does not contain time, so is constant:
\[HR=exp(b_1)\]
Note: Under proportional hazards, a subject’s hazard can change over time, but the hazard ratio comparing 2 subjects with different covariate values cannot.
Visually, this is represented by “parallel” survival curves that should not cross:
Proportional hazards is easier to evaluate if we plot \(-log(-log(S(t)))\), sometimes called a log minus log survival plot. Parallel curves suggest proportional hazards holds.
Non-proportional hazards means that a predictor’s effect changes over time.
Hazard and survival functions will often cross:
As will graphs of \(-log(-log(S(t)))\):
Failing to account for non-constant hazard ratios threatens validity of Cox model estimates.
The Cox model can be extended in various ways to accommodate non-proportional hazards.
The Cox model is popular because it does not require specification of the baseline hazard function, \(h_0(t)\), the hazard function for a subject with zero on all covariates.
Essentially, this means we can use the Cox model without knowing the shape of the underlying hazard functions.
Because no parameters related to a distribution (e.g. mean) are estimated, the Cox model is semiparametric.
Because \(h_0(t)\) is left unspecified, the Cox model only estimates covariate effects on the hazard function, not the hazard function itself.
No constant/intercept in Cox models, which prevents estimation of the hazard function (without additional methods).
The Cox is easily extended to accommodate multiple predictors, each of whose effects is assumed to be proportional over time.
\[h(t|X_1, X_2, ...X_p) = h_0(t)exp(b_1X_1 + b_2X_2 + ... + b_pX_p)\] Each coefficient \(b_i\) can be exponentiated to calculate a hazard ratio.
survival
packageWe will use the lung
data set from the survival
package for Cox modeling.
The data describe survival of patients with advanced lung cancer.
Variables:
time
: survival time in daysstatus
: 1=censored, 2=deadage
: age in years (assessed at beginning)sex
: 1=male, 2=female\(^\dagger\)wt.loss
: weight loss (pounds) in last 6 months\(^\dagger\)Normally binary variables should be coded 0/1 so that the intercept is interpretable. However, with no intercept in the Cox model 1/2 coding is equivalent.
Use coxph(formula,data=)
The formula is a typical R regression formula:
Surv(time, event) ~ x1 + x2…
where x1 + x2…
is a list of one or more predictor variables separated by +
.
# fit cox model and save results
lung.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data=lung)
# summary of results
summary(lung.cox)
Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0200882 1.0202913 0.0096644 2.079 0.0377 *
sex -0.5210319 0.5939074 0.1743541 -2.988 0.0028 **
wt.loss 0.0007596 1.0007599 0.0061934 0.123 0.9024
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0203 0.9801 1.0011 1.0398
sex 0.5939 1.6838 0.4220 0.8359
wt.loss 1.0008 0.9992 0.9887 1.0130
Concordance= 0.612 (se = 0.027 )
Likelihood ratio test= 14.67 on 3 df, p=0.002
Wald test = 13.98 on 3 df, p=0.003
Score (logrank) test = 14.24 on 3 df, p=0.003
Output interpretation:
coef
: log hazard ratio coefficients
exp(coef)
: hazard ratios (exponentiated coefficients)
se(coef)
: standard error of log hazard ratiosPr(>|z|)
: p-value for test of log hazard ratio = 0, or equivalently, hazard ratio = 1lower .95
, upper .95
: 95% confidence interval for hazard ratioConcordance
: proportion of pairs that are concordant, a goodness-of-fit measure (values near 1 indicate better fit)Likelihood ratio
, Wald
, and Score
tests: 3 tests of the null hypothesis that all coefficients equal zerocoxph()
coefficients and CIs using tidy()
We can tidy()
the coxph()
results and store them in a data frame.
# save summarized results as data.frame
# exponentiate=T returns hazard ratios
lung.cox.tab <- tidy(lung.cox, exponentiate=T, conf.int=T)
# display table
lung.cox.tab
# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age 1.02 0.00966 2.08 0.0377 1.00 1.04
2 sex 0.594 0.174 -2.99 0.00280 0.422 0.836
3 wt.loss 1.00 0.00619 0.123 0.902 0.989 1.01
We then pass the data frame of model results to ggplot2
to plot the hazard ratios and their confidence intervals.
# plot of hazard ratios and 95% CIs
ggplot(lung.cox.tab,
aes(y=term, x=estimate, xmin=conf.low, xmax=conf.high)) +
geom_pointrange() + # plots center point (x) and range (xmin, xmax)
geom_vline(xintercept=1, color="red") + # vertical line at HR=1
labs(x="hazard ratio", title="Hazard ratios and 95% CIs") +
theme_classic()
We often want to estimate the survival functions for a subject with a particular set of covariate values.
The semiparametric Cox model does not estimate the survival function (or hazard function) directly.
However, we can use non-parameteric methods (similar to KM) to estimate the baseline survival function, \(S_0(t)\), the survival function for a subject with zero on all covariates.
\[{S}_0(t)={S}(t|X_1=0,X_2=0,...,X_p=0)\]
The Cox model coefficients then allow estimation of the survival function for a subject with any set of covariate values:
\[\hat{S}(t|X_1=x_1,X_2=x_2,...,X_p=x_p)=\hat{S}_0(t)^{exp(b_1x_1+b_2x_2+...+b_px_p)}\]
Because \(\hat{S}_0(t)\) is estimated non-parametrically, survival functions will again be step functions that change values only at observed event times.
coxph()
with survfit()
survfit()
, when supplied a coxph
model object, predicts survival functions.
If no covariate values are supplied to survfit()
, the survival function is estimated for a subject with covariates at their means.
tidy()
(or summary()
) produces a table of the survival function estimated by survfit()
:
# predict survival function for subject with means on all covariates
surv.at.means <- survfit(lung.cox)
#table of survival function
tidy(surv.at.means)
# A tibble: 179 × 8
time n.risk n.event n.censor estimate std.error conf.high conf.low
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 214 1 0 0.996 0.00443 1 0.987
2 11 213 2 0 0.987 0.00772 1 0.972
3 12 211 1 0 0.982 0.00894 1.00 0.965
4 13 210 2 0 0.973 0.0110 0.995 0.953
5 15 208 1 0 0.969 0.0120 0.992 0.946
6 26 207 1 0 0.964 0.0128 0.989 0.940
7 30 206 1 0 0.960 0.0137 0.986 0.935
8 31 205 1 0 0.955 0.0145 0.983 0.929
9 53 204 2 0 0.946 0.0160 0.976 0.917
10 54 202 1 0 0.942 0.0167 0.973 0.912
# ℹ 169 more rows
plot()
plots the predicted survival curve and CI estimated by survfit()
.
Specify a data frame of covariate values in the newdata=
argument of survfit()
to predict a survival function for each row of the data frame:
First we create the new data frame of 2 rows of covariate values.
# create new data for plotting: 1 row for each sex
# and mean age and wt.loss for both rows
plotdata <- data.frame(age=mean(lung$age),
sex=1:2,
wt.loss=mean(lung$wt.loss, na.rm=TRUE))
# look at new data
plotdata
age sex wt.loss
1 62.44737 1 9.831776
2 62.44737 2 9.831776
Then we supply the new data to survfit
to predict separate survival functions for males and females, both at mean age and mean weight loss.
In the table of predicted survival functions produced by tidy()
, the column suffixes .1
and .2
differentiate estimates between the 2 sexes.
estimate.1
is the survival function estimates for sex=1
, males, while estimate.2
is for females.# get survival function estimates for each sex
surv.by.sex <- survfit(lung.cox, newdata=plotdata) # one function for each sex
# columns with .1 are for males, .1 for females
tidy(surv.by.sex)
# A tibble: 179 × 12
time n.risk n.event n.censor estimate.1 estimate.2 std.error.1 std.error.2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 214 1 0 0.995 0.997 0.00546 0.00327
2 11 213 2 0 0.984 0.990 0.00953 0.00577
3 12 211 1 0 0.978 0.987 0.0111 0.00674
4 13 210 2 0 0.967 0.980 0.0137 0.00844
5 15 208 1 0 0.962 0.977 0.0148 0.00921
6 26 207 1 0 0.956 0.974 0.0159 0.00995
7 30 206 1 0 0.951 0.971 0.0170 0.0107
8 31 205 1 0 0.945 0.967 0.0180 0.0113
9 53 204 2 0 0.934 0.960 0.0199 0.0127
10 54 202 1 0 0.929 0.957 0.0208 0.0133
# ℹ 169 more rows
# ℹ 4 more variables: conf.high.1 <dbl>, conf.high.2 <dbl>, conf.low.1 <dbl>,
# conf.low.2 <dbl>
Note: The CIs for the survival functions are estimated but not printed due to space.
Remember when using plot()
to graph multiple survival functions, CIs must be specifically requested with conf.int=T
.
# plot survival estimates
plot(surv.by.sex, xlab="days", ylab="survival probability",
conf.int=TRUE, col=c("blue", "red"))
Use ggsurvplot()
from survminer
for more customized plots.
Note: if a dataset was specified as newdata=
in survfit()
to generate survival function estimates, that dataset must be specified as data=
in ggsurvplot()
as well.
# data= is the same data used in survfit()
# censor=F removes censoring symbols
ggsurvplot(surv.by.sex, data=plotdata, censor=F,
legend.labs=c("male", "female"))
We see that females have overall better survival, reflected by \(\widehat{HR}_{sex}=.594\).
The plausibility of a model’s assumptions affects our confidence in the estimates.
Several methods have been developed to assess the PH assumption of the Cox model (including the log minus log plots discussed earlier).
Here we discuss 2 tools developed by Grambsch and Therneau (1994) based on Schoenfeld residuals1 that are available in the survival
package.
cox.zph()
performs a \(\chi^2\) test based on Schoenfeld residuals to test the hypothesis:
\(H_0\): covariate effect is constant (proportional) over time
\(H_A\): covariate effect changes over time
The null hypothesis of PH is tested for each covariate individually and jointly as well.
chisq df p
age 0.5077 1 0.48
sex 2.5489 1 0.11
wt.loss 0.0144 1 0.90
GLOBAL 3.0051 3 0.39
No strong evidence of violation of proportional hazards for any covariate, but some weak evidence for sex perhaps.
Another tool used to assess PH is a plot of a smoothed curve over the Schoenfeld residuals.
To create this plot, plot()
the object returned by cox.zph()
.
These plots depict an estimate of the coefficient (labelled “Beta(t)” in the plot) (y-axis) across time (x-axis). PH is indicated by flat smoothed curve
Again we see some evidence of non-proportional hazards for sex, as the effect of sex seems to increase with time.
The effect of sex seems to be strongest at the beginning of follow-up, but then trends toward zero as time passes.
1Schoenfeld residuals are defined for each covariate and for each subject at the subject’s event time. The residual measures the difference between that subject’s covariate values and the average covariate values of those still at risk. For example, a positive residual for age means that a subject who failed is older than those still risk. If the residuals for age increase on average as time passes, that means that older patients are failing at higher rates than expected by the Cox model as time passes, suggesting that the effect of age is increasing with time.
Several strategies have been proposed to account for violation of PH. We discuss two here:
If the change in the coefficient is not large enough to be clinically meaningfully, it can perhaps be ignored as well.
Imagine we are concerned that possible PH violation by sex in our Cox models may bias all of the model estimates.
We can address this concern with a sensitivity analysis to show how model inferences change depending on whether we ignore or address the PH violation by sex.
In the Cox model assuming PH for sex, we find evidence that the hazard increases with age but the effect of wt.loss is practically null or undeterminable.
Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0200882 1.0202913 0.0096644 2.079 0.0377 *
sex -0.5210319 0.5939074 0.1743541 -2.988 0.0028 **
wt.loss 0.0007596 1.0007599 0.0061934 0.123 0.9024
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0203 0.9801 1.0011 1.0398
sex 0.5939 1.6838 0.4220 0.8359
wt.loss 1.0008 0.9992 0.9887 1.0130
Concordance= 0.612 (se = 0.027 )
Likelihood ratio test= 14.67 on 3 df, p=0.002
Wald test = 13.98 on 3 df, p=0.003
Score (logrank) test = 14.24 on 3 df, p=0.003
One approach to addressing a PH-violating variable is to estimate the model stratified by that variable.
Drawback: no coefficient will be estimated for the stratification variable.
In the stratified Cox model:
Put the stratification variable inside strata()
within the coxph()
model formula:
lung.strat.sex <- coxph(Surv(time, status) ~ age + wt.loss + strata(sex), data=lung)
summary(lung.strat.sex)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss + strata(sex),
data = lung)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0192190 1.0194049 0.0096226 1.997 0.0458 *
wt.loss 0.0001412 1.0001412 0.0062509 0.023 0.9820
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.019 0.9810 1.000 1.039
wt.loss 1.000 0.9999 0.988 1.012
Concordance= 0.561 (se = 0.027 )
Likelihood ratio test= 4.09 on 2 df, p=0.1
Wald test = 3.99 on 2 df, p=0.1
Score (logrank) test = 4 on 2 df, p=0.1
The coefficients for age and wt.loss, averaged across males and females, are similar to those from the model where we assumed PH for sex, so our overall inferences are unchanged.
However, if the effect of sex is of primary interest, we should not stratify by sex.
The Cox model can be extended to allow the coefficient representing a covariate effect to change over time, by interacting that covariate with some function of time.
However, unlike in other regression models, we cannot create this interaction term by simply multiplying the covariate by the time variable (unless the data are in a special structure, see survSplit()
).
Instead, we strongly recommend the use of the time-transform function, tt()
, to avoid easily-made mistakes.
tt()
in the model formulatt()
within coxph()
tt = function(x,t,...)
x
and time t
. For example:
x*t
will allow coefficient to change with time linearlyx*log(t)
allows coefficient to change with log (magnitudes) of timex*(t>100)
allows coefficient to take on 2 different values, one value when \(t<=100\) and another value \(t>100\).
Below we specify an interaction of sex with time itself, so that the effect of sex is allowed to change linearly with time.
# notice sex and tt(sex) in model formula
lung.sex.by.time <- coxph(Surv(time, status) ~ age + wt.loss + sex + tt(sex), # sex and tt(sex) in formula
data=lung,
tt=function(x,t,...) x*t) # linear change in effect of sex
summary(lung.sex.by.time)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss + sex + tt(sex),
data = lung, tt = function(x, t, ...) x * t)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0194343 1.0196244 0.0096522 2.013 0.0441 *
wt.loss 0.0001260 1.0001261 0.0062502 0.020 0.9839
sex -0.9417444 0.3899470 0.3224791 -2.920 0.0035 **
tt(sex) 0.0013778 1.0013787 0.0008581 1.606 0.1084
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0196 0.9808 1.0005 1.0391
wt.loss 1.0001 0.9999 0.9879 1.0125
sex 0.3899 2.5645 0.2073 0.7337
tt(sex) 1.0014 0.9986 0.9997 1.0031
Concordance= 0.613 (se = 0.027 )
Likelihood ratio test= 17.23 on 4 df, p=0.002
Wald test = 15.86 on 4 df, p=0.003
Score (logrank) test = 16.44 on 4 df, p=0.002
The coef
estimated for sex
is the log-hazard-ratio at day \(t=0\), \(\hat{b}_{sex}=-.942\), corresponding to a hazard ratio of \(\hat{HR}_{sex}=exp(\hat{b}_{sex})=.39\).
The coef
for tt(sex)
is the change in the log-hazard-ratio for each additional day that passes.
Again, our inferences regarding age and wt.loss are mostly unchanged.
These sex coefficient estimates match the graph of the smoothed Schoenfeld residuals for sex. At the beginning of follow-up, the coefficient is close to -1, and it increases gradually over time.
The Cox PH model accommodates time-varying covariates, but the data should be structured in start-stop time format:
The survival
package provides a function tmerge()
to help get your data in this format. See vignette(timedep)
for guidance on its usage.
The jasa1
dataset, which looks at survival for patients on a waiting list for heart transplant, is already set up in this format:
In this data set, transplant
is a time-varying covariate:
id start stop event transplant age year surgery
1 1 0 49 1 0 -17.155373 0.1232033 0
2 2 0 5 1 0 3.835729 0.2546201 0
102 3 0 15 1 1 6.297057 0.2655715 0
3 4 0 35 0 0 -7.737166 0.4900753 0
103 4 35 38 1 1 -7.737166 0.4900753 0
4 5 0 17 1 0 -27.214237 0.6078029 0
For patient with id=4
:
The data are ready for coxph()
.
Use the Surv(time, time2, event)
specification.
jasa1.cox <- coxph(Surv(start, stop, event) ~ transplant + age + surgery, data=jasa1)
summary(jasa1.cox)
Call:
coxph(formula = Surv(start, stop, event) ~ transplant + age +
surgery, data = jasa1)
n= 170, number of events= 75
coef exp(coef) se(coef) z Pr(>|z|)
transplant 0.01405 1.01415 0.30822 0.046 0.9636
age 0.03055 1.03103 0.01389 2.199 0.0279 *
surgery -0.77326 0.46150 0.35966 -2.150 0.0316 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
transplant 1.0142 0.9860 0.5543 1.8555
age 1.0310 0.9699 1.0033 1.0595
surgery 0.4615 2.1668 0.2280 0.9339
Concordance= 0.599 (se = 0.036 )
Likelihood ratio test= 10.72 on 3 df, p=0.01
Wald test = 9.68 on 3 df, p=0.02
Score (logrank) test = 10 on 3 df, p=0.02
We can follow with the same procedures as before, checking PH assumptions and plotting predicted survival curves.
chisq df p
transplant 0.118 1 0.73
age 0.897 1 0.34
surgery 0.097 1 0.76
GLOBAL 1.363 3 0.71
No strong evidence that PH is violated.
# plot predicted survival by transplant group at mean age and surgery=0
plotdata <- data.frame(transplant=0:1, age=-2.48, surgery=0)
surv.by.transplant <- survfit(jasa1.cox, newdata=plotdata)
ggsurvplot(surv.by.transplant, data=plotdata, # remember to supply data to ggsurvplot() for predicted survival after coxph()
legend.labs=c("0", "1"), legend.title="transplant") # give proper labels and title to legend
Now we will fit a Cox proprotional hazards model to the veteran
data set.
time
: survival time in weeksstatus
: status, 0=censored, 1=deadtrt
: treatment, 1=standard, 2=testage
: agetrt
to change over time (even though the original model was not very useful).The coxph()
function has some additional flexibility, including:
vignette("survival")
)vignette("survival")
)?frailty
for more, but in general, use the coxme
packagerobust
and cluster
options in ?coxph
weights
argument in ?coxph
The survival
package has the survreg()
function for parametric regression models.
Additional useful survival analysis packages
coxme
package, frailty random effects models, also written by Therneauicenreg
package, for interval censored data (note: data set up is different from survival
)References for the survival
package
The survival
package has some of the best vignettes (tutorials) of any R package.
Key vignettes for getting started (use vignette(package="survival")
to see a full list):
vignette("survival")
for a general introduction of the usage and capabilities of the survival
package, as well as some theoretical survival analysis backgroundvignette("timedep")
for guidance modeling time-varying covariates and time-varying coefficientsReferences for survival analysis
Grambsch, P. & Therneau, T. (1994), Proportional hazards tests and diagnostics based on weighted residuals. Biometrika, 81, 515-26.
Therneau, T. M. & Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model, Third Edition. New York: Springer.
Kleinbaum, D. G., & Klein, M. (2012). Survival analysis: a self-learning text (Vol. 3). New York: Springer.