This workshop aims to provide just enough background in survival
analysis to be able to use the survival
package in R
to:
The survival
package:
survival
packagesurvminer
We use the survminer
for its
ggsurvplot()
function, used to create highly customizable
plots of survival functions
We use the broom
package for its tidy()
function, which cleans up output tables and stores them as data
frames.
If you are following in RStudio, go ahead and load the workshop
packages now with library()
.
coxph()
Survival analysis models how much time elapses before an event occurs.
The outcome variable, the length of time to an event, is often referred to as either survival time, failure time, or time to event.
Example events include:
Events are often referred to as failures.
Almost anything can be framed as the event of interest, so survival analysis has broad applications across many fields.
We often say that the subject is at risk and a member of the risk set before the event occurs or the subject’s time is censored.
One of the goals of survival analysis is to estimate the probability that a subject survives without experiencing the event past some time \(t\).
We can infer these probabilities from observing how long different subjects remain at risk before failing, i.e., observing their survival times.
Let \(T\) be a random variable representing a subject’s true survival time.
Sometimes, we cannot observe a subject’s true survival time \(T\) during the course of a study, known as censoring. In general, we say we observe subjects’ follow-up time, which for some will be the true survival time \(T\) and for others will be the censoring time.
The survival function, \(S(t)\) expresses the probability that a subject’s true survival time \(T\) will exceed time \(t\), i.e., that the subject will survive beyond time \(t\).
\[S(t)=Pr(T>t)\]
For the survival curve above,
\(S(100)=.577\), the probability that a subject survives beyond 100 days is 0.577.
\(S(200)=.122\), the probability that a subject survives beyond 200 days is 0.122.
Typically, we also assume:
\(S(0)=1\), all subjects survive the very first moment
\(S(\infty)=0\), all subjects fail after infinite time
Median survival time is defined as the time \(t\) at which 50% of the population is expected to be still surviving:
Some survival methods, such as the Kaplan-Meier estimator, focus on estimating the survival function \(S(t)\) directly.
Other methods, such as the Cox model, focus on the hazard function (also known as the hazard rate), \(h(t)\), which is inversely related to the \(S(t)\).
The hazard function at time \(t\), \(h(t)\), is defined as the instantaneous rate of events at time \(t\), given that the subject has survived until time \(t\). We say instantaneous because \(h(t)\) may be changing moment to moment, continuously over time.
For example, in the green curve below, \(h(200)=\frac{\text{.0204 events}}{day}\), while \(h(200.1)=\frac{\text{.02041 events}}{day}\).
With an increase in the hazard function, more events are expected per unit time, and survival will be expected to decrease.
Below we see three examples of hazard functions, 2 of which are changing continuously with time.
Note: \(h(t)\ge0\), so the hazard function can never be negative
The cumulative hazard function, \(H(t)\), expresses how much hazard a subject has accumulated over time up to time \(t\).
\[H(t)=\int_0^t h(u)du\] The probability that a subject will fail over time increases as the hazard accumulates.
Because the hazard function \(h(t)\) is never negative, the cumulative hazard \(H(t)\) can never decrease with time.
The survival function is inversely related to the cumulative hazard function, where we see that as a subject’s cumulative hazard grows, the survival probability decreases.
\[S(t)=exp(-H(t))\]
Therefore, by modeling either the survival function or the hazard function, we can infer the other.
Many times the exact time when the event is unknown or censored.
Right censoring means that a subject’s actual survival time is greater than their observed time
Left-censoring means that a subjects actual survival time is less than their observed time. One common example is when the event is defined as disease infection: positive tests for the infection may be delayed by days or even years.
Interval censoring means that a subjects is survival time is unknown, but known to lie between 2 observed time points.
We will only discuss methods that handle right-censoring in this workshop.
Many standard methods, such as linear regression, are not equipped to deal with censored outcomes.
*Image adapted from Kleinbaum and Klein, Survival Analysis: A Self-Learning Text, Third Edition, Springer, 2012.
Most survival analysis methods, including all those discussed here, assume non-informative censoring.
Failing to account for informative censoring may result in biased estimates of survival. Below are plots of the Kaplan-Meier survival function estimates of the above data:
Examples of possible informative censoring and resulting bias if not addressed:
The simplest data structure for a typical survival analysis is:
aml
datasetWe’ll start with the aml
dataset in the
survival
package.
These data come from a study looking at time to death for patients with acute myelogenous leukemia, comparing “maintained” chemotherapy treatment to “nonmaintained”.
Variables:
time
survival or censoring timestatus
0=censored, 1=deathx
chemotherapy “maintained” or “nonmaintained”time | status | x |
---|---|---|
9 | 1 | Maintained |
13 | 1 | Maintained |
13 | 0 | Maintained |
18 | 1 | Maintained |
23 | 1 | Maintained |
28 | 0 | Maintained |
31 | 1 | Maintained |
34 | 1 | Maintained |
45 | 0 | Maintained |
48 | 1 | Maintained |
161 | 0 | Maintained |
5 | 1 | Nonmaintained |
5 | 1 | Nonmaintained |
8 | 1 | Nonmaintained |
8 | 1 | Nonmaintained |
12 | 1 | Nonmaintained |
16 | 0 | Nonmaintained |
23 | 1 | Nonmaintained |
27 | 1 | Nonmaintained |
30 | 1 | Nonmaintained |
33 | 1 | Nonmaintained |
43 | 1 | Nonmaintained |
45 | 1 | Nonmaintained |
Surv()
function for survival outcomesUse Surv()
to specify the survival outcome
variables.
Allows for many different time and event status configurations.
For data with a single time variable indicating time to event or
censoring, the Surv
specification will be:
Surv(time, event)
time
survival/censoring time variableevent
status variable. To code for
censored/event use:
Censoring is assumed to be right-censored unless otherwise
specified with the type
argument.
Surv()
specification for start-stop formatSome survival analyses require time to be recorded in 2 variables that mark the beginning and end of time intervals. We need this format to model:
In this format, some or all subjects may have multiple rows of
data.
This format is sometimes called start-stop format.
The jasa1
data set has this setup, where
start
and stop
are the time variables, and
event
is the status variable:
## id start stop event transplant age year surgery
## 1 1 0 49 1 0 -17.155373 0.1232033 0
## 2 2 0 5 1 0 3.835729 0.2546201 0
## 102 3 0 15 1 1 6.297057 0.2655715 0
## 3 4 0 35 0 0 -7.737166 0.4900753 0
## 103 4 35 38 1 1 -7.737166 0.4900753 0
## 4 5 0 17 1 0 -27.214237 0.6078029 0
To specify the outcome for data in stop-start format, use:
Surv(time, time2, event)
time
and time2
beginning and end of time intervalsevent
is the status at the end of the
interval.The Kaplan-Meier (KM) estimator is a very popular non-parametric method to estimate the survival function, \(S(t)\). Non-parametric means that we are not assuming any particular distribution for the survival times.
It has an intuitive formula:
\[\hat{S}(t)=\prod_{t_i<t}\left(1-\frac{events_i}{num.at.risk_i}\right)\] The expression \(\left(1-\frac{events_i}{num.at.risk_i}\right)\) is simply the proportion of those at risk that survive time point \(t_i\). The KM estimator is the product of the proportion that survive each time point \(t_i\) up to the current time point \(t\).
Survival estimates do not change if someone drops due to censoring, although the number at risk will drop.
survfit()
We can obtain the KM estimate of \(S(t)\) using survfit()
.
The first argument is a model formula with a Surv()
outcome specification on the left side of ~
.
To estimate the survival function for the entire data set, we specify
1
after ~
.
# ~ 1 indicates KM survival function estimate for whole sample
KM <- survfit(Surv(time, status) ~ 1, data=aml)
Printing the survfit
object gives a summary:
n
: total number at riskevents
: total number of events that occurredmedian
: survival time \(t\) at which \(S(t)=.5\), or the time at which 50% of
those at risk are expected to still be alive0.95LCL
, 0.95UCL
: 95% confidence limits
for median survival## Call: survfit(formula = Surv(time, status) ~ 1, data = aml)
##
## n events median 0.95LCL 0.95UCL
## [1,] 23 18 27 18 45
## Call: survfit(formula = Surv(time, status) ~ 1, data = aml)
##
## n events median 0.95LCL 0.95UCL
## [1,] 23 18 27 18 45
The tidy()
function from the broom
package
works with many of the output objects created by the
survival
package to create tables stored as tibbles (modern
data.frames)
Using tidy()
on the survfit()
object
produces a table of the KM estimate of the survival function \(S(t)\)
time
event/censoring times in the data setn.risk
, n.event
, n.censor
number at risk, number of events, number censoredestimate
\(\hat{S}(t)\) survival estimate
std.error
, conf.high
,
conf.low
standard error for \(\hat{S}(t)\), and 95% confidence interval
for \(S(t)\)# save KM survival function as tibble (modern data.frame)
KM.tab <- tidy(KM)
KM.tab # same as print(KM.tab)
## # A tibble: 18 × 8
## time n.risk n.event n.censor estimate std.error conf.high conf.low
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 23 2 0 0.913 0.0643 1 0.805
## 2 8 21 2 0 0.826 0.0957 0.996 0.685
## 3 9 19 1 0 0.783 0.110 0.971 0.631
## 4 12 18 1 0 0.739 0.124 0.942 0.580
## 5 13 17 1 1 0.696 0.138 0.912 0.531
## 6 16 15 0 1 0.696 0.138 0.912 0.531
## 7 18 14 1 0 0.646 0.157 0.878 0.475
## 8 23 13 2 0 0.547 0.196 0.803 0.372
## 9 27 11 1 0 0.497 0.218 0.762 0.324
## 10 28 10 0 1 0.497 0.218 0.762 0.324
## 11 30 9 1 0 0.442 0.248 0.718 0.272
## 12 31 8 1 0 0.386 0.282 0.671 0.223
## 13 33 7 1 0 0.331 0.321 0.622 0.177
## 14 34 6 1 0 0.276 0.369 0.569 0.134
## 15 43 5 1 0 0.221 0.432 0.515 0.0947
## 16 45 4 1 1 0.166 0.519 0.458 0.0598
## 17 48 2 1 0 0.0828 0.877 0.462 0.0148
## 18 161 1 0 1 0.0828 0.877 0.462 0.0148
We can plot
the KM survival function and confidence
intervals for the entire aml
sample.
Notice how the KM-estimated \(\hat{S}(t)\) is a step function, where \(\hat{S}(t)\) only changes at timepoints when an event occurs.
The true underlying survival curve \(S(t)\) may be smooth, but the smoothness of the KM curve is limited by the number of event times observed in the data.
To estimate separate KM survival functions for different strata,
specify one or more strata variables after the ~
in
survfit()
# stratify by x variable
KM.x <- survfit(Surv(time, status) ~ x, data=aml)
# median survival by strata
KM.x
## Call: survfit(formula = Surv(time, status) ~ x, data = aml)
##
## n events median 0.95LCL 0.95UCL
## x=Maintained 11 7 31 18 NA
## x=Nonmaintained 12 11 23 8 NA
Notice the addition of the strata
column in the
tidy()
output:
## # A tibble: 20 × 9
## time n.risk n.event n.censor estimate std.error conf.high conf.low strata
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 9 11 1 0 0.909 0.0953 1 0.754 x=Mainta…
## 2 13 10 1 1 0.818 0.142 1 0.619 x=Mainta…
## 3 18 8 1 0 0.716 0.195 1 0.488 x=Mainta…
## 4 23 7 1 0 0.614 0.249 0.999 0.377 x=Mainta…
## 5 28 6 0 1 0.614 0.249 0.999 0.377 x=Mainta…
## 6 31 5 1 0 0.491 0.334 0.946 0.255 x=Mainta…
## 7 34 4 1 0 0.368 0.442 0.875 0.155 x=Mainta…
## 8 45 3 0 1 0.368 0.442 0.875 0.155 x=Mainta…
## 9 48 2 1 0 0.184 0.834 0.944 0.0359 x=Mainta…
## 10 161 1 0 1 0.184 0.834 0.944 0.0359 x=Mainta…
## 11 5 12 2 0 0.833 0.129 1 0.647 x=Nonmai…
## 12 8 10 2 0 0.667 0.204 0.995 0.447 x=Nonmai…
## 13 12 8 1 0 0.583 0.244 0.941 0.362 x=Nonmai…
## 14 16 7 0 1 0.583 0.244 0.941 0.362 x=Nonmai…
## 15 23 6 1 0 0.486 0.305 0.883 0.268 x=Nonmai…
## 16 27 5 1 0 0.389 0.378 0.816 0.185 x=Nonmai…
## 17 30 4 1 0 0.292 0.476 0.741 0.115 x=Nonmai…
## 18 33 3 1 0 0.194 0.627 0.664 0.0569 x=Nonmai…
## 19 43 2 1 0 0.0972 0.945 0.620 0.0153 x=Nonmai…
## 20 45 1 1 0 0 Inf NA NA x=Nonmai…
plot()
will only produce confidence intervals for \(S(t)\) by default if one curve is
plotted.\(\dagger\)
For multiple curves, we must request confidence intervals with
conf.int=T
.
We also specify two colors to make the graph more readable.
# stratified KM curves with 95% CI, 2 colors
plot(KM.x, ylab="survival probability", xlab="months",
conf.int=T, col=c("red", "blue"))
\(\dagger\)
plot()
is a generic function and calls
plot.survfit()
when supplied a survfit
object.
The option conf.int=
is an option of
plot.survfit()
.
survminer
plot.survfit()
uses base R graphics and has limited
options to customize a plot of survival functions.
The survminer
package leverages the graphical power of
the ggplot2
package and adds many of its own features to
create highly customizable plots of survival functions.
Using ggsurvplot()
from survminer
on a
survfit
object produces a plot of the KM estimated survival
functions.
Notice that ggsurvplot()
automatically adds
+
symbols to denote censored observations.
ggsurvplot()
makes adding a risk table very easy.
You can pass most arguments to various functions in
ggplot2
through ggsurvplot()
.
ggsurvplot(KM.x, conf.int=T,
risk.table=T,
palette="Accent", # argument to scale_color_brewer()
size=2, # argument to geom_line()
ggtheme = theme_minimal()) # changing ggtheme
You can also use traditional ggplot2
syntax by
extracting the ggplot
object as
ggsurvplot()$plot
.
g <- ggsurvplot(KM.x, conf.int=T,
risk.table=T)$plot # this is the ggplot object
g + scale_fill_grey() + scale_color_grey()
See our ggplot2
seminar to learn how to use the ggplot2
package.
survdiff()
We often want to test the hypothesis:
\(H_0\): survival curves across 2 or
more groups are equivalent
\(H_A\): survival curves across 2 or
more groups are not equivalent
The log-rank statistic is one popular method to evaluate this hypothesis.
Under the null, the log-rank statistic is \(\chi^2\) distributed with \(g-1\) degrees of freedom.
The function survdiff()
performs the log-rank test by
default.
## Call:
## survdiff(formula = Surv(time, status) ~ x, data = aml)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## x=Maintained 11 7 10.69 1.27 3.4
## x=Nonmaintained 12 11 7.31 1.86 3.4
##
## Chisq= 3.4 on 1 degrees of freedom, p= 0.07
We see some evidence, not strong, that the survival curves may be
different.
survdiff()
allows weights in calculation of the \(\chi^2\) statistic with the
rho=
argument. The weights are defined as \(\hat{S}(t)^\rho\), where \(0 \le \rho \le 1\).
rho=0
equal weights, \(\hat{S}(t)^0=1\), which is the log-rank
test and the defaultrho=1
the weights equal the survival estimate itself,
\(\hat{S}(t)^1=\hat{S}(t)\), equivalent
to the Gehan-Wilcoxon test
rho=
values between 0 and 1 are valid, with values
closer to 1 putting more weight on earlier time points# rho=1 specifies Peto & Peto modification of Gehan-Wilcoxon,
# more weight put on earlier time points
survdiff(Surv(time, status) ~ x, data=aml, rho=1)
## Call:
## survdiff(formula = Surv(time, status) ~ x, data = aml, rho = 1)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## x=Maintained 11 3.85 6.14 0.859 2.78
## x=Nonmaintained 12 7.18 4.88 1.081 2.78
##
## Chisq= 2.8 on 1 degrees of freedom, p= 0.1
The veteran
data set describes survival times for
veterans with lung cancer.
Variables:
time
: survival timestatus
: status, 0=censored, 1=deadtrt
: treatment, 1=standard, 2=testInstead of estimating the survival function \(S(t)\) directly, the Cox proportional hazards model estimates changes to the hazard function, \(h(t)\).
The Cox model can estimate the effects of multiple predictors\(^\dagger\) on the hazard function.
By far the most popular method for survival analysis
\(^\dagger\)We use the words predictor and covariate interchangeably throughout this workshop.
For simplicity, we begin with a Cox model with a single time-constant predictor, \(X_1\):
\[h(t|X_1=x_1) = h_0(t)exp(b_1x_1)\]
For example, imagine that \(X_1\) is a treatment variable, with values \(X_1=1\) for treatment and \(X_1=0\) for control.
The hazard at time \(t\) for treatment:
\[\begin{aligned}h(t|X_1=1) &= h_0(t)exp(b_1*1)\\ &= h_0(t)exp(b_1) \end{aligned}\]
and for control:
\[\begin{aligned}h(t|X_1=0) &= h_0(t)exp(b_1*0)\\ &= h_0(t)exp(0) \\ &= h_0(t) \end{aligned}\]
We can compare the hazards for treatment and control at time \(t\) as a hazard ratio (HR):
\[\begin{aligned}HR&=\frac{h(t|X_1=1)}{h(t|X_1=0)} \\ &= \frac{h_0(t)exp(b_1)}{h_0(t)} \\ &= exp(b_1) \end{aligned}\] Thus, \(exp(b_1)\) is the hazard ratio comparing the hazard for treatment to controls.
In general, \(exp(b_1)\) expresses the hazard ratio for a 1-unit increase in the associated covariate.
\(b_1\) itself is the log-hazard ratio.
The standard Cox model assumes proportional hazards, which means that the effects of covariates are constant over time.
For example, in our simple Cox model for a treatment effect, proportional hazards means that the effect of treatment does not change over time.
Notice that the expression for the hazard ratio for the treatment effect does not contain time, so it will be the same value no matter the time:
\[HR=exp(b_1)\]
Note: With proportional hazards, a subject’s hazard function (which we don’t need to know for the Cox model) can change over time. But the hazard ratio comparing that subject’s hazard to another subject’s hazard cannot change over time, provided their covariate values do not change.
Visually, this is represented by “parallel” survival curves that should not cross:
The parallelism is easier to evaluate if we plot \(-log(-log(S(t)))\), the negative log of the negative log of the survival function. If the hazards are proportional, the vertical distance between the curves is constant across time.
Violation of proportional hazards suggests that a predictor’s effect changes over time.
Hazard and survival functions will often cross:
As will graphs of \(-log(-log(S(t)))\), where we see that the vertical distance between curves changes and reverses direction over time:
Failing to account for non-constant hazard ratios threatens validity of Cox model estimates.
The Cox model can be extended in various ways to accommodate non-proportional hazards.
One reason why the Cox model is so popular is that it does not require specification of the baseline hazard function, \(h_0(t)\), the hazard function for a subject with zero on all covariates.
Essentially, this means we can use the Cox model without assuming a particular form of the hazard function or assuming a distribution of survival times.
We thus then do not need to estimate parameters to characterize a hazard function or survival distribution, but only the regression parameters that quantify the covariate effects. The Cox model is thus called semiparametric.
No constant/intercept in Cox models.
Because \(h_0(t)\) is left unspecified, the Cox model can not directly estimate either the hazard function or the survival function, but is used to estimate covariate effects on the hazard functions.
The Cox is easily extended to accommodate multiple predictors, each of whose effects is assumed to be proportional over time.
\[h(t|X_1, X_2, ...X_p) = h_0(t)exp(b_1X_1 + b_2X_2 + ... + b_pX_p)\] Each coefficient \(b_i\) can be exponentiated to calculate a hazard ratio.
survival
packageWe will be using the lung
data set form the
survival
package for Cox modeling.
The data describe survival of patients with advanced lung cancer.
Variables:
time
: survival time in daysstatus
: 1=censored, 2=dead\(^\dagger\)age
: age in years (assessed at beginning)sex
: 1=male, 2=female\(^\ddagger\)wt.loss
: weight loss (pounds) in last 6 months\(^\dagger\)Remember that the
Surv()
function accepts a status variable with 1=censored
and 2=event
\(^\ddagger\)We would normally recommend that binary variables be coded 0/1 so that the intercept is interpretable; however, there is no intercept in the Cox model, so 1/2 coding is equivalent.
Use coxph(formula,data=)
THe formula resembles a typical R regression formula:
Surv(time, event) ~ x1 + x2…
where x1 + x2…
is a list of one or more
predictor variables separated by +
.
# fit cox model and save results
lung.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data=lung)
# summary of results
summary(lung.cox)
## Call:
## coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
##
## n= 214, number of events= 152
## (14 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.0200882 1.0202913 0.0096644 2.079 0.0377 *
## sex -0.5210319 0.5939074 0.1743541 -2.988 0.0028 **
## wt.loss 0.0007596 1.0007599 0.0061934 0.123 0.9024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0203 0.9801 1.0011 1.0398
## sex 0.5939 1.6838 0.4220 0.8359
## wt.loss 1.0008 0.9992 0.9887 1.0130
##
## Concordance= 0.612 (se = 0.027 )
## Likelihood ratio test= 14.67 on 3 df, p=0.002
## Wald test = 13.98 on 3 df, p=0.003
## Score (logrank) test = 14.24 on 3 df, p=0.003
Notice:
coef
: log hazard ratio coefficients
exp(coef)
: hazard ratios (exponentiated coefficients)
se(coef)
: standard error of log hazard ratiosPr(>|z|)
: p-value for test of log hazard ratio = 0,
or equivalently, hazard ratio = 1lower .95
, upper .95
: 95% confidence
interval for hazard ratio
Concordance
: proportion of pairs that are concordant, a
goodness-of-fit measureLikelihood ratio
, Wald
, and
Score
tests: 3 tests of the null hypothesis that all
coefficients equal zerocoxph()
resultsWe can tidy()
the coxph()
results and store
them in a tibble data.frame.
# save summarized results as data.frame
# exponentiate=T returns hazard ratios
lung.cox.tab <- tidy(lung.cox, exponentiate=T, conf.int=T)
# display table
lung.cox.tab
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age 1.02 0.00966 2.08 0.0377 1.00 1.04
## 2 sex 0.594 0.174 -2.99 0.00280 0.422 0.836
## 3 wt.loss 1.00 0.00619 0.123 0.902 0.989 1.01
Storing the Cox model results as a data.frame makes it easy to use
ggplot2
to create plots of the hazard ratios and confidence
intervals.
# plot of hazard ratios and 95% CIs
ggplot(lung.cox.tab,
aes(y=term, x=estimate, xmin=conf.low, xmax=conf.high)) +
geom_pointrange() + # plots center point (x) and range (xmin, xmax)
geom_vline(xintercept=1, color="red") + # vertical line at HR=1
labs(x="hazard ratio", title="Hazard ratios and 95% CIs") +
theme_classic()
We often want to estimate and compare survival functions for subjects with different sets of covariates.
Because the Cox model does not estimate survival directly, we first use non-parameteric methods (similar to KM-estimator) to estimate the baseline survival function, \(S_0(t)\), the survival function for a subject with zero on all covariates.
\[{S}_0(t)={S}(t|X_1=0,X_2=0,...,X_p=0)\]
After we estimate the baseline survival function, \(\hat{S}_0(t)\), we can then estimate the survival function for a subject with non-zero covariate values using the regression coefficients estimated from the Cox model and this relation:
\[\hat{S}(t|X_1=x_1,X_2=x_2,...,X_p=x_p)=\hat{S}_0(t)^{exp(b_1x_1+b_2x_2+...+b_px_p)}\]
Because \(\hat{S}_0(t)\) is
estimated non-parametrically, survival functions estimated after
coxph()
will again be step functions that change values
only at event times observed in the data.
coxph()
with
survfit()
The survfit()
function, when supplied a
coxph
model object, performs all of the calculations to
predict survival.
If no covariate values are supplied to survfit()
, then a
survival function will be estimated for a subject with mean values on
all model covariates.
We can then use tidy()
to produce a table of the
survival function estimated by survfit()
:
# predict survival function for subject with means on all covariates
surv.at.means <- survfit(lung.cox)
#table of survival function
tidy(surv.at.means)
## # A tibble: 179 × 8
## time n.risk n.event n.censor estimate std.error conf.high conf.low
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 214 1 0 0.996 0.00443 1 0.987
## 2 11 213 2 0 0.987 0.00772 1 0.972
## 3 12 211 1 0 0.982 0.00894 1.00 0.965
## 4 13 210 2 0 0.973 0.0110 0.995 0.953
## 5 15 208 1 0 0.969 0.0120 0.992 0.946
## 6 26 207 1 0 0.964 0.0128 0.989 0.940
## 7 30 206 1 0 0.960 0.0137 0.986 0.935
## 8 31 205 1 0 0.955 0.0145 0.983 0.929
## 9 53 204 2 0 0.946 0.0160 0.976 0.917
## 10 54 202 1 0 0.942 0.0167 0.973 0.912
## # ℹ 169 more rows
We can also use plot()
on the survfit()
object to plot a predicted survival curve.
Predicting survival for a subject at the mean of all covariates may not make sense, particularly if one or more of the covariates are factors (categorical).
Instead, it is recommended to always supply a data.frame of covariate
values at which to predict the survival function to the
newdata=
option of survfit()
:
First we create the new data set.
# create new data for plotting: 1 row for each sex
# and mean age and wt.loss for both rows
plotdata <- data.frame(age=mean(lung$age),
sex=1:2,
wt.loss=mean(lung$wt.loss, na.rm=T))
# look at new data
plotdata
## age sex wt.loss
## 1 62.44737 1 9.831776
## 2 62.44737 2 9.831776
Then we supply the new data to survfit
, along with the
model object created by coxph()
.
We tidy()
the survfit
object to produce a
table of predicted survival functions. Note the column suffixes
.1
and .2
that differentiate the survival,
standard error, and confidence interval estimates between the 2
sexes.
estimate.1
is the survival function
estimates for sex=1
, males, while estimate.2
is for females.# get survival function estimates for each sex
surv.by.sex <- survfit(lung.cox, newdata=plotdata) # one function for each sex
# tidy results
tidy(surv.by.sex)
## # A tibble: 179 × 12
## time n.risk n.event n.censor estimate.1 estimate.2 std.error.1 std.error.2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 214 1 0 0.995 0.997 0.00546 0.00327
## 2 11 213 2 0 0.984 0.990 0.00953 0.00577
## 3 12 211 1 0 0.978 0.987 0.0111 0.00674
## 4 13 210 2 0 0.967 0.980 0.0137 0.00844
## 5 15 208 1 0 0.962 0.977 0.0148 0.00921
## 6 26 207 1 0 0.956 0.974 0.0159 0.00995
## 7 30 206 1 0 0.951 0.971 0.0170 0.0107
## 8 31 205 1 0 0.945 0.967 0.0180 0.0113
## 9 53 204 2 0 0.934 0.960 0.0199 0.0127
## 10 54 202 1 0 0.929 0.957 0.0208 0.0133
## # ℹ 169 more rows
## # ℹ 4 more variables: conf.high.1 <dbl>, conf.high.2 <dbl>, conf.low.1 <dbl>,
## # conf.low.2 <dbl>
We can also plot()
the survfit
object to
graph the predicted survival functions. We must again request confidence
intervals because we are plotting more than one curve.
# plot survival estimates
plot(surv.by.sex, xlab="days", ylab="survival probability",
conf.int=T, col=c("blue", "red"))
For more control over plots of predicted survival functions, we can
use ggsurvplot()
from survminer
again.
If a dataset was specified as newdata=
in
survfit()
to generate survival function estimates, then
that dataset must be specified as data=
in
ggsurvplot()
as well.
# data= is the same data used in survfit()
# censor=F removes censoring symbols
ggsurvplot(surv.by.sex, data=plotdata, censor=F,
legend.labs=c("male", "female"))
The hazard ratio comparing males to females was \(\hat{HR}_{sex}=.594\), meaning that females have about 60% the hazard rate that males do. Females thus have better overall survival than males, depicted in the graph above.
As with any statistical model, the plausibility of the model assumptions affects our confidence in the results.
Several methods have been developed to assess the proportional hazards assumption of the Cox model. Here we discuss 2 tools developed by Grambsch and Therneau (1994).
A chi-square test based on Schoenfeld residuals is available with
cox.zph()
to test the hypothesis:
\(H_0\): covariate effect is
constant (proportional) over time
\(H_A\): covariate effect changes over
time
The null hypothesis of proportional hazards is tested for each covariate individually and jointly as well.
## chisq df p
## age 0.5077 1 0.48
## sex 2.5489 1 0.11
## wt.loss 0.0144 1 0.90
## GLOBAL 3.0051 3 0.39
No strong evidence of violation of proportional hazards for any covariate, though some suggestion that sex may violate.
Another tool used to assess proportional hazards is a plot of a
smoothed curve over the Schoenfeld residuals.
To create this plot, plot()
the object returned by
cox.zph()
.
These plots depict an estimate of the coefficient (labelled “Beta(t)” in the plot) (y-axis) across time (x-axis). Proportional hazards is indicated by flat line.
Again we see some evidence of non-proportional hazards for sex, as the effect of sex seems to increase with time.
The effect of sex seems to be strongest at the beginning of follow-up, but then trends toward zero as time passes.
Many strategies have been proposed to account for violation of PH. We discuss two here:
If the change in the coefficient is not large enough to be
clinically meaningfully, it can perhaps be ignored as well.
Imagine we are interested in the effect of weight loss
(wt.loss
) on survival, but are also concerned that possible
PH violation by sex in our Cox models may bias estimates of the wt.loss
effect.
We can perform a sensitivity analysis to show how our inferences regarding the wt.loss effect change depending on whether we address the PH violation by sex or not.
In the standard Cox model assuming PH for sex, there was not much evidence that wt.loss was strongly predictive of survival, and we cannot even be confident about the direction of the effect.
## Call:
## coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
##
## n= 214, number of events= 152
## (14 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.0200882 1.0202913 0.0096644 2.079 0.0377 *
## sex -0.5210319 0.5939074 0.1743541 -2.988 0.0028 **
## wt.loss 0.0007596 1.0007599 0.0061934 0.123 0.9024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0203 0.9801 1.0011 1.0398
## sex 0.5939 1.6838 0.4220 0.8359
## wt.loss 1.0008 0.9992 0.9887 1.0130
##
## Concordance= 0.612 (se = 0.027 )
## Likelihood ratio test= 14.67 on 3 df, p=0.002
## Wald test = 13.98 on 3 df, p=0.003
## Score (logrank) test = 14.24 on 3 df, p=0.003
Stratification provides a general approach to control for the effects of a variable, even if it violates PH.
Drawback: we cannot quantify the effect of the stratification variable on survival (i.e., no coefficient will be estimated).
In the stratified Cox model:
Put the stratification variable inside strata()
within the coxph()
model formula:
lung.strat.sex <- coxph(Surv(time, status) ~ age + wt.loss + strata(sex), data=lung)
summary(lung.strat.sex)
## Call:
## coxph(formula = Surv(time, status) ~ age + wt.loss + strata(sex),
## data = lung)
##
## n= 214, number of events= 152
## (14 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.0192190 1.0194049 0.0096226 1.997 0.0458 *
## wt.loss 0.0001412 1.0001412 0.0062509 0.023 0.9820
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.019 0.9810 1.000 1.039
## wt.loss 1.000 0.9999 0.988 1.012
##
## Concordance= 0.561 (se = 0.027 )
## Likelihood ratio test= 4.09 on 2 df, p=0.1
## Wald test = 3.99 on 2 df, p=0.1
## Score (logrank) test = 4 on 2 df, p=0.1
We see no coefficients for sex, the stratification variable. Although the estimates have changed a bit, our inferences regarding wt.loss (and age) are similar to those from the model where we assume PH for sex.
The Cox model can be extended to allow the effects of a covariate (coefficients) to change over time, by interacting that covariate with some function of time.
However, unlike other regression models, we cannot create this
interaction term by simply multiplying the covariate by the time
variable (unless the data are in a special structure, see
survSplit()
).
Instead, we strongly recommend the use of the time-transform
function,tt()
, to avoid these easily-made mistakes.
tt()
in the model formulatt()
within coxph()
tt = function(x,t,...)
x
and time t
. For example:
x*t
will allow coefficient to change with time
linearlyx*log(t)
allows coefficient to change with log
(magnitudes) of timex*(t>100)
allows coefficient to take on 2 different
values, one value when \(t<=100\)
and another value \(t>100\).
Below we specify an interaction of sex with time itself, so that
the effect of sex is allowed to change linearly with time.
# notice sex and tt(sex) in model formula
lung.sex.by.time <- coxph(Surv(time, status) ~ age + wt.loss + sex + tt(sex), # sex and tt(sex) in formula
data=lung,
tt=function(x,t,...) x*t) # linear change in effect of sex
summary(lung.sex.by.time)
## Call:
## coxph(formula = Surv(time, status) ~ age + wt.loss + sex + tt(sex),
## data = lung, tt = function(x, t, ...) x * t)
##
## n= 214, number of events= 152
## (14 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.0194343 1.0196244 0.0096522 2.013 0.0441 *
## wt.loss 0.0001260 1.0001261 0.0062502 0.020 0.9839
## sex -0.9417444 0.3899470 0.3224791 -2.920 0.0035 **
## tt(sex) 0.0013778 1.0013787 0.0008581 1.606 0.1084
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0196 0.9808 1.0005 1.0391
## wt.loss 1.0001 0.9999 0.9879 1.0125
## sex 0.3899 2.5645 0.2073 0.7337
## tt(sex) 1.0014 0.9986 0.9997 1.0031
##
## Concordance= 0.613 (se = 0.027 )
## Likelihood ratio test= 17.23 on 4 df, p=0.002
## Wald test = 15.86 on 4 df, p=0.003
## Score (logrank) test = 16.44 on 4 df, p=0.002
The coef
estimated for sex
is the
log-hazard-ratio at day \(t=0\), \(\hat{b}_{sex}=-.942\), corresponding to a
hazard ratio of \(\hat{HR}_{sex}=exp(\hat{b}_{sex})=.39\).
The coef
for tt(sex)
is the change in the
log-hazard-ratio for each additional day that passes.
These estimates match the graph of the smoothed Schoenfeld residuals for sex. At the beginning of follow-up, the coefficient is close to -1, and it increases gradually over time.
Again, our inferences regarding wt.loss are mostly unchanged.
The Cox PH model easily accommodates time-varying covariates, but the data should be structured in start-stop time format:
The survival package provides a function tmerge()
to help get your data in this format. See vignette(timedep)
for guidance on its usage.
The jasa1
dataset, which looks at survival for patients
on a waiting list for heart transplant, is already set up in this
format:
In this data set, transplant
is a time-varying
covariate:
## id start stop event transplant age year surgery
## 1 1 0 49 1 0 -17.155373 0.1232033 0
## 2 2 0 5 1 0 3.835729 0.2546201 0
## 102 3 0 15 1 1 6.297057 0.2655715 0
## 3 4 0 35 0 0 -7.737166 0.4900753 0
## 103 4 35 38 1 1 -7.737166 0.4900753 0
## 4 5 0 17 1 0 -27.214237 0.6078029 0
The data are ready for coxph()
.
Use the Surv(time, time2, event)
specification.
jasa1.cox <- coxph(Surv(start, stop, event) ~ transplant + age + surgery, data=jasa1)
summary(jasa1.cox)
## Call:
## coxph(formula = Surv(start, stop, event) ~ transplant + age +
## surgery, data = jasa1)
##
## n= 170, number of events= 75
##
## coef exp(coef) se(coef) z Pr(>|z|)
## transplant 0.01405 1.01415 0.30822 0.046 0.9636
## age 0.03055 1.03103 0.01389 2.199 0.0279 *
## surgery -0.77326 0.46150 0.35966 -2.150 0.0316 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## transplant 1.0142 0.9860 0.5543 1.8555
## age 1.0310 0.9699 1.0033 1.0595
## surgery 0.4615 2.1668 0.2280 0.9339
##
## Concordance= 0.599 (se = 0.036 )
## Likelihood ratio test= 10.72 on 3 df, p=0.01
## Wald test = 9.68 on 3 df, p=0.02
## Score (logrank) test = 10 on 3 df, p=0.02
We can follow with the same procedures as before, checking PH assumptions and plotting predicted survival curves.
## chisq df p
## transplant 0.118 1 0.73
## age 0.897 1 0.34
## surgery 0.097 1 0.76
## GLOBAL 1.363 3 0.71
# plot predicted survival by transplant group at mean age and surgery=0
plotdata <- data.frame(transplant=0:1, age=-2.48, surgery=0)
surv.by.transplant <- survfit(jasa1.cox, newdata=plotdata)
ggsurvplot(surv.by.transplant, data=plotdata) # remember to supply data to ggsurvplot() for predicted survival after coxph()
Now we will fit a Cox proprotional hazards model to the
veteran
data set.
time
: survival time in weeksstatus
: status, 0=censored, 1=deadtrt
: treatment, 1=standard, 2=testage
: agetrt
to change
over time (even though the original model was not very useful).The coxph()
function has some additional flexibility,
including:
vignette("survival")
)vignette("survival")
)?frailty
for more, but in
general, use the coxme
packagerobust
and cluster
options in
?coxph
weights
argument in
?coxph
The survival
package has the survreg()
function for parametric regression models.
Additional useful survival analysis
packages
coxme
package, frailty random effects models, also
written by TherneauReferences for the survival
package
The survival
package has some of the best vignettes
(tutorials) of any R package.
Key vignettes for getting started (use
vignette(package="survival")
to see a full list):
vignette("survival")
for a general introduction of the
usage and capabilities of the survival
package, as well as
some theoretical survival analysis backgroundvignette("timedep")
for guidance modeling time-varying
covariates and time-varying coefficientsReferences for survival analysis
Grambsch, P. & Therneau, T. (1994), Proportional hazards tests and diagnostics based on weighted residuals. Biometrika, 81, 515-26.
Therneau, T. M. & Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model, Third Edition. New York: Springer.
Kleinbaum, D. G., & Klein, M. (2012). Survival analysis: a self-learning text (Vol. 3). New York: Springer.