Introduction to Survival Analysis in R

UCLA Office of Advanced Research Computing, Statistical Methods and Data Analytics

Purpose

Learn just enough about survival analysis to use the survival package to:

  • estimate survival functions
  • test for differences in survival
  • fit a Cox proportional hazards model

Workshop packages

  • survival: Kaplan-Meier estimator, Cox proportional hazards models and extensions (e.g. frailty)
  • survminer customizable plots of survival functions (requires ggplot2)
  • broom: storing output tables as data frames.
library(survival)
library(survminer) # for customizable graphs of survival function
library(broom) # for tidy output 
library(ggplot2) # for graphing (actually loaded by survminer)

Outline

  1. Review of survival and hazard functions
  2. Data setup
  3. Kaplan-Meier estimator
  4. Cox proportional hazards model
  5. Assessing the proportional hazards assumption
  6. Time-varying covariates

A very quick review of survival analysis (POLL)

What is survival analysis?

Survival analysis models survival time, the length of time to an event.

Other names for survival time are failure time or time to event.

Example events:

  • death from disease
  • divorce
  • malfunctioning of a machine
  • first job

The survival time clock starts when a subject becomes at risk for the event, at which point they enter the risk set.

Survival function

Let \(T\) represent survival time.

The survival function, \(S(t)\) expresses the probability that a subject survives beyond time \(T\).

\[S(t)=Pr(T>t)\]

For the survival curve above,

\(S(100)=.577\), the probability that a subject survives beyond 100 days is 0.577.

\(S(200)=.122\), the probability that a subject survives beyond 200 days is 0.122.


Typically, we also assume:

\(S(0)=1\), all subjects survive the very first moment

\(S(\infty)=0\), all subjects fail after infinite time

Median survival time

Median survival time: time \(t\) at which 50% of population expected to be still surviving:

Hazard function

Some methods, e.g. Cox model, focus on the hazard function, \(h(t)\), which is inversely related to the \(S(t)\).

\(h(t)\), is the rate of events, or hazard at time \(t\), given that the subject has survived until time \(t\).

As the hazard increases, more events are expected per unit time, survival will decrease.

Below we see three examples of hazard functions.

In the green hazard function above, \(h(200)=\frac{\text{.0204 events}}{day}\).

Cumulative hazard

The cumulative hazard function, \(H(t)\), expresses how much hazard has accumulated over time up to time \(t\).

\[H(t)=\int_0^t h(u)du\]

The probability that a subject fails increases as the hazard accumulates.

Because the hazard function \(h(t)\) is never negative, the cumulative hazard \(H(t)\) never decreases with time.

Relationship between the hazard and survival functions

The survival function is inversely related to the cumulative hazard function:

\[S(t)=exp(-H(t))\]

Therefore, by modeling either the survival function or the hazard function, we can infer the other.

Censored survival times

Right Censoring

Often the exact time the event occurs is unknown or censored.

Right censoring means that a subject’s actual survival time is greater than their observed time

  • study ends before event occurs
  • subject is lost to follow-up

Left and Interval Censoring

Left-censoring means that a subject’s actual survival time is less than their observed time.

  • positive tests for the infection (the event) may be considerably delayed

Interval censoring means that a subject’s survival time is only known to lie between 2 time points.

We will only discuss methods that handle right-censoring in this workshop.

Many standard methods, such as linear regression, are not equipped to deal with censored outcomes.

*Image adapted from Kleinbaum and Klein, Survival Analysis: A Self-Learning Text, Third Edition, Springer, 2012.

Assumption of non-informative censoring

Most survival analysis methods assume non-informative censoring: a subject’s censoring time should not be related to the unobserved survival time

Below, we see 2 graphs diagrams where subjects have the same true survival times (red squares), but censoring times (black circles) are informative on the left and non-informative on the right.


Bias from informative censoring

Failing to account for informative censoring may result in biased estimates of survival.

Below are estimated survival functions from the data diagrammed on the previous slide with either noninformative (purple) or informative censoring (green).

Examples of possible informative censoring and resulting bias if not addressed:

  • oldest, most vulnerable, subjects drop out of study of time to death after surgery, biasing survival estimates upward
  • Travel-loving subjects, who delay marriage to travel, drop out of study of time to first marriage, biasing survival estimates downward

Data set up for survival analysis

Data for survival analysis

A data record for subject in a survival analysis typically includes:

  • survival time variable measuring time to event (or censoring time, time of last observation)
  • a status variable coding whether the subject experienced the event or not (censored)
  • covariates

The aml dataset

We’ll start with the aml dataset in the survival package.

These data come from a study looking at time to death for patients with acute myelogenous leukemia, comparing “maintained” chemotherapy treatment to “nonmaintained”.

Variables:

  • time  survival or censoring time
  • status  0=censored, 1=death
  • x  chemotherapy “maintained” or “nonmaintained”
time status x
9 1 Maintained
13 1 Maintained
13 0 Maintained
18 1 Maintained
23 1 Maintained
28 0 Maintained
31 1 Maintained
34 1 Maintained
45 0 Maintained
48 1 Maintained
161 0 Maintained
5 1 Nonmaintained
5 1 Nonmaintained
8 1 Nonmaintained
8 1 Nonmaintained
12 1 Nonmaintained
16 0 Nonmaintained
23 1 Nonmaintained
27 1 Nonmaintained
30 1 Nonmaintained
33 1 Nonmaintained
43 1 Nonmaintained
45 1 Nonmaintained

The Surv() function for survival outcomes

Use Surv() to specify the survival outcome variables.

Surv(time, status)

  • time  survival/censoring time variable
  • status  status variable. To code for censored/event use:
    • 0/1
    • 1/2
    • FALSE/TRUE


Censoring is assumed to be right-censored unless otherwise specified with the type argument.

Surv() specification for start-stop format

Some survival analyses require time to be recorded in 2 variables that mark the beginning and end of time intervals. We need this format to model:

  • time-varying covariates
  • interval censoring
  • recurrent events data


In this format, some or all subjects may have multiple rows of data.

This format is sometimes called start-stop format.

The jasa1 data set has this setup, where start and stop are the time variables, and event is the status variable:

head(jasa1)
    id start stop event transplant        age      year surgery
1    1     0   49     1          0 -17.155373 0.1232033       0
2    2     0    5     1          0   3.835729 0.2546201       0
102  3     0   15     1          1   6.297057 0.2655715       0
3    4     0   35     0          0  -7.737166 0.4900753       0
103  4    35   38     1          1  -7.737166 0.4900753       0
4    5     0   17     1          0 -27.214237 0.6078029       0

To specify the outcome for data in stop-start format, use:

Surv(time, time2, event)

  • time and time2  beginning and end of time intervals
  • event  is the status at the end of the interval.

Kaplan-Meier estimator of the survival function

Formula for Kaplan-Meier estimator

The Kaplan-Meier (KM) estimator is a non-parametric method to estimate the survival function, \(S(t)\).

\[\hat{S}(t)=\prod_{t_i<t}\left(1-\frac{events_i}{num.at.risk_i}\right)\]

\(\left(1-\frac{events_i}{num.at.risk_i}\right)\) is the proportion of those still at risk that survive time point \(t_i\).

\(\hat{S}(t)\) is the product of the proportion that survive each time point \(t_i\) up to the current time point \(t\).

\(\hat{S}(t)\) does not change if someone drops due to censoring, although the number at risk will drop.

Kaplan-Meier estimation with survfit()

survfit() estimates \(S(t)\) via the KM method.

The first argument to survfit() is a formula:

  • Surv() with outcome variables on the left side of ~
  • either a 1 (whole sample) or a grouping variable (stratified) on the right.
# ~ 1 requests KM survival function for whole sample
KM <- survfit(Surv(time, status) ~ 1, data=aml)


Printing the survfit object gives a summary:

  • n: total number at risk
  • events: total number of events that occurred
  • median: median survival time (when 50% of those at risk are expected to still be alive)
  • 0.95LCL, 0.95UCL: 95% confidence limits for median survival
# printing KM object gives a summary
KM
Call: survfit(formula = Surv(time, status) ~ 1, data = aml)

      n events median 0.95LCL 0.95UCL
[1,] 23     18     27      18      45

Table of KM survival function

Using summary() on the survfit object produces a table of the KM estimates of the survival function \(S(t)\) at each failure time.

On each row, at least one subject fails.

  • time  event times in the data set
  • n.risk, n.event, n.censor  number at risk, number of events, number censored
  • estimate  \(\hat{S}(t)\) survival estimate
  • std.error, conf.high, conf.low  standard error and 95% confidence interval (CI) for \(S(t)\)
# save KM survival function as tibble (modern data.frame)
summary(KM)
Call: survfit(formula = Surv(time, status) ~ 1, data = aml)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    5     23       2   0.9130  0.0588       0.8049        1.000
    8     21       2   0.8261  0.0790       0.6848        0.996
    9     19       1   0.7826  0.0860       0.6310        0.971
   12     18       1   0.7391  0.0916       0.5798        0.942
   13     17       1   0.6957  0.0959       0.5309        0.912
   18     14       1   0.6460  0.1011       0.4753        0.878
   23     13       2   0.5466  0.1073       0.3721        0.803
   27     11       1   0.4969  0.1084       0.3240        0.762
   30      9       1   0.4417  0.1095       0.2717        0.718
   31      8       1   0.3865  0.1089       0.2225        0.671
   33      7       1   0.3313  0.1064       0.1765        0.622
   34      6       1   0.2761  0.1020       0.1338        0.569
   43      5       1   0.2208  0.0954       0.0947        0.515
   45      4       1   0.1656  0.0860       0.0598        0.458
   48      2       1   0.0828  0.0727       0.0148        0.462

Let’s confirm the first 2 survival estimates:

  • \(\hat{S}(5) = \left(1-\frac{2}{23}\right) = .913\)
  • \(\hat{S}(8) = \left(1-\frac{2}{23}\right)\left(1-\frac{2}{21}\right) = .826\)

Above, we can see that censored observations do not affect \(\hat{S}(t)\) when they leave the risk set:

  • At \(t=13\), 1 person of 17 at risk fails.
  • At the next event time, \(t=18\), we see that there are 14 people at risk, instead of 16.
  • Between \(t=13\) and \(t=18\) 2 people left the risk set due to censoring, but \(\hat{S}(t)\) does not decrease when they leave

Using tidy() from the broom package on the survfit object adds times where censored observations leave the risk set to the KM table.

# show KM table with censoring times
tidy(KM)
# A tibble: 18 × 8
    time n.risk n.event n.censor estimate std.error conf.high conf.low
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
 1     5     23       2        0   0.913     0.0643     1       0.805 
 2     8     21       2        0   0.826     0.0957     0.996   0.685 
 3     9     19       1        0   0.783     0.110      0.971   0.631 
 4    12     18       1        0   0.739     0.124      0.942   0.580 
 5    13     17       1        1   0.696     0.138      0.912   0.531 
 6    16     15       0        1   0.696     0.138      0.912   0.531 
 7    18     14       1        0   0.646     0.157      0.878   0.475 
 8    23     13       2        0   0.547     0.196      0.803   0.372 
 9    27     11       1        0   0.497     0.218      0.762   0.324 
10    28     10       0        1   0.497     0.218      0.762   0.324 
11    30      9       1        0   0.442     0.248      0.718   0.272 
12    31      8       1        0   0.386     0.282      0.671   0.223 
13    33      7       1        0   0.331     0.321      0.622   0.177 
14    34      6       1        0   0.276     0.369      0.569   0.134 
15    43      5       1        0   0.221     0.432      0.515   0.0947
16    45      4       1        1   0.166     0.519      0.458   0.0598
17    48      2       1        0   0.0828    0.877      0.462   0.0148
18   161      1       0        1   0.0828    0.877      0.462   0.0148

For example, above we see that a subject with a censored survival time left the risk set at \(t=16\), but \(\hat{S}(t)\) does not decrease.

Graphing the survival function

Use plot to graph the survfit KM survival function and its 95% CI.

The KM \(\hat{S}(t)\) is a step function that changes only when an event occurs.

The true underlying survival curve \(S(t)\) is likely smooth, but smoothness of the KM curve is limited by the number of event times in the data.

plot(KM, ylab="survival probability", xlab="months")

Comparing survival curves

Stratified Kaplan-Meier estimates

To estimate separate KM survival functions for different groups, specify one or more grouping variables after the ~ in survfit()

# stratify by x variable
KM.x <- survfit(Surv(time, status) ~ x, data=aml)

# median survival by strata
KM.x
Call: survfit(formula = Surv(time, status) ~ x, data = aml)

                 n events median 0.95LCL 0.95UCL
x=Maintained    11      7     31      18      NA
x=Nonmaintained 12     11     23       8      NA


We now get separate tables of KM estimates when using summary():

# KM estimated survival functions by strata
summary(KM.x)
Call: survfit(formula = Surv(time, status) ~ x, data = aml)

                x=Maintained 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    9     11       1    0.909  0.0867       0.7541        1.000
   13     10       1    0.818  0.1163       0.6192        1.000
   18      8       1    0.716  0.1397       0.4884        1.000
   23      7       1    0.614  0.1526       0.3769        0.999
   31      5       1    0.491  0.1642       0.2549        0.946
   34      4       1    0.368  0.1627       0.1549        0.875
   48      2       1    0.184  0.1535       0.0359        0.944

                x=Nonmaintained 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    5     12       2   0.8333  0.1076       0.6470        1.000
    8     10       2   0.6667  0.1361       0.4468        0.995
   12      8       1   0.5833  0.1423       0.3616        0.941
   23      6       1   0.4861  0.1481       0.2675        0.883
   27      5       1   0.3889  0.1470       0.1854        0.816
   30      4       1   0.2917  0.1387       0.1148        0.741
   33      3       1   0.1944  0.1219       0.0569        0.664
   43      2       1   0.0972  0.0919       0.0153        0.620
   45      1       1   0.0000     NaN           NA           NA

Graphing stratified KM estimates of survival

plot() produces separate KM survival curves when supplied a stratified survfit object.

However, confidence intervals are no longer produced by default so we request them with conf.int=TRUE.\(\dagger\)

We also use col to specify two colors.

# stratified KM curves with 95% CI, 2 colors
plot(KM.x, ylab="survival probability", xlab="months",
     conf.int=TRUE, col=c("red", "blue")) 

\(\dagger\) plot() is a generic base R function and calls plot.survfit() when supplied a survfit object. The option conf.int= is an option of plot.survfit().

Customized survival plots with survminer

plot.survfit() has limited options to customize a plot of survival functions.

ggsurvplot() from the survminer package leverages the graphical power of the ggplot2 package to allow extensive customization of survival function plots.

ggsurvplot() by default adds + symbols to denote censored observations.

ggsurvplot(KM.x, conf.int=T)

ggsurvplot() makes adding a risk table (number at risk at selected timepoints) very easy.

ggsurvplot(KM.x, conf.int=T,
           risk.table=T)

You can pass most arguments to various functions in ggplot2 through ggsurvplot().

ggsurvplot(KM.x, conf.int=T, 
           risk.table=T,
           palette="Accent", # argument to scale_color_brewer()
           size=2, # argument to geom_line()
           ggtheme = theme_minimal()) # changing ggtheme

You can also use traditional ggplot2 syntax by extracting the ggplot object as ggsurvplot()$plot.

g <- ggsurvplot(KM.x, conf.int=T,
           risk.table=T)$plot  # this is the ggplot object
g + scale_fill_grey() + scale_color_grey()

See our ggplot2 seminar to learn how to use the ggplot2 package.

Testing equivalence of survival functions with survdiff()

A common hypothesis of interest:

\(H_0\): survival curves across 2 or more groups are equivalent
\(H_A\): survival curves across 2 or more groups are not equivalent

A popular method to evaluate this hypothesis is the log-rank statistic, which is \(\chi^2\) distributed with \(g-1\) degrees of freedom under the null (\(g\) is the number of groups).

The function survdiff() performs the log-rank test.

# log rank test, default is rho=0
survdiff(Surv(time, status) ~ x, data=aml)
Call:
survdiff(formula = Surv(time, status) ~ x, data = aml)

                 N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained    11        7    10.69      1.27       3.4
x=Nonmaintained 12       11     7.31      1.86       3.4

 Chisq= 3.4  on 1 degrees of freedom, p= 0.07 

We see weak evidence that the survival curves may be different.

Weighted log-rank tests

For some problems, early differences in survival curves might be more important than later differences, such as for death after surgery.

Early differences can be emphasized by applying weights that decrease as time advances.

One popular weighting method is to weight the survival times by the survival probabilities \(\hat{S}(t)\) themselves, which decrease with time.

survdiff() allows weights in calculation of the \(\chi^2\) statistic with the rho= argument. The weights are defined as \(\hat{S}(t)^\rho\), where \(0 \le \rho \le 1\).

  • rho=1  weights are the survival estimates, \(\hat{S}(t)^1=\hat{S}(t)\), equivalent to the Gehan-Wilcoxon test
  • rho=0  equal weights, \(\hat{S}(t)^0=1\), the traditional log-rank test and the default
  • 0<rho<1  values closer to 1 weight earlier time points more
# rho=1 specifies Peto & Peto modification of Gehan-Wilcoxon,
#   more weight put on earlier time points
survdiff(Surv(time, status) ~ x, data=aml, rho=1)
Call:
survdiff(formula = Surv(time, status) ~ x, data = aml, rho = 1)

                 N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained    11     3.85     6.14     0.859      2.78
x=Nonmaintained 12     7.18     4.88     1.081      2.78

 Chisq= 2.8  on 1 degrees of freedom, p= 0.1 

Exercise 1

The veteran data set describes survival times for veterans with lung cancer.

Variables:

  • time: survival time
  • status: status, 0=censored, 1=dead
  • trt: treatment, 1=standard, 2=test


  1. Create a graph and table of the Kaplan-Meier estimated survival function for the entire data set. What is the median survival time?
  2. Create a graph of KM estimated survival functions stratified by treatment, adding 95% confidence intervals and coloring the 2 functions. Does survival appear different for the 2 treatment groups?
  3. Use the log-rank test to provide more evidence for your assessment of survival of the 2 groups.

The Cox proportional hazards model

Why the Cox model is so widely used

The Cox model:

  • is by far the most popular survival analysis method
  • estimates changes to the hazard function, \(h(t)\), rather than the survival function \(S(t)\)
  • accommodates multiple predictors (covariates)
  • does not assume any distribution for survival times
  • naturally accommodates right-censoring and time-varying covariates
  • can be extended in many ways:
    • time-varying coefficients
    • random effect frailties for recurrent events or clustering
    • competing risks modeling

The Cox proportional hazards model

For simplicity, we begin with a Cox model with a single time-constant predictor, \(X_1\):

\[h(t|X_1=x_1) = h_0(t)exp(b_1x_1)\]

  • \(h(t|X_1=x_1)\): the hazard at time \(t\) for a subject with predictor \(X_1\) equal to the value \(x_1\)
  • \(h_0(t)\): the baseline hazard at time \(t\), the hazard for a subject with all predictors equal to zero
  • \(exp(b_1x_1)\): the hazard ratio comparing the hazard for a subject with \(X_1=x_1\) to a subject with \(X_1=0\)

Hazard ratio

For example, imagine \(X_1=1\) for treatment and \(X_1=0\) for control.

The hazard at time \(t\) for treatment:

\[\begin{aligned}h(t|X_1=1) &= h_0(t)exp(b_1*1)\\ &= h_0(t)exp(b_1) \end{aligned}\]

and for control:

\[\begin{aligned}h(t|X_1=0) &= h_0(t)exp(b_1*0)\\ &= h_0(t)exp(0) \\ &= h_0(t) \end{aligned}\]

The hazard ratio (HR), \(exp(b_1)\), compares the hazards for treatment and control:

\[\begin{aligned}HR&=\frac{h(t|X_1=1)}{h(t|X_1=0)} \\ &= \frac{h_0(t)exp(b_1)}{h_0(t)} \\ &= exp(b_1) \end{aligned}\]

  • \(HR=.25\) means treatment has \(\frac{1}{4}\) (25%) the hazard of control, or a 75% decrease. Treatment would lead to better survival with a lower hazard.
  • \(HR=2\) here means that treatment has twice the hazard of control, or a 100% increase, and thus worse survival.

In general, \(exp(b_1)\) expresses the hazard ratio for a 1-unit increase in the predictor.

\(b_1\) itself is the log-hazard ratio.

Proportional hazards

The standard Cox model assumes proportional hazards, which means that covariate effects (expressed as hazard ratios) are constant over time.

Notice that the expression for the hazard ratio for the treatment effect does not contain time, so is constant:

\[HR=exp(b_1)\]

Note: Under proportional hazards, a subject’s hazard can change over time, but the hazard ratio comparing 2 subjects with different covariate values cannot.

Visually, this is represented by “parallel” survival curves that should not cross:

Proportional hazards is easier to evaluate if we plot \(-log(-log(S(t)))\), sometimes called a log minus log survival plot. Parallel curves suggest proportional hazards holds.

Non-proportional hazards means that a predictor’s effect changes over time.

Hazard and survival functions will often cross:

As will graphs of \(-log(-log(S(t)))\):

Failing to account for non-constant hazard ratios threatens validity of Cox model estimates.

The Cox model can be extended in various ways to accommodate non-proportional hazards.

Baseline hazard function \(h_0(t)\)

The Cox model is popular because it does not require specification of the baseline hazard function, \(h_0(t)\), the hazard function for a subject with zero on all covariates.

Essentially, this means we can use the Cox model without knowing the shape of the underlying hazard functions.

Because no parameters related to a distribution (e.g. mean) are estimated, the Cox model is semiparametric.

Because \(h_0(t)\) is left unspecified, the Cox model only estimates covariate effects on the hazard function, not the hazard function itself.

No constant/intercept in Cox models, which prevents estimation of the hazard function (without additional methods).

Cox model with multiple predictors

The Cox is easily extended to accommodate multiple predictors, each of whose effects is assumed to be proportional over time.

\[h(t|X_1, X_2, ...X_p) = h_0(t)exp(b_1X_1 + b_2X_2 + ... + b_pX_p)\] Each coefficient \(b_i\) can be exponentiated to calculate a hazard ratio.

Fitting the Cox model with the survival package

New data set for Cox modeling

We will use the lung data set from the survival package for Cox modeling.

The data describe survival of patients with advanced lung cancer.

Variables:

  • time: survival time in days
  • status: 1=censored, 2=dead
  • age: age in years (assessed at beginning)
  • sex: 1=male, 2=female\(^\dagger\)
  • wt.loss: weight loss (pounds) in last 6 months

\(^\dagger\)Normally binary variables should be coded 0/1 so that the intercept is interpretable. However, with no intercept in the Cox model 1/2 coding is equivalent.

Fitting a Cox model

Use coxph(formula,data=)

The formula is a typical R regression formula:

Surv(time, event) ~ x1 + x2…

where x1 + x2… is a list of one or more predictor variables separated by +.

# fit cox model and save results
lung.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data=lung)
# summary of results
summary(lung.cox)
Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

              coef  exp(coef)   se(coef)      z Pr(>|z|)   
age      0.0200882  1.0202913  0.0096644  2.079   0.0377 * 
sex     -0.5210319  0.5939074  0.1743541 -2.988   0.0028 **
wt.loss  0.0007596  1.0007599  0.0061934  0.123   0.9024   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0203     0.9801    1.0011    1.0398
sex        0.5939     1.6838    0.4220    0.8359
wt.loss    1.0008     0.9992    0.9887    1.0130

Concordance= 0.612  (se = 0.027 )
Likelihood ratio test= 14.67  on 3 df,   p=0.002
Wald test            = 13.98  on 3 df,   p=0.003
Score (logrank) test = 14.24  on 3 df,   p=0.003

Output interpretation:

  • sample size=214, 152 events, 14 observations dropped due to missingness
  • coef: log hazard ratio coefficients
    • a positive coefficient means the hazard increases (and survival decreases) as the predictor increases
  • exp(coef): hazard ratios (exponentiated coefficients)
    • \(\widehat{HR}_{age}=1.0203\), for each additional year of age at baseline, the hazard increases by 2.03%, or by a factor of 1.0203
    • \(\widehat{HR}_{sex}=.5939\), females have 60% the hazard of males, or a 40% decrease
    • \(\widehat{HR}_{wt.loss}=1.008\), for each additional pound of weight loss, the hazard increases by 0.8%
  • se(coef): standard error of log hazard ratios
  • Pr(>|z|): p-value for test of log hazard ratio = 0, or equivalently, hazard ratio = 1
  • lower .95, upper .95: 95% confidence interval for hazard ratio
  • Concordance: proportion of pairs that are concordant, a goodness-of-fit measure (values near 1 indicate better fit)
  • Likelihood ratio, Wald, and Score tests: 3 tests of the null hypothesis that all coefficients equal zero

Plotting coxph() coefficients and CIs using tidy()

We can tidy() the coxph() results and store them in a data frame.

# save summarized results as data.frame
#  exponentiate=T returns hazard ratios
lung.cox.tab <- tidy(lung.cox, exponentiate=T, conf.int=T) 

# display table   
lung.cox.tab
# A tibble: 3 × 7
  term    estimate std.error statistic p.value conf.low conf.high
  <chr>      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 age        1.02    0.00966     2.08  0.0377     1.00      1.04 
2 sex        0.594   0.174      -2.99  0.00280    0.422     0.836
3 wt.loss    1.00    0.00619     0.123 0.902      0.989     1.01 

We then pass the data frame of model results to ggplot2 to plot the hazard ratios and their confidence intervals.

# plot of hazard ratios and 95% CIs
ggplot(lung.cox.tab, 
       aes(y=term, x=estimate, xmin=conf.low, xmax=conf.high)) + 
  geom_pointrange() +  # plots center point (x) and range (xmin, xmax)
  geom_vline(xintercept=1, color="red") + # vertical line at HR=1
  labs(x="hazard ratio", title="Hazard ratios and 95% CIs") +
  theme_classic()

Predicting survival with Cox estimates

We often want to estimate the survival functions for a subject with a particular set of covariate values.

The semiparametric Cox model does not estimate the survival function (or hazard function) directly.

However, we can use non-parameteric methods (similar to KM) to estimate the baseline survival function, \(S_0(t)\), the survival function for a subject with zero on all covariates.

\[{S}_0(t)={S}(t|X_1=0,X_2=0,...,X_p=0)\]

The Cox model coefficients then allow estimation of the survival function for a subject with any set of covariate values:

\[\hat{S}(t|X_1=x_1,X_2=x_2,...,X_p=x_p)=\hat{S}_0(t)^{exp(b_1x_1+b_2x_2+...+b_px_p)}\]

Because \(\hat{S}_0(t)\) is estimated non-parametrically, survival functions will again be step functions that change values only at observed event times.

Predicting survival after coxph() with survfit()

survfit(), when supplied a coxph model object, predicts survival functions.

If no covariate values are supplied to survfit(), the survival function is estimated for a subject with covariates at their means.

tidy() (or summary()) produces a table of the survival function estimated by survfit():

# predict survival function for subject with means on all covariates
surv.at.means <- survfit(lung.cox) 

#table of survival function
tidy(surv.at.means)
# A tibble: 179 × 8
    time n.risk n.event n.censor estimate std.error conf.high conf.low
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
 1     5    214       1        0    0.996   0.00443     1        0.987
 2    11    213       2        0    0.987   0.00772     1        0.972
 3    12    211       1        0    0.982   0.00894     1.00     0.965
 4    13    210       2        0    0.973   0.0110      0.995    0.953
 5    15    208       1        0    0.969   0.0120      0.992    0.946
 6    26    207       1        0    0.964   0.0128      0.989    0.940
 7    30    206       1        0    0.960   0.0137      0.986    0.935
 8    31    205       1        0    0.955   0.0145      0.983    0.929
 9    53    204       2        0    0.946   0.0160      0.976    0.917
10    54    202       1        0    0.942   0.0167      0.973    0.912
# ℹ 169 more rows

Plotting survival curves

plot() plots the predicted survival curve and CI estimated by survfit().

# plot of predicted survival for subject at means of covariates
plot(surv.at.means, xlab="days", ylab="survival probability")

Predicting survival at specific covariate values

Specify a data frame of covariate values in the newdata= argument of survfit() to predict a survival function for each row of the data frame:

First we create the new data frame of 2 rows of covariate values.

# create new data for plotting: 1 row for each sex
#  and mean age and wt.loss for both rows
plotdata <- data.frame(age=mean(lung$age),
                       sex=1:2,
                       wt.loss=mean(lung$wt.loss, na.rm=TRUE))

# look at new data
plotdata
       age sex  wt.loss
1 62.44737   1 9.831776
2 62.44737   2 9.831776

Then we supply the new data to survfit to predict separate survival functions for males and females, both at mean age and mean weight loss.

In the table of predicted survival functions produced by tidy(), the column suffixes .1 and .2 differentiate estimates between the 2 sexes.

  • e.g., estimate.1 is the survival function estimates for sex=1, males, while estimate.2 is for females.
# get survival function estimates for each sex
surv.by.sex <- survfit(lung.cox, newdata=plotdata) # one function for each sex

# columns with .1 are for males, .1 for females
tidy(surv.by.sex)
# A tibble: 179 × 12
    time n.risk n.event n.censor estimate.1 estimate.2 std.error.1 std.error.2
   <dbl>  <dbl>   <dbl>    <dbl>      <dbl>      <dbl>       <dbl>       <dbl>
 1     5    214       1        0      0.995      0.997     0.00546     0.00327
 2    11    213       2        0      0.984      0.990     0.00953     0.00577
 3    12    211       1        0      0.978      0.987     0.0111      0.00674
 4    13    210       2        0      0.967      0.980     0.0137      0.00844
 5    15    208       1        0      0.962      0.977     0.0148      0.00921
 6    26    207       1        0      0.956      0.974     0.0159      0.00995
 7    30    206       1        0      0.951      0.971     0.0170      0.0107 
 8    31    205       1        0      0.945      0.967     0.0180      0.0113 
 9    53    204       2        0      0.934      0.960     0.0199      0.0127 
10    54    202       1        0      0.929      0.957     0.0208      0.0133 
# ℹ 169 more rows
# ℹ 4 more variables: conf.high.1 <dbl>, conf.high.2 <dbl>, conf.low.1 <dbl>,
#   conf.low.2 <dbl>

Note: The CIs for the survival functions are estimated but not printed due to space.

Plotting multiple predicted survival functions

Remember when using plot() to graph multiple survival functions, CIs must be specifically requested with conf.int=T.

# plot survival estimates
plot(surv.by.sex, xlab="days", ylab="survival probability",
     conf.int=TRUE, col=c("blue", "red"))

Use ggsurvplot() from survminer for more customized plots.

Note: if a dataset was specified as newdata= in survfit() to generate survival function estimates, that dataset must be specified as data= in ggsurvplot() as well.

# data= is the same data used in survfit()
#  censor=F removes censoring symbols
ggsurvplot(surv.by.sex, data=plotdata, censor=F,
           legend.labs=c("male", "female")) 

We see that females have overall better survival, reflected by \(\widehat{HR}_{sex}=.594\).

Assessing the proportional hazards (PH) assumption

The plausibility of a model’s assumptions affects our confidence in the estimates.

Several methods have been developed to assess the PH assumption of the Cox model (including the log minus log plots discussed earlier).

Here we discuss 2 tools developed by Grambsch and Therneau (1994) based on Schoenfeld residuals1 that are available in the survival package.

cox.zph() performs a \(\chi^2\) test based on Schoenfeld residuals to test the hypothesis:

\(H_0\): covariate effect is constant (proportional) over time
\(H_A\): covariate effect changes over time

The null hypothesis of PH is tested for each covariate individually and jointly as well.

cox.zph(lung.cox)
         chisq df    p
age     0.5077  1 0.48
sex     2.5489  1 0.11
wt.loss 0.0144  1 0.90
GLOBAL  3.0051  3 0.39

No strong evidence of violation of proportional hazards for any covariate, but some weak evidence for sex perhaps.


Another tool used to assess PH is a plot of a smoothed curve over the Schoenfeld residuals.

To create this plot, plot() the object returned by cox.zph().

These plots depict an estimate of the coefficient (labelled “Beta(t)” in the plot) (y-axis) across time (x-axis). PH is indicated by flat smoothed curve

plot(cox.zph(lung.cox))

Again we see some evidence of non-proportional hazards for sex, as the effect of sex seems to increase with time.

The effect of sex seems to be strongest at the beginning of follow-up, but then trends toward zero as time passes.

1Schoenfeld residuals are defined for each covariate and for each subject at the subject’s event time. The residual measures the difference between that subject’s covariate values and the average covariate values of those still at risk. For example, a positive residual for age means that a subject who failed is older than those still risk. If the residuals for age increase on average as time passes, that means that older patients are failing at higher rates than expected by the Cox model as time passes, suggesting that the effect of age is increasing with time.

Dealing with proportional hazards violations

Several strategies have been proposed to account for violation of PH. We discuss two here:

  • stratify by the non-PH variable
  • add an interaction of the non-PH variable with time to the model


If the change in the coefficient is not large enough to be clinically meaningfully, it can perhaps be ignored as well.

Imagine we are concerned that possible PH violation by sex in our Cox models may bias all of the model estimates.

We can address this concern with a sensitivity analysis to show how model inferences change depending on whether we ignore or address the PH violation by sex.

In the Cox model assuming PH for sex, we find evidence that the hazard increases with age but the effect of wt.loss is practically null or undeterminable.

# reprinting original model results
summary(lung.cox)
Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

              coef  exp(coef)   se(coef)      z Pr(>|z|)   
age      0.0200882  1.0202913  0.0096644  2.079   0.0377 * 
sex     -0.5210319  0.5939074  0.1743541 -2.988   0.0028 **
wt.loss  0.0007596  1.0007599  0.0061934  0.123   0.9024   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0203     0.9801    1.0011    1.0398
sex        0.5939     1.6838    0.4220    0.8359
wt.loss    1.0008     0.9992    0.9887    1.0130

Concordance= 0.612  (se = 0.027 )
Likelihood ratio test= 14.67  on 3 df,   p=0.002
Wald test            = 13.98  on 3 df,   p=0.003
Score (logrank) test = 14.24  on 3 df,   p=0.003

Stratified Cox model

One approach to addressing a PH-violating variable is to estimate the model stratified by that variable.

Drawback: no coefficient will be estimated for the stratification variable.


In the stratified Cox model:

  • the model is estimated separately by stratum
  • the baseline hazard function, \(h_0(t)\), is allowed to be different across strata
    • this can accommodate the non-proportional effects of the stratification variable
  • parameter estimates are averaged across strata to generate one final set of estimates


Put the stratification variable inside strata() within the coxph() model formula:

lung.strat.sex <- coxph(Surv(time, status) ~ age + wt.loss + strata(sex), data=lung)
summary(lung.strat.sex)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss + strata(sex), 
    data = lung)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

             coef exp(coef)  se(coef)     z Pr(>|z|)  
age     0.0192190 1.0194049 0.0096226 1.997   0.0458 *
wt.loss 0.0001412 1.0001412 0.0062509 0.023   0.9820  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.019     0.9810     1.000     1.039
wt.loss     1.000     0.9999     0.988     1.012

Concordance= 0.561  (se = 0.027 )
Likelihood ratio test= 4.09  on 2 df,   p=0.1
Wald test            = 3.99  on 2 df,   p=0.1
Score (logrank) test = 4  on 2 df,   p=0.1

The coefficients for age and wt.loss, averaged across males and females, are similar to those from the model where we assumed PH for sex, so our overall inferences are unchanged.

However, if the effect of sex is of primary interest, we should not stratify by sex.

Modeling time-varying coefficients

The Cox model can be extended to allow the coefficient representing a covariate effect to change over time, by interacting that covariate with some function of time.

However, unlike in other regression models, we cannot create this interaction term by simply multiplying the covariate by the time variable (unless the data are in a special structure, see survSplit()).

Instead, we strongly recommend the use of the time-transform function, tt(), to avoid easily-made mistakes.

  • Specify the nonPH covariate both by itself and inside of tt() in the model formula
  • Define the function tt() within coxph()
    • always start with tt = function(x,t,...)
    • then define the relationship between the covariate x and time t. For example:
      • x*t will allow coefficient to change with time linearly
      • x*log(t) allows coefficient to change with log (magnitudes) of time
      • x*(t>100) allows coefficient to take on 2 different values, one value when \(t<=100\) and another value \(t>100\).


Below we specify an interaction of sex with time itself, so that the effect of sex is allowed to change linearly with time.

# notice sex and tt(sex) in model formula
lung.sex.by.time <- coxph(Surv(time, status) ~ age + wt.loss + sex + tt(sex), # sex and tt(sex) in formula
                          data=lung,
                          tt=function(x,t,...) x*t) # linear change in effect of sex
summary(lung.sex.by.time)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss + sex + tt(sex), 
    data = lung, tt = function(x, t, ...) x * t)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

              coef  exp(coef)   se(coef)      z Pr(>|z|)   
age      0.0194343  1.0196244  0.0096522  2.013   0.0441 * 
wt.loss  0.0001260  1.0001261  0.0062502  0.020   0.9839   
sex     -0.9417444  0.3899470  0.3224791 -2.920   0.0035 **
tt(sex)  0.0013778  1.0013787  0.0008581  1.606   0.1084   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0196     0.9808    1.0005    1.0391
wt.loss    1.0001     0.9999    0.9879    1.0125
sex        0.3899     2.5645    0.2073    0.7337
tt(sex)    1.0014     0.9986    0.9997    1.0031

Concordance= 0.613  (se = 0.027 )
Likelihood ratio test= 17.23  on 4 df,   p=0.002
Wald test            = 15.86  on 4 df,   p=0.003
Score (logrank) test = 16.44  on 4 df,   p=0.002

The coef estimated for sex is the log-hazard-ratio at day \(t=0\), \(\hat{b}_{sex}=-.942\), corresponding to a hazard ratio of \(\hat{HR}_{sex}=exp(\hat{b}_{sex})=.39\).

The coef for tt(sex) is the change in the log-hazard-ratio for each additional day that passes.

Again, our inferences regarding age and wt.loss are mostly unchanged.

These sex coefficient estimates match the graph of the smoothed Schoenfeld residuals for sex. At the beginning of follow-up, the coefficient is close to -1, and it increases gradually over time.

plot(cox.zph(lung.cox), var="sex")

Time-varying covariates: data set up

The Cox PH model accommodates time-varying covariates, but the data should be structured in start-stop time format:

  • each subject can have multiple rows of data
  • 2 time variables record the beginning and end of time intervals
  • status variable records the event status at the end of each interval
    • for intervals where the subject remains in the risk set, the status should be censored
    • for single event data, if the event occurs, it will be in the subject’s last interval
  • with no time gaps, the start time of intervals is the stop time of the previous interval (except for the first interval)
  • the time when a covariate changes value should be recorded as the beginning of a new interval


The survival package provides a function tmerge() to help get your data in this format. See vignette(timedep) for guidance on its usage.

The jasa1 dataset, which looks at survival for patients on a waiting list for heart transplant, is already set up in this format:

In this data set, transplant is a time-varying covariate:

head(jasa1)
    id start stop event transplant        age      year surgery
1    1     0   49     1          0 -17.155373 0.1232033       0
2    2     0    5     1          0   3.835729 0.2546201       0
102  3     0   15     1          1   6.297057 0.2655715       0
3    4     0   35     0          0  -7.737166 0.4900753       0
103  4    35   38     1          1  -7.737166 0.4900753       0
4    5     0   17     1          0 -27.214237 0.6078029       0

For patient with id=4:

  • on the first row, we see that patient has not yet had transplant up to day 35, and the event has not yet occurred
  • on the second row, we see that the patient received a transplant on day 35 and then died on day 38
    • notice that day 35 is the end of the first interval and the start of the second interval

Cox model with time-varying covariates

The data are ready for coxph().

Use the Surv(time, time2, event) specification.

jasa1.cox <- coxph(Surv(start, stop, event) ~ transplant + age + surgery, data=jasa1)
summary(jasa1.cox)
Call:
coxph(formula = Surv(start, stop, event) ~ transplant + age + 
    surgery, data = jasa1)

  n= 170, number of events= 75 

               coef exp(coef) se(coef)      z Pr(>|z|)  
transplant  0.01405   1.01415  0.30822  0.046   0.9636  
age         0.03055   1.03103  0.01389  2.199   0.0279 *
surgery    -0.77326   0.46150  0.35966 -2.150   0.0316 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

           exp(coef) exp(-coef) lower .95 upper .95
transplant    1.0142     0.9860    0.5543    1.8555
age           1.0310     0.9699    1.0033    1.0595
surgery       0.4615     2.1668    0.2280    0.9339

Concordance= 0.599  (se = 0.036 )
Likelihood ratio test= 10.72  on 3 df,   p=0.01
Wald test            = 9.68  on 3 df,   p=0.02
Score (logrank) test = 10  on 3 df,   p=0.02

We can follow with the same procedures as before, checking PH assumptions and plotting predicted survival curves.

# check PH assumptions
cox.zph(jasa1.cox)
           chisq df    p
transplant 0.118  1 0.73
age        0.897  1 0.34
surgery    0.097  1 0.76
GLOBAL     1.363  3 0.71

No strong evidence that PH is violated.

# plot predicted survival by transplant group at mean age and surgery=0
plotdata <- data.frame(transplant=0:1, age=-2.48, surgery=0)
surv.by.transplant <- survfit(jasa1.cox, newdata=plotdata)
ggsurvplot(surv.by.transplant, data=plotdata,  # remember to supply data to ggsurvplot() for predicted survival after coxph()
           legend.labs=c("0", "1"), legend.title="transplant") # give proper labels and title to legend 

Exercise 2

Now we will fit a Cox proprotional hazards model to the veteran data set.

  • time: survival time in weeks
  • status: status, 0=censored, 1=dead
  • trt: treatment, 1=standard, 2=test
  • age: age


  1. Use a Cox proportional hazards model to estimate hazard ratios for the effects of treatment and age. Interpret the estimated hazard ratios.
  2. Assess the proportional hazards assumption for both covariates using a chi-square test and graphs.
  3. Fit a Cox model that allows the effect of trt to change over time (even though the original model was not very useful).

More about survival analysis

More R stuff

The coxph() function has some additional flexibility, including:

  • recurrent events modeling (see vignette("survival"))
  • competing risks modeling (see vignette("survival"))
  • frailty random effects, see ?frailty for more, but in general, use the coxme package
  • robust and cluster robust variance estimation, see the robust and cluster options in ?coxph
  • weights that can be treated as replication/frequency or sampling weights, see the weights argument in ?coxph


The survival package has the survreg() function for parametric regression models.


Additional useful survival analysis packages

  • coxme package, frailty random effects models, also written by Therneau
  • icenreg package, for interval censored data (note: data set up is different from survival)

References

References for the survival package

The survival package has some of the best vignettes (tutorials) of any R package.

Key vignettes for getting started (use vignette(package="survival") to see a full list):

  • vignette("survival") for a general introduction of the usage and capabilities of the survival package, as well as some theoretical survival analysis background
  • vignette("timedep") for guidance modeling time-varying covariates and time-varying coefficients


References for survival analysis

Grambsch, P. & Therneau, T. (1994), Proportional hazards tests and diagnostics based on weighted residuals. Biometrika, 81, 515-26.

Therneau, T. M. & Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model, Third Edition. New York: Springer.

Kleinbaum, D. G., & Klein, M. (2012). Survival analysis: a self-learning text (Vol. 3). New York: Springer.