class: center, middle, inverse, title-slide .title[ # Advanced statistics in R ] .subtitle[ ## Survival analysis ] .author[ ### ] .date[ ###
contact@appliedepi.org
] --- <style type="text/css"> .remark-slide table{ border: none } .remark-slide-table { } tr:first-child { border-top: none; } tr:last-child { border-bottom: none; } .center2 { margin: 0; position: absolute; top: 50%; left: 50%; } </style> # Survival analysis - what is it? Survival analysis (also known as time-to-event or duration analysis) refers to a group of statistical techniques to analyze the *time* to an *event*. --- # Examples Time from diagnosis to death. --- # Examples Time from diagnosis to death. The time from entering a hospital to discharge. --- # Examples Time from diagnosis to death. The time from entering a hospital to discharge. How long from an exposure to developing disease. --- # Censoring is an important part of survival analysis Censoring is when we have _some_ information about a subject, but we do not know everything about the time and event. This may be due to reasons such as: --- # Censoring is an important part of survival analysis Censoring is when we have _some_ information about a subject, but we do not know everything about the time and event. This may be due to reasons such as: * No event is recorded before the study ends --- # Censoring is an important part of survival analysis Censoring is when we have _some_ information about a subject, but we do not know everything about the time and event. This may be due to reasons such as: * No event is recorded before the study ends * An individual is lost to follow up --- # Censoring is an important part of survival analysis Censoring is when we have _some_ information about a subject, but we do not know everything about the time and event. This may be due to reasons such as: * No event is recorded before the study ends * An individual is lost to follow up * A person withdraws from the study --- # Left, right and interval censoring The type of censoring is related to when the individual was observed, and when the event happened. --- # Left, right and interval censoring The type of censoring is related to when the individual was observed, and when the event happened. **Right censoring - When the event did not occur during the study, but was observed to occur afterwards.** --- # Left, right and interval censoring The type of censoring is related to when the individual was observed, and when the event happened. Right censoring - When the event did not occur during the study, but was observed to occur afterwards. **Left censoring - When an event or exposure occurred before the study.** --- # Left, right and interval censoring The type of censoring is related to when the individual was observed, and when the event happened. Right censoring - When the event did not occur during the study, but was observed to occur afterwards. Left censoring - When an event or exposure occurred before the study. **Interval censoring - When an individual is unobserved for a period of time in the study.** --- <img src="../../images/survival/blank_censored.png" width="100%" /> --- <img src="../../images/survival/right_censored.png" width="100%" /> --- <img src="../../images/survival/left_censored.png" width="100%" /> --- <img src="../../images/survival/interval_censored.png" width="100%" /> --- <img src="../../images/survival/not_censored.png" width="100%" /> --- # The **survival package** We will be using the **survival** package to carry out our analysis. The first step is to use the function `Surv()` with the arguments `time` and `event`. Where: --- # The **survival package** We will be using the **survival** package to carry out our analysis. The first step is to use the function `Surv()` with the arguments `time` and `event`. Where: `time = ` refers to *when* an event occurred, or the end of the study, and --- # The **survival package** We will be using the **survival** package to carry out our analysis. The first step is to use the function `Surv()` with the arguments `time` and `event`. Where: `time = ` refers to *when* an event occurred, or the end of the study, and `event = ` a binary indication (0 or 1) of whether or not the event of interest occurred. --- # Creating the surv object First we have our dataset, `survival_basic`, which contains three columns. - `event`: Whether an individual died. - `time_to_event`: The number of days between onset and death. - `gender`: The gender of the individual. ``` ## # A tibble: 6 × 3 ## event time_to_event gender ## <dbl> <drtn> <chr> ## 1 0 16 days female ## 2 1 4 days female ## 3 1 16 days female ## 4 1 6 days female ## 5 0 7 days male ## 6 0 15 days male ``` --- # Using $ not %>% The **survival** package is not pipe (`%>%`) friendly, so rather than piping in the dataset we need to specify which columns we want to use with `$`. ```r head(survival_basic$event) ``` ``` ## [1] 0 1 1 1 0 0 ``` --- # Creating the Surv object ```r surv_obj <- Surv(time = survival_basic$time_to_event, event = survival_basic$event) class(surv_obj) ``` ``` ## [1] "Surv" ``` ```r head(surv_obj) ``` ``` ## [1] 16+ 4 16 6 7+ 15+ ``` --- # How to use our Surv object Once we have created our `Surv` object, `surv_obj`, we can use the function `survfit()` from the **survival** package to fit our first analysis. A *Kaplan Meier (KM)* estimate of the survival curve. --- # KM is our first step in survival analysis Kaplan Meier is a non-parametric statistic that estimates the survival function from time and events. This produces both a statistical estimate of the survival probability by time point for the data, and can visualize this as a curve. --- # Running `survfit()` This takes our `Surv` object, `surv_obj` and uses a tilde, `~ 1`, to specify that we are fitting to the events previously specified with a value of 1. ```r surv_fit <- survfit(surv_obj ~ 1) ``` --- # Understanding the output of `survfit()` `time` - The time `n.event` - The number of individuals at risk (who did not develop the event and were not censored) `n.event` - The number of events `survival` - The probability of **not** developing the event `std.err` - The standard error `95% CI` - The lower and upper 95% confidence intervals --- # Understanding the output of `survfit()` ```r summary(surv_fit) ``` ``` ## Call: survfit(formula = surv_obj ~ 1) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## -10 335 2 0.994 0.00421 0.9858 1.000 ## -8 333 1 0.991 0.00515 0.9810 1.000 ## -7 332 1 0.988 0.00593 0.9765 1.000 ## -5 329 2 0.982 0.00726 0.9679 0.996 ## -4 326 1 0.979 0.00784 0.9638 0.995 ## -1 324 1 0.976 0.00838 0.9597 0.993 ## 2 323 2 0.970 0.00935 0.9518 0.988 ## 3 321 11 0.937 0.01336 0.9109 0.963 ## 4 309 22 0.870 0.01849 0.8346 0.907 ## 5 282 22 0.802 0.02199 0.7602 0.846 ## 6 257 23 0.730 0.02460 0.6837 0.780 ## 7 230 13 0.689 0.02573 0.6405 0.741 ## 8 209 13 0.646 0.02674 0.5959 0.701 ## 9 193 9 0.616 0.02731 0.5648 0.672 ## 10 176 9 0.585 0.02786 0.5325 0.642 ## 11 163 10 0.549 0.02837 0.4959 0.607 ## 12 150 9 0.516 0.02871 0.4625 0.575 ## 13 128 3 0.504 0.02887 0.4502 0.564 ## 14 117 2 0.495 0.02901 0.4414 0.555 ## 15 106 4 0.476 0.02938 0.4222 0.538 ## 16 95 11 0.421 0.03033 0.3658 0.485 ## 17 77 4 0.399 0.03066 0.3436 0.464 ## 18 66 3 0.381 0.03101 0.3250 0.447 ## 20 56 2 0.368 0.03136 0.3110 0.435 ## 21 50 2 0.353 0.03178 0.2958 0.421 ## 22 43 1 0.345 0.03208 0.2872 0.414 ## 23 39 1 0.336 0.03246 0.2779 0.406 ## 28 25 1 0.322 0.03382 0.2625 0.396 ## 29 22 2 0.293 0.03655 0.2296 0.374 ## 30 18 1 0.277 0.03798 0.2116 0.362 ## 36 10 1 0.249 0.04310 0.1775 0.350 ## 38 7 2 0.178 0.05251 0.0998 0.317 ## 51 1 1 0.000 NaN NA NA ``` --- # Preparing the output for plotting To plot the output we can use the function `tidy()` from the **broom** package to reformat the data for easy plotting with `ggplot()`. ```r surv_plot_data <- tidy(surv_fit) head(surv_plot_data) ``` ``` ## # A tibble: 6 × 8 ## time n.risk n.event n.censor estimate std.error conf.high conf.low ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -10 335 2 0 0.994 0.00423 1 0.986 ## 2 -8 333 1 0 0.991 0.00519 1 0.981 ## 3 -7 332 1 1 0.988 0.00601 1.00 0.976 ## 4 -6 330 0 1 0.988 0.00601 1.00 0.976 ## 5 -5 329 2 1 0.982 0.00739 0.996 0.968 ## 6 -4 326 1 0 0.979 0.00801 0.995 0.964 ``` --- # Plotting the output .pull-left[ ```r ggplot(data = surv_plot_data, aes(x = time, y = estimate, ymin = conf.low, ymax = conf.high)) + geom_line() + geom_ribbon(alpha = 0.25) + labs(x = "Time", y = "Survival prob") + theme_bw() ``` ] .pull-right[ <img src="survival_files/figure-html/plot-first-out-1.png" width="504" /> ] --- #Comparing survival plots If we want to compare outcomes, say by an outcome like `gender` we can specify this when we use `survfit()`. In this we have to also specify the dataset used to create our `Survival` object (`survival_obj`). ```r survfit_gender <- survfit(surv_obj ~ gender, data = survival_basic) surv_plot_gender <- tidy(survfit_gender) ``` --- #Survival plot by gender .pull-left[ ```r ggplot(data = surv_plot_gender, aes(x = time, y = estimate, ymin = conf.low, ymax = conf.high, color = strata, fill = strata)) + geom_line() + geom_ribbon(alpha = 0.25) + labs(x = "Time", y = "Survival prob") + theme_bw() ``` ] .pull-right[ <img src="survival_files/figure-html/plot-second-out-1.png" width="504" /> ] --- # Assessing statistical significance Using the function `survdiff()` we can calculate a log-rank test, giving us a chi-square statistic and a p-value to understand if there is a statistical difference between the gender. ```r logrank_gender <- survdiff(formula = surv_obj ~ gender, data = survival_basic) logrank_gender ``` ``` ## Call: ## survdiff(formula = surv_obj ~ gender, data = survival_basic) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## gender=female 172 100 101.7 0.0291 0.0666 ## gender=male 163 92 90.3 0.0328 0.0666 ## ## Chisq= 0.1 on 1 degrees of freedom, p= 0.8 ``` --- # Cox proportial hazards regression Cox proportional hazards regression is one of the most useful regression techniques for survival analysis. This gives us the hazard ratio (comparable to an odds ratio) that an individual has survived to a time point. To do this we use the function `coxph()` from the **survival** package. We can take into account the effect of independent variables on the outcome, much like a standard regression. --- # Creating a new dataframe and running coxph() ```r #Create a new dataset with additional variables linelist_update <- linelist %>% mutate(time_to_event = date_outcome - date_onset, event = ifelse(outcome == "Death", 1, 0)) %>% select(event, time_to_event, gender, age, bleeding, blood_ct, temp) %>% drop_na() ``` --- # Creating a new dataframe and running coxph() ```r #Create a new dataset with additional variables linelist_update <- linelist %>% mutate(time_to_event = date_outcome - date_onset, event = ifelse(outcome == "Death", 1, 0)) %>% select(event, time_to_event, gender, age, bleeding, blood_ct, temp) %>% drop_na() #Create a survival object surv_obj_cox <- Surv(time = linelist_update$time_to_event, event = linelist_update$event) ``` --- # Creating a new dataframe and running coxph() ```r #Create a new dataset with additional variables linelist_update <- linelist %>% mutate(time_to_event = date_outcome - date_onset, event = ifelse(outcome == "Death", 1, 0)) %>% select(event, time_to_event, gender, age, bleeding, blood_ct, temp) %>% drop_na() #Create a survival object surv_obj_cox <- Surv(time = linelist_update$time_to_event, event = linelist_update$event) #Run the regression cox_regression <- coxph(surv_obj_cox ~ gender + age + temp + blood_ct + bleeding, data = linelist_update) ``` --- #Displaying the output As with other regressions we have looked at, we can use the function `tbl_regression` from **gtsummary** to produce publication ready outputs. ```r cox_regression %>% tbl_regression(exponentiate = TRUE) %>% bold_p() ``` --- #Displaying the output
Characteristic
HR
1
95% CI
1
p-value
gender
female
—
—
male
0.98
0.73, 1.32
0.9
age
1.00
0.99, 1.01
0.9
temp
0.97
0.82, 1.13
0.7
blood_ct
1.03
0.95, 1.11
0.5
bleeding
No
—
—
Yes
0.45
0.34, 0.61
<0.001
1
HR = Hazard Ratio, CI = Confidence Interval
--- # Ready to try it out? Any questions? **Resources** Course website (initial setup and slides access): [https://appliedepi.github.io/intro_course/](https://appliedepi.github.io/intro_course/) [Epi R Handbook](epirhandbook.com/) Applied Epi Community A great resource for asking questions and help! https://community.appliedepi.org/