class: center, middle, inverse, title-slide .title[ # Advanced statistics in R ] .subtitle[ ## Regression - variable selection and multivariable regression ] .author[ ### ] .date[ ###
contact@appliedepi.org
] --- <style type="text/css"> .remark-slide table{ border: none } .remark-slide-table { } tr:first-child { border-top: none; } tr:last-child { border-bottom: none; } .tiny .remark-code { /*Change made here*/ font-size: 50% !important; } .teenytiny .remark-code { /*Change made here*/ font-size: 25% !important; } .giant .remark-code { /*Change made here*/ font-size: 150% !important; } .smaller .remark-code { /*Change made here*/ font-size: 75% !important; } .huge .remark-code { /*Change made here*/ font-size: 250% !important; } </style> # Multivariable regression Now that you are experts at conducting univariate regression in R, it is time to move onto multivariable regression! --- # Multivariable regression - when univariate is just not enough Multivariable regression allows us to look at the relationship between our dependent variable and multiple independent variables. Why multivariable? * It considers the relationship of each exposure _in context_ with the outcome * This allows us to understand which exposures have a *true association* with the outcome and are not just subject to an *effect modification* --- # Effect modification and confounding Effect modification is when the **magnitude** of the effect of an exposure on an outcome differs by another exposure. Confounding is when another exposure is related to both the exposure and the outcome, but **it does not lie on the causative pathway**. Can we think of any examples of effect modification and confounding? --- # Effect modification Effect modification is all about stratification. Which of these are examples of effect modification? --- # Effect modification Effect modification is all about stratification. Which of these are examples of effect modification? - When a drug works on females but does not work on males --- # Effect modification Effect modification is all about stratification. Which of these are examples of effect modification? - When a drug works on females but does not work on males - Those who are exposed to asbestos **and** smoke have higher risks of lung cancer --- # Effect modification Effect modification is all about stratification. Which of these are examples of effect modification? - When a drug works on females but does not work on males - Those who are exposed to asbestos **and** smoke have higher risks of lung cancer - Alcohol consumption is associated with lung cancer, those who drink alcohol are more likely to smoke --- # Confounding A confounder is a variable that influences both the dependent variable and the independent variable, causing a spurious association. <img src="../../images/regression/confounder_wikipedia.png" width="125%" /> https://upload.wikimedia.org/wikipedia/commons/5/53/Comparison_confounder_mediator.svg --- # Confounding So icecream is causing sharks to bite humans? <img src="../../images/regression/icecream_sharks.png" width="100%" /> --- # Effect modification and confounding Multivariable regression estimates the association between the exposures and the outcome by holding all other exposures constant. This allows us to account for *confounding* **and** *effect modification*. --- # What do we put into our model? * Do we put in **all** the data? * Do we include **only** what we _know_ is likely to be useful? * Do we carry out some sort of stepwise approach? --- # What do we put into our model? * Do we put in **all** the data? * Do we include **only** what we _know_ is likely to be useful? * Do we carry out some sort of stepwise approach? <font size="46">We should avoid all of these!</font> --- # Do's and do not's *The do's* Expert knowledge - Think about the research question you are trying to answer - Statistics should follow a theory not the other way round --- # Do's and do not's *The do's* Expert knowledge - Think about the research question you are trying to answer - Statistics should follow a theory not the other way round - If you, or the literature, thinks something is important add it! - A non-significant result in these is a useful result --- # Do's and do not's *The do not's* Stepwise variable selection - This is no longer recommended for several reasons - It can remove useful and retain useless variables due to coincidence rather than real world meaning. --- # Do's and do not's *The do not's* Stepwise variable selection - This is no longer recommended for several reasons - It can remove useful and retain useless variables due to coincidence rather than real world meaning. Screening using a univariate regression - Non-significant variables can still affect parameter estimates - This can reject important variables when there is confounding or effect modification --- # Overfitting This is where the model "memorises" a pattern rather than "learns" it. This can give us a model that performs very well on the dataset that it has been given, but very poorly on a new dataset. --- # Overfitting <img src="../../images/regression/overfitting.png" width="100%" /> https://www.educative.io/api/edpresso/shot/6668977167138816/image/5033807687188480 --- # Overfitting Even if we are not predicting our model to a new dataset, we want to know that the conclusions our model is coming to are valid. To do this we want to **train** our model on one subset of data, and **test** our model on a different subset of data. If we find the same associations in both datasets we can have more confidence that we have found a real association. --- # Lasso variable selection If you are presented with a large number of variables, and there is some uncertainty in what is important, you can use a technique called Lasso (Least Absolute Shrinkage and Selection Operator) regression. Lasso regression works by testing variable and data combinations to create a model which has the **least** difference in predictions between the **in-sample** and the **out-of-sample** datasets. --- # Lasso variable selection Lasso regression works by testing variable and data combinations to create a model which has the **least** difference in predictions between the **in-sample** and the **out-of-sample** datasets. --- # A simple model is a good model! [Occam's razor (the law of parsimony)](https://en.wikipedia.org/wiki/Occam%27s_razor) - Do not make something more complicated that is necessary. A simple model is a good model - As long as it can still adequately explain our research questions. --- #Lasso variable selection 1) First we set up the lasso regression using the function `trainControl()` from the **caret** package. .smaller[ ``` r fitControl <- trainControl( method = "repeatedcv", # We want to use repeated cross validation number = 5, # Divide the dataset into 5 sections to train and test repeats = 10) # Repeat this with 10 "cuts" of data ``` ] ??? Talk through each of method/number/repeats method - This stands for repeated cross validation, the process of taking out a "chunk" of the dataset and trying to predict it using the rest of the dataset. This is done to make sure the model is not overfitting number - Split the data into 5 sections (20% removed each time) - the smaller the number the larger the "chunks" in the dataset. This makes the fitting harder to do, but can help increase the validity of predictions repeats - Repeat 10 times with different "cuts" of the data to ensure the result we get is a true relationship and not just do to a "lucky" or "unlucky" split of the data --- # Cross validation .smaller[ ``` r fitControl <- trainControl( method = "repeatedcv", # We want to use repeated cross validation number = 5, # Divide the dataset into 5 sections to train and test repeats = 10) # Repeat this with 10 "cuts" of data ``` ] <img src="../../images/regression/KfoldCV.gif" width="100%" /> ??? Talk through this gif in the context of the last slide --- # Run the lasso regression 1) First we subset our data to the potential independent variables of interest, and the dependent variable. .smaller[ ``` r data_use <- linelist %>% select(gender, #These are the exposures age, #want to investigate hospital:bmi, blood_ct, outcome_death) ``` ] --- # Run the lasso regression 1) First we subset our data to the potential independent variables of interest, and the dependent variable. 2) Then we remove na values with `drop_na()`. .smaller[ ``` r data_use <- linelist %>% select(gender, #These are the exposures age, #want to investigate hospital:bmi, blood_ct, outcome_death) %>% drop_na() #We can only use complete rows ``` ] --- # Run the lasso regression 1) First we subset our data to the potential independent variables of interest, and the dependent variable. 2) Then we remove na values with `drop_na()`. 3) Then we use `train()` from **caret** to run the lasso regression (in caret we specify `method = "glmnet"`) according to the conditions we specified with `trainControl()` .smaller[ ``` r data_use <- linelist %>% select(gender, #These are the exposures age, #want to investigate hospital:bmi, blood_ct, outcome_death) %>% drop_na() #We can only use complete rows lasso_fit <- train(form = outcome_death ~ ., data = data_use, method = "glmnet", #This is the caret name for lasso trControl = fitControl) #This is the training method we specified ``` ] --- # Lasso results Then we use the function `predictors()` from **caret** to ``` r lasso_predictors_raw <- predictors(lasso_fit) lasso_predictors_raw ``` ``` ## [1] "hospitalMilitary Hospital" "hospitalOther" ## [3] "hospitalPort Hospital" "wt_kg" ## [5] "coughyes" ``` ??? Talk through these predictors and why some have "hospital" in front of them (factor and then the level that is important) --- # Lasso results We can then process these to get them ready for use using `str_extract()` from the **stringr** package. This takes our results and removes everything _not_ found in the column names in our dataset. .smaller[ ``` r lasso_predictors_clean <- str_extract(string = lasso_predictors_raw, pattern = paste(colnames(data_use), collapse = "|")) lasso_predictors_clean ``` ``` ## [1] "hospital" "hospital" "hospital" "wt_kg" "cough" ``` ] --- # Expert knowledge (biological plausibility) is important Are we done now, did lasso solve our model building? --- # Expert knowledge (biological plausibility) is important Are we done now, did lasso solve our model building? <font size="46">No!</font> If there is a variable that you think should be included (based on biological plausibility or a previously established association), include it! --- # Carrying out multivariable regression Now we are ready to carry out our multivariable regression! This is done in a slightly different way to our univariate regression. We will be using **base** R's `glm()` function and then tidying up the outputs using the function `tbl_regression` from **gtsummary**. --- #Carrying out multivariable regression ``` r multivariable_regression_table <- data_use %>% select(lasso_predictors_clean, outcome_death, temp, age) %>% ``` ??? Talk through the stages, what formula means, what family means, remind them what exponentiate = TRUE means --- #Carrying out multivariable regression ``` r multivariable_regression_table <- data_use %>% select(lasso_predictors_clean, outcome_death, temp, age) %>% glm(formula = outcome_death ~., family = binomial) ``` ??? Talk through the stages, what formula means, what family means, remind them what exponentiate = TRUE means --- #Carrying out multivariable regression ``` r multivariable_regression_table <- data_use %>% select(lasso_predictors_clean, outcome_death, temp, age) %>% glm(formula = outcome_death ~., family = binomial) %>% tbl_regression(exponentiate = TRUE) ``` ??? Talk through the stages, what formula means, what family means, remind them what exponentiate = TRUE means --- #Carrying out multivariable regression ``` r multivariable_regression_table <- data_use %>% select(lasso_predictors_clean, outcome_death, temp, age) %>% glm(formula = outcome_death ~., family = binomial) %>% tbl_regression(exponentiate = TRUE) #Create a table of counts cross_tab <- data_use %>% select(lasso_predictors_clean, outcome_death, temp, age) %>% tbl_summary(by = outcome_death) ``` ??? Talk through the stages, what formula means, what family means, remind them what exponentiate = TRUE means --- #Carrying out multivariable regression ``` r multivariable_regression_table <- data_use %>% select(lasso_predictors_clean, outcome_death, temp, age) %>% glm(formula = outcome_death ~., family = binomial) %>% tbl_regression(exponentiate = TRUE) #Create a table of counts cross_tab <- data_use %>% select(lasso_predictors_clean, outcome_death, temp, age) %>% tbl_summary(by = outcome_death) #Combine the regression table and the count table tbl_merge(list(cross_tab, multivariable_regression_table), tab_spanner = c("Counts", "Multivariable regression")) ``` ??? Talk through the stages, what formula means, what family means, remind them what exponentiate = TRUE means --- #Carrying out multivariable regression
Characteristic
Counts
Multivariable regression
0
N = 82
1
1
N = 135
1
OR
2
95% CI
2
p-value
hospital
Central Hospital
8 (9.8%)
12 (8.9%)
—
—
Military Hospital
22 (27%)
24 (18%)
0.54
0.17, 1.61
0.3
Other
22 (27%)
22 (16%)
0.52
0.17, 1.55
0.2
Port Hospital
24 (29%)
69 (51%)
1.53
0.53, 4.24
0.4
SMMH
6 (7.3%)
8 (5.9%)
0.85
0.20, 3.67
0.8
wt_kg
61 (48, 70)
59 (50, 65)
0.96
0.93, 1.00
0.058
cough
62 (76%)
119 (88%)
no
—
—
yes
2.24
1.04, 4.89
0.040
temp
38.90 (38.40, 39.30)
38.90 (38.40, 39.20)
0.89
0.64, 1.23
0.5
age
19 (11, 27)
17 (10, 25)
1.02
0.98, 1.07
0.3
1
n (%); Median (Q1, Q3)
2
OR = Odds Ratio, CI = Confidence Interval
--- # Displaying results You can also use `tbl_merge` to compare and contrast your univariate and multivariable regression results. .tiny[
Characteristic
Univariate
Multivariable
N
OR
1
95% CI
1
p-value
OR
1
95% CI
1
p-value
hospital
217
Central Hospital
—
—
—
—
Military Hospital
0.73
0.24, 2.09
0.6
0.54
0.17, 1.61
0.3
Other
0.67
0.22, 1.93
0.5
0.52
0.17, 1.55
0.2
Port Hospital
1.92
0.68, 5.22
0.2
1.53
0.53, 4.24
0.4
SMMH
0.89
0.22, 3.63
0.9
0.85
0.20, 3.67
0.8
wt_kg
217
0.98
0.96, 1.00
0.12
0.96
0.93, 1.00
0.058
cough
217
no
—
—
—
—
yes
2.40
1.16, 5.02
0.018
2.24
1.04, 4.89
0.040
temp
217
0.92
0.67, 1.25
0.6
0.89
0.64, 1.23
0.5
age
217
0.99
0.97, 1.01
0.4
1.02
0.98, 1.07
0.3
1
OR = Odds Ratio, CI = Confidence Interval
] --- # Displaying results .pull-left[ And you can create forest plots using `ggplot()` ] .pull-right[ <img src="multivariable_regression_files/figure-html/unnamed-chunk-22-1.png" width="600px" height="400px" /> ] --- # It's now time for you to try it out yourselves! Any questions? **Resources** - Course website (initial setup and slides access): [https://courses.appliedepi.org/statsr/](https://courses.appliedepi.org/statsr/) - [Epi R Handbook](https://epirhandbook.com/en/) 50 chapters of best-practice code examples available online and offline - [Applied Epi Community](https://community.appliedepi.org/) A great resource for asking questions and help!