Advanced statistics in R

class: center, middle, inverse, title-slide

.title[
# Advanced statistics in R
]
.subtitle[
## Regression - variable selection and multivariable regression
]
.author[
### 
]
.date[
### <a href="mailto:contact@appliedepi.org">contact@appliedepi.org</a>
]

---

# Multivariable regression

Now that you are experts at conducting univariate regression in R, it is time to move onto multivariable regression!

---

# Multivariable regression - when univariate is just not enough

Multivariable regression allows us to look at the relationship between our dependent variable and multiple independent variables.

Why multivariable?

* It considers the relationship of each exposure _in context_ with the outcome
* This allows us to understand which exposures have a *true association* with the outcome and are not just subject to an *effect modification*

---

# Effect modification and confounding

Effect modification is when the **magnitude** of the effect of an exposure on an outcome differs by another exposure.

Confounding is when another exposure is related to both the exposure and the outcome, but **it does not lie on the causative pathway**.

Can we think of any examples of effect modification and confounding?

---

# Effect modification

Effect modification is all about stratification.

Which of these are examples of effect modification?

---

# Effect modification

Effect modification is all about stratification.

Which of these are examples of effect modification?

- When a drug works on females but does not work on males

---

# Effect modification

Effect modification is all about stratification.

Which of these are examples of effect modification?

- When a drug works on females but does not work on males
- Those who are exposed to asbestos **and** smoke have higher risks of lung cancer

---

# Effect modification

Effect modification is all about stratification.

Which of these are examples of effect modification?

- When a drug works on females but does not work on males
- Those who are exposed to asbestos **and** smoke have higher risks of lung cancer
- Alcohol consumption is associated with lung cancer, those who drink alcohol are more likely to smoke

---

# Confounding

A confounder is a variable that influences both the dependent variable and the independent variable, causing a spurious association.

<img src="../../images/regression/confounder_wikipedia.png" width="125%" />
https://upload.wikimedia.org/wikipedia/commons/5/53/Comparison_confounder_mediator.svg

---

# Confounding

So icecream is causing sharks to bite humans?

---
# Effect modification and confounding

Multivariable regression estimates the association between the exposures and the outcome by holding all other exposures constant.

This allows us to account for *confounding* **and** *effect modification*.

---

# What do we put into our model?

* Do we put in **all** the data?

* Do we include **only** what we _know_ is likely to be useful?

* Do we carry out some sort of stepwise approach?

---

# What do we put into our model?

* Do we put in **all** the data?

* Do we include **only** what we _know_ is likely to be useful?

* Do we carry out some sort of stepwise approach?

We should avoid all of these!

---

# Do's and do not's

*The do's*

Expert knowledge

- Think about the research question you are trying to answer
 - Statistics should follow a theory not the other way round

---

# Do's and do not's

*The do's*

Expert knowledge

- Think about the research question you are trying to answer
 - Statistics should follow a theory not the other way round
- If you, or the literature, thinks something is important add it!
 - A non-significant result in these is a useful result

---
# Do's and do not's

*The do not's*

Stepwise variable selection

- This is no longer recommended for several reasons
 - It can remove useful and retain useless variables due to coincidence rather than real world meaning.
 
---

# Do's and do not's

*The do not's*

Stepwise variable selection

- This is no longer recommended for several reasons
 - It can remove useful and retain useless variables due to coincidence rather than real world meaning.
 
Screening using a univariate regression

- Non-significant variables can still affect parameter estimates
- This can reject important variables when there is confounding or effect modification

---

# Overfitting

This is where the model "memorises" a pattern rather than "learns" it.

This can give us a model that performs very well on the dataset that it has been given, but very poorly on a new dataset.

---

# Overfitting

<img src="../../images/regression/overfitting.png" width="100%" />
https://www.educative.io/api/edpresso/shot/6668977167138816/image/5033807687188480
---

# Overfitting

Even if we are not predicting our model to a new dataset, we want to know that the conclusions our model is coming to are valid.

To do this we want to **train** our model on one subset of data, and **test** our model on a different subset of data.

If we find the same associations in both datasets we can have more confidence that we have found a real association.

---

# Lasso variable selection

If you are presented with a large number of variables, and there is some uncertainty in what is important, you can use a technique called Lasso (Least Absolute Shrinkage and Selection Operator) regression.

Lasso regression works by testing variable and data combinations to create a model which has the **least** difference in predictions between the **in-sample** and the **out-of-sample** datasets.

---

# Lasso variable selection

Lasso regression works by testing variable and data combinations to create a model which has the **least** difference in predictions between the **in-sample** and the **out-of-sample** datasets.

---

# A simple model is a good model!

[Occam's razor (the law of parsimony)](https://en.wikipedia.org/wiki/Occam%27s_razor) - Do not make something more complicated that is necessary.

A simple model is a good model
 - As long as it can still adequately explain our research questions.

---

#Lasso variable selection

1) First we set up the lasso regression using the function `trainControl()` from the **caret** package.

.smaller[

``` r
fitControl <- trainControl(
 method = "repeatedcv", # We want to use repeated cross validation
 number = 5, # Divide the dataset into 5 sections to train and test
 repeats = 10) # Repeat this with 10 "cuts" of data
```
]
???
Talk through each of method/number/repeats

method - This stands for repeated cross validation, the process of taking out a "chunk" of the dataset and trying to predict it using the rest of the dataset. This is done to make sure the model is not overfitting

number - Split the data into 5 sections (20% removed each time) - the smaller the number the larger the "chunks" in the dataset. This makes the fitting harder to do, but can help increase the validity of predictions

repeats - Repeat 10 times with different "cuts" of the data to ensure the result we get is a true relationship and not just do to a "lucky" or "unlucky" split of the data

---

# Cross validation

.smaller[

???

Talk through this gif in the context of the last slide

---

# Run the lasso regression

1) First we subset our data to the potential independent variables of interest, and the dependent variable.

.smaller[

``` r
data_use <- linelist %>%
 select(gender, #These are the exposures
 age, #want to investigate
 hospital:bmi, 
 blood_ct, 
 outcome_death)
```
]
---

# Run the lasso regression

1) First we subset our data to the potential independent variables of interest, and the dependent variable.

2) Then we remove na values with `drop_na()`.

.smaller[

``` r
data_use <- linelist %>%
 select(gender, #These are the exposures
 age, #want to investigate
 hospital:bmi, 
 blood_ct, 
 outcome_death) %>%
 drop_na() #We can only use complete rows
```
]
---

# Run the lasso regression

1) First we subset our data to the potential independent variables of interest, and the dependent variable.

2) Then we remove na values with `drop_na()`.

3) Then we use `train()` from **caret** to run the lasso regression (in caret we specify `method = "glmnet"`) according to the conditions we specified with `trainControl()`

.smaller[

lasso_fit <- train(form = outcome_death ~ ., 
 data = data_use,
 method = "glmnet", #This is the caret name for lasso 
 trControl = fitControl) #This is the training method we specified
```
]
---

# Lasso results

Then we use the function `predictors()` from **caret** to

``` r
lasso_predictors_raw <- predictors(lasso_fit)

lasso_predictors_raw
```

```
## [1] "hospitalMilitary Hospital" "hospitalOther"            
## [3] "hospitalPort Hospital"     "wt_kg"                    
## [5] "coughyes"
```

???
Talk through these predictors and why some have "hospital" in front of them (factor and then the level that is important)

---

# Lasso results

We can then process these to get them ready for use using `str_extract()` from the **stringr** package.

This takes our results and removes everything _not_ found in the column names in our dataset.

.smaller[

``` r
lasso_predictors_clean <- str_extract(string = lasso_predictors_raw,
 pattern = paste(colnames(data_use), 
 collapse = "|"))
lasso_predictors_clean
```

```
## [1] "hospital" "hospital" "hospital" "wt_kg"    "cough"
```
]
---

# Expert knowledge (biological plausibility) is important

Are we done now, did lasso solve our model building?

---

# Expert knowledge (biological plausibility) is important

Are we done now, did lasso solve our model building?

No!

If there is a variable that you think should be included (based on biological plausibility or a previously established association), include it!

---

# Carrying out multivariable regression

Now we are ready to carry out our multivariable regression!

This is done in a slightly different way to our univariate regression. We will be using **base** R's `glm()` function and then tidying up the outputs using the function `tbl_regression` from **gtsummary**.

---

#Carrying out multivariable regression

``` r
multivariable_regression_table <- data_use %>%
 select(lasso_predictors_clean, 
 outcome_death,
 temp,
 age) %>%
```