Introduction to Rfor Applied Epidemiology

class: center, middle, inverse, title-slide

.medium-large-table table {
  font-size: 10px;     
}

.medium-large-table2 table {
  font-size: 11px;     
}

.small-code .remark-code{
  font-size: 40%
}
</style>

# Introduction to R for </br> Applied Epidemiology

### The Ebola case study and data cleaning

contact@appliedepi.org

---
# Objectives & schedule

* Create a new RStudio project for the Ebola case study  
* Import data from a project subfolder using `import()` and `here()`  
* Gain familiarity with {dplyr} data cleaning functions  
* Begin writing a cleaning command using the `%>%` pipe operator

</br>

<div class="tabwid"><style>.cl-b1bef610{}.cl-b1bb03fc{font-family:'Helvetica';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-b1bca554{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-b1bcb71a{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb72e{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb738{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb742{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb743{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb756{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb760{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb76a{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb774{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb77e{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb77f{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(255, 255, 255, 0.00);border-top: 0 solid rgba(255, 255, 255, 0.00);border-left: 0 solid rgba(255, 255, 255, 0.00);border-right: 0 solid rgba(255, 255, 255, 0.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb788{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(255, 255, 255, 0.00);border-top: 0 solid rgba(255, 255, 255, 0.00);border-left: 0 solid rgba(255, 255, 255, 0.00);border-right: 0 solid rgba(255, 255, 255, 0.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-b1bef610'><thead><tr style="overflow-wrap:break-word;"><th class="cl-b1bcb71a"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Time</span></p></th><th class="cl-b1bcb72e"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Topic</span></p></th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-b1bcb738"><p class="cl-b1bca554"><span class="cl-b1bb03fc">10 minutes</span></p></td><td class="cl-b1bcb742"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Set up of the Ebola case study</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b1bcb743"><p class="cl-b1bca554"><span class="cl-b1bb03fc">20 minutes</span></p></td><td class="cl-b1bcb756"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Functions for data cleaning</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b1bcb743"><p class="cl-b1bca554"><span class="cl-b1bb03fc">10 minutes</span></p></td><td class="cl-b1bcb756"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Demo of data cleaning</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b1bcb760"><p class="cl-b1bca554"><span class="cl-b1bb03fc">2 hours</span></p></td><td class="cl-b1bcb76a"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Exercise</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b1bcb774"><p class="cl-b1bca554"><span class="cl-b1bb03fc">20 minutes</span></p></td><td class="cl-b1bcb77e"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Debrief</span></p></td></tr></tbody><tfoot><tr style="overflow-wrap:break-word;"><td  colspan="2"class="cl-b1bcb77f"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Take breaks as you wish during the exercise</span></p></td></tr></tfoot></table></div>

???
Note stretch breaks throughout.

---
# Review

- **RStudio projects** - a home for data and scripts for a particular analysis

- Running commands in **an R script**, with comments

- Creating **objects** with the assignment operator **`<-`**

- Using **functions** like `max()`, `min()`, and `paste()`

- Importing a dataset with **`import()`**

- Reviewing a dataset with `skim()` and `summary()`

- Columns have **classes** that can be checked with `class()`

---
class: inverse, center, middle

# The Ebola case study

Modules 2-9 will use data from a simulated Ebola outbreak in Sierra Leone.

---
# A new RStudio project

.pull-left[

The exercise will guide you to create a new **RStudio project** in the "intro_course/**ebola**/" folder.

]

.pull-right[

📂 intro_course
* 📁 module1  
* 📂 covid  
* **📂 ebola**  
  * **ebola.Rproj**
  * 📁 data  
  * 📁 outputs  
  * 📂 scripts

]

---
# A new R Script

.pull-left[

You will write a new R script named "ebola_analysis.R" to hold your commands.

The script will be saved in the subfolder "ebola/**scripts**/"

<img src="../../images/data_cleaning/ebola_setup.png" width="100%" height="200%" />
]

.pull-right[

📁 intro_course
* 📁 module1  
* 📂 covid  
* 📁 **ebola**  
  * **ebola.Rproj**
  * 📁 data  
  * 📂 outputs  
  * 📂 **scripts**  
      * **ebola_analysis.R**

]

---
# Load packages

What will be your first command in the new R script? What function will it use?

Use **`pacman::p_load()`** to **load the packages** needed for the analysis

``` r
pacman::p_load(
     rio,          # for importing data
     here,         # for relative file paths
     skimr,        # for reviewing the data
     janitor,      # for cleaning data
     epikit,       # for creating age categories
     tidyverse     # for data management and visualization
)
```

---
# Import data from a subfolder

The ebola linelist is saved in the new project's "**data**/**raw**/" subfolder:

📁 **ebola**  
  * ebola.Rproj
  * 📂 **data**  
    * 📂 clean  
    * 📂 **raw**  
      * **surveillance_linelist_20141201.csv**  
  * 📁 scripts  
  * 📂 outputs

`import()` expects a *file path* - the data's location or "address" on your computer.

Will this command work to import the Ebola linelist?

``` r
import("surveillance_linelist_20141201.csv")
```

**No**, you need to specify which *subfolder* of the project the data is saved in.

---
# File paths

**Avoid** the fragile "absolute" file path *(only works on one computer)*

``` r
import("C:/Users/Me/Docs/intro_course/ebola/data/raw/surveillance_linelist_20141201.csv")
```

**In an RStudio project** the path can start from the project root folder

``` r
import("data/raw/surveillance_linelist_20141201.csv") # works on almost any computer
```

**Use `here()` to create the file path** without slashes

`here("data", "raw", "surveillance_linelist_raw.csv")`

**The final step** is to place the `here()` file path command *within* `import()`

``` r
surv_raw <- import(here("data", "raw", "surveillance_linelist_20141201.csv"))
```

*The `<-` operator saves the dataset as an object with the name `surv_raw`.*

???
We teach them here() because it removes the need to handle slashes, and it really helps when you get to automated reports.

---

class: medium-large-table

# The data

<div class="datatables html-widget html-fill-item" id="htmlwidget-3741f132bf186932d564" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-3741f132bf186932d564">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25"],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25],["694928","86340d","92d002","544bd1","6056ba","eb5aeb","e64e04","5a65bb","2ae019","7ca4c0","699d82","30af4f","a06123","27c07d","8f654c","dc1730","9b5dec","f8cd8f","e4bee8","f23f4e","6e9925","668458","7bb784","d74a9e","2772c3"],["11/9/2014","10/30/2014","8/16/2014","8/29/2014","10/20/2014","10/28/2014","10/6/2014","9/21/2014","5/6/2014","9/29/2014","11/27/2014","","9/10/2014","9/12/2014","10/19/2014","9/26/2014","9/7/2014","10/18/2014","6/22/2014","9/14/2014","9/9/2014","10/21/2014","10/21/2014","11/12/2014","10/24/2014"],["m","f","f","f","f","f","f","m","m","m","m","f","m","m","f","m","m","f","f","m","f","m","m","m","m"],[23,1,16,10,0,8,7,4,37,11,27,6,10,20,6,13,22,11,3,60,26,19,9,18,2],["","years","years","years","years","years","years","years","years","years","years","years","years","years","years","years","years","years","years","years","years","years","months","years","years"],["Other","Port Hospital","","","","Port Hospital","","Port Hospital","Other","Port Hospital","Port Hospital","Port Hospital","","Other","Military Hospital","Central Hospital","St. Mark's Maternity Hospital (SMMH)","Port Hospital","Other","Other","Other","St. Mark's Maternity Hospital (SMMH)","Central Hospital","Other",""],[70,18,59,39,-11,34,32,41,81,56,76,30,55,72,37,62,60,47,30,84,69,68,79,59,30],[147,29,133,106,24,104,91,74,167,163,178,75,125,151,86,148,159,114,63,245,148,153,175,153,67],["yes","yes","yes","yes","yes","yes","no","yes","yes","yes","yes","no","yes","yes","yes","","yes","no","yes","yes","yes","no","yes","yes","yes"],["no","no","no","no","yes","no","no","yes","no","no","yes","no","no","yes","no","","no","no","no","no","no","yes","no","no","no"],["yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","no","yes","no","","yes","yes","yes","no","no","yes","yes","yes","yes"],["no","no","no","no","no","no","yes","no","no","no","no","no","no","yes","no","","no","no","no","no","no","no","no","no","no"],["no","no","yes","yes","yes","yes","yes","yes","no","no","no","yes","no","no","yes","","yes","yes","no","yes","yes","no","no","yes","no"],[39.2,39.4,39.3,39.3,38.6,38.2,36.5,39.1,39.5,38.9,39.4,37.5,38,38.8,39.2,36.2,38.2,37.5,39.1,38.9,39.1,37.2,38.7,38.9,39.8],[32.39390994,214.0309156,33.35406185,34.70986116,-190.9722222,31.43491124,38.64267601,74.87216947,29.04370899,21.07719523,23.98687034,53.33333333,35.2,31.57756239,50.02704164,28.30533236,23.7332384,36.16497384,75.58578987,13.9941691,31.50109569,29.0486565,25.79591837,25.20398137,66.83002896000001],["","Mountain Rural","Mountain Rural","East II","West III","West III","Mountain Rural","Mountain Rural","West III","West III","Mountain Rural","West I","Mountain Rural","West II","West II","West III","West II","West II","Central I","Central II","West I","West III","Mountain Rural","East II","Mountain Rural"],["SL040102","SL040102","SL040102","SL040204","SL040208","SL040208","SL040102","SL040102","SL040208","SL040208","SL040102","SL040206","SL040102","SL040207","SL040207","SL040208","SL040207","SL040207","SL040201","SL040202","SL040206","SL040208","SL040102","SL040204","SL040102"],["Central II","Central II","Mountain Rural","East II","West III","West III","Mountain Rural","Mountain Rural","West III","West III","Mountain Rural","West I","Mountain Rural","West II","West II","West III","West II","West II","Central I","Central II","Central II","West III","Mountain Rural","East II","Central II"],[false,true,true,true,true,false,true,true,true,true,false,true,true,true,true,false,true,true,true,true,false,true,true,true,true],["11/9/2014","10/31/2014","8/20/2014","8/30/2014","10/21/2014","11/1/2014","10/10/2014","9/22/2014","5/11/2014","9/30/2014","11/28/2014","11/10/2014","9/15/2014","9/15/2014","10/19/2014","9/27/2014","9/11/2014","10/19/2014","6/25/2014","9/18/2014","9/10/2014","10/23/2014","10/22/2014","11/12/2014","10/25/2014"],[8.453538571999999,8.470122275,8.454882422000001,8.484896061000001,8.467532547999999,8.452985424,8.474794458,8.462900412,8.452506080999999,8.456236007999999,8.465821021,8.478479549999999,8.469755298000001,8.461030291,8.465708204,8.456338428,8.466538458,8.467451204,8.478470939999999,8.487316046,8.483711029,8.462788073,8.463369754,8.482723023,8.462216115],[-13.20963625,-13.21314431,-13.21094789,-13.22525323,-13.26645243,-13.26243993,-13.22028905,-13.21564488,-13.26809408,-13.26695002,-13.21178312,-13.24747771,-13.21416774,-13.23493492,-13.23316849,-13.26466626,-13.23375243,-13.23348154,-13.22992349,-13.23619714,-13.24692135,-13.26936217,-13.21540756,-13.212511,-13.22137963],["no","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>row_num<\/th>\n      <th>case_id<\/th>\n      <th>onset date<\/th>\n      <th>sex<\/th>\n      <th>age<\/th>\n      <th>age unit<\/th>\n      <th>hospital<\/th>\n      <th>wt (kg)<\/th>\n      <th>ht (cm)<\/th>\n      <th>fever<\/th>\n      <th>chills<\/th>\n      <th>cough<\/th>\n      <th>aches<\/th>\n      <th>vomit<\/th>\n      <th>temp<\/th>\n      <th>bmi<\/th>\n      <th>adm3_name_res<\/th>\n      <th>admin3pcod<\/th>\n      <th>adm3_name_det<\/th>\n      <th>lab_confirmed<\/th>\n      <th>date of report<\/th>\n      <th>lat<\/th>\n      <th>lon<\/th>\n      <th>epilink<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"scrollY":300,"scrollX":600,"pageLength":25,"fontSize":"25%","dom":"ti","ordering":false,"rownames":false,"options":{"pageLength":5,"scrollX":true},"class":"white-space: nowrap","columnDefs":[{"className":"dt-right","targets":[1,5,8,9,15,16,22,23]},{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"row_num","targets":1},{"name":"case_id","targets":2},{"name":"onset date","targets":3},{"name":"sex","targets":4},{"name":"age","targets":5},{"name":"age unit","targets":6},{"name":"hospital","targets":7},{"name":"wt (kg)","targets":8},{"name":"ht (cm)","targets":9},{"name":"fever","targets":10},{"name":"chills","targets":11},{"name":"cough","targets":12},{"name":"aches","targets":13},{"name":"vomit","targets":14},{"name":"temp","targets":15},{"name":"bmi","targets":16},{"name":"adm3_name_res","targets":17},{"name":"admin3pcod","targets":18},{"name":"adm3_name_det","targets":19},{"name":"lab_confirmed","targets":20},{"name":"date of report","targets":21},{"name":"lat","targets":22},{"name":"lon","targets":23},{"name":"epilink","targets":24}],"order":[],"autoWidth":false,"orderClasses":false},"selection":{"mode":"multiple","selected":null,"target":"row","selectable":null}},"evals":[],"jsHooks":[]}</script>

.footnote[Only 25 rows are shown here]

???
Table shows just the first 25 rows, to load faster.

---
class: inverse, center, middle

# Live demonstration

## New RStudio project and R script

---
class: inverse, center, middle

## Cleaning data in R

---
# Clean data, messy data

Now your data are imported. What is typically involved in "cleaning" a dataset?

.pull-left[
<img src="../../images/data_cleaning/tidy_broom.png" width="75%" />
]

.pull-right[

- Prepare for analysis and visualization

- Standardize column names

- Subset rows and columns

- Align spellings

- Create categorical and calculated variables

- Join with other data

- Remove duplicates... 
 
]

.footnote[]

???
Ask the participants what steps they take to clean datasets

---

# The {dplyr} package

.pull-left[
<img src="../../images/data_cleaning/dplyr_hex.png" width="75%" />
]

.pull-right[

* The easiest and most versatile package for data cleaning

* This package is installed in a universe of {tidyverse} R packages

* The {tidyverse} has transformed R in the last 10 years

]

.footnote["dplyr" is shorthand for "data plier" - a plier is the handheld tool pictured above]

???
Tidyverse has made R coding much more user-friendly, intuitive, and accessible to beginner coders

---

# Practice dataset

Let's use a mini **`surv_raw`** dataset to practice some core R functions.

|case_id | age|sex |lab_confirmed |onset date | wt (kg)|
|:-------|---:|:---|:-------------|:----------|-------:|
|694928  |  23|m   |FALSE         |11/9/2014  |      70|
|86340d  |   0|f   |TRUE          |10/30/2014 |      18|
|92d002  |  16|m   |TRUE          |8/16/2014  |      59|
|544bd1  |  10|f   |TRUE          |8/29/2014  |      39|
|544bd1  |  10|f   |TRUE          |8/29/2014  |      39|
|544bd1  |  10|f   |FALSE         |8/29/2014  |      39|

---

# Functions for today

Function       | Utility                               
---------------|---------------------------------------
`filter()`|subset **rows**
`select()`|subset **columns**
`clean_names()`|standardise column names  
`rename()`|rename columns manually 
`mutate()`|create and transform columns 
`mdy()`, `dmy()`, `ymd()` |tell R how to understand dates

---
class: medium-large-table2

# `filter()` rows

.pull-left[

``` r
filter(surv_raw)
```

1st argument: a data frame

]

.pull-right[

]

---
class: medium-large-table2

# `filter()` rows

.pull-left[

``` r
filter(surv_raw, age < 18)
```

2nd+ arguments: logical tests for rows to be *kept*

]

.pull-right[

|case_id | age|sex |lab_confirmed |onset date | wt (kg)|
|:-------|---:|:---|:-------------|:----------|-------:|
|86340d  |   0|f   |TRUE          |10/30/2014 |      18|
|92d002  |  16|m   |TRUE          |8/16/2014  |      59|
|544bd1  |  10|f   |TRUE          |8/29/2014  |      39|
|544bd1  |  10|f   |TRUE          |8/29/2014  |      39|
|544bd1  |  10|f   |FALSE         |8/29/2014  |      39|

]

---
class: medium-large-table2

# `filter()` rows

.pull-left[

``` r
filter(surv_raw, age < 18, sex == "f")
```

2nd+ arguments: logical tests for rows to be *kept*

]

.pull-right[

|case_id | age|sex |lab_confirmed |onset date | wt (kg)|
|:-------|---:|:---|:-------------|:----------|-------:|
|86340d  |   0|f   |TRUE          |10/30/2014 |      18|
|544bd1  |  10|f   |TRUE          |8/29/2014  |      39|
|544bd1  |  10|f   |TRUE          |8/29/2014  |      39|
|544bd1  |  10|f   |FALSE         |8/29/2014  |      39|

]

.footnote[Note use of double equals `==` to test equivalence]

---
class: medium-large-table2

# `filter()` rows

.pull-left[

``` r
filter(surv_raw, 
  age < 18 & 
  (sex == "f" | lab_confirmed == TRUE)
)
```

*Newlines and indents do not impact code*

The logic can get complex using:
* `&` (AND) 
* `|` (OR)
* Parentheses

]

.pull-right[

]

---
class: medium-large-table2

# `select()` columns

.pull-left[

``` r
select(surv_raw, ___) 
```

`select()` also expects a data frame as the first argument

]

.pull-right[

]

---
class: medium-large-table2

# `select()` columns

.pull-left[

``` r
select(surv_raw, case_id, age)
```

You can provide `select()` with column names to *keep*

]

.pull-right[

|case_id | age|
|:-------|---:|
|694928  |  23|
|86340d  |   0|
|92d002  |  16|
|544bd1  |  10|
|544bd1  |  10|
|544bd1  |  10|

]

---
class: medium-large-table2

# `select()` columns

.pull-left[

``` r
select(surv_raw, case_id, age, sex)
```

You can provide `select()` with column names to *keep*

]

.pull-right[

|case_id | age|sex |
|:-------|---:|:---|
|694928  |  23|m   |
|86340d  |   0|f   |
|92d002  |  16|m   |
|544bd1  |  10|f   |
|544bd1  |  10|f   |
|544bd1  |  10|f   |

]

---
class: medium-large-table2

# `select()` columns

.pull-left[

``` r
select(surv_raw, -case_id, -lab_confirmed)
```

Or you can designate which columns to *remove* with -

]

.pull-right[

| age|sex |onset date | wt (kg)|
|---:|:---|:----------|-------:|
|  23|m   |11/9/2014  |      70|
|   0|f   |10/30/2014 |      18|
|  16|m   |8/16/2014  |      59|
|  10|f   |8/29/2014  |      39|
|  10|f   |8/29/2014  |      39|
|  10|f   |8/29/2014  |      39|

]

---

# `filter()` *and* `select()`?

Yes! Use the **%>%** "pipe" operator to "pass" data from one function to the next.

.pull-left[

It is like saying the words **"and then"**.

A typical cleaning command contains a *sequence* of linked steps

* Rename columns  
* Filter rows  
* Select columns  
* Deduplicate  
* Clean values...

]

.pull-right[

]

---
class: medium-large-table2
# Piping data

Previously, the 1st argument was the data frame

`filter(`**surv_raw**`, age < 18)`

Using pipes, this is now written as:

**surv_raw** `%>% filter(age < 18)`

You can pipe the data through *multiple* functions  
`surv_raw`

---
class: medium-large-table2
# Piping data

Previously, the 1st argument was the data frame

`filter(`**surv_raw**`, age < 18)`

Using pipes, this is now written as:

**surv_raw** `%>% filter(age < 18)`

You can pipe the data through *multiple* functions  
`surv_raw` **%>%** `filter(age < 18)`

---
class: medium-large-table2
# Piping data

Previously, the 1st argument was the data frame

`filter(`**surv_raw**`, age < 18)`

Using pipes, this is now written as:

**surv_raw** `%>% filter(age < 18)`

You can pipe the data through *multiple* functions  
`surv_raw` **%>%** `filter(age < 18)` **%>%** `select(case_id, age, sex)`

|case_id | age|sex |
|:-------|---:|:---|
|86340d  |   0|f   |
|92d002  |  16|m   |
|544bd1  |  10|f   |
|544bd1  |  10|f   |
|544bd1  |  10|f   |

---
# Vertical coding style

A *vertical* style with indents does not impact the code, but makes it more readable!

``` r
surv_raw
```

---
# Vertical coding style

A *vertical* style with indents does not impact the code, but makes it more readable!

``` r
surv_raw %>% 
  select(case_id, age, sex, lab_confirmed)                  # select columns
```

|case_id | age|sex |lab_confirmed |
|:-------|---:|:---|:-------------|
|694928  |  23|m   |FALSE         |
|86340d  |   0|f   |TRUE          |
|92d002  |  16|m   |TRUE          |
|544bd1  |  10|f   |TRUE          |
|544bd1  |  10|f   |TRUE          |
|544bd1  |  10|f   |FALSE         |

---
# Vertical coding style

The **`%>%`** pipe passes the dataset to the next step

``` r
surv_raw %>% 
  select(case_id, age, sex, lab_confirmed) %>%              # select columns
  distinct()                                                # de-duplicate
```

---
# Vertical coding style

The **`%>%`** pipe passes the dataset to the next step

``` r
surv_raw %>% 
  select(case_id, age, sex, lab_confirmed) %>%              # select columns
  distinct() %>%                                            # de-duplicate
  filter(age < 18, lab_confirmed == TRUE)                   # only children cases
```

|case_id | age|sex |lab_confirmed |
|:-------|---:|:---|:-------------|
|86340d  |   0|f   |TRUE          |
|92d002  |  16|m   |TRUE          |
|544bd1  |  10|f   |TRUE          |

---
# Vertical coding style

The **`%>%`** pipe passes the dataset to the next step

``` r
surv_raw %>% 
  select(case_id, age, sex, lab_confirmed) %>%              # select columns
  distinct() %>%                                            # de-duplicate
  filter(age < 18, lab_confirmed == TRUE) %>%               # only children cases      
  mutate(infant = ifelse(age < 1, "infant", "not infant"))  # create a column      
```

|case_id | age|sex |lab_confirmed |infant     |
|:-------|---:|:---|:-------------|:----------|
|86340d  |   0|f   |TRUE          |infant     |
|92d002  |  16|m   |TRUE          |not infant |
|544bd1  |  10|f   |TRUE          |not infant |

---
# Vertical coding style

Is there a pipe operator at the end of this workflow?

``` r
surv_raw %>% 
  select(case_id, age, sex, lab_confirmed) %>%              # select columns
  distinct() %>%                                            # de-duplicate
  filter(age < 18, lab_confirmed == TRUE) %>%               # only children cases      
* mutate(infant = ifelse(age < 1, "infant", "not infant"))
```

The pipes connect all these functions into one, linked command.  
How would you run this command in RStudio?

---
# Clean the column names

We can observe changes to column names by printing them with `names()`

``` r
# print current column names
names(surv_raw)  
```

```
## [1] "case_id"       "age"           "sex"          
## [4] "lab_confirmed" "onset date"    "wt (kg)"
```

---
# Clean the column names

Equivalently, `surv_raw` can be passed to `names()` using a **pipe**:

``` r
surv_raw %>%  # begin with raw data
  names()     # print current column names                           
```

```
## [1] "case_id"       "age"           "sex"          
## [4] "lab_confirmed" "onset date"    "wt (kg)"
```

Apply `clean_names()` to `surv_raw` by inserting it into the pipe sequence.  
This standardizes column names (lowercase, no spaces or special characters).

``` r
surv_raw %>%            # begin with raw data
* clean_names() %>%     # standardize column names
  names()               # print current column names
```

```
## [1] "case_id"       "age"           "sex"          
## [4] "lab_confirmed" "onset_date"    "wt_kg"
```

*See changes to the final two columns*

---
# Clean the column names

Equivalently, `surv_raw` can be passed to `names()` using a **pipe**:

``` r
surv_raw %>%  # begin with raw data
  names()     # print current column names                           
```

```
## [1] "case_id"       "age"           "sex"          
## [4] "lab_confirmed" "onset date"    "wt (kg)"
```

Then, pipe the **cleaned** column names to `rename()` for manual edits.  
Note that `rename()` references the **cleaned** column names (`onset_date`).

``` r
surv_raw %>%                        # begin with raw data
  clean_names() %>%                 # standardize column names 
* rename(                           # manual edits
*     age_years  = age,             # NEW = OLD
*     date_onset = onset_date) %>%
  names()                           # print current column names
```

```
## [1] "case_id"       "age_years"     "sex"          
## [4] "lab_confirmed" "date_onset"    "wt_kg"
```

---

# Printing vs. saving

Click the tabs to see the difference.

.panelset[
.panel[.panel-name[Printing]

The previous changes to `surv_raw` were **not** saved.

We only *printed with modifications*.

``` r
# modify, then print column names
*surv_raw %>%                        # start with raw data
  clean_names() %>%                 # standardize column names 
  rename(                           # manual edits 
      age_years  = age,             # NEW = OLD    
      date_onset = onset_date) %>%                        
* names()                           # print current column names
```

**`surv_raw`** still has the *original column names*!

``` r
names(surv_raw) 
```

```
## [1] "case_id"       "age"           "sex"          
## [4] "lab_confirmed" "onset date"    "wt (kg)"
```

]

.panel[.panel-name[Saving]

Use the **`<-`** to save the changes to a new **`surv_clean`** data frame.

No output is printed, but the new object will appear in the RStudio Environment.

``` r
# create new data frame
*surv_clean <- surv_raw %>%
  clean_names() %>%          
  rename(                           
      age_years  = age,                 
      date_onset = onset_date)
```

**`surv_clean`** has the *cleaned column names*!

``` r
names(surv_clean) 
```

```
## [1] "case_id"       "age_years"     "sex"          
## [4] "lab_confirmed" "date_onset"    "wt_kg"
```

]
]

---
class: medium-large-table2

# `mutate()` to *create* columns  
 
The syntax is:

``` r
DATASET %>% 
  mutate(NEW_COLUMN_NAME = A_FUNCTION(arguments))
```

.pull-left[

``` r
surv_raw %>% 
  mutate(age_group = ifelse(
    test = age >= 18,
    yes = "adult",  
    no = "minor")) 
```

`ifelse()` logically tests each row and writes in the new `age_group` column:

* "adult" if the test is TRUE  
* "minor" if the test is FALSE

]

.pull-right[

|case_id | age|sex |lab_confirmed |onset date | wt (kg)|age_group |
|:-------|---:|:---|:-------------|:----------|-------:|:---------|
|694928  |  23|m   |FALSE         |11/9/2014  |      70|adult     |
|86340d  |   0|f   |TRUE          |10/30/2014 |      18|minor     |
|92d002  |  16|m   |TRUE          |8/16/2014  |      59|minor     |
|544bd1  |  10|f   |TRUE          |8/29/2014  |      39|minor     |
|544bd1  |  10|f   |TRUE          |8/29/2014  |      39|minor     |
|544bd1  |  10|f   |FALSE         |8/29/2014  |      39|minor     |

]

---
class: medium-large-table2

# `mutate()` to *edit* columns  
 
The syntax is similar:

``` r
DATASET %>% 
  mutate(SAME_COLUMN_NAME = A_FUNCTION(arguments))
```

.pull-left[

``` r
surv_raw %>% 
  mutate(sex = recode(sex,  
    "m" = "male",           
    "f" = "female"))        
```

Column `sex` is overwritten.

`recode()` starts with original `sex` column and applies changes:

* "m" to "male"  
* "f" to "female"

]

.pull-right[

|case_id | age|sex    |lab_confirmed |onset date | wt (kg)|
|:-------|---:|:------|:-------------|:----------|-------:|
|694928  |  23|male   |FALSE         |11/9/2014  |      70|
|86340d  |   0|female |TRUE          |10/30/2014 |      18|
|92d002  |  16|male   |TRUE          |8/16/2014  |      59|
|544bd1  |  10|female |TRUE          |8/29/2014  |      39|
|544bd1  |  10|female |TRUE          |8/29/2014  |      39|
|544bd1  |  10|female |FALSE         |8/29/2014  |      39|

]

---
# `mutate()` with dates

The `class()` of date columns should be "date", not "character".

To change the class, you must *tell* R how to understand the raw dates.

.pull-left[

Dates come in many formats:

Is "03/09/2024" the 9th of March, or the 3rd of September?

]

.pull-right[

]

.footnote[More details in this [Epi R Handbook chapter](https://epirhandbook.com/en/new_pages/dates.html)]

---
# Convert to date class

Within `mutate()`, use the {lubridate} function that aligns with the *raw date format*.

* `ymd()` if raw values are YYYY-MM-DD  
* `dmy()` if raw values are DD-MM-YYYY  
* `mdy()` if raw values are MM-DD-YYYY

.pull-left[

``` r
surv_clean %>% 
  select(case_id, date_onset) %>% 
  tibble()
```

```
## # A tibble: 6 × 2
##   case_id date_onset
##   <chr>   <chr>     
## 1 694928  11/9/2014 
## 2 86340d  10/30/2014
## 3 92d002  8/16/2014 
## 4 544bd1  8/29/2014 
## 5 544bd1  8/29/2014 
## 6 544bd1  8/29/2014
```

]

.pull-right[

``` r
surv_clean %>% 
* mutate(date_onset = mdy(date_onset)) %>%
  select(case_id, date_onset) %>% 
  tibble()
```

```
## # A tibble: 6 × 2
##   case_id date_onset
##   <chr>   <date>    
## 1 694928  2014-11-09
## 2 86340d  2014-10-30
## 3 92d002  2014-08-16
## 4 544bd1  2014-08-29
## 5 544bd1  2014-08-29
## 6 544bd1  2014-08-29
```

]

.footnote[The `tibble()` display shows the class of each column above its values.]

---
class: inverse, center, middle

## Exercise!

Go to the course website  
Open the exercise for Module 2, and login  
Follow the instructions to create a new RStudio project and begin coding  
Let an instructor know if you are unsure what to do