class: center, middle, inverse, title-slide <style type="text/css"> .remark-slide table{ border: none } .remark-slide-table { } tr:first-child { border-top: none; } tr:last-child { border-bottom: none; } </style> <style type="text/css"> /* THIS IS A CSS CHUNK - THIS IS A COMMENT */ /* Size of font in code echo. E.g. 10px or 50% */ .remark-code { font-size: 70%; } /* Size of font in text */ .medium-text { font-size: 75%; } /* Size of font in tables */ .small-table table { font-size: 6px; } .medium-table table { font-size: 8px; } .medium-large-table table { font-size: 10px; } .medium-large-table2 table { font-size: 11px; } .small-code .remark-code{ font-size: 40% } </style> # Introduction to R for </br> Applied Epidemiology ### The Ebola case study and data cleaning contact@appliedepi.org --- # Objectives & schedule * Create a new RStudio project for the Ebola case study * Import data from a project subfolder using `import()` and `here()` * Gain familiarity with {dplyr} data cleaning functions * Begin writing a cleaning command using the `%>%` pipe operator </br> <div class="tabwid"><style>.cl-b1bef610{}.cl-b1bb03fc{font-family:'Helvetica';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-b1bca554{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-b1bcb71a{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb72e{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb738{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb742{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb743{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb756{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb760{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb76a{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb774{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb77e{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb77f{width:1.033in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(255, 255, 255, 0.00);border-top: 0 solid rgba(255, 255, 255, 0.00);border-left: 0 solid rgba(255, 255, 255, 0.00);border-right: 0 solid rgba(255, 255, 255, 0.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-b1bcb788{width:2.358in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(255, 255, 255, 0.00);border-top: 0 solid rgba(255, 255, 255, 0.00);border-left: 0 solid rgba(255, 255, 255, 0.00);border-right: 0 solid rgba(255, 255, 255, 0.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-b1bef610'><thead><tr style="overflow-wrap:break-word;"><th class="cl-b1bcb71a"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Time</span></p></th><th class="cl-b1bcb72e"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Topic</span></p></th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-b1bcb738"><p class="cl-b1bca554"><span class="cl-b1bb03fc">10 minutes</span></p></td><td class="cl-b1bcb742"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Set up of the Ebola case study</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b1bcb743"><p class="cl-b1bca554"><span class="cl-b1bb03fc">20 minutes</span></p></td><td class="cl-b1bcb756"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Functions for data cleaning</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b1bcb743"><p class="cl-b1bca554"><span class="cl-b1bb03fc">10 minutes</span></p></td><td class="cl-b1bcb756"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Demo of data cleaning</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b1bcb760"><p class="cl-b1bca554"><span class="cl-b1bb03fc">2 hours</span></p></td><td class="cl-b1bcb76a"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Exercise</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-b1bcb774"><p class="cl-b1bca554"><span class="cl-b1bb03fc">20 minutes</span></p></td><td class="cl-b1bcb77e"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Debrief</span></p></td></tr></tbody><tfoot><tr style="overflow-wrap:break-word;"><td colspan="2"class="cl-b1bcb77f"><p class="cl-b1bca554"><span class="cl-b1bb03fc">Take breaks as you wish during the exercise</span></p></td></tr></tfoot></table></div> ??? Note stretch breaks throughout. --- # Review - **RStudio projects** - a home for data and scripts for a particular analysis -- - Running commands in **an R script**, with comments -- - Creating **objects** with the assignment operator **`<-`** -- - Using **functions** like `max()`, `min()`, and `paste()` -- - Importing a dataset with **`import()`** -- - Reviewing a dataset with `skim()` and `summary()` -- - Columns have **classes** that can be checked with `class()` --- class: inverse, center, middle # The Ebola case study Modules 2-9 will use data from a simulated Ebola outbreak in Sierra Leone. --- # A new RStudio project .pull-left[ The exercise will guide you to create a new **RStudio project** in the "intro_course/**ebola**/" folder. <img src="../../images/functions_packages/project_briefcase.png" width="85%" /> ] .pull-right[ 📂 intro_course * 📁 module1 * 📂 covid * **📂 ebola** * **ebola.Rproj** * 📁 data * 📁 outputs * 📂 scripts ] --- # A new R Script .pull-left[ You will write a new R script named "ebola_analysis.R" to hold your commands. The script will be saved in the subfolder "ebola/**scripts**/" <img src="../../images/data_cleaning/ebola_setup.png" width="100%" height="200%" /> ] .pull-right[ 📁 intro_course * 📁 module1 * 📂 covid * 📁 **ebola** * **ebola.Rproj** * 📁 data * 📂 outputs * 📂 **scripts** * **ebola_analysis.R** ] --- # Load packages What will be your first command in the new R script? What function will it use? -- Use **`pacman::p_load()`** to **load the packages** needed for the analysis ``` r pacman::p_load( rio, # for importing data here, # for relative file paths skimr, # for reviewing the data janitor, # for cleaning data epikit, # for creating age categories tidyverse # for data management and visualization ) ``` --- # Import data from a subfolder The ebola linelist is saved in the new project's "**data**/**raw**/" subfolder: 📁 **ebola** * ebola.Rproj * 📂 **data** * 📂 clean * 📂 **raw** * **surveillance_linelist_20141201.csv** * 📁 scripts * 📂 outputs `import()` expects a *file path* - the data's location or "address" on your computer. Will this command work to import the Ebola linelist? ``` r import("surveillance_linelist_20141201.csv") ``` -- **No**, you need to specify which *subfolder* of the project the data is saved in. --- # File paths **Avoid** the fragile "absolute" file path *(only works on one computer)* ``` r import("C:/Users/Me/Docs/intro_course/ebola/data/raw/surveillance_linelist_20141201.csv") ``` -- **In an RStudio project** the path can start from the project root folder ``` r import("data/raw/surveillance_linelist_20141201.csv") # works on almost any computer ``` -- **Use `here()` to create the file path** without slashes `here("data", "raw", "surveillance_linelist_raw.csv")` -- **The final step** is to place the `here()` file path command *within* `import()` ``` r surv_raw <- import(here("data", "raw", "surveillance_linelist_20141201.csv")) ``` -- *The `<-` operator saves the dataset as an object with the name `surv_raw`.* ??? We teach them here() because it removes the need to handle slashes, and it really helps when you get to automated reports. --- class: medium-large-table # The data
.footnote[Only 25 rows are shown here] ??? Table shows just the first 25 rows, to load faster. --- class: inverse, center, middle # Live demonstration ## New RStudio project and R script <img src="../../images/functions_packages/piano_man.jpg" width="50%" /> --- class: inverse, center, middle ## Cleaning data in R <img src="../../images/data_cleaning/data_cleaning.png" width="75%" /> --- # Clean data, messy data Now your data are imported. What is typically involved in "cleaning" a dataset? .pull-left[ <img src="../../images/data_cleaning/tidy_broom.png" width="75%" /> ] -- .pull-right[ - Prepare for analysis and visualization - Standardize column names - Subset rows and columns - Align spellings - Create categorical and calculated variables - Join with other data - Remove duplicates... ] .footnote[] ??? Ask the participants what steps they take to clean datasets --- # The {dplyr} package .pull-left[ <img src="../../images/data_cleaning/dplyr_hex.png" width="75%" /> ] .pull-right[ * The easiest and most versatile package for data cleaning * This package is installed in a universe of {tidyverse} R packages * The {tidyverse} has transformed R in the last 10 years ] .footnote["dplyr" is shorthand for "data plier" - a plier is the handheld tool pictured above] ??? Tidyverse has made R coding much more user-friendly, intuitive, and accessible to beginner coders --- # Practice dataset Let's use a mini **`surv_raw`** dataset to practice some core R functions. |case_id | age|sex |lab_confirmed |onset date | wt (kg)| |:-------|---:|:---|:-------------|:----------|-------:| |694928 | 23|m |FALSE |11/9/2014 | 70| |86340d | 0|f |TRUE |10/30/2014 | 18| |92d002 | 16|m |TRUE |8/16/2014 | 59| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |FALSE |8/29/2014 | 39| --- # Functions for today Function | Utility ---------------|--------------------------------------- `filter()`|subset **rows** `select()`|subset **columns** `clean_names()`|standardise column names `rename()`|rename columns manually `mutate()`|create and transform columns `mdy()`, `dmy()`, `ymd()` |tell R how to understand dates --- class: medium-large-table2 # `filter()` rows .pull-left[ ``` r filter(surv_raw) ``` 1st argument: a data frame ] .pull-right[ |case_id | age|sex |lab_confirmed |onset date | wt (kg)| |:-------|---:|:---|:-------------|:----------|-------:| |694928 | 23|m |FALSE |11/9/2014 | 70| |86340d | 0|f |TRUE |10/30/2014 | 18| |92d002 | 16|m |TRUE |8/16/2014 | 59| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |FALSE |8/29/2014 | 39| ] --- class: medium-large-table2 # `filter()` rows .pull-left[ ``` r filter(surv_raw, age < 18) ``` 2nd+ arguments: logical tests for rows to be *kept* ] .pull-right[ |case_id | age|sex |lab_confirmed |onset date | wt (kg)| |:-------|---:|:---|:-------------|:----------|-------:| |86340d | 0|f |TRUE |10/30/2014 | 18| |92d002 | 16|m |TRUE |8/16/2014 | 59| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |FALSE |8/29/2014 | 39| ] --- class: medium-large-table2 # `filter()` rows .pull-left[ ``` r filter(surv_raw, age < 18, sex == "f") ``` 2nd+ arguments: logical tests for rows to be *kept* ] .pull-right[ |case_id | age|sex |lab_confirmed |onset date | wt (kg)| |:-------|---:|:---|:-------------|:----------|-------:| |86340d | 0|f |TRUE |10/30/2014 | 18| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |FALSE |8/29/2014 | 39| ] .footnote[Note use of double equals `==` to test equivalence] --- class: medium-large-table2 # `filter()` rows .pull-left[ ``` r filter(surv_raw, age < 18 & (sex == "f" | lab_confirmed == TRUE) ) ``` *Newlines and indents do not impact code* The logic can get complex using: * `&` (AND) * `|` (OR) * Parentheses ] .pull-right[ |case_id | age|sex |lab_confirmed |onset date | wt (kg)| |:-------|---:|:---|:-------------|:----------|-------:| |86340d | 0|f |TRUE |10/30/2014 | 18| |92d002 | 16|m |TRUE |8/16/2014 | 59| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |FALSE |8/29/2014 | 39| ] --- class: medium-large-table2 # `select()` columns .pull-left[ ``` r select(surv_raw, ___) ``` `select()` also expects a data frame as the first argument ] .pull-right[ |case_id | age|sex |lab_confirmed |onset date | wt (kg)| |:-------|---:|:---|:-------------|:----------|-------:| |694928 | 23|m |FALSE |11/9/2014 | 70| |86340d | 0|f |TRUE |10/30/2014 | 18| |92d002 | 16|m |TRUE |8/16/2014 | 59| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |FALSE |8/29/2014 | 39| ] --- class: medium-large-table2 # `select()` columns .pull-left[ ``` r select(surv_raw, case_id, age) ``` You can provide `select()` with column names to *keep* ] .pull-right[ |case_id | age| |:-------|---:| |694928 | 23| |86340d | 0| |92d002 | 16| |544bd1 | 10| |544bd1 | 10| |544bd1 | 10| ] --- class: medium-large-table2 # `select()` columns .pull-left[ ``` r select(surv_raw, case_id, age, sex) ``` You can provide `select()` with column names to *keep* ] .pull-right[ |case_id | age|sex | |:-------|---:|:---| |694928 | 23|m | |86340d | 0|f | |92d002 | 16|m | |544bd1 | 10|f | |544bd1 | 10|f | |544bd1 | 10|f | ] --- class: medium-large-table2 # `select()` columns .pull-left[ ``` r select(surv_raw, -case_id, -lab_confirmed) ``` Or you can designate which columns to *remove* with - ] .pull-right[ | age|sex |onset date | wt (kg)| |---:|:---|:----------|-------:| | 23|m |11/9/2014 | 70| | 0|f |10/30/2014 | 18| | 16|m |8/16/2014 | 59| | 10|f |8/29/2014 | 39| | 10|f |8/29/2014 | 39| | 10|f |8/29/2014 | 39| ] --- # `filter()` *and* `select()`? Yes! Use the **%>%** "pipe" operator to "pass" data from one function to the next. .pull-left[ It is like saying the words **"and then"**. A typical cleaning command contains a *sequence* of linked steps * Rename columns * Filter rows * Select columns * Deduplicate * Clean values... ] .pull-right[ <img src="../../images/data_cleaning/cleaning.png" width="75%" /> ] --- class: medium-large-table2 # Piping data Previously, the 1st argument was the data frame `filter(`**surv_raw**`, age < 18)` -- Using pipes, this is now written as: **surv_raw** `%>% filter(age < 18)` -- You can pipe the data through *multiple* functions `surv_raw` |case_id | age|sex |lab_confirmed |onset date | wt (kg)| |:-------|---:|:---|:-------------|:----------|-------:| |694928 | 23|m |FALSE |11/9/2014 | 70| |86340d | 0|f |TRUE |10/30/2014 | 18| |92d002 | 16|m |TRUE |8/16/2014 | 59| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |FALSE |8/29/2014 | 39| --- class: medium-large-table2 # Piping data Previously, the 1st argument was the data frame `filter(`**surv_raw**`, age < 18)` Using pipes, this is now written as: **surv_raw** `%>% filter(age < 18)` You can pipe the data through *multiple* functions `surv_raw` **%>%** `filter(age < 18)` |case_id | age|sex |lab_confirmed |onset date | wt (kg)| |:-------|---:|:---|:-------------|:----------|-------:| |86340d | 0|f |TRUE |10/30/2014 | 18| |92d002 | 16|m |TRUE |8/16/2014 | 59| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |FALSE |8/29/2014 | 39| --- class: medium-large-table2 # Piping data Previously, the 1st argument was the data frame `filter(`**surv_raw**`, age < 18)` Using pipes, this is now written as: **surv_raw** `%>% filter(age < 18)` You can pipe the data through *multiple* functions `surv_raw` **%>%** `filter(age < 18)` **%>%** `select(case_id, age, sex)` |case_id | age|sex | |:-------|---:|:---| |86340d | 0|f | |92d002 | 16|m | |544bd1 | 10|f | |544bd1 | 10|f | |544bd1 | 10|f | --- # Vertical coding style A *vertical* style with indents does not impact the code, but makes it more readable! ``` r surv_raw ``` |case_id | age|sex |lab_confirmed |onset date | wt (kg)| |:-------|---:|:---|:-------------|:----------|-------:| |694928 | 23|m |FALSE |11/9/2014 | 70| |86340d | 0|f |TRUE |10/30/2014 | 18| |92d002 | 16|m |TRUE |8/16/2014 | 59| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |TRUE |8/29/2014 | 39| |544bd1 | 10|f |FALSE |8/29/2014 | 39| --- # Vertical coding style A *vertical* style with indents does not impact the code, but makes it more readable! ``` r surv_raw %>% select(case_id, age, sex, lab_confirmed) # select columns ``` |case_id | age|sex |lab_confirmed | |:-------|---:|:---|:-------------| |694928 | 23|m |FALSE | |86340d | 0|f |TRUE | |92d002 | 16|m |TRUE | |544bd1 | 10|f |TRUE | |544bd1 | 10|f |TRUE | |544bd1 | 10|f |FALSE | --- # Vertical coding style The **`%>%`** pipe passes the dataset to the next step ``` r surv_raw %>% select(case_id, age, sex, lab_confirmed) %>% # select columns distinct() # de-duplicate ``` |case_id | age|sex |lab_confirmed | |:-------|---:|:---|:-------------| |694928 | 23|m |FALSE | |86340d | 0|f |TRUE | |92d002 | 16|m |TRUE | |544bd1 | 10|f |TRUE | |544bd1 | 10|f |FALSE | --- # Vertical coding style The **`%>%`** pipe passes the dataset to the next step ``` r surv_raw %>% select(case_id, age, sex, lab_confirmed) %>% # select columns distinct() %>% # de-duplicate filter(age < 18, lab_confirmed == TRUE) # only children cases ``` |case_id | age|sex |lab_confirmed | |:-------|---:|:---|:-------------| |86340d | 0|f |TRUE | |92d002 | 16|m |TRUE | |544bd1 | 10|f |TRUE | --- # Vertical coding style The **`%>%`** pipe passes the dataset to the next step ``` r surv_raw %>% select(case_id, age, sex, lab_confirmed) %>% # select columns distinct() %>% # de-duplicate filter(age < 18, lab_confirmed == TRUE) %>% # only children cases mutate(infant = ifelse(age < 1, "infant", "not infant")) # create a column ``` |case_id | age|sex |lab_confirmed |infant | |:-------|---:|:---|:-------------|:----------| |86340d | 0|f |TRUE |infant | |92d002 | 16|m |TRUE |not infant | |544bd1 | 10|f |TRUE |not infant | --- # Vertical coding style Is there a pipe operator at the end of this workflow? ``` r surv_raw %>% select(case_id, age, sex, lab_confirmed) %>% # select columns distinct() %>% # de-duplicate filter(age < 18, lab_confirmed == TRUE) %>% # only children cases * mutate(infant = ifelse(age < 1, "infant", "not infant")) ``` |case_id | age|sex |lab_confirmed |infant | |:-------|---:|:---|:-------------|:----------| |86340d | 0|f |TRUE |infant | |92d002 | 16|m |TRUE |not infant | |544bd1 | 10|f |TRUE |not infant | -- The pipes connect all these functions into one, linked command. How would you run this command in RStudio? --- # Clean the column names We can observe changes to column names by printing them with `names()` ``` r # print current column names names(surv_raw) ``` ``` ## [1] "case_id" "age" "sex" ## [4] "lab_confirmed" "onset date" "wt (kg)" ``` --- # Clean the column names Equivalently, `surv_raw` can be passed to `names()` using a **pipe**: ``` r surv_raw %>% # begin with raw data names() # print current column names ``` ``` ## [1] "case_id" "age" "sex" ## [4] "lab_confirmed" "onset date" "wt (kg)" ``` -- Apply `clean_names()` to `surv_raw` by inserting it into the pipe sequence. This standardizes column names (lowercase, no spaces or special characters). ``` r surv_raw %>% # begin with raw data * clean_names() %>% # standardize column names names() # print current column names ``` ``` ## [1] "case_id" "age" "sex" ## [4] "lab_confirmed" "onset_date" "wt_kg" ``` *See changes to the final two columns* --- # Clean the column names Equivalently, `surv_raw` can be passed to `names()` using a **pipe**: ``` r surv_raw %>% # begin with raw data names() # print current column names ``` ``` ## [1] "case_id" "age" "sex" ## [4] "lab_confirmed" "onset date" "wt (kg)" ``` Then, pipe the **cleaned** column names to `rename()` for manual edits. Note that `rename()` references the **cleaned** column names (`onset_date`). ``` r surv_raw %>% # begin with raw data clean_names() %>% # standardize column names * rename( # manual edits * age_years = age, # NEW = OLD * date_onset = onset_date) %>% names() # print current column names ``` ``` ## [1] "case_id" "age_years" "sex" ## [4] "lab_confirmed" "date_onset" "wt_kg" ``` --- # Printing vs. saving Click the tabs to see the difference. .panelset[ .panel[.panel-name[Printing] The previous changes to `surv_raw` were **not** saved. We only *printed with modifications*. ``` r # modify, then print column names *surv_raw %>% # start with raw data clean_names() %>% # standardize column names rename( # manual edits age_years = age, # NEW = OLD date_onset = onset_date) %>% * names() # print current column names ``` **`surv_raw`** still has the *original column names*! ``` r names(surv_raw) ``` ``` ## [1] "case_id" "age" "sex" ## [4] "lab_confirmed" "onset date" "wt (kg)" ``` ] .panel[.panel-name[Saving] Use the **`<-`** to save the changes to a new **`surv_clean`** data frame. No output is printed, but the new object will appear in the RStudio Environment. ``` r # create new data frame *surv_clean <- surv_raw %>% clean_names() %>% rename( age_years = age, date_onset = onset_date) ``` **`surv_clean`** has the *cleaned column names*! ``` r names(surv_clean) ``` ``` ## [1] "case_id" "age_years" "sex" ## [4] "lab_confirmed" "date_onset" "wt_kg" ``` ] ] --- class: medium-large-table2 # `mutate()` to *create* columns The syntax is: ``` r DATASET %>% mutate(NEW_COLUMN_NAME = A_FUNCTION(arguments)) ``` .pull-left[ ``` r surv_raw %>% mutate(age_group = ifelse( test = age >= 18, yes = "adult", no = "minor")) ``` `ifelse()` logically tests each row and writes in the new `age_group` column: * "adult" if the test is TRUE * "minor" if the test is FALSE ] .pull-right[ |case_id | age|sex |lab_confirmed |onset date | wt (kg)|age_group | |:-------|---:|:---|:-------------|:----------|-------:|:---------| |694928 | 23|m |FALSE |11/9/2014 | 70|adult | |86340d | 0|f |TRUE |10/30/2014 | 18|minor | |92d002 | 16|m |TRUE |8/16/2014 | 59|minor | |544bd1 | 10|f |TRUE |8/29/2014 | 39|minor | |544bd1 | 10|f |TRUE |8/29/2014 | 39|minor | |544bd1 | 10|f |FALSE |8/29/2014 | 39|minor | ] --- class: medium-large-table2 # `mutate()` to *edit* columns The syntax is similar: ``` r DATASET %>% mutate(SAME_COLUMN_NAME = A_FUNCTION(arguments)) ``` .pull-left[ ``` r surv_raw %>% mutate(sex = recode(sex, "m" = "male", "f" = "female")) ``` Column `sex` is overwritten. `recode()` starts with original `sex` column and applies changes: * "m" to "male" * "f" to "female" ] .pull-right[ |case_id | age|sex |lab_confirmed |onset date | wt (kg)| |:-------|---:|:------|:-------------|:----------|-------:| |694928 | 23|male |FALSE |11/9/2014 | 70| |86340d | 0|female |TRUE |10/30/2014 | 18| |92d002 | 16|male |TRUE |8/16/2014 | 59| |544bd1 | 10|female |TRUE |8/29/2014 | 39| |544bd1 | 10|female |TRUE |8/29/2014 | 39| |544bd1 | 10|female |FALSE |8/29/2014 | 39| ] --- # `mutate()` with dates The `class()` of date columns should be "date", not "character". To change the class, you must *tell* R how to understand the raw dates. .pull-left[ Dates come in many formats: Is "03/09/2024" the 9th of March, or the 3rd of September? ] .pull-right[ <img src="../../images/data_cleaning/Dates_500x500.png" width="100%" /> ] .footnote[More details in this [Epi R Handbook chapter](https://epirhandbook.com/en/new_pages/dates.html)] --- # Convert to date class Within `mutate()`, use the {lubridate} function that aligns with the *raw date format*. * `ymd()` if raw values are YYYY-MM-DD * `dmy()` if raw values are DD-MM-YYYY * `mdy()` if raw values are MM-DD-YYYY -- .pull-left[ ``` r surv_clean %>% select(case_id, date_onset) %>% tibble() ``` ``` ## # A tibble: 6 × 2 ## case_id date_onset ## <chr> <chr> ## 1 694928 11/9/2014 ## 2 86340d 10/30/2014 ## 3 92d002 8/16/2014 ## 4 544bd1 8/29/2014 ## 5 544bd1 8/29/2014 ## 6 544bd1 8/29/2014 ``` ] -- .pull-right[ ``` r surv_clean %>% * mutate(date_onset = mdy(date_onset)) %>% select(case_id, date_onset) %>% tibble() ``` ``` ## # A tibble: 6 × 2 ## case_id date_onset ## <chr> <date> ## 1 694928 2014-11-09 ## 2 86340d 2014-10-30 ## 3 92d002 2014-08-16 ## 4 544bd1 2014-08-29 ## 5 544bd1 2014-08-29 ## 6 544bd1 2014-08-29 ``` ] .footnote[The `tibble()` display shows the class of each column above its values.] --- class: inverse, center, middle ## Exercise! Go to the course website Open the exercise for Module 2, and login Follow the instructions to create a new RStudio project and begin coding Let an instructor know if you are unsure what to do <img src="../../images/breakout/COVID dominoes.png" width="100%" />