Emory COVID-19 Response Collaborative

# Emory COVID-19 Response Collaborative
## Data visualization with ggplot2: an overview and demonstration
### <a href="https:://epirhandbook.com">EpiRhandbook Team</a> <br><br> <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> <a href="https://twitter.com/epiRhandbook">epiRhandbook</a> <br> <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M440 6.5L24 246.4c-34.4 19.9-31.1 70.8 5.7 85.9L144 379.6V464c0 46.4 59.2 65.5 86.6 28.6l43.8-59.1 111.9 46.2c5.9 2.4 12.1 3.6 18.3 3.6 8.2 0 16.3-2.1 23.6-6.2 12.8-7.2 21.6-20 23.9-34.5l59.4-387.2c6.1-40.1-36.9-68.8-71.5-48.9zM192 464v-64.6l36.6 15.1L192 464zm212.6-28.7l-153.8-63.5L391 169.5c10.7-15.5-9.5-33.5-23.7-21.2L155.8 332.6 48 288 464 48l-59.4 387.3z"></path></svg> <a href="mailto:epirhandbook@gmail.com">epirhandbook@gmail.com</a>
### August 2021

---

/* Size of font in code echo. E.g. 10px or 50% */
.remark-code {
  font-size: 70%;
}

/* Size of font in text */

.medium-text {
  font-size: 75%;     
}

/* Size of font in tables */

.small-table table {
  font-size: 6px;     
}

.medium-table table {
  font-size: 8px;     
}

.medium-large-table table {
  font-size: 10px;     
}

</style>

# We are Applied Epi

![](https://github.com/appliedepi/epiRhandbook_eng/raw/master/images/Epi%20R%20Handbook%20Banner%20Beige%201500x500.png)

* We are a grassroots collaborative of applied epidemiologists  
* Our (free) **Epi R Handbook** has been used by 60,000 people worldwide
* Now we offer customized training to health departments and NGOs

???
- What makes us different is that we focus on the challenges of applied epi, not academic epi. We emphasize the skills used every day by ground-level epidemiologists.

---
# Introducing ourselves

* **Neale Batra**  
 - Founder of the Epi R Handbook.  
 - Previously @Philly, @SantaClaraCounty, and @USAID/PEPFAR. Internationally with @MSF and @WHO.  
* **Alex Spina**  
 - Alum of EPIET (European EIS). Med student and epi consultant for @MSF and @WHO.  
* **Mathilde Mousset**  
 - R programmer / Data manager at Epicentre (@MSF). PhD in evolutionary ecology.  
* **Henry Laurenson-Schafer**  
 - Epi/data scientist with WHO Geneva on COVID response. PhD in molecular biology/bioinformatics.  
* **Wen Lin** 
 - Epi @SantaClaraCounty, **Resident SAS expert**.

???
We have Wen available for answering any questions on translating SAS to R

---
# Today's demonstration

Today is primarily *explanation* and *demonstration*, following from the previous [demonstration on data management and workflow](https://appliedepi.github.io/emory_training/presentation/slides_workflow.html).

We assume you have reviewed the Handbook pages on [R Basics](https://epirhandbook.com/r-basics.html) and [ggplot basics](https://epirhandbook.com/ggplot-basics.html).

![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/EDGE.png?raw=true)

???
- Today's session is primarily a *demonstration* with question & answer session.  
- We assume that you have read at least some of the relevant pages in the Epi R Handbook prior to this session.  
- We also offer training in which we guide you through case studies with hands-on coaching.

---
# Objectives

Our objectives today are to:

1) De-mystify `ggplot()` syntax so you can apply it to your own tasks  
2) Demonstrate core `geom_()` commands  
3) Introduce helpful "wrapper" packages like **incidence2** and **apyramid**  
4) Touch upon advanced tips with packages **ggExtra**, **ggrepel**, **gghighlight**, **scales**, **plotly**, and **esquisse**  
5) Introduce mapping (GIS) with **sf**

???
- 1 - core syntax and 2 - exposure
- Ready yourself because today will be a whirlwind.
- There will be a LOT of content and we will move quickly.
- Type questions in the chat, or hold them to the end.  
- We will do understanding checks throughout

---
# Resources for you

Today's work is within an R project, which can be [downloaded as a zipped file](https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/appliedepi/emory_training/tree/master/case_study).

Today's code is available to you in `ggplot_demo.Rmd` (also online [here](https://github.com/appliedepi/emory_training/blob/master/case_study/ggplot_demo.Rmd)).

R project file (`case_study.Rproj`)  
  * `data/` folder  
    * `covid_example_data/` folder 
        * `linelist_cleaned.rds` 
        * `covid_example_data.xlsx`  
        * `covid_shapefile/` folder  
          - `FultonCountryZipCode.shp`...
    * `map_tiles/` folder 
        * base map images for Fulton County...
  * `ggplot_demo.Rmd` (R markdown script)

???
- Do NOT download and try it now. Too distracting.  
- Today will move fast, but just like last time the REAL opportunity for you is to review at your own pace.  
- We've spent a lot of time developing materials for you.  
- SHOW SCRIPT AND OUTPUT IN FOLDER

---
# Load packages and import data

Load packages (**ggplot2** is within [**tidyverse**](https://www.tidyverse.org/)) and import the cleaned (fake) dataset from [last session](https://github.com/appliedepi/emory_training/blob/master/case_study/weekly_report.Rmd).

```r
## install and load necessary packages 
pacman::p_load(
  rio,        # importing data  
  here,       # relative file pathways  
  skimr,      # review data
  DT,         # visualize data frame
  janitor,    # data cleaning and tables
  epikit,     # age categories
  lubridate,  # working with dates
  incidence2, # epidemic curves
  ggrepel,    # smart labels
  ggExtra,    # extras
  esquisse,   # point-and-click for simple ggplots
  apyramid,   # age pyramids
  scales,     # formatting of scales
  plotly,     # interactive plots
  cowplot,    # combine plots
* tidyverse   # ggplot2 & data management
)
```

```r
linelist <- import(here::here("data", "covid_example_data", "linelist_cleaned.rds"))
```

???
One tip is to load tidyverse last to avoid masking of package names  
The `import()` function can import almost any kind of file. An `.rds` file retains column classes from an R data frame.

---

# Review data

Below are the first 25 rows of the `linelist` data frame:

<div id="htmlwidget-1790b12f893b3171b65d" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-1790b12f893b3171b65d">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25"],["3a85e6992a5ac52f","c6b5281d5fc50b96","53495ad0dca4e22c","2948a265da0d081b","a5524aadd1ca0458","db14eeabe531fab7","8bd07af27a9095e2","b0fb67854aaeb475","bf9099af39be433a","c01011bc9755eb3a","27b97d577fb9a30d","932f146c3860e756","44655181e230b95c","a951cb110aacd31f","7288a58b47539174","dfbc141429716762","033e8eb949e30135","a254eb953370eb3d","2ebf4b64459ff07c","8285906d7277879a","75ed6eb8e8a5373a","bee0aecf4384dbc8","147251528ba0e86c","3517a48c39570377","ddfe45263e67714b"],["2020-03-22","2020-02-01","2020-02-10","2020-03-20","2020-02-26","2020-02-11","2020-03-01","2020-03-01","2020-03-18","2020-03-12","2020-02-26","2020-04-10","2020-04-06","2020-03-20","2020-03-05","2020-03-05","2020-02-24","2020-04-04","2020-03-28","2020-04-07","2020-04-06","2020-02-26","2020-04-12","2020-04-01","2020-02-28"],["2004-11-08","1964-06-07","1944-04-06","1964-06-25","1964-12-21","1956-06-21","1974-05-25","1960-02-17","1985-06-09","1978-11-26","1946-12-21","1993-10-27","1941-06-04","1954-03-19","1958-10-22","1929-10-22","1952-01-11","1994-01-01","1995-12-25","1987-09-14","1981-12-12","1975-07-11","1948-01-11","1974-05-30","1977-01-04"],[16,57,77,57,56,65,47,61,36,42,74,27,80,67,62,91,69,27,25,33,39,46,73,47,44],["Male","Male","Female","Female","Male","Male","Male","Female","Male","Female","Male","Female","Female","Male","Female","Male","Male","Male","Male","Female","Female","Female","Male","Male","Female"],["WHITE","WHITE","BLACK","BLACK","WHITE","BLACK","BLACK","BLACK","ASIAN","BLACK","BLACK","BLACK","BLACK","BLACK","BLACK","BLACK","WHITE","BLACK","WHITE","BLACK","BLACK","BLACK","BLACK","WHITE","BLACK"],["NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NOT SPECIFIED","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO","NON-HISPANIC/LATINO"],["30308","30308","30315","30213","30004","30314","30313","30350","30324","30305","30315","30308","30312","30213","30311","30311","30327","30349","30009","30337","30344","30315","30349","30327","30318"],["Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields","Fields"],["PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)","PIEDMONT (12-2)"],["Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia","Georgia"],["Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes"],["2020-03-20","2020-01-28","2020-02-10","2021-05-19","2020-02-20","2020-01-17","2020-02-21","2020-02-25","2020-03-17","2020-11-21","2020-02-15","2020-04-02","2020-03-22","2020-03-10","2020-03-02","2020-02-29","2020-02-19",null,"2020-03-25","2020-03-30","2020-03-29","2020-02-17","2020-03-30","2020-03-28","2020-02-26"],["Yes","No","Yes","No","Yes","Yes","No","Yes","No","Yes","Unknown","Unk","No","Yes","Yes","Yes","Yes","Yes","No","No","No","Yes","Yes","No","No"],["Yes","No","Unknown","Yes","Yes","Yes","No","Yes","Yes","No","Unknown","Yes","No","Yes","Unk","Yes","Yes","No","No","Yes","No","Yes","No","Yes","Unknown"],["No","Yes","Yes","Yes","Yes","No","Unk","No","No","Yes","Unknown","No","No","Yes","Unk","No","Unknown","Yes","No","Yes","No","No","Yes","Yes","Unknown"],["Unknown","Unknown","Unknown","Unknown","Unknown","Unknown","Unknown","Unknown","Unknown","Yes","Unknown","Unknown","Unknown","Unknown","Unknown","Unknown","Unknown","Unknown","No","Unknown","Unknown","Unknown","Unknown","Unknown","Unknown"],["Yes","No","Yes","Yes","No","Unk","Yes","Yes","Yes","Yes","Unknown","No","No","No","Unk","No","Unknown","No","Yes","No","No","Yes","No","Unknown","Unknown"],["Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Unk","Yes","Yes","Yes","No","Yes","Yes","Yes","Yes","Unknown","Yes"],["Yes","No","Unknown","Yes","No","Unk","Yes","No","Yes","No","Unknown","No","No","Yes","Unk","No","Unknown","No","No","No","No","No","No","Unknown","Unknown"],["No","No","No","No","No","Yes","No","No","No","No","No","No","Yes","Yes","No","No","Unknown","No","No","Yes","No","No","No","No","No"],[null,null,null,null,null,"2020-02-21",null,null,null,null,null,null,"2020-04-16",null,null,null,null,null,null,"2020-04-10",null,null,null,null,null],["Yes","No","Unknown","No","No","No","No","No","No","Yes","No","No","Yes","No","Unk","No","Unknown","No","No","No","No","No","No","No","No"],["No","No","Yes","Unknown","Yes","Yes","Yes","No","No","Yes","Unknown","No","Yes","Yes","Yes","Yes","No","Yes","No","Yes","No","Yes","Yes","No","Yes"],[null,null,"2020-02-08",null,"2020-02-26","2020-01-27","2020-03-01",null,null,"2020-11-28",null,null,"2020-04-06","2020-03-19","2020-03-03","2020-03-03",null,"2020-04-03",null,"2020-04-06",null,"2020-02-22","2020-04-15",null,"2020-02-27"],[null,null,null,null,null,"2020-02-21",null,null,null,"2020-12-02",null,null,"2020-04-16",null,null,null,null,null,null,"2020-04-10",null,"2020-03-10","2020-04-18",null,"2020-02-27"],["No","No","No","No","Unknown","Yes","No","Unknown","No","No","Unknown","No","Yes","No","No","No","Unknown","No","No","No","No","No","No","No","No"],["No","No","No","No","Unknown","Yes","No","Unknown","No","No","Unknown","No","Yes","No","No","No","Unknown","No","No","No","No","No","No","No","No"],[null,null,null,null,null,"2020-02-21",null,null,null,null,null,null,"2020-04-16",null,null,null,null,null,null,null,null,null,null,null,null],["Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes"],["Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed","Confirmed"],["2020-03-22","2020-02-01","2020-02-10","2021-01-17","2020-02-25","2020-02-20","2020-03-02","2020-02-29","2020-03-17","2020-11-21","2020-03-02","2020-04-10","2020-04-15","2020-03-20","2020-03-07","2020-03-05","2020-02-21","2020-12-16","2021-01-28","2020-04-08","2020-04-02","2020-02-22","2020-04-11","2020-03-29","2020-02-27"],[33.77664546,33.78051014,33.73023331,33.55506787,34.10608705,33.75937142,33.76269487,33.99152006,33.81818887,0.000509015,33.7247196,33.76662936,33.73726168,33.58325606,33.73072166,33.74443579,33.84942894,33.61755356,34.09972633,33.65329287,33.69052246,33.71013231,33.64674578,33.84126361,33.80477485],[-84.38568523,-84.38947474,-84.38425189,-84.62885167,-84.27453508,-84.42582354,-84.40045162,-84.35447278,-84.37365369,-0.000975479,-84.38623582,-84.38936047,-84.38274508,-84.63672804,-84.44208134,-84.47630174,-84.41986975,-84.46304234,-84.31308119,-84.45800906,-84.46524239,-84.38555839,-84.57468022,-84.42349808,-84.45733019],["2020-03-18","2020-01-29","2020-02-05","2020-03-18","2020-02-26","2020-02-05","2020-02-26","2020-02-26","2020-03-18","2020-03-11","2020-02-26","2020-04-08","2020-04-01","2020-03-18","2020-03-04","2020-03-04","2020-02-19","2020-04-01","2020-03-25","2020-04-01","2020-04-01","2020-02-26","2020-04-08","2020-04-01","2020-02-26"],[null,null,null,null,6,10,9,null,null,7,null,null,15,9,1,3,null,null,null,7,null,5,16,null,1],[null,null,null,null,null,"2020-02-21",null,null,null,"2020-12-02",null,null,"2020-04-16",null,null,null,null,null,null,"2020-04-10",null,"2020-03-10","2020-04-18",null,"2020-02-27"],[null,null,null,null,null,25,null,null,null,4,null,null,10,null,null,null,null,null,null,4,null,17,3,null,0],["10-19","50-59","70+","50-59","50-59","60-69","40-49","60-69","30-39","40-49","70+","20-29","70+","60-69","60-69","70+","60-69","20-29","20-29","30-39","30-39","40-49","70+","40-49","40-49"],["White, NH","White, NH","Black, NH","Black, NH","White, NH","Black, NH","Black, NH","Black, NH","Asian, NH","Black, NH","Black, NH","Black, NH","Black, NH","Black, NH","Black, NH","Black, NH","White, NH","Black, NH","White, NH","Black, NH","Black, NH","Black, NH","Black, NH","White, NH","Black, NH"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>pid<\/th>\n      <th>date_report<\/th>\n      <th>date_dob<\/th>\n      <th>age<\/th>\n      <th>gender<\/th>\n      <th>race<\/th>\n      <th>eth<\/th>\n      <th>zip<\/th>\n      <th>county<\/th>\n      <th>district<\/th>\n      <th>state<\/th>\n      <th>contact_id<\/th>\n      <th>date_onset<\/th>\n      <th>sym_fever<\/th>\n      <th>sym_subjfever<\/th>\n      <th>sym_myalgia<\/th>\n      <th>sym_losstastesmell<\/th>\n      <th>sym_sorethroat<\/th>\n      <th>sym_cough<\/th>\n      <th>sym_headache<\/th>\n      <th>sym_resolved<\/th>\n      <th>date_recovery<\/th>\n      <th>contact_hh<\/th>\n      <th>hospitalized<\/th>\n      <th>date_hospitalized<\/th>\n      <th>date_discharge<\/th>\n      <th>died<\/th>\n      <th>died_covid<\/th>\n      <th>date_died<\/th>\n      <th>confirmed_case<\/th>\n      <th>covid_dx<\/th>\n      <th>date_positive<\/th>\n      <th>lat<\/th>\n      <th>lon<\/th>\n      <th>epiweek<\/th>\n      <th>days_onset_hosp<\/th>\n      <th>date_outcome<\/th>\n      <th>days_hosp<\/th>\n      <th>age_group<\/th>\n      <th>eth_race<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"scrollY":300,"scrollX":600,"pageLength":25,"fontSize":"25%","dom":"ti","ordering":false,"rownames":false,"options":{"pageLength":5,"scrollX":true},"class":"white-space: nowrap","columnDefs":[{"className":"dt-right","targets":[4,33,34,36,38]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
# Visualization options in R

Today we focus on **ggplot2** because it:

* is good for fast data exploration of multi-dimensional data  
* produces very **high quality** final outputs  
* has well-structured grammar => **high consistency**  
* is accompanied by many packages that expand functionality

See the [R graph gallery](https://www.r-graph-gallery.com/ggplot2-package.html) for inspiration.

.footnote[Other plotting options include [**base** R](https://towardsdatascience.com/base-plotting-in-r-eb365da06b22), [**lattice**](https://www.statmethods.net/advgraphs/trellis.html), and [**plotly**](https://plotly.com/r/).]

---
# gg-what??

- The **ggplot2** *package* is the most popular data visualization tool in R

- Its `ggplot()` *function* is at the core of the package

- This whole approach is colloquially known as “ggplotting”

- Resulting figures are sometimes affectionately called “ggplots”

**ggplot2** benefits from a wide variety of supplementary R packages that extends its functionalities, such as **gganimate**, **ggthemr**, **ggdendro**, **gghighlight**, **ggforce**...

???
- "gg" represents the “grammar of graphics” used to construct the figures

---
# Syntax overview

Build a plot object by “adding” commands on top of one another that specify plot layers and design elements

The order of layers will usually look like this:

1) **"Open" the plot** with the `ggplot()` command and specify the dataset

2) **"Map" data columns** to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency

3) **Add (`+`) “geom” layers** that visualize data geometrically as shapes

4) **Modify "scales"**, such as a color scale or y-axis breaks

5) **Add "theme" plot design elements** such as axis labels, title, caption, fonts, text sizes, background themes, or axes rotation

These layers are "added" sequentially with `+` symbols.  
**ggplot2** commands can be quite long!

???
Remember that although the commands may be long, it is infinitely easier to edit and recycle than in Excel

---
# Open the plot

`ggplot()` creates an empty canvas. Assign the data frame to use.

```r
ggplot(data = linelist)                   
```

Alternatively, use the `%>%` pipe operator to "pipe" a data frame *into* `ggplot()`

```r
linelist %>%
  ggplot()
```

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-10-1.png" width="504" />
]

???
This is only a blank canvas, we have not defined what should be in the x and y axes. 
If several data frames are needed, they can be added in their own geoms.
Piping is useful to make one-time changes to a dataset prior to plotting.

---
# Mappings with `aes()`

```r
ggplot(
  data = linelist,
* mapping = aes(
*  x = age,
*  y = days_hosp))
```

"Aesthetics" are features *whose display could vary for each data point* (position, color, shape...)

`mapping = aes()` assigns plot "aesthetics" to columns in the data. Inputs must be placed within `aes()`.

Two basic aesthetic mappings are axes, via `x=` and `y=`.

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-12-1.png" width="504" />
]

---
# Add geometry

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = age,
    y = days_hosp)) +
*geom_point()
```

Data are visualized using "geom" commands, such as `geom_point()`.

These commands are "added" with a `+` to the `ggplot()` command.

]

]

---
# Geometries

Some classical “geoms” include:

Geometry        |Geom                
----------------|--------------------
Histograms      |`geom_histogram()`
Points          |`geom_point()`

]

]

---
# Geometries

Some classical “geoms” include:

Geometry        |Geom                
----------------|--------------------
Histograms      |`geom_histogram()`  
Points          |`geom_point()`  
Lines           |`geom_line()`  
Bar plots       |`geom_bar()` or `geom_col()`

]

]

---
# Geometries

Some classical “geoms” include:

Geometry        |Geom                
----------------|--------------------
Histograms      |`geom_histogram()`
Points          |`geom_point()`  
Lines           |`geom_line()`  
Bar plots       |`geom_bar()` or `geom_col()`  
Boxplots        |`geom_boxplot()`  
Violin plots    |`geom_violin()`

]

]

---
# Adding geoms

```r
ggplot(
  data = linelist,
  mapping = aes(
   x = age,
   y = days_hosp)) +
*geom_point()
```

With axes now mapped, `geom_point()` displays the data as points.

]

]

---
# Adding geoms

```r
ggplot(
  data = linelist,
  mapping = aes(
   x = age,
   y = days_hosp)) +
geom_point() +
*geom_smooth()
```

We can add additional geoms to the current plot with `+`.

*Geoms appear in the order they are written*: the smoothed line appears over the points.

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-21-1.png" width="504" />
]

.footnote[geom_smooth() gives smoothed conditional means, helping to show trends in presence of "over-plotting" (see [documentation](https://ggplot2.tidyverse.org/reference/geom_smooth.html))]

???
- Explain why you might use one or the other

---
# A quick note on indentations

Indentations, spaces, and newlines do not impact code execution, and can be varied to improve readability.

```r
ggplot(data = linelist, mapping = aes(x = age, y = days_hosp))+geom_point()
```

is the same as:

```r
ggplot(data = linelist,
       mapping = aes(x = age, y = days_hosp)) +
geom_point()
```

is the same as:

```r
ggplot(
  data = linelist,        # use case linelist
  mapping = aes(          # make aesthetic mappings for all geoms
    x = age,              # assign x-axis to age column
    y = days_hosp)) +     # assign y-axis to duration of hospitalization
geom_point()              # display data as points
```

???
- Explain why you might use one or the other 
long style can enable informative comments/annotations
- short style very dense (harder to read for some). Shorter scripts, but so what? The number of lines of your code is not an informative metric.
- very long lines => needs to scroll horizontally for people with smaller monitors (not nice)
- long-ish style makes it easier to see which argument belongs to each function
- spaces around "=" or "+" => make it easier to parse to many people
- other?

---

# Other aesthetics

Aside from axes, other common "aesthetics" include:

Argument        |Controls                
----------------|-----------------------
`shape`      |Display of point as dot, star, triangle, square...
`fill`       |The *interior* color (e.g of bar or boxplot)  
`color`      |The *exterior* or bar, boxplot - OR point color  
`size`       |Line thickness, point size...
`alpha`      |Transparency: 0 (invisible) to 1 (opaque) 
`width`      |Width of "bar plot" bars
`linetype`   |Either solid, dashed, dotted, etc.  
`binwidth`   |Width of histogram bins

???

Note that “aesthetic” in ggplot has a specific meaning that you might associate with the word “aesthetics” in common English. In ggplot those details are called “themes” and are adjusted within a theme() command

Each geom accepts certain aesthetics, like `binwidth=` for `geom_histogram()`

---

# Aesthetics assignments

Aesthetics can be assigned to either:

* **Static values**: `color = "purple"`  
  - Assigned *outside* `aes()`  
  - Same display for all data

* **A data column**: `aes(color = died)`  
  - Assigned *inside* `aes()`
  - Displays data as "groups"

]

Some examples:

]

---

# Aesthetics assignments

Aesthetics can be assigned to either:

* **Static values**: `fill = "purple"`  
  - Assigned *outside* `aes()`  
  - Same display for all data  
  
* **A data column**: `aes(fill = died)`  
  - Assigned *inside* `aes()`
  - Displays data as "groups"

]

More examples:

]

---
# Static aesthetics

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = age,
    y = days_hosp)) + 
*geom_point(color = "seagreen")
```

An aesthetic is *static* if it applies the same display to all data points in the geom or plot. Static aesthetics are defined *outside* `aes()` to a *number or character value*.

Other examples:  
`size = 3`  
`alpha = 0.5`  
`width = 1.2`

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-29-1.png" width="504" />
]

---
# Dynamic aesthetics

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = age,
    y = days_hosp,
*   color = died)) +
geom_point()
```

*Dynamic* aesthetics are mapped to a column name, and defined *inside* `aes()`. This creates "groups" in the plot and generates a legend. The display varies for each data point.

Above, `color=` is mapped to column `died`, and is inherited by `geom_point()`.

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-31-1.png" width="504" />
]

???

---
# Static and dynamic

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = age,
    y = days_hosp,
*   color = died)) +  
geom_point(
* size = 7, 
* alpha = 0.7) 
```

Above, `size = 7` and `alpha = 0.7` are static aesthetics defined outside any `aes()` and apply to `geom_point()`.

`color=` is within an `aes()`, and maps to values in column `died`.

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-33-1.png" width="504" />
]

.footnote[Read more about ggplot aesthetics [here](https://ggplot2.tidyverse.org/articles/ggplot2-specs.html)]

???
As there is only one geom, all aesthetics can be written in `ggplot()`, or in `geom_point()`

---
# Aesthetic mapping placement

1) Dynamic mappings in the initial `ggplot()` call will apply to subsequent geoms, *unless otherwise indicated*  
2) Static aesthetics (e.g. `color = "blue"`) are not inherited by subsequent geoms  
3) Mappings written within one geom apply only to that geom

```r
ggplot(                
  data = linelist,      
  mapping = aes(        
    x = age,
    y = days_hosp)) +
# points colored by outcome
geom_point(
  mapping = aes(color = died),
  size = 1) +
# smoothed means by outcome
geom_smooth(
  mapping = aes(color = died),
  size = 3) +
# smoothed mean of entire dataset
geom_smooth(color = "black")
```

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-35-1.png" width="504" />
]

???
- Three rules to remember as you decide where to place aesthetic mappings
- This is rather complex example, but serves to demonstrate the above guidelines.

---
# A common error

If your data are "grouped", yet you plot indiscriminately without the grouping assigned to a dynamic aesthetic, you get multiple y values per x value.

Below, `linelist` is aggregated into counts per unique week-outcome.

```r
outcome_week_data <- linelist %>% 
* count(died,
*       week_onset = floor_date(
*         x = date_onset,
*         unit = "week")) %>% 
  select(week_onset, everything()) %>% 
  arrange(week_onset) %>% 
  drop_na(week_onset)
```
]

|week_onset |died    |  n|
|:----------|:-------|--:|
|2019-12-29 |Yes     |  1|
|2019-12-29 |No      |  5|
|2020-01-05 |No      |  7|
|2020-01-05 |Unknown |  1|
|2020-01-12 |Yes     |  1|
|2020-01-12 |No      |  5|
|2020-01-12 |Unknown |  2|
|2020-01-19 |No      |  4|

]

---
# A common error

```r
ggplot(
* data = outcome_week_data,
  mapping = aes(
    x = week_onset,       
    y = n)) +
geom_line()
```

Above, the column `died` is not provided to any grouping aesthetic.

`ggplot()` does not know how to display the multiple values for each week.

]

]

---

# A common error - resolved

```r
ggplot(
  data = outcome_week_data,
  mapping = aes(
    x = week_onset,       
    y = n,
*   color = died))+
geom_line()
```

Use aesthetic mappings like `fill=` or `color=` to correctly visualize the "groups".

Above, `color=` is mapped to the `died` column.

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-41-1.png" width="504" />
]

---
# Facets

```r
ggplot(
  data = linelist,
  mapping = aes(x = date_onset)) +
geom_histogram() +
*facet_wrap(~eth_race)
```

Another way of *displaying groups* is via faceting.

`facet_wrap()` produces one facet per unique value.

Place a "~" before the column name.

]

]

???
Also called "small multiples"

---

# Facets

```r
ggplot(
  data = linelist,
  mapping = aes(x = date_onset)) +
geom_histogram() +
*facet_wrap(~eth_race,
            scales = "free_y")
```

"Free" auto-scaled axes with `scales=`. Options:  
- "free_y"  
- "free_x"   
- "free" (both x and y)

]

]

]

---
# Facets - by two variables

```r
ggplot(
  data = linelist,
  mapping = aes(x = date_onset)) +
geom_histogram() +
*facet_wrap(eth_race ~ gender) 
```

The "~" signifies "by". You can place columns on either side.  
  
With `facet_wrap()`, levels are combined into facet titles, appearing alphabetically/by factor level.

]

]

---
# Facets - grid layout

Use `facet_grid()`. Labels appear on top and side.

```r
ggplot(
  data = linelist,
  mapping = aes(x = date_onset)) +
geom_histogram() +
*facet_grid(eth_race ~ gender) 
```

]

]

---
# Facets - drop levels

Use `filter()` or `drop_na()` in advance, to remove undesired levels.

```r
linelist %>% 
*  drop_na(gender, eth_race) %>% 
*  filter(gender != "Unknown") %>% 
  
  # begin plot
  ggplot(
    mapping = aes(x = date_onset)) +
  geom_histogram() +
  facet_grid(eth_race ~ gender) 
```

]

]

---
# Facets + `gghighlight()`

Add `gghighlight()` from **gghighlight** to show a full "shadow" behind each facet. For color add the `aes(fill=)`.

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = date_onset,
*   fill = eth_race)) +
geom_histogram() +
facet_wrap(~ eth_race) +
*gghighlight::gghighlight()
```

]

]

---
# gghighlight

Add `gghighlight()` to other plots, and specify specific values (or criteria) to highlight

```r
linelist %>% 
  # get daily counts by age group
  group_by(age_group, date_report) %>%
  count() %>% 
  
  # plot
  ggplot(
    mapping = aes(
      x = date_report,
      y = n,
      color = age_group)) +
  geom_line() +
* gghighlight::gghighlight(
    age_group %in% c("40-49", "60-69)"))+
  theme(legend.position = "none")
```

]

]

---
# Scales - overview

Scale commands replace defaults of *how* the aesthetic mappings manifest, such as:  
* *Which* colors or shapes to display  
* The min/max of point sizes  
* The min/max and frequency of axes breaks

As a generic formula, these commands are written as: **`scale_AESTHETIC_METHOD()`**.

1) `scale_` : this prefix never changes  
2) AESTHETIC: `_fill_` , `_color_` , `_x_` , `_y_` , etc.  
3) METHOD: `_continuous()`, `_discrete()`, `_manual()`, `_date()`, etc.

---
# Scales examples

Some examples of scale commands:

You want to adjust  |Scale command
--------------------|-------------------
continuous y-axis   |`scale_y_continuous()`
date x-axis         |`scale_x_date()`  
categorical x-axis  |`scale_x_discrete()`  
fill, continuous    |`scale_fill_continuous()`
fill, continuous    |`scale_fill_gradient()`  
color, manual assign|`scale_color_manual()`

---
# Scales - default

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = eth_race,
    fill = died)) +
geom_bar()
```

Above, the fill of a bar plot uses the **default colors and axis breaks**:

]

]

---
# Scales - adjusted fill

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = eth_race,
    fill = died)) +
geom_bar() +
*scale_fill_manual(        
*  values = c(
*   "Yes"     = "violetred", 
*   "No"      = "aquamarine",
*   "Unknown" = "grey"))
```

Within `scale_fill_manual()` we provide assignments within a vector `c()`.

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-59-1.png" width="504" />
]

???
Discuss the na.value= arguments in most scale commands, and the difference between having NA values in the data and having an explicit missing value such as "Unknown".

---
# Scales - adjusted y-axis

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = eth_race,
    fill = died)) +
geom_bar() +
scale_fill_manual(        
  values = c(
    "Yes"     = "violetred", 
    "No"      = "aquamarine",
    "Unknown" = "grey")) +
*scale_y_continuous(
*  breaks = seq(from = 0,
*               to = 35000,
*               by = 5000))
```

In `scale_y_continuous()` we adjust the y-axis breaks using `seq()` to define a numeric sequence.

]

]

---
# Scales - Start axes at 0

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = eth_race,
    fill = died)) +
geom_bar() +
scale_fill_manual(        
  values = c("Yes"    = "violetred", 
             "No"     = "aquamarine",
            "Unknown" = "grey")) +
scale_y_continuous(
  breaks = seq(from = 0,
               to = 35000,
               by = 5000),
* expand = c(0, 0)) +
*scale_x_discrete(
*  expand = c(0, 0))
```

In any `scale_x_` or `scale_y_` command, use `expand = c(0,0)` to remove excess space around the plot features.

]

]

---
# Scales - date axis labels

```r
ggplot(
  data = linelist,
  mapping = aes(x = date_onset)) +
geom_histogram()
```

The default scale for date axes will vary by the range of your data.

]

]

---
# Scales - date label breaks

```r
ggplot(
  data = linelist,
  mapping = aes(x = date_onset)) +
geom_histogram() +
*scale_x_date(
*  date_breaks = "3 months")
```

Adjust axis labels with `scale_x_date()`.

Use `date_breaks=` values like "1 week", "2 weeks", or "3 months".

Note: these are the *axis* label breaks, not histogram bin breaks!

.footnote[
Note: Default "weeks" start on Mondays. See [epiRhandbook epicurves](https://epirhandbook.com/epidemic-curves.html) page for alternatives.
]

]

]

???
For tips on geom_histogram() bins, see Epi R Handbook epicurves page

---
# Scales - date axis labels

```r
ggplot(
  data = linelist,
  mapping = aes(x = date_onset)) +
geom_histogram() +
scale_x_date(
  date_breaks = "3 months",
* date_labels = "%d %b\n%Y")
```

Specify date label format to `date_labels=` using ["strptime" syntax](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strptime)

`"%d %b %Y"` for DD MMM YYYY.

See Epi R Handbook [Epicurves](https://epirhandbook.com/epidemic-curves.html) and [Strings](https://epirhandbook.com/characters-and-strings.html) pages for more tips

]

]

???
/n is a newline

---
# Scales - date axis labels

```r
ggplot(
  data = linelist,
  mapping = aes(x = date_onset)) +
geom_histogram() +
scale_x_date(
  date_breaks = "3 months",
* labels = scales::label_date_short() )
```

Alternatively, simplify by assigning `labels=` to `label_date_short()` from [**the scales package**](https://scales.r-lib.org/)

The year is not repeated on each label anymore.

]

]

---

# Scales - display percents

Easily display proportions as percents with `percent()` from **scales** within `scale_y_continuous()`.

First few rows of a CFR dataset:

|month      | cases| deaths|       CFR|
|:----------|-----:|------:|---------:|
|2020-01-01 |     1|      0| 0.0000000|
|2020-02-01 |    54|      3| 0.0555556|
|2020-03-01 |   857|     73| 0.0851809|
|2020-04-01 |  1832|    152| 0.0829694|

```r
ggplot(
  data = CFR_data,
  mapping = aes(
    x = month,
*   y = CFR) +
geom_line(size = 2, color = "brown") +
*scale_y_continuous(labels = percent)
```

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-75-1.png" width="504" />
]

???
Remember to load the scales package

---
# Plot labels

```r
ggplot(data = linelist) +
geom_point(
  mapping = aes(
    x = age,
    y = days_hosp,
    color = died),    
  alpha = 0.3) +
*labs(
*  title = "Duration of admission",
*  subtitle = "Fulton County, GA",
*  x = "Age (years)",
*  y = "Duration (days)",
*  caption = "Fictional COVID-19 data",
*  color = "Deceased"
)
```

Use `labs()` as above.

Note: to edit legend title, use the aesthetic that created the legend (e.g. `color=`).

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-77-1.png" width="504" />
]

---
# Plot label tips

See the [Epi R Handbook page](https://epirhandbook.com/characters-and-strings.html#dynamic-strings) on adjusting characters with the **stringr** package

**New lines**

```r
"Top line\nNextline"                        # \n creates new line
str_wrap("This is a really long title", 30) # wrap at 30 characters
```

**Dynamic labels** - Embed code in `str_glue()` that updates with the data

```r
str_glue("Data as of {Sys.Date()}")
```

```
## Data as of 2021-08-20
```

```r
str_glue("{fmt_count(linelist, is.na(date_onset))} cases missing onset and not shown")
```

```
## 37323 (45.7%) cases missing onset and not shown
```

???
Explain that in str_glue, anything within curly brackets it will run as R code.

---
# Themes

Themes are non-data design features (background, text size/color, etc).

[These "complete themes"](https://ggplot2.tidyverse.org/reference/ggtheme.html) are easy to add.

```r
# Try one of these...
theme_bw()+
theme_classic()+
theme_dark()+
theme_gray()+
theme_minimal()+
theme_light()+
theme_void()+
```

Try the argument `base_size = 16` to quickly increase text size.

]

]

---
# Themes

Make micro-adjustments with `theme()`. See the handbook for tips.

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = age,
    y = days_hosp,
    color = died),    
    alpha = 0.3) +
geom_point() +
labs(
  title = "Duration of admission",
  x = "Age (years)",
  y = "Duration (days)",
  color = "Deceased")+
theme_minimal(base_size = 16) +
*theme(
* legend.position = "bottom",
* plot.title = element_text(
*   color = "red",
*   face = "bold"),
* axis.title.y = element_text(angle = 90))
```

]

The syntax takes practice - see [this list](https://ggplot2.tidyverse.org/reference/theme.html) of feature-specific arguments.

<img src="slides_ggplot_files/figure-html/unnamed-chunk-84-1.png" width="504" />
]

???
Talk about these theme() arguments and how they consist of two parts, just like `mapping = aes()`. 
Explain that nobody has these all memorized, but the common ones are easy to remember once you use them enough.  
Remember to add them AFTER any complete themes.

---
# Bar plots

`geom_col()` and `geom_bar()` are used to make general bar plots (non-epicurves).

Use `geom_bar()` if bar height should reflect the **number of rows** in the data
(e.g. a case linelist).

```r
ggplot(
  data = linelist,                  # begin with linelist
  mapping = aes(x = eth_race)) +    # No y= argument
*geom_bar()                           
```

Use `geom_col()` there is a numeric column containing the desired **bar height**
(e.g. aggregated count data).

```r
ggplot(
  data = linelist_agg,                  # begin with aggregated count data   
  mapping = aes(x = eth_race, y = n)) + # bar height is value in column "n"                                     
*geom_col()   
```

---

# Bar plot - count rows

With `geom_bar()`, the height reflects the number of rows per x-axis group. This works well for linelist data.

|date_onset |eth_race  |died    |
|:----------|:---------|:-------|
|2020-03-20 |White, NH |No      |
|2020-01-28 |White, NH |No      |
|2020-02-10 |Black, NH |No      |
|2021-05-19 |Black, NH |No      |
|2020-02-20 |White, NH |Unknown |
|2020-01-17 |Black, NH |Yes     |

```r
ggplot(
  data = linelist,
  mapping = aes(
    x = eth_race)) +             
geom_bar() +
theme(axis.text.x= element_text(angle=30))  
```

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-89-1.png" width="504" />
]

---

# Bar plot - count grouped rows

To achieve "stacked" bars with `geom_bar()`, assign the grouping column to `fill=`, within `aes()`.

```r
ggplot(
  data = linelist,
  mapping = aes(
      x = eth_race,   
*     fill = died)) +             
geom_bar() + 
theme(axis.text.x= element_text(angle=30))
```

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-92-1.png" width="504" />
]

---

# Bar plot - aggregated counts

In contrast, `geom_col()` uses a column of counts, such as column `n` in this `linelist_eth` dataset:

|eth_race            |     n|
|:-------------------|-----:|
|Asian, NH           |  3022|
|Black, NH           | 34395|
|White, NH           | 27303|
|Hispanic, all races |  8600|
|Other, NH           |  2912|
|Unknown             |  5525|

Column `n` provides the bar height.

```r
ggplot(linelist_eth) +             
*  geom_col(           
    mapping = aes(   
      x = eth_race,    
*     y = n))+
theme(axis.text.x= element_text(angle= 30))         
```

]

]

---

# Bar plot - stacked counts

To have "stacked" bars using `geom_col()`, each plotting group must have its own rows in the data. Use "long"-style data like below:

|eth_race  |died    |     n|
|:---------|:-------|-----:|
|Asian, NH |Yes     |    30|
|Asian, NH |No      |  1802|
|Asian, NH |Unknown |  1190|
|Black, NH |Yes     |  1037|
|Black, NH |No      | 19204|
|Black, NH |Unknown | 14154|

```r
ggplot(linelist_eth_died) +               
geom_col(                 
  mapping = aes(
    x = eth_race,          
*   y = n,                
*   fill = died)) +
theme(axis.text.x= element_text(angle=30))
```

]

]

---

# Bar plot - a common error

If your data look like below (counts) and your plot looks like that (bars of same height), ensure you are using `geom_col()` and not `geom_bar()`!

]

This plot shows that there is one row per ethnicity/race!

]

---
# Bar plot - flip axes

```r
ggplot(linelist) +        
geom_bar(              
  mapping = aes(
    x = eth_race,            
    fill = died)) +
theme(legend.position = "top") +  
*coord_flip()           
```

It is simple to flip the axes on any ggplot by adding `coord_flip()`

This is useful in bar plot to improve readability.

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-102-1.png" width="504" />
]

---
# Bar plot - adjust order

```r
ggplot(linelist) +        
geom_bar(                
  mapping = aes(
*   x = fct_infreq(eth_race),   
*   fill = fct_infreq(died))) + 
theme(legend.position = "top") +  
coord_flip()
```

Use the **forcats** package to convert to class *factor* and adjust "level" order.

Above, `fct_infreq()` orders x-axis position and stacks (fill) by frequency.

]

<img src="slides_ggplot_files/figure-html/unnamed-chunk-104-1.png" width="504" />
]

.footnote[See the Epi R Handbook page on [factors](https://epirhandbook.com/factors.html#within-a-plot).]

???
You will need to clean up the axes and legend titles now.

---
# Bar plot - reverse order

```r
ggplot(linelist)  +              
geom_bar(                
  mapping = aes(
*   x = fct_rev(fct_infreq(eth_race)),   
*   fill = fct_rev(fct_infreq(died)))) +      
theme(legend.position = "top") + 
coord_flip()              
```

To *reverse* order, use `fct_rev()`, which can be wrapped around other functions.

]

]

---

# Bar plot - adjust width

```r
ggplot(
  data = linelist_eth_died,
  mapping = aes(
    x = eth_race,
    y = n,
    fill = died)) +             
*geom_col(width = 0.5) +
theme(axis.text.x = element_text(angle = 30))
```

Adjust bar width with `width=`.

]

]

???
Be wary of adjusting width for date bars (e.g. month) - use `geom_histogram()` instead with bin breaks.

---

# Bar plot - adjacent

```r
ggplot(
  data = linelist_eth_died,
  mapping = aes(
    x = eth_race,
    y = n,
    fill = fct_infreq(died),
    label = n)) +             
*geom_col(position = "dodge")+
theme(axis.text.x = element_text(angle = 30))
```

To make the bars adjacent, specify `position = "dodge"` in `geom_col()`.

]

]

---

# Bar plot - display counts

To display text, use `geom_col()` with `geom_text()`. Assign `aes(label=)` to the height values.

The `position=` arg in `geom_text()` can specify display at mid-bar.

```r
ggplot(
  data = linelist_eth_died,
  mapping = aes(
    x = eth_race,
    y = n,
    fill = died,
*   label = n)) +             
geom_col() +
*geom_text(
* size = 3,
* position = position_stack(vjust = 0.5)) +
theme(axis.text.x = element_text(angle=30))
```

]

]

---

# Point-and-click ggplot2

Now that you've learned the basic syntax... the [**esquisse**](https://cran.r-project.org/web/packages/esquisse/vignettes/get-started.html]) package allows point-and-click creation of (simple) ggplots! (esquisse means "sketch" in French)

![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/esquisse1.gif?raw=true)

---
# Make an epidemic curve

Two approaches that we suggest:

1) Use the **incidence2** package  
  * Fast, simple, and modifiable with ggplot additions  
  
2) Use ggplot's `geom_histogram()`  
  * Most customizeability  
  * Most complex code

Today's demonstration will focus on **incidence2**.

.footnote[See the Epi R handbook's [Epicurves page](https://epirhandbook.com/epidemic-curves.html) for detailed examples of both methods]

---
# Epicurve - incidence object

**Create** an incidence object

```r
weekly <- incidence(
  x = linelist,            
  date_index = date_onset, 
  interval = "week")          
```

**Plot** the incidence object

```r
plot(weekly)
```
]

]

---
# Epicurve - incidence object

Adjust the interval

```r
bimonthly <- incidence(
  x = linelist,            
  date_index = date_onset, 
* interval = "2 months")      
```

**Plot** the incidence object

```r
plot(bimonthly)
```

Try "Sunday weeks" or "MMWR week" for Sunday weeks.

]

]

---
# Epicurve - groups

To show groups, specify them to `groups=` in the `incidence()` command:

```r
weekly <- incidence(
  x = linelist,              
  date_index = date_onset,  
  interval = "weeks",       
* groups = eth_race)      
```

AND to `fill=` in the `plot()` command:

```r
plot(weekly,
* fill = eth_race)
```

If grouping by multiple columns, nest them in both places within `c()`:

```r
groups = c(eth_race, gender)
```

]

]

---
# Epicurve - ggplot2 additions

```r
plot(weekly,
     fill = eth_race) +
scale_y_continuous(
  expand = c(0,0),
  breaks = seq(0,2000,250)) + 
*scale_y_continuous(
* expand = c(0,0),
* breaks = seq(0,2000,250)) + 
*theme_minimal(base_size = 16) + 
*theme(legend.position = "top") +
*labs(fill = "Race and\nEthnicity")
```

Add other **ggplot2** commands with `+`.

]

]

---
# Epicurve - date axis

```r
plot(weekly,
  fill = eth_race,
*  date_format = "%a %d %b %Y\n (Week %W)", 
*  angle = 30)+                           
scale_y_continuous(
  expand = c(0,0),
  breaks = seq(0,2000,250)) + 
theme_minimal(base_size = 16) + 
theme(legend.position = "top") +
labs(fill = "Race and\nEthnicity")
```

Do not adjust **incidence2** epicurves with `scale_x_date()`.

Instead, use custom **incidence2** arguments in `plot()`.

]

]

---
# Epicurve - show cases

For small outbreaks, the style `show_cases = TRUE` may be helpful:

```r
small_outbreak <- linelist %>% 
  filter(
    zip == 30024,
    date_report >= as.Date("2020-12-01")) %>% 
  incidence(
    date_index = date_onset,               
    interval = "Sunday weeks",             
    groups = eth_race)
```

```r
plot(small_outbreak,
  fill = eth_race,
* show_cases = TRUE)+
theme(legend.position = "bottom")+
labs(
  title = "ZIP 30024 by race")
```

]

]

???
Dataset is piped in from above

---
# Epicurve - color palettes

Use `scale_fill_viridis_d()` for palettes that are color-blind friendly:

```r
plot(weekly, fill = eth_race)+
*scale_fill_viridis_d(
    option = "inferno",           
    name = "Race and\nEthnicity",  
    na.value = "grey")          
```

[Viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html) (try with `option = "plasma"` or "inferno"), and [colorbrewer](https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html) palette functions can be added to any ggplot as well.

]

]

---

# Epicurve - aggregated counts

You can also use **incidence2** on data that are aggregated counts...

```r
# For demo: aggregate linelist
linelist_day_counts <- linelist %>% 
*  count(
*    day = floor_date(date_report, "day"),
*    died) %>% 
  drop_na(day)
```

]

|day        |died    |  n|
|:----------|:-------|--:|
|2021-06-19 |No      | 11|
|2021-06-19 |Unknown | 11|
|2021-06-20 |No      | 10|
|2021-06-20 |Unknown |  6|
|2021-06-21 |No      | 12|
|2021-06-21 |Unknown | 11|
|2021-06-22 |No      |  9|
|2021-06-22 |Unknown |  6|
|2021-06-23 |No      |  7|
|2021-06-23 |Unknown |  7|

]

---

# Epicurve - aggregated counts

```r
died_curve <- incidence(  
  linelist_day_counts,  
  date_index = day,   
* count = n,            
  interval = "week",    
  groups = died        
  )
```

```r
# plot stacked by outcome
plot(died_curve,    
     fill = died)   
```

]

]

---
# Demographic pyramids

We suggest two ways to make age pyramids:

1) Use the **apyramid** package - simple, easy

2) Use **ggplot2** - more customizeable, but opportunity for error

Today we will demonstrate **apyramid**.

.footnote[See the Epi R Handbook [page on demographic pyramids](https://epirhandbook.com/demographic-pyramids-and-likert-scales.html) and the [package vignette](https://cran.r-project.org/web/packages/apyramid/vignettes/intro.html) for more information.]

---
# Demographic pyramids

```r
age_pyramid(
  data = linelist,
  age_group = "age_group",
  split_by = "gender")
```

The function `age_pyramid()` from **apyramid** offers an easy interface.

]

]

---
# Demographic pyramids

```r
age_pyramid(
  data = linelist,
  age_group = "age_group",
  split_by = "gender",
* proportional = TRUE,
* show_midpoint = FALSE,
* pal = c("darkgreen", "brown"))
```

Further modifications:

* Percents instead of raw counts on the x-axis  
* No mid-point designation  
* Specified colors

]

]

---
# Demographic pyramids

```r
age_pyramid(
  data = linelist,
  age_group = "age_group",
  split_by = "gender",
* stack_by = "hospitalized"
 proportional = TRUE,
 show_midpoint = FALSE,
 pal = c("darkgreen", "brown",
         "yellow", "orange"))
```

You can stack the bars by another variable with `stack_by=`

]

]

---
# Demographic pyramids

```r
age_pyramid(
  data = linelist,
  age_group = "age_group",
  split_by = "gender",
  proportional = TRUE,
  show_midpoint = FALSE) +
*theme_minimal(base_size = 10) +
*labs(
*  title = "Age and Gender",
*  subtitle = "Fulton County, GA",
*  x = "Percent of total",
*  y = "Age group",
*  fill = "Gender",
*  caption = "Caption here")
```

You can add further ggplot commands.

]

]

---
# Dynamic labels example

Let's use dynamic labels that will update with the data.

**Simple caption**

```r
str_glue("n = {nrow(linelist)}")
```

```
## n = 81757
```

**Complex caption**

For complicated `str_glue()` scenarios, define the dynamic components outside the quotation marks.

`fmt_count()` from **epikit** is useful to count and display rows.

```r
str_glue("{missing} missing age or gender not shown.",
        missing = fmt_count(linelist, is.na(gender) | is.na(age_group))
        )
```

```
## 405 (0.5%) missing age or gender not shown.
```

---
# Dynamic labels example

**Subtitle**

Max/Min dates can be wrapped in `format()` from **base** to adjust the display. See the epiRhandbook section on [strptime syntax](https://epirhandbook.com/working-with-dates.html?q=strptime#format).

```r
str_glue("Fulton County, reported {min_date} - {max_date}",
    min_date = format(min(linelist$date_report, na.rm=T), "%B %d %Y"),
    max_date = format(max(linelist$date_report, na.rm=T), "%B %d %Y")
    )
```

```
## Fulton County, reported January 03 2020 - June 23 2021
```

---
class: remark-code

# Dynamic captions

Now applying those dynamic captions:

```r
age_pyramid(
  data = linelist,
  age_group = "age_group",
  split_by = "gender",
  proportional = TRUE,
  show_midpoint = FALSE) +
theme_minimal(base_size = 10) +
labs(
  title = "Age and Gender",
* subtitle = str_glue(
    "Fulton County, reported {min_date} - {max_date}",
  
    min_date = format(
      min(linelist$date_report, na.rm=T),
      "%B %d %Y"),
  
    max_date = format(
      max(linelist$date_report, na.rm=T),
      "%B %d %Y")
  ),
  x = "Percent of total",
  y = "Age group",
  fill = "Gender",
* caption = str_glue(
    "{missing} missing age or gender not shown.",
    missing = fmt_count(
      linelist,
      is.na(gender) | is.na(age_group))
    )
  )
```

]

]

---

# Labeling

.pull-left[
 
To demonstrate labeling points, we create a new dataset that summarizes CFR and median age by race/ethnicity

```r
race_CFR_age <- linelist %>% 
  group_by(eth_race) %>% 
  summarise(
    cases = n(),
    deaths = sum(died_covid == "Yes",
                 na.rm=T),
    CFR = deaths/cases,
    med_age = median(age, na.rm=T)  )
```

]

|eth_race            | cases| deaths|       CFR| med_age|
|:-------------------|-----:|------:|---------:|-------:|
|Asian, NH           |  3022|     23| 0.0076109|      34|
|Black, NH           | 34395|    833| 0.0242186|      39|
|White, NH           | 27303|    417| 0.0152730|      36|
|Hispanic, all races |  8600|     55| 0.0063953|      34|
|Other, NH           |  2912|      6| 0.0020604|      32|
]

---

# Labeling

.pull-left[
 
We can plot these data as points, and label them with `geom_text()`... but it does not look very good.

```r
ggplot(
  data = race_CFR_age,
  mapping = aes(
    x = med_age,
    y = CFR,
    size = cases,
*   label = eth_race)) +
geom_point() +
*geom_text()
```

]

]

---

# Labeling

```r
library(ggrepel)

ggplot(
  data = race_CFR_age,
  mapping = aes(
    x = med_age,
    y = CFR,
    size = cases,
*   label = eth_race)) +     
geom_point() +
*geom_label_repel(         
*   size = 5,               
*   min.segment.length = 0)
```

]

]

---

# Labeling

```r
library(ggrepel)

ggplot(
  data = race_CFR_age,
  mapping = aes(
    x = med_age,
    y = CFR,
    size = cases,
*   label = str_glue(
*     "{eth_race}\n{cases} cases"))) +
geom_point() +
geom_label_repel(
    size = 5,
    min.segment.length = 0)
```

]

]

---

# Labeling

.pull-left[
 
Use `comma()` from the **scales** package to make numbers display with comma separators every three digits.

```r
ggplot(
  data = race_CFR_age,
  mapping = aes(
    x = med_age,
    y = CFR,
    size = cases,
*   label = str_glue(
*     "{eth_race}\n{comma(cases)} cases"))) +
geom_point() +
geom_label_repel(
    size = 4,
    min.segment.length = 0)
```

See the many other useful functions in [**scales**](https://scales.r-lib.org/).

]

]

---

# Marginal distributions

Add "marginal" histograms to scatterplots with `ggMarginal()` from the **ggExtra** package.

```r
scatterplot <- ggplot(
  data = linelist,
  mapping = aes(
    x = age,
    y = days_hosp)) + 
geom_point()
```

```r
ggExtra::ggMarginal(
  scatterplot,                   
  type = "histogram",            
  fill = "lightblue",            
  xparams = list(binwidth = 10), 
  yparams = list(binwidth = 5))  
```
]

]

---
# Combining plots - plot 1

The **cowplot** package facilitates combining multiple plots, while aligning elements such as axes. For date axes, ensure the same limits in both plots.

Define the first plot.

```r
plot1 <- plot2 <- linelist %>% 
  ggplot(
    mapping = aes(
      x = date_report),
    binwidth = 7)+
  geom_histogram()+
  theme_minimal()+
  scale_y_continuous(expand = c(0,0))+
* scale_x_date(
*   limits = c(
*     as.Date("2020-03-01"),
*     max(linelist$date_report, na.rm=T)),
    date_breaks = "months",
    labels = scales::label_date_short(),
    expand = c(0,0))
```
]

]

---

# Combining plots - plot 2

```r
plot2 <- linelist %>% 
group_by(week = floor_date(date_report, "week")) %>% 
summarise(ci = list(mean_cl_normal(age) %>% rename(mean=y, lwr=ymin, upr=ymax))) %>% 
unnest() %>%  
  
ggplot(
  mapping = aes(
    x = week,
    y = mean,
    ymin = lwr,
    ymax = upr))+
geom_ribbon(alpha = 0.5, fill = "green", color = "green")+
geom_line(size = 2, color = "darkgreen")+
*scale_x_date(
*  limits = c(
*    as.Date("2020-03-01"),
*    max(linelist$date_report, na.rm=T)),
  date_breaks = "months",
  labels = scales::label_date_short(),
  expand = c(0,0))+
coord_cartesian(ylim = c(30, 60))+
theme_minimal()+
labs(
  y = "Weekly mean age (95%CI)",
  x = "Month")
```

]

]

---

# Combining plots - result

```r
plot_grid(plot1, plot2, rel_heights = c(1, 1), ncol = 1, align = "hv")
```

---
# Dual-axis - plot 1

Define the first plot:

```r
plot1 <- ggplot(
  data = linelist,
  mapping = aes(
    x = date_report)) + 
geom_histogram(color = "grey",
               alpha = 0.5) +
*theme_cowplot() +
*scale_x_date(
* limits = c(
*   as.Date("2020-03-01"),
*   max(linelist$date_report, na.rm = TRUE)),  
  date_breaks = "months",
  labels = scales::label_date_short(),
  expand = c(0, 0),
  name = "") +
scale_y_continuous(
  expand = c(0,0),
  name   = "Weekly reported case incidence")
```
]

]

---

# Dual-axis - plot 2

```r
plot2 <- linelist %>% 
  group_by(epiweek) %>% 
  summarise(CFR = sum(died == "Yes", na.rm=T) / n() ) %>% 
ggplot(
  mapping = aes(
    x = epiweek,
    y = CFR))+
    geom_line(size = 2, color = "orange")+
* theme_cowplot()+
  scale_y_continuous(
*   position = "right",
    limits = c(0, 0.2),
    expand = c(0,0),
    name = "Weekly CFR")+
* scale_x_date(
*   limits = c(
*     as.Date("2020-03-01"),
*     max(linelist$date_report, na.rm=T)),  
    date_breaks = "months",
    labels = scales::label_date_short(),
    expand = c(0,0),
    name = "Epiweek")+
  theme(
    axis.text.y = element_text(color = "orange", face = "bold"),
    axis.title.y = element_text(color = "orange", face = "bold"))
```

]

]

---

# Dual-axis - combined

```r
aligned  <- align_plots(plot1, plot2, align = "hv")
combined <- ggdraw(aligned[[1]]) + draw_plot(aligned[[2]])
combined
```

---

# Interactive plots

Use `ggplotly()` from the **plotly** package to easily make most ggplots interactive.

```r
# Provide the name of a defined plot into ggplotly()  
plotly::ggplotly(bar_plot)
```

<div id="htmlwidget-bc63206a64899d154108" style="width:648px;height:432px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-bc63206a64899d154108">{"x":{"data":[{"orientation":"v","width":[0.9,0.9,0.9,0.9,0.9],"base":[2992,33358,26747,8532,2902],"x":[1,2,3,4,5],"y":[30,1037,556,68,10],"text":["count:    30<br />eth_race: Asian, NH<br />died: Yes","count:  1037<br />eth_race: Black, NH<br />died: Yes","count:   556<br />eth_race: White, NH<br />died: Yes","count:    68<br />eth_race: Hispanic, all races<br />died: Yes","count:    10<br />eth_race: Other, NH<br />died: Yes"],"type":"bar","marker":{"autocolorscale":false,"color":"rgba(248,118,109,1)","line":{"width":1.88976377952756,"color":"transparent"}},"name":"Yes","legendgroup":"Yes","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"orientation":"v","width":[0.9,0.9,0.9,0.9,0.9,0.9],"base":[1190,14154,11877,3303,1771,4513],"x":[1,2,3,4,5,6],"y":[1802,19204,14870,5229,1131,1012],"text":["count:  1802<br />eth_race: Asian, NH<br />died: No","count: 19204<br />eth_race: Black, NH<br />died: No","count: 14870<br />eth_race: White, NH<br />died: No","count:  5229<br />eth_race: Hispanic, all races<br />died: No","count:  1131<br />eth_race: Other, NH<br />died: No","count:  1012<br />eth_race: Unknown<br />died: No"],"type":"bar","marker":{"autocolorscale":false,"color":"rgba(0,186,56,1)","line":{"width":1.88976377952756,"color":"transparent"}},"name":"No","legendgroup":"No","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"orientation":"v","width":[0.9,0.9,0.9,0.9,0.9,0.9],"base":[0,0,0,0,0,0],"x":[1,2,3,4,5,6],"y":[1190,14154,11877,3303,1771,4513],"text":["count:  1190<br />eth_race: Asian, NH<br />died: Unknown","count: 14154<br />eth_race: Black, NH<br />died: Unknown","count: 11877<br />eth_race: White, NH<br />died: Unknown","count:  3303<br />eth_race: Hispanic, all races<br />died: Unknown","count:  1771<br />eth_race: Other, NH<br />died: Unknown","count:  4513<br />eth_race: Unknown<br />died: Unknown"],"type":"bar","marker":{"autocolorscale":false,"color":"rgba(97,156,255,1)","line":{"width":1.88976377952756,"color":"transparent"}},"name":"Unknown","legendgroup":"Unknown","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":26.958904109589,"r":7.30593607305936,"b":40.9132420091324,"l":54.7945205479452},"plot_bgcolor":"rgba(235,235,235,1)","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.4,6.6],"tickmode":"array","ticktext":["Asian, NH","Black, NH","White, NH","Hispanic, all races","Other, NH","Unknown"],"tickvals":[1,2,3,4,5,6],"categoryorder":"array","categoryarray":["Asian, NH","Black, NH","White, NH","Hispanic, all races","Other, NH","Unknown"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-30,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"eth_race","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-1719.75,36114.75],"tickmode":"array","ticktext":["0","10000","20000","30000"],"tickvals":[0,10000,20000,30000],"categoryorder":"array","categoryarray":["0","10000","20000","30000"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"count","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":true,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":1.88976377952756,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895},"y":0.97594050743657},"annotations":[{"text":"died","x":1.02,"y":1,"showarrow":false,"ax":0,"ay":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"xref":"paper","yref":"paper","textangle":-0,"xanchor":"left","yanchor":"bottom","legendTitle":true}],"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"1a41f391baa":{"x":{},"fill":{},"type":"bar"}},"cur_data":"1a41f391baa","visdat":{"1a41f391baa":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>

---

# Data structure

**ggplot2** works best when the dataset is in "long" format, where each variable has its own column. We recommend that you become familiar with `pivot_longer()` from **dplyr** (see the [Epi R Handbook pivoting page](https://epirhandbook.com/pivoting-data.html)).

![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/pivoting.png?raw=true)

---

# Helpful charts

For color & fill use names or hex code

![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/ggplot_colors.png?raw=true)
]

For shapes, try these numeric codes:

![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/ggplot_shapes.png?raw=true)
]

---
class: inverse, center, middle

# Geographic Information Systems (GIS) 
### An introduction with R

---
# Shapefiles 
.pull-left[
- Multiple layers make file 
- Made up of: 
  - Points, lines, polygons
- handled with [{sf}](https://r-spatial.github.io/sf/) package
  - simple features:
   make things much more simple!
  - Can use tidyverse syntax as if dataframes
- Other formats exist (e.g. GeoJson)
]

<img src="https://github.com/appliedepi/emory_training/blob/master/presentation/images/shapefiles.png?raw=true" width="80%" />
]

???
Until a few years ago geospatial work in R was very painful - but with SF it has
now become as easy as dataframes in tidyverse!

---
# Import shapefile

```r
# import shapefile
shapefile <- read_sf(
  # define path to file
  here::here("case_study", "data", "covid_example_data", "covid_shapefile", 
             "FultonCountyZipCodes.shp")
)
```

```
## Simple feature collection with 48 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 2087952 ymin: 1274336 xmax: 2317492 ymax: 1522856
*## Projected CRS: NAD83 / Georgia West (ftUS)
## # A tibble: 48 x 5
##    ZipCode SHAPE_Leng  SHAPE_Area Population                            geometry
##    <chr>        <dbl>       <dbl>      <dbl>     <MULTIPOLYGON [US_survey_foot]>
##  1 30304        3578.     874668.          0 (((2228175 1327979, 2227998 132786~
##  2 30098        6187.    2071539.        118 (((2297621 1477003, 2296953 147697~
##  3 30291      123056.  271391984.      20835 (((2193467 1298536, 2192364 129806~
##  4 30268      294091. 1729387442.       7729 (((2121821 1318505, 2122024 131756~
##  5 30305       87773.  187624264.      29220 (((2233900 1400489, 2233914 139992~
##  6 30213      375869. 1801881966.      39840 (((2151675 1332989, 2151274 133220~
##  7 30092       17665.    2769444.          0 (((2278083 1456502, 2278075 145649~
##  8 30024       33262.   30183784.        299 (((2300742 1476746, 2300594 147679~
##  9 30097      124207.  336153806.      26791 (((2295375 1480030, 2295462 148000~
## 10 30022      169357.  673419811.      69893 (((2286054 1475364, 2286025 147513~
## # ... with 38 more rows
```

???
We can see here that the CRS is specific to Georgia (so doesn't play well with GPS points)

---
# Coordinate reference systems

---
# Changing the CRS

```r
# change shapefile coordinate reference system to WGS84
shapefile <- st_transform(shapefile, crs = 4326)
```

```
## Simple feature collection with 48 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.85116 ymin: 33.5025 xmax: -84.09764 ymax: 34.1862
*## Geodetic CRS:  WGS 84
## # A tibble: 48 x 5
##    ZipCode SHAPE_Leng  SHAPE_Area Population                            geometry
##  * <chr>        <dbl>       <dbl>      <dbl>                  <MULTIPOLYGON [°]>
##  1 30304        3578.     874668.          0 (((-84.39147 33.65054, -84.39205 3~
##  2 30098        6187.    2071539.        118 (((-84.16324 34.0603, -84.16545 34~
##  3 30291      123056.  271391984.      20835 (((-84.50521 33.56937, -84.50882 3~
##  4 30268      294091. 1729387442.       7729 (((-84.74079 33.62338, -84.7401 33~
##  5 30305       87773.  187624264.      29220 (((-84.37313 33.84985, -84.37308 3~
##  6 30213      375869. 1801881966.      39840 (((-84.64293 33.6636, -84.64424 33~
##  7 30092       17665.    2769444.          0 (((-84.22771 34.00394, -84.22774 3~
##  8 30024       33262.   30183784.        299 (((-84.15294 34.05959, -84.15342 3~
##  9 30097      124207.  336153806.      26791 (((-84.17066 34.06862, -84.17037 3~
## 10 30022      169357.  673419811.      69893 (((-84.20143 34.05579, -84.20153 3~
## # ... with 38 more rows
```

???
We can use sf::st_transform to re-project a spatial object 
We can use sf::st_crs to check (or set) the crs for a spatial object

---
# Take a look!

```r
# open up a ggplot
ggplot() + 
  # add the shapefile on top
  geom_sf(data = shapefile, 
          # no fill
          fill = NA,
          # black borders
          colour = "black") 
```
]

.pull-right[
<img src="slides_ggplot_files/figure-html/plot_shp_show-1.png" width="504" />
]
---
# Set CRS for point coordinates

```r
# define period of interest
recent_period   <- seq(as.Date("2021-06-30") - 13, as.Date("2021-06-30"), by = 1)

# convert linelist data frame into sf object (with georeference points)
linelist_sf <- linelist %>%
  # remove rows missing gps coordinates
  drop_na(lat, lon) %>%
  filter(
    # drop those with wrong GPS points
    lat >= 33 & lat <= 35,
    lon >= -85 & lon <= -84,
    # drop those outside the last 2 weeks 
    date_report %in% recent_period
  ) %>%
  # create an sf object
* st_as_sf(
    # define the coordinates based on lat/long variables
*   coords = c("lon", "lat"),
    # set the coordinate reference system to WGS84
*   crs = 4326,
    # do not change string variables to factors 
*   stringsAsFactors = FALSE
  )
```

---
# Set CRS for point coordinates

```
## Simple feature collection with 126 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.66011 ymin: 33.53836 xmax: -84.20438 ymax: 34.16152
## Geodetic CRS:  WGS 84
## First 10 features:
##                 pid date_report age gender                   geometry
## 1  0de96c1a7bfc991b  2021-06-18   6 Female POINT (-84.38023 33.95349)
## 2  fb6aac67ab0dbfe3  2021-06-18  23 Female POINT (-84.36471 33.74956)
## 3  adc636708adbd668  2021-06-17  89 Female POINT (-84.36682 33.74679)
## 4  4987c88010a7b125  2021-06-20  65 Female POINT (-84.36018 34.02925)
## 5  9738fc3209ac189f  2021-06-21  14   Male POINT (-84.55165 33.60726)
## 6  8f03d5ffc0354218  2021-06-18  57 Female  POINT (-84.5458 33.58735)
## 7  c46e8649039e2fa9  2021-06-17  27 Female POINT (-84.34898 33.81149)
## 8  ac79c1b0d6993a5c  2021-06-17  47   Male POINT (-84.37456 33.86814)
## 9  e305f709c166eaf5  2021-06-17   7 Female POINT (-84.49827 33.56734)
## 10 b834f3e6759d0915  2021-06-18   1   Male POINT (-84.37045 33.67755)
```

---
# Basemaps

- A screenshot of the map area 
- Called tiles 
  - Smaller images pieced together 
- Added for context (bottom layer of *ggplot*)
- A lot of different tile servers 
- A lot of different R packages 
  - We demonstrate [{ggspatial}](https://paleolimbot.github.io/ggspatial/)
    - Able to save tiles for offline use 
    - Also have scalebars

---
# Bounding boxes

.center[
<img src="https://github.com/appliedepi/emory_training/blob/master/presentation/images/bounding_box.png?raw=true" width="70%" height="60%" />
]

---
# Download & plot basemap

```r
# get the bounding box for the shapefile 
bounding_box <- shapefile %>% 
  st_bbox()

# plot a base map including scale bar 
basemap <- ggplot() +
  # change the bounding box to an sf object
  # this defines the area to download map tiles for
  geom_sf(data = st_as_sfc(bounding_box)) +
  # download map tiles and add to the plot
* annotation_map_tile(
    # define what map tiles to use
    type =  "cartolight",
    # define folder to store tile images 
    cachedir = here::here("data", "map_tiles"),
    # define if should download tiles each time
    forcedownload = FALSE,
    # hide messages about download status and zoom
    progress = "none" ) 
```
]

---
# Points

```r
# plot the basemap 
points_map <- basemap + 
  # add the shapefile on top with no fill and black borders
  geom_sf(data = shapefile, 
          fill = NA, 
          colour = "black") +
  # plot points from linelist coloured by ethnicity
  # (order the factor so that the most frequent group is plotted first)
  geom_sf(data = linelist_sf, 
          mapping = aes(colour = fct_infreq(eth_race))) + 
  # choose colour combination 
  scale_colour_brewer(palette = "Set1") +
  # change the legend label
  labs(colour = "Ethnicity") + 
  theme(legend.position = "bottom")
```
]

---
# Choropleths - prepare data

```r
# get counts of points within polygons
shapefile <- shapefile %>%
  # add a column to the shapefile with counts
  mutate(
    # see which points are in which zip code polygon
*   cases = st_intersects(., linelist_sf) %>%
            # count how many are in each
*           lengths()
    )

# calculate incidence in shapefiledata
shapefile <- shapefile %>% 
  mutate(
    # divide cases by population 
    incidence = round(cases / Population * 100000, digits = 1), 
    # clean up calculations 
    incidence = case_when(
           # fix the outliers: set infinity to NA and cases less than 10 to NA
           cases < 5              ~ NA_real_,
           is.infinite(incidence) ~ NA_real_,  ## nb. infinite due to zero denominator
           TRUE                   ~ incidence)
    )
```

---
# Choropleths - prepare data

```r
# define breaks (for plotting groups)
breakers <- shapefile %>% 
  # change shapefile to a tibble (otherwise geometry pulled with)
  as_tibble() %>% 
  # only keep zips with more than ten cases
  filter(cases > 5) %>% 
  # pull the incidence column
  select(incidence) %>% 
  # define grouping cut-offs based on quartiles of observations 
  quantile(probs = seq(0, 1, 0.25), na.rm = TRUE)

# create a factor variable for incidence
shapefile <- shapefile %>% 
* mutate(
    # define groups using the cut function       
    incidence_cats = cut(incidence,                                       
                         # cut-offs as defined above (including zero)
                         breaks = c(0, breakers), 
                         # add labels by using the cut-offs
                         labels = str_glue("<={breakers}"))
    )
```

---
# Choropleths - plot data

```r
# plot basemap 
choropleth <- basemap + 
  # add in shapefile (with black borders) colour by incidence categories
  geom_sf(data = shapefile, 
          aes(fill = incidence_cats),
          colour = "black") + 
  # define colour scheme 
  scale_fill_brewer(palette = "YlGn", 
                    # edit legend labels to the categories defined
                    # rename the missing group to be fewer than 10 cases
                    labels = c(
                      levels(shapefile$incidence_cats),
                      "Fewer than 5 cases")) + 
  # add in text labels for each zip code 
  geom_sf_text(data = shapefile, 
               aes(label = ZipCode), 
               check_overlap = TRUE) + 
  # change the legend title 
  labs(fill = "Confirmed cases per \n 100,000 by Zip code")
```
]

---
# Bivariate choropleths 
- Placeholder for demo > see case study [weekly_report.Rmd](https://github.com/appliedepi/emory_training/blob/master/case_study/weekly_report.Rmd#L1894) for details

---
# Heatmaps

- Three dimensional distribution over each point and then interpolated
  
![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/heatmaps.png?raw=true)

---
# Heatmaps - distance
- How many x-units in one kilometre?

.pull-left[
![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/axis_distance.png?raw=true)
]

`$$x_{units} = \frac{x_{max}-x_{min}}{x_{metres}}$$`
]

---
# Heatmaps - calculating

```r
# define x and y minimum/maximum coordinates based on bounding box of base map
xmin <- st_point(c(bounding_box$xmin, bounding_box$ymin))
xmax <- st_point(c(bounding_box$xmax, bounding_box$ymin))
ymin <- xmin
ymax <- st_point(c(bounding_box$xmin, bounding_box$ymax))

# create sf points (with coordinate reference system WGS84) for the x and y axis
xaxis <- st_sfc(xmin, xmax, crs = 4326)
yaxis <- st_sfc(ymin, ymax, crs = 4326)
                 
# calculate the distance in metres on the axes of your based on longitude and latitude
# i.e. how many metres on your x axis and how many on your y
*xdist <- st_distance(xaxis)[2] %>% units::set_units("miles")
*ydist <- st_distance(yaxis)[2] %>% units::set_units("miles")

# calculate how many lat/long units there are on each axis 
*xunits <- st_distance(xmin, xmax)
*yunits <- st_distance(ymin, ymax)

# divide the difference in latitude or longitude by the corresponding distance
# returns the number of units per distance of interest (i.e. coord units per mile)
*xfact <- as.numeric(xunits / xdist)
*yfact <- as.numeric(yunits / ydist)
```

---
# Heatmaps - plotting

```r
heatmap <- basemap + 
  # add in shapefile not filled with black border
  geom_sf(data = shapefile, fill = NA, colour = "black") +
  # add in a density layer 
  stat_density_2d(
        # using the sf formatted linelist
        data = linelist_sf,
        aes(
          # extract the lat lon from geometry (list column)
          x = purrr::map_dbl(geometry, ~.[1]),
          y = purrr::map_dbl(geometry, ~.[2]), 
          # fill based on the calculated density
          fill = after_stat(level)
          ),
        # define the shape to use for smoothing
        geom = "polygon",
        # define whether to show legend (removed as uninformative)
        show.legend = FALSE,
        # bandwith (distance squared) 
        h = c(xfact, yfact)
        ) + 
  # define fill palette (viridis continuous) 
  scale_fill_viridis_c(option = "C")
```
]

.pull-right[
<img src="slides_ggplot_files/figure-html/heatmaps_plotting_show-1.png" width="504" />
]

---
class: center, middle

# Thank you