+ - 0:00:00
Notes for current slide
Notes for next slide
  • What makes us different is that we focus on the challenges of applied epi, not academic epi. We emphasize the skills used every day by ground-level epidemiologists.

Emory COVID-19 Response Collaborative

Data visualization with ggplot2: an overview and demonstration

EpiRhandbook Team

epiRhandbook
epirhandbook@gmail.com

August 2021

1 / 119

We are Applied Epi

  • We are a grassroots collaborative of applied epidemiologists
  • Our (free) Epi R Handbook has been used by 60,000 people worldwide
  • Now we offer customized training to health departments and NGOs

Bookmark www.epiRhandbook.com!

2 / 119
  • What makes us different is that we focus on the challenges of applied epi, not academic epi. We emphasize the skills used every day by ground-level epidemiologists.

Introducing ourselves

  • Neale Batra
    • Founder of the Epi R Handbook.
    • Previously @Philly, @SantaClaraCounty, and @USAID/PEPFAR. Internationally with @MSF and @WHO.
  • Alex Spina
    • Alum of EPIET (European EIS). Med student and epi consultant for @MSF and @WHO.
  • Mathilde Mousset
    • R programmer / Data manager at Epicentre (@MSF). PhD in evolutionary ecology.
  • Henry Laurenson-Schafer
    • Epi/data scientist with WHO Geneva on COVID response. PhD in molecular biology/bioinformatics.
  • Wen Lin
    • Epi @SantaClaraCounty, Resident SAS expert.
3 / 119

We have Wen available for answering any questions on translating SAS to R

Today's demonstration

Today is primarily explanation and demonstration, following from the previous demonstration on data management and workflow.

We assume you have reviewed the Handbook pages on R Basics and ggplot basics.

4 / 119
  • Today's session is primarily a demonstration with question & answer session.
  • We assume that you have read at least some of the relevant pages in the Epi R Handbook prior to this session.
  • We also offer training in which we guide you through case studies with hands-on coaching.

Objectives

Our objectives today are to:

1) De-mystify ggplot() syntax so you can apply it to your own tasks
2) Demonstrate core geom_() commands
3) Introduce helpful "wrapper" packages like incidence2 and apyramid
4) Touch upon advanced tips with packages ggExtra, ggrepel, gghighlight, scales, plotly, and esquisse
5) Introduce mapping (GIS) with sf

The Epi R Handbook contains many more examples.

5 / 119
  • 1 - core syntax and 2 - exposure
  • Ready yourself because today will be a whirlwind.
  • There will be a LOT of content and we will move quickly.
  • Type questions in the chat, or hold them to the end.
  • We will do understanding checks throughout

Resources for you

Today's work is within an R project, which can be downloaded as a zipped file.

Today's code is available to you in ggplot_demo.Rmd (also online here).

R project file (case_study.Rproj)

  • data/ folder
    • covid_example_data/ folder
      • linelist_cleaned.rds
      • covid_example_data.xlsx
      • covid_shapefile/ folder
        • FultonCountryZipCode.shp...
    • map_tiles/ folder
      • base map images for Fulton County...
  • ggplot_demo.Rmd (R markdown script)
6 / 119
  • Do NOT download and try it now. Too distracting.
  • Today will move fast, but just like last time the REAL opportunity for you is to review at your own pace.
  • We've spent a lot of time developing materials for you.
  • SHOW SCRIPT AND OUTPUT IN FOLDER

Load packages and import data

Load packages (ggplot2 is within tidyverse) and import the cleaned (fake) dataset from last session.

## install and load necessary packages
pacman::p_load(
rio, # importing data
here, # relative file pathways
skimr, # review data
DT, # visualize data frame
janitor, # data cleaning and tables
epikit, # age categories
lubridate, # working with dates
incidence2, # epidemic curves
ggrepel, # smart labels
ggExtra, # extras
esquisse, # point-and-click for simple ggplots
apyramid, # age pyramids
scales, # formatting of scales
plotly, # interactive plots
cowplot, # combine plots
tidyverse # ggplot2 & data management
)
linelist <- import(here::here("data", "covid_example_data", "linelist_cleaned.rds"))
7 / 119

One tip is to load tidyverse last to avoid masking of package names
The import() function can import almost any kind of file. An .rds file retains column classes from an R data frame.

Review data

Below are the first 25 rows of the linelist data frame:

Tip: Use skim() from skimr package to review all columns with summary statistics

8 / 119

Visualization options in R

Today we focus on ggplot2 because it:

  • is good for fast data exploration of multi-dimensional data
  • produces very high quality final outputs
  • has well-structured grammar => high consistency
  • is accompanied by many packages that expand functionality

See the R graph gallery for inspiration.

Other plotting options include base R, lattice, and plotly.

9 / 119

gg-what??

10 / 119

gg-what??

  • The ggplot2 package is the most popular data visualization tool in R
10 / 119

gg-what??

  • The ggplot2 package is the most popular data visualization tool in R

  • Its ggplot() function is at the core of the package

10 / 119

gg-what??

  • The ggplot2 package is the most popular data visualization tool in R

  • Its ggplot() function is at the core of the package

  • This whole approach is colloquially known as “ggplotting”

10 / 119

gg-what??

  • The ggplot2 package is the most popular data visualization tool in R

  • Its ggplot() function is at the core of the package

  • This whole approach is colloquially known as “ggplotting”

  • Resulting figures are sometimes affectionately called “ggplots”

10 / 119

gg-what??

  • The ggplot2 package is the most popular data visualization tool in R

  • Its ggplot() function is at the core of the package

  • This whole approach is colloquially known as “ggplotting”

  • Resulting figures are sometimes affectionately called “ggplots”

ggplot2 benefits from a wide variety of supplementary R packages that extends its functionalities, such as gganimate, ggthemr, ggdendro, gghighlight, ggforce...

Bonus question: What does the "gg” in these names represent?

10 / 119
  • "gg" represents the “grammar of graphics” used to construct the figures

Syntax overview

Build a plot object by “adding” commands on top of one another that specify plot layers and design elements

11 / 119

Syntax overview

Build a plot object by “adding” commands on top of one another that specify plot layers and design elements

The order of layers will usually look like this:

1) "Open" the plot with the ggplot() command and specify the dataset

11 / 119

Syntax overview

Build a plot object by “adding” commands on top of one another that specify plot layers and design elements

The order of layers will usually look like this:

1) "Open" the plot with the ggplot() command and specify the dataset

2) "Map" data columns to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency

11 / 119

Syntax overview

Build a plot object by “adding” commands on top of one another that specify plot layers and design elements

The order of layers will usually look like this:

1) "Open" the plot with the ggplot() command and specify the dataset

2) "Map" data columns to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency

3) Add (+) “geom” layers that visualize data geometrically as shapes

11 / 119

Syntax overview

Build a plot object by “adding” commands on top of one another that specify plot layers and design elements

The order of layers will usually look like this:

1) "Open" the plot with the ggplot() command and specify the dataset

2) "Map" data columns to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency

3) Add (+) “geom” layers that visualize data geometrically as shapes

4) Modify "scales", such as a color scale or y-axis breaks

11 / 119

Syntax overview

Build a plot object by “adding” commands on top of one another that specify plot layers and design elements

The order of layers will usually look like this:

1) "Open" the plot with the ggplot() command and specify the dataset

2) "Map" data columns to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency

3) Add (+) “geom” layers that visualize data geometrically as shapes

4) Modify "scales", such as a color scale or y-axis breaks

5) Add "theme" plot design elements such as axis labels, title, caption, fonts, text sizes, background themes, or axes rotation

11 / 119

Syntax overview

Build a plot object by “adding” commands on top of one another that specify plot layers and design elements

The order of layers will usually look like this:

1) "Open" the plot with the ggplot() command and specify the dataset

2) "Map" data columns to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency

3) Add (+) “geom” layers that visualize data geometrically as shapes

4) Modify "scales", such as a color scale or y-axis breaks

5) Add "theme" plot design elements such as axis labels, title, caption, fonts, text sizes, background themes, or axes rotation

These layers are "added" sequentially with + symbols.
ggplot2 commands can be quite long!

11 / 119

Remember that although the commands may be long, it is infinitely easier to edit and recycle than in Excel

Open the plot

ggplot() creates an empty canvas. Assign the data frame to use.

ggplot(data = linelist)

Alternatively, use the %>% pipe operator to "pipe" a data frame into ggplot()

linelist %>%
ggplot()

12 / 119

This is only a blank canvas, we have not defined what should be in the x and y axes. If several data frames are needed, they can be added in their own geoms. Piping is useful to make one-time changes to a dataset prior to plotting.

Mappings with aes()

ggplot(
data = linelist,
mapping = aes(
x = age,
y = days_hosp))

"Aesthetics" are features whose display could vary for each data point (position, color, shape...)

mapping = aes() assigns plot "aesthetics" to columns in the data. Inputs must be placed within aes().

Two basic aesthetic mappings are axes, via x= and y=.

13 / 119

Add geometry

ggplot(
data = linelist,
mapping = aes(
x = age,
y = days_hosp)) +
geom_point()

Data are visualized using "geom" commands, such as geom_point().

These commands are "added" with a + to the ggplot() command.

14 / 119

Geometries

Some classical “geoms” include:

Geometry Geom
Histograms geom_histogram()
Points geom_point()

Full list here

15 / 119

Geometries

Some classical “geoms” include:

Geometry Geom
Histograms geom_histogram()
Points geom_point()
Lines geom_line()
Bar plots geom_bar() or geom_col()

Full list here

16 / 119

Geometries

Some classical “geoms” include:

Geometry Geom
Histograms geom_histogram()
Points geom_point()
Lines geom_line()
Bar plots geom_bar() or geom_col()
Boxplots geom_boxplot()
Violin plots geom_violin()

Full list here

17 / 119

Adding geoms

ggplot(
data = linelist,
mapping = aes(
x = age,
y = days_hosp)) +
geom_point()

With axes now mapped, geom_point() displays the data as points.

18 / 119

Adding geoms

ggplot(
data = linelist,
mapping = aes(
x = age,
y = days_hosp)) +
geom_point() +
geom_smooth()

We can add additional geoms to the current plot with +.

Geoms appear in the order they are written: the smoothed line appears over the points.

geom_smooth() gives smoothed conditional means, helping to show trends in presence of "over-plotting" (see documentation)

19 / 119
  • Explain why you might use one or the other

A quick note on indentations

Indentations, spaces, and newlines do not impact code execution, and can be varied to improve readability.

ggplot(data = linelist, mapping = aes(x = age, y = days_hosp))+geom_point()

is the same as:

ggplot(data = linelist,
mapping = aes(x = age, y = days_hosp)) +
geom_point()

is the same as:

ggplot(
data = linelist, # use case linelist
mapping = aes( # make aesthetic mappings for all geoms
x = age, # assign x-axis to age column
y = days_hosp)) + # assign y-axis to duration of hospitalization
geom_point() # display data as points
20 / 119
  • Explain why you might use one or the other long style can enable informative comments/annotations
  • short style very dense (harder to read for some). Shorter scripts, but so what? The number of lines of your code is not an informative metric.
  • very long lines => needs to scroll horizontally for people with smaller monitors (not nice)
  • long-ish style makes it easier to see which argument belongs to each function
  • spaces around "=" or "+" => make it easier to parse to many people
  • other?

Other aesthetics

Aside from axes, other common "aesthetics" include:

Argument Controls
shape Display of point as dot, star, triangle, square...
fill The interior color (e.g of bar or boxplot)
color The exterior or bar, boxplot - OR point color
size Line thickness, point size...
alpha Transparency: 0 (invisible) to 1 (opaque)
width Width of "bar plot" bars
linetype Either solid, dashed, dotted, etc.
binwidth Width of histogram bins
21 / 119

Note that “aesthetic” in ggplot has a specific meaning that you might associate with the word “aesthetics” in common English. In ggplot those details are called “themes” and are adjusted within a theme() command

Each geom accepts certain aesthetics, like binwidth= for geom_histogram()

Aesthetics assignments

Aesthetics can be assigned to either:

  • Static values: color = "purple"

    • Assigned outside aes()
    • Same display for all data
  • A data column: aes(color = died)

    • Assigned inside aes()
    • Displays data as "groups"

Some examples:

22 / 119

Aesthetics assignments

Aesthetics can be assigned to either:

  • Static values: fill = "purple"

    • Assigned outside aes()
    • Same display for all data
  • A data column: aes(fill = died)

    • Assigned inside aes()
    • Displays data as "groups"

More examples:

23 / 119

Static aesthetics

ggplot(
data = linelist,
mapping = aes(
x = age,
y = days_hosp)) +
geom_point(color = "seagreen")

An aesthetic is static if it applies the same display to all data points in the geom or plot. Static aesthetics are defined outside aes() to a number or character value.

Other examples:
size = 3
alpha = 0.5
width = 1.2

24 / 119

Dynamic aesthetics

ggplot(
data = linelist,
mapping = aes(
x = age,
y = days_hosp,
color = died)) +
geom_point()

Dynamic aesthetics are mapped to a column name, and defined inside aes(). This creates "groups" in the plot and generates a legend. The display varies for each data point.

Above, color= is mapped to column died, and is inherited by geom_point().

25 / 119

Static and dynamic

ggplot(
data = linelist,
mapping = aes(
x = age,
y = days_hosp,
color = died)) +
geom_point(
size = 7,
alpha = 0.7)

Above, size = 7 and alpha = 0.7 are static aesthetics defined outside any aes() and apply to geom_point().

color= is within an aes(), and maps to values in column died.

Read more about ggplot aesthetics here

26 / 119

As there is only one geom, all aesthetics can be written in ggplot(), or in geom_point()

Aesthetic mapping placement

1) Dynamic mappings in the initial ggplot() call will apply to subsequent geoms, unless otherwise indicated
2) Static aesthetics (e.g. color = "blue") are not inherited by subsequent geoms
3) Mappings written within one geom apply only to that geom

ggplot(
data = linelist,
mapping = aes(
x = age,
y = days_hosp)) +
# points colored by outcome
geom_point(
mapping = aes(color = died),
size = 1) +
# smoothed means by outcome
geom_smooth(
mapping = aes(color = died),
size = 3) +
# smoothed mean of entire dataset
geom_smooth(color = "black")

27 / 119
  • Three rules to remember as you decide where to place aesthetic mappings
  • This is rather complex example, but serves to demonstrate the above guidelines.

A common error

If your data are "grouped", yet you plot indiscriminately without the grouping assigned to a dynamic aesthetic, you get multiple y values per x value.

Below, linelist is aggregated into counts per unique week-outcome.

outcome_week_data <- linelist %>%
count(died,
week_onset = floor_date(
x = date_onset,
unit = "week")) %>%
select(week_onset, everything()) %>%
arrange(week_onset) %>%
drop_na(week_onset)

First few rows:

week_onset died n
2019-12-29 Yes 1
2019-12-29 No 5
2020-01-05 No 7
2020-01-05 Unknown 1
2020-01-12 Yes 1
2020-01-12 No 5
2020-01-12 Unknown 2
2020-01-19 No 4
28 / 119

A common error

ggplot(
data = outcome_week_data,
mapping = aes(
x = week_onset,
y = n)) +
geom_line()

Above, the column died is not provided to any grouping aesthetic.

ggplot() does not know how to display the multiple values for each week.

29 / 119

A common error - resolved

ggplot(
data = outcome_week_data,
mapping = aes(
x = week_onset,
y = n,
color = died))+
geom_line()

Use aesthetic mappings like fill= or color= to correctly visualize the "groups".

Above, color= is mapped to the died column.

30 / 119

Facets

ggplot(
data = linelist,
mapping = aes(x = date_onset)) +
geom_histogram() +
facet_wrap(~eth_race)

Another way of displaying groups is via faceting.

facet_wrap() produces one facet per unique value.

Place a "~" before the column name.

31 / 119

Also called "small multiples"

Facets

ggplot(
data = linelist,
mapping = aes(x = date_onset)) +
geom_histogram() +
facet_wrap(~eth_race,
scales = "free_y")

"Free" auto-scaled axes with scales=. Options:

  • "free_y"
  • "free_x"
  • "free" (both x and y)

Alert your audience if you use free axes! Also, try ncol= and nrow=

32 / 119

Facets - by two variables

ggplot(
data = linelist,
mapping = aes(x = date_onset)) +
geom_histogram() +
facet_wrap(eth_race ~ gender)

The "~" signifies "by". You can place columns on either side.

With facet_wrap(), levels are combined into facet titles, appearing alphabetically/by factor level.

33 / 119

Facets - grid layout

Use facet_grid(). Labels appear on top and side.

ggplot(
data = linelist,
mapping = aes(x = date_onset)) +
geom_histogram() +
facet_grid(eth_race ~ gender)

34 / 119

Facets - drop levels

Use filter() or drop_na() in advance, to remove undesired levels.

linelist %>%
drop_na(gender, eth_race) %>%
filter(gender != "Unknown") %>%
# begin plot
ggplot(
mapping = aes(x = date_onset)) +
geom_histogram() +
facet_grid(eth_race ~ gender)

Above, we pipe the adjusted data frame into ggplot(), for ease

35 / 119

Facets + gghighlight()

Add gghighlight() from gghighlight to show a full "shadow" behind each facet. For color add the aes(fill=).

ggplot(
data = linelist,
mapping = aes(
x = date_onset,
fill = eth_race)) +
geom_histogram() +
facet_wrap(~ eth_race) +
gghighlight::gghighlight()

36 / 119

gghighlight

Add gghighlight() to other plots, and specify specific values (or criteria) to highlight

linelist %>%
# get daily counts by age group
group_by(age_group, date_report) %>%
count() %>%
# plot
ggplot(
mapping = aes(
x = date_report,
y = n,
color = age_group)) +
geom_line() +
gghighlight::gghighlight(
age_group %in% c("40-49", "60-69)"))+
theme(legend.position = "none")

37 / 119

Scales - overview

Scale commands replace defaults of how the aesthetic mappings manifest, such as:

  • Which colors or shapes to display
  • The min/max of point sizes
  • The min/max and frequency of axes breaks

As a generic formula, these commands are written as: scale_AESTHETIC_METHOD().

1) scale_ : this prefix never changes
2) AESTHETIC: _fill_ , _color_ , _x_ , _y_ , etc.
3) METHOD: _continuous(), _discrete(), _manual(), _date(), etc.

38 / 119

Scales examples

Some examples of scale commands:

You want to adjust Scale command
continuous y-axis scale_y_continuous()
date x-axis scale_x_date()
categorical x-axis scale_x_discrete()
fill, continuous scale_fill_continuous()
fill, continuous scale_fill_gradient()
color, manual assign scale_color_manual()
39 / 119

Scales - default

ggplot(
data = linelist,
mapping = aes(
x = eth_race,
fill = died)) +
geom_bar()

Above, the fill of a bar plot uses the default colors and axis breaks:

40 / 119

Scales - adjusted fill

ggplot(
data = linelist,
mapping = aes(
x = eth_race,
fill = died)) +
geom_bar() +
scale_fill_manual(
values = c(
"Yes" = "violetred",
"No" = "aquamarine",
"Unknown" = "grey"))

Within scale_fill_manual() we provide assignments within a vector c().

Use na.value = "grey" for missing values

41 / 119

Discuss the na.value= arguments in most scale commands, and the difference between having NA values in the data and having an explicit missing value such as "Unknown".

Scales - adjusted y-axis

ggplot(
data = linelist,
mapping = aes(
x = eth_race,
fill = died)) +
geom_bar() +
scale_fill_manual(
values = c(
"Yes" = "violetred",
"No" = "aquamarine",
"Unknown" = "grey")) +
scale_y_continuous(
breaks = seq(from = 0,
to = 35000,
by = 5000))

In scale_y_continuous() we adjust the y-axis breaks using seq() to define a numeric sequence.

42 / 119

Scales - Start axes at 0

ggplot(
data = linelist,
mapping = aes(
x = eth_race,
fill = died)) +
geom_bar() +
scale_fill_manual(
values = c("Yes" = "violetred",
"No" = "aquamarine",
"Unknown" = "grey")) +
scale_y_continuous(
breaks = seq(from = 0,
to = 35000,
by = 5000),
expand = c(0, 0)) +
scale_x_discrete(
expand = c(0, 0))

In any scale_x_ or scale_y_ command, use expand = c(0,0) to remove excess space around the plot features.

43 / 119

Scales - date axis labels

ggplot(
data = linelist,
mapping = aes(x = date_onset)) +
geom_histogram()

The default scale for date axes will vary by the range of your data.

44 / 119

Scales - date label breaks

ggplot(
data = linelist,
mapping = aes(x = date_onset)) +
geom_histogram() +
scale_x_date(
date_breaks = "3 months")

Adjust axis labels with scale_x_date().

Use date_breaks= values like "1 week", "2 weeks", or "3 months".

Note: these are the axis label breaks, not histogram bin breaks!

Note: Default "weeks" start on Mondays. See epiRhandbook epicurves page for alternatives.

45 / 119

For tips on geom_histogram() bins, see Epi R Handbook epicurves page

Scales - date axis labels

ggplot(
data = linelist,
mapping = aes(x = date_onset)) +
geom_histogram() +
scale_x_date(
date_breaks = "3 months",
date_labels = "%d %b\n%Y")

Specify date label format to date_labels= using "strptime" syntax

"%d %b %Y" for DD MMM YYYY.

See Epi R Handbook Epicurves and Strings pages for more tips

46 / 119

/n is a newline

Scales - date axis labels

ggplot(
data = linelist,
mapping = aes(x = date_onset)) +
geom_histogram() +
scale_x_date(
date_breaks = "3 months",
labels = scales::label_date_short() )

Alternatively, simplify by assigning labels= to label_date_short() from the scales package

The year is not repeated on each label anymore.

Note the use above of labels=, not date_labels=.

47 / 119

Scales - display percents

Easily display proportions as percents with percent() from scales within scale_y_continuous().

First few rows of a CFR dataset:

month cases deaths CFR
2020-01-01 1 0 0.0000000
2020-02-01 54 3 0.0555556
2020-03-01 857 73 0.0851809
2020-04-01 1832 152 0.0829694
ggplot(
data = CFR_data,
mapping = aes(
x = month,
y = CFR) +
geom_line(size = 2, color = "brown") +
scale_y_continuous(labels = percent)

48 / 119

Remember to load the scales package

Plot labels

ggplot(data = linelist) +
geom_point(
mapping = aes(
x = age,
y = days_hosp,
color = died),
alpha = 0.3) +
labs(
title = "Duration of admission",
subtitle = "Fulton County, GA",
x = "Age (years)",
y = "Duration (days)",
caption = "Fictional COVID-19 data",
color = "Deceased"
)

Use labs() as above.

Note: to edit legend title, use the aesthetic that created the legend (e.g. color=).

49 / 119

Plot label tips

See the Epi R Handbook page on adjusting characters with the stringr package

New lines

"Top line\nNextline" # \n creates new line
str_wrap("This is a really long title", 30) # wrap at 30 characters
50 / 119

Plot label tips

See the Epi R Handbook page on adjusting characters with the stringr package

New lines

"Top line\nNextline" # \n creates new line
str_wrap("This is a really long title", 30) # wrap at 30 characters

Dynamic labels - Embed code in str_glue() that updates with the data

str_glue("Data as of {Sys.Date()}")
## Data as of 2021-08-20
str_glue("{fmt_count(linelist, is.na(date_onset))} cases missing onset and not shown")
## 37323 (45.7%) cases missing onset and not shown
50 / 119

Explain that in str_glue, anything within curly brackets it will run as R code.

Themes

Themes are non-data design features (background, text size/color, etc).

These "complete themes" are easy to add.

# Try one of these...
theme_bw()+
theme_classic()+
theme_dark()+
theme_gray()+
theme_minimal()+
theme_light()+
theme_void()+

Try the argument base_size = 16 to quickly increase text size.

51 / 119

Themes

Make micro-adjustments with theme(). See the handbook for tips.

ggplot(
data = linelist,
mapping = aes(
x = age,
y = days_hosp,
color = died),
alpha = 0.3) +
geom_point() +
labs(
title = "Duration of admission",
x = "Age (years)",
y = "Duration (days)",
color = "Deceased")+
theme_minimal(base_size = 16) +
theme(
legend.position = "bottom",
plot.title = element_text(
color = "red",
face = "bold"),
axis.title.y = element_text(angle = 90))

The syntax takes practice - see this list of feature-specific arguments.

52 / 119

Talk about these theme() arguments and how they consist of two parts, just like mapping = aes(). Explain that nobody has these all memorized, but the common ones are easy to remember once you use them enough.
Remember to add them AFTER any complete themes.

Bar plots

geom_col() and geom_bar() are used to make general bar plots (non-epicurves).

53 / 119

Bar plots

geom_col() and geom_bar() are used to make general bar plots (non-epicurves).

Use geom_bar() if bar height should reflect the number of rows in the data (e.g. a case linelist).

ggplot(
data = linelist, # begin with linelist
mapping = aes(x = eth_race)) + # No y= argument
geom_bar()
53 / 119

Bar plots

geom_col() and geom_bar() are used to make general bar plots (non-epicurves).

Use geom_bar() if bar height should reflect the number of rows in the data (e.g. a case linelist).

ggplot(
data = linelist, # begin with linelist
mapping = aes(x = eth_race)) + # No y= argument
geom_bar()

Use geom_col() there is a numeric column containing the desired bar height (e.g. aggregated count data).

ggplot(
data = linelist_agg, # begin with aggregated count data
mapping = aes(x = eth_race, y = n)) + # bar height is value in column "n"
geom_col()
53 / 119

Bar plot - count rows

With geom_bar(), the height reflects the number of rows per x-axis group. This works well for linelist data.

date_onset eth_race died
2020-03-20 White, NH No
2020-01-28 White, NH No
2020-02-10 Black, NH No
2021-05-19 Black, NH No
2020-02-20 White, NH Unknown
2020-01-17 Black, NH Yes
ggplot(
data = linelist,
mapping = aes(
x = eth_race)) +
geom_bar() +
theme(axis.text.x= element_text(angle=30))

54 / 119

Bar plot - count grouped rows

To achieve "stacked" bars with geom_bar(), assign the grouping column to fill=, within aes().

date_onset eth_race died
2020-03-20 White, NH No
2020-01-28 White, NH No
2020-02-10 Black, NH No
2021-05-19 Black, NH No
2020-02-20 White, NH Unknown
2020-01-17 Black, NH Yes
ggplot(
data = linelist,
mapping = aes(
x = eth_race,
fill = died)) +
geom_bar() +
theme(axis.text.x= element_text(angle=30))

55 / 119

Bar plot - aggregated counts

In contrast, geom_col() uses a column of counts, such as column n in this linelist_eth dataset:

eth_race n
Asian, NH 3022
Black, NH 34395
White, NH 27303
Hispanic, all races 8600
Other, NH 2912
Unknown 5525

Column n provides the bar height.

ggplot(linelist_eth) +
geom_col(
mapping = aes(
x = eth_race,
y = n))+
theme(axis.text.x= element_text(angle= 30))

56 / 119

Bar plot - stacked counts

To have "stacked" bars using geom_col(), each plotting group must have its own rows in the data. Use "long"-style data like below:

eth_race died n
Asian, NH Yes 30
Asian, NH No 1802
Asian, NH Unknown 1190
Black, NH Yes 1037
Black, NH No 19204
Black, NH Unknown 14154
ggplot(linelist_eth_died) +
geom_col(
mapping = aes(
x = eth_race,
y = n,
fill = died)) +
theme(axis.text.x= element_text(angle=30))

See the pivoting page in the Epi R Handbook

57 / 119

Bar plot - a common error

If your data look like below (counts) and your plot looks like that (bars of same height), ensure you are using geom_col() and not geom_bar()!

eth_race n
Asian, NH 3022
Black, NH 34395
White, NH 27303
Hispanic, all races 8600
Other, NH 2912
Unknown 5525

This plot shows that there is one row per ethnicity/race!

58 / 119

Bar plot - flip axes

ggplot(linelist) +
geom_bar(
mapping = aes(
x = eth_race,
fill = died)) +
theme(legend.position = "top") +
coord_flip()

It is simple to flip the axes on any ggplot by adding coord_flip()

This is useful in bar plot to improve readability.

59 / 119

Bar plot - adjust order

ggplot(linelist) +
geom_bar(
mapping = aes(
x = fct_infreq(eth_race),
fill = fct_infreq(died))) +
theme(legend.position = "top") +
coord_flip()

Use the forcats package to convert to class factor and adjust "level" order.

Above, fct_infreq() orders x-axis position and stacks (fill) by frequency.

See the Epi R Handbook page on factors.

60 / 119

You will need to clean up the axes and legend titles now.

Bar plot - reverse order

ggplot(linelist) +
geom_bar(
mapping = aes(
x = fct_rev(fct_infreq(eth_race)),
fill = fct_rev(fct_infreq(died)))) +
theme(legend.position = "top") +
coord_flip()

To reverse order, use fct_rev(), which can be wrapped around other functions.

61 / 119

Bar plot - adjust width

ggplot(
data = linelist_eth_died,
mapping = aes(
x = eth_race,
y = n,
fill = died)) +
geom_col(width = 0.5) +
theme(axis.text.x = element_text(angle = 30))

Adjust bar width with width=.

62 / 119

Be wary of adjusting width for date bars (e.g. month) - use geom_histogram() instead with bin breaks.

Bar plot - adjacent

ggplot(
data = linelist_eth_died,
mapping = aes(
x = eth_race,
y = n,
fill = fct_infreq(died),
label = n)) +
geom_col(position = "dodge")+
theme(axis.text.x = element_text(angle = 30))

To make the bars adjacent, specify position = "dodge" in geom_col().

Adjust the order with via a forcats function.

63 / 119

Bar plot - display counts

To display text, use geom_col() with geom_text(). Assign aes(label=) to the height values.

The position= arg in geom_text() can specify display at mid-bar.

ggplot(
data = linelist_eth_died,
mapping = aes(
x = eth_race,
y = n,
fill = died,
label = n)) +
geom_col() +
geom_text(
size = 3,
position = position_stack(vjust = 0.5)) +
theme(axis.text.x = element_text(angle=30))

64 / 119

Point-and-click ggplot2

Now that you've learned the basic syntax... the esquisse package allows point-and-click creation of (simple) ggplots! (esquisse means "sketch" in French)

65 / 119

Make an epidemic curve

Two approaches that we suggest:

1) Use the incidence2 package

  • Fast, simple, and modifiable with ggplot additions

2) Use ggplot's geom_histogram()

  • Most customizeability
  • Most complex code
66 / 119

Make an epidemic curve

Two approaches that we suggest:

1) Use the incidence2 package

  • Fast, simple, and modifiable with ggplot additions

2) Use ggplot's geom_histogram()

  • Most customizeability
  • Most complex code

Today's demonstration will focus on incidence2.

See the Epi R handbook's Epicurves page for detailed examples of both methods

66 / 119

Epicurve - incidence object

Create an incidence object

weekly <- incidence(
x = linelist,
date_index = date_onset,
interval = "week")

Plot the incidence object

plot(weekly)

67 / 119

Epicurve - incidence object

Adjust the interval

bimonthly <- incidence(
x = linelist,
date_index = date_onset,
interval = "2 months")

Plot the incidence object

plot(bimonthly)

Try "Sunday weeks" or "MMWR week" for Sunday weeks.

68 / 119

Epicurve - groups

To show groups, specify them to groups= in the incidence() command:

weekly <- incidence(
x = linelist,
date_index = date_onset,
interval = "weeks",
groups = eth_race)

AND to fill= in the plot() command:

plot(weekly,
fill = eth_race)

If grouping by multiple columns, nest them in both places within c():

groups = c(eth_race, gender)

69 / 119

Epicurve - ggplot2 additions

plot(weekly,
fill = eth_race) +
scale_y_continuous(
expand = c(0,0),
breaks = seq(0,2000,250)) +
scale_y_continuous(
expand = c(0,0),
breaks = seq(0,2000,250)) +
theme_minimal(base_size = 16) +
theme(legend.position = "top") +
labs(fill = "Race and\nEthnicity")

Add other ggplot2 commands with +.

70 / 119

Epicurve - date axis

plot(weekly,
fill = eth_race,
date_format = "%a %d %b %Y\n (Week %W)",
angle = 30)+
scale_y_continuous(
expand = c(0,0),
breaks = seq(0,2000,250)) +
theme_minimal(base_size = 16) +
theme(legend.position = "top") +
labs(fill = "Race and\nEthnicity")

Do not adjust incidence2 epicurves with scale_x_date().

Instead, use custom incidence2 arguments in plot().

71 / 119

Epicurve - show cases

For small outbreaks, the style show_cases = TRUE may be helpful:

small_outbreak <- linelist %>%
filter(
zip == 30024,
date_report >= as.Date("2020-12-01")) %>%
incidence(
date_index = date_onset,
interval = "Sunday weeks",
groups = eth_race)
plot(small_outbreak,
fill = eth_race,
show_cases = TRUE)+
theme(legend.position = "bottom")+
labs(
title = "ZIP 30024 by race")

72 / 119

Dataset is piped in from above

Epicurve - color palettes

Use scale_fill_viridis_d() for palettes that are color-blind friendly:

plot(weekly, fill = eth_race)+
scale_fill_viridis_d(
option = "inferno",
name = "Race and\nEthnicity",
na.value = "grey")

Viridis (try with option = "plasma" or "inferno"), and colorbrewer palette functions can be added to any ggplot as well.

The _d() indicates a scale for discrete values

73 / 119

Epicurve - aggregated counts

You can also use incidence2 on data that are aggregated counts...

# For demo: aggregate linelist
linelist_day_counts <- linelist %>%
count(
day = floor_date(date_report, "day"),
died) %>%
drop_na(day)

A few rows of daily counts:

day died n
2021-06-19 No 11
2021-06-19 Unknown 11
2021-06-20 No 10
2021-06-20 Unknown 6
2021-06-21 No 12
2021-06-21 Unknown 11
2021-06-22 No 9
2021-06-22 Unknown 6
2021-06-23 No 7
2021-06-23 Unknown 7
74 / 119

Epicurve - aggregated counts

died_curve <- incidence(
linelist_day_counts,
date_index = day,
count = n,
interval = "week",
groups = died
)
# plot stacked by outcome
plot(died_curve,
fill = died)

75 / 119

Demographic pyramids

We suggest two ways to make age pyramids:

1) Use the apyramid package - simple, easy

2) Use ggplot2 - more customizeable, but opportunity for error

Today we will demonstrate apyramid.

See the Epi R Handbook page on demographic pyramids and the package vignette for more information.

76 / 119

Demographic pyramids

age_pyramid(
data = linelist,
age_group = "age_group",
split_by = "gender")

The function age_pyramid() from apyramid offers an easy interface.

We have cleaned gender to only two values

77 / 119

Demographic pyramids

age_pyramid(
data = linelist,
age_group = "age_group",
split_by = "gender",
proportional = TRUE,
show_midpoint = FALSE,
pal = c("darkgreen", "brown"))

Further modifications:

  • Percents instead of raw counts on the x-axis
  • No mid-point designation
  • Specified colors

78 / 119

Demographic pyramids

age_pyramid(
data = linelist,
age_group = "age_group",
split_by = "gender",
stack_by = "hospitalized"
proportional = TRUE,
show_midpoint = FALSE,
pal = c("darkgreen", "brown",
"yellow", "orange"))

You can stack the bars by another variable with stack_by=

79 / 119

Demographic pyramids

age_pyramid(
data = linelist,
age_group = "age_group",
split_by = "gender",
proportional = TRUE,
show_midpoint = FALSE) +
theme_minimal(base_size = 10) +
labs(
title = "Age and Gender",
subtitle = "Fulton County, GA",
x = "Percent of total",
y = "Age group",
fill = "Gender",
caption = "Caption here")

You can add further ggplot commands.

80 / 119

Dynamic labels example

Let's use dynamic labels that will update with the data.

Simple caption

str_glue("n = {nrow(linelist)}")
## n = 81757
81 / 119

Dynamic labels example

Let's use dynamic labels that will update with the data.

Simple caption

str_glue("n = {nrow(linelist)}")
## n = 81757

Complex caption

For complicated str_glue() scenarios, define the dynamic components outside the quotation marks.

fmt_count() from epikit is useful to count and display rows.

str_glue("{missing} missing age or gender not shown.",
missing = fmt_count(linelist, is.na(gender) | is.na(age_group))
)
## 405 (0.5%) missing age or gender not shown.
81 / 119

Dynamic labels example

Subtitle

Max/Min dates can be wrapped in format() from base to adjust the display. See the epiRhandbook section on strptime syntax.

str_glue("Fulton County, reported {min_date} - {max_date}",
min_date = format(min(linelist$date_report, na.rm=T), "%B %d %Y"),
max_date = format(max(linelist$date_report, na.rm=T), "%B %d %Y")
)
## Fulton County, reported January 03 2020 - June 23 2021
82 / 119

Dynamic captions

Now applying those dynamic captions:

age_pyramid(
data = linelist,
age_group = "age_group",
split_by = "gender",
proportional = TRUE,
show_midpoint = FALSE) +
theme_minimal(base_size = 10) +
labs(
title = "Age and Gender",
subtitle = str_glue(
"Fulton County, reported {min_date} - {max_date}",
min_date = format(
min(linelist$date_report, na.rm=T),
"%B %d %Y"),
max_date = format(
max(linelist$date_report, na.rm=T),
"%B %d %Y")
),
x = "Percent of total",
y = "Age group",
fill = "Gender",
caption = str_glue(
"{missing} missing age or gender not shown.",
missing = fmt_count(
linelist,
is.na(gender) | is.na(age_group))
)
)

83 / 119

Labeling

To demonstrate labeling points, we create a new dataset that summarizes CFR and median age by race/ethnicity

race_CFR_age <- linelist %>%
group_by(eth_race) %>%
summarise(
cases = n(),
deaths = sum(died_covid == "Yes",
na.rm=T),
CFR = deaths/cases,
med_age = median(age, na.rm=T) )
eth_race cases deaths CFR med_age
Asian, NH 3022 23 0.0076109 34
Black, NH 34395 833 0.0242186 39
White, NH 27303 417 0.0152730 36
Hispanic, all races 8600 55 0.0063953 34
Other, NH 2912 6 0.0020604 32
84 / 119

Labeling

We can plot these data as points, and label them with geom_text()... but it does not look very good.

ggplot(
data = race_CFR_age,
mapping = aes(
x = med_age,
y = CFR,
size = cases,
label = eth_race)) +
geom_point() +
geom_text()

85 / 119

Labeling

Labels look much improved by using geom_label_repel() from the ggrepel package:

library(ggrepel)
ggplot(
data = race_CFR_age,
mapping = aes(
x = med_age,
y = CFR,
size = cases,
label = eth_race)) +
geom_point() +
geom_label_repel(
size = 5,
min.segment.length = 0)

86 / 119

Labeling

labels= can be assigned complex values with str_glue():

library(ggrepel)
ggplot(
data = race_CFR_age,
mapping = aes(
x = med_age,
y = CFR,
size = cases,
label = str_glue(
"{eth_race}\n{cases} cases"))) +
geom_point() +
geom_label_repel(
size = 5,
min.segment.length = 0)

87 / 119

Labeling

Use comma() from the scales package to make numbers display with comma separators every three digits.

ggplot(
data = race_CFR_age,
mapping = aes(
x = med_age,
y = CFR,
size = cases,
label = str_glue(
"{eth_race}\n{comma(cases)} cases"))) +
geom_point() +
geom_label_repel(
size = 4,
min.segment.length = 0)

See the many other useful functions in scales.

88 / 119

Marginal distributions

Add "marginal" histograms to scatterplots with ggMarginal() from the ggExtra package.

scatterplot <- ggplot(
data = linelist,
mapping = aes(
x = age,
y = days_hosp)) +
geom_point()
ggExtra::ggMarginal(
scatterplot,
type = "histogram",
fill = "lightblue",
xparams = list(binwidth = 10),
yparams = list(binwidth = 5))

89 / 119

Combining plots - plot 1

The cowplot package facilitates combining multiple plots, while aligning elements such as axes. For date axes, ensure the same limits in both plots.

Define the first plot.

plot1 <- plot2 <- linelist %>%
ggplot(
mapping = aes(
x = date_report),
binwidth = 7)+
geom_histogram()+
theme_minimal()+
scale_y_continuous(expand = c(0,0))+
scale_x_date(
limits = c(
as.Date("2020-03-01"),
max(linelist$date_report, na.rm=T)),
date_breaks = "months",
labels = scales::label_date_short(),
expand = c(0,0))

90 / 119

Combining plots - plot 2

plot2 <- linelist %>%
group_by(week = floor_date(date_report, "week")) %>%
summarise(ci = list(mean_cl_normal(age) %>% rename(mean=y, lwr=ymin, upr=ymax))) %>%
unnest() %>%
ggplot(
mapping = aes(
x = week,
y = mean,
ymin = lwr,
ymax = upr))+
geom_ribbon(alpha = 0.5, fill = "green", color = "green")+
geom_line(size = 2, color = "darkgreen")+
scale_x_date(
limits = c(
as.Date("2020-03-01"),
max(linelist$date_report, na.rm=T)),
date_breaks = "months",
labels = scales::label_date_short(),
expand = c(0,0))+
coord_cartesian(ylim = c(30, 60))+
theme_minimal()+
labs(
y = "Weekly mean age (95%CI)",
x = "Month")

91 / 119

Combining plots - result

plot_grid(plot1, plot2, rel_heights = c(1, 1), ncol = 1, align = "hv")

92 / 119

Dual-axis - plot 1

You can also use cowplot to overlap two plots and create a "dual-axis" plot.

Define the first plot:

plot1 <- ggplot(
data = linelist,
mapping = aes(
x = date_report)) +
geom_histogram(color = "grey",
alpha = 0.5) +
theme_cowplot() +
scale_x_date(
limits = c(
as.Date("2020-03-01"),
max(linelist$date_report, na.rm = TRUE)),
date_breaks = "months",
labels = scales::label_date_short(),
expand = c(0, 0),
name = "") +
scale_y_continuous(
expand = c(0,0),
name = "Weekly reported case incidence")

93 / 119

Dual-axis - plot 2

plot2 <- linelist %>%
group_by(epiweek) %>%
summarise(CFR = sum(died == "Yes", na.rm=T) / n() ) %>%
ggplot(
mapping = aes(
x = epiweek,
y = CFR))+
geom_line(size = 2, color = "orange")+
theme_cowplot()+
scale_y_continuous(
position = "right",
limits = c(0, 0.2),
expand = c(0,0),
name = "Weekly CFR")+
scale_x_date(
limits = c(
as.Date("2020-03-01"),
max(linelist$date_report, na.rm=T)),
date_breaks = "months",
labels = scales::label_date_short(),
expand = c(0,0),
name = "Epiweek")+
theme(
axis.text.y = element_text(color = "orange", face = "bold"),
axis.title.y = element_text(color = "orange", face = "bold"))

94 / 119

Dual-axis - combined

aligned <- align_plots(plot1, plot2, align = "hv")
combined <- ggdraw(aligned[[1]]) + draw_plot(aligned[[2]])
combined

95 / 119

Interactive plots

Use ggplotly() from the plotly package to easily make most ggplots interactive.

# Provide the name of a defined plot into ggplotly()
plotly::ggplotly(bar_plot)
Asian, NHBlack, NHWhite, NHHispanic, all racesOther, NHUnknown0100002000030000
YesNoUnknowneth_racecountdied
96 / 119

Data structure

ggplot2 works best when the dataset is in "long" format, where each variable has its own column. We recommend that you become familiar with pivot_longer() from dplyr (see the Epi R Handbook pivoting page).

97 / 119

Helpful charts

For color & fill use names or hex code

For shapes, try these numeric codes:

98 / 119

Geographic Information Systems (GIS)

An introduction with R

99 / 119

Shapefiles

  • Multiple layers make file
  • Made up of:
    • Points, lines, polygons
  • handled with {sf} package
    • simple features: make things much more simple!
    • Can use tidyverse syntax as if dataframes
  • Other formats exist (e.g. GeoJson)

100 / 119

Until a few years ago geospatial work in R was very painful - but with SF it has now become as easy as dataframes in tidyverse!

Import shapefile

# import shapefile
shapefile <- read_sf(
# define path to file
here::here("case_study", "data", "covid_example_data", "covid_shapefile",
"FultonCountyZipCodes.shp")
)
## Simple feature collection with 48 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 2087952 ymin: 1274336 xmax: 2317492 ymax: 1522856
## Projected CRS: NAD83 / Georgia West (ftUS)
## # A tibble: 48 x 5
## ZipCode SHAPE_Leng SHAPE_Area Population geometry
## <chr> <dbl> <dbl> <dbl> <MULTIPOLYGON [US_survey_foot]>
## 1 30304 3578. 874668. 0 (((2228175 1327979, 2227998 132786~
## 2 30098 6187. 2071539. 118 (((2297621 1477003, 2296953 147697~
## 3 30291 123056. 271391984. 20835 (((2193467 1298536, 2192364 129806~
## 4 30268 294091. 1729387442. 7729 (((2121821 1318505, 2122024 131756~
## 5 30305 87773. 187624264. 29220 (((2233900 1400489, 2233914 139992~
## 6 30213 375869. 1801881966. 39840 (((2151675 1332989, 2151274 133220~
## 7 30092 17665. 2769444. 0 (((2278083 1456502, 2278075 145649~
## 8 30024 33262. 30183784. 299 (((2300742 1476746, 2300594 147679~
## 9 30097 124207. 336153806. 26791 (((2295375 1480030, 2295462 148000~
## 10 30022 169357. 673419811. 69893 (((2286054 1475364, 2286025 147513~
## # ... with 38 more rows
101 / 119

We can see here that the CRS is specific to Georgia (so doesn't play well with GPS points)

Coordinate reference systems

Source: Mike Bostock

102 / 119

Changing the CRS

# change shapefile coordinate reference system to WGS84
shapefile <- st_transform(shapefile, crs = 4326)
## Simple feature collection with 48 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.85116 ymin: 33.5025 xmax: -84.09764 ymax: 34.1862
## Geodetic CRS: WGS 84
## # A tibble: 48 x 5
## ZipCode SHAPE_Leng SHAPE_Area Population geometry
## * <chr> <dbl> <dbl> <dbl> <MULTIPOLYGON [°]>
## 1 30304 3578. 874668. 0 (((-84.39147 33.65054, -84.39205 3~
## 2 30098 6187. 2071539. 118 (((-84.16324 34.0603, -84.16545 34~
## 3 30291 123056. 271391984. 20835 (((-84.50521 33.56937, -84.50882 3~
## 4 30268 294091. 1729387442. 7729 (((-84.74079 33.62338, -84.7401 33~
## 5 30305 87773. 187624264. 29220 (((-84.37313 33.84985, -84.37308 3~
## 6 30213 375869. 1801881966. 39840 (((-84.64293 33.6636, -84.64424 33~
## 7 30092 17665. 2769444. 0 (((-84.22771 34.00394, -84.22774 3~
## 8 30024 33262. 30183784. 299 (((-84.15294 34.05959, -84.15342 3~
## 9 30097 124207. 336153806. 26791 (((-84.17066 34.06862, -84.17037 3~
## 10 30022 169357. 673419811. 69893 (((-84.20143 34.05579, -84.20153 3~
## # ... with 38 more rows
103 / 119

We can use sf::st_transform to re-project a spatial object We can use sf::st_crs to check (or set) the crs for a spatial object

Take a look!

# open up a ggplot
ggplot() +
# add the shapefile on top
geom_sf(data = shapefile,
# no fill
fill = NA,
# black borders
colour = "black")

104 / 119

Set CRS for point coordinates

# define period of interest
recent_period <- seq(as.Date("2021-06-30") - 13, as.Date("2021-06-30"), by = 1)
# convert linelist data frame into sf object (with georeference points)
linelist_sf <- linelist %>%
# remove rows missing gps coordinates
drop_na(lat, lon) %>%
filter(
# drop those with wrong GPS points
lat >= 33 & lat <= 35,
lon >= -85 & lon <= -84,
# drop those outside the last 2 weeks
date_report %in% recent_period
) %>%
# create an sf object
st_as_sf(
# define the coordinates based on lat/long variables
coords = c("lon", "lat"),
# set the coordinate reference system to WGS84
crs = 4326,
# do not change string variables to factors
stringsAsFactors = FALSE
)
105 / 119

Set CRS for point coordinates

## Simple feature collection with 126 features and 4 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -84.66011 ymin: 33.53836 xmax: -84.20438 ymax: 34.16152
## Geodetic CRS: WGS 84
## First 10 features:
## pid date_report age gender geometry
## 1 0de96c1a7bfc991b 2021-06-18 6 Female POINT (-84.38023 33.95349)
## 2 fb6aac67ab0dbfe3 2021-06-18 23 Female POINT (-84.36471 33.74956)
## 3 adc636708adbd668 2021-06-17 89 Female POINT (-84.36682 33.74679)
## 4 4987c88010a7b125 2021-06-20 65 Female POINT (-84.36018 34.02925)
## 5 9738fc3209ac189f 2021-06-21 14 Male POINT (-84.55165 33.60726)
## 6 8f03d5ffc0354218 2021-06-18 57 Female POINT (-84.5458 33.58735)
## 7 c46e8649039e2fa9 2021-06-17 27 Female POINT (-84.34898 33.81149)
## 8 ac79c1b0d6993a5c 2021-06-17 47 Male POINT (-84.37456 33.86814)
## 9 e305f709c166eaf5 2021-06-17 7 Female POINT (-84.49827 33.56734)
## 10 b834f3e6759d0915 2021-06-18 1 Male POINT (-84.37045 33.67755)
106 / 119

Basemaps

  • A screenshot of the map area
  • Called tiles
    • Smaller images pieced together
  • Added for context (bottom layer of ggplot)
  • A lot of different tile servers
  • A lot of different R packages
    • We demonstrate {ggspatial}
    • Able to save tiles for offline use
    • Also have scalebars
107 / 119

Bounding boxes

108 / 119

Download & plot basemap

# get the bounding box for the shapefile
bounding_box <- shapefile %>%
st_bbox()
# plot a base map including scale bar
basemap <- ggplot() +
# change the bounding box to an sf object
# this defines the area to download map tiles for
geom_sf(data = st_as_sfc(bounding_box)) +
# download map tiles and add to the plot
annotation_map_tile(
# define what map tiles to use
type = "cartolight",
# define folder to store tile images
cachedir = here::here("data", "map_tiles"),
# define if should download tiles each time
forcedownload = FALSE,
# hide messages about download status and zoom
progress = "none" )

109 / 119

Points

# plot the basemap
points_map <- basemap +
# add the shapefile on top with no fill and black borders
geom_sf(data = shapefile,
fill = NA,
colour = "black") +
# plot points from linelist coloured by ethnicity
# (order the factor so that the most frequent group is plotted first)
geom_sf(data = linelist_sf,
mapping = aes(colour = fct_infreq(eth_race))) +
# choose colour combination
scale_colour_brewer(palette = "Set1") +
# change the legend label
labs(colour = "Ethnicity") +
theme(legend.position = "bottom")

110 / 119

Choropleths - prepare data

# get counts of points within polygons
shapefile <- shapefile %>%
# add a column to the shapefile with counts
mutate(
# see which points are in which zip code polygon
cases = st_intersects(., linelist_sf) %>%
# count how many are in each
lengths()
)
# calculate incidence in shapefiledata
shapefile <- shapefile %>%
mutate(
# divide cases by population
incidence = round(cases / Population * 100000, digits = 1),
# clean up calculations
incidence = case_when(
# fix the outliers: set infinity to NA and cases less than 10 to NA
cases < 5 ~ NA_real_,
is.infinite(incidence) ~ NA_real_, ## nb. infinite due to zero denominator
TRUE ~ incidence)
)
111 / 119

Choropleths - prepare data

# define breaks (for plotting groups)
breakers <- shapefile %>%
# change shapefile to a tibble (otherwise geometry pulled with)
as_tibble() %>%
# only keep zips with more than ten cases
filter(cases > 5) %>%
# pull the incidence column
select(incidence) %>%
# define grouping cut-offs based on quartiles of observations
quantile(probs = seq(0, 1, 0.25), na.rm = TRUE)
# create a factor variable for incidence
shapefile <- shapefile %>%
mutate(
# define groups using the cut function
incidence_cats = cut(incidence,
# cut-offs as defined above (including zero)
breaks = c(0, breakers),
# add labels by using the cut-offs
labels = str_glue("<={breakers}"))
)
112 / 119

Choropleths - plot data

# plot basemap
choropleth <- basemap +
# add in shapefile (with black borders) colour by incidence categories
geom_sf(data = shapefile,
aes(fill = incidence_cats),
colour = "black") +
# define colour scheme
scale_fill_brewer(palette = "YlGn",
# edit legend labels to the categories defined
# rename the missing group to be fewer than 10 cases
labels = c(
levels(shapefile$incidence_cats),
"Fewer than 5 cases")) +
# add in text labels for each zip code
geom_sf_text(data = shapefile,
aes(label = ZipCode),
check_overlap = TRUE) +
# change the legend title
labs(fill = "Confirmed cases per \n 100,000 by Zip code")

113 / 119

Bivariate choropleths

114 / 119

Heatmaps

  • Three dimensional distribution over each point and then interpolated

115 / 119

Heatmaps - distance

  • How many x-units in one kilometre?

xunits=xmaxxminxmetres

116 / 119

Heatmaps - calculating

# define x and y minimum/maximum coordinates based on bounding box of base map
xmin <- st_point(c(bounding_box$xmin, bounding_box$ymin))
xmax <- st_point(c(bounding_box$xmax, bounding_box$ymin))
ymin <- xmin
ymax <- st_point(c(bounding_box$xmin, bounding_box$ymax))
# create sf points (with coordinate reference system WGS84) for the x and y axis
xaxis <- st_sfc(xmin, xmax, crs = 4326)
yaxis <- st_sfc(ymin, ymax, crs = 4326)
# calculate the distance in metres on the axes of your based on longitude and latitude
# i.e. how many metres on your x axis and how many on your y
xdist <- st_distance(xaxis)[2] %>% units::set_units("miles")
ydist <- st_distance(yaxis)[2] %>% units::set_units("miles")
# calculate how many lat/long units there are on each axis
xunits <- st_distance(xmin, xmax)
yunits <- st_distance(ymin, ymax)
# divide the difference in latitude or longitude by the corresponding distance
# returns the number of units per distance of interest (i.e. coord units per mile)
xfact <- as.numeric(xunits / xdist)
yfact <- as.numeric(yunits / ydist)
117 / 119

Heatmaps - plotting

heatmap <- basemap +
# add in shapefile not filled with black border
geom_sf(data = shapefile, fill = NA, colour = "black") +
# add in a density layer
stat_density_2d(
# using the sf formatted linelist
data = linelist_sf,
aes(
# extract the lat lon from geometry (list column)
x = purrr::map_dbl(geometry, ~.[1]),
y = purrr::map_dbl(geometry, ~.[2]),
# fill based on the calculated density
fill = after_stat(level)
),
# define the shape to use for smoothing
geom = "polygon",
# define whether to show legend (removed as uninformative)
show.legend = FALSE,
# bandwith (distance squared)
h = c(xfact, yfact)
) +
# define fill palette (viridis continuous)
scale_fill_viridis_c(option = "C")

118 / 119

Thank you

Slides created via the {xaringan} R package

119 / 119

We are Applied Epi

  • We are a grassroots collaborative of applied epidemiologists
  • Our (free) Epi R Handbook has been used by 60,000 people worldwide
  • Now we offer customized training to health departments and NGOs

Bookmark www.epiRhandbook.com!

2 / 119
  • What makes us different is that we focus on the challenges of applied epi, not academic epi. We emphasize the skills used every day by ground-level epidemiologists.
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow