Bookmark www.epiRhandbook.com!
We have Wen available for answering any questions on translating SAS to R
Today is primarily explanation and demonstration, following from the previous demonstration on data management and workflow.
We assume you have reviewed the Handbook pages on R Basics and ggplot basics.
Our objectives today are to:
1) De-mystify ggplot()
syntax so you can apply it to your own tasks
2) Demonstrate core geom_()
commands
3) Introduce helpful "wrapper" packages like incidence2 and apyramid
4) Touch upon advanced tips with packages ggExtra, ggrepel, gghighlight, scales, plotly, and esquisse
5) Introduce mapping (GIS) with sf
The Epi R Handbook contains many more examples.
Today's work is within an R project, which can be downloaded as a zipped file.
Today's code is available to you in ggplot_demo.Rmd
(also online here).
R project file (case_study.Rproj
)
data/
folder covid_example_data/
folder linelist_cleaned.rds
covid_example_data.xlsx
covid_shapefile/
folder FultonCountryZipCode.shp
...map_tiles/
folder ggplot_demo.Rmd
(R markdown script) Load packages (ggplot2 is within tidyverse) and import the cleaned (fake) dataset from last session.
## install and load necessary packages pacman::p_load( rio, # importing data here, # relative file pathways skimr, # review data DT, # visualize data frame janitor, # data cleaning and tables epikit, # age categories lubridate, # working with dates incidence2, # epidemic curves ggrepel, # smart labels ggExtra, # extras esquisse, # point-and-click for simple ggplots apyramid, # age pyramids scales, # formatting of scales plotly, # interactive plots cowplot, # combine plots tidyverse # ggplot2 & data management)
linelist <- import(here::here("data", "covid_example_data", "linelist_cleaned.rds"))
One tip is to load tidyverse last to avoid masking of package names
The import()
function can import almost any kind of file. An .rds
file retains column classes from an R data frame.
Below are the first 25 rows of the linelist
data frame:
pid | date_report | date_dob | age | gender | race | eth | zip | county | district | state | contact_id | date_onset | sym_fever | sym_subjfever | sym_myalgia | sym_losstastesmell | sym_sorethroat | sym_cough | sym_headache | sym_resolved | date_recovery | contact_hh | hospitalized | date_hospitalized | date_discharge | died | died_covid | date_died | confirmed_case | covid_dx | date_positive | lat | lon | epiweek | days_onset_hosp | date_outcome | days_hosp | age_group | eth_race |
---|
pid | date_report | date_dob | age | gender | race | eth | zip | county | district | state | contact_id | date_onset | sym_fever | sym_subjfever | sym_myalgia | sym_losstastesmell | sym_sorethroat | sym_cough | sym_headache | sym_resolved | date_recovery | contact_hh | hospitalized | date_hospitalized | date_discharge | died | died_covid | date_died | confirmed_case | covid_dx | date_positive | lat | lon | epiweek | days_onset_hosp | date_outcome | days_hosp | age_group | eth_race | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3a85e6992a5ac52f | 2020-03-22 | 2004-11-08 | 16 | Male | WHITE | NON-HISPANIC/LATINO | 30308 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-03-20 | Yes | Yes | No | Unknown | Yes | Yes | Yes | No | Yes | No | No | No | Yes | Confirmed | 2020-03-22 | 33.77664546 | -84.38568523 | 2020-03-18 | 10-19 | White, NH | |||||||
2 | c6b5281d5fc50b96 | 2020-02-01 | 1964-06-07 | 57 | Male | WHITE | NON-HISPANIC/LATINO | 30308 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-01-28 | No | No | Yes | Unknown | No | Yes | No | No | No | No | No | No | Yes | Confirmed | 2020-02-01 | 33.78051014 | -84.38947474 | 2020-01-29 | 50-59 | White, NH | |||||||
3 | 53495ad0dca4e22c | 2020-02-10 | 1944-04-06 | 77 | Female | BLACK | NON-HISPANIC/LATINO | 30315 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-02-10 | Yes | Unknown | Yes | Unknown | Yes | Yes | Unknown | No | Unknown | Yes | 2020-02-08 | No | No | Yes | Confirmed | 2020-02-10 | 33.73023331 | -84.38425189 | 2020-02-05 | 70+ | Black, NH | ||||||
4 | 2948a265da0d081b | 2020-03-20 | 1964-06-25 | 57 | Female | BLACK | NON-HISPANIC/LATINO | 30213 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2021-05-19 | No | Yes | Yes | Unknown | Yes | Yes | Yes | No | No | Unknown | No | No | Yes | Confirmed | 2021-01-17 | 33.55506787 | -84.62885167 | 2020-03-18 | 50-59 | Black, NH | |||||||
5 | a5524aadd1ca0458 | 2020-02-26 | 1964-12-21 | 56 | Male | WHITE | NOT SPECIFIED | 30004 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-02-20 | Yes | Yes | Yes | Unknown | No | Yes | No | No | No | Yes | 2020-02-26 | Unknown | Unknown | Yes | Confirmed | 2020-02-25 | 34.10608705 | -84.27453508 | 2020-02-26 | 6 | 50-59 | White, NH | |||||
6 | db14eeabe531fab7 | 2020-02-11 | 1956-06-21 | 65 | Male | BLACK | NON-HISPANIC/LATINO | 30314 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-01-17 | Yes | Yes | No | Unknown | Unk | Yes | Unk | Yes | 2020-02-21 | No | Yes | 2020-01-27 | 2020-02-21 | Yes | Yes | 2020-02-21 | Yes | Confirmed | 2020-02-20 | 33.75937142 | -84.42582354 | 2020-02-05 | 10 | 2020-02-21 | 25 | 60-69 | Black, NH |
7 | 8bd07af27a9095e2 | 2020-03-01 | 1974-05-25 | 47 | Male | BLACK | NON-HISPANIC/LATINO | 30313 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-02-21 | No | No | Unk | Unknown | Yes | Yes | Yes | No | No | Yes | 2020-03-01 | No | No | Yes | Confirmed | 2020-03-02 | 33.76269487 | -84.40045162 | 2020-02-26 | 9 | 40-49 | Black, NH | |||||
8 | b0fb67854aaeb475 | 2020-03-01 | 1960-02-17 | 61 | Female | BLACK | NON-HISPANIC/LATINO | 30350 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-02-25 | Yes | Yes | No | Unknown | Yes | Yes | No | No | No | No | Unknown | Unknown | Yes | Confirmed | 2020-02-29 | 33.99152006 | -84.35447278 | 2020-02-26 | 60-69 | Black, NH | |||||||
9 | bf9099af39be433a | 2020-03-18 | 1985-06-09 | 36 | Male | ASIAN | NON-HISPANIC/LATINO | 30324 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-03-17 | No | Yes | No | Unknown | Yes | Yes | Yes | No | No | No | No | No | Yes | Confirmed | 2020-03-17 | 33.81818887 | -84.37365369 | 2020-03-18 | 30-39 | Asian, NH | |||||||
10 | c01011bc9755eb3a | 2020-03-12 | 1978-11-26 | 42 | Female | BLACK | NON-HISPANIC/LATINO | 30305 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-11-21 | Yes | No | Yes | Yes | Yes | Yes | No | No | Yes | Yes | 2020-11-28 | 2020-12-02 | No | No | Yes | Confirmed | 2020-11-21 | 0.000509015 | -0.000975479 | 2020-03-11 | 7 | 2020-12-02 | 4 | 40-49 | Black, NH | ||
11 | 27b97d577fb9a30d | 2020-02-26 | 1946-12-21 | 74 | Male | BLACK | NON-HISPANIC/LATINO | 30315 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-02-15 | Unknown | Unknown | Unknown | Unknown | Unknown | Yes | Unknown | No | No | Unknown | Unknown | Unknown | Yes | Confirmed | 2020-03-02 | 33.7247196 | -84.38623582 | 2020-02-26 | 70+ | Black, NH | |||||||
12 | 932f146c3860e756 | 2020-04-10 | 1993-10-27 | 27 | Female | BLACK | NON-HISPANIC/LATINO | 30308 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-04-02 | Unk | Yes | No | Unknown | No | Yes | No | No | No | No | No | No | Yes | Confirmed | 2020-04-10 | 33.76662936 | -84.38936047 | 2020-04-08 | 20-29 | Black, NH | |||||||
13 | 44655181e230b95c | 2020-04-06 | 1941-06-04 | 80 | Female | BLACK | NON-HISPANIC/LATINO | 30312 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-03-22 | No | No | No | Unknown | No | Yes | No | Yes | 2020-04-16 | Yes | Yes | 2020-04-06 | 2020-04-16 | Yes | Yes | 2020-04-16 | Yes | Confirmed | 2020-04-15 | 33.73726168 | -84.38274508 | 2020-04-01 | 15 | 2020-04-16 | 10 | 70+ | Black, NH |
14 | a951cb110aacd31f | 2020-03-20 | 1954-03-19 | 67 | Male | BLACK | NON-HISPANIC/LATINO | 30213 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-03-10 | Yes | Yes | Yes | Unknown | No | Yes | Yes | Yes | No | Yes | 2020-03-19 | No | No | Yes | Confirmed | 2020-03-20 | 33.58325606 | -84.63672804 | 2020-03-18 | 9 | 60-69 | Black, NH | |||||
15 | 7288a58b47539174 | 2020-03-05 | 1958-10-22 | 62 | Female | BLACK | NON-HISPANIC/LATINO | 30311 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-03-02 | Yes | Unk | Unk | Unknown | Unk | Unk | Unk | No | Unk | Yes | 2020-03-03 | No | No | Yes | Confirmed | 2020-03-07 | 33.73072166 | -84.44208134 | 2020-03-04 | 1 | 60-69 | Black, NH | |||||
16 | dfbc141429716762 | 2020-03-05 | 1929-10-22 | 91 | Male | BLACK | NON-HISPANIC/LATINO | 30311 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-02-29 | Yes | Yes | No | Unknown | No | Yes | No | No | No | Yes | 2020-03-03 | No | No | Yes | Confirmed | 2020-03-05 | 33.74443579 | -84.47630174 | 2020-03-04 | 3 | 70+ | Black, NH | |||||
17 | 033e8eb949e30135 | 2020-02-24 | 1952-01-11 | 69 | Male | WHITE | NON-HISPANIC/LATINO | 30327 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-02-19 | Yes | Yes | Unknown | Unknown | Unknown | Yes | Unknown | Unknown | Unknown | No | Unknown | Unknown | Yes | Confirmed | 2020-02-21 | 33.84942894 | -84.41986975 | 2020-02-19 | 60-69 | White, NH | |||||||
18 | a254eb953370eb3d | 2020-04-04 | 1994-01-01 | 27 | Male | BLACK | NON-HISPANIC/LATINO | 30349 | Fields | PIEDMONT (12-2) | Georgia | Yes | Yes | No | Yes | Unknown | No | Yes | No | No | No | Yes | 2020-04-03 | No | No | Yes | Confirmed | 2020-12-16 | 33.61755356 | -84.46304234 | 2020-04-01 | 20-29 | Black, NH | |||||||
19 | 2ebf4b64459ff07c | 2020-03-28 | 1995-12-25 | 25 | Male | WHITE | NON-HISPANIC/LATINO | 30009 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-03-25 | No | No | No | No | Yes | No | No | No | No | No | No | No | Yes | Confirmed | 2021-01-28 | 34.09972633 | -84.31308119 | 2020-03-25 | 20-29 | White, NH | |||||||
20 | 8285906d7277879a | 2020-04-07 | 1987-09-14 | 33 | Female | BLACK | NON-HISPANIC/LATINO | 30337 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-03-30 | No | Yes | Yes | Unknown | No | Yes | No | Yes | 2020-04-10 | No | Yes | 2020-04-06 | 2020-04-10 | No | No | Yes | Confirmed | 2020-04-08 | 33.65329287 | -84.45800906 | 2020-04-01 | 7 | 2020-04-10 | 4 | 30-39 | Black, NH | |
21 | 75ed6eb8e8a5373a | 2020-04-06 | 1981-12-12 | 39 | Female | BLACK | NON-HISPANIC/LATINO | 30344 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-03-29 | No | No | No | Unknown | No | Yes | No | No | No | No | No | No | Yes | Confirmed | 2020-04-02 | 33.69052246 | -84.46524239 | 2020-04-01 | 30-39 | Black, NH | |||||||
22 | bee0aecf4384dbc8 | 2020-02-26 | 1975-07-11 | 46 | Female | BLACK | NON-HISPANIC/LATINO | 30315 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-02-17 | Yes | Yes | No | Unknown | Yes | Yes | No | No | No | Yes | 2020-02-22 | 2020-03-10 | No | No | Yes | Confirmed | 2020-02-22 | 33.71013231 | -84.38555839 | 2020-02-26 | 5 | 2020-03-10 | 17 | 40-49 | Black, NH | ||
23 | 147251528ba0e86c | 2020-04-12 | 1948-01-11 | 73 | Male | BLACK | NON-HISPANIC/LATINO | 30349 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-03-30 | Yes | No | Yes | Unknown | No | Yes | No | No | No | Yes | 2020-04-15 | 2020-04-18 | No | No | Yes | Confirmed | 2020-04-11 | 33.64674578 | -84.57468022 | 2020-04-08 | 16 | 2020-04-18 | 3 | 70+ | Black, NH | ||
24 | 3517a48c39570377 | 2020-04-01 | 1974-05-30 | 47 | Male | WHITE | NON-HISPANIC/LATINO | 30327 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-03-28 | No | Yes | Yes | Unknown | Unknown | Unknown | Unknown | No | No | No | No | No | Yes | Confirmed | 2020-03-29 | 33.84126361 | -84.42349808 | 2020-04-01 | 40-49 | White, NH | |||||||
25 | ddfe45263e67714b | 2020-02-28 | 1977-01-04 | 44 | Female | BLACK | NON-HISPANIC/LATINO | 30318 | Fields | PIEDMONT (12-2) | Georgia | Yes | 2020-02-26 | No | Unknown | Unknown | Unknown | Unknown | Yes | Unknown | No | No | Yes | 2020-02-27 | 2020-02-27 | No | No | Yes | Confirmed | 2020-02-27 | 33.80477485 | -84.45733019 | 2020-02-26 | 1 | 2020-02-27 | 0 | 40-49 | Black, NH |
Tip: Use skim()
from skimr package to review all columns with summary statistics
Today we focus on ggplot2 because it:
See the R graph gallery for inspiration.
Other plotting options include base R, lattice, and plotly.
The ggplot2 package is the most popular data visualization tool in R
Its ggplot()
function is at the core of the package
The ggplot2 package is the most popular data visualization tool in R
Its ggplot()
function is at the core of the package
This whole approach is colloquially known as “ggplotting”
The ggplot2 package is the most popular data visualization tool in R
Its ggplot()
function is at the core of the package
This whole approach is colloquially known as “ggplotting”
Resulting figures are sometimes affectionately called “ggplots”
The ggplot2 package is the most popular data visualization tool in R
Its ggplot()
function is at the core of the package
This whole approach is colloquially known as “ggplotting”
Resulting figures are sometimes affectionately called “ggplots”
ggplot2 benefits from a wide variety of supplementary R packages that extends its functionalities, such as gganimate, ggthemr, ggdendro, gghighlight, ggforce...
Bonus question: What does the "gg” in these names represent?
Build a plot object by “adding” commands on top of one another that specify plot layers and design elements
Build a plot object by “adding” commands on top of one another that specify plot layers and design elements
The order of layers will usually look like this:
1) "Open" the plot with the ggplot()
command and specify the dataset
Build a plot object by “adding” commands on top of one another that specify plot layers and design elements
The order of layers will usually look like this:
1) "Open" the plot with the ggplot()
command and specify the dataset
2) "Map" data columns to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency
Build a plot object by “adding” commands on top of one another that specify plot layers and design elements
The order of layers will usually look like this:
1) "Open" the plot with the ggplot()
command and specify the dataset
2) "Map" data columns to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency
3) Add (+
) “geom” layers that visualize data geometrically as shapes
Build a plot object by “adding” commands on top of one another that specify plot layers and design elements
The order of layers will usually look like this:
1) "Open" the plot with the ggplot()
command and specify the dataset
2) "Map" data columns to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency
3) Add (+
) “geom” layers that visualize data geometrically as shapes
4) Modify "scales", such as a color scale or y-axis breaks
Build a plot object by “adding” commands on top of one another that specify plot layers and design elements
The order of layers will usually look like this:
1) "Open" the plot with the ggplot()
command and specify the dataset
2) "Map" data columns to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency
3) Add (+
) “geom” layers that visualize data geometrically as shapes
4) Modify "scales", such as a color scale or y-axis breaks
5) Add "theme" plot design elements such as axis labels, title, caption, fonts, text sizes, background themes, or axes rotation
Build a plot object by “adding” commands on top of one another that specify plot layers and design elements
The order of layers will usually look like this:
1) "Open" the plot with the ggplot()
command and specify the dataset
2) "Map" data columns to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency
3) Add (+
) “geom” layers that visualize data geometrically as shapes
4) Modify "scales", such as a color scale or y-axis breaks
5) Add "theme" plot design elements such as axis labels, title, caption, fonts, text sizes, background themes, or axes rotation
These layers are "added" sequentially with +
symbols.
ggplot2 commands can be quite long!
Remember that although the commands may be long, it is infinitely easier to edit and recycle than in Excel
ggplot()
creates an empty canvas. Assign the data frame to use.
ggplot(data = linelist)
Alternatively, use the %>%
pipe operator to "pipe" a data frame into ggplot()
linelist %>% ggplot()
This is only a blank canvas, we have not defined what should be in the x and y axes. If several data frames are needed, they can be added in their own geoms. Piping is useful to make one-time changes to a dataset prior to plotting.
aes()
ggplot( data = linelist, mapping = aes( x = age, y = days_hosp))
"Aesthetics" are features whose display could vary for each data point (position, color, shape...)
mapping = aes()
assigns plot "aesthetics" to columns in the data. Inputs must be placed within aes()
.
Two basic aesthetic mappings are axes, via x=
and y=
.
ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) +geom_point()
Data are visualized using "geom" commands, such as geom_point()
.
These commands are "added" with a +
to the ggplot()
command.
Some classical “geoms” include:
Geometry | Geom |
---|---|
Histograms | geom_histogram() |
Points | geom_point() |
Full list here
Some classical “geoms” include:
Geometry | Geom |
---|---|
Histograms | geom_histogram() |
Points | geom_point() |
Lines | geom_line() |
Bar plots | geom_bar() or geom_col() |
Full list here
Some classical “geoms” include:
Geometry | Geom |
---|---|
Histograms | geom_histogram() |
Points | geom_point() |
Lines | geom_line() |
Bar plots | geom_bar() or geom_col() |
Boxplots | geom_boxplot() |
Violin plots | geom_violin() |
Full list here
ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) +geom_point()
With axes now mapped, geom_point()
displays the data as points.
ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) +geom_point() +geom_smooth()
We can add additional geoms to the current plot with +
.
Geoms appear in the order they are written: the smoothed line appears over the points.
geom_smooth() gives smoothed conditional means, helping to show trends in presence of "over-plotting" (see documentation)
Indentations, spaces, and newlines do not impact code execution, and can be varied to improve readability.
ggplot(data = linelist, mapping = aes(x = age, y = days_hosp))+geom_point()
is the same as:
ggplot(data = linelist, mapping = aes(x = age, y = days_hosp)) +geom_point()
is the same as:
ggplot( data = linelist, # use case linelist mapping = aes( # make aesthetic mappings for all geoms x = age, # assign x-axis to age column y = days_hosp)) + # assign y-axis to duration of hospitalizationgeom_point() # display data as points
Aside from axes, other common "aesthetics" include:
Argument | Controls |
---|---|
shape |
Display of point as dot, star, triangle, square... |
fill |
The interior color (e.g of bar or boxplot) |
color |
The exterior or bar, boxplot - OR point color |
size |
Line thickness, point size... |
alpha |
Transparency: 0 (invisible) to 1 (opaque) |
width |
Width of "bar plot" bars |
linetype |
Either solid, dashed, dotted, etc. |
binwidth |
Width of histogram bins |
Note that “aesthetic” in ggplot has a specific meaning that you might associate with the word “aesthetics” in common English. In ggplot those details are called “themes” and are adjusted within a theme() command
Each geom accepts certain aesthetics, like binwidth=
for geom_histogram()
Aesthetics can be assigned to either:
Static values: color = "purple"
aes()
A data column: aes(color = died)
aes()
Some examples:
Aesthetics can be assigned to either:
Static values: fill = "purple"
aes()
A data column: aes(fill = died)
aes()
More examples:
ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) + geom_point(color = "seagreen")
An aesthetic is static if it applies the same display to all data points in the geom or plot. Static aesthetics are defined outside aes()
to a number or character value.
Other examples:size = 3
alpha = 0.5
width = 1.2
ggplot( data = linelist, mapping = aes( x = age, y = days_hosp, color = died)) +geom_point()
Dynamic aesthetics are mapped to a column name, and defined inside aes()
. This creates "groups" in the plot and generates a legend. The display varies for each data point.
Above, color=
is mapped to column died
, and is inherited by geom_point()
.
ggplot( data = linelist, mapping = aes( x = age, y = days_hosp, color = died)) + geom_point( size = 7, alpha = 0.7)
Above, size = 7
and alpha = 0.7
are static aesthetics defined outside any aes()
and apply to geom_point()
.
color=
is within an aes()
, and maps to values in column died
.
Read more about ggplot aesthetics here
As there is only one geom, all aesthetics can be written in ggplot()
, or in geom_point()
1) Dynamic mappings in the initial ggplot()
call will apply to subsequent geoms, unless otherwise indicated
2) Static aesthetics (e.g. color = "blue"
) are not inherited by subsequent geoms
3) Mappings written within one geom apply only to that geom
ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) +# points colored by outcomegeom_point( mapping = aes(color = died), size = 1) +# smoothed means by outcomegeom_smooth( mapping = aes(color = died), size = 3) +# smoothed mean of entire datasetgeom_smooth(color = "black")
If your data are "grouped", yet you plot indiscriminately without the grouping assigned to a dynamic aesthetic, you get multiple y values per x value.
Below, linelist
is aggregated into counts per unique week-outcome.
outcome_week_data <- linelist %>% count(died, week_onset = floor_date( x = date_onset, unit = "week")) %>% select(week_onset, everything()) %>% arrange(week_onset) %>% drop_na(week_onset)
First few rows:
week_onset | died | n |
---|---|---|
2019-12-29 | Yes | 1 |
2019-12-29 | No | 5 |
2020-01-05 | No | 7 |
2020-01-05 | Unknown | 1 |
2020-01-12 | Yes | 1 |
2020-01-12 | No | 5 |
2020-01-12 | Unknown | 2 |
2020-01-19 | No | 4 |
ggplot( data = outcome_week_data, mapping = aes( x = week_onset, y = n)) +geom_line()
Above, the column died
is not provided to any grouping aesthetic.
ggplot()
does not know how to display the multiple values for each week.
ggplot( data = outcome_week_data, mapping = aes( x = week_onset, y = n, color = died))+geom_line()
Use aesthetic mappings like fill=
or color=
to correctly visualize the "groups".
Above, color=
is mapped to the died
column.
ggplot( data = linelist, mapping = aes(x = date_onset)) +geom_histogram() +facet_wrap(~eth_race)
Another way of displaying groups is via faceting.
facet_wrap()
produces one facet per unique value.
Place a "~" before the column name.
Also called "small multiples"
ggplot( data = linelist, mapping = aes(x = date_onset)) +geom_histogram() +facet_wrap(~eth_race, scales = "free_y")
"Free" auto-scaled axes with scales=
. Options:
Alert your audience if you use free axes!
Also, try ncol=
and nrow=
ggplot( data = linelist, mapping = aes(x = date_onset)) +geom_histogram() +facet_wrap(eth_race ~ gender)
The "~" signifies "by". You can place columns on either side.
With facet_wrap()
, levels are combined into facet titles, appearing alphabetically/by factor level.
Use facet_grid()
. Labels appear on top and side.
ggplot( data = linelist, mapping = aes(x = date_onset)) +geom_histogram() +facet_grid(eth_race ~ gender)
Use filter()
or drop_na()
in advance, to remove undesired levels.
linelist %>% drop_na(gender, eth_race) %>% filter(gender != "Unknown") %>% # begin plot ggplot( mapping = aes(x = date_onset)) + geom_histogram() + facet_grid(eth_race ~ gender)
Above, we pipe the adjusted data frame into ggplot()
, for ease
gghighlight()
Add gghighlight()
from gghighlight to show a full "shadow" behind each facet. For color add the aes(fill=)
.
ggplot( data = linelist, mapping = aes( x = date_onset, fill = eth_race)) +geom_histogram() +facet_wrap(~ eth_race) +gghighlight::gghighlight()
Add gghighlight()
to other plots, and specify specific values (or criteria) to highlight
linelist %>% # get daily counts by age group group_by(age_group, date_report) %>% count() %>% # plot ggplot( mapping = aes( x = date_report, y = n, color = age_group)) + geom_line() + gghighlight::gghighlight( age_group %in% c("40-49", "60-69)"))+ theme(legend.position = "none")
Scale commands replace defaults of how the aesthetic mappings manifest, such as:
As a generic formula, these commands are written as: scale_AESTHETIC_METHOD()
.
1) scale_
: this prefix never changes
2) AESTHETIC: _fill_
, _color_
, _x_
, _y_
, etc.
3) METHOD: _continuous()
, _discrete()
, _manual()
, _date()
, etc.
Some examples of scale commands:
You want to adjust | Scale command |
---|---|
continuous y-axis | scale_y_continuous() |
date x-axis | scale_x_date() |
categorical x-axis | scale_x_discrete() |
fill, continuous | scale_fill_continuous() |
fill, continuous | scale_fill_gradient() |
color, manual assign | scale_color_manual() |
ggplot( data = linelist, mapping = aes( x = eth_race, fill = died)) +geom_bar()
Above, the fill of a bar plot uses the default colors and axis breaks:
ggplot( data = linelist, mapping = aes( x = eth_race, fill = died)) +geom_bar() +scale_fill_manual( values = c( "Yes" = "violetred", "No" = "aquamarine", "Unknown" = "grey"))
Within scale_fill_manual()
we provide assignments within a vector c()
.
Use na.value = "grey" for missing values
Discuss the na.value= arguments in most scale commands, and the difference between having NA values in the data and having an explicit missing value such as "Unknown".
ggplot( data = linelist, mapping = aes( x = eth_race, fill = died)) +geom_bar() +scale_fill_manual( values = c( "Yes" = "violetred", "No" = "aquamarine", "Unknown" = "grey")) +scale_y_continuous( breaks = seq(from = 0, to = 35000, by = 5000))
In scale_y_continuous()
we adjust the y-axis breaks using seq()
to define a numeric sequence.
ggplot( data = linelist, mapping = aes( x = eth_race, fill = died)) +geom_bar() +scale_fill_manual( values = c("Yes" = "violetred", "No" = "aquamarine", "Unknown" = "grey")) +scale_y_continuous( breaks = seq(from = 0, to = 35000, by = 5000), expand = c(0, 0)) +scale_x_discrete( expand = c(0, 0))
In any scale_x_
or scale_y_
command, use expand = c(0,0)
to remove excess space around the plot features.
ggplot( data = linelist, mapping = aes(x = date_onset)) +geom_histogram()
The default scale for date axes will vary by the range of your data.
ggplot( data = linelist, mapping = aes(x = date_onset)) +geom_histogram() +scale_x_date( date_breaks = "3 months")
Adjust axis labels with scale_x_date()
.
Use date_breaks=
values like "1 week", "2 weeks", or "3 months".
Note: these are the axis label breaks, not histogram bin breaks!
Note: Default "weeks" start on Mondays. See epiRhandbook epicurves page for alternatives.
For tips on geom_histogram() bins, see Epi R Handbook epicurves page
ggplot( data = linelist, mapping = aes(x = date_onset)) +geom_histogram() +scale_x_date( date_breaks = "3 months", date_labels = "%d %b\n%Y")
Specify date label format to date_labels=
using "strptime" syntax
"%d %b %Y"
for DD MMM YYYY.
See Epi R Handbook Epicurves and Strings pages for more tips
/n is a newline
ggplot( data = linelist, mapping = aes(x = date_onset)) +geom_histogram() +scale_x_date( date_breaks = "3 months", labels = scales::label_date_short() )
Alternatively, simplify by assigning labels=
to label_date_short()
from the scales package
The year is not repeated on each label anymore.
Note the use above of labels=
, not date_labels=
.
Easily display proportions as percents with percent()
from scales within scale_y_continuous()
.
First few rows of a CFR dataset:
month | cases | deaths | CFR |
---|---|---|---|
2020-01-01 | 1 | 0 | 0.0000000 |
2020-02-01 | 54 | 3 | 0.0555556 |
2020-03-01 | 857 | 73 | 0.0851809 |
2020-04-01 | 1832 | 152 | 0.0829694 |
ggplot( data = CFR_data, mapping = aes( x = month, y = CFR) +geom_line(size = 2, color = "brown") +scale_y_continuous(labels = percent)
Remember to load the scales package
ggplot(data = linelist) +geom_point( mapping = aes( x = age, y = days_hosp, color = died), alpha = 0.3) +labs( title = "Duration of admission", subtitle = "Fulton County, GA", x = "Age (years)", y = "Duration (days)", caption = "Fictional COVID-19 data", color = "Deceased")
Use labs()
as above.
Note: to edit legend title, use the aesthetic that created the legend (e.g. color=
).
See the Epi R Handbook page on adjusting characters with the stringr package
New lines
"Top line\nNextline" # \n creates new linestr_wrap("This is a really long title", 30) # wrap at 30 characters
See the Epi R Handbook page on adjusting characters with the stringr package
New lines
"Top line\nNextline" # \n creates new linestr_wrap("This is a really long title", 30) # wrap at 30 characters
Dynamic labels - Embed code in str_glue()
that updates with the data
str_glue("Data as of {Sys.Date()}")
## Data as of 2021-08-20
str_glue("{fmt_count(linelist, is.na(date_onset))} cases missing onset and not shown")
## 37323 (45.7%) cases missing onset and not shown
Explain that in str_glue, anything within curly brackets it will run as R code.
Themes are non-data design features (background, text size/color, etc).
These "complete themes" are easy to add.
# Try one of these...theme_bw()+theme_classic()+theme_dark()+theme_gray()+theme_minimal()+theme_light()+theme_void()+
Try the argument base_size = 16
to quickly increase text size.
Make micro-adjustments with theme()
. See the handbook for tips.
ggplot( data = linelist, mapping = aes( x = age, y = days_hosp, color = died), alpha = 0.3) +geom_point() +labs( title = "Duration of admission", x = "Age (years)", y = "Duration (days)", color = "Deceased")+theme_minimal(base_size = 16) +theme( legend.position = "bottom", plot.title = element_text( color = "red", face = "bold"), axis.title.y = element_text(angle = 90))
The syntax takes practice - see this list of feature-specific arguments.
Talk about these theme() arguments and how they consist of two parts, just like mapping = aes()
.
Explain that nobody has these all memorized, but the common ones are easy to remember once you use them enough.
Remember to add them AFTER any complete themes.
geom_col()
and geom_bar()
are used to make general bar plots (non-epicurves).
geom_col()
and geom_bar()
are used to make general bar plots (non-epicurves).
Use geom_bar()
if bar height should reflect the number of rows in the data
(e.g. a case linelist).
ggplot( data = linelist, # begin with linelist mapping = aes(x = eth_race)) + # No y= argumentgeom_bar()
geom_col()
and geom_bar()
are used to make general bar plots (non-epicurves).
Use geom_bar()
if bar height should reflect the number of rows in the data
(e.g. a case linelist).
ggplot( data = linelist, # begin with linelist mapping = aes(x = eth_race)) + # No y= argumentgeom_bar()
Use geom_col()
there is a numeric column containing the desired bar height
(e.g. aggregated count data).
ggplot( data = linelist_agg, # begin with aggregated count data mapping = aes(x = eth_race, y = n)) + # bar height is value in column "n" geom_col()
With geom_bar()
, the height reflects the number of rows per x-axis group. This works well for linelist data.
date_onset | eth_race | died |
---|---|---|
2020-03-20 | White, NH | No |
2020-01-28 | White, NH | No |
2020-02-10 | Black, NH | No |
2021-05-19 | Black, NH | No |
2020-02-20 | White, NH | Unknown |
2020-01-17 | Black, NH | Yes |
ggplot( data = linelist, mapping = aes( x = eth_race)) + geom_bar() +theme(axis.text.x= element_text(angle=30))
To achieve "stacked" bars with geom_bar()
, assign the grouping column to fill=
, within aes()
.
date_onset | eth_race | died |
---|---|---|
2020-03-20 | White, NH | No |
2020-01-28 | White, NH | No |
2020-02-10 | Black, NH | No |
2021-05-19 | Black, NH | No |
2020-02-20 | White, NH | Unknown |
2020-01-17 | Black, NH | Yes |
ggplot( data = linelist, mapping = aes( x = eth_race, fill = died)) + geom_bar() + theme(axis.text.x= element_text(angle=30))
In contrast, geom_col()
uses a column of counts, such as column n
in this linelist_eth
dataset:
eth_race | n |
---|---|
Asian, NH | 3022 |
Black, NH | 34395 |
White, NH | 27303 |
Hispanic, all races | 8600 |
Other, NH | 2912 |
Unknown | 5525 |
Column n
provides the bar height.
ggplot(linelist_eth) + geom_col( mapping = aes( x = eth_race, y = n))+theme(axis.text.x= element_text(angle= 30))
To have "stacked" bars using geom_col()
, each plotting group must have its own rows in the data. Use "long"-style data like below:
eth_race | died | n |
---|---|---|
Asian, NH | Yes | 30 |
Asian, NH | No | 1802 |
Asian, NH | Unknown | 1190 |
Black, NH | Yes | 1037 |
Black, NH | No | 19204 |
Black, NH | Unknown | 14154 |
ggplot(linelist_eth_died) + geom_col( mapping = aes( x = eth_race, y = n, fill = died)) +theme(axis.text.x= element_text(angle=30))
See the pivoting page in the Epi R Handbook
If your data look like below (counts) and your plot looks like that (bars of same height), ensure you are using geom_col()
and not geom_bar()
!
eth_race | n |
---|---|
Asian, NH | 3022 |
Black, NH | 34395 |
White, NH | 27303 |
Hispanic, all races | 8600 |
Other, NH | 2912 |
Unknown | 5525 |
This plot shows that there is one row per ethnicity/race!
ggplot(linelist) + geom_bar( mapping = aes( x = eth_race, fill = died)) +theme(legend.position = "top") + coord_flip()
It is simple to flip the axes on any ggplot by adding coord_flip()
This is useful in bar plot to improve readability.
ggplot(linelist) + geom_bar( mapping = aes( x = fct_infreq(eth_race), fill = fct_infreq(died))) + theme(legend.position = "top") + coord_flip()
Use the forcats package to convert to class factor and adjust "level" order.
Above, fct_infreq()
orders x-axis position and stacks (fill) by frequency.
See the Epi R Handbook page on factors.
You will need to clean up the axes and legend titles now.
ggplot(linelist) + geom_bar( mapping = aes( x = fct_rev(fct_infreq(eth_race)), fill = fct_rev(fct_infreq(died)))) + theme(legend.position = "top") + coord_flip()
To reverse order, use fct_rev()
, which can be wrapped around other functions.
ggplot( data = linelist_eth_died, mapping = aes( x = eth_race, y = n, fill = died)) + geom_col(width = 0.5) +theme(axis.text.x = element_text(angle = 30))
Adjust bar width with width=
.
Be wary of adjusting width for date bars (e.g. month) - use geom_histogram()
instead with bin breaks.
ggplot( data = linelist_eth_died, mapping = aes( x = eth_race, y = n, fill = fct_infreq(died), label = n)) + geom_col(position = "dodge")+theme(axis.text.x = element_text(angle = 30))
To make the bars adjacent, specify position = "dodge"
in geom_col()
.
Adjust the order with via a forcats function.
To display text, use geom_col()
with geom_text()
. Assign aes(label=)
to the height values.
The position=
arg in geom_text()
can specify display at mid-bar.
ggplot( data = linelist_eth_died, mapping = aes( x = eth_race, y = n, fill = died, label = n)) + geom_col() +geom_text( size = 3, position = position_stack(vjust = 0.5)) +theme(axis.text.x = element_text(angle=30))
Now that you've learned the basic syntax... the esquisse package allows point-and-click creation of (simple) ggplots! (esquisse means "sketch" in French)
Two approaches that we suggest:
1) Use the incidence2 package
2) Use ggplot's geom_histogram()
Two approaches that we suggest:
1) Use the incidence2 package
2) Use ggplot's geom_histogram()
Today's demonstration will focus on incidence2.
See the Epi R handbook's Epicurves page for detailed examples of both methods
Create an incidence object
weekly <- incidence( x = linelist, date_index = date_onset, interval = "week")
Plot the incidence object
plot(weekly)
Adjust the interval
bimonthly <- incidence( x = linelist, date_index = date_onset, interval = "2 months")
Plot the incidence object
plot(bimonthly)
Try "Sunday weeks" or "MMWR week" for Sunday weeks.
To show groups, specify them to groups=
in the incidence()
command:
weekly <- incidence( x = linelist, date_index = date_onset, interval = "weeks", groups = eth_race)
AND to fill=
in the plot()
command:
plot(weekly, fill = eth_race)
If grouping by multiple columns, nest them in both places within c()
:
groups = c(eth_race, gender)
plot(weekly, fill = eth_race) +scale_y_continuous( expand = c(0,0), breaks = seq(0,2000,250)) + scale_y_continuous( expand = c(0,0), breaks = seq(0,2000,250)) + theme_minimal(base_size = 16) + theme(legend.position = "top") +labs(fill = "Race and\nEthnicity")
Add other ggplot2 commands with +
.
plot(weekly, fill = eth_race, date_format = "%a %d %b %Y\n (Week %W)", angle = 30)+ scale_y_continuous( expand = c(0,0), breaks = seq(0,2000,250)) + theme_minimal(base_size = 16) + theme(legend.position = "top") +labs(fill = "Race and\nEthnicity")
Do not adjust incidence2 epicurves with scale_x_date()
.
Instead, use custom incidence2 arguments in plot()
.
For small outbreaks, the style show_cases = TRUE
may be helpful:
small_outbreak <- linelist %>% filter( zip == 30024, date_report >= as.Date("2020-12-01")) %>% incidence( date_index = date_onset, interval = "Sunday weeks", groups = eth_race)
plot(small_outbreak, fill = eth_race, show_cases = TRUE)+theme(legend.position = "bottom")+labs( title = "ZIP 30024 by race")
Dataset is piped in from above
Use scale_fill_viridis_d()
for palettes that are color-blind friendly:
plot(weekly, fill = eth_race)+scale_fill_viridis_d( option = "inferno", name = "Race and\nEthnicity", na.value = "grey")
Viridis (try with option = "plasma"
or "inferno"), and colorbrewer palette functions can be added to any ggplot as well.
The _d()
indicates a scale for discrete values
You can also use incidence2 on data that are aggregated counts...
# For demo: aggregate linelistlinelist_day_counts <- linelist %>% count( day = floor_date(date_report, "day"), died) %>% drop_na(day)
A few rows of daily counts:
day | died | n |
---|---|---|
2021-06-19 | No | 11 |
2021-06-19 | Unknown | 11 |
2021-06-20 | No | 10 |
2021-06-20 | Unknown | 6 |
2021-06-21 | No | 12 |
2021-06-21 | Unknown | 11 |
2021-06-22 | No | 9 |
2021-06-22 | Unknown | 6 |
2021-06-23 | No | 7 |
2021-06-23 | Unknown | 7 |
died_curve <- incidence( linelist_day_counts, date_index = day, count = n, interval = "week", groups = died )
# plot stacked by outcomeplot(died_curve, fill = died)
We suggest two ways to make age pyramids:
1) Use the apyramid package - simple, easy
2) Use ggplot2 - more customizeable, but opportunity for error
Today we will demonstrate apyramid.
See the Epi R Handbook page on demographic pyramids and the package vignette for more information.
age_pyramid( data = linelist, age_group = "age_group", split_by = "gender")
The function age_pyramid()
from apyramid offers an easy interface.
We have cleaned gender
to only two values
age_pyramid( data = linelist, age_group = "age_group", split_by = "gender", proportional = TRUE, show_midpoint = FALSE, pal = c("darkgreen", "brown"))
Further modifications:
age_pyramid( data = linelist, age_group = "age_group", split_by = "gender", stack_by = "hospitalized" proportional = TRUE, show_midpoint = FALSE, pal = c("darkgreen", "brown", "yellow", "orange"))
You can stack the bars by another variable with stack_by=
age_pyramid( data = linelist, age_group = "age_group", split_by = "gender", proportional = TRUE, show_midpoint = FALSE) +theme_minimal(base_size = 10) +labs( title = "Age and Gender", subtitle = "Fulton County, GA", x = "Percent of total", y = "Age group", fill = "Gender", caption = "Caption here")
You can add further ggplot commands.
Let's use dynamic labels that will update with the data.
Simple caption
str_glue("n = {nrow(linelist)}")
## n = 81757
Let's use dynamic labels that will update with the data.
Simple caption
str_glue("n = {nrow(linelist)}")
## n = 81757
Complex caption
For complicated str_glue()
scenarios, define the dynamic components outside the quotation marks.
fmt_count()
from epikit is useful to count and display rows.
str_glue("{missing} missing age or gender not shown.", missing = fmt_count(linelist, is.na(gender) | is.na(age_group)) )
## 405 (0.5%) missing age or gender not shown.
Subtitle
Max/Min dates can be wrapped in format()
from base to adjust the display. See the epiRhandbook section on strptime syntax.
str_glue("Fulton County, reported {min_date} - {max_date}", min_date = format(min(linelist$date_report, na.rm=T), "%B %d %Y"), max_date = format(max(linelist$date_report, na.rm=T), "%B %d %Y") )
## Fulton County, reported January 03 2020 - June 23 2021
Now applying those dynamic captions:
age_pyramid( data = linelist, age_group = "age_group", split_by = "gender", proportional = TRUE, show_midpoint = FALSE) +theme_minimal(base_size = 10) +labs( title = "Age and Gender", subtitle = str_glue( "Fulton County, reported {min_date} - {max_date}", min_date = format( min(linelist$date_report, na.rm=T), "%B %d %Y"), max_date = format( max(linelist$date_report, na.rm=T), "%B %d %Y") ), x = "Percent of total", y = "Age group", fill = "Gender", caption = str_glue( "{missing} missing age or gender not shown.", missing = fmt_count( linelist, is.na(gender) | is.na(age_group)) ) )
To demonstrate labeling points, we create a new dataset that summarizes CFR and median age by race/ethnicity
race_CFR_age <- linelist %>% group_by(eth_race) %>% summarise( cases = n(), deaths = sum(died_covid == "Yes", na.rm=T), CFR = deaths/cases, med_age = median(age, na.rm=T) )
eth_race | cases | deaths | CFR | med_age |
---|---|---|---|---|
Asian, NH | 3022 | 23 | 0.0076109 | 34 |
Black, NH | 34395 | 833 | 0.0242186 | 39 |
White, NH | 27303 | 417 | 0.0152730 | 36 |
Hispanic, all races | 8600 | 55 | 0.0063953 | 34 |
Other, NH | 2912 | 6 | 0.0020604 | 32 |
We can plot these data as points, and label them with geom_text()
... but it does not look very good.
ggplot( data = race_CFR_age, mapping = aes( x = med_age, y = CFR, size = cases, label = eth_race)) +geom_point() +geom_text()
Labels look much improved by using geom_label_repel()
from the ggrepel package:
library(ggrepel)ggplot( data = race_CFR_age, mapping = aes( x = med_age, y = CFR, size = cases, label = eth_race)) + geom_point() +geom_label_repel( size = 5, min.segment.length = 0)
labels=
can be assigned complex values with str_glue()
:
library(ggrepel)ggplot( data = race_CFR_age, mapping = aes( x = med_age, y = CFR, size = cases, label = str_glue( "{eth_race}\n{cases} cases"))) +geom_point() +geom_label_repel( size = 5, min.segment.length = 0)
Use comma()
from the scales package to make numbers display with comma separators every three digits.
ggplot( data = race_CFR_age, mapping = aes( x = med_age, y = CFR, size = cases, label = str_glue( "{eth_race}\n{comma(cases)} cases"))) +geom_point() +geom_label_repel( size = 4, min.segment.length = 0)
See the many other useful functions in scales.
Add "marginal" histograms to scatterplots with ggMarginal()
from the ggExtra package.
scatterplot <- ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) + geom_point()
ggExtra::ggMarginal( scatterplot, type = "histogram", fill = "lightblue", xparams = list(binwidth = 10), yparams = list(binwidth = 5))
The cowplot package facilitates combining multiple plots, while aligning elements such as axes. For date axes, ensure the same limits in both plots.
Define the first plot.
plot1 <- plot2 <- linelist %>% ggplot( mapping = aes( x = date_report), binwidth = 7)+ geom_histogram()+ theme_minimal()+ scale_y_continuous(expand = c(0,0))+ scale_x_date( limits = c( as.Date("2020-03-01"), max(linelist$date_report, na.rm=T)), date_breaks = "months", labels = scales::label_date_short(), expand = c(0,0))
plot2 <- linelist %>% group_by(week = floor_date(date_report, "week")) %>% summarise(ci = list(mean_cl_normal(age) %>% rename(mean=y, lwr=ymin, upr=ymax))) %>% unnest() %>% ggplot( mapping = aes( x = week, y = mean, ymin = lwr, ymax = upr))+geom_ribbon(alpha = 0.5, fill = "green", color = "green")+geom_line(size = 2, color = "darkgreen")+scale_x_date( limits = c( as.Date("2020-03-01"), max(linelist$date_report, na.rm=T)), date_breaks = "months", labels = scales::label_date_short(), expand = c(0,0))+coord_cartesian(ylim = c(30, 60))+theme_minimal()+labs( y = "Weekly mean age (95%CI)", x = "Month")
plot_grid(plot1, plot2, rel_heights = c(1, 1), ncol = 1, align = "hv")
You can also use cowplot to overlap two plots and create a "dual-axis" plot.
Define the first plot:
plot1 <- ggplot( data = linelist, mapping = aes( x = date_report)) + geom_histogram(color = "grey", alpha = 0.5) +theme_cowplot() +scale_x_date( limits = c( as.Date("2020-03-01"), max(linelist$date_report, na.rm = TRUE)), date_breaks = "months", labels = scales::label_date_short(), expand = c(0, 0), name = "") +scale_y_continuous( expand = c(0,0), name = "Weekly reported case incidence")
plot2 <- linelist %>% group_by(epiweek) %>% summarise(CFR = sum(died == "Yes", na.rm=T) / n() ) %>% ggplot( mapping = aes( x = epiweek, y = CFR))+ geom_line(size = 2, color = "orange")+ theme_cowplot()+ scale_y_continuous( position = "right", limits = c(0, 0.2), expand = c(0,0), name = "Weekly CFR")+ scale_x_date( limits = c( as.Date("2020-03-01"), max(linelist$date_report, na.rm=T)), date_breaks = "months", labels = scales::label_date_short(), expand = c(0,0), name = "Epiweek")+ theme( axis.text.y = element_text(color = "orange", face = "bold"), axis.title.y = element_text(color = "orange", face = "bold"))
aligned <- align_plots(plot1, plot2, align = "hv")combined <- ggdraw(aligned[[1]]) + draw_plot(aligned[[2]])combined
ggplot2 works best when the dataset is in "long" format, where each variable has its own column. We recommend that you become familiar with pivot_longer()
from dplyr (see the Epi R Handbook pivoting page).
For color & fill use names or hex code
For shapes, try these numeric codes:
Until a few years ago geospatial work in R was very painful - but with SF it has now become as easy as dataframes in tidyverse!
# import shapefileshapefile <- read_sf( # define path to file here::here("case_study", "data", "covid_example_data", "covid_shapefile", "FultonCountyZipCodes.shp"))
## Simple feature collection with 48 features and 4 fields## Geometry type: MULTIPOLYGON## Dimension: XY## Bounding box: xmin: 2087952 ymin: 1274336 xmax: 2317492 ymax: 1522856## Projected CRS: NAD83 / Georgia West (ftUS)## # A tibble: 48 x 5## ZipCode SHAPE_Leng SHAPE_Area Population geometry## <chr> <dbl> <dbl> <dbl> <MULTIPOLYGON [US_survey_foot]>## 1 30304 3578. 874668. 0 (((2228175 1327979, 2227998 132786~## 2 30098 6187. 2071539. 118 (((2297621 1477003, 2296953 147697~## 3 30291 123056. 271391984. 20835 (((2193467 1298536, 2192364 129806~## 4 30268 294091. 1729387442. 7729 (((2121821 1318505, 2122024 131756~## 5 30305 87773. 187624264. 29220 (((2233900 1400489, 2233914 139992~## 6 30213 375869. 1801881966. 39840 (((2151675 1332989, 2151274 133220~## 7 30092 17665. 2769444. 0 (((2278083 1456502, 2278075 145649~## 8 30024 33262. 30183784. 299 (((2300742 1476746, 2300594 147679~## 9 30097 124207. 336153806. 26791 (((2295375 1480030, 2295462 148000~## 10 30022 169357. 673419811. 69893 (((2286054 1475364, 2286025 147513~## # ... with 38 more rows
We can see here that the CRS is specific to Georgia (so doesn't play well with GPS points)
# change shapefile coordinate reference system to WGS84shapefile <- st_transform(shapefile, crs = 4326)
## Simple feature collection with 48 features and 4 fields## Geometry type: MULTIPOLYGON## Dimension: XY## Bounding box: xmin: -84.85116 ymin: 33.5025 xmax: -84.09764 ymax: 34.1862## Geodetic CRS: WGS 84## # A tibble: 48 x 5## ZipCode SHAPE_Leng SHAPE_Area Population geometry## * <chr> <dbl> <dbl> <dbl> <MULTIPOLYGON [°]>## 1 30304 3578. 874668. 0 (((-84.39147 33.65054, -84.39205 3~## 2 30098 6187. 2071539. 118 (((-84.16324 34.0603, -84.16545 34~## 3 30291 123056. 271391984. 20835 (((-84.50521 33.56937, -84.50882 3~## 4 30268 294091. 1729387442. 7729 (((-84.74079 33.62338, -84.7401 33~## 5 30305 87773. 187624264. 29220 (((-84.37313 33.84985, -84.37308 3~## 6 30213 375869. 1801881966. 39840 (((-84.64293 33.6636, -84.64424 33~## 7 30092 17665. 2769444. 0 (((-84.22771 34.00394, -84.22774 3~## 8 30024 33262. 30183784. 299 (((-84.15294 34.05959, -84.15342 3~## 9 30097 124207. 336153806. 26791 (((-84.17066 34.06862, -84.17037 3~## 10 30022 169357. 673419811. 69893 (((-84.20143 34.05579, -84.20153 3~## # ... with 38 more rows
We can use sf::st_transform to re-project a spatial object We can use sf::st_crs to check (or set) the crs for a spatial object
# open up a ggplotggplot() + # add the shapefile on top geom_sf(data = shapefile, # no fill fill = NA, # black borders colour = "black")
# define period of interestrecent_period <- seq(as.Date("2021-06-30") - 13, as.Date("2021-06-30"), by = 1)# convert linelist data frame into sf object (with georeference points)linelist_sf <- linelist %>% # remove rows missing gps coordinates drop_na(lat, lon) %>% filter( # drop those with wrong GPS points lat >= 33 & lat <= 35, lon >= -85 & lon <= -84, # drop those outside the last 2 weeks date_report %in% recent_period ) %>% # create an sf object st_as_sf( # define the coordinates based on lat/long variables coords = c("lon", "lat"), # set the coordinate reference system to WGS84 crs = 4326, # do not change string variables to factors stringsAsFactors = FALSE )
## Simple feature collection with 126 features and 4 fields## Geometry type: POINT## Dimension: XY## Bounding box: xmin: -84.66011 ymin: 33.53836 xmax: -84.20438 ymax: 34.16152## Geodetic CRS: WGS 84## First 10 features:## pid date_report age gender geometry## 1 0de96c1a7bfc991b 2021-06-18 6 Female POINT (-84.38023 33.95349)## 2 fb6aac67ab0dbfe3 2021-06-18 23 Female POINT (-84.36471 33.74956)## 3 adc636708adbd668 2021-06-17 89 Female POINT (-84.36682 33.74679)## 4 4987c88010a7b125 2021-06-20 65 Female POINT (-84.36018 34.02925)## 5 9738fc3209ac189f 2021-06-21 14 Male POINT (-84.55165 33.60726)## 6 8f03d5ffc0354218 2021-06-18 57 Female POINT (-84.5458 33.58735)## 7 c46e8649039e2fa9 2021-06-17 27 Female POINT (-84.34898 33.81149)## 8 ac79c1b0d6993a5c 2021-06-17 47 Male POINT (-84.37456 33.86814)## 9 e305f709c166eaf5 2021-06-17 7 Female POINT (-84.49827 33.56734)## 10 b834f3e6759d0915 2021-06-18 1 Male POINT (-84.37045 33.67755)
# get the bounding box for the shapefile bounding_box <- shapefile %>% st_bbox()# plot a base map including scale bar basemap <- ggplot() + # change the bounding box to an sf object # this defines the area to download map tiles for geom_sf(data = st_as_sfc(bounding_box)) + # download map tiles and add to the plot annotation_map_tile( # define what map tiles to use type = "cartolight", # define folder to store tile images cachedir = here::here("data", "map_tiles"), # define if should download tiles each time forcedownload = FALSE, # hide messages about download status and zoom progress = "none" )
# plot the basemap points_map <- basemap + # add the shapefile on top with no fill and black borders geom_sf(data = shapefile, fill = NA, colour = "black") + # plot points from linelist coloured by ethnicity # (order the factor so that the most frequent group is plotted first) geom_sf(data = linelist_sf, mapping = aes(colour = fct_infreq(eth_race))) + # choose colour combination scale_colour_brewer(palette = "Set1") + # change the legend label labs(colour = "Ethnicity") + theme(legend.position = "bottom")
# get counts of points within polygonsshapefile <- shapefile %>% # add a column to the shapefile with counts mutate( # see which points are in which zip code polygon cases = st_intersects(., linelist_sf) %>% # count how many are in each lengths() )# calculate incidence in shapefiledatashapefile <- shapefile %>% mutate( # divide cases by population incidence = round(cases / Population * 100000, digits = 1), # clean up calculations incidence = case_when( # fix the outliers: set infinity to NA and cases less than 10 to NA cases < 5 ~ NA_real_, is.infinite(incidence) ~ NA_real_, ## nb. infinite due to zero denominator TRUE ~ incidence) )
# define breaks (for plotting groups)breakers <- shapefile %>% # change shapefile to a tibble (otherwise geometry pulled with) as_tibble() %>% # only keep zips with more than ten cases filter(cases > 5) %>% # pull the incidence column select(incidence) %>% # define grouping cut-offs based on quartiles of observations quantile(probs = seq(0, 1, 0.25), na.rm = TRUE) # create a factor variable for incidenceshapefile <- shapefile %>% mutate( # define groups using the cut function incidence_cats = cut(incidence, # cut-offs as defined above (including zero) breaks = c(0, breakers), # add labels by using the cut-offs labels = str_glue("<={breakers}")) )
# plot basemap choropleth <- basemap + # add in shapefile (with black borders) colour by incidence categories geom_sf(data = shapefile, aes(fill = incidence_cats), colour = "black") + # define colour scheme scale_fill_brewer(palette = "YlGn", # edit legend labels to the categories defined # rename the missing group to be fewer than 10 cases labels = c( levels(shapefile$incidence_cats), "Fewer than 5 cases")) + # add in text labels for each zip code geom_sf_text(data = shapefile, aes(label = ZipCode), check_overlap = TRUE) + # change the legend title labs(fill = "Confirmed cases per \n 100,000 by Zip code")
xunits=xmax−xminxmetres
# define x and y minimum/maximum coordinates based on bounding box of base mapxmin <- st_point(c(bounding_box$xmin, bounding_box$ymin))xmax <- st_point(c(bounding_box$xmax, bounding_box$ymin))ymin <- xminymax <- st_point(c(bounding_box$xmin, bounding_box$ymax))# create sf points (with coordinate reference system WGS84) for the x and y axisxaxis <- st_sfc(xmin, xmax, crs = 4326)yaxis <- st_sfc(ymin, ymax, crs = 4326)# calculate the distance in metres on the axes of your based on longitude and latitude# i.e. how many metres on your x axis and how many on your yxdist <- st_distance(xaxis)[2] %>% units::set_units("miles")ydist <- st_distance(yaxis)[2] %>% units::set_units("miles")# calculate how many lat/long units there are on each axis xunits <- st_distance(xmin, xmax)yunits <- st_distance(ymin, ymax)# divide the difference in latitude or longitude by the corresponding distance# returns the number of units per distance of interest (i.e. coord units per mile)xfact <- as.numeric(xunits / xdist)yfact <- as.numeric(yunits / ydist)
heatmap <- basemap + # add in shapefile not filled with black border geom_sf(data = shapefile, fill = NA, colour = "black") + # add in a density layer stat_density_2d( # using the sf formatted linelist data = linelist_sf, aes( # extract the lat lon from geometry (list column) x = purrr::map_dbl(geometry, ~.[1]), y = purrr::map_dbl(geometry, ~.[2]), # fill based on the calculated density fill = after_stat(level) ), # define the shape to use for smoothing geom = "polygon", # define whether to show legend (removed as uninformative) show.legend = FALSE, # bandwith (distance squared) h = c(xfact, yfact) ) + # define fill palette (viridis continuous) scale_fill_viridis_c(option = "C")
Bookmark www.epiRhandbook.com!
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |