class: center, middle, inverse, title-slide # Emory COVID-19 Response Collaborative ## Data visualization with ggplot2: an overview and demonstration ###
EpiRhandbook Team
epiRhandbook
epirhandbook@gmail.com
### August 2021 --- <style type="text/css"> /* THIS IS A CSS CHUNK - THIS IS A COMMENT */ /* Size of font in code echo. E.g. 10px or 50% */ .remark-code { font-size: 70%; } /* Size of font in text */ .medium-text { font-size: 75%; } /* Size of font in tables */ .small-table table { font-size: 6px; } .medium-table table { font-size: 8px; } .medium-large-table table { font-size: 10px; } </style> # We are Applied Epi ![](https://github.com/appliedepi/epiRhandbook_eng/raw/master/images/Epi%20R%20Handbook%20Banner%20Beige%201500x500.png) * We are a grassroots collaborative of applied epidemiologists * Our (free) **Epi R Handbook** has been used by 60,000 people worldwide * Now we offer customized training to health departments and NGOs .footnote[Bookmark [www.epiRhandbook.com](www.epirhandbook.com)!] ??? - What makes us different is that we focus on the challenges of applied epi, not academic epi. We emphasize the skills used every day by ground-level epidemiologists. --- # Introducing ourselves * **Neale Batra** - Founder of the Epi R Handbook. - Previously @Philly, @SantaClaraCounty, and @USAID/PEPFAR. Internationally with @MSF and @WHO. * **Alex Spina** - Alum of EPIET (European EIS). Med student and epi consultant for @MSF and @WHO. * **Mathilde Mousset** - R programmer / Data manager at Epicentre (@MSF). PhD in evolutionary ecology. * **Henry Laurenson-Schafer** - Epi/data scientist with WHO Geneva on COVID response. PhD in molecular biology/bioinformatics. * **Wen Lin** - Epi @SantaClaraCounty, **Resident SAS expert**. ??? We have Wen available for answering any questions on translating SAS to R --- # Today's demonstration Today is primarily *explanation* and *demonstration*, following from the previous [demonstration on data management and workflow](https://appliedepi.github.io/emory_training/presentation/slides_workflow.html). We assume you have reviewed the Handbook pages on [R Basics](https://epirhandbook.com/r-basics.html) and [ggplot basics](https://epirhandbook.com/ggplot-basics.html). ![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/EDGE.png?raw=true) ??? - Today's session is primarily a *demonstration* with question & answer session. - We assume that you have read at least some of the relevant pages in the Epi R Handbook prior to this session. - We also offer training in which we guide you through case studies with hands-on coaching. --- # Objectives Our objectives today are to: 1) De-mystify `ggplot()` syntax so you can apply it to your own tasks 2) Demonstrate core `geom_()` commands 3) Introduce helpful "wrapper" packages like **incidence2** and **apyramid** 4) Touch upon advanced tips with packages **ggExtra**, **ggrepel**, **gghighlight**, **scales**, **plotly**, and **esquisse** 5) Introduce mapping (GIS) with **sf** .footnote[The [Epi R Handbook](https://epirhandbook.com/index.html) contains many more examples.] ??? - 1 - core syntax and 2 - exposure - Ready yourself because today will be a whirlwind. - There will be a LOT of content and we will move quickly. - Type questions in the chat, or hold them to the end. - We will do understanding checks throughout --- # Resources for you Today's work is within an R project, which can be [downloaded as a zipped file](https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/appliedepi/emory_training/tree/master/case_study). Today's code is available to you in `ggplot_demo.Rmd` (also online [here](https://github.com/appliedepi/emory_training/blob/master/case_study/ggplot_demo.Rmd)). R project file (`case_study.Rproj`) * `data/` folder * `covid_example_data/` folder * `linelist_cleaned.rds` * `covid_example_data.xlsx` * `covid_shapefile/` folder - `FultonCountryZipCode.shp`... * `map_tiles/` folder * base map images for Fulton County... * `ggplot_demo.Rmd` (R markdown script) ??? - Do NOT download and try it now. Too distracting. - Today will move fast, but just like last time the REAL opportunity for you is to review at your own pace. - We've spent a lot of time developing materials for you. - SHOW SCRIPT AND OUTPUT IN FOLDER --- # Load packages and import data Load packages (**ggplot2** is within [**tidyverse**](https://www.tidyverse.org/)) and import the cleaned (fake) dataset from [last session](https://github.com/appliedepi/emory_training/blob/master/case_study/weekly_report.Rmd). ```r ## install and load necessary packages pacman::p_load( rio, # importing data here, # relative file pathways skimr, # review data DT, # visualize data frame janitor, # data cleaning and tables epikit, # age categories lubridate, # working with dates incidence2, # epidemic curves ggrepel, # smart labels ggExtra, # extras esquisse, # point-and-click for simple ggplots apyramid, # age pyramids scales, # formatting of scales plotly, # interactive plots cowplot, # combine plots * tidyverse # ggplot2 & data management ) ``` ```r linelist <- import(here::here("data", "covid_example_data", "linelist_cleaned.rds")) ``` ??? One tip is to load tidyverse last to avoid masking of package names The `import()` function can import almost any kind of file. An `.rds` file retains column classes from an R data frame. --- class: medium-large-table # Review data Below are the first 25 rows of the `linelist` data frame:
.footnote[Tip: Use `skim()` from **skimr** package to review all columns with summary statistics] --- # Visualization options in R Today we focus on **ggplot2** because it: * is good for fast data exploration of multi-dimensional data * produces very **high quality** final outputs * has well-structured grammar => **high consistency** * is accompanied by many packages that expand functionality See the [R graph gallery](https://www.r-graph-gallery.com/ggplot2-package.html) for inspiration. .footnote[Other plotting options include [**base** R](https://towardsdatascience.com/base-plotting-in-r-eb365da06b22), [**lattice**](https://www.statmethods.net/advgraphs/trellis.html), and [**plotly**](https://plotly.com/r/).] --- # gg-what?? -- - The **ggplot2** *package* is the most popular data visualization tool in R -- - Its `ggplot()` *function* is at the core of the package -- - This whole approach is colloquially known as “ggplotting” -- - Resulting figures are sometimes affectionately called “ggplots” -- **ggplot2** benefits from a wide variety of supplementary R packages that extends its functionalities, such as **gganimate**, **ggthemr**, **ggdendro**, **gghighlight**, **ggforce**... .footnote[ *Bonus question:* What does the "gg” in these names represent? ] ??? - "gg" represents the “grammar of graphics” used to construct the figures --- # Syntax overview Build a plot object by “adding” commands on top of one another that specify plot layers and design elements -- The order of layers will usually look like this: 1) **"Open" the plot** with the `ggplot()` command and specify the dataset -- 2) **"Map" data columns** to "aesthetic" features of the plot such as axes, color, size, shape, fill, transparency -- 3) **Add (`+`) “geom” layers** that visualize data geometrically as shapes -- 4) **Modify "scales"**, such as a color scale or y-axis breaks -- 5) **Add "theme" plot design elements** such as axis labels, title, caption, fonts, text sizes, background themes, or axes rotation -- These layers are "added" sequentially with `+` symbols. **ggplot2** commands can be quite long! ??? Remember that although the commands may be long, it is infinitely easier to edit and recycle than in Excel --- # Open the plot .pull-left[ `ggplot()` creates an empty canvas. Assign the data frame to use. ```r ggplot(data = linelist) ``` Alternatively, use the `%>%` pipe operator to "pipe" a data frame *into* `ggplot()` ```r linelist %>% ggplot() ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] ??? This is only a blank canvas, we have not defined what should be in the x and y axes. If several data frames are needed, they can be added in their own geoms. Piping is useful to make one-time changes to a dataset prior to plotting. --- # Mappings with `aes()` .pull-left[ ```r ggplot( data = linelist, * mapping = aes( * x = age, * y = days_hosp)) ``` "Aesthetics" are features *whose display could vary for each data point* (position, color, shape...) `mapping = aes()` assigns plot "aesthetics" to columns in the data. Inputs must be placed within `aes()`. Two basic aesthetic mappings are axes, via `x=` and `y=`. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-12-1.png" width="504" /> ] --- # Add geometry .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) + *geom_point() ``` Data are visualized using "geom" commands, such as `geom_point()`. These commands are "added" with a `+` to the `ggplot()` command. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-14-1.png" width="504" /> ] --- # Geometries .pull-left[ Some classical “geoms” include: Geometry |Geom ----------------|-------------------- Histograms |`geom_histogram()` Points |`geom_point()` .footnote[Full list [here](https://ggplot2.tidyverse.org/reference/)] ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-15-1.png" width="504" /> ] --- # Geometries .pull-left[ Some classical “geoms” include: Geometry |Geom ----------------|-------------------- Histograms |`geom_histogram()` Points |`geom_point()` Lines |`geom_line()` Bar plots |`geom_bar()` or `geom_col()` .footnote[Full list [here](https://ggplot2.tidyverse.org/reference/)] ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-16-1.png" width="504" /> ] --- # Geometries .pull-left[ Some classical “geoms” include: Geometry |Geom ----------------|-------------------- Histograms |`geom_histogram()` Points |`geom_point()` Lines |`geom_line()` Bar plots |`geom_bar()` or `geom_col()` Boxplots |`geom_boxplot()` Violin plots |`geom_violin()` .footnote[Full list [here](https://ggplot2.tidyverse.org/reference/)] ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-17-1.png" width="504" /> ] --- # Adding geoms .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) + *geom_point() ``` With axes now mapped, `geom_point()` displays the data as points. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-19-1.png" width="504" /> ] --- # Adding geoms .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) + geom_point() + *geom_smooth() ``` We can add additional geoms to the current plot with `+`. *Geoms appear in the order they are written*: the smoothed line appears over the points. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-21-1.png" width="504" /> ] .footnote[geom_smooth() gives smoothed conditional means, helping to show trends in presence of "over-plotting" (see [documentation](https://ggplot2.tidyverse.org/reference/geom_smooth.html))] ??? - Explain why you might use one or the other --- # A quick note on indentations Indentations, spaces, and newlines do not impact code execution, and can be varied to improve readability. ```r ggplot(data = linelist, mapping = aes(x = age, y = days_hosp))+geom_point() ``` is the same as: ```r ggplot(data = linelist, mapping = aes(x = age, y = days_hosp)) + geom_point() ``` is the same as: ```r ggplot( data = linelist, # use case linelist mapping = aes( # make aesthetic mappings for all geoms x = age, # assign x-axis to age column y = days_hosp)) + # assign y-axis to duration of hospitalization geom_point() # display data as points ``` ??? - Explain why you might use one or the other long style can enable informative comments/annotations - short style very dense (harder to read for some). Shorter scripts, but so what? The number of lines of your code is not an informative metric. - very long lines => needs to scroll horizontally for people with smaller monitors (not nice) - long-ish style makes it easier to see which argument belongs to each function - spaces around "=" or "+" => make it easier to parse to many people - other? --- class: large-table # Other aesthetics Aside from axes, other common "aesthetics" include: Argument |Controls ----------------|----------------------- `shape` |Display of point as dot, star, triangle, square... `fill` |The *interior* color (e.g of bar or boxplot) `color` |The *exterior* or bar, boxplot - OR point color `size` |Line thickness, point size... `alpha` |Transparency: 0 (invisible) to 1 (opaque) `width` |Width of "bar plot" bars `linetype` |Either solid, dashed, dotted, etc. `binwidth` |Width of histogram bins ??? Note that “aesthetic” in ggplot has a specific meaning that you might associate with the word “aesthetics” in common English. In ggplot those details are called “themes” and are adjusted within a theme() command Each geom accepts certain aesthetics, like `binwidth=` for `geom_histogram()` --- class: medium-text # Aesthetics assignments .pull-left[ Aesthetics can be assigned to either: * **Static values**: `color = "purple"` - Assigned *outside* `aes()` - Same display for all data * **A data column**: `aes(color = died)` - Assigned *inside* `aes()` - Displays data as "groups" ] .pull-right[ Some examples: <img src="slides_ggplot_files/figure-html/unnamed-chunk-25-1.png" width="504" /> ] --- class: medium-text # Aesthetics assignments .pull-left[ Aesthetics can be assigned to either: * **Static values**: `fill = "purple"` - Assigned *outside* `aes()` - Same display for all data * **A data column**: `aes(fill = died)` - Assigned *inside* `aes()` - Displays data as "groups" ] .pull-right[ More examples: <img src="slides_ggplot_files/figure-html/unnamed-chunk-26-1.png" width="504" /> ] --- # Static aesthetics .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) + *geom_point(color = "seagreen") ``` An aesthetic is *static* if it applies the same display to all data points in the geom or plot. Static aesthetics are defined *outside* `aes()` to a *number or character value*. Other examples: `size = 3` `alpha = 0.5` `width = 1.2` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-29-1.png" width="504" /> ] --- # Dynamic aesthetics .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = age, y = days_hosp, * color = died)) + geom_point() ``` *Dynamic* aesthetics are mapped to a column name, and defined *inside* `aes()`. This creates "groups" in the plot and generates a legend. The display varies for each data point. Above, `color=` is mapped to column `died`, and is inherited by `geom_point()`. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-31-1.png" width="504" /> ] ??? --- # Static and dynamic .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = age, y = days_hosp, * color = died)) + geom_point( * size = 7, * alpha = 0.7) ``` Above, `size = 7` and `alpha = 0.7` are static aesthetics defined outside any `aes()` and apply to `geom_point()`. `color=` is within an `aes()`, and maps to values in column `died`. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-33-1.png" width="504" /> ] .footnote[Read more about ggplot aesthetics [here](https://ggplot2.tidyverse.org/articles/ggplot2-specs.html)] ??? As there is only one geom, all aesthetics can be written in `ggplot()`, or in `geom_point()` --- # Aesthetic mapping placement 1) Dynamic mappings in the initial `ggplot()` call will apply to subsequent geoms, *unless otherwise indicated* 2) Static aesthetics (e.g. `color = "blue"`) are not inherited by subsequent geoms 3) Mappings written within one geom apply only to that geom .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) + # points colored by outcome geom_point( mapping = aes(color = died), size = 1) + # smoothed means by outcome geom_smooth( mapping = aes(color = died), size = 3) + # smoothed mean of entire dataset geom_smooth(color = "black") ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-35-1.png" width="504" /> ] ??? - Three rules to remember as you decide where to place aesthetic mappings - This is rather complex example, but serves to demonstrate the above guidelines. --- # A common error .pull-left[ If your data are "grouped", yet you plot indiscriminately without the grouping assigned to a dynamic aesthetic, you get multiple y values per x value. Below, `linelist` is aggregated into counts per unique week-outcome. ```r outcome_week_data <- linelist %>% * count(died, * week_onset = floor_date( * x = date_onset, * unit = "week")) %>% select(week_onset, everything()) %>% arrange(week_onset) %>% drop_na(week_onset) ``` ] .pull-right[ First few rows: |week_onset |died | n| |:----------|:-------|--:| |2019-12-29 |Yes | 1| |2019-12-29 |No | 5| |2020-01-05 |No | 7| |2020-01-05 |Unknown | 1| |2020-01-12 |Yes | 1| |2020-01-12 |No | 5| |2020-01-12 |Unknown | 2| |2020-01-19 |No | 4| ] --- # A common error .pull-left[ ```r ggplot( * data = outcome_week_data, mapping = aes( x = week_onset, y = n)) + geom_line() ``` Above, the column `died` is not provided to any grouping aesthetic. `ggplot()` does not know how to display the multiple values for each week. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-39-1.png" width="504" /> ] --- # A common error - resolved .pull-left[ ```r ggplot( data = outcome_week_data, mapping = aes( x = week_onset, y = n, * color = died))+ geom_line() ``` Use aesthetic mappings like `fill=` or `color=` to correctly visualize the "groups". Above, `color=` is mapped to the `died` column. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-41-1.png" width="504" /> ] --- # Facets .pull-left[ ```r ggplot( data = linelist, mapping = aes(x = date_onset)) + geom_histogram() + *facet_wrap(~eth_race) ``` Another way of *displaying groups* is via faceting. `facet_wrap()` produces one facet per unique value. Place a "~" before the column name. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-43-1.png" width="504" /> ] ??? Also called "small multiples" --- # Facets .pull-left[ ```r ggplot( data = linelist, mapping = aes(x = date_onset)) + geom_histogram() + *facet_wrap(~eth_race, scales = "free_y") ``` "Free" auto-scaled axes with `scales=`. Options: - "free_y" - "free_x" - "free" (both x and y) ] .footnote[ Alert your audience if you use free axes! Also, try `ncol=` and `nrow=` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-45-1.png" width="504" /> ] --- # Facets - by two variables .pull-left[ ```r ggplot( data = linelist, mapping = aes(x = date_onset)) + geom_histogram() + *facet_wrap(eth_race ~ gender) ``` The "~" signifies "by". You can place columns on either side. With `facet_wrap()`, levels are combined into facet titles, appearing alphabetically/by factor level. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-47-1.png" width="504" /> ] --- # Facets - grid layout .pull-left[ Use `facet_grid()`. Labels appear on top and side. ```r ggplot( data = linelist, mapping = aes(x = date_onset)) + geom_histogram() + *facet_grid(eth_race ~ gender) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-49-1.png" width="504" /> ] --- # Facets - drop levels .pull-left[ Use `filter()` or `drop_na()` in advance, to remove undesired levels. ```r linelist %>% * drop_na(gender, eth_race) %>% * filter(gender != "Unknown") %>% # begin plot ggplot( mapping = aes(x = date_onset)) + geom_histogram() + facet_grid(eth_race ~ gender) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-51-1.png" width="504" /> ] .footnote[Above, we pipe the adjusted data frame into `ggplot()`, for ease] --- # Facets + `gghighlight()` .pull-left[ Add `gghighlight()` from **gghighlight** to show a full "shadow" behind each facet. For color add the `aes(fill=)`. ```r ggplot( data = linelist, mapping = aes( x = date_onset, * fill = eth_race)) + geom_histogram() + facet_wrap(~ eth_race) + *gghighlight::gghighlight() ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-53-1.png" width="504" /> ] --- # gghighlight .pull-left[ Add `gghighlight()` to other plots, and specify specific values (or criteria) to highlight ```r linelist %>% # get daily counts by age group group_by(age_group, date_report) %>% count() %>% # plot ggplot( mapping = aes( x = date_report, y = n, color = age_group)) + geom_line() + * gghighlight::gghighlight( age_group %in% c("40-49", "60-69)"))+ theme(legend.position = "none") ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-55-1.png" width="504" /> ] --- # Scales - overview Scale commands replace defaults of *how* the aesthetic mappings manifest, such as: * *Which* colors or shapes to display * The min/max of point sizes * The min/max and frequency of axes breaks As a generic formula, these commands are written as: **`scale_AESTHETIC_METHOD()`**. 1) `scale_` : this prefix never changes 2) AESTHETIC: `_fill_` , `_color_` , `_x_` , `_y_` , etc. 3) METHOD: `_continuous()`, `_discrete()`, `_manual()`, `_date()`, etc. --- # Scales examples Some examples of scale commands: You want to adjust |Scale command --------------------|------------------- continuous y-axis |`scale_y_continuous()` date x-axis |`scale_x_date()` categorical x-axis |`scale_x_discrete()` fill, continuous |`scale_fill_continuous()` fill, continuous |`scale_fill_gradient()` color, manual assign|`scale_color_manual()` --- # Scales - default .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = eth_race, fill = died)) + geom_bar() ``` Above, the fill of a bar plot uses the **default colors and axis breaks**: ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-57-1.png" width="504" /> ] --- # Scales - adjusted fill .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = eth_race, fill = died)) + geom_bar() + *scale_fill_manual( * values = c( * "Yes" = "violetred", * "No" = "aquamarine", * "Unknown" = "grey")) ``` Within `scale_fill_manual()` we provide assignments within a vector `c()`. .footnote[Use *na.value = "grey"* for missing values] ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-59-1.png" width="504" /> ] ??? Discuss the na.value= arguments in most scale commands, and the difference between having NA values in the data and having an explicit missing value such as "Unknown". --- # Scales - adjusted y-axis .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = eth_race, fill = died)) + geom_bar() + scale_fill_manual( values = c( "Yes" = "violetred", "No" = "aquamarine", "Unknown" = "grey")) + *scale_y_continuous( * breaks = seq(from = 0, * to = 35000, * by = 5000)) ``` In `scale_y_continuous()` we adjust the y-axis breaks using `seq()` to define a numeric sequence. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-61-1.png" width="504" /> ] --- # Scales - Start axes at 0 .pull-left[ ```r ggplot( data = linelist, mapping = aes( x = eth_race, fill = died)) + geom_bar() + scale_fill_manual( values = c("Yes" = "violetred", "No" = "aquamarine", "Unknown" = "grey")) + scale_y_continuous( breaks = seq(from = 0, to = 35000, by = 5000), * expand = c(0, 0)) + *scale_x_discrete( * expand = c(0, 0)) ``` In any `scale_x_` or `scale_y_` command, use `expand = c(0,0)` to remove excess space around the plot features. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-63-1.png" width="504" /> ] --- # Scales - date axis labels .pull-left[ ```r ggplot( data = linelist, mapping = aes(x = date_onset)) + geom_histogram() ``` The default scale for date axes will vary by the range of your data. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-65-1.png" width="504" /> ] --- # Scales - date label breaks .pull-left[ ```r ggplot( data = linelist, mapping = aes(x = date_onset)) + geom_histogram() + *scale_x_date( * date_breaks = "3 months") ``` Adjust axis labels with `scale_x_date()`. Use `date_breaks=` values like "1 week", "2 weeks", or "3 months". Note: these are the *axis* label breaks, not histogram bin breaks! .footnote[ Note: Default "weeks" start on Mondays. See [epiRhandbook epicurves](https://epirhandbook.com/epidemic-curves.html) page for alternatives. ] ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-67-1.png" width="504" /> ] ??? For tips on geom_histogram() bins, see Epi R Handbook epicurves page --- # Scales - date axis labels .pull-left[ ```r ggplot( data = linelist, mapping = aes(x = date_onset)) + geom_histogram() + scale_x_date( date_breaks = "3 months", * date_labels = "%d %b\n%Y") ``` Specify date label format to `date_labels=` using ["strptime" syntax](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strptime) `"%d %b %Y"` for DD MMM YYYY. See Epi R Handbook [Epicurves](https://epirhandbook.com/epidemic-curves.html) and [Strings](https://epirhandbook.com/characters-and-strings.html) pages for more tips ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-69-1.png" width="504" /> ] ??? /n is a newline --- # Scales - date axis labels .pull-left[ ```r ggplot( data = linelist, mapping = aes(x = date_onset)) + geom_histogram() + scale_x_date( date_breaks = "3 months", * labels = scales::label_date_short() ) ``` Alternatively, simplify by assigning `labels=` to `label_date_short()` from [**the scales package**](https://scales.r-lib.org/) The year is not repeated on each label anymore. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-71-1.png" width="504" /> ] .footnote[Note the use above of `labels=`, not `date_labels=`.] --- class: medium-large-table # Scales - display percents .pull-left[ Easily display proportions as percents with `percent()` from **scales** within `scale_y_continuous()`. First few rows of a CFR dataset: |month | cases| deaths| CFR| |:----------|-----:|------:|---------:| |2020-01-01 | 1| 0| 0.0000000| |2020-02-01 | 54| 3| 0.0555556| |2020-03-01 | 857| 73| 0.0851809| |2020-04-01 | 1832| 152| 0.0829694| ```r ggplot( data = CFR_data, mapping = aes( x = month, * y = CFR) + geom_line(size = 2, color = "brown") + *scale_y_continuous(labels = percent) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-75-1.png" width="504" /> ] ??? Remember to load the scales package --- # Plot labels .pull-left[ ```r ggplot(data = linelist) + geom_point( mapping = aes( x = age, y = days_hosp, color = died), alpha = 0.3) + *labs( * title = "Duration of admission", * subtitle = "Fulton County, GA", * x = "Age (years)", * y = "Duration (days)", * caption = "Fictional COVID-19 data", * color = "Deceased" ) ``` Use `labs()` as above. Note: to edit legend title, use the aesthetic that created the legend (e.g. `color=`). ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-77-1.png" width="504" /> ] --- # Plot label tips See the [Epi R Handbook page](https://epirhandbook.com/characters-and-strings.html#dynamic-strings) on adjusting characters with the **stringr** package **New lines** ```r "Top line\nNextline" # \n creates new line str_wrap("This is a really long title", 30) # wrap at 30 characters ``` -- **Dynamic labels** - Embed code in `str_glue()` that updates with the data ```r str_glue("Data as of {Sys.Date()}") ``` ``` ## Data as of 2021-08-20 ``` ```r str_glue("{fmt_count(linelist, is.na(date_onset))} cases missing onset and not shown") ``` ``` ## 37323 (45.7%) cases missing onset and not shown ``` ??? Explain that in str_glue, anything within curly brackets it will run as R code. --- # Themes .pull-left[ Themes are non-data design features (background, text size/color, etc). [These "complete themes"](https://ggplot2.tidyverse.org/reference/ggtheme.html) are easy to add. ```r # Try one of these... theme_bw()+ theme_classic()+ theme_dark()+ theme_gray()+ theme_minimal()+ theme_light()+ theme_void()+ ``` Try the argument `base_size = 16` to quickly increase text size. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-82-1.png" width="504" /> ] --- # Themes .pull-left[ Make micro-adjustments with `theme()`. See the handbook for tips. ```r ggplot( data = linelist, mapping = aes( x = age, y = days_hosp, color = died), alpha = 0.3) + geom_point() + labs( title = "Duration of admission", x = "Age (years)", y = "Duration (days)", color = "Deceased")+ theme_minimal(base_size = 16) + *theme( * legend.position = "bottom", * plot.title = element_text( * color = "red", * face = "bold"), * axis.title.y = element_text(angle = 90)) ``` ] .pull-right[ The syntax takes practice - see [this list](https://ggplot2.tidyverse.org/reference/theme.html) of feature-specific arguments. <img src="slides_ggplot_files/figure-html/unnamed-chunk-84-1.png" width="504" /> ] ??? Talk about these theme() arguments and how they consist of two parts, just like `mapping = aes()`. Explain that nobody has these all memorized, but the common ones are easy to remember once you use them enough. Remember to add them AFTER any complete themes. --- # Bar plots `geom_col()` and `geom_bar()` are used to make general bar plots (non-epicurves). -- Use `geom_bar()` if bar height should reflect the **number of rows** in the data (e.g. a case linelist). ```r ggplot( data = linelist, # begin with linelist mapping = aes(x = eth_race)) + # No y= argument *geom_bar() ``` -- Use `geom_col()` there is a numeric column containing the desired **bar height** (e.g. aggregated count data). ```r ggplot( data = linelist_agg, # begin with aggregated count data mapping = aes(x = eth_race, y = n)) + # bar height is value in column "n" *geom_col() ``` --- class: medium-table # Bar plot - count rows .pull-left[ With `geom_bar()`, the height reflects the number of rows per x-axis group. This works well for linelist data. |date_onset |eth_race |died | |:----------|:---------|:-------| |2020-03-20 |White, NH |No | |2020-01-28 |White, NH |No | |2020-02-10 |Black, NH |No | |2021-05-19 |Black, NH |No | |2020-02-20 |White, NH |Unknown | |2020-01-17 |Black, NH |Yes | ```r ggplot( data = linelist, mapping = aes( x = eth_race)) + geom_bar() + theme(axis.text.x= element_text(angle=30)) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-89-1.png" width="504" /> ] --- class: medium-table # Bar plot - count grouped rows .pull-left[ To achieve "stacked" bars with `geom_bar()`, assign the grouping column to `fill=`, within `aes()`. |date_onset |eth_race |died | |:----------|:---------|:-------| |2020-03-20 |White, NH |No | |2020-01-28 |White, NH |No | |2020-02-10 |Black, NH |No | |2021-05-19 |Black, NH |No | |2020-02-20 |White, NH |Unknown | |2020-01-17 |Black, NH |Yes | ```r ggplot( data = linelist, mapping = aes( x = eth_race, * fill = died)) + geom_bar() + theme(axis.text.x= element_text(angle=30)) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-92-1.png" width="504" /> ] --- class: medium-table # Bar plot - aggregated counts .pull-left[ In contrast, `geom_col()` uses a column of counts, such as column `n` in this `linelist_eth` dataset: |eth_race | n| |:-------------------|-----:| |Asian, NH | 3022| |Black, NH | 34395| |White, NH | 27303| |Hispanic, all races | 8600| |Other, NH | 2912| |Unknown | 5525| Column `n` provides the bar height. ```r ggplot(linelist_eth) + * geom_col( mapping = aes( x = eth_race, * y = n))+ theme(axis.text.x= element_text(angle= 30)) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-95-1.png" width="504" /> ] --- class: medium-table # Bar plot - stacked counts .pull-left[ To have "stacked" bars using `geom_col()`, each plotting group must have its own rows in the data. Use "long"-style data like below: |eth_race |died | n| |:---------|:-------|-----:| |Asian, NH |Yes | 30| |Asian, NH |No | 1802| |Asian, NH |Unknown | 1190| |Black, NH |Yes | 1037| |Black, NH |No | 19204| |Black, NH |Unknown | 14154| ```r ggplot(linelist_eth_died) + geom_col( mapping = aes( x = eth_race, * y = n, * fill = died)) + theme(axis.text.x= element_text(angle=30)) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-98-1.png" width="504" /> .footnote[See the pivoting page in the Epi R Handbook] ] --- # Bar plot - a common error .pull-left[ If your data look like below (counts) and your plot looks like that (bars of same height), ensure you are using `geom_col()` and not `geom_bar()`! |eth_race | n| |:-------------------|-----:| |Asian, NH | 3022| |Black, NH | 34395| |White, NH | 27303| |Hispanic, all races | 8600| |Other, NH | 2912| |Unknown | 5525| ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-100-1.png" width="504" /> This plot shows that there is one row per ethnicity/race! ] --- # Bar plot - flip axes .pull-left[ ```r ggplot(linelist) + geom_bar( mapping = aes( x = eth_race, fill = died)) + theme(legend.position = "top") + *coord_flip() ``` It is simple to flip the axes on any ggplot by adding `coord_flip()` This is useful in bar plot to improve readability. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-102-1.png" width="504" /> ] --- # Bar plot - adjust order .pull-left[ ```r ggplot(linelist) + geom_bar( mapping = aes( * x = fct_infreq(eth_race), * fill = fct_infreq(died))) + theme(legend.position = "top") + coord_flip() ``` Use the **forcats** package to convert to class *factor* and adjust "level" order. Above, `fct_infreq()` orders x-axis position and stacks (fill) by frequency. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-104-1.png" width="504" /> ] .footnote[See the Epi R Handbook page on [factors](https://epirhandbook.com/factors.html#within-a-plot).] ??? You will need to clean up the axes and legend titles now. --- # Bar plot - reverse order .pull-left[ ```r ggplot(linelist) + geom_bar( mapping = aes( * x = fct_rev(fct_infreq(eth_race)), * fill = fct_rev(fct_infreq(died)))) + theme(legend.position = "top") + coord_flip() ``` To *reverse* order, use `fct_rev()`, which can be wrapped around other functions. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-106-1.png" width="504" /> ] --- # Bar plot - adjust width .pull-left[ ```r ggplot( data = linelist_eth_died, mapping = aes( x = eth_race, y = n, fill = died)) + *geom_col(width = 0.5) + theme(axis.text.x = element_text(angle = 30)) ``` Adjust bar width with `width=`. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-108-1.png" width="504" /> ] ??? Be wary of adjusting width for date bars (e.g. month) - use `geom_histogram()` instead with bin breaks. --- # Bar plot - adjacent .pull-left[ ```r ggplot( data = linelist_eth_died, mapping = aes( x = eth_race, y = n, fill = fct_infreq(died), label = n)) + *geom_col(position = "dodge")+ theme(axis.text.x = element_text(angle = 30)) ``` To make the bars adjacent, specify `position = "dodge"` in `geom_col()`. .footnote[Adjust the order with via a **forcats** function.] ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-110-1.png" width="504" /> ] --- # Bar plot - display counts .pull-left[ To display text, use `geom_col()` with `geom_text()`. Assign `aes(label=)` to the height values. The `position=` arg in `geom_text()` can specify display at mid-bar. ```r ggplot( data = linelist_eth_died, mapping = aes( x = eth_race, y = n, fill = died, * label = n)) + geom_col() + *geom_text( * size = 3, * position = position_stack(vjust = 0.5)) + theme(axis.text.x = element_text(angle=30)) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-112-1.png" width="504" /> ] --- # Point-and-click ggplot2 Now that you've learned the basic syntax... the [**esquisse**](https://cran.r-project.org/web/packages/esquisse/vignettes/get-started.html]) package allows point-and-click creation of (simple) ggplots! (esquisse means "sketch" in French) ![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/esquisse1.gif?raw=true)<!-- --> --- # Make an epidemic curve Two approaches that we suggest: 1) Use the **incidence2** package * Fast, simple, and modifiable with ggplot additions 2) Use ggplot's `geom_histogram()` * Most customizeability * Most complex code -- Today's demonstration will focus on **incidence2**. .footnote[See the Epi R handbook's [Epicurves page](https://epirhandbook.com/epidemic-curves.html) for detailed examples of both methods] --- # Epicurve - incidence object .pull-left[ **Create** an incidence object ```r weekly <- incidence( x = linelist, date_index = date_onset, interval = "week") ``` **Plot** the incidence object ```r plot(weekly) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-116-1.png" width="504" /> ] --- # Epicurve - incidence object .pull-left[ Adjust the interval ```r bimonthly <- incidence( x = linelist, date_index = date_onset, * interval = "2 months") ``` **Plot** the incidence object ```r plot(bimonthly) ``` Try "Sunday weeks" or "MMWR week" for Sunday weeks. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-119-1.png" width="504" /> ] --- # Epicurve - groups .pull-left[ To show groups, specify them to `groups=` in the `incidence()` command: ```r weekly <- incidence( x = linelist, date_index = date_onset, interval = "weeks", * groups = eth_race) ``` AND to `fill=` in the `plot()` command: ```r plot(weekly, * fill = eth_race) ``` If grouping by multiple columns, nest them in both places within `c()`: ```r groups = c(eth_race, gender) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-123-1.png" width="504" /> ] --- # Epicurve - ggplot2 additions .pull-left[ ```r plot(weekly, fill = eth_race) + scale_y_continuous( expand = c(0,0), breaks = seq(0,2000,250)) + *scale_y_continuous( * expand = c(0,0), * breaks = seq(0,2000,250)) + *theme_minimal(base_size = 16) + *theme(legend.position = "top") + *labs(fill = "Race and\nEthnicity") ``` Add other **ggplot2** commands with `+`. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-125-1.png" width="504" /> ] --- # Epicurve - date axis .pull-left[ ```r plot(weekly, fill = eth_race, * date_format = "%a %d %b %Y\n (Week %W)", * angle = 30)+ scale_y_continuous( expand = c(0,0), breaks = seq(0,2000,250)) + theme_minimal(base_size = 16) + theme(legend.position = "top") + labs(fill = "Race and\nEthnicity") ``` Do not adjust **incidence2** epicurves with `scale_x_date()`. Instead, use custom **incidence2** arguments in `plot()`. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-127-1.png" width="504" /> ] --- # Epicurve - show cases .pull-left[ For small outbreaks, the style `show_cases = TRUE` may be helpful: ```r small_outbreak <- linelist %>% filter( zip == 30024, date_report >= as.Date("2020-12-01")) %>% incidence( date_index = date_onset, interval = "Sunday weeks", groups = eth_race) ``` ```r plot(small_outbreak, fill = eth_race, * show_cases = TRUE)+ theme(legend.position = "bottom")+ labs( title = "ZIP 30024 by race") ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-131-1.png" width="504" /> ] ??? Dataset is piped in from above --- # Epicurve - color palettes .pull-left[ Use `scale_fill_viridis_d()` for palettes that are color-blind friendly: ```r plot(weekly, fill = eth_race)+ *scale_fill_viridis_d( option = "inferno", name = "Race and\nEthnicity", na.value = "grey") ``` [Viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html) (try with `option = "plasma"` or "inferno"), and [colorbrewer](https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html) palette functions can be added to any ggplot as well. .footnote[The `_d()` indicates a scale for discrete values] ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-133-1.png" width="504" /> ] --- # Epicurve - aggregated counts .pull-left[ You can also use **incidence2** on data that are aggregated counts... ```r # For demo: aggregate linelist linelist_day_counts <- linelist %>% * count( * day = floor_date(date_report, "day"), * died) %>% drop_na(day) ``` ] .pull-right[ A few rows of daily counts: |day |died | n| |:----------|:-------|--:| |2021-06-19 |No | 11| |2021-06-19 |Unknown | 11| |2021-06-20 |No | 10| |2021-06-20 |Unknown | 6| |2021-06-21 |No | 12| |2021-06-21 |Unknown | 11| |2021-06-22 |No | 9| |2021-06-22 |Unknown | 6| |2021-06-23 |No | 7| |2021-06-23 |Unknown | 7| ] --- # Epicurve - aggregated counts .pull-left[ ```r died_curve <- incidence( linelist_day_counts, date_index = day, * count = n, interval = "week", groups = died ) ``` ```r # plot stacked by outcome plot(died_curve, fill = died) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-140-1.png" width="504" /> ] --- # Demographic pyramids We suggest two ways to make age pyramids: 1) Use the **apyramid** package - simple, easy 2) Use **ggplot2** - more customizeable, but opportunity for error Today we will demonstrate **apyramid**. .footnote[See the Epi R Handbook [page on demographic pyramids](https://epirhandbook.com/demographic-pyramids-and-likert-scales.html) and the [package vignette](https://cran.r-project.org/web/packages/apyramid/vignettes/intro.html) for more information.] --- # Demographic pyramids .pull-left[ ```r age_pyramid( data = linelist, age_group = "age_group", split_by = "gender") ``` The function `age_pyramid()` from **apyramid** offers an easy interface. .footnote[We have cleaned `gender` to only two values] ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-143-1.png" width="504" /> ] --- # Demographic pyramids .pull-left[ ```r age_pyramid( data = linelist, age_group = "age_group", split_by = "gender", * proportional = TRUE, * show_midpoint = FALSE, * pal = c("darkgreen", "brown")) ``` Further modifications: * Percents instead of raw counts on the x-axis * No mid-point designation * Specified colors ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-145-1.png" width="504" /> ] --- # Demographic pyramids .pull-left[ ```r age_pyramid( data = linelist, age_group = "age_group", split_by = "gender", * stack_by = "hospitalized" proportional = TRUE, show_midpoint = FALSE, pal = c("darkgreen", "brown", "yellow", "orange")) ``` You can stack the bars by another variable with `stack_by=` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-147-1.png" width="504" /> ] --- # Demographic pyramids .pull-left[ ```r age_pyramid( data = linelist, age_group = "age_group", split_by = "gender", proportional = TRUE, show_midpoint = FALSE) + *theme_minimal(base_size = 10) + *labs( * title = "Age and Gender", * subtitle = "Fulton County, GA", * x = "Percent of total", * y = "Age group", * fill = "Gender", * caption = "Caption here") ``` You can add further ggplot commands. ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-149-1.png" width="504" /> ] --- # Dynamic labels example Let's use dynamic labels that will update with the data. **Simple caption** ```r str_glue("n = {nrow(linelist)}") ``` ``` ## n = 81757 ``` -- **Complex caption** For complicated `str_glue()` scenarios, define the dynamic components outside the quotation marks. `fmt_count()` from **epikit** is useful to count and display rows. ```r str_glue("{missing} missing age or gender not shown.", missing = fmt_count(linelist, is.na(gender) | is.na(age_group)) ) ``` ``` ## 405 (0.5%) missing age or gender not shown. ``` --- # Dynamic labels example **Subtitle** Max/Min dates can be wrapped in `format()` from **base** to adjust the display. See the epiRhandbook section on [strptime syntax](https://epirhandbook.com/working-with-dates.html?q=strptime#format). ```r str_glue("Fulton County, reported {min_date} - {max_date}", min_date = format(min(linelist$date_report, na.rm=T), "%B %d %Y"), max_date = format(max(linelist$date_report, na.rm=T), "%B %d %Y") ) ``` ``` ## Fulton County, reported January 03 2020 - June 23 2021 ``` --- class: remark-code # Dynamic captions Now applying those dynamic captions: .pull-left[ ```r age_pyramid( data = linelist, age_group = "age_group", split_by = "gender", proportional = TRUE, show_midpoint = FALSE) + theme_minimal(base_size = 10) + labs( title = "Age and Gender", * subtitle = str_glue( "Fulton County, reported {min_date} - {max_date}", min_date = format( min(linelist$date_report, na.rm=T), "%B %d %Y"), max_date = format( max(linelist$date_report, na.rm=T), "%B %d %Y") ), x = "Percent of total", y = "Age group", fill = "Gender", * caption = str_glue( "{missing} missing age or gender not shown.", missing = fmt_count( linelist, is.na(gender) | is.na(age_group)) ) ) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-155-1.png" width="504" /> ] --- class: medium-table # Labeling .pull-left[ To demonstrate labeling points, we create a new dataset that summarizes CFR and median age by race/ethnicity ```r race_CFR_age <- linelist %>% group_by(eth_race) %>% summarise( cases = n(), deaths = sum(died_covid == "Yes", na.rm=T), CFR = deaths/cases, med_age = median(age, na.rm=T) ) ``` ] .pull-right[ |eth_race | cases| deaths| CFR| med_age| |:-------------------|-----:|------:|---------:|-------:| |Asian, NH | 3022| 23| 0.0076109| 34| |Black, NH | 34395| 833| 0.0242186| 39| |White, NH | 27303| 417| 0.0152730| 36| |Hispanic, all races | 8600| 55| 0.0063953| 34| |Other, NH | 2912| 6| 0.0020604| 32| ] --- # Labeling .pull-left[ We can plot these data as points, and label them with `geom_text()`... but it does not look very good. ```r ggplot( data = race_CFR_age, mapping = aes( x = med_age, y = CFR, size = cases, * label = eth_race)) + geom_point() + *geom_text() ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-159-1.png" width="504" /> ] --- # Labeling .pull-left[ Labels look much improved by using `geom_label_repel()` from the **ggrepel** package: ```r library(ggrepel) ggplot( data = race_CFR_age, mapping = aes( x = med_age, y = CFR, size = cases, * label = eth_race)) + geom_point() + *geom_label_repel( * size = 5, * min.segment.length = 0) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-161-1.png" width="504" /> ] --- # Labeling .pull-left[ `labels=` can be assigned complex values with `str_glue()`: ```r library(ggrepel) ggplot( data = race_CFR_age, mapping = aes( x = med_age, y = CFR, size = cases, * label = str_glue( * "{eth_race}\n{cases} cases"))) + geom_point() + geom_label_repel( size = 5, min.segment.length = 0) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-163-1.png" width="504" /> ] --- # Labeling .pull-left[ Use `comma()` from the **scales** package to make numbers display with comma separators every three digits. ```r ggplot( data = race_CFR_age, mapping = aes( x = med_age, y = CFR, size = cases, * label = str_glue( * "{eth_race}\n{comma(cases)} cases"))) + geom_point() + geom_label_repel( size = 4, min.segment.length = 0) ``` See the many other useful functions in [**scales**](https://scales.r-lib.org/). ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-165-1.png" width="504" /> ] --- # Marginal distributions .pull-left[ Add "marginal" histograms to scatterplots with `ggMarginal()` from the **ggExtra** package. ```r scatterplot <- ggplot( data = linelist, mapping = aes( x = age, y = days_hosp)) + geom_point() ``` ```r ggExtra::ggMarginal( scatterplot, type = "histogram", fill = "lightblue", xparams = list(binwidth = 10), yparams = list(binwidth = 5)) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-168-1.png" width="504" /> ] --- # Combining plots - plot 1 The **cowplot** package facilitates combining multiple plots, while aligning elements such as axes. For date axes, ensure the same limits in both plots. .pull-left[ Define the first plot. ```r plot1 <- plot2 <- linelist %>% ggplot( mapping = aes( x = date_report), binwidth = 7)+ geom_histogram()+ theme_minimal()+ scale_y_continuous(expand = c(0,0))+ * scale_x_date( * limits = c( * as.Date("2020-03-01"), * max(linelist$date_report, na.rm=T)), date_breaks = "months", labels = scales::label_date_short(), expand = c(0,0)) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-171-1.png" width="504" /> ] --- # Combining plots - plot 2 .pull-left[ ```r plot2 <- linelist %>% group_by(week = floor_date(date_report, "week")) %>% summarise(ci = list(mean_cl_normal(age) %>% rename(mean=y, lwr=ymin, upr=ymax))) %>% unnest() %>% ggplot( mapping = aes( x = week, y = mean, ymin = lwr, ymax = upr))+ geom_ribbon(alpha = 0.5, fill = "green", color = "green")+ geom_line(size = 2, color = "darkgreen")+ *scale_x_date( * limits = c( * as.Date("2020-03-01"), * max(linelist$date_report, na.rm=T)), date_breaks = "months", labels = scales::label_date_short(), expand = c(0,0))+ coord_cartesian(ylim = c(30, 60))+ theme_minimal()+ labs( y = "Weekly mean age (95%CI)", x = "Month") ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-173-1.png" width="504" /> ] --- # Combining plots - result ```r plot_grid(plot1, plot2, rel_heights = c(1, 1), ncol = 1, align = "hv") ``` <img src="slides_ggplot_files/figure-html/unnamed-chunk-174-1.png" width="648" /> --- # Dual-axis - plot 1 .pull-left[ You can also use **cowplot** to overlap two plots and create a "dual-axis" plot. Define the first plot: ```r plot1 <- ggplot( data = linelist, mapping = aes( x = date_report)) + geom_histogram(color = "grey", alpha = 0.5) + *theme_cowplot() + *scale_x_date( * limits = c( * as.Date("2020-03-01"), * max(linelist$date_report, na.rm = TRUE)), date_breaks = "months", labels = scales::label_date_short(), expand = c(0, 0), name = "") + scale_y_continuous( expand = c(0,0), name = "Weekly reported case incidence") ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-176-1.png" width="504" /> ] --- # Dual-axis - plot 2 .pull-left[ ```r plot2 <- linelist %>% group_by(epiweek) %>% summarise(CFR = sum(died == "Yes", na.rm=T) / n() ) %>% ggplot( mapping = aes( x = epiweek, y = CFR))+ geom_line(size = 2, color = "orange")+ * theme_cowplot()+ scale_y_continuous( * position = "right", limits = c(0, 0.2), expand = c(0,0), name = "Weekly CFR")+ * scale_x_date( * limits = c( * as.Date("2020-03-01"), * max(linelist$date_report, na.rm=T)), date_breaks = "months", labels = scales::label_date_short(), expand = c(0,0), name = "Epiweek")+ theme( axis.text.y = element_text(color = "orange", face = "bold"), axis.title.y = element_text(color = "orange", face = "bold")) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/unnamed-chunk-178-1.png" width="504" /> ] --- # Dual-axis - combined ```r aligned <- align_plots(plot1, plot2, align = "hv") combined <- ggdraw(aligned[[1]]) + draw_plot(aligned[[2]]) combined ``` <img src="slides_ggplot_files/figure-html/unnamed-chunk-179-1.png" width="648" /> --- # Interactive plots Use `ggplotly()` from the **plotly** package to easily make most ggplots interactive. ```r # Provide the name of a defined plot into ggplotly() plotly::ggplotly(bar_plot) ```
--- # Data structure **ggplot2** works best when the dataset is in "long" format, where each variable has its own column. We recommend that you become familiar with `pivot_longer()` from **dplyr** (see the [Epi R Handbook pivoting page](https://epirhandbook.com/pivoting-data.html)). ![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/pivoting.png?raw=true)<!-- --> --- # Helpful charts .pull-left[ For color & fill use names or hex code ![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/ggplot_colors.png?raw=true)<!-- --> ] .pull-right[ For shapes, try these numeric codes: ![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/ggplot_shapes.png?raw=true)<!-- --> ] --- class: inverse, center, middle # Geographic Information Systems (GIS) ### An introduction with R --- # Shapefiles .pull-left[ - Multiple layers make file - Made up of: - Points, lines, polygons - handled with [{sf}](https://r-spatial.github.io/sf/) package - simple features: make things much more simple! - Can use tidyverse syntax as if dataframes - Other formats exist (e.g. GeoJson) ] .pull-right[ <img src="https://github.com/appliedepi/emory_training/blob/master/presentation/images/shapefiles.png?raw=true" width="80%" /> ] .footnote[ See the [EpiRHandbook section on GIS basics](https://epirhandbook.com/gis-basics.html) ] ??? Until a few years ago geospatial work in R was very painful - but with SF it has now become as easy as dataframes in tidyverse! --- # Import shapefile ```r # import shapefile shapefile <- read_sf( # define path to file here::here("case_study", "data", "covid_example_data", "covid_shapefile", "FultonCountyZipCodes.shp") ) ``` ``` ## Simple feature collection with 48 features and 4 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: 2087952 ymin: 1274336 xmax: 2317492 ymax: 1522856 *## Projected CRS: NAD83 / Georgia West (ftUS) ## # A tibble: 48 x 5 ## ZipCode SHAPE_Leng SHAPE_Area Population geometry ## <chr> <dbl> <dbl> <dbl> <MULTIPOLYGON [US_survey_foot]> ## 1 30304 3578. 874668. 0 (((2228175 1327979, 2227998 132786~ ## 2 30098 6187. 2071539. 118 (((2297621 1477003, 2296953 147697~ ## 3 30291 123056. 271391984. 20835 (((2193467 1298536, 2192364 129806~ ## 4 30268 294091. 1729387442. 7729 (((2121821 1318505, 2122024 131756~ ## 5 30305 87773. 187624264. 29220 (((2233900 1400489, 2233914 139992~ ## 6 30213 375869. 1801881966. 39840 (((2151675 1332989, 2151274 133220~ ## 7 30092 17665. 2769444. 0 (((2278083 1456502, 2278075 145649~ ## 8 30024 33262. 30183784. 299 (((2300742 1476746, 2300594 147679~ ## 9 30097 124207. 336153806. 26791 (((2295375 1480030, 2295462 148000~ ## 10 30022 169357. 673419811. 69893 (((2286054 1475364, 2286025 147513~ ## # ... with 38 more rows ``` ??? We can see here that the CRS is specific to Georgia (so doesn't play well with GPS points) --- # Coordinate reference systems <iframe width="100%" height="676" frameborder="0" src="https://observablehq.com/embed/@d3/projection-transitions?cells=viewof+context"></iframe> .footnote[ Source: [Mike Bostock]("https://twitter.com/mbostock/status/1292887409474994176") ] --- # Changing the CRS ```r # change shapefile coordinate reference system to WGS84 shapefile <- st_transform(shapefile, crs = 4326) ``` ``` ## Simple feature collection with 48 features and 4 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -84.85116 ymin: 33.5025 xmax: -84.09764 ymax: 34.1862 *## Geodetic CRS: WGS 84 ## # A tibble: 48 x 5 ## ZipCode SHAPE_Leng SHAPE_Area Population geometry ## * <chr> <dbl> <dbl> <dbl> <MULTIPOLYGON [°]> ## 1 30304 3578. 874668. 0 (((-84.39147 33.65054, -84.39205 3~ ## 2 30098 6187. 2071539. 118 (((-84.16324 34.0603, -84.16545 34~ ## 3 30291 123056. 271391984. 20835 (((-84.50521 33.56937, -84.50882 3~ ## 4 30268 294091. 1729387442. 7729 (((-84.74079 33.62338, -84.7401 33~ ## 5 30305 87773. 187624264. 29220 (((-84.37313 33.84985, -84.37308 3~ ## 6 30213 375869. 1801881966. 39840 (((-84.64293 33.6636, -84.64424 33~ ## 7 30092 17665. 2769444. 0 (((-84.22771 34.00394, -84.22774 3~ ## 8 30024 33262. 30183784. 299 (((-84.15294 34.05959, -84.15342 3~ ## 9 30097 124207. 336153806. 26791 (((-84.17066 34.06862, -84.17037 3~ ## 10 30022 169357. 673419811. 69893 (((-84.20143 34.05579, -84.20153 3~ ## # ... with 38 more rows ``` ??? We can use sf::st_transform to re-project a spatial object We can use sf::st_crs to check (or set) the crs for a spatial object --- # Take a look! .pull-left[ ```r # open up a ggplot ggplot() + # add the shapefile on top geom_sf(data = shapefile, # no fill fill = NA, # black borders colour = "black") ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/plot_shp_show-1.png" width="504" /> ] --- # Set CRS for point coordinates ```r # define period of interest recent_period <- seq(as.Date("2021-06-30") - 13, as.Date("2021-06-30"), by = 1) # convert linelist data frame into sf object (with georeference points) linelist_sf <- linelist %>% # remove rows missing gps coordinates drop_na(lat, lon) %>% filter( # drop those with wrong GPS points lat >= 33 & lat <= 35, lon >= -85 & lon <= -84, # drop those outside the last 2 weeks date_report %in% recent_period ) %>% # create an sf object * st_as_sf( # define the coordinates based on lat/long variables * coords = c("lon", "lat"), # set the coordinate reference system to WGS84 * crs = 4326, # do not change string variables to factors * stringsAsFactors = FALSE ) ``` --- # Set CRS for point coordinates ``` ## Simple feature collection with 126 features and 4 fields ## Geometry type: POINT ## Dimension: XY ## Bounding box: xmin: -84.66011 ymin: 33.53836 xmax: -84.20438 ymax: 34.16152 ## Geodetic CRS: WGS 84 ## First 10 features: ## pid date_report age gender geometry ## 1 0de96c1a7bfc991b 2021-06-18 6 Female POINT (-84.38023 33.95349) ## 2 fb6aac67ab0dbfe3 2021-06-18 23 Female POINT (-84.36471 33.74956) ## 3 adc636708adbd668 2021-06-17 89 Female POINT (-84.36682 33.74679) ## 4 4987c88010a7b125 2021-06-20 65 Female POINT (-84.36018 34.02925) ## 5 9738fc3209ac189f 2021-06-21 14 Male POINT (-84.55165 33.60726) ## 6 8f03d5ffc0354218 2021-06-18 57 Female POINT (-84.5458 33.58735) ## 7 c46e8649039e2fa9 2021-06-17 27 Female POINT (-84.34898 33.81149) ## 8 ac79c1b0d6993a5c 2021-06-17 47 Male POINT (-84.37456 33.86814) ## 9 e305f709c166eaf5 2021-06-17 7 Female POINT (-84.49827 33.56734) ## 10 b834f3e6759d0915 2021-06-18 1 Male POINT (-84.37045 33.67755) ``` --- # Basemaps - A screenshot of the map area - Called tiles - Smaller images pieced together - Added for context (bottom layer of *ggplot*) - A lot of different tile servers - A lot of different R packages - We demonstrate [{ggspatial}](https://paleolimbot.github.io/ggspatial/) - Able to save tiles for offline use - Also have scalebars --- # Bounding boxes .center[ <img src="https://github.com/appliedepi/emory_training/blob/master/presentation/images/bounding_box.png?raw=true" width="70%" height="60%" /> ] --- # Download & plot basemap .pull-left[ ```r # get the bounding box for the shapefile bounding_box <- shapefile %>% st_bbox() # plot a base map including scale bar basemap <- ggplot() + # change the bounding box to an sf object # this defines the area to download map tiles for geom_sf(data = st_as_sfc(bounding_box)) + # download map tiles and add to the plot * annotation_map_tile( # define what map tiles to use type = "cartolight", # define folder to store tile images cachedir = here::here("data", "map_tiles"), # define if should download tiles each time forcedownload = FALSE, # hide messages about download status and zoom progress = "none" ) ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/basemap_show-1.png" width="504" /> ] --- # Points .pull-left[ ```r # plot the basemap points_map <- basemap + # add the shapefile on top with no fill and black borders geom_sf(data = shapefile, fill = NA, colour = "black") + # plot points from linelist coloured by ethnicity # (order the factor so that the most frequent group is plotted first) geom_sf(data = linelist_sf, mapping = aes(colour = fct_infreq(eth_race))) + # choose colour combination scale_colour_brewer(palette = "Set1") + # change the legend label labs(colour = "Ethnicity") + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/points_map_show-1.png" width="504" /> ] --- # Choropleths - prepare data ```r # get counts of points within polygons shapefile <- shapefile %>% # add a column to the shapefile with counts mutate( # see which points are in which zip code polygon * cases = st_intersects(., linelist_sf) %>% # count how many are in each * lengths() ) # calculate incidence in shapefiledata shapefile <- shapefile %>% mutate( # divide cases by population incidence = round(cases / Population * 100000, digits = 1), # clean up calculations incidence = case_when( # fix the outliers: set infinity to NA and cases less than 10 to NA cases < 5 ~ NA_real_, is.infinite(incidence) ~ NA_real_, ## nb. infinite due to zero denominator TRUE ~ incidence) ) ``` --- # Choropleths - prepare data ```r # define breaks (for plotting groups) breakers <- shapefile %>% # change shapefile to a tibble (otherwise geometry pulled with) as_tibble() %>% # only keep zips with more than ten cases filter(cases > 5) %>% # pull the incidence column select(incidence) %>% # define grouping cut-offs based on quartiles of observations quantile(probs = seq(0, 1, 0.25), na.rm = TRUE) # create a factor variable for incidence shapefile <- shapefile %>% * mutate( # define groups using the cut function incidence_cats = cut(incidence, # cut-offs as defined above (including zero) breaks = c(0, breakers), # add labels by using the cut-offs labels = str_glue("<={breakers}")) ) ``` --- # Choropleths - plot data .pull-left[ ```r # plot basemap choropleth <- basemap + # add in shapefile (with black borders) colour by incidence categories geom_sf(data = shapefile, aes(fill = incidence_cats), colour = "black") + # define colour scheme scale_fill_brewer(palette = "YlGn", # edit legend labels to the categories defined # rename the missing group to be fewer than 10 cases labels = c( levels(shapefile$incidence_cats), "Fewer than 5 cases")) + # add in text labels for each zip code geom_sf_text(data = shapefile, aes(label = ZipCode), check_overlap = TRUE) + # change the legend title labs(fill = "Confirmed cases per \n 100,000 by Zip code") ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/choropleths_plot_show-1.png" width="504" /> ] --- # Bivariate choropleths - Placeholder for demo > see case study [weekly_report.Rmd](https://github.com/appliedepi/emory_training/blob/master/case_study/weekly_report.Rmd#L1894) for details .center[ <img src="slides_ggplot_files/figure-html/bivarmap_demo-1.png" width="504" /> ] --- # Heatmaps - Three dimensional distribution over each point and then interpolated ![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/heatmaps.png?raw=true) --- # Heatmaps - distance - How many x-units in one kilometre? .pull-left[ ![](https://github.com/appliedepi/emory_training/blob/master/presentation/images/axis_distance.png?raw=true) ] .pull-right[ `$$x_{units} = \frac{x_{max}-x_{min}}{x_{metres}}$$` ] --- # Heatmaps - calculating ```r # define x and y minimum/maximum coordinates based on bounding box of base map xmin <- st_point(c(bounding_box$xmin, bounding_box$ymin)) xmax <- st_point(c(bounding_box$xmax, bounding_box$ymin)) ymin <- xmin ymax <- st_point(c(bounding_box$xmin, bounding_box$ymax)) # create sf points (with coordinate reference system WGS84) for the x and y axis xaxis <- st_sfc(xmin, xmax, crs = 4326) yaxis <- st_sfc(ymin, ymax, crs = 4326) # calculate the distance in metres on the axes of your based on longitude and latitude # i.e. how many metres on your x axis and how many on your y *xdist <- st_distance(xaxis)[2] %>% units::set_units("miles") *ydist <- st_distance(yaxis)[2] %>% units::set_units("miles") # calculate how many lat/long units there are on each axis *xunits <- st_distance(xmin, xmax) *yunits <- st_distance(ymin, ymax) # divide the difference in latitude or longitude by the corresponding distance # returns the number of units per distance of interest (i.e. coord units per mile) *xfact <- as.numeric(xunits / xdist) *yfact <- as.numeric(yunits / ydist) ``` --- # Heatmaps - plotting .pull-left[ ```r heatmap <- basemap + # add in shapefile not filled with black border geom_sf(data = shapefile, fill = NA, colour = "black") + # add in a density layer stat_density_2d( # using the sf formatted linelist data = linelist_sf, aes( # extract the lat lon from geometry (list column) x = purrr::map_dbl(geometry, ~.[1]), y = purrr::map_dbl(geometry, ~.[2]), # fill based on the calculated density fill = after_stat(level) ), # define the shape to use for smoothing geom = "polygon", # define whether to show legend (removed as uninformative) show.legend = FALSE, # bandwith (distance squared) h = c(xfact, yfact) ) + # define fill palette (viridis continuous) scale_fill_viridis_c(option = "C") ``` ] .pull-right[ <img src="slides_ggplot_files/figure-html/heatmaps_plotting_show-1.png" width="504" /> ] --- class: center, middle # Thank you .footnote[ Slides created via the [{xaringan}](https://github.com/yihui/xaringan) R package ]