class: center, middle, inverse, title-slide .title[ # Advanced statistics in R ] .subtitle[ ## Exploring the data and basic hypothesis testing ] .author[ ### ] .date[ ###
contact@appliedepi.org
] --- <style type="text/css"> .remark-slide table{ border: none } .remark-slide-table { } tr:first-child { border-top: none; } tr:last-child { border-bottom: none; } .center2 { margin: 0; position: absolute; top: 50%; left: 50%; } </style> # Thank you for joining us Brief introductions from the instructors **Thank you for your service** *to your community in these busy times for public health. We are glad that you are taking the time to learn R with us.* ??? Do a brief round of introductions --- # Why use R for statistical analysis? ## Efficient - Once you have written the code once, there is no need to point and click and re-do analysis with updated data, you can reuse your old code! Which means that R is... <img src="../../images/welcome/efficient_image.PNG" width="50%" /> ??? Mention marketable skill as well as technically for epidemic response --- # Why use R for statistical analysis? ## Reproducible We can re-run the same code and get the same outcome, or use new data and see how findings have changed. Because it is written as a script it is reproducible, and... <img src="../../images/welcome/clone_image.jpg" width="50%" /> --- # Why use R for statistical analysis? ## Shareable You can share scripts and other people can produce the same findings and results as you. This makes it easy to produce automated reports and share the work between people. <img src="../../images/welcome/shareable_image.PNG" width="50%" /> --- # Why use R for statistical analysis? .pull-left[ ## Easy (once you learn the syntax!) We know that R can be scary (it was for us too when we first learned it!), but once you learn the basics, and how to apply these to statistics within R, you can very quickly produce indepth analysis of data in an efficient, effective and reproducible way! ] .pull-right[ <img src="../../images/welcome/easy_image.jpg" width="100%" /> ] --- # Learning curve R has become easier to learn in the last 5 years .pull-left[ - Friendlier user interface (RStudio) - Simpler syntax ("tidyverse") - Free resources & interactive tutorials available - **Epidemiologist R Handbook** - **R 4 Data Science** - **R epidemiology case studies** ] .pull-right[ <img src="course_intro_files/figure-html/difficulty_plot-1.png" width="504" /> ] ??? - There is a bit of a learning curve - as with any software. - But we are here to get you over that initial hump so you can keep developing. --- # User community <img src="../../images/welcome/user_community.png" width="125%" /> ??? - **The internet is your friend.** - Once you get to grips with the basics - people answer literally any question on **stackoverflow!** - **Epi R Handbook** --- # What statistical analysis can we carry out in R? .pull-left[ All of them! R is a statistical programming language with a wealth of inbuilt functions and external packages to carry out any analysis you can imagine. The difficult part is knowing which packages to use and where! - That's what this course is here to help with ] .pull-right[ <img src="../../images/welcome/statistics_image.jpg" width="100%" /> ] --- # This course is about translating your statistics knowledge, not teaching statistics Teaching statistics and R at the same time would require far longer than we have for a short course. We will be focusing on taking your existing knowledge and showing you how to apply this to R. --- # What we will be covering - Exploring your data and basic hypothesis testing. - Univariate, Stratified and Multivariable regression (with a focus on logistic regression). - How to make publication ready tables and plots to communicate your regression results. - Rmarkdown, combining your new knowledge with Rmarkdown to generate a document detailing your statistical analysis. --- # Learning outcomes By the end of this course, you should be able to: - Generate histograms of your data and explore the correlation between variables - Carry out the appropriate statistical tests needed to analyse your data and produce publication quality tables - Undertake univariate and stratified regression, interpret the results and produce figures - Understand how to carry out variable selection and run multivariable regression analysis, with and without interaction terms and random effects --- # The data analysis pipeline <br> <br> <br> <br> .center[ <img src="../../images/regression/data_analysis_pipeline_1.PNG" width="125%" /> ] --- # The data analysis pipeline <br> <br> <br> .center[ <img src="../../images/regression/data_analysis_pipeline_2.PNG" width="125%" /> ] --- # Importance of exploring the data first .pull-left[ We cannot overstate the importance of understanding your data *before* carrying out more advanced statistical analysis. Without understanding the data, its distributions and relations between variables, we may use the wrong statistical tool, make the wrong assumptions, and derive the wrong conclusions. Look out for lurking danger! ] .pull-right[ <img src="../../images/welcome/crocodile_swamp.PNG" width="100%" /> ] --- <img src="course_intro_files/figure-html/unnamed-chunk-13-1.png" width="75%" /> --- <img src="course_intro_files/figure-html/unnamed-chunk-14-1.png" width="75%" /> --- <img src="course_intro_files/figure-html/unnamed-chunk-15-1.png" width="75%" /> --- <img src="course_intro_files/figure-html/unnamed-chunk-16-1.png" width="75%" /> --- # Basic hypothesis testing What do we mean by basic hypothesis testing? - We use this term to indicate the use of a test that explores the relationship between an outcome and an exposure and returns an indication (a p-value) of whether or not there is a relationship. - This includes T-tests, Shapiro-Wilk tests, Wilcoxon rank sum tests, Kruskal-Wallis, Chi-squared, etc These are often a useful first step in statistically evaluating the data and set up the stage for further investigating relationships between outcomes and variables. --- # gtsummary There are many different ways of carrying out statistical tests in R, and a variety of pros and cons for deciding between packages (and base R). However, in this course we will be focussing on the package **gtsummary** as it allows us to carry out these tests with a few simple commands, and produces publication ready tables with ease. --- # How do we carry out and incorporate statistical tests in R? Once we know what we want to do, we can carry out these operations with the help of the handy pipe chain `%>%`! It is simply just a case of adding in the arguments (here `tbl_summary()` and `add_p()`) and you're ready to go ``` r linelist %>% ``` --- # How do we carry out and incorporate statistical tests in R? Once we know what we want to do, we can carry out these operations with the help of the handy pipe chain `%>%`! It is simply just a case of adding in the arguments (here `tbl_summary()` and `add_p()`) and you're ready to go ``` r linelist %>% select(outcome, vomit, cough) %>% ``` --- # How do we carry out and incorporate statistical tests in R? Once we know what we want to do, we can carry out these operations with the help of the handy pipe chain `%>%`! It is simply just a case of adding in the arguments (here `tbl_summary()` and `add_p()`) and you're ready to go ``` r linelist %>% select(outcome, vomit, cough) %>% tbl_summary(by = outcome) %>% ``` --- # How do we carry out and incorporate statistical tests in R? Once we know what we want to do, we can carry out these operations with the help of the handy pipe chain `%>%`! It is simply just a case of adding in the arguments (here `tbl_summary()` and `add_p()`) and you're ready to go ``` r linelist %>% select(outcome, vomit, cough) %>% tbl_summary(by = outcome) %>% add_p() ``` --- # gtsummary For these simple statistical tests, we can use the function `tbl_summary()` and `add_p()` to customise the data we input, the statistics we want output, and the test we want carried out in a few short lines. Here we select a few columns and explore their relationship with outcome ``` r linelist %>% select(outcome, vomit, cough) %>% tbl_summary(by = outcome) %>% add_p() ```
Characteristic
Death
N = 242
1
Recover
N = 186
1
p-value
2
vomit
126 (55%)
94 (53%)
0.8
Unknown
11
9
cough
205 (89%)
143 (81%)
0.025
Unknown
11
9
1
n (%)
2
Pearson’s Chi-squared test
Remember we can use `?tbl_summary` in order to explore a full range of the available inputs. --- # gtsummary The default test carried out is the Chi-squared test, but if we want to carry out another analysis, we simply update the `add_p()` section. ``` r linelist %>% select(outcome, wt_kg) %>% tbl_summary(by = outcome) %>% add_p(wt_kg ~ "t.test") ```
Characteristic
Death
N = 242
1
Recover
N = 186
1
p-value
2
wt_kg
55 (42, 63)
55 (41, 65)
0.7
Unknown
3
0
1
Median (Q1, Q3)
2
Welch Two Sample t-test
For a full list of tests please see `?add_p.tbl_summary` --- #Adding in descriptive statistics We can then further update the information in the table by adding in descriptive statistics. We do this by adding an argument into `tbl_summary()`. ``` r linelist %>% select(outcome, wt_kg) %>% tbl_summary(statistic = wt_kg ~ "{mean} ({sd})", by = outcome ) %>% add_p(wt_kg ~ "t.test") ```
Characteristic
Death
N = 242
1
Recover
N = 186
1
p-value
2
wt_kg
52 (17)
53 (18)
0.7
Unknown
3
0
1
Mean (SD)
2
Welch Two Sample t-test
--- # Ready to try it out? Any questions? **Resources** - Course website (initial setup and slides access): [https://courses.appliedepi.org/statsr/](https://courses.appliedepi.org/statsr/) - [Epi R Handbook](https://epirhandbook.com/en/) 50 chapters of best-practice code examples available online and offline - [Applied Epi Community](https://community.appliedepi.org/) A great resource for asking questions and help!