Adding data
adding-data.Rmd
Guide to contributing new datasets
PACKAGE FOLDER STRUCTURE
Overview
The diagram and descriptions below are a simplified explanation of the R packages data chapter. For a fuller understanding, read that chapter.
Folders you will edit
-
inst/extdata: Newly added datasets go in here in
their original filetype, e.g. .xlsx, .csv, but also .rds files. These
files will be downloaded with the
save_data()
function.
-
data-raw: Contains R scripts used for processing
and internalizing datasets (making them more efficient for storage),
i.e. for turning newly added datasets in
inst/extdata
to .rda files - R: Contains R scripts in Roxygen2 format which define the functions and datasets. When run this creates the R documentation files (.Rd files) which are saved in the man folder. Multiple datasets can be defined in one script, which is sensible as some are grouped together (e.g. part of the same outbreak)
- vignettes: Contains detailed instructions on how to use particular functions. They complement the function-level documentation in the man folder by giving broader explanation.
Folders you will not edit
These contain files outputted by running the code in the folders above.
- data: Contains internalised and processed R datasets (.rda files). These are created when internalizing the data (running code in data-raw folder), which makes them more efficient for file storage.
- man: Contains the R documentation files for package functions and datasets (created when running the code in the R folder). Each .Rd file has information about a specific function or dataset.
- tests: Contains test files for developer use only.
ADDING DATA
This describes the process for adding a file to the repo. Note that any original datasets need to be added to the inst/exta folder, even if already in .rda format.
If you are adding a dataset from an existing R package, you can skip to step 3 below.
-
Name your file appropriately
A. You can name it whatever you want, but stick to basic naming conventions.
B. Ensure that there is not already file in tableoftables.xlsx named the same.
C. Avoid generic names like:
linelist_cleaned.xlsx
orsurvey_data.xlsx
.D. Use consistent and descriptive names without spaces (e.g.,
AJS_AmTiman
,sitrep_mortality_survey
).E. Name files from the same group (e.g. from the same outbreak or for the same case study) with the same prefix. E.g:
examplename_data
andexamplename_population
for a case linelist and corresponding denominator table respectively. -
Place your file in the correct folder
A. Add into
inst/extdata
folder. This is what will be downloaded when running the save_data() function. If a shapefile, zip first. -
Build the package with the added data
A. Press Ctrl + Shift + B to build the package with the newly added data. This will mean the data is recognized for the next step.
-
Reproducibly edit dataset and internalize (see
data-raw/AJS_AmTiman.R
for example)A. In your console run
usethis::use_data_raw("<name of your file without extension>")
.B. This creates an R script in the
data-raw
folder. It will already contain a comment at the top saying “code to preparedata goes here” and a usethis::use_data()
function to internalise the dataset (i.e. to produce the rda file)C. Edit the R script to correctly read in the file using
system.file()
, as below. Necessary edits to the dataset should also go here. Then run this script to internalise the data, and close.D. NOTE that if you want to process the raw data to create a dataset that should also be accessible to users (e.g. via the
get_data()
orsave_data()
functions), make sure that the processed data also gets saved into inst/extdata as version 2. Make sure you internalise (with use_data) both versions of the data. Add the dataset to the
tablesoftables.xlsx
as described below.-
Add documentation for each dataset added
A. Create a new R script in the
R
folder, with one file for all datasets in a group. The file name should be the group name generated from tableoftables (column P), with suffix ‘_doc’. Note the easiest is to copy and edit an existing R script in the folder.B. Ensure to clearly document the source and license for the dataset.
C. Ensure to put the correct name of the dataset at the bottom of each dataset description (under @docType). E.g.
examplename_linelist
.D. Run devtools::document() to create the actual R documentation. This will be an .Rd file in the
man
folder, with file name corresponding with the dataset name (e.g.examplename_linelist.Rd
) -
Add the datasets to
_pkgdown.yml
A. Subtitle: Describe the group of linelists. State the year in brackets and language at the end
B. Contents: List the datasets. Again, the names here correspond to the name of the dataset without extension, e.g “examplename_linelist”.
C. Make sure to have correct indentation and use of dashes. See prior examples in the file.
ADDING DATASET METADATA (adding to
tablesoftables.xlsx
)
Below is a table explaining how to fill in each variable in the
dataset metadata Excel sheet (tablesoftables.xlsx
). This
guide helps ensure consistency and completeness when adding new datasets
to your collection.
name: The filename of the dataset as it appears in the
inst/extdata
directory, without the file extension. This should be unique within the dataset group, and ideally also within the tableoftables (i.e. avoid generic names like:linelist_cleaned.xlsx
orsurvey_data.xlsx
). Use consistent and descriptive names without spaces (e.g.,AJS_AmTiman
,mortality_survey
).type: The category or type of the dataset (e.g.,
linelist
,population
,shape
,survey
,dictionary
).extension: The file extension (e.g.,
xlsx
,zip
).type_version: Used to identify the original dataset and its associated child data. Increment when format or variables change. If there are multiple linelists in one group, this would increment with the type.
data_version: Used to identify the original dataset and its associated child data. Increment when format or variables change. Ensure you document changes in the appropriate ‘data-raw’ file.
language: Language code using ISO 639-1 codes (e.g.,
en
,fr
).country: Country code using ISO 3166-1 alpha-3 codes (e.g.,
tcd
).scale: Geographic scale (e.g.,
subnational
,national
, orinternational
).subject: Main subject of the dataset (e.g.,
acute jaundice syndrome
).context: Context of the data (e.g.,
outbreak
,survey
).fictional: Is the dataset fictional (
yes
) or real (no
)?year: Year the data was collected (e.g.,
2016
). This is the earliest year in the dataset (even if fictional).description: Brief description of the dataset. Ideally, copy from roxygen documentation.
usage: Intended usage (e.g.,
{sitrep} walkthroughs
,training
).license: License for dataset (e.g.,
gpl3
,mit
).group_identifier: DO NOT EDIT - Created by concatinating function in excel. High-level identifier combining
subject
,context
,country
, andyear
(e.g.,acute_jaundice_syndrome_outbreak_tcd_2016
).unique_identifier: DO NOT EDIT - Combines
group_identifier
,type
,type_version
,data_version
,context
, andyear
to create a unique identifier (e.g.acute_jaundice_syndrome_outbreak_tcd_2016_linelist_1
).
For example, when adding an Ebola dataset, you would enter the
information as shown below. The original dataset (whether it’s from
{outbreaks} or another source) would be considered
type_version
1. If it’s the only linelist in its group, it
remains type_version
1. If a completely different linelist
is added (not just an edited version), increment the
type_version
accordingly.
For any changes to the data (such as cleaning or changing nums of
rows or columns), increment the data_version
(e.g.,
data_version
2), but the type_version
remains
the same to indicate that it’s a derivative (or “child”) of the
original. Each child dataset gets its own entry.
If a dataset is translated into a different language, create a new
entry for the translated version while keeping the
data_version
and type_version
the same, but
editing the language
column accordingly. This ensures you
can trace back the parent-child relationship between datasets.
Variable | Example Entry |
---|---|
name | ebola_linelist_cleaned |
type | linelist |
extension | xlsx |
type_version | 1 |
data_version | 1 |
language | en |
country | lbr |
scale | national |
subject | ebola |
context | outbreak |
fictional | yes |
year | 2014 |
description | Linelist data from the Ebola virus |
disease outbreak in Liberia in | |
2014. | |
usage |
introexercises , etc. |
license | gpl3 |
group_identifier | ebola_outbreak_lbr_2014 |
unique_identifier | ebola_outbreak_lbr_2014_linelist_1_1_outbreak_2014 |