Adding data
adding-data.Rmd
Guide to contributing new datasets
Package folder structure
To below is a simplified explanation of the R packages data chapter. For a fuller understanding, read that chapter.
- The following package folders are important:
-
data: R datasets go in
data
folder -
inst/extdata: Non-R datasets go in
inst>extdata
folder.
-
internal data: When you build a package, the
Rda
datasets (fromdata
folder) can become “internal” (more efficient for file storage). These are accessed by callingpackage::dataset
(e.g.appliedepidata::AJS_AmTiman
. They can also be imported directly from github using link to the file in data folder e.g. ‘rio()’, or the appliedepidata::get_data
orappliedepidata::save_data
functions. - data-raw: Contains R scripts used for creating the exported or internal data (e.g. if you have edited a dataset or used {usethis} to internalise the dataset)
- sysdata: Not relevant for current package setup. In some setups you are supposed to put tableoftables in sysdata (i.e. just for package usage). However for our current setup leave in extdata.
-
data: R datasets go in
Adding a file
This describes the process for adding a file to the repo. Note that
the processes for adding a non-R file (any file that is not
.rda
) and an R file (any file already in.rda
format) are slightly different. If you are adding a dataset from an
existing R package, you can skip to step 3 below.
- Name your file appropriately
- You can name it whatever you want, but stick to basic naming
conventions.
- Ensure that there is not already file in tableoftables.xlsx named the same.
- Avoid generic names like:
linelist_cleaned.xlsx
orsurvey_data.xlsx
. - Use consistent and descriptive names without spaces (e.g.,
AJS_AmTiman
,sitrep_mortality_survey
).
- Place your file in the correct folder
- A non-R file (e.g.
xlsx
,shp
,zip
) goes ininst/extdata
folder i. If adding a shapefile then zip it - An R file (e.g.
rda
,rds
) goes indata
folder
- Reproducibly edit dataset and internalise (see
data-raw/AJS_AmTiman.R
for example)
- In your console run
usethis::use_data_raw(<name of your file without extension>)
- This creates an R script in the
data-raw
folder. - Read in the file by defining the path with
system.file
. i. If you are editing a file already in the package (e.g. shortening the Ebola linelist for a course), make sure you read in the original dataset here. Document this properly with {roxygen} and in the metadata as described below. - Make any edits necessary to your dataset in a reproducible way.
- Save and internalise the dataset with
usethis::usedata()
.
- Add documentation for each dataset added
- This is done in an R script in the
R
folder. - Name the script something that will allow reviewers to find it
(e.g.
AJS_chad
) and suffix with_doc
so that it can be differentiated from functions. - Place all the documentation for datasets in that group within the same script.
- Ensure to clearly document the source and license for the dataset.
- Add in an explanation for each variable, if you have a data dictionary you use appliedepidata::create_desc() to help with this. i. You could also create a data dictionary for use with this function, see the data dictionary walk-through
- Add the datasets to
_pkgdown.yml
- Group relevant datasets under the same subtitle (suffix with the language)
- The names here correspond to the name in quotations at the end of your description file from point 4 above, as well as the name of the file (without file extension).
- Add the dataset to the
tablesoftables.xlsx
as described below.
Defining dataset metadata (adding to
tablesoftables.xlsx
)
Below is a table explaining how to fill in each variable in the
dataset metadata Excel sheet (tablesoftables.xlsx
). This
guide helps ensure consistency and completeness when adding new datasets
to your collection.
name: The filename of the dataset as it appears in the
inst/extdata
directory, without the file extension. This should be unique within the dataset group, and ideally also within the tableoftables (i.e. avoid generic names like:linelist_cleaned.xlsx
orsurvey_data.xlsx
). Use consistent and descriptive names without spaces (e.g.,AJS_AmTiman
,mortality_survey
).type: The category or type of the dataset (e.g.,
linelist
,population
,shape
,survey
,dictionary
).extension: The file extension (e.g.,
xlsx
,zip
).type_version: Used to identify the original dataset and its associated child data. Increment when format or variables change. If there are multiple linelists in one group, this would increment with the type.
data_version: Used to identify the original dataset and its associated child data. Increment when format or variables change. Ensure you document changes in the appropriate ‘data-raw’ file.
language: Language code using ISO 639-1 codes (e.g.,
en
,fr
).country: Country code using ISO 3166-1 alpha-3 codes (e.g.,
tcd
).scale: Geographic scale (e.g.,
subnational
,national
, orinternational
).subject: Main subject of the dataset (e.g.,
acute jaundice syndrome
).context: Context of the data (e.g.,
outbreak
,survey
).fictional: Is the dataset fictional (
yes
) or real (no
)?year: Year the data was collected (e.g.,
2016
). This is the earliest year in the dataset.description: Brief description of the dataset. Ideally, copy from roxygen documentation.
usage: Intended usage (e.g.,
{sitrep} walkthroughs
,training
).license: License for dataset (e.g.,
gpl3
,mit
).group_identifier: DO NOT EDIT - Created by concatinating function in excel. High-level identifier combining
subject
,context
,country
, andyear
(e.g.,acute_jaundice_syndrome_outbreak_tcd_2016
).unique_identifier: DO NOT EDIT - Combines
group_identifier
,type
,type_version
,data_version
,context
, andyear
to create a unique identifier (e.g.acute_jaundice_syndrome_outbreak_tcd_2016_linelist_1
).
For example, when adding an Ebola dataset, you would enter the
information as shown below. The original dataset (whether it’s from
{outbreaks} or another source) would be considered
type_version
1. If it’s the only linelist in its group, it
remains type_version
1. If a completely different linelist
is added (not just an edited version), increment the
type_version
accordingly.
For any changes to the data (such as cleaning or changing nums of
rows or columns), increment the data_version
(e.g.,
data_version
2), but the type_version
remains
the same to indicate that it’s a derivative (or “child”) of the
original. Each child dataset gets its own entry.
If a dataset is translated into a different language, create a new
entry for the translated version while keeping the
data_version
and type_version
the same, but
editing the language
column accordingly. This ensures you
can trace back the parent-child relationship between datasets.
Variable | Example Entry |
---|---|
name | ebola_linelist_cleaned |
type | linelist |
extension | xlsx |
type_version | 1 |
data_version | 1 |
language | en |
country | lbr |
scale | national |
subject | ebola |
context | outbreak |
fictional | yes |
year | 2014 |
description | Linelist data from the Ebola virus |
disease outbreak in Liberia in | |
2014. | |
usage |
introexercises , etc. |
license | gpl3 |
group_identifier | ebola_outbreak_lbr_2014 |
unique_identifier | ebola_outbreak_lbr_2014_linelist_1_1_outbreak_2014 |