Skip to contents

Guide to contributing new datasets

Package folder structure

To below is a simplified explanation of the R packages data chapter. For a fuller understanding, read that chapter.

  • The following package folders are important:
    • data: R datasets go in data folder
    • inst/extdata: Non-R datasets go in inst>extdata folder.
    • internal data: When you build a package, the Rda datasets (from data folder) can become “internal” (more efficient for file storage). These are accessed by calling package::dataset (e.g.  appliedepidata::AJS_AmTiman. They can also be imported directly from github using link to the file in data folder e.g. ‘rio()’, or the appliedepidata::get_data or appliedepidata::save_data functions.
    • data-raw: Contains R scripts used for creating the exported or internal data (e.g. if you have edited a dataset or used {usethis} to internalise the dataset)
    • sysdata: Not relevant for current package setup. In some setups you are supposed to put tableoftables in sysdata (i.e. just for package usage). However for our current setup leave in extdata.
.
├── appliedepidata.Rproj
├── _pkgdown.yml
├── data
   └── newdata.rda
├── data-raw
   └── newdata.R
├── inst
   └── extdata
       └── tableoftables.xlsx
       └── newdata.xlsx
├── R
   └── newdata_doc.R
└── man
    └── newdata.Rd

Adding a file

This describes the process for adding a file to the repo. Note that the processes for adding a non-R file (any file that is not .rda) and an R file (any file already in.rda format) are slightly different. If you are adding a dataset from an existing R package, you can skip to step 3 below.

  1. Name your file appropriately
  1. You can name it whatever you want, but stick to basic naming conventions.
  2. Ensure that there is not already file in tableoftables.xlsx named the same.
  3. Avoid generic names like: linelist_cleaned.xlsx or survey_data.xlsx.
  4. Use consistent and descriptive names without spaces (e.g., AJS_AmTiman, sitrep_mortality_survey).
  1. Place your file in the correct folder
  1. A non-R file (e.g. xlsx, shp, zip) goes in inst/extdata folder i. If adding a shapefile then zip it
  2. An R file (e.g. rda, rds) goes in data folder
  1. Reproducibly edit dataset and internalise (see data-raw/AJS_AmTiman.R for example)
  1. In your console run usethis::use_data_raw(<name of your file without extension>)
  2. This creates an R script in the data-raw folder.
  3. Read in the file by defining the path with system.file. i. If you are editing a file already in the package (e.g. shortening the Ebola linelist for a course), make sure you read in the original dataset here. Document this properly with {roxygen} and in the metadata as described below.
  4. Make any edits necessary to your dataset in a reproducible way.
  5. Save and internalise the dataset with usethis::usedata().
  1. Add documentation for each dataset added
  1. This is done in an R script in the R folder.
  2. Name the script something that will allow reviewers to find it (e.g. AJS_chad) and suffix with _doc so that it can be differentiated from functions.
  3. Place all the documentation for datasets in that group within the same script.
  4. Ensure to clearly document the source and license for the dataset.
  5. Add in an explanation for each variable, if you have a data dictionary you use appliedepidata::create_desc() to help with this. i. You could also create a data dictionary for use with this function, see the data dictionary walk-through
  1. Add the datasets to _pkgdown.yml
  1. Group relevant datasets under the same subtitle (suffix with the language)
  2. The names here correspond to the name in quotations at the end of your description file from point 4 above, as well as the name of the file (without file extension).
  1. Add the dataset to the tablesoftables.xlsx as described below.

Defining dataset metadata (adding to tablesoftables.xlsx)

Below is a table explaining how to fill in each variable in the dataset metadata Excel sheet (tablesoftables.xlsx). This guide helps ensure consistency and completeness when adding new datasets to your collection.

  • name: The filename of the dataset as it appears in the inst/extdata directory, without the file extension. This should be unique within the dataset group, and ideally also within the tableoftables (i.e. avoid generic names like: linelist_cleaned.xlsx or survey_data.xlsx). Use consistent and descriptive names without spaces (e.g., AJS_AmTiman, mortality_survey).

  • type: The category or type of the dataset (e.g., linelist, population, shape, survey, dictionary).

  • extension: The file extension (e.g., xlsx, zip).

  • type_version: Used to identify the original dataset and its associated child data. Increment when format or variables change. If there are multiple linelists in one group, this would increment with the type.

  • data_version: Used to identify the original dataset and its associated child data. Increment when format or variables change. Ensure you document changes in the appropriate ‘data-raw’ file.

  • language: Language code using ISO 639-1 codes (e.g., en, fr).

  • country: Country code using ISO 3166-1 alpha-3 codes (e.g., tcd).

  • scale: Geographic scale (e.g., subnational, national, or international).

  • subject: Main subject of the dataset (e.g., acute jaundice syndrome).

  • context: Context of the data (e.g., outbreak, survey).

  • fictional: Is the dataset fictional (yes) or real (no)?

  • year: Year the data was collected (e.g., 2016). This is the earliest year in the dataset.

  • description: Brief description of the dataset. Ideally, copy from roxygen documentation.

  • usage: Intended usage (e.g., {sitrep} walkthroughs, training).

  • license: License for dataset (e.g., gpl3, mit).

  • group_identifier: DO NOT EDIT - Created by concatinating function in excel. High-level identifier combining subject, context, country, and year (e.g.,acute_jaundice_syndrome_outbreak_tcd_2016).

  • unique_identifier: DO NOT EDIT - Combines group_identifier, type, type_version, data_version, context, and year to create a unique identifier (e.g. acute_jaundice_syndrome_outbreak_tcd_2016_linelist_1).

For example, when adding an Ebola dataset, you would enter the information as shown below. The original dataset (whether it’s from {outbreaks} or another source) would be considered type_version 1. If it’s the only linelist in its group, it remains type_version 1. If a completely different linelist is added (not just an edited version), increment the type_version accordingly.

For any changes to the data (such as cleaning or changing nums of rows or columns), increment the data_version (e.g., data_version 2), but the type_version remains the same to indicate that it’s a derivative (or “child”) of the original. Each child dataset gets its own entry.

If a dataset is translated into a different language, create a new entry for the translated version while keeping the data_version and type_version the same, but editing the language column accordingly. This ensures you can trace back the parent-child relationship between datasets.

Variable Example Entry
name ebola_linelist_cleaned
type linelist
extension xlsx
type_version 1
data_version 1
language en
country lbr
scale national
subject ebola
context outbreak
fictional yes
year 2014
description Linelist data from the Ebola virus
disease outbreak in Liberia in
2014.
usage introexercises, etc.
license gpl3
group_identifier ebola_outbreak_lbr_2014
unique_identifier ebola_outbreak_lbr_2014_linelist_1_1_outbreak_2014