Best practices
This section includes some best practices to promote consistent, efficient, and accurate data analysis with R
.
1.1 Welcome to the Tidyverse
The Tidyverse consists of a group of R packages that work in harmony by sharing a common data format and syntax. Using packages within the Tidyverse for your analysis will make your scripts easier to read for others and the future you.
Recommended packages
Loading data
readr
Load data from text files: tab, comma, and other delimited files.
readxl
Load data from Excel.
RODBC
Load data from Access and Oracle databases such as DELTA.
RMySQL
, RPostgresSQL
, and RSQLite
Connect to SQL databases.
rvest
Read and scrape content from web pages.
pdftools
Read PDF documents.
googledrive
Read and write files to your Google Drive.
foreign
Load data from Minitab and Systat.
R.matlab
Load data from Matlab.
haven
Load data from SAS, SPSS, and Stata.
Manipulate data
dplyr
Essential shortcuts to subset, summarize, rearrange, and join data sets.
tidyr
Reshape tables and unpack multiple inputs stored in single cell.
stringr
Tools to edit and clean text and character strings.
lubridate
Tools to format dates and perform calculations based on time.
forcats
Set, re-order and change levels of factors (ordered character variables).
magrittr
Simplify your scripts by chaining multiple commands together with the pipe %>%
.
Charts and visuals
ggplot2
Essential package for plots and charts.
ggsave
Export charts in various formats and sizes.
xkcd
, hrbrthemes
, ggpomological
, ggthemes
Chart themes for ggplot.
viridis
, wesanderson
, ghibli
Color palettes.
rmarkdown
Write summary reports and save as PDF, Word document, presentation, or website.
shiny
Create interactive data exploration tools for the web.
DT
Create data tables for the web.
Maps
leaflet
Display spatial data and make interactive maps.
sf
Simple features, a spatial format using data frames to perform spatial.
tigris
Download geography boundaries: States, Counties, Census tracts, Block groups.
tidycensus
Download Census and American Community Survey data.
Files and folders
here
Simplify file paths in your scripts and functions.
fs
Command line functions for copying, moving, and deleting files.
Automation
cronR
Schedule scripts to run and refresh data at a set time.
Create R packages
devtools
Essential tools for creating your own R package.
roxygen2
A quick way to document your R packages.
usethis
Shortcuts for package development and Github integration.
goodpractice
Automated feedback on your R package.
Find new packages
Install the Addin CRANsearcher to browse all published packages.
Visit rOpenSci to view a selection of peer reviewed science packages.
1.2 Keep R updated
Get the installr package
Update your R version
- When a popup suggests closing RStudio and running the update from RGUI, click NO.
- When asked whether to copy and update all of your packages. Click Yes. This will save you buckets of time.
To learn more about installr see https://github.com/talgalili/installr/#readme.
1.3 Script organization
The following tips will help make your R scripts easier to read and easier for others to use your code in their own projects.
- Add a description to the top of your script.
- Load all packages at the top of the script.
- Assign important file paths and parameters near the top of the script.
- Avoid changing working directories with
setwd()
. - Save files to a local working directory such as
"~\results.csv"
, rather than to a fixed location, such asX:\EAO\Air Data\Project1\results.csv
. - Load usernames and passwords from a file such as
credentials.csv
.
An example R script
# name_of_script.R
# Purpose: This script does amazing things.
# Assumptions: The monitoring data for the sites is from different years,
# but we compare them anyways.
# Warning! Don't use this script for Kryptonite. It will break everything.
# Load packages
library(dplyr)
library(readr)
# Set parameters
year <- 2017
data_file <- "monitors 1 and 2.csv"
# Load data
air_data <- read_csv(data_file)
# My functions
calc_stat <- function(data) {
new_stat <- step1(data) %>% step2() %>% step3()
}
# My analysis
results <- air_data %>% summarize(stat = calc_stat(concentration))
# Save results to local folder
write_csv(results, "results/summary_results.csv")
1.4 Divide and conquer
For complex projects, many small components are often better than one big super script. Try to create small functions to perform each task within your analysis. If it’s a function that you will use across multiple projects, it is helpful to save the function to it’s own script.
These functions will become reusable building blocks that both future you and others can apply to their own projects.
Using an R Markdown document is an additional way to split your project into manageable steps. R Markdown makes it easier to add a short description of each step to your analysis. In fact, you are reading the output of an R Markdown document right now.
An example folder structure for a project is shown below:
New_Project\
- data
- monitors 1 and 2.csv
- R
- load_data.R
- calc_ucl95.R
- lil_boot_function.R
- summary_report.RMD
- results
- summary_results.csv
- summary_report.pdf
1.5 Codebooks, metadata, and data dictionaries
A codebook provides essential technical details and references for the variables in a dataset. The codebook ensures someone unfamiliar with your data will have the necessary information to complete their analysis and help them to estimate the uncertainty of their results. An R package named dataMaid
is available to assist with creating documentation for your data sets.
An example codebook generated with dataMaid
library(dataMaid)
library(readr)
data <- read_csv('https://raw.githubusercontent.com/MPCA-air/air-methods/master/airtoxics_data_2009_2013.csv')
names(data) <- c("aqs_id", "poc", "param_code", "date", "conc", "null_code", "md_limit", "pollutant", "year", "cas")
attr(data$aqs_id, "labels") <- "Monitor ID"
attr(data$aqs_id, "shortDescription") <- "EPA assigned monitor ID for the national Air Quality System (AQS)."
makeCodebook(data, vol = 1, reportTitle = "Codebook for air toxics monitoring data", replace = TRUE)
1.6 Data formatting
Performing similar analysis on multiple pollutants at multiple sites over multiple years becomes much easier when data is in a consistent format. MPCA prefers tidy data where each result has its own row in a table. More information on tidy data can be found here: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html.
For monitoring data, each row should contain a separate column for:
monitoring site ID
Preferably the ASQ ID or other unique identifierparameter code
In some cases other identifiers for pollutants may be used such as CAS or the analyte name, but the identifier should be unique to each analyte in the dataPOC
For collocated monitors; if there are no collocated monitors in your data then you can include a POC column with all values set to 1sample date
Recommended format is yyyymmdd or yyyy-mm-dd; other formats can be converted using R’s lubridate packagesample start time
Format is hh:mm:sssample duration
Code for duration or numeric value of durationresult
Observed valuenull data code
Example Tidy data
Additional columns may include qualifier codes and method detection limits. Fields directly associated with one or more of the columns above can be removed and stored in their own tables whenever possible. For example, the site address and site name are both fields that are directly associated with the monitor’s AQS ID. If your data file already has the AQS ID column, you can perform all analysis based on AQS ID and then join the site information after the analysis is completed. EPA’s AQS data is stored in a recommended format for air monitoring data, as it meets all of the requirements for tidy data.
The R functions shown in this guide often group and sort based on the columns identified above. Including these columns in your data will allow you to apply these functions in your own scripts as-is. Keep in mind that you may need to rearrange your data depending on its original format. For example, you may have multiple result columns (e.g. a separate column for each site’s results) and you will want to reorganize the data so that all of the results are in a single column.
The gather()
function in R’s tidyr package works well for this purpose. The mutate()
function in R’s dplyr package can add and manipulate columns. The functions rename()
and names(_)
can update column names to make data compatible with the examples in this guide.
1.7 Pollutant names
Pollutants often go by different names depending on the context. To prevent confusion it helps to include a unique CAS #
or parameter code in your analysis results and summary data. If you are working with monitoring data that is missing a unique identifier for each pollutant, an R package named aircas
is available to help join CAS numbers to pollutant names.