Summary statistics

The summary methods described in this chapter include:

8.1 Normality
8.2 Bootstrapping
8.3 Below the detection limit
8.4 Confidence intervals
8.5 Annual summaries for incomplete data
8.6 Comparison of data to inhalation health benchmarks

8.1 Test for normality

Data should be tested for normality prior to determining the type of summary statistic to report. In general environmental data are non-normal and contain with many low values and a few high values. This causes the mean of the data to be higher than the median (right skewed). Many normality tests exist and they vary in sensitivity. A good article on normality tests can be found at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693611/.

The normality test in the example script below is a recommended test for data of sample size between 5-5000 and will allow missing values. A Shapiro-Wilk test may also be performed for the typical sample size of a year of air toxics results. The script below includes a a line to switch to a Shapiro Wilk test.

Click below to view code for a normality test.

Normality test

library(tidyverse)
library(stats)
library(nortest)

data <- read_csv('https://raw.githubusercontent.com/MPCA-air/air-methods/master/airtoxics_data_2009_2013.csv')

colnames(data) <- c("aqs_id", "poc", "param_code", "date", "conc", "null_code", "md_limit", "pollutant", "year", "cas")

data_normresults <- data.frame(site_name=character(),
                               analyte=character(),
                               year=character(),
                                 w_value=numeric(), 
                                 p_value=numeric(), 
                                 stringsAsFactors=FALSE) 
sites <- unique(data$aqs_id)
analytes <- unique(data$pollutant)
years <- unique(data$year)

for(i in sites[1:3]){
  for(j in analytes){
    for(k in years){
      
  data_sub <- filter(data , aqs_id == i, pollutant == j, year ==k)
  
  if (length(unique(data_sub$conc)) > 5) {
    
    norm_test <- sf.test(data_sub$conc)
    #norm_test <- shapiro.test(data_sub$conc)
    
    print(i)
    
    print(norm_test)
    
    values <- data.frame(site_name = i, analyte=j, year=k, w_value = round(norm_test$statistic, digits = 2), p_value = round(norm_test$p.value, digits = 2))
    
    data_normresults <- rbind(values, data_normresults)
    }
  }
  }
}

DT::datatable(head(data_normresults), options = dt_options)

We also suggest visualizing the data to determine if normality may be a valid assumption. The script below is an example of visual normality tests and includes an explanation of their interpretation. In the Q-Q Plot, normal data will form a straight line with the theoretical distribution quantiles with little deviation from the line. This is a visual test and somewhat subjective. Normal data form histograms that produce a bell shaped curve.

Here’s an example tool to inspect your data by year, site and pollutant: https://mpca-pahs.shinyapps.io/NormViz/

Click to view the code for the shiny tool above.

Shiny tool

When to use a arithmetic or geometric mean:

Use an arithmetic mean when your data are normal and there are less than 5% below detection limit values per year/analyte/site.
For non-automated scripts: Use a geometric mean if your data are log-normally distributed and there are less than 5% below detection limit values per year/analyte/site.

8.2 Bootstrapping

Bootstrapping provides methods for calculating summary statistics without making assumptions about the distribution from which the data is sampled.

Click to view the bootstrapping code.

Bootstrapping

library(dplyr)
library(readr)

Our data is organized by monitoring site and date. Here’s a sample.

AQS_ID	Date	Conc
270535501	2009-07-30	0.00148
270535501	2009-09-30	0.00064
270535501	2009-11-30	0.34256
270535501	2009-12-30	0.00064
270535502	2009-03-30	0.26219
270535502	2009-07-30	0.01113
270535502	2009-09-30	0.00044
270535502	2009-11-30	0.00127
270535502	2009-12-30	0.00113

Bootstrap function

We currently use the EnvStats package to generate air toxics summary values. It has built in functions to account for non-detect data and allows for different distributions. However, if you’re not dealing with non-detects you can use a simple loop to boot things yourself.

Before you start you’ll want to set the random number generator to ensure you’ll be able to reproduce your results. I’ll use #27 below.

set.seed(27)

The general idea is to take a random sample with replacement from the data set, generate the statistic that you’re interested in, record it, then rinse and repeat. Below is the code for how to resample a single site.

# Filter data to a single site
df_site1 <- filter(df, AQS_ID == AQS_ID[1])

# Pull random sample
# `replace=T` allows for the same value to be pulled multiple times
# `size=nrow(df)` ensures the number of observations in the new table to match the original 
random_df <- sample(df_site1$Conc, replace = T)

# Generate summary statistic
quantile(random_df, 0.1)

##     10% 
## 0.00064

To repeat this 3,000 times we can wrap these steps into a resample function, and then use sapply to collect the results.

# Create  resample function
resample_Conc <- function(data = df_site1$Conc, Conc_pct = 0.10){

random_df <- sample(data, replace = T)

quantile(random_df, Conc_pct, na.rm=T)[[1]]

}

# Repeat using `sapply`
repeats <- 3000

booted_10pct <- sapply(1:repeats, FUN = function(x)
  resample_Conc(df_site1$Conc))

# The 50th percentile or median Conc
median(booted_10pct, na.rm=T)

## [1] 0.00064

# Return the 95th percentile of the booted concentrations
quantile(booted_10pct, 0.95, na.rm = T)[[1]]

## [1] 0.00148

# Force the 95th percentile to be a recorded value
sort(booted_10pct)[repeats*.95 +1]

## [1] 0.00148

# Upper and lower confidence interval around the median
quantile(booted_10pct, c(0.025, 0.5, 0.975), na.rm = T)

##     2.5%      50%    97.5% 
## 0.000640 0.000640 0.103216

Automate

To finish, put these steps into a boot function and run it on each site by using group_by.

# Create boot function
boot_low_Conc <- function(data     = df$Conc, 
                          Conc_pct = 0.10, 
                          conf_int = 0.95, 
                          repeats  = 3000){

alpha <- (1 - conf_int)/2

booted_10pct <- sapply(1:repeats, FUN = function(x) resample_Conc(data, Conc_pct))

# Upper and lower confidence interval around the median
list(quantile(booted_10pct, c(alpha, 0.5, 1-alpha), na.rm = T))

}

# Use `group_by` to send data for each site to your boot function
conc_summary <- group_by(df, AQS_ID) %>% 
                mutate(boot_results   = boot_low_Conc(Conc, Conc_pct=0.10, conf_int=0.95)) %>%
                summarize(Conc_10pct  =  quantile(Conc, 0.10, na.rm=T)[[1]],
                Low_CL95_conc         = unlist(boot_results[1])[[1]], 
                Boot_conc             = unlist(boot_results[1])[[2]], 
                Upper_CL95_conc       = unlist(boot_results[1])[[3]])

Results

The booted confidence limits:

AQS_ID	Conc_10pct	Low_CL95_conc	Boot_conc	Upper_CL95_conc
270535501	0.000640	0.00064	0.000640	0.103216
270535502	0.000716	0.00044	0.000716	0.005214

8.3 Below the detection limit

The annual mean or upper confidence limit is reported as below the detection limit when either of the following conditions is true:

The number of censored values is greater than 80%.
- This is based on Cox 2006 and MPCA simulations (cite work and results)[www.google.com].
The annual mean or 95% upper confidence limit is below the method detection limit.

8.4 Upper confidence limits (UCLs)

If data is not normal, use bootstraping to generate means for each site. The 95% upper confidence limit is the 95th percentile of means generated using bootstrap sampling.

The script below uses bootstrap sampling to generate 1000 means for each site, pollutant, and year and then calculates the 95% UCL by taking the 95th percentile of those means. Values below the MDL are estimated using lognormal MLE estimation for each bootstrap sample. The 95% UCL is not calculated for any site, pollutant, and year which fails to meet completeness requirements and/or minimum detection requirements.

Click to view code for calculating UCLs.

Upper Confidence Limits

# DATA prep:
## For this function data must have names: 
## "AQSID", "POC", "Parameter", "Date", "Result", "MDL", "Pollutant"

library(dplyr)
library(EnvStats)
library(lubridate)

seed <- 2017

data <- read_csv('https://raw.githubusercontent.com/MPCA-air/air-methods/master/airtoxics_data_2009_2013.csv')


names(data)[1:10] <- c("AQSID", "POC", "Parameter", "Date","Result",
                       "Null_Data_Code", "MDL", "Pollutant", "Year", "CAS")

data <- data %>% 
        filter(AQSID == 270370020) %>% 
        mutate(`Monitoring Site` = AQSID, 
               PollutantGroup    = Pollutant, 
               Censored          = Result < MDL, 
               Result            = ifelse(Censored, MDL, Result))


UCL_95 = function(data, Boot_Repeats = 1000) {
  
  library(EnvStats)
  
  set.seed(2017)
  
  annual_AT_means <- function(air_toxics) {
  
  air_toxics <- mutate(air_toxics, 
                       Year = year(ymd(Date)), 
                       Quarter = quarter(ymd(Date)))
  
  sample_complete <- air_toxics %>% 
                     group_by(AQSID, Pollutant, Year, Quarter, MDL) %>%
                     summarise(Complete = ( (sum(!is.na(Result) ) / length(Result) ) >= 0.75 ) ) %>%
                     mutate(Complete = ifelse(is.na(Complete), F, Complete) ) %>%
                     group_by(AQSID, Pollutant, Year, MDL) %>%
                     summarise(Complete = all(Complete) )
  
  enough_detects <- air_toxics %>% 
                    group_by(AQSID, Pollutant, Year, MDL) %>%
                    summarise(Detected = mean(Censored, na.rm = T) <= 0.8 )
  
  site_means <- air_toxics %>% 
    group_by(AQSID, Pollutant, Year, MDL) %>% 
    summarise(Mean = ifelse(length(unique(Result[!is.na(Result) & !Censored] ) ) < 2, NA,
     ifelse (any(Censored, na.rm = T), elnormAltCensored(Result, Censored, method = "impute.w.mle", ci = F)$parameters[[1]], mean(Result, na.rm = T) ) ) )
  
  site_means <- left_join(site_means, sample_complete, by = c("AQSID", "Pollutant", "Year", "MDL") ) %>%
    left_join(enough_detects, by = c("AQSID", "Pollutant", "Year", "MDL") ) %>% 
    mutate(Mean = ifelse(Complete & Detected, Mean, NA), ID = paste(AQSID, Pollutant, Year) )

  return(site_means)
  
}
  
  MLE_est <- function(data){
    
    results = data$Result
    
    censored = data$Censored
    
    n = sum(!is.na(results))
    
    if (length(unique(results[!is.na(results) & !censored] ) ) < 2 ) {
      MLE_means = NA
    }
    
    else {
      random.rows = NULL  
      
      random.rows = sample(which(!is.na(censored) & (!censored) & !duplicated(results) ), 2, replace = FALSE)
      
      random.rows = c(random.rows, sample(which(!is.na(censored)), n-2, replace = TRUE))
      
      MLE_means = ifelse(sum(censored[random.rows], na.rm = T) == 0, mean(results[random.rows]), elnormAltCensored(results[random.rows], censored[random.rows], method = "impute.w.mle", ci = F)$parameters[[1]] )
      
    }
    
    return(MLE_means)
  }
  
  data <- mutate(data, ID = paste(AQSID, Pollutant, Year))
  
  Bootstrap_means <- replicate(Boot_Repeats, (by(data, data$ID, MLE_est) ) )
  
  CL <- apply(Bootstrap_means, 1, function(x) sort(x)[ceiling(0.95 * Boot_Repeats)] )
  
  CL <- data.frame(ID = names(CL), UCL95 = unname(CL))
  
  annual_summary = left_join(annual_AT_means(data), CL, by = "ID") %>% mutate(UCL95 = ifelse(Complete & Detected, UCL95, NA) ) %>% select(-ID)
  
  return(annual_summary)
  
}

set.seed(seed)

annual_summary = UCL_95(data, 100)

8.5 Annual summaries for incomplete data

There may be situations when a monitoring site does not meet the 75% completeness requirement in each quarter of a year, but annual summary statistics are still necessary. In this case, calculate annual summary statistics as usual, but add a flag noting that the site did not meet completeness requirements. The minimum threshold of valid samples required for reporting annual summary statistics may be based on the professional judgement of data analysis staff, but must meet statistical test assumptions.

8.6 Comparison of data to inhalation health benchmarks

Chronic Inhalation Health Benchmarks
The annual 95% upper confidence limit is compared to chronic inhalation health benchmarks which are chosen based on the MPCA/MDH hierarchy for inhalation health benchmarks information sources.These risk results are summarized differently for annual measured data reports (The Air Toxics Data Explorer) or if the comparison is used in a cumulative analysis (cumulative AERA) to inform air permitting or environmental review.

Acute Inhalation Health Benchmarks
The 2nd highest annual value multiplied by ten is compared to acute inhalation health benchmarks. This adjustment of the 24 hour integrated value approximates an hourly maximum, since acute inhalation health benchmarks are developed to be protective for high short-term exposures (about an hour).

Summary of risk estimates for cumulative Air Emission Risk Analaysis

Cumulative AERAs use the mean cumulative risk estimate from the most recent three years of monitored data. For cumulative risks, we apply additivity and sum risk estimates across pollutants. Since all pollutants are not measured at each monitoring site, multiple sites are used to generate a cumulative risk estimate for various population densities.

Means are calculated for urban, mid-density and rural sites. Facilities within urban areas may also be provided with risk estimates for the closest or most representative monitor with all pollutants measured.

This is described further in the cumulative AERA section of the AERA guide.