Completeness checks


Checking for data completeness before generating summaries ensures that results will be comparable from one year to the next. Without completeness checks, changes in the seasonal coverage from year to year may create the illusion of increasing or decreasing trends.

Completeness checks include tests for:

  • Annual completeness Number of samples required per year.
  • Seasonal completeness Number of samples required per season.

7.1 Completeness checks

Description

This section describes completeness guidelines and methods for performing completeness checks.

Air toxics reporting guidelines

Annual results are considered complete if the following conditions are met:

  • Seasonal completeness 75% or more of samples collected in colder months (January - April, November, Decmeber) and in warmer months (May - October)
  • Annual completeness Both seasons are complete


Criteria pollutant reporting guidelines

Completeness rules for criteria pollutant design values are defined in the Appendices of 40 CFR 50.

In general, these rules apply:

  • Annual completeness 75% or more of samples collected per year
  • Calendar quarter completeness 75% or more of samples collected in each calendar quarter


Recommended steps for air toxics

  1. Based on the monitoring schedule, record the total expected samples for each year and season.
    • Air toxics monitors follow a fixed sampling schedule provided by EPA’s Air Toxics Calendar.
  2. Count the number of valid samples for each year and season.
  3. Divide the number of valid samples by the number of expected samples.
  4. Mark annual results as incomplete if one or both seasons do not fulfill the completeness checks.


Example R script

Click below to view an example.


Packages

library(tidyverse)
library(lubridate)

Our example data is organized by monitoring site and date.

data <- read_csv('https://raw.githubusercontent.com/MPCA-air/air-methods/master/airtoxics_data_2009_2013.csv')

Figure 7.1: Sample data table.


Step 1: Find the expected number of samples.

Monitors in EPA’s Air Quality System are required to follow the Air Toxics Monitoring Calendar. The sampling schedule for air toxics is generally 1 sample per every 6 days. Depending on the sampling start date and whether it is a leap year or not, the expected number of samples for the year will range from 60 to 61. If you are uncertain about the sampling schedule for your data consult the lab to confirm the expected number of samples.

Entering the start and end date into the sample_calendar() function below will create a list of the expected sampling dates based on EPA’s air toxics monitoring schedule.

# Create a sampling calendar based on EPA's air toxics monitoring schedule
sample_calendar <- function(start         = "2012-01-01", 
                            end           = "2016-12-31", 
                            day_interval  = 6,
                            type          = "air_toxics") {
  
  library(lubridate)
  
  # Convert 'start' and 'end' to class date
  start <- ymd(start)
  end   <- ymd(end)
  
  # Set official start date to selected EPA calendar
  if (type == "air_toxics") {
       epa_start <- ymd("1990-01-09")
  } else {
       epa_start <- start
  }
  
  # Create full table of sampling dates
  calendar <- seq(from = epa_start, 
                  to   = end, 
                  by   = paste(day_interval, "days"))
  
  
  # Subset to user's date range
  calendar <- calendar[calendar >= start & calendar <= end]
  
  # Print total sampling days and date range
  cat(length(calendar), " sampling days from ", as.character(min(calendar)), "to", as.character(max(calendar)))
  
  return(calendar)
  
}

Find the date range of the data and create the expected sampling schedule with the function above.

# Find the year range of your data
date_range <- range(data$Date)

# Create expected sample calendar
epa_schedule <- tibble(Date = sample_calendar(start = format(date_range[1], "%Y-01-01"), # Extend range to first day of the year
                                                  end   = format(date_range[2], "%Y-12-31"), # Extend range to last day of the year
                                                  day_interval = 6))
## 304  sampling days from  2009-01-05 to 2013-12-28
# Add year and calendar quarter columns
epa_schedule <- epa_schedule %>% mutate(Year        = year(Date),
                                        cal_quarter = quarter(Date))


# Count the expected number of samples per quarter and year.
epa_schedule <- epa_schedule %>% 
                group_by(Year, cal_quarter) %>%
                summarize(expected_quarter_samples = length(unique(Date))) %>%
                group_by(Year) %>%
                mutate(expected_annual_samples = sum(expected_quarter_samples))

Expected number of samples.

Figure 7.2: Expected number of samples.


Step 2: Count number of valid samples.

# Assign each date to a calendar quarter
data <- data %>% mutate(cal_quarter = quarter(Date))

# Count the number of sampling dates for each quarter and year.
data <- data %>% 
          group_by(AQSID, Param_Code, CAS, Year, cal_quarter) %>%
          mutate(valid_quarter_samples = length(unique(Date[!is.na(Concentration)]))) %>%
          group_by(AQSID, Param_Code, CAS, Year) %>%
          mutate(valid_annual_samples = length(unique(Date[!is.na(Concentration)]))) 


Step 3: Divide the number of valid samples by the number of expected samples.

# Join expected sample table to data by quarter and year columns
data <- left_join(data, epa_schedule, by = c("Year", "cal_quarter"))


# Divide valid samples by expected samples
data <- data %>% 
          group_by(AQSID, CAS, Year, cal_quarter) %>%
          mutate(pct_quarter_samples = valid_quarter_samples / expected_quarter_samples) %>%
          group_by(AQSID, CAS, Year) %>%
          mutate(pct_annual_samples = valid_annual_samples / expected_annual_samples) 


Step 4: Mark results as incomplete if they fail one of the completeness checks.

# Set incomplete to zero if any completeness checks fail
data <- data %>% 
        rowwise() %>%
        mutate(complete = sum(c(pct_quarter_samples >= 0.75, 
                                pct_annual_samples  >= 0.75)) > 1) %>%
       ungroup()


Final table with added complete column.

Figure 7.3: Final table with added complete column.


Step 6 (Optional): By collapsing the complete status to 1 row per site-pollutant-quarter combination, you can save the table and attach a site’s completeness status to a future data analysis as needed.

# Collapse table to 1 row per unique site-pollutant-quarter
 data <- data %>% 
         group_by(AQSID, Pollutant, Year, cal_quarter) %>%  #Check completeness for every quarter
         summarize(complete              = complete[1], 
                   valid_annual_samples  = valid_annual_samples[1],
                   valid_quarter_samples = valid_quarter_samples[1])

# Collapse table to 1 row per unique site-pollutant-year
 data <- data %>%          
         group_by(AQSID, Pollutant, Year) %>%  
         summarize(complete            = sum(complete, na.rm = T) == 4,
                   annual_samples      = valid_annual_samples[1],
                   min_quarter_samples = min(valid_quarter_samples, na.rm = T))

Figure 7.4: Collapsed table showing completeness status.


References

Appendices of 40 CFR 50


Back to top