Completeness checks
Checking for data completeness before generating summaries ensures that results will be comparable from one year to the next. Without completeness checks, changes in the seasonal coverage from year to year may create the illusion of increasing or decreasing trends.
Completeness checks include tests for:
Annual completeness
Number of samples required per year.
Seasonal completeness
Number of samples required per season.
7.1 Completeness checks
Description
This section describes completeness guidelines and methods for performing completeness checks.
Air toxics reporting guidelines
Annual results are considered complete if the following conditions are met:
Seasonal completeness
75% or more of samples collected in colder months (January - April, November, Decmeber) and in warmer months (May - October)Annual completeness
Both seasons are complete
Criteria pollutant reporting guidelines
Completeness rules for criteria pollutant design values are defined in the Appendices of 40 CFR 50.
In general, these rules apply:
Annual completeness
75% or more of samples collected per yearCalendar quarter completeness
75% or more of samples collected in each calendar quarter
Recommended steps for air toxics
- Based on the monitoring schedule, record the total expected samples for each year and season.
- Air toxics monitors follow a fixed sampling schedule provided by EPA’s Air Toxics Calendar.
- Count the number of valid samples for each year and season.
- Divide the number of valid samples by the number of expected samples.
- Mark annual results as incomplete if one or both seasons do not fulfill the completeness checks.
Example R
script
Click below to view an example.
Packages
Our example data is organized by monitoring site and date.
data <- read_csv('https://raw.githubusercontent.com/MPCA-air/air-methods/master/airtoxics_data_2009_2013.csv')
Step 1: Find the expected number of samples.
Monitors in EPA’s Air Quality System are required to follow the Air Toxics Monitoring Calendar. The sampling schedule for air toxics is generally 1 sample per every 6 days. Depending on the sampling start date and whether it is a leap year or not, the expected number of samples for the year will range from 60 to 61. If you are uncertain about the sampling schedule for your data consult the lab to confirm the expected number of samples.
Entering the start and end date into the sample_calendar()
function below will create a list of the expected sampling dates based on EPA’s air toxics monitoring schedule.
# Create a sampling calendar based on EPA's air toxics monitoring schedule
sample_calendar <- function(start = "2012-01-01",
end = "2016-12-31",
day_interval = 6,
type = "air_toxics") {
library(lubridate)
# Convert 'start' and 'end' to class date
start <- ymd(start)
end <- ymd(end)
# Set official start date to selected EPA calendar
if (type == "air_toxics") {
epa_start <- ymd("1990-01-09")
} else {
epa_start <- start
}
# Create full table of sampling dates
calendar <- seq(from = epa_start,
to = end,
by = paste(day_interval, "days"))
# Subset to user's date range
calendar <- calendar[calendar >= start & calendar <= end]
# Print total sampling days and date range
cat(length(calendar), " sampling days from ", as.character(min(calendar)), "to", as.character(max(calendar)))
return(calendar)
}
Find the date range of the data and create the expected sampling schedule with the function above.
# Find the year range of your data
date_range <- range(data$Date)
# Create expected sample calendar
epa_schedule <- tibble(Date = sample_calendar(start = format(date_range[1], "%Y-01-01"), # Extend range to first day of the year
end = format(date_range[2], "%Y-12-31"), # Extend range to last day of the year
day_interval = 6))
## 304 sampling days from 2009-01-05 to 2013-12-28
# Add year and calendar quarter columns
epa_schedule <- epa_schedule %>% mutate(Year = year(Date),
cal_quarter = quarter(Date))
# Count the expected number of samples per quarter and year.
epa_schedule <- epa_schedule %>%
group_by(Year, cal_quarter) %>%
summarize(expected_quarter_samples = length(unique(Date))) %>%
group_by(Year) %>%
mutate(expected_annual_samples = sum(expected_quarter_samples))
Expected number of samples.
Step 2: Count number of valid samples.
# Assign each date to a calendar quarter
data <- data %>% mutate(cal_quarter = quarter(Date))
# Count the number of sampling dates for each quarter and year.
data <- data %>%
group_by(AQSID, Param_Code, CAS, Year, cal_quarter) %>%
mutate(valid_quarter_samples = length(unique(Date[!is.na(Concentration)]))) %>%
group_by(AQSID, Param_Code, CAS, Year) %>%
mutate(valid_annual_samples = length(unique(Date[!is.na(Concentration)])))
Step 3: Divide the number of valid samples by the number of expected samples.
# Join expected sample table to data by quarter and year columns
data <- left_join(data, epa_schedule, by = c("Year", "cal_quarter"))
# Divide valid samples by expected samples
data <- data %>%
group_by(AQSID, CAS, Year, cal_quarter) %>%
mutate(pct_quarter_samples = valid_quarter_samples / expected_quarter_samples) %>%
group_by(AQSID, CAS, Year) %>%
mutate(pct_annual_samples = valid_annual_samples / expected_annual_samples)
Step 4: Mark results as incomplete if they fail one of the completeness checks.
# Set incomplete to zero if any completeness checks fail
data <- data %>%
rowwise() %>%
mutate(complete = sum(c(pct_quarter_samples >= 0.75,
pct_annual_samples >= 0.75)) > 1) %>%
ungroup()
complete
column.
Step 6 (Optional): By collapsing the complete
status to 1 row per site-pollutant-quarter combination, you can save the table and attach a site’s completeness status to a future data analysis as needed.
# Collapse table to 1 row per unique site-pollutant-quarter
data <- data %>%
group_by(AQSID, Pollutant, Year, cal_quarter) %>% #Check completeness for every quarter
summarize(complete = complete[1],
valid_annual_samples = valid_annual_samples[1],
valid_quarter_samples = valid_quarter_samples[1])
# Collapse table to 1 row per unique site-pollutant-year
data <- data %>%
group_by(AQSID, Pollutant, Year) %>%
summarize(complete = sum(complete, na.rm = T) == 4,
annual_samples = valid_annual_samples[1],
min_quarter_samples = min(valid_quarter_samples, na.rm = T))
References