Data validation

Environmental monitoring data can be subject to error. One way to catch these errors is to complete routines to find data that are inconsistent with expectations or do not comply with data rules. The first of these routines should always be to visually inspect data in a variety of ways to focus data validation steps. These steps should be completed at a scheduled frequency depending on the data duration and frequency and the resources available.

Some important data validation checks include:

  • 4.1 Leaks, instrument drift, and sample contamination
  • 4.2 Outliers and extreme values
  • 4.3 Sequential repeats and sticky numbers
  • 4.4 Insufficient unique values


There may be other data validation steps you want to carry out dependent on the pollutant of interest.


What is the importance of data validation steps above and beyond the typical laboratory calibration and qualifying procedures that always occur?

A decreasing signal or repeated sequential measurements may mean that the data are not interpretable. An exceptionally high value could mean laboratory contamination or potentially could contribute to human health impacts if the high value is accurate. Data validation steps are critical for these reasons.

Data that fail validation tests may imply that laboratory equipment or system is in error. A failed validation test may also imply that there are extreme values at a site because of a significant air pollution source. In either case, a data validation routine is important to distiguish which of these may be the case and what program to contact for futher information.


Basic validation tests

  1. Visualize the data to help focus data validation steps.
  2. Identify potential instrument drift: Find a significant change in variation in the data in the same time frame from one year to the next. This, or a sudden shift in magnitude of signal, could imply that there is drift or concentrations results are only different due to a new calibration or instrument change.
  3. Identify extreme values: Find extreme values that may be in error, from contamination, or from a loss in signal.
  4. Identify sequential identical values (sticking): Find 3 or more sequential identical values within the same site and pollutants.
  5. Identify data sets with too few unique values.
  6. Report any failed test or significant finding to the laboratory if instrument failure is suspected. If there are several extreme values, communicate this to monitoring staff to find out if anything significant happened in the field. If there is no suggestion that the values could be from instrument or sampling issues, than a pollution source specialist should be consulted.



The following sections show examples of data validation steps.

Load example data

library(tidyverse)
library(stringr)
library(RcppRoll)
library(lubridate)

data <- read_csv('https://raw.githubusercontent.com/MPCA-air/air-methods/master/airtoxics_data_2009_2013.csv')

names(data) <- c("aqs_id", "poc", "param_code", "date", "conc", "null_code", "md_limit", "pollutant", "year", "cas")


Sample monitoring data


First, it’s always a good idea to plot the data. If there are a low number of numeric data points the statistical tests that follow may not be advised.

Open the web tool below to chart the measured pollutants at each site. A script to recreate the webtool for other monitoring data is included as well.

Shiny Data tool

4.1 Instrument drift or leaks in a system.

The intrument drift test returns a difference in variance of carbon tetrachloride between a calendar quarter for two consecutive years. This test works for VOC air toxics measurements only. Carbon tetrachloride was chosen because it is a banned substance and no longer in use, it has very few below detection limit values, and no direct sources. It has seasonally variable concentrations, so calendar quarters were compared to eliminate the detection of seasonal differences. The statistical test used was a Levenes Test for homogeneity of variance. This test could be applied to other surrogate compounds where there are multiple analytes, or could be applied to single analyte measurements (i.e. PM2.5). Another way to test for instrument drift is to test for significant differences between equal numbers of measurements before and after a calibration or tuning. (See site comparison section of this book for these tests)


Sample R script

Click below to view an example of checking for instrument drift.

library(car)

data$conc <- as.numeric(data$conc)
data$aqs_id <- as.character(data$aqs_id)
sites <- unique(data$aqs_id)
years <- unique(data$year)
data$quarter = quarter(data$date)
quarters <- unique(data$quarter)
pocs <- unique(data$poc)

clean_values = function(data) {
  data$conc = as.numeric(as.character(data$conc))
  data$conc[abs(data$conc) >= 999] = NA
  return(data)
}

data = clean_values(data)

levene_function = function(conc, quarter_year, row, col) {
  if(length(unique(quarter_year))<2){
  return(NA)} 
  data = data.frame(conc = conc, quarter_year = quarter_year)
  return(leveneTest(conc~as.factor(quarter_year), data = data)[row,col])
}

leak_table_unfiltered <- data.frame()
for(i in max(years):max(years)-1:length(years)){
  for(j in i+1){
    data_carbontet=data.frame()
  data_carbontet <- filter(data, pollutant=="Carbon Tetrachloride", !is.na(conc), year %in% c(i, j))
data_carbontet$quarter_year = paste(data_carbontet$quarter, "_", data_carbontet$year)
data_carbontet <- data_carbontet %>% group_by(aqs_id, poc, quarter) %>% 
  summarise(fvalue_levene = levene_function(conc, quarter_year, 1,2), 
            pval_levene = levene_function(conc, quarter_year, 1,3), 
            deg_free_levene = levene_function(conc, quarter_year, 2,1),
            Year_1 = min(year), 
            Year_2 = max(year),
            Mean_Year_1 = mean(conc[year==min(year)], na.rm=T),
            Mean_Year_2 = mean(conc[year==max(year)], na.rm=T)) %>% ungroup()
leak_table_unfiltered  <- rbind(leak_table_unfiltered,data_carbontet)
  }
}
leak_table <- filter(leak_table_unfiltered, pval_levene<0.01, abs(Year_1-Year_2)==1)
leak_table$Warning_Type <- "Decrease_in_Measurements"

datatable(head(leak_table, 10), options = dt_options) %>% formatSignif(c("fvalue_levene", "pval_levene", "Mean_Year_1", "Mean_Year_2"), digits = 2)

4.2 Extreme values and outliers

We recommend testing air toxics for extreme values by comparing each measured concentration to 3 x the 75th percentile of the data set by year, site, and pollutant. For now, collocated measurements (POCS) are not tested separately. For data with higher temporal variability, such as black carbon and ultrafine particulates, a criteria of 3 x the 95th percentile of the data set by year, site and pollutant should be used. There are a variety of outlier tests. The recommended criteria are sensitive enough to find outliers each year but not so sensitive that values reflecting site conditions are flagged. For example, most air pollution measurements are much higher during fire works, such as occurrs on July 4th and December 31st. These data are not in error, they are accurate representations of the air concentrations during those times.


Sample R script

Click the button to view an example.

# Test for exceptionally high values [above 75th percentile X 3] This can be modified to 95% for continous or other short duration data.
high_data <- group_by(data, year, aqs_id, pollutant) %>% mutate(AR_Mean = mean(conc, na.rm=T), Percentile_75 = quantile(conc, 0.75, na.rm = T), Percentile_75_X3 = Percentile_75*3) %>% ungroup()

high_data <- filter(high_data, conc>Percentile_75_X3, !is.na(conc), AR_Mean>md_limit, Percentile_75>0)


high_data$Warning_Type <- "Exceptionally_High_Value_Test"
high_data <- unique(high_data)

datatable(head(high_data, 10), options = dt_options) 
high_data <- high_data[, c("date", "pollutant", "aqs_id", "Warning_Type")]

4.3 Sequential repeats and “sticky” numbers

Three or more sequential replicate values may be a result of a machine error.

##Search for three repeating identical values.
repeat_data <- group_by(data, poc, year, aqs_id, pollutant) %>% 
  arrange(year, aqs_id, poc, pollutant, date) %>% 
  mutate(Previous_Day   = lag(conc, 1), 
         Two_Days_Prior = lag(conc, 2)) %>% ungroup()

repeat_data <- mutate(repeat_data, Repeat_Test = ifelse(round(conc, digits = 4) == round(Previous_Day, digits = 4) & round(Previous_Day, digits = 4) == round(Two_Days_Prior, digits=4), "TRUE", "FALSE"))

repeat_data$Warning_Type <- "Repeat_Test"

repeat_data <- filter(repeat_data, conc > 0 &  Repeat_Test == TRUE)

datatable(head(repeat_data, 10), options = dt_options)
repeat_data <- repeat_data[, c("date", "pollutant", "aqs_id", "Warning_Type")]


Table: Dates and pollutants to check including the potential issue

4.4 Unique detected values

In some cases the detected observations at a site may all be an identical value. Identical observations scattered throughout monitoring results at a single monitor can indicate a machine error. In addition, calculating summary statistics often requires a minimum number of unique values.


Example R script

Click below to view an example of counting unqiue detected values.

Packages

library(tidyverse)

Our example data is organized by monitoring site and date.

data <- read_csv('https://raw.githubusercontent.com/MPCA-air/air-methods/master/airtoxics_data_2009_2013.csv')

Figure 4.1: Sample data table.


Count the number of unique detected values for each year.

data <- data %>% 
        group_by(AQSID, POC, Param_Code, CAS, Year) %>%
        mutate(detected_obs  = ifelse(Concentration < Dlimit, NA, Concentration), 
               unique_values = n_distinct(detected_obs, na.rm = TRUE))


Flag sites with less than 3 unique values.

data <- data %>% 
          group_by(AQSID, POC, Param_Code, CAS, Year) %>%
          summarize(unique_values = unique_values[1],
                    unique_flag   = unique_values < 3) %>%
          ungroup()


Final table with added unique_flag