1 Abstract

This week we will use the R language and statistical environment to analyze air quality data from Pocatello, Idaho, obtained directly from the Environmental Protection Agency (EPA) Air Quality System (AQS) database. Our aim is to determine if any patterns exist in air quality with respect to year and time of year. Before lab, read this handout and visit the websites listed in the handout.

2 Criteria Pollutants

The US government sets standards for six air pollutants called criteria pollutants (see box below). The standards for pollutants are based on requirements for the protection of human health, under the Clean Air Act. The Clean Air Act defines two types of ambient air quality standards.

Enforcement of criteria pollutant standards is the job of US Environmental Protection Agency (EPA). EPA information concerning criteria pollutants can be found here.

Criteria Pollutants

Pollutant Health concern
Sulfur Dioxide (SO2) Respiratory tract problems, permanent harm to lung tissue.
Carbon Monoxide (CO) Ability of blood to carry oxygen impaired, cardiovascular, nervous, and pulmonary systems affected.
Nitrogen Oxide (NOx) Respiratory illness and lung damage.
Ozone (O3) Respiratory tract problems (difficult breathing, reduced lung function), asthma, eye irritation, nasal congestion, reduced resistance to infection, possible premature aging of lung tissue.
Lead Retardation and brain damage, especially in children.
Particulates Eye & throat irritation, bronchitis, lung damage, and impaired visibility.


Particulate Matter

In today’s lab we will focus on one class of criteria pollutants, particulate matter. The size of particle pollutants is directly linked to its potential for causing health problems. The EPA is mainly concerned about particles that are 10 micrometers = 10 microns in diameter or smaller because these particles can easily pass through the throat and nose and enter the lungs. Once inhaled, these particles can affect the heart and lungs and cause serious health problems. Particulate pollution can be made more severe by the presence of atmospheric inversion layers that prevent the dispersion of relatively cool, polluted air trapped near the ground. Inversions are particularly problematic for urban areas adjacent to mountains during the winter (Fig 2.1). The EPA groups particle pollution into two categories PM10 and PM2.5.


Inversion effects in Salt Lake City

Figure 2.1: Inversion effects in Salt Lake City

PM10

Particles between 0 and 10 microns in diameter are called PM10 particles. These “inhalable coarse particles” often have high concentrations near roadways and dusty industries. Wood stoves may also cause high levels of PM10 pollution. Long-term exposure to high levels of PM10 can increase the likelihood of developing chronic respiratory conditions such as bronchitis. Current EPA standards require that daily PM10 concentrations of 150 μg/m3 not be exceeded more than once per year, on average, over 3 years.

PM2.5

In July 1997 the EPA established a new standard for air quality, based on particulate matter that is less than 2.5 microns in diameter (PM2.5). This new standard was created in response to health studies indicating that these smaller particles can cause serious health problems. PM2.5 particles can be directly emitted from sources such as forest fires, or they can form when gases emitted from power plants, industries and automobiles react in the air. Current EPA standards have primary, secondary, and primary and secondary specifications for PM2.5 particles:

  • Primary: An annual daily mean PM2.5 particle concentration of 9.0 μg/m3, averaged over 3 years, cannot be exceeded.
  • Secondary: An annual daily mean PM2.5 particle concentration of 15 μg/m3, averaged over 3 years, cannot be exceeded.
  • Primary and Secondary: The 98th percentile of daily PM2.5 measures cannot exceed 35 μg/m3, when considered over three years.

Pocatello-Chubbuck Non-attainment and State Implementation Plan

The EPA monitors criteria pollutants, often in cooperation with state agencies like the Idaho Department of Environmental Quality (IDEQ). When the health-based standards are not met by a site, that site is considered to be in non-attainment. Given a non-attainment designation, the EPA requires that the state develop a State Implementation Plan (SIP), to show how air quality will be improved. The Pocatello-Chubbuck area was designated as non-attainment because of particularly high PM10 levels that were recorded in the early 1990’s.

The steering committee that developed Pocatello’s SIP identified smoke from wood stoves and fireplaces as one of the causes of high PM10 in the Pocatello-Chubbuck area. In an attempt to reduce PM10 associated with wood smoke, the Pocatello and Chubbuck city councils passed local air quality ordinances that restrict the materials that can be burned, limit the use of wood stoves and fireplaces, and require that new buildings have a heating source other than a wood stove or fireplace. These regulations prohibit burning when PM10 exceeds 120 μg/m3, and called for a voluntary ban when PM10 exceeds 80 μg/m3.

The IDEQ has been monitoring PM10 and PM2.5 particles, in the Pocatello, since 1990, at the intersection of Garrett and Gould Roads.

3 R

R is a popular open source programming language and computational environment. It can be downloaded for free here. All ISU computer labs are currently equipped with R.

Open R on your workstation. To start with, we can use R as a calculator. Type 2 + 2 and click Enter.

2 + 2
[1] 4

The output term [1] means, “this is the first requested element.” In this case there is just one requested element, \(4\), the solution to \(2 + 2\). If the output elements cannot be held on a single console line, then R would begin the second line of output with the element number comprising the first element of the new line. For instance, the command rnorm(20) will take 20 pseudo-random samples from a standard normal distribution. We have:

rnorm(20)
 [1] -0.03106545 -0.53677292  1.66554225  0.32432316 -0.37972590  2.05901542
 [7]  0.23694828  0.18449073  0.54924468 -0.28540694 -0.20888245  0.39782478
[13] -1.17042364 -0.86490787  0.22263080 -0.13161842 -0.35473139  0.30858531
[19] -0.57922130  0.54571807

Like Python, which we learned about earlier this semester, R is an Object Oriented Programming (OOP) language. Type:

obj1 = 2 + 2 

The result, 4, can be accessed by typing the object name: obj1 (or whatever I named the object).

obj1
[1] 4

Here I create a numerical array of the numbers, 1, 5, and 7, using the function c, which means “combine”.

obj2 = c(1,5,7)

I can use R in various ways. For instance, here I add obj1 to every element of obj2, and take the natural log of the sum.

log(obj1 + obj2)
[1] 1.609438 2.197225 2.397895

Notes to ourselves, and others, can be made at the R command line using the # symbol.

# Natural log of the sum of obj1 and obj2:
log(obj1 + obj2)
[1] 1.609438 2.197225 2.397895

4 Using R to Obtain and Analyze EPA Air Quality Data

Today we will primarily be using R to obtain and analyze Pocatello PM10 and PM2.5 data from the massive US EPA Air Quality System (AQS) database and API1. Details concerning the AQS can be found here.

Download the GG10daily and GG25daily binary datasets which contain PM10 and PM2.5 data from 12/31/1980 to 12/31/2025 from IDEQ air sensors installed at the corner of Garret and Gould streets in Pocatello. The data were acquired, via R, from the EPA-AQS database. Set the R working directory directory to be the location of the downloaded files. Use forward slashes instead of backslashes to designate nested directories and files.

setwd("C:/Users/My_directory") # for instance...
GG10daily <- readRDS("GG10daily") # or use G10daily <- readRDS(file.choose())
GG25daily <- readRDS("GG25daily") # or use G25daily <- readRDS(file.choose())

Questions

Your job is to produce a summary report on trends in air quality in Pocatello, Idaho. Assume that this report will be used to determine the eligibility of the City of Pocatello for Federal funds. If air quality standards are not met, the availability of Federal funds for things such as highway projects will be reduced. Run all code in Section 4 to obtain data necessary to answer the questions below.

  1. To get warmed up, in R, create a numerical object with your name and perform some simple math with it. Provide snapshots of your work (and all your work in succeeding questions).
  2. Obtain some metadata for the data you have downloaded.
    1. Record the number of days that we have records for PM10 and PM2.5, using nrow(GG10daily) and nrow(GG25daily), respectively.
    2. Record the first and last datapoint we have for PM10 and PM2.5, using min(GG10daily$date) and max(GG10daily$date), and min(GG25daily$date) and max(GG25daily$date), respectively.
  3. Consider patterns of average annual PM10 and PM2.5 concentrations.
    1. Using the code below, obtain plots for average annual PM10 and PM2.5 concentrations as a function of year. Create appropriate entries for xlab and ylab. Paste the figures into your report.
    2. Summarize and interpret the figures from (a).
year10 <- format(GG10daily$date, "%Y") # PM 10 years from data
year25 <- format(GG25daily$date, "%Y") # PM 2.5 years from data

# yearly averages of daily averages
yearly.avg10 <- tapply(GG10daily$avg, year10, mean) # annual mean for PM 10 
yearly.avg25 <- tapply(GG25daily$avg, year25, mean) # annual mean for PM 2.5 

#--------------------------- plot for PM 10 ------------------------------#
plot(unique(year10), yearly.avg10, type = "l", xlab = "", ylab = "")

#--------------------------- plot for PM 2.5 -----------------------------#
plot(unique(year25), yearly.avg25, type = "l", xlab = "", ylab = "")

  1. Next, we will consider non-attainment criteria for PM10. Recall that non-attainment occurs if daily PM10 concentrations of 150 μg/m3 are exceeded more than once per year, on average, over 3 years. Use the code below to: 1) Identify (with ones) days with PM10 concentrations \(>\) 150 μg/m3 and identify with (zeroes) days with PM10 concentrations \(\leq\) 150 μg/m3, 2) Obtain annual counts of days with PM10 concentrations \(>\) 150 μg/m3, 3) Obtain three year moving averages of annual counts of days with PM10 concentrations \(>\) 150 μg/m3, and 4) Create a plot of the results.
    1. What do the R objects gtr.than.150, yearly.count and moving_avg contain in the code below? To actually see the contents of these object you can type their names in the console and hit Enter.
    2. Create appropriate entries for xlab and ylab for the plot resulting from step 4 below. Paste the figure into your report.
    3. Summarize the figure. What are the patterns of non-attainment? What do you think recent non-attainment outcomes (if any) are due to?
    4. Do we have a complete dataset for 2025? Could this affect non-attainment results for 2025? Why?
#---------------------- Step 1 (ID days with high PM 10 conc.) ----------------------------#
gtr.than.150 <- ifelse(GG10daily$avg > 150, 1, 0) # ID days with > 150 ug/mm^3?
#----------------- Step 2 (Get annual counts of high PM 10 days) --------------------------#
yearly.count <- tapply(gtr.than.150, year10, sum) # count days with > 150 ug/mm^3, by year
#-------------- Step 3 (Get 3 yr moving averages, requires zoo package) -------------------#
install.packages("zoo")
library(zoo)
# k = 3, align = "right": years 1,2,3 used to calculate moving avg for yr 3, and so on...  
moving_avg <- rollmean(yearly.count, k = 3, fill = NA, align = "right")
#------------------------------ Step 4 (create plot) --------------------------------------#
plot(x = 1990:2025, y = moving_avg, type = "l", xlab = "", ylab = "")
abline(h = 1, lty = 2, col = 2)

  1. Next, we will consider non-attainment patterns for PM2.5. Recall that the primary standard for PM2.5 particles was that an annual daily mean PM2.5 particle concentration of 9.0 μg/m3, averaged over 3 years, could not be exceeded, and that the secondary standard was that an annual daily mean PM2.5 particle concentration of 15 μg/m3, averaged over 3 years, could not be exceeded. To address this question we will use the object yearly.avg25 form Q 2. The object contains annual average PM2.5 concentrations. In the code below a three year moving average of concentrations is taken and a plot is generated.
    1. From the code below, obtain the plot. Create appropriate entries for xlab and ylab. Paste the figure into your report.
    2. Summarize the figure. What are the patterns of non-attainment (if any).
    3. IDEQ measures for PM2.5 stopped in 2006 at the Garret and Gould site. Does this seem reasonable?
    4. Why do you think the moving average line missing for the years 1998 and 1999? Note that this also occurs for the previous moving average plot we made for PM10, although it is harder to see.
moving_avg <- rollmean(yearly.avg25, k = 3, fill = NA, align = "right") 
plot(1998:2006, moving_avg, type = "l", xlab = "", ylab = "", ylim = c(6,15))
abline(h = 9, lty = 2, col = 2)
abline(h = 15, lty = 2, col = 1)
legend("topright", lty = 2, legend = c("Primary", "Secondary"), col = c(2,1))

  1. Perform time series analyses for the monthly PM10 and PM10 data. Use the code below. Your TA will help you interpret the plots2.
    1. Paste resulting figures and code output (monthly means of PM10 and PM10 concentrations) into your report.
    2. Take several sentences to summarize your monthly time series results and your tabulation of monthly means. Based on your findings, what are some likely seasonal causes of poor air quality in Pocatello, Idaho, and what recommendations would you make to improve air quality?
##############################################################################
#------------------------- time series analysis -----------------------------#
##################################$###########################################

#-------------------------- PM 10 -------------------------#

month10 <- format(GG10daily$date, "%m") # PM 10 months from data
month_yr10 <- paste(year10, month10, sep = "-") # PM 10 year AND months
monthavg10 <- tapply(GG10daily$avg, month_yr10, mean) # PM 10 averages for year AND month combinations
pmts10 <- ts(monthavg10, start = c(1990, 11),frequency = 12) 
plot(stl(pmts10, s.window = "periodic")) 

mos <- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
tapply(GG10daily$avg, factor(month10, labels = mos), mean) # PM 10 monthly summary
     Jan      Feb      Mar      Apr      May      Jun      Jul      Aug 
16.57457 17.54545 18.27233 18.50579 20.32230 22.57314 27.79355 34.09264 
     Sep      Oct      Nov      Dec 
26.58451 20.47404 16.35895 16.26393 
#-------------------------- PM 2.5 -------------------------#

month25 <- format(GG25daily$date, "%m") # PM 2.5 months from data
month_yr25 <- paste(year25, month25, sep = "-") # PM 2.5 year AND months
monthavg25 <- tapply(GG25daily$avg, month_yr25, mean) # PM 2.5 averages for year AND month combinations
pmts25 <- ts(monthavg25, start = c(1998, 11),frequency = 12) 
plot(stl(pmts25, s.window = "periodic")) 

tapply(GG25daily$avg, factor(month25, labels = mos), mean) # PM 2.5 monthly summary
      Jan       Feb       Mar       Apr       May       Jun       Jul       Aug 
15.865625 15.123333  7.419658  5.107692  5.047967  5.442149  6.800000  8.262766 
      Sep       Oct       Nov       Dec 
 6.519540  5.656044  8.400000 17.557282 

Footnotes


  1. API stands for Application Programming Interface. An API provides glue code that allows one software framework (e.g., R) to work directly with a foreign software system.↩︎

  2. The stl function provides a seasonal decomposition of a time series using a smoother algorithm. The function produces four related plots with names: data, seasonal, trend, and remainder. The seasonal component is found by direct application of a loess smoothing algorithm to monthly means. To find the trend, seasonal values are differenced, and those differences are smoothed. The remainder component is comprised of the residuals from the seasonal plus trend fit. Gray bars to the right of plots allow comparison of changes from the four perspectives. The height of the box in plot 1, compared to height of the box in plot 2, describes a comparable unit change in plot 1, compared to plot 2.↩︎