Using the new R package, FedData, to access federal open datasets (including interactive graphics)

The FedData package (created by R. Kyle Bocinsky) is a great new R package that provides easy access to some important federal datasets. The package is well-designed and provides functions to download climate, elevation, hydrography and other data for your area of interest. The following five sources of data are currently available for download with future ones in the works:

  • The National Hydrography Dataset (USGS)
  • The National Elevation Dataset digital elevation models (1 and 1/3 arc-second; USGS)
  • The Soil Survey Geographic (SSURGO) database from the National Cooperative Soil Survey (NCSS)
  • The Global Historical Climatology Network (GHCN), coordinated by National Climatic Data Center at NOAA
  • The International Tree Ring Data Bank (ITRDB), coordinated by National Climatic Data Center at NOAA

For this post we'll focus on the climate (GHCN) and elevation (NED) datasets and we’ll provide code for doing the download, processing and creation of both static and interactive graphics.

Blog Post Contents

1. Load the libraries

For the data download you only really need the FedData package but we will be manipulating and mapping the data and will load some of our favorite data manipulation and spatial libraries.

library(FedData)    # for downloading federal data
library(rgdal)      # for reading shapefiles
library(ggmap)      # for mapping and more
library(raster)     # for defining extents and raster processing
library(dplyr)      # for data manipulation
library(leaflet)    # for mapping
library(rgeos)      # for using well known text (readWKT)
library(tidyr)      # for reformatting data.frames
library(xts)        # for working with time series
library(dygraphs)   # for the interactive time series plot
library(sp)         # for working with spatial objects
library(tigris)     # for downloading geography

2. Define your map extent

There are a series of “get” functions in FedData that allow you to get data (get_ned, get_ghcn_daily) and each one requires that you define the extent – your area of interest. You can use standard R spatial objects created by, for example, reading in shapefiles or you can take advantage of functions from the FedData package to help you define the extent. Below we outline several different approaches for defining your extent and you can pick the one that works best for you. For this particular post we will focus on our home area, Ithaca, NY.

a) Define extent: Use an existing shapefile

If you already have a shapefile with the extent you're interested in you can read it in and use this for the extent. In a previous post we describe how to read a shapefile into R using the rgdal library and the basic syntax is as follows:

# Get our Ithaca bounding box
library(rgdal)
extentA <- readOGR(dsn=".", layer="ithaca", verbose = FALSE)
sp::plot(extentA, col="blue") # not so exciting!

plot1-1

b) Define extent: Use a bounding box and FedData's polygon_from_extent function

If you don't already have a shapefile to use, you can define the extent based on coordinates of a bounding box. The FedData package comes with a handy function that takes, as input, an extent object created using the raster package. The extent can be created using just four coordinates that define your bounding box.

To define your extent you need an xmin, xmax, ymin and ymax coordinate and you can define these manually, but we will take advantage of ggmap to get these coordinates.

Start by running get_map to get a ggmap object.

# The function qmap will get and create a map but we need get_map for coords
library(ggmap)
ithaca<-get_map("Ithaca, NY", zoom=11) 
ggmap(ithaca) #plot the map

plot2-1

Using our ggmap object pull out the bounding box coordinates using the attr function. Note that you can also access the bbox slot directly with ithaca@bbox (but it's in a different format).

bb<-attr(ithaca, "bb")
bb
##     ll.lat    ll.lon   ur.lat    ur.lon
## 1 42.28135 -76.72126 42.60564 -76.28181

Now feed your coordinates into the extent and polygon_from_extent functions. Note that a double colon (::) specifies which package the function comes from. If you load the raster package this is not necessary except “extent” is such a common word that you might worry that other packages have a function named “extent” in which case you can be specific.

#xmin, xmax, ymin, ymax
extentB<-polygon_from_extent(raster::extent(bb$ll.lon, bb$ur.lon,
     bb$ll.lat, bb$ur.lat), 
     proj4string="+proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs")
class(extentB)
## [1] "SpatialPolygons"
## attr(,"package")
## [1] "sp"

To confirm that you have this right:

# Perfect!
ggmap(ithaca) + geom_polygon(aes(x=long, y=lat), size=3, color='purple',
  data=extentB, fill=NA)

plot3-1

c) Define extent: Use the tigris package to extract boundaries

The tigris package offers a lot of great functions for extracting boundaries. If you want more detail on tigris we have a detailed post on using thetigris library. As an example, if you wanted your extent to be a particular county (in this case Ithaca is in Tompkins County in NYS) you could use the following code:

library(tigris)
extentC<-counties(state="NY") # county extent, not a bounding box
extentC<-extentC[extentC$NAME == "Tompkins",]
sp::plot(extentC, col="firebrick", border=NA)

plot4-1

d) Define extent: Extra credit, define a bounding box manually

You can also use what's known as Well Known Text, a geographic markup language, to manually define a polygon and use this with the function readWKT from the rgeos package to create your spatial object.

As an aside, there is a known bug in ggplot2 where it can mangle polygons that end up partly outside the extent of a map. As a result, I'm using carefully rounded coordinates so that the map will show the box, but this is not necessary if you're not using ggmap to create the map.

extentD<-readWKT("POLYGON(( -76.71 42.29,  -76.71 42.6,
    -76.29 42.6,  -76.29 42.29,  -76.71 42.29))")
proj4string(extentD)<-"+proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs"
ggmap(ithaca) + geom_polygon(aes(x=long, y=lat), size=3, color='purple',
  data=extentD, fill=NA)

plot5-1

3. Download elevation and climate data

You can use any of the defined extents above, I will use the one created using the FedData function polygon_from_extent (extentB).

Now that we have our extent defined let's download some data. Keep in mind that based on your extent this step might take a few minutes. Also you may want to set a working directory, otherwise the data will be downloaded to your default directory.

setwd("d:/junk")


# Download NED elevation data. This will return a raster
# of elevation data cropped to our extent area. Res = "1" 
# refers to 1 arc-second NED. This is the default. To 
# download 1/3 arc-second NED use res = "13". I'm adding 
# force.redo = F to both download functions. This will 
# check to see if the data already exists in which case 
# it will not re-download it.
ned_ithaca<-get_ned(template=extentB, label="ned_ithaca", 
    res="1", force.redo = F)


# Download GHCN daily climate data. We're interested in 
# precipitation (PRCP) but many other climate-related
# elements exist (i.e. max and min temp, snowfall, etc).
# See the `FedData` documentation for more details. Also 
# of note is that a character vector of station IDs can 
# be used in place of our extent object
ghcn_ithaca<-get_ghcn_daily(template=extentB, label="ghcn_ithaca",
    elements="prcp", force.redo = F)

Using the R plot function from the sp package to take a quick look at the data.

# Not pretty but otherwise so far so good!
sp::plot(ned_ithaca)
sp::plot(ghcn_ithaca$spatial, add=TRUE)

plot6-1

4. Explore the GHCN climate data

GHCN spatial component

As you may have noticed in the above plot, we access the “spatial” component of the GHCN data for adding points to the map. The other component of the GHCN download is the “tabular” piece. This is where the actual climate data is stored. Here we'll use the spatial component one more time and map the points on our Ithaca map.

# Plot the data using ggmap and ggplot2. Note 
# that we convert our spatial coordinates to a 
# data.frame for plotting with ggmap.
ghcn_pts<-data.frame(ID = ghcn_ithaca$spatial$ID,
    lat = ghcn_ithaca$spatial@coords[,2],
    long = ghcn_ithaca$spatial@coords[,1])
ghcn_pts<-mutate(ghcn_pts, ID = as.character(as.factor(ID)))


head(ghcn_pts)
##            ID     lat     long
## 1 US1NYSY0001 42.3907 -76.7176
## 2 US1NYTM0002 42.5666 -76.5700
## 3 US1NYTM0003 42.4317 -76.4886
## 4 US1NYTM0004 42.5913 -76.3720
## 5 US1NYTM0005 42.5253 -76.3218
## 6 US1NYTM0006 42.3520 -76.3184


# Plot the points again. Looking a little better!
ggmap(ithaca) + geom_point(aes(x=long, y=lat),
  size=3, color='red', data=ghcn_pts)

plot7-1

GHCN tabular component (includes interactive graphics)

Now let's have a look at the actual precipitation values. The GHCN output returns a large list where each list element is an individual station and each station contains a data.frame of precipitation values. If we had downloaded multiple climate elements there would be multiple tables associated with each station.

ghcn_dat<-ghcn_ithaca$tabular


# Pull out the PRCP (precipitation) tables. If there were other
# climate elements we could access these in a 
# similar way (i.e. ghcn_dat, "[[", "TMIN")
prcp<-lapply(ghcn_dat, "[[", "PRCP")


# Dissolve the list into a single data.frame and add a station id
# from the row names (then replace row names with a number sequence)
prcp<-do.call("rbind", prcp)
prcp$station<-substring(row.names(prcp),1,11)
row.names(prcp)<-1:nrow(prcp)


head(prcp)
##   YEAR MONTH D1  D2 D3 D4  D5  D6  D7  D8  D9 D10 D11 D12 D13 D14 D15 D16
## 1 2007    11 NA   0  0  0   0  48   5   0   0  20   0   0  46   0 157 170
## 2 2007    12  3  97 99 46   3  13   0   0   0  28   0 173   3  48   3 114
## 3 2008     1 15  33 15  0   0  36   0   0   3   0  33  28   0   8  20  25
## 4 2008     2 25 208  0  0  61 267 137  10   5   3   8   0  53  53   0  28
## 5 2008     3 30  10  0 43 302   8   0 157 185   0   0   3   8  NA  NA  NA
## 6 2008     4 46   0  0 30  46   0   0   0   0   0  28  20  NA  NA  NA  NA
##   D17 D18 D19 D20 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30 D31     station
## 1   8   0  13  33   3   8  18   0   0  64 353   5   0   3  NA US1NYSY0001
## 2  51   0   0   3   3   0   0 216   0   0   0  15  20   0  74 US1NYSY0001
## 3   0   5   0  NA  NA   0   0   0   0   0   5   0   5  20   0 US1NYSY0001
## 4   0   8  20   0   0  18   8   3   0   0 147   3   0  NA  NA US1NYSY0001
## 5  NA   0  23 155  18   0   0   0   0  NA   0  NA  NA   0   3 US1NYSY0001
## 6  NA  NA  NA  NA  NA   0   0  74   0   0 300   5  89   0  NA US1NYSY0001

Doing monthly and yearly annual averages is cumbersome when each day is it's own column so let's convert this from “wide” format to “long” format using the function gather from the tidyr package.

# I'm going to convert to "long" format for easier calculations.
# The gather function is from the tidyr package
prcp<-gather(prcp, day, precip, -station, -YEAR, -MONTH)


# Add the date
prcp<-mutate(prcp,
    MONTH = stringr::str_pad(MONTH, 2, side="left", pad="0"),
    day = stringr::str_pad(gsub("D", "", day), 2,
        side="left", pad="0"),
    date = as.Date(paste(YEAR, MONTH, day, sep="-")))


# Exclude non NA values and limit to current years
prcp<-dplyr::filter(prcp, !is.na(date), !is.na(precip),
    date > as.Date("2008-01-01"),
    date < as.Date("2016-01-01"))
head(prcp)
##   YEAR MONTH     station day precip       date
## 1 2008    02 US1NYSY0001  01     25 2008-02-01
## 2 2008    03 US1NYSY0001  01     30 2008-03-01
## 3 2008    04 US1NYSY0001  01     46 2008-04-01
## 4 2008    05 US1NYSY0001  01      0 2008-05-01
## 5 2008    06 US1NYSY0001  01    157 2008-06-01
## 6 2008    07 US1NYSY0001  01      5 2008-07-01

There are some stations with very little data, let's take a look at daily counts and then filter to only stations with plenty of data

# Do a count by station using dplyr functions
cnt<-group_by(prcp, station) %>% 
    summarize(cnt = n()) %>% arrange(cnt) %>% 
    mutate(station = factor(station, levels = station))


ggplot(cnt, aes(station, cnt)) + 
    geom_bar(stat="identity", fill="cadetblue") + 
    coord_flip() +
    geom_hline(yintercept=2000) +
    ggtitle("Count of days with data at climate stations")


# semi_join does a regular inner_join but only keeps the resulting
# columns from the left table. Greater than or equal to 2000 days 
# looks like a good cutoff
prcp<-semi_join(prcp, dplyr::filter(cnt, cnt>=2000), by = "station")

plot8-1

Do some final formatting of the daily data so that we can use it in an interactive graphic using htmlWidgets.

# If you'd prefer to look at daily data instead of monthly comment this out
prcp<-group_by(prcp, YEAR, MONTH, station) %>% 
    summarize(precip  = mean(precip, na.rm=T)) %>% 
    mutate(date = as.Date(paste(YEAR, MONTH, "15", sep="-")))


# We will make it so that each column is a station
prcp.wide<-spread(prcp[,c("date", "station", "precip")], station, precip)
head(prcp.wide)
## Source: local data frame [6 x 12]
## 
##         date US1NYSY0001 US1NYTM0002 US1NYTM0004 US1NYTM0005 US1NYTM0015
##       (date)       (dbl)       (dbl)       (dbl)       (dbl)       (dbl)
## 1 2008-01-15    8.428571    0.000000    11.70000    11.13333          NA
## 2 2008-02-15   36.724138   25.000000    38.20690    48.21739          NA
## 3 2008-03-15   39.375000          NA    44.12903    51.41935     6.50000
## 4 2008-04-15   30.380952   26.200000    30.00000    24.56667    26.66667
## 5 2008-05-15   15.419355    8.518519    14.03704    19.60000    20.75862
## 6 2008-06-15   42.551724   23.066667    43.36667    38.63333    28.20000
## Variables not shown: US1NYTM0017 (dbl), US1NYTM0018 (dbl), US1NYTM0023
##   (dbl), US1NYTM0027 (dbl), USC00303050 (dbl), USC00304174 (dbl)


# Now apply a 30-day moving average (both sides) to smooth out the variation
prcp.wide[,2:ncol(prcp.wide)]<-lapply(prcp.wide[,2:ncol(prcp.wide)],
    function(x) stats::filter(x, rep(1/30, 30, sides=2)))


# Delete rows with no data and convert to a data.frame
howmanyNA<-rowSums(is.na(prcp.wide))
prcp.wide<-prcp.wide[howmanyNA != (ncol(prcp.wide)-1),]
prcp.wide<-data.frame(prcp.wide)

In the final step we will convert the object to a time series object and look at the daily data (with the 30-day moving average) using the great htmlWidget dygraphs.

prcp.xts<-xts(prcp.wide[,-1], order.by = prcp.wide[,1])
dygraph(prcp.xts) %>% dyRangeSelector()

Be sure to play around with this interactive graphic:

Summarize the data for each station. Here we'll calculate annual averages for 2015 and present them in a map.

NOTE: for simplicity we are not applying filters based on completeness which would be appropriate in a scientific setting. Please do not apply these methods blindly when doing your own analysis.

# Calculate annual averages and create a final data.frame of values
prcp.fin<-dplyr::filter(prcp, date > as.Date("2015-01-01")) %>% 
    group_by(station) %>% 
    summarize(prcpAvg = round(mean(precip, na.rm=T),2)) %>% 
    rename(ID = station)

prcp.fin<-left_join(prcp.fin, ghcn_pts, by= "ID")

Create Leaflet map of GHCN average annual data

leaflet() %>% addProviderTiles("CartoDB.Positron") %>%
    addCircleMarkers(data = prcp.fin, lng = ~long, lat = ~lat,
        radius = ~prcpAvg/3, stroke=F,
    popup = paste("<strong>Station ID</strong><br>", prcp.fin$ID,
        "<br><br><strong>Average Annual<br>Precipitation</strong><br>",
        prcp.fin$prcpAvg, " inches"), color = "#006B8C", fillOpacity = 0.7)

5. Explore the NED 1-arc second elevation data

In a previous blog post we discuss working with rasters using the raster library. That post is helpful if you're looking to crop, re-classify or re-sample your data. In addition we have a post on rendering a raster for use in Google Maps which highlights multiple methods of exporting and overlaying a raster in Google Maps. Given these posts we're only going to lightly explore our elevation raster. The goal is to extract elevation values at the station locations and lastly create a final map.

# Take a quick look at the elevation distribution (these are meters)
neddat<-data.frame(rasterToPoints(ned_ithaca))
names(neddat)[3]<-"elev"

ggplot(data=neddat, aes(elev)) + 
  geom_histogram(binwidth = 20, fill="sienna1", col="sienna3")

plot9-1

# Create SpatialPointsDataFrame from GHCN points
pts<-data.frame(prcp.fin)
coordinates(pts)<-~long+lat
proj4string(pts)<-"+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs"


# Project the points and elevation raster to NY State Plane
ithprj<-"+proj=tmerc +lat_0=40 +lon_0=-76.58333333333333 +k=0.9999375 +x_0=250000 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs" 

ptsprj<-spTransform(pts, CRS(ithprj))
nedprj<-projectRaster(ned_ithaca, crs=ithprj)


# Extract elevation values at point locations
nedpts<-raster::extract(nedprj, ptsprj)


# Add elevation values to the points data and convert to feet
prcp.fin<-cbind(prcp.fin, nedpts)
prcp.fin<-mutate(prcp.fin, nedpts = round(nedpts/0.3048, 2))


head(prcp.fin)
##            ID prcpAvg     lat     long  nedpts
## 1 US1NYSY0001   26.58 42.3907 -76.7176 1384.90
## 2 US1NYTM0002   27.39 42.5666 -76.5700  849.02
## 3 US1NYTM0004   29.81 42.5913 -76.3720 1050.19
## 4 US1NYTM0005   30.10 42.5253 -76.3218 1063.56
## 5 US1NYTM0015   24.42 42.3223 -76.5899 1581.19
## 6 US1NYTM0017   30.39 42.4972 -76.5803 1021.68

6. Create the final map with elevation and climate data (includes interactive graphics)

Here we'll pull together all the pieces and create a relatively simple Leaflet map with our elevation raster and station locations. NOTE: depending on the size of your raster you may not be able to add the image to your Leaflet Map.

# Elevation colors were heavily borrowed from the
# Kansas Geological Survey's Elevation Map of Kansas:
# http://www.kgs.ku.edu/General/elevatMap.html
cols<-c("#06407F", "#317A9D", "#4ABEBB", "#40AE89", "#467B5D",
    "#3C6D4D", "#1A572E", "#034C00", "#045D03", "#6C975F", "#6B823A",
    "#88A237", "#C5D16B", "#DDE580", "#FFF6AE", "#FBCB81", "#F0B16A",
    "#F2B16D", "#D18338", "#B16F33", "#825337", "#66422A", "#4F2C0C")


pal<-colorNumeric(cols, values(nedprj),
  na.color = "transparent")


leaflet() %>% addProviderTiles("CartoDB.Positron") %>%
  addRasterImage(nedprj, colors = pal, opacity = 0.6) %>%
  addCircleMarkers(data = prcp.fin, lng = ~long,
      lat = ~lat, radius = ~prcpAvg/3, stroke=F,
      popup = paste("<strong>Station ID</strong><br>", prcp.fin$ID,
      "<br><br><strong>Average Annual<br>Precipitation</strong><br>",
      prcp.fin$prcpAvg, " inches<br><br><strong>Elevation</strong><br>",
          prcp.fin$nedpts, " feet"), color = "#006B8C", fillOpacity = 0.7)

7. Conclusion

For those who regularly rely on major federal data sources like the National Elevation Dataset and the Global Historical Climatology Network the FedData package is an extremely useful resource that can help simplify your workflow and improve the reproducibility of your analyses. Rather than visit a federal website, search for a data source, download a ZIP file, unzip, read into R and clip, you can take advantage of tools in FedData to streamline the work and make it easier to make changes. Thanks to R. Kyle Bocinsky for the package and the GitHub page mentions that the package is a product SKOPE (Synthesized Knowledge of Past Environments) and the Village Ecodynamics Project so be sure to check out those organizations.

2 responses

  1. Great post!
    Just as a thought, with mapview::popupGraph(…, type = “html”) you could include the dygraphs time series as a popup for the stations.
    Cheers, Tim

Leave a Reply to zev@zevross.com Cancel reply

Your email address will not be published. Required fields are marked *