The FedData
package (created by R. Kyle Bocinsky) is a great new R package that provides easy access to some important federal datasets. The package is well-designed and provides functions to download climate, elevation, hydrography and other data for your area of interest. The following five sources of data are currently available for download with future ones in the works:
- The National Hydrography Dataset (USGS)
- The National Elevation Dataset digital elevation models (1 and 1/3 arc-second; USGS)
- The Soil Survey Geographic (SSURGO) database from the National Cooperative Soil Survey (NCSS)
- The Global Historical Climatology Network (GHCN), coordinated by National Climatic Data Center at NOAA
- The International Tree Ring Data Bank (ITRDB), coordinated by National Climatic Data Center at NOAA
For this post we'll focus on the climate (GHCN) and elevation (NED) datasets and we’ll provide code for doing the download, processing and creation of both static and interactive graphics.
Blog Post Contents
1. Load the libraries
For the data download you only really need the FedData
package but we will be manipulating and mapping the data and will load some of our favorite data manipulation and spatial libraries.
library(FedData) # for downloading federal data
library(rgdal) # for reading shapefiles
library(ggmap) # for mapping and more
library(raster) # for defining extents and raster processing
library(dplyr) # for data manipulation
library(leaflet) # for mapping
library(rgeos) # for using well known text (readWKT)
library(tidyr) # for reformatting data.frames
library(xts) # for working with time series
library(dygraphs) # for the interactive time series plot
library(sp) # for working with spatial objects
library(tigris) # for downloading geography
2. Define your map extent
There are a series of “get” functions in FedData
that allow you to get data (get_ned
, get_ghcn_daily
) and each one requires that you define the extent – your area of interest. You can use standard R spatial objects created by, for example, reading in shapefiles or you can take advantage of functions from the FedData
package to help you define the extent. Below we outline several different approaches for defining your extent and you can pick the one that works best for you. For this particular post we will focus on our home area, Ithaca, NY.
a) Define extent: Use an existing shapefile
If you already have a shapefile with the extent you're interested in you can read it in and use this for the extent. In a previous post we describe how to read a shapefile into R using the rgdal
library and the basic syntax is as follows:
# Get our Ithaca bounding box
library(rgdal)
extentA <- readOGR(dsn=".", layer="ithaca", verbose = FALSE)
sp::plot(extentA, col="blue") # not so exciting!
b) Define extent: Use a bounding box and FedData's polygon_from_extent
function
If you don't already have a shapefile to use, you can define the extent based on coordinates of a bounding box. The FedData
package comes with a handy function that takes, as input, an extent object created using the raster
package. The extent can be created using just four coordinates that define your bounding box.
To define your extent you need an xmin, xmax, ymin and ymax coordinate and you can define these manually, but we will take advantage of ggmap
to get these coordinates.
Start by running get_map
to get a ggmap object.
# The function qmap will get and create a map but we need get_map for coords
library(ggmap)
ithaca<-get_map("Ithaca, NY", zoom=11)
ggmap(ithaca) #plot the map
Using our ggmap
object pull out the bounding box coordinates using the attr
function. Note that you can also access the bbox
slot directly with ithaca@bbox
(but it's in a different format).
bb<-attr(ithaca, "bb")
bb
## ll.lat ll.lon ur.lat ur.lon
## 1 42.28135 -76.72126 42.60564 -76.28181
Now feed your coordinates into the extent
and polygon_from_extent
functions. Note that a double colon (::) specifies which package the function comes from. If you load the raster
package this is not necessary except “extent” is such a common word that you might worry that other packages have a function named “extent” in which case you can be specific.
#xmin, xmax, ymin, ymax
extentB<-polygon_from_extent(raster::extent(bb$ll.lon, bb$ur.lon,
bb$ll.lat, bb$ur.lat),
proj4string="+proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs")
class(extentB)
## [1] "SpatialPolygons"
## attr(,"package")
## [1] "sp"
To confirm that you have this right:
# Perfect!
ggmap(ithaca) + geom_polygon(aes(x=long, y=lat), size=3, color='purple',
data=extentB, fill=NA)
c) Define extent: Use the tigris
package to extract boundaries
The tigris
package offers a lot of great functions for extracting boundaries. If you want more detail on tigris
we have a detailed post on using thetigris
library. As an example, if you wanted your extent to be a particular county (in this case Ithaca is in Tompkins County in NYS) you could use the following code:
library(tigris)
extentC<-counties(state="NY") # county extent, not a bounding box
extentC<-extentC[extentC$NAME == "Tompkins",]
sp::plot(extentC, col="firebrick", border=NA)
d) Define extent: Extra credit, define a bounding box manually
You can also use what's known as Well Known Text, a geographic markup language, to manually define a polygon and use this with the function readWKT
from the rgeos
package to create your spatial object.
As an aside, there is a known bug in ggplot2
where it can mangle polygons that end up partly outside the extent of a map. As a result, I'm using carefully rounded coordinates so that the map will show the box, but this is not necessary if you're not using ggmap
to create the map.
extentD<-readWKT("POLYGON(( -76.71 42.29, -76.71 42.6,
-76.29 42.6, -76.29 42.29, -76.71 42.29))")
proj4string(extentD)<-"+proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs"
ggmap(ithaca) + geom_polygon(aes(x=long, y=lat), size=3, color='purple',
data=extentD, fill=NA)
3. Download elevation and climate data
You can use any of the defined extents above, I will use the one created using the FedData
function polygon_from_extent
(extentB
).
Now that we have our extent defined let's download some data. Keep in mind that based on your extent this step might take a few minutes. Also you may want to set a working directory, otherwise the data will be downloaded to your default directory.
setwd("d:/junk")
# Download NED elevation data. This will return a raster
# of elevation data cropped to our extent area. Res = "1"
# refers to 1 arc-second NED. This is the default. To
# download 1/3 arc-second NED use res = "13". I'm adding
# force.redo = F to both download functions. This will
# check to see if the data already exists in which case
# it will not re-download it.
ned_ithaca<-get_ned(template=extentB, label="ned_ithaca",
res="1", force.redo = F)
# Download GHCN daily climate data. We're interested in
# precipitation (PRCP) but many other climate-related
# elements exist (i.e. max and min temp, snowfall, etc).
# See the `FedData` documentation for more details. Also
# of note is that a character vector of station IDs can
# be used in place of our extent object
ghcn_ithaca<-get_ghcn_daily(template=extentB, label="ghcn_ithaca",
elements="prcp", force.redo = F)
Using the R plot
function from the sp
package to take a quick look at the data.
# Not pretty but otherwise so far so good!
sp::plot(ned_ithaca)
sp::plot(ghcn_ithaca$spatial, add=TRUE)
4. Explore the GHCN climate data
GHCN spatial component
As you may have noticed in the above plot, we access the “spatial” component of the GHCN data for adding points to the map. The other component of the GHCN download is the “tabular” piece. This is where the actual climate data is stored. Here we'll use the spatial component one more time and map the points on our Ithaca map.
# Plot the data using ggmap and ggplot2. Note
# that we convert our spatial coordinates to a
# data.frame for plotting with ggmap.
ghcn_pts<-data.frame(ID = ghcn_ithaca$spatial$ID,
lat = ghcn_ithaca$spatial@coords[,2],
long = ghcn_ithaca$spatial@coords[,1])
ghcn_pts<-mutate(ghcn_pts, ID = as.character(as.factor(ID)))
head(ghcn_pts)
## ID lat long
## 1 US1NYSY0001 42.3907 -76.7176
## 2 US1NYTM0002 42.5666 -76.5700
## 3 US1NYTM0003 42.4317 -76.4886
## 4 US1NYTM0004 42.5913 -76.3720
## 5 US1NYTM0005 42.5253 -76.3218
## 6 US1NYTM0006 42.3520 -76.3184
# Plot the points again. Looking a little better!
ggmap(ithaca) + geom_point(aes(x=long, y=lat),
size=3, color='red', data=ghcn_pts)
GHCN tabular component (includes interactive graphics)
Now let's have a look at the actual precipitation values. The GHCN output returns a large list where each list element is an individual station and each station contains a data.frame of precipitation values. If we had downloaded multiple climate elements there would be multiple tables associated with each station.
ghcn_dat<-ghcn_ithaca$tabular
# Pull out the PRCP (precipitation) tables. If there were other
# climate elements we could access these in a
# similar way (i.e. ghcn_dat, "[[", "TMIN")
prcp<-lapply(ghcn_dat, "[[", "PRCP")
# Dissolve the list into a single data.frame and add a station id
# from the row names (then replace row names with a number sequence)
prcp<-do.call("rbind", prcp)
prcp$station<-substring(row.names(prcp),1,11)
row.names(prcp)<-1:nrow(prcp)
head(prcp)
## YEAR MONTH D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16
## 1 2007 11 NA 0 0 0 0 48 5 0 0 20 0 0 46 0 157 170
## 2 2007 12 3 97 99 46 3 13 0 0 0 28 0 173 3 48 3 114
## 3 2008 1 15 33 15 0 0 36 0 0 3 0 33 28 0 8 20 25
## 4 2008 2 25 208 0 0 61 267 137 10 5 3 8 0 53 53 0 28
## 5 2008 3 30 10 0 43 302 8 0 157 185 0 0 3 8 NA NA NA
## 6 2008 4 46 0 0 30 46 0 0 0 0 0 28 20 NA NA NA NA
## D17 D18 D19 D20 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30 D31 station
## 1 8 0 13 33 3 8 18 0 0 64 353 5 0 3 NA US1NYSY0001
## 2 51 0 0 3 3 0 0 216 0 0 0 15 20 0 74 US1NYSY0001
## 3 0 5 0 NA NA 0 0 0 0 0 5 0 5 20 0 US1NYSY0001
## 4 0 8 20 0 0 18 8 3 0 0 147 3 0 NA NA US1NYSY0001
## 5 NA 0 23 155 18 0 0 0 0 NA 0 NA NA 0 3 US1NYSY0001
## 6 NA NA NA NA NA 0 0 74 0 0 300 5 89 0 NA US1NYSY0001
Doing monthly and yearly annual averages is cumbersome when each day is it's own column so let's convert this from “wide” format to “long” format using the function gather
from the tidyr
package.
# I'm going to convert to "long" format for easier calculations.
# The gather function is from the tidyr package
prcp<-gather(prcp, day, precip, -station, -YEAR, -MONTH)
# Add the date
prcp<-mutate(prcp,
MONTH = stringr::str_pad(MONTH, 2, side="left", pad="0"),
day = stringr::str_pad(gsub("D", "", day), 2,
side="left", pad="0"),
date = as.Date(paste(YEAR, MONTH, day, sep="-")))
# Exclude non NA values and limit to current years
prcp<-dplyr::filter(prcp, !is.na(date), !is.na(precip),
date > as.Date("2008-01-01"),
date < as.Date("2016-01-01"))
head(prcp)
## YEAR MONTH station day precip date
## 1 2008 02 US1NYSY0001 01 25 2008-02-01
## 2 2008 03 US1NYSY0001 01 30 2008-03-01
## 3 2008 04 US1NYSY0001 01 46 2008-04-01
## 4 2008 05 US1NYSY0001 01 0 2008-05-01
## 5 2008 06 US1NYSY0001 01 157 2008-06-01
## 6 2008 07 US1NYSY0001 01 5 2008-07-01
There are some stations with very little data, let's take a look at daily counts and then filter to only stations with plenty of data
# Do a count by station using dplyr functions
cnt<-group_by(prcp, station) %>%
summarize(cnt = n()) %>% arrange(cnt) %>%
mutate(station = factor(station, levels = station))
ggplot(cnt, aes(station, cnt)) +
geom_bar(stat="identity", fill="cadetblue") +
coord_flip() +
geom_hline(yintercept=2000) +
ggtitle("Count of days with data at climate stations")
# semi_join does a regular inner_join but only keeps the resulting
# columns from the left table. Greater than or equal to 2000 days
# looks like a good cutoff
prcp<-semi_join(prcp, dplyr::filter(cnt, cnt>=2000), by = "station")
Do some final formatting of the daily data so that we can use it in an interactive graphic using htmlWidgets.
# If you'd prefer to look at daily data instead of monthly comment this out
prcp<-group_by(prcp, YEAR, MONTH, station) %>%
summarize(precip = mean(precip, na.rm=T)) %>%
mutate(date = as.Date(paste(YEAR, MONTH, "15", sep="-")))
# We will make it so that each column is a station
prcp.wide<-spread(prcp[,c("date", "station", "precip")], station, precip)
head(prcp.wide)
## Source: local data frame [6 x 12]
##
## date US1NYSY0001 US1NYTM0002 US1NYTM0004 US1NYTM0005 US1NYTM0015
## (date) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 2008-01-15 8.428571 0.000000 11.70000 11.13333 NA
## 2 2008-02-15 36.724138 25.000000 38.20690 48.21739 NA
## 3 2008-03-15 39.375000 NA 44.12903 51.41935 6.50000
## 4 2008-04-15 30.380952 26.200000 30.00000 24.56667 26.66667
## 5 2008-05-15 15.419355 8.518519 14.03704 19.60000 20.75862
## 6 2008-06-15 42.551724 23.066667 43.36667 38.63333 28.20000
## Variables not shown: US1NYTM0017 (dbl), US1NYTM0018 (dbl), US1NYTM0023
## (dbl), US1NYTM0027 (dbl), USC00303050 (dbl), USC00304174 (dbl)
# Now apply a 30-day moving average (both sides) to smooth out the variation
prcp.wide[,2:ncol(prcp.wide)]<-lapply(prcp.wide[,2:ncol(prcp.wide)],
function(x) stats::filter(x, rep(1/30, 30, sides=2)))
# Delete rows with no data and convert to a data.frame
howmanyNA<-rowSums(is.na(prcp.wide))
prcp.wide<-prcp.wide[howmanyNA != (ncol(prcp.wide)-1),]
prcp.wide<-data.frame(prcp.wide)
In the final step we will convert the object to a time series object and look at the daily data (with the 30-day moving average) using the great htmlWidget dygraphs.
prcp.xts<-xts(prcp.wide[,-1], order.by = prcp.wide[,1])
dygraph(prcp.xts) %>% dyRangeSelector()
Be sure to play around with this interactive graphic:
Summarize the data for each station. Here we'll calculate annual averages for 2015 and present them in a map.
NOTE: for simplicity we are not applying filters based on completeness which would be appropriate in a scientific setting. Please do not apply these methods blindly when doing your own analysis.
# Calculate annual averages and create a final data.frame of values
prcp.fin<-dplyr::filter(prcp, date > as.Date("2015-01-01")) %>%
group_by(station) %>%
summarize(prcpAvg = round(mean(precip, na.rm=T),2)) %>%
rename(ID = station)
prcp.fin<-left_join(prcp.fin, ghcn_pts, by= "ID")
Create Leaflet map of GHCN average annual data
leaflet() %>% addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(data = prcp.fin, lng = ~long, lat = ~lat,
radius = ~prcpAvg/3, stroke=F,
popup = paste("<strong>Station ID</strong><br>", prcp.fin$ID,
"<br><br><strong>Average Annual<br>Precipitation</strong><br>",
prcp.fin$prcpAvg, " inches"), color = "#006B8C", fillOpacity = 0.7)
5. Explore the NED 1-arc second elevation data
In a previous blog post we discuss working with rasters using the raster
library. That post is helpful if you're looking to crop, re-classify or re-sample your data. In addition we have a post on rendering a raster for use in Google Maps which highlights multiple methods of exporting and overlaying a raster in Google Maps. Given these posts we're only going to lightly explore our elevation raster. The goal is to extract elevation values at the station locations and lastly create a final map.
# Take a quick look at the elevation distribution (these are meters)
neddat<-data.frame(rasterToPoints(ned_ithaca))
names(neddat)[3]<-"elev"
ggplot(data=neddat, aes(elev)) +
geom_histogram(binwidth = 20, fill="sienna1", col="sienna3")
# Create SpatialPointsDataFrame from GHCN points
pts<-data.frame(prcp.fin)
coordinates(pts)<-~long+lat
proj4string(pts)<-"+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs"
# Project the points and elevation raster to NY State Plane
ithprj<-"+proj=tmerc +lat_0=40 +lon_0=-76.58333333333333 +k=0.9999375 +x_0=250000 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs"
ptsprj<-spTransform(pts, CRS(ithprj))
nedprj<-projectRaster(ned_ithaca, crs=ithprj)
# Extract elevation values at point locations
nedpts<-raster::extract(nedprj, ptsprj)
# Add elevation values to the points data and convert to feet
prcp.fin<-cbind(prcp.fin, nedpts)
prcp.fin<-mutate(prcp.fin, nedpts = round(nedpts/0.3048, 2))
head(prcp.fin)
## ID prcpAvg lat long nedpts
## 1 US1NYSY0001 26.58 42.3907 -76.7176 1384.90
## 2 US1NYTM0002 27.39 42.5666 -76.5700 849.02
## 3 US1NYTM0004 29.81 42.5913 -76.3720 1050.19
## 4 US1NYTM0005 30.10 42.5253 -76.3218 1063.56
## 5 US1NYTM0015 24.42 42.3223 -76.5899 1581.19
## 6 US1NYTM0017 30.39 42.4972 -76.5803 1021.68
6. Create the final map with elevation and climate data (includes interactive graphics)
Here we'll pull together all the pieces and create a relatively simple Leaflet map with our elevation raster and station locations. NOTE: depending on the size of your raster you may not be able to add the image to your Leaflet Map.
# Elevation colors were heavily borrowed from the
# Kansas Geological Survey's Elevation Map of Kansas:
# http://www.kgs.ku.edu/General/elevatMap.html
cols<-c("#06407F", "#317A9D", "#4ABEBB", "#40AE89", "#467B5D",
"#3C6D4D", "#1A572E", "#034C00", "#045D03", "#6C975F", "#6B823A",
"#88A237", "#C5D16B", "#DDE580", "#FFF6AE", "#FBCB81", "#F0B16A",
"#F2B16D", "#D18338", "#B16F33", "#825337", "#66422A", "#4F2C0C")
pal<-colorNumeric(cols, values(nedprj),
na.color = "transparent")
leaflet() %>% addProviderTiles("CartoDB.Positron") %>%
addRasterImage(nedprj, colors = pal, opacity = 0.6) %>%
addCircleMarkers(data = prcp.fin, lng = ~long,
lat = ~lat, radius = ~prcpAvg/3, stroke=F,
popup = paste("<strong>Station ID</strong><br>", prcp.fin$ID,
"<br><br><strong>Average Annual<br>Precipitation</strong><br>",
prcp.fin$prcpAvg, " inches<br><br><strong>Elevation</strong><br>",
prcp.fin$nedpts, " feet"), color = "#006B8C", fillOpacity = 0.7)
7. Conclusion
For those who regularly rely on major federal data sources like the National Elevation Dataset and the Global Historical Climatology Network the FedData package is an extremely useful resource that can help simplify your workflow and improve the reproducibility of your analyses. Rather than visit a federal website, search for a data source, download a ZIP file, unzip, read into R and clip, you can take advantage of tools in FedData to streamline the work and make it easier to make changes. Thanks to R. Kyle Bocinsky for the package and the GitHub page mentions that the package is a product SKOPE (Synthesized Knowledge of Past Environments) and the Village Ecodynamics Project so be sure to check out those organizations.
Great post!
Just as a thought, with mapview::popupGraph(…, type = “html”) you could include the dygraphs time series as a popup for the stations.
Cheers, Tim
That’s a good idea, I didn’t know about popupGraph!