Manipulating and mapping US Census data in R using the acs, tigris and leaflet packages

The US Census provides an incredible wealth of data but it’s not always easy to work with it. In the past, working with the tabular and spatial census data generally meant downloading a table from FactFinder and a shapefile from the boundary files site and joining the two, perhaps in a GIS system. These files could also be handled in R but getting the data, reading it into R and, in particular, merging tabular and spatial data can be a chore. Working with slots and all the different classes of data and functions can be challenging.

A recent interesting post on stackoverflow by Claire Salloum prompted me to revisit this issue in R and I've definitely found some valuable new packages for capturing and manipulating Census data in R.

Edited with an update related to ggplot2 version 2.0 on May 6, 2016

Census data the hard way

The tabular data described in the stackoverflow post can be downloaded from this FactFinder link. Note that I clicked on download and then chose “Data and annotations in a separate file” and included the descriptive data element names. I also posted a version with just the variables used in this post. The spatial data for tracts comes from this link. We are using data from the US Census' American Community Survey.

Let's start by loading our libraries

library(rgdal)    # for readOGR and others
library(sp)       # for spatial objects
library(leaflet)  # for interactive maps (NOT leafletR here)
library(dplyr)    # for working with data frames
library(ggplot2)  # for plotting

Now we're ready to read in the shapefile that we downloaded and unzipped:

tract <- readOGR(dsn=".", layer = "cb_2014_36_tract_500k")
## OGR data source with driver: ESRI Shapefile 
## Source: "x:/junk/claire/leaflet_plot", layer: "cb_2014_36_tract_500k"
## with 4906 features
## It has 9 fields

# convert the GEOID to a character
tract@data$GEOID<-as.character(tract@data$GEOID)

And now the tabular data we downloaded from FactFinder. We will use dplyr functions to manipulate the data and clean it up. If these functions are unfamiliar to you, take a look at a previous blog post on dplyr.

data <- read.csv("x:/junk/claire/leaflet_plot/ACS_13_5YR_B19001.csv", stringsAsFactors = FALSE)

data <- select(data, GEO.id2, GEO.display.label, HD01_VD01, HD01_VD17) %>% 
  slice(-1) %>% # census has this extra descriptive record
  rename(id=GEO.id2, geography=GEO.display.label, total=HD01_VD01,
         over_200=HD01_VD17)

data <- mutate(data, id=as.character(id),
               geography=as.character(geography),
               total = as.numeric(total),
               over_200 = as.numeric(over_200),
               percent = (over_200/total)*100)

Plotting census data with ggplot2

In order to plot with ggplot2 we need to combine our spatial and tabular data AND the result needs to be a data frame. Luckily, ggplot2 has a nice function, fortify, to help us convert the polygons to a data frame and we can merge the ACS data to this data frame:

# convert polygons to data.frame
# Note that with ggplo2 version 2.0 you will need to 
# Add maptools see http://bit.ly/1rybv3O
library(maptools)
ggtract<-fortify(tract, region = "GEOID") 
# join tabular data
ggtract<-left_join(ggtract, data, by=c("id")) 

# here we limit to the NYC counties
ggtract <- ggtract[grep("Kings|Bronx|New York County|Queens|Richmond", ggtract$geography),]

ggplot() +
  geom_polygon(data = ggtract , aes(x=long, y=lat, group = group, fill=percent), color="grey50") +
   scale_fill_gradientn(colours = c("red", "white", "cadetblue"),
                       values = c(1,0.5, .3, .2, .1, 0))+
  coord_map(xlim = c(-74.26, -73.71), ylim = c(40.49,40.92))

plot of chunk unnamed-chunk-5

This works. Not particularly pretty, but functional. We could play around with the color scale to come up with something better or we could invest our time in creating something more useful, an interactive map. Let’s give this a try.

By the way, if you’d like more discussion on mapping with ggplot2, you could look at this blog post.

Interactive plotting with the leaflet package

In a previous blog post I showed how to use the leafletR package but I've now moved to using the leaflet package from RStudio instead primarily because it does not require that the user convert a spatial object to a geoJSON object first. Robin Lovelace has a nice summary of leaflet here. There is also a nice post by Marco Sciaini here.

The leaflet package requires the data input to be a spatial object. In our example, we started with a SpatialPolygonsDataFrame (which we created by reading using readORG) and then we converted this to a vanilla data frame using the fortify function. Given our example there are two approaches. You can convert your data.frame back to a spatial data frame and map with leaflet or you can make use of your existing spatial data frame (called tract above). I'll show the code for both versions (though, spoiler alert, using the existing data frame is much easier).

Option 1: Convert the data.frame back to a SpatialPolygonsDataFrame

Here you need to go through a lot of hoops. There may be an easier approach I have not considered but my experience is that you need to convert each tract to a Polygon, then to a Polygons object, then to a SpatialPolygons object and finally to a SpatialPolygonsDataFrame. No, this is not not fun!

Here is the code. Note that I start by creating a function that will extract the records for a particular tract and then convert this tract data to a Polygon and then Polygons object:

#function for creating a Polygons object for 
# input tractname
polyFunc<-function(groupname, dat){
  poly<-filter(dat, id==groupname) %>% 
    select(long, lat)
  return(Polygons(list(Polygon(poly)), groupname))
}


tracts <- distinct(ggtract, id, percent)
tractname <- tracts$id
polygons<-lapply(tractname, function(x) polyFunc(x, dat=ggtract)) 
sp.polygon<-SpatialPolygons(polygons)
df.polygon<-SpatialPolygonsDataFrame(sp.polygon, 
                                     data=data.frame(row.names=tractname, tracts))
df.polygon <- df.polygon[order(df.polygon$percent),]

Now that we have a SpatialPolygonsDataFrame we can create the interactive plot with the leaflet package:

popup <- paste0("GEOID: ", df.polygon$id, "<br>", "Percent of Households above $200k: ", round(df.polygon$percent,2))
pal <- colorNumeric(
  palette = "YlGnBu",
  domain = df.polygon$percent
)

map1<-leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(data = df.polygon, 
              fillColor = ~pal(percent), 
              color = "#b2aeae", # you need to use hex colors
              fillOpacity = 0.7, 
              weight = 0.3, 
              smoothFactor = 0.2,
              popup = popup) %>%
  addLegend(pal = pal, 
            values = df.polygon$percent, 
            position = "bottomright", 
            title = "Percent of Households<br>above $200k",
            labFormat = labelFormat(suffix = "%")) 
map1

Options 2: Make use of the existing SpatialPolygonsDataFrame

Rather than go through the hoops of converting the SpatialPolygonsDataFrame to a data.frame and back let's make use of the original SpatialPolygonsDataFrame that we read in with readOGR.

class(tract)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

# create a new version
df.polygon2<-tract #tract is the 

# create a rec-field to make sure that we have the order correct
# this probably is unnecessary but it helps to be careful
df.polygon2@data$rec<-1:nrow(df.polygon2@data)
tmp <- left_join(df.polygon2@data, data, by=c("GEOID"="id")) %>% 
  arrange(rec)

# replace the original data with the new merged data
df.polygon2@data<-tmp
# limit to NYC
df.polygon2 <- df.polygon2[grep("Kings|Bronx|New York County|Queens|Richmond", df.polygon2$geography),]
#df.polygon2 <- df.polygon2[order(df.polygon2$percent),]

Create the interactive plot with the leaflet package:

popup <- paste0("GEOID: ", df.polygon2$id, "<br>", "Percent of Households above $200k: ", round(df.polygon2$percent,2))
pal <- colorNumeric(
  palette = "YlGnBu",
  domain = df.polygon2$percent
)

map2<-leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(data = df.polygon2, 
              fillColor = ~pal(percent), 
              color = "#b2aeae", # you need to use hex colors
              fillOpacity = 0.7, 
              weight = 1, 
              smoothFactor = 0.2,
              popup = popup) %>%
  addLegend(pal = pal, 
            values = df.polygon2$percent, 
            position = "bottomright", 
            title = "Percent of Households<br>above $200k",
            labFormat = labelFormat(suffix = "%")) 
map2

That was definitely a simpler way to map using downloaded census data. But what if we did not have to manually download data and spatial data? Enter the packages acs and tigris.

Census data the easy(er) way

There are two great packages that make downloading tabular and spatial data from the Census websites unnecessary for many cases. You can instead use the acs package by Ezra Glenn to download tabular data and the tigris package by Kyle Walker and Bob Rudis. These packages make the process quite a bit smoother and self-contained.

The code from this example is extracted in a large part from Kyle Walker's helpful page on tigris.

1) Set up the packages

library(tigris)
library(acs)
library(stringr) # to pad fips codes

2) Get the spatial data (tigris)

# note that you can use county names in the tigris package but 
# not in the acs.fetch function from the acs package so I'm using
# fips numbers here.

# grab the spatial data (tigris)
counties <- c(5, 47, 61, 81, 85)
tracts <- tracts(state = 'NY', county = c(5, 47, 61, 81, 85), cb=TRUE)

3) Get the tabular data (acs)

In order to do this you will need an API key from the US Census. Go to this site to request one (takes a minute or two) and then use the api.key.install function in the `acs` package to use the key.

api.key.install(key="YOUR API KEY")
# create a geographic set to grab tabular data (acs)
geo<-geo.make(state=c("NY"),
              county=c(5, 47, 61, 81, 85), tract="*")

# !!!! important note -- the package has not been updated to 2013
# data so I'm using the five year span that ends in 2012

income<-acs.fetch(endyear = 2012, span = 5, geography = geo,
                table.number = "B19001", col.names = "pretty")

# use of col.names = "pretty" above gives the full column definitions
# if you want Census variable IDs use col.names="auto". Here are the
# variables we want with pretty and auto results.
#"Household Income: Total:" ("B19001_001")
#"Household Income: $200,000 or more" ("B19001_017")


# the resulting "income" object is not a data.frame it's a list
# to see what's available

names(attributes(income))
##  [1] "endyear"        "span"           "acs.units"      "currency.year" 
##  [5] "modified"       "geography"      "acs.colnames"   "estimate"      
##  [9] "standard.error" "class"
attr(income, "acs.colnames")
##  [1] "Household Income: Total:"              
##  [2] "Household Income: Less than $10,000"   
##  [3] "Household Income: $10,000 to $14,999"  
##  [4] "Household Income: $15,000 to $19,999"  
##  [5] "Household Income: $20,000 to $24,999"  
##  [6] "Household Income: $25,000 to $29,999"  
##  [7] "Household Income: $30,000 to $34,999"  
##  [8] "Household Income: $35,000 to $39,999"  
##  [9] "Household Income: $40,000 to $44,999"  
## [10] "Household Income: $45,000 to $49,999"  
## [11] "Household Income: $50,000 to $59,999"  
## [12] "Household Income: $60,000 to $74,999"  
## [13] "Household Income: $75,000 to $99,999"  
## [14] "Household Income: $100,000 to $124,999"
## [15] "Household Income: $125,000 to $149,999"
## [16] "Household Income: $150,000 to $199,999"
## [17] "Household Income: $200,000 or more"

# convert to a data.frame for merging
income_df <- data.frame(paste0(str_pad(income@geography$state, 2, "left", pad="0"), 
                             str_pad(income@geography$county, 3, "left", pad="0"), 
                             str_pad(income@geography$tract, 6, "left", pad="0")), 
                        income@estimate[,c("Household Income: Total:",
"Household Income: $200,000 or more")], 
                        stringsAsFactors = FALSE)

income_df <- select(income_df, 1:3)
rownames(income_df)<-1:nrow(income_df)
names(income_df)<-c("GEOID", "total", "over_200")
income_df$percent <- 100*(income_df$over_200/income_df$total)

4) Do the merge (tigris)

The package tigris has a nice little merge function to do the sometimes difficult merge between the spatial and tabular data.

income_merged<- geo_join(tracts, income_df, "GEOID", "GEOID")
# there are some tracts with no land that we should exclude
income_merged <- income_merged[income_merged$ALAND>0,]

5) Make your map (leaflet)

popup <- paste0("GEOID: ", income_merged$GEOID, "<br>", "Percent of Households above $200k: ", round(income_merged$percent,2))
pal <- colorNumeric(
  palette = "YlGnBu",
  domain = income_merged$percent
)

map3<-leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(data = income_merged, 
              fillColor = ~pal(percent), 
              color = "#b2aeae", # you need to use hex colors
              fillOpacity = 0.7, 
              weight = 1, 
              smoothFactor = 0.2,
              popup = popup) %>%
  addLegend(pal = pal, 
            values = income_merged$percent, 
            position = "bottomright", 
            title = "Percent of Households<br>above $200k",
            labFormat = labelFormat(suffix = "%")) 
map3

Saving your maps

There is a nice function in the library htmlWidgets for saving your maps:

library(htmlWidgets)
saveWidget(map1, file="map1.html", selfcontained=FALSE)
saveWidget(map2, file="map2.html", selfcontained=FALSE)
saveWidget(map3, file="map3.html", selfcontained=FALSE)

If you get an error pandoc document conversion failed your data may be too big to save and you may need to use a less detailed version of the spatial data. Thanks to this post for pointing this out.

One issue with the leaflet map

An astute observer will notice that the stroke lines (outlines) are not being rendered in these maps. In RStudio the lines appear (see image below). I experimented with opacity, width, color and even within the Chrome developer tools. I tried to save from the RStudio plot window and I tried saveWidget. If you have ideas please let me know.

Rarely is the solution to an issue like this so simple. Thanks to Bob Rudis for pointing out that I should be using the hex codes for colors rather than using named colors so that they can be recognized in a web setting. I’ve made the changes above and it worked perfectly.

Capture

40 responses

  1. Very informative article. Thanks

    a) Do any of the Provider Tiles actually show county names as an overlay?
    b) I’m not familiar with US tracts. Are they just a code or do they also have neighborhood equivalent names

    I think there may be a typo
    income_merged$id should be income_merged$GEOID

    cheers
    pssguy

  2. Any chance you can post a direct link to the census data on your server? I’ve tried to download from census.gov 3 times, and each time I end up with a 4 kb, 2 row file!

    Anyway thanks a lot for this well-done post.

    • I updated the post with three things: a better link to the data at Census, a link to a ZIP with the data and an update about the API key needed to get ACS data. If you’re willing to test out the data links that would be helpful.

  3. Great tutorial! I’ve been looking for something like this for awhile. EXTREMELY glad to have found a resource.

    The code worked for me… now to dive into understanding it all and mapping to my ACS data pulls

  4. Very nice post! thanks a lot!
    Is there an easy way to produce maps for all of the USA instead of just a state? (Is there a “state code” that gives all the states, and if yes, is there a file with all the counties?)
    Or do we need to loop over states?

    Thanks again!

    • Take a look at the documentation for tigris. If I understand your question correctly this is straightforward counties(c("RI", "DE")).

  5. Thanks for the post. One (probably stupid) question, when you download and unzip the tract data, which do you use? I have several extensions for the “cb_2014_36_tract_500k/cb_2014_36_tract_500k” file.

    Thanks!

  6. Wonderful tutorial. I’ve been playing with census data regarding American Indian land and populations. I didn’t know about the acs package. The link you provide to handling other spatial files is great–I’ll be playing with this all week.

  7. Zev-

    The tigris merge is producing an error:
    income_merged<- geo_join(tracts, income_df, "GEOID", "GEOID")

    Error in data.frame(spatial_data@data, data_frame[match(spatial_data@data[[by_sp]], :
    trying to get slot "data" from an object of a basic class ("function") with no slots

    Did the syntax with tigris change with a newer version?

    • Hi, is it possible you missed a step? We just re-ran the code with the newest ACS and it worked. Both Hollie and I tested it out she was using acs 2.0, tigris 0.2.2 and R 3.2.5 and I’m using acs 1.2, tigris 0.2.2.9000, r 3.2.3. Let me know if you still have trouble.

      • I reinstalled some of the packages and now it works. Not sure what the original problem was but regardless your code is indeed 100%. Thank you! I learned a lot from it.

  8. I’m interested in mapping at least two states, all tracts, blockgroups, or blocks…is this possible? I’ve tried many approaches…none seem to work.

    Thanks,

    • This is very possible. Probably your question is a job for stackoverflow. Show what you’ve tried and how it fails and provide a link.

  9. Zev,

    This is a great post, and I am completely new to this. I tried your “Census data the easy(er) way” section, and I received the following error message.

    Error in eval(expr, envir, enclos) : could not find function “select”

    Any idea how I can correct this error message? Thank you!

  10. Hello, I try to reproduce the map in the section “Census data the easy(er) way”. When I try to fetch data with acs.fetch() function, I get the following error:

    Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
    ‘data’ must be of a vector type, was ‘NULL’

    Could you help me to solve this problem?

    T.E.G.

    • We were not able to re-create your error, the code is running fine for us. Perhaps post to stackoverflow with more detail than you provided here and give us the link. We are using acs v 2.0.

      • We could not reproduce the issue. We ran the code again and it worked fine. Can you post details to stackoverflow including your version of the packages? (Use sessionInfo()) and we can take a look?

  11. Sorry if I’m missing something, but where do I find the table number?

    Can anyone point me toward information on how to find the variables used in this example? If, say, I want to know how many children under 5 are in each tract in Oregon?

  12. Zev,

    Thank you so much for this tutorial. It will play a critical role in preparation for my job interview.

  13. Hi Zev, I have a question. I follow your code which goes well until I run “ggtract <- fortify(tract, region = "GEOID")". It keep giving me "isTRUE(gpclibPermitStatus()) is not TRUE". I tried many methods, but still cannot solve this problem. What is the aim of this step, is it trying to convert mapping data to the data frame? Thanks in advance!

    • This is an old post and I wouldn’t use fortify any more. The brand new version of ggplot2 (I think it comes out in a week or two) has a new method geom_sf() which you can use without fortify.

Leave a Reply

Your email address will not be published. Required fields are marked *