The US Census provides an incredible wealth of data but it’s not always easy to work with it. In the past, working with the tabular and spatial census data generally meant downloading a table from FactFinder and a shapefile from the boundary files site and joining the two, perhaps in a GIS system. These files could also be handled in R but getting the data, reading it into R and, in particular, merging tabular and spatial data can be a chore. Working with slots and all the different classes of data and functions can be challenging.
A recent interesting post on stackoverflow by Claire Salloum prompted me to revisit this issue in R and I've definitely found some valuable new packages for capturing and manipulating Census data in R.
Edited with an update related to ggplot2 version 2.0 on May 6, 2016
Census data the hard way
The tabular data described in the stackoverflow post can be downloaded from this FactFinder link. Note that I clicked on download and then chose “Data and annotations in a separate file” and included the descriptive data element names. I also posted a version with just the variables used in this post. The spatial data for tracts comes from this link. We are using data from the US Census' American Community Survey.
Let's start by loading our libraries
library(rgdal) # for readOGR and others
library(sp) # for spatial objects
library(leaflet) # for interactive maps (NOT leafletR here)
library(dplyr) # for working with data frames
library(ggplot2) # for plotting
Now we're ready to read in the shapefile that we downloaded and unzipped:
tract <- readOGR(dsn=".", layer = "cb_2014_36_tract_500k")
## OGR data source with driver: ESRI Shapefile
## Source: "x:/junk/claire/leaflet_plot", layer: "cb_2014_36_tract_500k"
## with 4906 features
## It has 9 fields
# convert the GEOID to a character
tract@data$GEOID<-as.character(tract@data$GEOID)
And now the tabular data we downloaded from FactFinder. We will use dplyr
functions to manipulate the data and clean it up. If these functions are unfamiliar to you, take a look at a previous blog post on dplyr
.
data <- read.csv("x:/junk/claire/leaflet_plot/ACS_13_5YR_B19001.csv", stringsAsFactors = FALSE)
data <- select(data, GEO.id2, GEO.display.label, HD01_VD01, HD01_VD17) %>%
slice(-1) %>% # census has this extra descriptive record
rename(id=GEO.id2, geography=GEO.display.label, total=HD01_VD01,
over_200=HD01_VD17)
data <- mutate(data, id=as.character(id),
geography=as.character(geography),
total = as.numeric(total),
over_200 = as.numeric(over_200),
percent = (over_200/total)*100)
Plotting census data with ggplot2
In order to plot with ggplot2
we need to combine our spatial and tabular data AND the result needs to be a data frame. Luckily, ggplot2
has a nice function, fortify
, to help us convert the polygons to a data frame and we can merge the ACS data to this data frame:
# convert polygons to data.frame
# Note that with ggplo2 version 2.0 you will need to
# Add maptools see http://bit.ly/1rybv3O
library(maptools)
ggtract<-fortify(tract, region = "GEOID")
# join tabular data
ggtract<-left_join(ggtract, data, by=c("id"))
# here we limit to the NYC counties
ggtract <- ggtract[grep("Kings|Bronx|New York County|Queens|Richmond", ggtract$geography),]
ggplot() +
geom_polygon(data = ggtract , aes(x=long, y=lat, group = group, fill=percent), color="grey50") +
scale_fill_gradientn(colours = c("red", "white", "cadetblue"),
values = c(1,0.5, .3, .2, .1, 0))+
coord_map(xlim = c(-74.26, -73.71), ylim = c(40.49,40.92))
This works. Not particularly pretty, but functional. We could play around with the color scale to come up with something better or we could invest our time in creating something more useful, an interactive map. Let’s give this a try.
By the way, if you’d like more discussion on mapping with ggplot2
, you could look at this blog post.
Interactive plotting with the leaflet
package
In a previous blog post I showed how to use the leafletR
package but I've now moved to using the leaflet
package from RStudio instead primarily because it does not require that the user convert a spatial object to a geoJSON object first. Robin Lovelace has a nice summary of leaflet
here. There is also a nice post by Marco Sciaini here.
The leaflet
package requires the data input to be a spatial object. In our example, we started with a SpatialPolygonsDataFrame (which we created by reading using readORG
) and then we converted this to a vanilla data frame using the fortify
function. Given our example there are two approaches. You can convert your data.frame
back to a spatial data frame and map with leaflet
or you can make use of your existing spatial data frame (called tract
above). I'll show the code for both versions (though, spoiler alert, using the existing data frame is much easier).
Option 1: Convert the data.frame
back to a SpatialPolygonsDataFrame
Here you need to go through a lot of hoops. There may be an easier approach I have not considered but my experience is that you need to convert each tract to a Polygon
, then to a Polygons
object, then to a SpatialPolygons
object and finally to a SpatialPolygonsDataFrame
. No, this is not not fun!
Here is the code. Note that I start by creating a function that will extract the records for a particular tract and then convert this tract data to a Polygon
and then Polygons
object:
#function for creating a Polygons object for
# input tractname
polyFunc<-function(groupname, dat){
poly<-filter(dat, id==groupname) %>%
select(long, lat)
return(Polygons(list(Polygon(poly)), groupname))
}
tracts <- distinct(ggtract, id, percent)
tractname <- tracts$id
polygons<-lapply(tractname, function(x) polyFunc(x, dat=ggtract))
sp.polygon<-SpatialPolygons(polygons)
df.polygon<-SpatialPolygonsDataFrame(sp.polygon,
data=data.frame(row.names=tractname, tracts))
df.polygon <- df.polygon[order(df.polygon$percent),]
Now that we have a SpatialPolygonsDataFrame
we can create the interactive plot with the leaflet
package:
popup <- paste0("GEOID: ", df.polygon$id, "<br>", "Percent of Households above $200k: ", round(df.polygon$percent,2))
pal <- colorNumeric(
palette = "YlGnBu",
domain = df.polygon$percent
)
map1<-leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(data = df.polygon,
fillColor = ~pal(percent),
color = "#b2aeae", # you need to use hex colors
fillOpacity = 0.7,
weight = 0.3,
smoothFactor = 0.2,
popup = popup) %>%
addLegend(pal = pal,
values = df.polygon$percent,
position = "bottomright",
title = "Percent of Households<br>above $200k",
labFormat = labelFormat(suffix = "%"))
map1
Options 2: Make use of the existing SpatialPolygonsDataFrame
Rather than go through the hoops of converting the SpatialPolygonsDataFrame
to a data.frame
and back let's make use of the original SpatialPolygonsDataFrame
that we read in with readOGR
.
class(tract)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
# create a new version
df.polygon2<-tract #tract is the
# create a rec-field to make sure that we have the order correct
# this probably is unnecessary but it helps to be careful
df.polygon2@data$rec<-1:nrow(df.polygon2@data)
tmp <- left_join(df.polygon2@data, data, by=c("GEOID"="id")) %>%
arrange(rec)
# replace the original data with the new merged data
df.polygon2@data<-tmp
# limit to NYC
df.polygon2 <- df.polygon2[grep("Kings|Bronx|New York County|Queens|Richmond", df.polygon2$geography),]
#df.polygon2 <- df.polygon2[order(df.polygon2$percent),]
Create the interactive plot with the leaflet
package:
popup <- paste0("GEOID: ", df.polygon2$id, "<br>", "Percent of Households above $200k: ", round(df.polygon2$percent,2))
pal <- colorNumeric(
palette = "YlGnBu",
domain = df.polygon2$percent
)
map2<-leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(data = df.polygon2,
fillColor = ~pal(percent),
color = "#b2aeae", # you need to use hex colors
fillOpacity = 0.7,
weight = 1,
smoothFactor = 0.2,
popup = popup) %>%
addLegend(pal = pal,
values = df.polygon2$percent,
position = "bottomright",
title = "Percent of Households<br>above $200k",
labFormat = labelFormat(suffix = "%"))
map2
That was definitely a simpler way to map using downloaded census data. But what if we did not have to manually download data and spatial data? Enter the packages acs
and tigris
.
Census data the easy(er) way
There are two great packages that make downloading tabular and spatial data from the Census websites unnecessary for many cases. You can instead use the acs
package by Ezra Glenn to download tabular data and the tigris
package by Kyle Walker and Bob Rudis. These packages make the process quite a bit smoother and self-contained.
The code from this example is extracted in a large part from Kyle Walker's helpful page on tigris
.
1) Set up the packages
library(tigris)
library(acs)
library(stringr) # to pad fips codes
2) Get the spatial data (tigris
)
# note that you can use county names in the tigris package but
# not in the acs.fetch function from the acs package so I'm using
# fips numbers here.
# grab the spatial data (tigris)
counties <- c(5, 47, 61, 81, 85)
tracts <- tracts(state = 'NY', county = c(5, 47, 61, 81, 85), cb=TRUE)
3) Get the tabular data (acs
)
In order to do this you will need an API key from the US Census. Go to this site to request one (takes a minute or two) and then use the api.key.install
function in the `acs` package to use the key.
api.key.install(key="YOUR API KEY")
# create a geographic set to grab tabular data (acs)
geo<-geo.make(state=c("NY"),
county=c(5, 47, 61, 81, 85), tract="*")
# !!!! important note -- the package has not been updated to 2013
# data so I'm using the five year span that ends in 2012
income<-acs.fetch(endyear = 2012, span = 5, geography = geo,
table.number = "B19001", col.names = "pretty")
# use of col.names = "pretty" above gives the full column definitions
# if you want Census variable IDs use col.names="auto". Here are the
# variables we want with pretty and auto results.
#"Household Income: Total:" ("B19001_001")
#"Household Income: $200,000 or more" ("B19001_017")
# the resulting "income" object is not a data.frame it's a list
# to see what's available
names(attributes(income))
## [1] "endyear" "span" "acs.units" "currency.year"
## [5] "modified" "geography" "acs.colnames" "estimate"
## [9] "standard.error" "class"
attr(income, "acs.colnames")
## [1] "Household Income: Total:"
## [2] "Household Income: Less than $10,000"
## [3] "Household Income: $10,000 to $14,999"
## [4] "Household Income: $15,000 to $19,999"
## [5] "Household Income: $20,000 to $24,999"
## [6] "Household Income: $25,000 to $29,999"
## [7] "Household Income: $30,000 to $34,999"
## [8] "Household Income: $35,000 to $39,999"
## [9] "Household Income: $40,000 to $44,999"
## [10] "Household Income: $45,000 to $49,999"
## [11] "Household Income: $50,000 to $59,999"
## [12] "Household Income: $60,000 to $74,999"
## [13] "Household Income: $75,000 to $99,999"
## [14] "Household Income: $100,000 to $124,999"
## [15] "Household Income: $125,000 to $149,999"
## [16] "Household Income: $150,000 to $199,999"
## [17] "Household Income: $200,000 or more"
# convert to a data.frame for merging
income_df <- data.frame(paste0(str_pad(income@geography$state, 2, "left", pad="0"),
str_pad(income@geography$county, 3, "left", pad="0"),
str_pad(income@geography$tract, 6, "left", pad="0")),
income@estimate[,c("Household Income: Total:",
"Household Income: $200,000 or more")],
stringsAsFactors = FALSE)
income_df <- select(income_df, 1:3)
rownames(income_df)<-1:nrow(income_df)
names(income_df)<-c("GEOID", "total", "over_200")
income_df$percent <- 100*(income_df$over_200/income_df$total)
4) Do the merge (tigris
)
The package tigris
has a nice little merge function to do the sometimes difficult merge between the spatial and tabular data.
income_merged<- geo_join(tracts, income_df, "GEOID", "GEOID")
# there are some tracts with no land that we should exclude
income_merged <- income_merged[income_merged$ALAND>0,]
5) Make your map (leaflet
)
popup <- paste0("GEOID: ", income_merged$GEOID, "<br>", "Percent of Households above $200k: ", round(income_merged$percent,2))
pal <- colorNumeric(
palette = "YlGnBu",
domain = income_merged$percent
)
map3<-leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(data = income_merged,
fillColor = ~pal(percent),
color = "#b2aeae", # you need to use hex colors
fillOpacity = 0.7,
weight = 1,
smoothFactor = 0.2,
popup = popup) %>%
addLegend(pal = pal,
values = income_merged$percent,
position = "bottomright",
title = "Percent of Households<br>above $200k",
labFormat = labelFormat(suffix = "%"))
map3
Saving your maps
There is a nice function in the library htmlWidgets
for saving your maps:
library(htmlWidgets)
saveWidget(map1, file="map1.html", selfcontained=FALSE)
saveWidget(map2, file="map2.html", selfcontained=FALSE)
saveWidget(map3, file="map3.html", selfcontained=FALSE)
If you get an error pandoc document conversion failed
your data may be too big to save and you may need to use a less detailed version of the spatial data. Thanks to this post for pointing this out.
One issue with the leaflet
map
An astute observer will notice that the stroke lines (outlines) are not being rendered in these maps. In RStudio the lines appear (see image below). I experimented with opacity, width, color and even within the Chrome developer tools. I tried to save from the RStudio plot window and I tried saveWidget. If you have ideas please let me know.
Rarely is the solution to an issue like this so simple. Thanks to Bob Rudis for pointing out that I should be using the hex codes for colors rather than using named colors so that they can be recognized in a web setting. I’ve made the changes above and it worked perfectly.
Great post. Thank you for sharing this. Brian
Very nice. May I use it to map kriging prediction values on ggmap?
Yes, you can put anything with coordinates on a ggmap (this post, though, does not use ggmap)
Great tutorial! I’m so happy I stumbled across your blog. Have you considered adding your site to r-bloggers.com?
Thanks. It’s a good idea to list at r-bloggers, I just haven’t looked into it 🙂
Very informative article. Thanks
a) Do any of the Provider Tiles actually show county names as an overlay?
b) I’m not familiar with US tracts. Are they just a code or do they also have neighborhood equivalent names
I think there may be a typo
income_merged$id should be income_merged$GEOID
cheers
pssguy
a) Yes some provider tiles have counties but I don’t know, off-hand, which ones. Here is a good list:
http://leaflet-extras.github.io/leaflet-providers/preview/
b) Tracts are just a code and associated geography, they do not have neighborhood equivalent names.
Yes, absolutely right about the typo, I’ve made the fix and appreciate you reporting this!
Any chance you can post a direct link to the census data on your server? I’ve tried to download from census.gov 3 times, and each time I end up with a 4 kb, 2 row file!
Anyway thanks a lot for this well-done post.
I updated the post with three things: a better link to the data at Census, a link to a ZIP with the data and an update about the API key needed to get ACS data. If you’re willing to test out the data links that would be helpful.
Great tutorial! I’ve been looking for something like this for awhile. EXTREMELY glad to have found a resource.
The code worked for me… now to dive into understanding it all and mapping to my ACS data pulls
Very nice post! thanks a lot!
Is there an easy way to produce maps for all of the USA instead of just a state? (Is there a “state code” that gives all the states, and if yes, is there a file with all the counties?)
Or do we need to loop over states?
Thanks again!
Take a look at the documentation for
tigris
. If I understand your question correctly this is straightforwardcounties(c("RI", "DE"))
.Thanks for the post. One (probably stupid) question, when you download and unzip the tract data, which do you use? I have several extensions for the “cb_2014_36_tract_500k/cb_2014_36_tract_500k” file.
Thanks!
Take a look at this https://3.85.142.75/blog/2016/01/13/tips-for-reading-spatial-files-into-r-with-rgdal/
Wonderful tutorial. I’ve been playing with census data regarding American Indian land and populations. I didn’t know about the acs package. The link you provide to handling other spatial files is great–I’ll be playing with this all week.
Great post. Also worth looking at is CRAN package mapview, which might simplify some of these things (in particular mapping multiple attributes in a leaflet window).
Very interesting! Thanks for sharing Edzer, it looks like an interesting map package.
Zev-
The tigris merge is producing an error:
income_merged<- geo_join(tracts, income_df, "GEOID", "GEOID")
Error in data.frame(spatial_data@data, data_frame[match(spatial_data@data[[by_sp]], :
trying to get slot "data" from an object of a basic class ("function") with no slots
Did the syntax with tigris change with a newer version?
Hi, is it possible you missed a step? We just re-ran the code with the newest ACS and it worked. Both Hollie and I tested it out she was using acs 2.0, tigris 0.2.2 and R 3.2.5 and I’m using acs 1.2, tigris 0.2.2.9000, r 3.2.3. Let me know if you still have trouble.
I reinstalled some of the packages and now it works. Not sure what the original problem was but regardless your code is indeed 100%. Thank you! I learned a lot from it.
Yes, I wonder what the issue was! Glad it’s working and thanks for the update.
I’m interested in mapping at least two states, all tracts, blockgroups, or blocks…is this possible? I’ve tried many approaches…none seem to work.
Thanks,
This is very possible. Probably your question is a job for stackoverflow. Show what you’ve tried and how it fails and provide a link.
Zev,
This is a great post, and I am completely new to this. I tried your “Census data the easy(er) way” section, and I received the following error message.
Error in eval(expr, envir, enclos) : could not find function “select”
Any idea how I can correct this error message? Thank you!
Did you load dplyr — this is the package with select.
Hello, I try to reproduce the map in the section “Census data the easy(er) way”. When I try to fetch data with acs.fetch() function, I get the following error:
Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
‘data’ must be of a vector type, was ‘NULL’
Could you help me to solve this problem?
T.E.G.
We were not able to re-create your error, the code is running fine for us. Perhaps post to stackoverflow with more detail than you provided here and give us the link. We are using acs v 2.0.
Hi – I’m having the same problem, did you ever resolve it?
Olivia
We could not reproduce the issue. We ran the code again and it worked fine. Can you post details to stackoverflow including your version of the packages? (Use sessionInfo()) and we can take a look?
Sorry if I’m missing something, but where do I find the table number?
Can anyone point me toward information on how to find the variables used in this example? If, say, I want to know how many children under 5 are in each tract in Oregon?
I struggled through this and figured out how table numbers and GEOIDs work. For those also new to census data, I’ve put together some resources that helped me figure things out:
Basic intro to conventions in census data:
https://source.opennews.org/articles/how-use-census-bureau-american-community-survey/
List of table codes:
https://censusreporter.org/topics/table-codes/
GEOID’s:
https://www.census.gov/geo/reference/geoidentifiers.html
(Expand “GEOID Structure for Geographic Areas”)
And if you want to go into smaller geographic areas:
http://eglenn.scripts.mit.edu/citystate/2013/03/acs-r-a-worked-example-using-blockgroup-level-data/
FAQS:
Q: What is the income_df creation with str_pad doing?
A: Constructing GEOIDs. These will need to match the GEOID from the “tracts” object in geo_join(). The tracts object as written in this example is only valid if you are going the level of tracts. If you go up or down levels, see other functions (block_groups(), blocks(), etc). Basically look at the number of numerals in your GEOID from acs.get(), compared to the GEOIDs link above, and make sure you have the same numeral length.
Zev,
Thank you so much for this tutorial. It will play a critical role in preparation for my job interview.
Good luck!
It’s been almost year since I’ve written that comment. I am happy at a new job 🙂 Thank you!
As you can see it takes me a long time to get to these 🙂 Congratulations on the new job.
Hi Zev, I have a question. I follow your code which goes well until I run “ggtract <- fortify(tract, region = "GEOID")". It keep giving me "isTRUE(gpclibPermitStatus()) is not TRUE". I tried many methods, but still cannot solve this problem. What is the aim of this step, is it trying to convert mapping data to the data frame? Thanks in advance!
This is an old post and I wouldn’t use fortify any more. The brand new version of ggplot2 (I think it comes out in a week or two) has a new method geom_sf() which you can use without fortify.
I’m interested in mapping at zcta level. Just want to check if I can use the approach
I know this is a quite old post, but I found it extremely useful in 2020!
Great information. Lucky me I found your website by accident (stumbleupon).
I have saved it for later!