I used to spend considerably more time begging and, sometimes, badgering government agencies for data. Thanks to the push for more open and transparent data I’m more and more likely to find data I need posted in a nice manageable format on a government website. This was true on a recent project where I needed the locations of food markets in New York State. A quick web search turned up a lovely dataset on the New York State Open Data Portal posted by the Department of Agriculture and Markets. Thank you Open Data!
If you visit this site you can browse and download the data in a variety of different formats. The interface looks like this:
If you’re using R or other data analysis software, often the most convenient format to work with is comma separated values. With this particular data, though, you’ll find that there are two reasons why CSV is not the best option. For one, if you look closely at the data you will see that the location field is not a traditional set of columns and when you download in CSV you’ll find some of that data missing. Secondly, CSV does not allow for easy inclusion of important metadata. JSON, on the other hand, can easily accommodate the detailed location data and integrate the metadata directly in the file. But working with JSON can be challenging so I’ve put together this post to help guide others through the process.
1) Grab the data
I’ll warn you that I sometimes found the Open Data Portal to be a little slow. You can view and download the data from this link — but if you’re having timeout issues, I’ve also posted a ZIP file with the JSON as well as a shapefile that I use below for mapping (warning, it’s 5mb).
In order to read the data I’m using the R package RJSONIO
and I’m reading directly from the Open Data Portal — if this takes too long for you be sure to download the ZIP and save to your local machine.
library(RJSONIO)
# from the website
foodMarketsRaw<-fromJSON("https://data.ny.gov/api/views/9a8c-vfzj/rows.json?accessType=DOWNLOAD")
# if you downloaded
#foodMarketsRaw<-fromJSON("retail_food_markets.json")
Note that I was asked by Scott Chamberlain why I used the package RJSONIO
rather than rjsonlite
. My answer was “no good reason”. I followed up with a speed test on this file using the fromJSON
function from each package. RJSONIO was 3x faster in this one case. At the suggestion of Jeroen Ooms, creator of the jsonlite package, I added the simplifyVector=FALSE
argument to the jsonlite::fromJSON and found this allows jsonlite to read the data 2x faster than RJSONIO (though for a perfect time comparison I would need to change RJSONIO simplification settings also).
For more information on related packages, Gaston Sanchez has a really nice presentation on the different options for reading JSON data in R.
2) Extract the data from the JSON file
If you take a look at the file in the browser or in a text editor you'll see that the first big chunk of lines is devoted to the metadata – the source of the file etc. It is not until line 1229 that you see the data node which is all we need initially. If you type str(foodMarketsRaw)
you’ll notice that the data has been read as an R list. So we will use double brackets to extract the data node:
# extract the data node
foodMarkets<-foodMarketsRaw[['data']]
The data (in the JSON file) looks something like this:
"data" : [ [ 1, "D943D633-8DB5-4740-B5F4-06AAB6FC52E2", 1, 1412261567, "707400", 1412262737, "708543", "{\n}", "Albany", "704121", "Store", "JAC ", "DHARAM INC ", "PLAZA 23 TRUCK STOP ", "240", "CHURCH ST #242 ", " ", " ", "ALBANY ", "NY", "12202", "3600", [ "{\"address\":\"240 CHURCH ST #242\",\"city\":\"ALBANY\",\"state\":\"NY\",\"zip\":\"12202\"}", "42.63540995700049", "-73.75540780899968", null, false ] ]
, [ 2, "7A069C98-90B2-4B11-A4A7-3C4761D14DDD", 2, 1412261567, "707400", 1412262426, "708543", "{\n}", "Albany", "711851", "Store", "JAC ", "PRICE CHOPPER OPERATING CO INC ", "PRICE CHOPPER #245 ", "515", "19TH ST ", " ", " ", "WATERVLIET ", "NY", "12189", "0", [ "{\"address\":\"515 19TH ST\",\"city\":\"WATERVLIET\",\"state\":\"NY\",\"zip\":\"12189\"}", "42.73110653700047", "-73.70496327399962", null, false ] ]
, [ 3, "68DBE61E-ACF4-4803-BD90-87380FDAA4F9", 3, 1412261567, "707400", 1412262278, "708543", "{\n}", "Albany", "704964", "Store", "JAC ", "PEACOCK INC ", "PEACOCK ", "795", "BROADWAY ", " ", " ", "ALBANY ", "NY", "12207", "2000", [ "{\"address\":\"795 BROADWAY\",\"city\":\"ALBANY\",\"state\":\"NY\",\"zip\":\"12207\"}", "42.657937794000475", "-73.74779704599968", null, false ] ]
, [ 4, "9DA5A827-A0AD-4782-8CE2-5D6867B53293", 4, 1412261567, "707400", 1412262565, "708543", "{\n}", "Albany", "704995", "Store", "JAC ", "ALBANY PRIDE SEAFOOD LLC ", "FINYOUR FISHMONGER ", "2050", "WESTERN AVE ", " ", " ", "GUILDERLAND ", "NY", "12084", "2100", [ "{\"address\":\"2050 WESTERN AVE\",\"city\":\"GUILDERLAND\",\"state\":\"NY\",\"zip\":\"12084\"}", "42.69787242500047", "-73.88923733299964", null, false ] ]]
3) Orient yourself to the data
Working with JSON in R can be a bit disorienting because you end up with lists within lists within lists so let's break it down a bit. In this dataset we have an outer list where each list item is an individual food market (you can see this from the sample data above. So foodMarkets[[1]]
would give you all the data for the first food market in the list. If you type length(foodMarkets[[1]])
you'll see that the first food market comes with 23 pieces of information. For example, if you explore the sample above you'll see that the 14th element is the food market name:
foodMarkets[[1]][[14]] ## name of food market 1
## [1] "PLAZA 23 TRUCK STOP "
foodMarkets[[2]][[14]] ## name of food market 2
## [1] "PRICE CHOPPER #245 "
4a) Assemble the data frame: an example of extracting a variable
I’m going to extract the data variable-by-variable and then assemble them into a data frame. Since we have a list, a logical place to start is the lapply
function which will operate on the list piece-by-piece or, even better sapply
which does the same thing but will return an array instead of a list.
So, for example, if you want to extract the food market names from the list you can use this code which essentially cycles through each food market and extracts the name:
fmNames<-sapply(foodMarkets, function(x) x[[14]])
head(fmNames)
## [1] "PLAZA 23 TRUCK STOP " "PRICE CHOPPER #245 "
## [3] "PEACOCK " "FINYOUR FISHMONGER "
## [5] "R&A GROCERY STORE " "ANTHONYS CHOC DIP FRUIT"
We could copy and paste this line of code 23 times but this is cumbersome and prone to error — let’s do it programatically.
4b) Assemble the data frame: extracting all the variables (except the geography)
There are a ton of ways to extract all the variables without hard coding. I can pretty much guarantee that this is not the best approach (and please do write to me with alternatives!). Originally I tested using two, nested sapply
statements but ran into trouble when certain data values were missing. So, instead, I wrote two functions. I have a function that returns the data if it exists and an NA otherwise (this is returnData
) and then I have a function that does the sapply
. (Note that I'm only applying this to the first 22 variables, not the geographic infomation – this is next).
library(gdata) # for the trim function
grabInfo<-function(var){
print(paste("Variable", var, sep=" "))
sapply(foodMarkets, function(x) returnData(x, var))
}
returnData<-function(x, var){
if(!is.null( x[[var]])){
return( trim(x[[var]]))
}else{
return(NA)
}
}
# do the extraction and assembly
fmDataDF<-data.frame(sapply(1:22, grabInfo), stringsAsFactors=FALSE)
4c) Assemble the data frame: extracting the geographic information
There is one additional level of complication with the geographic information stored in element 23 of each market. This “variable” is, itself, a list.
foodMarkets[[1]][[23]] #geographic info for food market 1
## [[1]]
## [1] "{\"address\":\"240 CHURCH ST #242\",\"city\":\"ALBANY\",\"state\":\"NY\",\"zip\":\"12202\"}"
##
## [[2]]
## [1] "42.63540995700049"
##
## [[3]]
## [1] "-73.75540780899968"
##
## [[4]]
## NULL
##
## [[5]]
## [1] FALSE
The first piece is a JSON format of the address and then lat and long and then two administrative variables. We will use the same approach as before but hard code the variable number and extract the data one level deeper.
grabGeoInfo<-function(val){
l<- length(foodMarkets[[1]][[val]])
tmp<-lapply(1:l, function(y)
sapply(foodMarkets, function(x){
if(!is.null(x[[val]][[y]])){
return(x[[val]][[y]])
}else{
return(NA)
}
})
)
}
fmDataGeo<-grabGeoInfo(23)
fmDataGeo<-data.frame(do.call("cbind", fmDataGeo), stringsAsFactors=FALSE)
fmDataDF<-cbind(fmDataDF, fmDataGeo)
5) Add the names
The column names are in the metadata. If you review the metadata you can see that the columns are under meta:view. The column detail, then, can be extracted with:
columns<-foodMarketsRaw[['meta']][['view']][['columns']]
If you look at any of the entries (try columns[[14]]
) you'll see that there is a lot more than just column names. So, once again, we'll use sapply
. We again have a complication related to geo where the column names are not under meta:view:columns but rather meta:view:columns:subColumnTypes so I'll extract with hard coding here (which is clearer) but I'll also give the function that can do it for you regardless of whether the variable is geo or not:
# names the hard-coding way
fmNames1<-sapply(1:22, function(x) columns[[x]]$name)
fmNames2<-columns[[23]]$subColumnTypes
fmNames<-c(fmNames1, fmNames2)
Here is the function approach instead which will extract names for a geo or non-geo field:
getNames<-function(x){
if(is.null(columns[[x]]$subColumnTypes)){
return(columns[[x]]$name)
}else{
return(columns[[x]]$subColumnTypes)
}
}
fmNames<-unlist(sapply(1:length(columns), getNames))
Now we add the names to the dataset and take a look:
names(fmDataDF)<-fmNames
head(fmDataDF)
## sid id position created_at
## 1 1 D943D633-8DB5-4740-B5F4-06AAB6FC52E2 1 1412261567
## 2 2 7A069C98-90B2-4B11-A4A7-3C4761D14DDD 2 1412261567
## 3 3 68DBE61E-ACF4-4803-BD90-87380FDAA4F9 3 1412261567
## 4 4 9DA5A827-A0AD-4782-8CE2-5D6867B53293 4 1412261567
## 5 5 5D624E4B-E4A8-4F3D-BF88-190663DF8FCF 5 1412261567
## 6 6 86BC29A4-55DA-43C5-BDED-6EF2C92C3C8B 6 1412261567
## created_meta updated_at updated_meta meta County License Number
## 1 707400 1412262737 708543 {\n} Albany 704121
## 2 707400 1412262426 708543 {\n} Albany 711851
## 3 707400 1412262278 708543 {\n} Albany 704964
## 4 707400 1412262565 708543 {\n} Albany 704995
## 5 707400 1412262177 708543 {\n} Albany 713107
## 6 707400 1412262499 708543 {\n} Albany 711674
## Operation Type Estab Type Entity Name
## 1 Store JAC DHARAM INC
## 2 Store JAC PRICE CHOPPER OPERATING CO INC
## 3 Store JAC PEACOCK INC
## 4 Store JAC ALBANY PRIDE SEAFOOD LLC
## 5 Store JAC POPAL BASHIR
## 6 Store JAC FRUIT BASKETS & GIFTS INC
## DBA Name Street Number Street Name Address Line 2
## 1 PLAZA 23 TRUCK STOP 240 CHURCH ST #242
## 2 PRICE CHOPPER #245 515 19TH ST
## 3 PEACOCK 795 BROADWAY
## 4 FINYOUR FISHMONGER 2050 WESTERN AVE
## 5 R&A GROCERY STORE 265 CENTRAL AVE
## 6 ANTHONYS CHOC DIP FRUIT 1693 CENTRAL AVE
## Address Line 3 City State Zip Code Square Footage
## 1 ALBANY NY 12202 3600
## 2 WATERVLIET NY 12189 0
## 3 ALBANY NY 12207 2000
## 4 GUILDERLAND NY 12084 2100
## 5 ALBANY NY 12206 1500
## 6 ALBANY NY 12205 0
## human_address
## 1 {"address":"240 CHURCH ST #242","city":"ALBANY","state":"NY","zip":"12202"}
## 2 {"address":"515 19TH ST","city":"WATERVLIET","state":"NY","zip":"12189"}
## 3 {"address":"795 BROADWAY","city":"ALBANY","state":"NY","zip":"12207"}
## 4 {"address":"2050 WESTERN AVE","city":"GUILDERLAND","state":"NY","zip":"12084"}
## 5 {"address":"265 CENTRAL AVE","city":"ALBANY","state":"NY","zip":"12206"}
## 6 {"address":"1693 CENTRAL AVE","city":"ALBANY","state":"NY","zip":"12205"}
## latitude longitude machine_address needs_recoding
## 1 42.63540995700049 -73.75540780899968 <NA> FALSE
## 2 42.73110653700047 -73.70496327399962 <NA> FALSE
## 3 42.657937794000475 -73.74779704599968 <NA> FALSE
## 4 42.69787242500047 -73.88923733299964 <NA> FALSE
## 5 42.663401914000474 -73.77088070699966 <NA> FALSE
## 6 42.72312237700049 -73.83911998799965 <NA> FALSE
And format the lat/long for future use:
fmDataDF$latitude<-as.numeric(fmDataDF$latitude)
fmDataDF$longitude<-as.numeric(fmDataDF$longitude)
6) Make a quick map with the R package ggplot2
I'm going to use a simple shapefile of New York State that is included in the zip you can download. I’ll go through the mapping quickly but if you want more detail, much of the code comes from a previous post on mapping with ggplot2.
library(rgdal)
library(ggplot2)
state<-readOGR("X:/administrative/social_media/blogposts/json_from_R", layer="nys")
## OGR data source with driver: ESRI Shapefile
## Source: "X:/administrative/social_media/blogposts/json_from_R", layer: "nys"
## with 1 features and 53 fields
## Feature type: wkbMultiPolygon with 2 dimensions
You can't simply add the points because the state boundary has a projected coordinate system while the points are unprojected latitude and longitude. You can get the projection of the state boundaries with proj4string(state)
. So we need to project before we can add food markets (again, see the previous post, I know this code is tricky):
fmDataDF<-fmDataDF[!is.na(fmDataDF$latitude) & !is.na(fmDataDF$longitude),]
fmDataDF[["Square Footage"]]<-as.numeric(fmDataDF[["Square Footage"]])
write.csv(fmDataDF[,c("Estab Type", "DBA Name", "latitude", "longitude", "Square Footage")],
"X:/administrative/social_media/blogposts/json_from_R/foodmarkets.csv", row.names=FALSE)
coordinates(fmDataDF)<-~longitude+latitude
proj4string(fmDataDF)<-CRS("+proj=longlat +datum=NAD83") #set the coordinate system
fmDataDF<-spTransform(fmDataDF, CRS(proj4string(state)))
geodata<-data.frame(coordinates(fmDataDF))
names(geodata)<-c("x", "y")
Now we're ready to combine them:
ggplot() +
geom_polygon(data=state, aes(x=long, y=lat, group=group), fill="grey40",
colour="grey90", alpha=1)+
labs(x="", y="", title="Food Markets NYS")+ #labels
theme(axis.ticks.y = element_blank(),axis.text.y = element_blank(), # get rid of x ticks/text
axis.ticks.x = element_blank(),axis.text.x = element_blank(), # get rid of y ticks/text
plot.title = element_text(lineheight=.8, face="bold", vjust=1))+ # make title bold and add space
geom_point(aes(x=x, y=y), data=geodata, alpha=1, size=3, color="grey20")+# to get outline
geom_point(aes(x=x, y=y), data=geodata, alpha=1, size=2, color="darkblue")+
coord_equal(ratio=1) # square plot to avoid the distortion
7) An interactive map with CartoDB
This data is better suited to an interactive map, so I manually added the data and created the map below:
Thanks for posting this! I adapted the code to a project I’m working on with Code for Atlanta, to make the GA Dept of Public Health inspection data more accessible. Here’s the link: https://github.com/codeforatlanta/inspectr/blob/master/getjsondata_socrata.R
I got the following error message as I was going through the example –
> state<-readOGR("X:/administrative/social_media/blogposts/json_from_R", layer="nys")
Error in ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv = use_iconv, :
Cannot open file
Did you change the path? You’ll need to change the path to where it is on your computer:
`state<-readOGR("D:/junk/foodmarket", layer="nys")` Be careful NOT to add a finishing slash (NOT D:/junk/foodmarket/)
I am getting this error can any help me to resolve this problem
foodMarkets[[1]][[14]] ## name of food market 1
Error in foodMarkets[[1]][[14]] : subscript out of bounds
even i am facing problem with this function
fmNames<-apply(foodMarkets, function(x) x[[14]])
I just re-ran the code and the lines you have issues with seem to work fine (I did not download the data, I used fromJSON and gave it the URL). Please re-look at your code, make sure it matches mine exactly and if you still have issues submit an issue on stackoverflow and I’ll take a look.
I need to build out an application connectivity map of which host talk to which host and on what ports.
I have the JSON file with Source, Destination, and Port, but have no clue as to where to start. Any suggestions?
If by “connectivity map” you mean plot/map of connections you would read the JSON file into R and use, perhaps,
ggplot2
. Seegeom_segment
.Thank you for the great example.
Because of the following errors I couldn’t rerun it.
> fmDataDF<-spTransform(fmDataDF, CRS(proj4string(state)))
non finite transformation detected:
longitude latitude
Error in spTransform(xSP, CRSobj, …) : failure in points
Additional: Warning:
In spTransform(xSP, CRSobj, …) : 83 projected point(s) not finite
Any advice is welcome
Can you double check that you ran all the code in the post as is? I just ran through the code again and it worked fine. You should take a look at the latitude and longitude fields in fmDataDF before you hit that step and see if perhaps something went wrong and you missed a step in the data prep.
This is a very instructive post. Thank you.
I noticed that when I processed the data (on 2016-12-09), there was a small but significant (for your code) difference between the files when accessed from “https://data.ny.gov/api/views/9a8c-vfzj/rows.json?accessType=DOWNLOAD” as opposed to your download from “https://3.85.142.75/blog/wp-content/uploads/2015/02/foodMarkets.zip”.
In the former the variable is “Establishment Type” versus “Estab Type” in the zip download file.
I haven’t had a chance to look at this yet, but will. Thanks for letting me know
What an article! Do you know if how to use R is provided in standard spatial sciences classes? I was considering USC Spatial Sciences Institute (http://spatial.usc.edu/), I’ve heard they are pretty advanced!