Using R to download and parse JSON: an example using data from an open data portal

I used to spend considerably more time begging and, sometimes, badgering government agencies for data. Thanks to the push for more open and transparent data I’m more and more likely to find data I need posted in a nice manageable format on a government website. This was true on a recent project where I needed the locations of food markets in New York State. A quick web search turned up a lovely dataset on the New York State Open Data Portal posted by the Department of Agriculture and Markets. Thank you Open Data!

If you visit this site you can browse and download the data in a variety of different formats. The interface looks like this:

openData

If you’re using R or other data analysis software, often the most convenient format to work with is comma separated values. With this particular data, though, you’ll find that there are two reasons why CSV is not the best option. For one, if you look closely at the data you will see that the location field is not a traditional set of columns and when you download in CSV you’ll find some of that data missing. Secondly, CSV does not allow for easy inclusion of important metadata. JSON, on the other hand, can easily accommodate the detailed location data and integrate the metadata directly in the file. But working with JSON can be challenging so I’ve put together this post to help guide others through the process.

1) Grab the data

I’ll warn you that I sometimes found the Open Data Portal to be a little slow. You can view and download the data from this link — but if you’re having timeout issues, I’ve also posted a ZIP file with the JSON as well as a shapefile that I use below for mapping (warning, it’s 5mb).

In order to read the data I’m using the R package RJSONIO and I’m reading directly from the Open Data Portal — if this takes too long for you be sure to download the ZIP and save to your local machine.

library(RJSONIO)
# from the website
foodMarketsRaw<-fromJSON("https://data.ny.gov/api/views/9a8c-vfzj/rows.json?accessType=DOWNLOAD")

# if you downloaded
#foodMarketsRaw<-fromJSON("retail_food_markets.json")

Note that I was asked by Scott Chamberlain why I used the package RJSONIO rather than rjsonlite. My answer was “no good reason”. I followed up with a speed test on this file using the fromJSON function from each package. RJSONIO was 3x faster in this one case. At the suggestion of Jeroen Ooms, creator of the jsonlite package, I added the simplifyVector=FALSE argument to the jsonlite::fromJSON and found this allows jsonlite to read the data 2x faster than RJSONIO (though for a perfect time comparison I would need to change RJSONIO simplification settings also).

For more information on related packages, Gaston Sanchez has a really nice presentation on the different options for reading JSON data in R.

2) Extract the data from the JSON file

If you take a look at the file in the browser or in a text editor you'll see that the first big chunk of lines is devoted to the metadata – the source of the file etc. It is not until line 1229 that you see the data node which is all we need initially. If you type str(foodMarketsRaw) you’ll notice that the data has been read as an R list. So we will use double brackets to extract the data node:

# extract the data node
foodMarkets<-foodMarketsRaw[['data']]

The data (in the JSON file) looks something like this:

"data" : [ [ 1, "D943D633-8DB5-4740-B5F4-06AAB6FC52E2", 1, 1412261567, "707400", 1412262737, "708543", "{\n}", "Albany", "704121", "Store", "JAC   ", "DHARAM INC                       ", "PLAZA 23 TRUCK STOP    ", "240", "CHURCH ST #242               ", "             ", "               ", "ALBANY            ", "NY", "12202", "3600", [ "{\"address\":\"240 CHURCH ST #242\",\"city\":\"ALBANY\",\"state\":\"NY\",\"zip\":\"12202\"}", "42.63540995700049", "-73.75540780899968", null, false ] ]
, [ 2, "7A069C98-90B2-4B11-A4A7-3C4761D14DDD", 2, 1412261567, "707400", 1412262426, "708543", "{\n}", "Albany", "711851", "Store", "JAC   ", "PRICE CHOPPER OPERATING CO INC   ", "PRICE CHOPPER #245     ", "515", "19TH ST                  ", "             ", "               ", "WATERVLIET        ", "NY", "12189", "0", [ "{\"address\":\"515 19TH ST\",\"city\":\"WATERVLIET\",\"state\":\"NY\",\"zip\":\"12189\"}", "42.73110653700047", "-73.70496327399962", null, false ] ]
, [ 3, "68DBE61E-ACF4-4803-BD90-87380FDAA4F9", 3, 1412261567, "707400", 1412262278, "708543", "{\n}", "Albany", "704964", "Store", "JAC   ", "PEACOCK INC                      ", "PEACOCK                ", "795", "BROADWAY                     ", "             ", "               ", "ALBANY            ", "NY", "12207", "2000", [ "{\"address\":\"795 BROADWAY\",\"city\":\"ALBANY\",\"state\":\"NY\",\"zip\":\"12207\"}", "42.657937794000475", "-73.74779704599968", null, false ] ]
, [ 4, "9DA5A827-A0AD-4782-8CE2-5D6867B53293", 4, 1412261567, "707400", 1412262565, "708543", "{\n}", "Albany", "704995", "Store", "JAC   ", "ALBANY PRIDE SEAFOOD LLC         ", "FINYOUR FISHMONGER     ", "2050", "WESTERN AVE                 ", "             ", "               ", "GUILDERLAND       ", "NY", "12084", "2100", [ "{\"address\":\"2050 WESTERN AVE\",\"city\":\"GUILDERLAND\",\"state\":\"NY\",\"zip\":\"12084\"}", "42.69787242500047", "-73.88923733299964", null, false ] ]]

3) Orient yourself to the data

Working with JSON in R can be a bit disorienting because you end up with lists within lists within lists so let's break it down a bit. In this dataset we have an outer list where each list item is an individual food market (you can see this from the sample data above. So foodMarkets[[1]] would give you all the data for the first food market in the list. If you type length(foodMarkets[[1]]) you'll see that the first food market comes with 23 pieces of information. For example, if you explore the sample above you'll see that the 14th element is the food market name:

foodMarkets[[1]][[14]] ## name of food market 1
## [1] "PLAZA 23 TRUCK STOP    "
foodMarkets[[2]][[14]] ## name of food market 2
## [1] "PRICE CHOPPER #245     "

4a) Assemble the data frame: an example of extracting a variable

I’m going to extract the data variable-by-variable and then assemble them into a data frame. Since we have a list, a logical place to start is the lapply function which will operate on the list piece-by-piece or, even better sapply which does the same thing but will return an array instead of a list.

So, for example, if you want to extract the food market names from the list you can use this code which essentially cycles through each food market and extracts the name:

fmNames<-sapply(foodMarkets, function(x) x[[14]])
head(fmNames)
## [1] "PLAZA 23 TRUCK STOP    " "PRICE CHOPPER #245     "
## [3] "PEACOCK                " "FINYOUR FISHMONGER     "
## [5] "R&A GROCERY STORE      " "ANTHONYS CHOC DIP FRUIT"

We could copy and paste this line of code 23 times but this is cumbersome and prone to error — let’s do it programatically.

4b) Assemble the data frame: extracting all the variables (except the geography)

There are a ton of ways to extract all the variables without hard coding. I can pretty much guarantee that this is not the best approach (and please do write to me with alternatives!). Originally I tested using two, nested sapply statements but ran into trouble when certain data values were missing. So, instead, I wrote two functions. I have a function that returns the data if it exists and an NA otherwise (this is returnData) and then I have a function that does the sapply. (Note that I'm only applying this to the first 22 variables, not the geographic infomation – this is next).

library(gdata) # for the trim function
grabInfo<-function(var){
  print(paste("Variable", var, sep=" "))  
  sapply(foodMarkets, function(x) returnData(x, var)) 
}

returnData<-function(x, var){
  if(!is.null( x[[var]])){
    return( trim(x[[var]]))
  }else{
    return(NA)
  }
}

# do the extraction and assembly
fmDataDF<-data.frame(sapply(1:22, grabInfo), stringsAsFactors=FALSE)

4c) Assemble the data frame: extracting the geographic information

There is one additional level of complication with the geographic information stored in element 23 of each market. This “variable” is, itself, a list.

foodMarkets[[1]][[23]] #geographic info for food market 1
## [[1]]
## [1] "{\"address\":\"240 CHURCH ST #242\",\"city\":\"ALBANY\",\"state\":\"NY\",\"zip\":\"12202\"}"
## 
## [[2]]
## [1] "42.63540995700049"
## 
## [[3]]
## [1] "-73.75540780899968"
## 
## [[4]]
## NULL
## 
## [[5]]
## [1] FALSE

The first piece is a JSON format of the address and then lat and long and then two administrative variables. We will use the same approach as before but hard code the variable number and extract the data one level deeper.

grabGeoInfo<-function(val){

    l<- length(foodMarkets[[1]][[val]])
    tmp<-lapply(1:l, function(y) 

      sapply(foodMarkets, function(x){

        if(!is.null(x[[val]][[y]])){
          return(x[[val]][[y]])
        }else{
          return(NA)
        }

        })     
      )
}


fmDataGeo<-grabGeoInfo(23)
fmDataGeo<-data.frame(do.call("cbind", fmDataGeo), stringsAsFactors=FALSE)

fmDataDF<-cbind(fmDataDF, fmDataGeo)

5) Add the names

The column names are in the metadata. If you review the metadata you can see that the columns are under meta:view. The column detail, then, can be extracted with:

columns<-foodMarketsRaw[['meta']][['view']][['columns']]

If you look at any of the entries (try columns[[14]]) you'll see that there is a lot more than just column names. So, once again, we'll use sapply. We again have a complication related to geo where the column names are not under meta:view:columns but rather meta:view:columns:subColumnTypes so I'll extract with hard coding here (which is clearer) but I'll also give the function that can do it for you regardless of whether the variable is geo or not:

# names the hard-coding way
fmNames1<-sapply(1:22, function(x) columns[[x]]$name)
fmNames2<-columns[[23]]$subColumnTypes

fmNames<-c(fmNames1, fmNames2)

Here is the function approach instead which will extract names for a geo or non-geo field:

getNames<-function(x){
  if(is.null(columns[[x]]$subColumnTypes)){
    return(columns[[x]]$name)
  }else{
    return(columns[[x]]$subColumnTypes)
  }
}

fmNames<-unlist(sapply(1:length(columns), getNames))

Now we add the names to the dataset and take a look:

names(fmDataDF)<-fmNames
head(fmDataDF)
##   sid                                   id position created_at
## 1   1 D943D633-8DB5-4740-B5F4-06AAB6FC52E2        1 1412261567
## 2   2 7A069C98-90B2-4B11-A4A7-3C4761D14DDD        2 1412261567
## 3   3 68DBE61E-ACF4-4803-BD90-87380FDAA4F9        3 1412261567
## 4   4 9DA5A827-A0AD-4782-8CE2-5D6867B53293        4 1412261567
## 5   5 5D624E4B-E4A8-4F3D-BF88-190663DF8FCF        5 1412261567
## 6   6 86BC29A4-55DA-43C5-BDED-6EF2C92C3C8B        6 1412261567
##   created_meta updated_at updated_meta meta County License Number
## 1       707400 1412262737       708543 {\n} Albany         704121
## 2       707400 1412262426       708543 {\n} Albany         711851
## 3       707400 1412262278       708543 {\n} Albany         704964
## 4       707400 1412262565       708543 {\n} Albany         704995
## 5       707400 1412262177       708543 {\n} Albany         713107
## 6       707400 1412262499       708543 {\n} Albany         711674
##   Operation Type Estab Type                    Entity Name
## 1          Store        JAC                     DHARAM INC
## 2          Store        JAC PRICE CHOPPER OPERATING CO INC
## 3          Store        JAC                    PEACOCK INC
## 4          Store        JAC       ALBANY PRIDE SEAFOOD LLC
## 5          Store        JAC                   POPAL BASHIR
## 6          Store        JAC      FRUIT BASKETS & GIFTS INC
##                  DBA Name Street Number    Street Name Address Line 2
## 1     PLAZA 23 TRUCK STOP           240 CHURCH ST #242               
## 2      PRICE CHOPPER #245           515        19TH ST               
## 3                 PEACOCK           795       BROADWAY               
## 4      FINYOUR FISHMONGER          2050    WESTERN AVE               
## 5       R&A GROCERY STORE           265    CENTRAL AVE               
## 6 ANTHONYS CHOC DIP FRUIT          1693    CENTRAL AVE               
##   Address Line 3        City State Zip Code Square Footage
## 1                     ALBANY    NY    12202           3600
## 2                 WATERVLIET    NY    12189              0
## 3                     ALBANY    NY    12207           2000
## 4                GUILDERLAND    NY    12084           2100
## 5                     ALBANY    NY    12206           1500
## 6                     ALBANY    NY    12205              0
##                                                                    human_address
## 1    {"address":"240 CHURCH ST #242","city":"ALBANY","state":"NY","zip":"12202"}
## 2       {"address":"515 19TH ST","city":"WATERVLIET","state":"NY","zip":"12189"}
## 3          {"address":"795 BROADWAY","city":"ALBANY","state":"NY","zip":"12207"}
## 4 {"address":"2050 WESTERN AVE","city":"GUILDERLAND","state":"NY","zip":"12084"}
## 5       {"address":"265 CENTRAL AVE","city":"ALBANY","state":"NY","zip":"12206"}
## 6      {"address":"1693 CENTRAL AVE","city":"ALBANY","state":"NY","zip":"12205"}
##             latitude          longitude machine_address needs_recoding
## 1  42.63540995700049 -73.75540780899968            <NA>          FALSE
## 2  42.73110653700047 -73.70496327399962            <NA>          FALSE
## 3 42.657937794000475 -73.74779704599968            <NA>          FALSE
## 4  42.69787242500047 -73.88923733299964            <NA>          FALSE
## 5 42.663401914000474 -73.77088070699966            <NA>          FALSE
## 6  42.72312237700049 -73.83911998799965            <NA>          FALSE

And format the lat/long for future use:

fmDataDF$latitude<-as.numeric(fmDataDF$latitude)
fmDataDF$longitude<-as.numeric(fmDataDF$longitude)

6) Make a quick map with the R package ggplot2

I'm going to use a simple shapefile of New York State that is included in the zip you can download. I’ll go through the mapping quickly but if you want more detail, much of the code comes from a previous post on mapping with ggplot2.

library(rgdal)
library(ggplot2)
state<-readOGR("X:/administrative/social_media/blogposts/json_from_R", layer="nys")
## OGR data source with driver: ESRI Shapefile 
## Source: "X:/administrative/social_media/blogposts/json_from_R", layer: "nys"
## with 1 features and 53 fields
## Feature type: wkbMultiPolygon with 2 dimensions

You can't simply add the points because the state boundary has a projected coordinate system while the points are unprojected latitude and longitude. You can get the projection of the state boundaries with proj4string(state). So we need to project before we can add food markets (again, see the previous post, I know this code is tricky):

fmDataDF<-fmDataDF[!is.na(fmDataDF$latitude) & !is.na(fmDataDF$longitude),]
fmDataDF[["Square Footage"]]<-as.numeric(fmDataDF[["Square Footage"]])
write.csv(fmDataDF[,c("Estab Type", "DBA Name", "latitude", "longitude", "Square Footage")],
"X:/administrative/social_media/blogposts/json_from_R/foodmarkets.csv", row.names=FALSE)
coordinates(fmDataDF)<-~longitude+latitude
proj4string(fmDataDF)<-CRS("+proj=longlat +datum=NAD83") #set the coordinate system
fmDataDF<-spTransform(fmDataDF, CRS(proj4string(state)))
geodata<-data.frame(coordinates(fmDataDF))
names(geodata)<-c("x", "y")

Now we're ready to combine them:


ggplot() +  
    geom_polygon(data=state, aes(x=long, y=lat, group=group), fill="grey40", 
        colour="grey90", alpha=1)+
    labs(x="", y="", title="Food Markets NYS")+ #labels
    theme(axis.ticks.y = element_blank(),axis.text.y = element_blank(), # get rid of x ticks/text
          axis.ticks.x = element_blank(),axis.text.x = element_blank(), # get rid of y ticks/text
          plot.title = element_text(lineheight=.8, face="bold", vjust=1))+ # make title bold and add space
    geom_point(aes(x=x, y=y), data=geodata, alpha=1, size=3, color="grey20")+# to get outline
    geom_point(aes(x=x, y=y), data=geodata, alpha=1, size=2, color="darkblue")+
    coord_equal(ratio=1) # square plot to avoid the distortion

plot of chunk unnamed-chunk-16

7) An interactive map with CartoDB

This data is better suited to an interactive map, so I manually added the data and created the map below:

11 responses

  1. I got the following error message as I was going through the example –

    > state<-readOGR("X:/administrative/social_media/blogposts/json_from_R", layer="nys")
    Error in ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv = use_iconv, :
    Cannot open file

    • Did you change the path? You’ll need to change the path to where it is on your computer:

      `state<-readOGR("D:/junk/foodmarket", layer="nys")` Be careful NOT to add a finishing slash (NOT D:/junk/foodmarket/)

  2. I am getting this error can any help me to resolve this problem

    foodMarkets[[1]][[14]] ## name of food market 1
    Error in foodMarkets[[1]][[14]] : subscript out of bounds

    even i am facing problem with this function
    fmNames<-apply(foodMarkets, function(x) x[[14]])

    • I just re-ran the code and the lines you have issues with seem to work fine (I did not download the data, I used fromJSON and gave it the URL). Please re-look at your code, make sure it matches mine exactly and if you still have issues submit an issue on stackoverflow and I’ll take a look.

  3. I need to build out an application connectivity map of which host talk to which host and on what ports.

    I have the JSON file with Source, Destination, and Port, but have no clue as to where to start. Any suggestions?

    • If by “connectivity map” you mean plot/map of connections you would read the JSON file into R and use, perhaps, ggplot2. See geom_segment.

  4. Thank you for the great example.
    Because of the following errors I couldn’t rerun it.
    > fmDataDF<-spTransform(fmDataDF, CRS(proj4string(state)))
    non finite transformation detected:
    longitude latitude
    Error in spTransform(xSP, CRSobj, …) : failure in points
    Additional: Warning:
    In spTransform(xSP, CRSobj, …) : 83 projected point(s) not finite

    Any advice is welcome

    • Can you double check that you ran all the code in the post as is? I just ran through the code again and it worked fine. You should take a look at the latitude and longitude fields in fmDataDF before you hit that step and see if perhaps something went wrong and you missed a step in the data prep.

  5. This is a very instructive post. Thank you.

    I noticed that when I processed the data (on 2016-12-09), there was a small but significant (for your code) difference between the files when accessed from “https://data.ny.gov/api/views/9a8c-vfzj/rows.json?accessType=DOWNLOAD” as opposed to your download from “http://zevross.com/blog/wp-content/uploads/2015/02/foodMarkets.zip”.

    In the former the variable is “Establishment Type” versus “Estab Type” in the zip download file.

Leave a Reply

Your email address will not be published. Required fields are marked *