Although many data science-related projects can be completed with a single software tool we often find that decisions about what tool to use for a project involve weighing a combination of what tool would be “best” for the job, what tools we're most familiar with and whether we already have scripts we can use. As an example, below we've outlined our workflow on a recent project in which we used R, Python, PostgreSQL/PostGIS (SQL) and GDAL.
Note that rather than bog down this post with the full code including things like looping through dozens of text fields etc we've limited to key code snippets.
The project: assemble a database of air pollution and compute metro averages
The project required us to assemble a large database of air pollution data form the US Environmental Protection Agency. The database needed to include data on annual and daily concentrations of ozone and other pollutants 1999-2012. The raw air pollution data all comes from the same source and you can download it yourself from here. Ultimately the daily data comprises more than 70 million records and 30 columns – totaling approximately 2,100,000,000 data cells.
In addition to the air pollution data itself the task required information from two additional data components – a table with detailed descriptions of the monitoring locations and a geographic file of US metropolitan areas. The site table gave us descriptions of the type of location (land use types, urban/suburban etc) along with coordinates and the geographic file of US metro areas allows us to assign metro areas codes to the sites so that we can compute averages by metro area.
The Workflow
1. Create a spatially-enabled PostgreSQL database
In our office we've moved from using MySQL to PostgreSQL primarily because of the latter's spatial functionality through PostGIS. For this project the spatial component is small but we need to be able to assign a metro code to the monitoring sites. Metropolitan areas are polygons and the monitornig locations are represented by points. Once you've installed both PostgreSQL and PostGIS creating a spatially-enabled database simply involves using a spatial template. For example, if you're using pgAdmin III you would create a new database and under the “Definition” tab you would choose the template you likely installed when you installed PostGIS. In my case, this template is called template_postgis_21
. By using this template you are including the spatial functionality that comes with PostGIS. We are using PostgreSQL 9.3 and PostGIS 2.1.
2. Download and unzip the air pollution data
The air pollution data is housed at the US EPA and provided in year/pollutant specific tables for annual and for daily data. The website tables look like this:
The data is in a nice, easy-to-use CSV format. For us we needed several pollutants and several aggregation levels (annual, daily) so we wrote scripts to loop through the pollutants, identify the appropriate URL to the zips and then download. The basic download and unzipping uses the following R code:
# ----- data URL
url<-"http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html"
# use the XML package to pull out the tables of URLs
dat1<-readHTMLTable(url, stringsAsFactors = F)[[1]] # Annual summary data
# find the zip URLs using regular expressions
m<-regexpr("\\Annual_\\w{3}_\\d{4}\\.zip", rows, perl=TRUE)
res<-regmatches(rows, m)
zipurls<-paste("http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/", res, sep="")
# download and unzip (an example using 1 -- our code loops through all)
download.file(zipurls[1], "temp")
unzip("temp", exdir="d://junk", junkpaths=TRUE)
3. Put the data in a PostgreSQL database
Bringing the data in to PostgreSQL can be done many ways. The most efficient approach would be to use the command line and psql using code like the following:
psql -h localhost -U user -d db -c "COPY datatable(field1, field2) FROM 'datatable.txt' DELIMITER ',' CSV HEADER"
But we already had R scripts for writing to PostgreSQL from another project. This is definitely not the most efficient approach since the data gets read into R first and then exported to PostgreSQL. Probably if we were to do this again we would simply use psql. Nevertheless, R is often our tool of choice and the code worked smoothly. Here is what the code looks like:
# here is a snippet of code showing how the data was read into R and
# written to PostgreSQL
library(RPostgreSQL)
tempdata<-read.csv("Path to CSV", as.is=T)
con<-dbConnect(PostgreSQL(), host="localhost",user= "xx", password="xx",dbname="epa_aqs", port="5432")
dbWriteTable(con, "airpollutiondata", tempdata) #first table
# loop through all subsequent tables
dbWriteTable(con, "airpollutiondata", tempdata, append=TRUE)
# add index -- computations are 100x faster with this index on site_id
dbSendQuery(con, 'CREATE INDEX daily_siteid_indx ON airpollutiondata(aqs_site_id)')
We used this code to loop through the pollutants and the years to create a table for both annual (1.5 million records) and daily data (71 million records).
4. Get the detailed site data - stored in an unusual format
As I mentioned, the air pollution data is in a lovely format and it's easy to use but there are some attributes missing that we need. So we want detailed descriptions of the sites which are available from EPA here.
The sites data, unfortunately, is more complicated because it is stored in a multi-table text file. This means that several tables are stored in a single text file distinguished, in this example, by the first few letters. So, in the example below, each one of these lines is in the same text file but the first two letters distinguishes which table the line belongs to. For example, MD is the “Monitor Agency Role” table, MA is “Basic Monitor Information” table and so on.
# the records starting with MD make up a different table from
# those that start with an ME, MK etc
MD|I|01|001|0002|42401|1|CERTIFYING|0013|19800101|19820801
MD|I|01|001|0002|42401|1|PQAO|0013|19800101|19820801
MD|I|01|001|0002|42401|1|REPORTING|0013|19800101|19820801
ME|I|01|001|0002|42401|1|UNKNOWN||0000|||
MK|I|01|001|0002|42401|1|101|1|007|014|||
MK|I|01|001|0002|42401|1|102|1|007|020|||
MM|I|01|001|0002|42401|1|014|19800909|19820331
MM|I|01|001|0002|42401|1|020|19820401|
MA|I|01|001|0002|44201|1|03|MOBILE|NEIGHBORHOOD|||3||||||||||||||||
MB|I|01|001|0002|44201|1|19800101|19821031
MC|I|01|001|0002|44201|1|UNKNOWN|19800101|19801231
For me, this is a job for Python, I find that Python is quicker and more adept at working with line-level data from text files. You can certainly use R to do this using readLines
and other functions. In fact, I recently came across a request for help from the R listserve that I sent in 2006 with code for doing this (I pasted this below). But I like working with Python for this kind of task and here is the code snippet
import re
import psycopg2 # module for connecting to PostgreSQL
conn = psycopg2.connect("host=192.168.1.3 dbname=epa_aqs user=XXX password=XXX port=5432")
cur = conn.cursor()
# create the table
cur.execute ("DROP table if exists sites")
cur.execute ("CREATE TABLE sites" + """
(
trantype CHAR(2),
action CHAR(1),
... OTHER FIELDS...
)""")
filename = "RD_SITEMON_4_2014-0.txt" # file to read
f = open(filename, 'r') # open and read
f = f.readlines()
justAA = [] # Lines that start with AA are basic site info
justMA = [] # Lines that start with MA are basic monitor info
justMF = [] # Lines that start with MF are monitor sampling schedule
# loop through lines in the file and create tmp tables
for line in f:
firsttwo = line[0:2]
if firsttwo == 'AA': justAA.append(line)
if firsttwo == 'MA': justMA.append(line)
if firsttwo == 'MF': justMF.append(line)
# put the data in the PostgreSQL database line by line
# this loop could have been added to the loop above
for line in justAA:
query = """INSERT INTO sites (trantype,action, ...OTHER FIELDS ...) VALUES( """ + "%s, "*44 + "%s)"
data = line.strip('\n').split('|')[:] # this is a pipe-delimited file
data = [x if x!='' else None for x in data ]
cur.execute (query , data)
As I mentioned above, if you were determined to do this in R you could use this code to extract lines that begin with “RD”. Then you would connect to PostgreSQL using code like that above.
# you can test this code out by copying the table above
L <- readLines("clipboard")
L <- grep("^MD", L, value = TRUE)
L <- read.table(textConnection(L), sep = "|")
Note: this code comes from Gabor Grothendieck in answer to a R-Help question I submitted in 2006!
5. Add a geometry column to the sites table using the latitude and longitude
We can't do any spatial queries until we define the geometry column. To do this we use the following SQL to create a geometry column called geom
:
ALTER TABLE sites ADD COLUMN geom geometry(POINT,4269); --4269 is NAD 83
UPDATE sites SET geom = ST_SetSRID(ST_MakePoint(longitude,latitude),4269);
6. Get the data on metro areas
We downloaded data on Metropolitan Statistical Areas (MSA) and Primary Metropolitan Statistical Areas (PMSA) data from here and imported to PostgreSQL. The Census actually no longer uses MSA and PMSA in favor of Core Based Statistical Areas (CBSA) and Metropolitan Divisions (METDIV) but many past health studies used MSA and PMSA boundaries and, for consistency, current studies often still use these boundary definitions. The code we used to import relies on GDAL's function ogr2ogr which automatically creates a geometry column (which we name geom
):
ogr2ogr -overwrite -lco GEOMETRY_NAME=geom -a_srs "EPSG:4269" -f "PostgreSQL" PG:"host=localhost user=postgres dbname=xx password=xx port=5432" -nlt MULTIPOLYGON shapefilepath -nln msa)
Here is a quick look at the data in QGIS (this is southern New York State and northern Pennsylvania).
7. Transfer attributes from the metro polygons to the site points
Now that we have a geometry column in the sites table and we have the polygon files of MSA and PMSA we can use PostGIS functionality to assign MSA/PMSA codes to the monitoring locations:
-- create new columns to house the MSA name and two ID-related fields
ALTER table tmp_sites ADD COLUMN msaname varchar, ADD COLUMN msacmsa char(10), ADD COLUMN ma99_99_id INTEGER;
-- set the new columns equal to the matching column from the MSA
-- table if they intersect
UPDATE tmp_sites SET msaname = msa.name,
msacmsa = msa.msacmsa,
ma99_99_id = msa.ma99_99_id
FROM geo_ma99_99 msa
WHERE ST_Intersects(msa.geom, tmp_sites.geom);
8. Compute metro averages using dplyr
We are a major convert to the new library by Hadley Wickham dplyr
. You can see our explanation for why at this post.
Our code for computing MSA averages is much, much more involved that what we're showing below. We average all monitors within a site, we apply completeness criteria, we do a check to make sure units are consistent and more. The basic idea, though, is that we loop through MSAs, grab all the monitoring data for that MSA, do some cleaning, then average all sites by MSA.
Extremely stripped code:
# 1. loop through MSA
# 2. Query site table to get list of monitors in MSA
# 3. Assemble query like this one to get pollutant data
# 4. In our case we queried the database and brough data into R
# with dplyr you can analyze the data in the database but we chose not to do this
# 5. Use dplyr's group_by and summarize functions to generate averages
# the code looks something like this
query<-"SELECT Year,Pollutant,msaname,ArithmeticMean FROM dat_daily_summary WHERE aqs_site_id IN ('06-061-0002','06-017-0001')"
rs<-dbSendQuery(con, query)
sitedata<-fetch(rs, n = -1)
areaGroup<-group_by(sitedata, "Pollutant", "Year", "msaname")
areaSum<-summarize(areaGroup, count=n(), Arithmetic_Mean=mean(Arithmetic_Mean))
Conclusion
In the example above we show the workflow for a recent project that involved a combination of R, Python, PostgreSQL/PostGIS and GDAL to download data, add it to a database and conduct analysis. In several places we could have used R or Python to do the same tasks and our decision about which tools to use involved tradeoffs between staff time to create new scripts and computation time. In many cases for us computation time is nearly irrelevant – if you can simply set a script running when you leave the office and it's complete in the morning (and, importantly, you only need to run it once!) then there is very little incentive (aside from the sport of it) to spend quality time improving efficiency. There are clear places where we can improve the workflow above from a computation efficiency perspective, but the scripts run smoothly and produce the final product we need.