For a recent project we have been working with a relatively large database of all historical air pollution data for New York City going all way back to the 1950s. Including both daily and hourly measurements the database includes about 10 million records and 20 variables. Using traditional R functions working with a database of this size can sometimes be cumbersome and time consuming so we took advantage of a nice package from Matthew Dowle called data.table
.
The functions in data.table
can speed up computations substantially, in some cases we’ve found improvements of 100x. Below we’re showing the speed savings using this package for importing data, deleting fields and aggregating on a big air pollution database.
Compare Import Speed
Each year of data is a separate comma-separated text file so we need to cycle through the files, import each one and bind them together. Instead of using the more traditional combination of the functions read.csv
and rbind
we can use data.table
s fread
and rbindlist
to improve import and bind speeds. Below we show the code for each.
# traditional approach using read.csv and rbind
ptm <- proc.time() # set the timer
yrs <- 1957:2012 # each year is a separate CSV file
myDF <- NULL
for (i in yrs) {
tmppath <- paste(rawpath, i, ".csv", sep = "") #rawpath is a directory on a computer
tmpdat <- read.csv(tmppath, as.is = T)
myDF <- rbind(myDF, tmpdat)
}
(proc.time() - ptm)[3] # this is elapsed time
## elapsed
## 472.7
# data.table approach using fread and rbindlist
library(data.table)
ptm <- proc.time()
yrs <- 1957:2012
myDT <- NULL
for (i in yrs) {
tmppath <- paste(rawpath, i, ".csv", sep = "") #rawpath is a directory on my computer
tmpdat <- fread(tmppath) #fread is a function in data.table
myDT <- rbindlist(list(myDT, tmpdat))
}
(proc.time() - ptm)[3]
## elapsed
## 143.6
# how many records are we working with
dim(myDT)
## [1] 6795788 25
The combination of fread
and rbindlist
from the package data.table
improves speeds 3.3 times over the more traditional combination of read.csv
and rbind
. Now we have a traditional data.frame (myDF
) and a data.table (myDT
) that we can use to compare speeds doing a couple of additional computations.
Compare Speed Deleting Fields
You can see from above that there are 25 fields in the data. Many of these are not useful for the kinds of analysis we’re doing so we can compare the speed of deleting fields. We will delete the latitude
and longitude
fields as the example using both the traditional and data.table
approach.
fieldsToDelete <- c("latitude", "longitude")
# delete fields the traditional way
ptm <- proc.time()
myDF <- myDF[, !names(myDF) %in% fieldsToDelete]
(proc.time() - ptm)[3]
## elapsed
## 2.12
# delete fields using data.table
ptm <- proc.time()
set(myDT, j = (which(colnames(myDT) %in% fieldsToDelete)), value = NULL)
(proc.time() - ptm)[3]
## elapsed
## 0
Even on a relatively straightforward calculation, removing columns, data.table
offers some substantial time savings. Comparing the times you can see that removing columns in this example is definitely faster using data.table
.
Compare Speed in Aggregation
Let’s look at a slightly more complex computation. In the example below we will aggregate all hourly data into daily averages using the commonly used aggregate
function and then we will aggregate using data.table
syntax. Note that in order to simplify the code I have removed all columns except those used in this computation. I have also simplified the computations here by, for example, removing tests for completeness (you wouldn’t want a daily average based on a single hour of data) among other things.
# compute average by monitor/parameter within site the traditional way
ptm <- proc.time()
dailyDF <- aggregate(myDF$val, list(myDF$date, myDF$state, myDF$county, myDF$site,
myDF$param, myDF$monitor), mean, na.rm = T)
(proc.time() - ptm)[3]
## elapsed
## 33.68
# compute average by monitor/parameter within site using data.table
ptm <- proc.time()
dailyDT <- myDT[, list(dayavg = mean(val)), by = "date,state,county,site,param,monitor"]
(proc.time() - ptm)[3]
## elapsed
## 2.14
dailyDT <- dailyDT[order(date, state, county, site, param, monitor)]
head(dailyDT)
## date state county site param monitor dayavg
## 1: 1957-01-07 36 061 0001 11101 1 163
## 2: 1957-01-20 36 061 0001 11101 1 325
## 3: 1957-02-02 36 061 0001 11101 1 230
## 4: 1957-02-14 36 061 0001 11101 1 187
## 5: 1957-03-11 36 061 0001 11101 1 323
## 6: 1957-03-15 36 061 0001 11101 1 183
You can see a fairly dramatic improvement in speeds when comparing the traditional approach vs the data.table
approach. Using data.table
speeds are improved 15.74 times.
Final Point, Adding a Key Can Improve Speeds Even More
One of the powerful features of data.table
is the ability to add a key similar to keys used in SQL databases. With keys in place on a table computations that involve the key can be sped up significantly. Here we do the same computation as that above except that we add a key to the table first.
# here we add a key to improve speeds (the key is essentially a
# concatenation of date, state, county, site, param and monitor)
setkey(myDT, date, state, county, site, param, monitor)
ptm <- proc.time()
dailyDT <- myDT[, list(dayavg = mean(val)), by = key(myDT)]
(proc.time() - ptm)[3]
## elapsed
## 0.95
dailyDT <- dailyDT[order(date, state, county, site, param, monitor)]
head(dailyDT)
## date state county site param monitor dayavg
## 1: 1957-01-07 36 061 0001 11101 1 163
## 2: 1957-01-20 36 061 0001 11101 1 325
## 3: 1957-02-02 36 061 0001 11101 1 230
## 4: 1957-02-14 36 061 0001 11101 1 187
## 5: 1957-03-11 36 061 0001 11101 1 323
## 6: 1957-03-15 36 061 0001 11101 1 183
Adding the key speeds up the computation from above to 35.5 times faster than the traditional approach and 2.3 times faster than the data.table
approach without a key.