Overlap joins in R: a speed comparison with packages sqldf and data.table

In a perfect data world the tables you need to join have common IDs. In this setting, in R, you might use the merge function from the base package or the speedy and useful join functions from the dplyr package. But sometimes your join is a little more complicated.

Say, for example, that you have height categories. Anyone under 5.5 feet tall would be short, between 5.5 and 6 feet is average and above 6 is tall. Then you have a dataset of study participant heights. You might have data that looks like the following:


categories <- data.frame(cat = c("short", "average", "tall"), min=c(0,5.5,6), max=c(5.5,6,10))
study.data<-data.frame(name=c("Ronaldo", "Wambach", "Messi", "Klingenberg"),
                       height=c(6.08,  5.58, 5.92, 5.27))

categories
##       cat min  max
## 1   short 0.0  5.5
## 2 average 5.5  6.0
## 3    tall 6.0 10.0
study.data
##          name height
## 1     Ronaldo   6.08
## 2     Wambach   5.58
## 3       Messi   5.92
## 4 Klingenberg   5.27

With this tiny amount of data you could assign categories by hand or you could do this in R using a variety of different techniques. But what if you had 5000 categories and thousands of participants? This would be a bigger challenge.

As an example of where you might be faced with such an issue consider dates and times. On a current project we were collecting continuous time data at a high temporal resolution – every second. We needed to be able to aggregate the data at a user-chosen time interval – say 15 minutes, 20 minutes or hourly.

In the code below, we show three different approaches to solving the problem and you'll see that one of them is the fastest by far (hint, it uses the package data.table). As a caveat, it is entirely possible that there is a faster solution, perhaps using plyr. If you come up with something faster and/or more elegant let me know and I will post it here.

Data set up

I'm going to create our datasets – one that represents the intervals we want to assign and a second that represents a “real” world dataset – perhaps a GPS unit collecting data every second.

Create the “real” world dataset

I'll create a dataset representing GPS measurements at 1 second intervals. This rider went on three 1-hour rides, I spread them out over a couple of months.

# function to create a table for each rider
riderDF<-function(start, end, id){

  times  <- seq(as.POSIXct(start), as.POSIXct(end), by="1 sec")
  outdat <- data.frame(time = times,
                       rider = id,
                       timeid = seq(1, length(times)))
  outdat
}


# you could also do this will lapply and do.call if you had a lot
# of riders

df1<-riderDF("2015-05-05 08:00:00", "2015-05-05 09:00:00", 1)
df2<-riderDF("2015-06-05 17:30:00", "2015-06-05 18:30:00", 2)
df3<-riderDF("2015-07-08 09:00:00", "2015-07-08 10:00:00", 3)


dat<-rbind(df1, df2, df3)

dim(dat)
## [1] 10803     3
head(dat)
##                  time rider timeid
## 1 2015-05-05 08:00:00     1      1
## 2 2015-05-05 08:00:01     1      2
## 3 2015-05-05 08:00:02     1      3
## 4 2015-05-05 08:00:03     1      4
## 5 2015-05-05 08:00:04     1      5
## 6 2015-05-05 08:00:05     1      6

Create the dataset of 15 minute intervals

I want a dataset with a row for each interval with a beginning and end. I want to make sure that my reference dataset with intervals spans my entire biking dataset plus a little padding to make sure I've covered all times.

In this particular example we would not have to include all intervals in these months, we could limit to these three 1-hour rides, but in our project we actually have many riders and we’d want the intervals to cover all riders and rides

# create the 15 minute intervals making sure to span all the dates

# round down by day
beg <- round(min(dat$time), "days")

# add a day (in seconds) and then round down
end <- round(max(dat$time)+86400, "days")

# the interval I want
interval<-"15 min"

seq15min<-seq(beg, end, by=interval)

# this is the reference dataset with the intervals
ref<-data.frame(interval=1:(length(seq15min)-1), 
                               begin=seq15min[-length(seq15min)], 
                               end=seq15min[-1])

dim(ref)
## [1] 6240    3
head(ref)
##   interval               begin                 end
## 1        1 2015-05-05 00:00:00 2015-05-05 00:15:00
## 2        2 2015-05-05 00:15:00 2015-05-05 00:30:00
## 3        3 2015-05-05 00:30:00 2015-05-05 00:45:00
## 4        4 2015-05-05 00:45:00 2015-05-05 01:00:00
## 5        5 2015-05-05 01:00:00 2015-05-05 01:15:00
## 6        6 2015-05-05 01:15:00 2015-05-05 01:30:00

Using base R

My first attempt was nothing fancy. Create a function that returns the interval for each value.


findinterval<-function(val, reftable){

  res<-reftable[val>=reftable$begin & val<reftable$end,]
  if(nrow(res)!=0) {
    return(data.frame(time=val, res))
  }else{
    return()
  }
}


ptm<-proc.time() #start the timer
res.base<-lapply(dat$time, findinterval, reftable = ref) # run function on all times
res.base<-do.call("rbind", res.base) #assemble into data.frame
basetime<-proc.time()-ptm #end timer


basetime
##    user  system elapsed 
##   12.38    0.03   12.44
head(res.base)
##                    time interval               begin                 end
## 33  2015-05-05 08:00:00       33 2015-05-05 08:00:00 2015-05-05 08:15:00
## 331 2015-05-05 08:00:01       33 2015-05-05 08:00:00 2015-05-05 08:15:00
## 332 2015-05-05 08:00:02       33 2015-05-05 08:00:00 2015-05-05 08:15:00
## 333 2015-05-05 08:00:03       33 2015-05-05 08:00:00 2015-05-05 08:15:00
## 334 2015-05-05 08:00:04       33 2015-05-05 08:00:00 2015-05-05 08:15:00
## 335 2015-05-05 08:00:05       33 2015-05-05 08:00:00 2015-05-05 08:15:00

This worked fine on a small dataset, but 12 seconds is a long time and I found that as the data got bigger the time got sluggish and ultimately unpleasant.

Using the package sqldf

I looked for alternatives and several stackoverflow threads suggested using the package sqldf.See, for example, this thread. Overlapping joins are not uncommon in SQL so this seemed like a natural fit. The code was elegant, lovely, pared down but…. you'll see it does not do well time-wise.

library(sqldf)
ptm<-proc.time() #start the timer

res.sqldf<-sqldf("select * from dat left join ref on dat.time between ref.begin and ref.end")

sqldftime<-proc.time()-ptm #end the timer

sqldftime
##    user  system elapsed 
##   12.05    0.00   12.04
head(res.sqldf)
##                  time rider timeid interval               begin                 end
## 1 2015-05-05 08:00:00     1      1       32 2015-05-05 07:45:00 2015-05-05 08:00:00
## 2 2015-05-05 08:00:00     1      1       33 2015-05-05 08:00:00 2015-05-05 08:15:00
## 3 2015-05-05 08:00:01     1      2       33 2015-05-05 08:00:00 2015-05-05 08:15:00
## 4 2015-05-05 08:00:02     1      3       33 2015-05-05 08:00:00 2015-05-05 08:15:00
## 5 2015-05-05 08:00:03     1      4       33 2015-05-05 08:00:00 2015-05-05 08:15:00
## 6 2015-05-05 08:00:04     1      5       33 2015-05-05 08:00:00 2015-05-05 08:15:00

Shucks, this works well but the time is still entirely too long at 12 seconds.

Using data.table

The package data.table is incredibly fast. In fact our very first post on this blog gushed about the speed. But, frankly, I kept forgetting the syntax so it fell out of use (generally in favor of dplyr) because most of my code does not need to be lightning fast. In this case, though, data.table has the perfect function called foverlaps. This function is designed to do overlap joins similar to what we need in this case.

Again, I thank stackoverflow for code. In this case the code comes from Arun Srinivasan, a co-developer of data.table.

Note that the foverlaps function actually looks at overlaps in two sets of intervals so it expects an interval in your first table and your second table. We don't have an interval in our second table but we can create one by simply creating a new, dummy column with the same information from our time variable. We then compute the overlapping join and remove the dummy variable.

library(data.table)

# for fairness I'm including all the required steps within the timer
ptm<-proc.time() #start the timer

setDT(dat) # make a data.table
setDT(ref) # make a data.table

# the foverlaps function requires a to and from
dat<-dat[, dummy := time] 
head(dat)
##                   time rider timeid               dummy
## 1: 2015-05-05 08:00:00     1      1 2015-05-05 08:00:00
## 2: 2015-05-05 08:00:01     1      2 2015-05-05 08:00:01
## 3: 2015-05-05 08:00:02     1      3 2015-05-05 08:00:02
## 4: 2015-05-05 08:00:03     1      4 2015-05-05 08:00:03
## 5: 2015-05-05 08:00:04     1      5 2015-05-05 08:00:04
## 6: 2015-05-05 08:00:05     1      6 2015-05-05 08:00:05


dat <- dat[order(time)]  # sorting by time so I can choose first match
setkey(ref, begin, end)  # setting keys tells data.table what to join on
setkey(dat, time, dummy) # setting keys tells data.table what to join on

# do the join. The nomatch = 0 says not to return records with no match
# the mult="first" says to give me the first match and the last bit
# removes the dummy column
res.datatable <- foverlaps(dat, ref, nomatch=0L, mult="first")[, dummy := NULL]


# NOTE: if you don't add the mult= argument you can get an example like this
# where you get two matches. I want this example to pick out the start time
# so I'm using mult="first"

#    interval               begin                 end                time ID
# 1:        1 2015-01-01 00:00:00 2015-01-01 00:15:00 2015-01-01 00:15:00  4
# 2:        2 2015-01-01 00:15:00 2015-01-01 00:30:00 2015-01-01 00:15:00  4


datatabletime<-proc.time()-ptm #end timer

datatabletime
##    user  system elapsed 
##    0.03    0.00    0.03
head(res.datatable)
##    interval               begin                 end                time rider timeid
## 1:       32 2015-05-05 07:45:00 2015-05-05 08:00:00 2015-05-05 08:00:00     1      1
## 2:       33 2015-05-05 08:00:00 2015-05-05 08:15:00 2015-05-05 08:00:01     1      2
## 3:       33 2015-05-05 08:00:00 2015-05-05 08:15:00 2015-05-05 08:00:02     1      3
## 4:       33 2015-05-05 08:00:00 2015-05-05 08:15:00 2015-05-05 08:00:03     1      4
## 5:       33 2015-05-05 08:00:00 2015-05-05 08:15:00 2015-05-05 08:00:04     1      5
## 6:       33 2015-05-05 08:00:00 2015-05-05 08:15:00 2015-05-05 08:00:05     1      6

The code is not lean. The code is tough to master. But the code is fast! Less than a second. When I experimented with more rides and more time intervals I was finding speed improvements of more than 1000x.

Summary

Data table seems to blow away the other options. In the example above data table is 414.67 faster than my code using base R and 401.33 faster than the approach using the sqldf pacakge. In other contexts as well I've seen data.table outperform the competition. As I mentioned above, though, the syntax is very hard to remember so much of my code now relies on dplyr. For pure performance, though, data.table does amazingly well.

Posted in R

4 responses

  1. Hi, great and clean Article.

    I tried to use your code with my own Map data in CartoDB, changing the user, table and CARTOCSS but something goes wrong. It doesn’t work and the console log said: Uncaught TypeError: Cannot read property ‘torque’ of undefined.

    Demo: http://lacrafteria.es/aguas/

    It looks like the API doesn’t recognize my user and/or table. Any idea or suggestion?

    Thanks.

  2. Looks like you did not use an index in the sqldf version. A fairer comparison would be to compare sqldf (with an index) to the data.table version. Although it would still probably be faster to use data table.

  3. Wow, data.table is really blazing fast. Thanks for this article, I need to perform similar operations myself and found this really useful.

Leave a Reply to zev@zevross.com Cancel reply

Your email address will not be published. Required fields are marked *