Four reasons why you should check out the R package dplyr

The R package dplyr, written by Hadley Wickham is only a few months old but has already become an important part of our data analysis/manipulation workflow, replacing functions that we have used for years. There are several reasons why dplyr is such a valuable tool but most important from my perspective are the following:

  1. Speed. You’ll see below that dplyr is much, much faster than other, more traditional, functions.
  2. Direct connection to and analysis within external databases permitting simpler handling of large data
  3. Function chaining that allows us to avoid cluttering our workspace with interim objects
  4. Syntax simplicity and ease of use. The code is easy to write and to follow.

To illustrate these enhancements and convince you that dplyr is worth trying out, I’m going to illustrate with an example of population data from US Census blocks. In total the table I will be using, which is stored in a PostgreSQL database, has more than 11 million records. Each record has a population value and Census blocks are wholly contained in US counties so we will illustrate the nice features of dplyr with the simple example of grouping blocks by county and summing population. The size of this database and the fact that it’s housed in an external database will help to show dplyr functionality.

Census blocks are relatively small geographic units and, in city settings, tend to encapsulate actual city blocks. Here is what they look like in lower Manhattan:

nyc

1. Speed – dplyr functions are much faster than other functions

To demonstrate speed differences I will first I will conduct the calculations using our pre-dplyr approach – by connecting to the database with the package RPostgreSQL, then pulling the entire table into R in one big chunk and conducting the analysis with the aggregate function in the stats library.

Note that from purely a speed perspective, others have shown that the package data.table can outperform dplyr in some settings (see, for example, Tal Galili’s post here) but the performance boost is minimal and probably not meaningful is most real-world setting.

To make things a little more interesting let’s get the mean, max and count of blocks within the counties (in addition to sum). By the way, the unique ID for a county is the combination of state ID (statefp10) and county ID (countyfp10) so we need to aggregate by both variables.

How long do the calculations take using aggregate?

I want to keep the elapsed time involved in importing the data separate from the elapsed time for the calculations so I can compare more easily so I’ll set two timers.

library(RPostgreSQL)

drv<- dbDriver("PostgreSQL") # set up connection to database
con<-dbConnect(drv, user="postgres", password="spatial", dbname="census") # set up connection to database
q<-"SELECT statefp10, countyfp10, pop10 FROM geo_tabblock2010_all_pophu" # our query

start<-proc.time() #timer for data import
fulltable<-fetch(dbSendQuery(con, q), n=-1)# fetch the data en masse
dim(fulltable)# dimensions of the full block table

## [1] 11078297        3

getdata.timer<-proc.time()-start
getdata.timer[3] # elapsed time for data import

## elapsed 
##   55.44

Reading in the full table took 55.44 seconds. Now we’re ready to compare the actual calculations using the in-memory table fulltable.


start<-proc.time() #timer for aggregate

# aggregate the traditional way
use.aggregate<-aggregate(pop10~statefp10+countyfp10,data=fulltable,
                   FUN=function(x) c(sum=sum(x), mean=mean(x), max=max(x), cnt=length(x))) #summarize

use.aggregate<-use.aggregate[order(use.aggregate$statefp10, use.aggregate$countyfp10),] # order
aggregate.timer<-proc.time()-start # stop timer

dim(fulltable)

## [1] 11078297        3

head(use.aggregate)

##     statefp10 countyfp10 pop10.sum pop10.mean pop10.max pop10.cnt
## 1          01        001  54571.00      28.92   1322.00   1887.00
## 50         01        003 182265.00      21.12   1117.00   8628.00
## 99         01        005  27457.00      15.09   1626.00   1820.00
## 149        01        007  22915.00      12.90   2093.00   1777.00
## 196        01        009  57322.00      20.84    635.00   2750.00
## 244        01        011  10914.00      13.31   1523.00    820.00

dim(use.aggregate)

## [1] 3143    3

aggregate.timer[3] # ELAPSED TIME USING aggregate (seconds)

## elapsed 
##    23.4

We can see that the query using aggregate took 23.4 seconds. Now let’s see how dplyr handles the data. We will start by using dplyr on the full table in R memory, the same table used by aggregate.

How long do the calculations take using dplyr?

library(dplyr)
start<-proc.time() # start dplyr timer
grp<-group_by(fulltable, statefp10, countyfp10) # set up the grouping
use.dplyr<-summarise(grp, pop10.tot=sum(pop10), 
      pop10.m=mean(pop10), pop10.max=max(pop10),  pop10.cnt=n()) #set up aggregation by groups
use.dplyr<-arrange(use.dplyr, statefp10, countyfp10) # order the data
use.dplyr<-collect(use.dplyr) # grab the results
dplyr.timer<-proc.time()-start

dim(use.dplyr)

## [1] 3143    6

head(use.dplyr)

## Source: local data frame [6 x 6]
## Groups: statefp10
## 
##   statefp10 countyfp10 pop10.tot pop10.m pop10.max pop10.cnt
## 1        01        001     54571   28.92      1322      1887
## 2        01        003    182265   21.12      1117      8628
## 3        01        005     27457   15.09      1626      1820
## 4        01        007     22915   12.90      2093      1777
## 5        01        009     57322   20.84       635      2750
## 6        01        011     10914   13.31      1523       820

dplyr.timer[3]

## elapsed 
##    0.91

Based on the timer we see that dplyr is 25.71 times faster, a significant time saving. This is due in part to the fact that ‘key pieces’ of dplyr are written in Rcpp, a package written to accelerate computations by by integrating R with C++.

2. Direct connection to external databases

As you saw above, the original table we’re working with is in a PostgreSQL database and what we did above is read the entire table into memory. For cleanliness and memory-reasons, we, in this office, keep our largest databases – generally those in the 10s of millions of records – in external databases and we can’t (or would prefer not to) bring whole tables into R. R has some nice packages (RMySQL, RPostgreSQL) we use regularly to directly access these databases but dplyr has built this direct access in. So instead of bringing the large table into R let’s leave the table where it is. Now we can show the approach for the same tallies using dplyr on the database.

Note that, as described in the vignette on dplyr and databases here dplyr ‘tries to be as lazy as possible’ by not pulling data back to R unless the user asks for it (which we are doing with the collect function and delays doing work until the last minute so that it can send a single request to the database).

Run the same procedure but keep the table in the external database

One thing to keep in mind here is that we can avoid the 55.44 seconds above that it took to bring the table into memory since the table is being left in the database.

con<-src_postgres(dbname="census", host="localhost", user="postgres", password="spatial")

start<-proc.time() # start dplyr DB timer
tbl<-tbl(con, sql(q)) #set up table
use.dplyrDB1.grp<-group_by(tbl, statefp10, countyfp10) # set up the grouping
use.dplyrDB1.sum<-summarise(use.dplyrDB1.grp, pop10.tot=sum(pop10), 
            pop10.m=mean(pop10), pop10.max=max(pop10), pop10.min=min(pop10), pop10.cnt=n())
use.dplyrDB1.arr<-arrange(use.dplyrDB1.sum, statefp10, countyfp10) # do ordering
use.dplyrDB1<-collect(use.dplyrDB1.arr) # grab the final result
dplyrDB1.timer<-proc.time()-start # ELAPSED TIME USING dplyr on database (seconds)

dim(use.dplyrDB1)

## [1] 3143    6

head(use.dplyrDB1)

## Source: local data frame [6 x 6]
## Groups: statefp10

##   statefp10 countyfp10 pop10.tot  pop10.m pop10.max pop10.cnt
## 1        01        001     54571 28.91945      1322      1887
## 2        01        003    182265 21.12483      1117      8628
## 3        01        005     27457 15.08626      1626      1820
## 4        01        007     22915 12.89533      2093      1777
## 5        01        009     57322 20.84436       635      2750
## 6        01        011     10914 13.30976      1523       820

dplyrDB1.timer[3]

## elapsed 
##   208.42

In total, the database calculations took 208.42 seconds. Frankly I was a little surprised that the calculations using the external SQL database were significantly slower than the previous calculations (in fact, the in-memory calculations were 229 times faster). But Hadley reminded me that in-memory calculations will be faster than even a well indexed database on disk.

3. Chain functions to reduce clutter and coding

An additional aspect of dplyr that I love is the ability to chain functions together. Although technically there is some chain functionality in base R (try, for example, sum(c(3,3))%.%sum(c(12,11))) dplyr takes chaining to a new level. Below you’ll see that the code in Section 2 above is re-written using the %.% operator which allows us to chain together the various dplyr commands. The results are the same, but the amount of coding is reduced and we get rid of the intermittent objects.

Clean up the code using chaining

library(dplyr)

start<-proc.time() # start dplyr DB timer (chain)

use.dplyrDB2<-tbl(con, sql(q))%.%
  group_by(statefp10, countyfp10)%.%
  summarise(pop10.tot=sum(pop10), pop10.m=mean(pop10), pop10.max=max(pop10), pop10.min=min(pop10), pop10.cnt=n())%.%
  arrange(statefp10, countyfp10)%.%
  collect()

dplyrDB2.timer<-proc.time()-start # ELAPSED TIME USING dplyr on database (chained) (seconds)

dim(use.dplyrDB2)

## [1] 3143    6

head(use.dplyrDB2)

## Source: local data frame [6 x 6]
## Groups: statefp10

##   statefp10 countyfp10 pop10.tot  pop10.m pop10.max pop10.cnt
## 1        01        001     54571 28.91945      1322      1887
## 2        01        003    182265 21.12483      1117      8628
## 3        01        005     27457 15.08626      1626      1820
## 4        01        007     22915 12.89533      2093      1777
## 5        01        009     57322 20.84436       635      2750
## 6        01        011     10914 13.30976      1523       820

dplyrDB2.timer[3]

## elapsed 
##   209.45

The chaining took 209.45 seconds, the same as the result without chaining (208.42 seconds) but with a lot less coding. (Note that when I first published this tutorial I reported that the chaining approach was slower and Hadley pointed out that this should not be the case. I re-ran the calculations in the same way as I had done previously and got the same results. I then ran the non-chained and then the chained code in separate R instances at separate times and found, as Hadley suggested, the nearly identical results reported here now. Given that the original calculations were done sequentially, perhaps the original difference in speed was attributed to reduced memory available to subsequent calculations (suggestions are welcome).)

Let’s look at the result using another library written by Hadley Wickham, ggplot2.

library(ggplot2)
ggplot(use.dplyrDB2, aes(x=pop10.tot))+geom_histogram(color='white', fill='cadetblue')+
  xlab("County Population (2010) Computed by Summing Blocks")

plot of chunk unnamed-chunk-10

4. Syntax simplicity and ease of use

Finally, I’ve found that, especially when compared to package data.table, the syntax is much easier to follow and memorize. I still find myself referring to cheat sheets for data.table while the transition to dplyr has been smoother. In addition to the relative simplicity, there are a few nice flourishes to the code that have simplified coding. Here are a couple of small examples.

If you want to create, for example, two new columns, the function mutate allows, say, the second column to refer to the first not-yet-created column. Hard to describe easier to show:

Easy creation of new columns


library(datasets)
data(cars)
head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

newcars<-mutate(cars, 
       col1 = speed*dist,
       col2 = col1*10,
       col3 = col1+col2)
head(newcars)

##   speed dist col1 col2 col3
## 1     4    2    8   80   88
## 2     4   10   40  400  440
## 3     7    4   28  280  308
## 4     7   22  154 1540 1694
## 5     8   16  128 1280 1408
## 6     9   10   90  900  990

I also find that selecting columns by name is significantly easier with dplyr. Here are examples using the same table we just created:

Easy column selection

# select two columns
head(select(newcars, speed, col2))

##   speed col2
## 1     4   80
## 2     4  400
## 3     7  280
## 4     7 1540
## 5     8 1280
## 6     9  900
# all but 1 column
head(select(newcars, -col2))
##   speed dist col1 col3
## 1     4    2    8   88
## 2     4   10   40  440
## 3     7    4   28  308
## 4     7   22  154 1694
## 5     8   16  128 1408
## 6     9   10   90  990
# sequence of columns
head(select(newcars, speed, col1:col3))
##   speed col1 col2 col3
## 1     4    8   80   88
## 2     4   40  400  440
## 3     7   28  280  308
## 4     7  154 1540 1694
## 5     8  128 1280 1408
## 6     9   90  900  990

Conclusions

The dplyr package is an excellent addition to our workflow. It speeds up, cleans up and simplifies data analysis and manipulation. If you would like to learn more about dplyr please be sure to check out the well done introduction to dplyr on the RStudio blog at this site. Definitely don’t forget to follow the links at that bottom to useful vignettes including one on working with databases.

2 responses

  1. “Frankly I was a little surprised that the calculations using the external SQL database were significantly slower than the previous calculations (in fact, the in-memory calculations were 229 times faster). But Hadley reminded me that in-memory calculations will be slower than even a well indexed database on disk.”

    Should that be *faster*?

Leave a Reply to zev@zevross.com Cancel reply

Your email address will not be published. Required fields are marked *