R’s purrr
package is an amazing set of tools for applying a function to the items in a vector or list. Although the default for this type of operation in purrr
is to run each step sequentially the furrr
package makes it incredibly easy to improve performance by running the steps concurrently (in parallel).
purrr
functions run sequentially
In this simple initial example the map_dbl
function from purrr
takes the values 1, 2, 3 and 4 and squares them. The computations are done sequentially, one at a time.
library(purrr)
map_dbl(1:4, function(x){
x^2
})
## [1] 1 4 9 16
But since the individual computations do not depend at all on a previous computation each of these computations could be done at the same time, taking advantage of modern computers’ mutli-core, multi-threaded architecture.
The furrr
package can parallelize computations easily
Enter the furrr
package. With less than a line of code you can convert the sequential process above to a parallel process.
The furrr
package is a marriage of the purrr
package and the future
package which is designed, in part, to provide a simple way to evaluate R expressions asynchronously (as opposed to sequentially).
Thanks for these innovations go to the package creators Davis Vaughan and Matt Dancho (for furrr
), Henrik Bengtsson (for future
) and Lionel Henry and Hadley Wickham (for purrr
).
Running purrr
functions in parallel is easy with furrr
Here is the same code as before, traditional purrr
running sequentially.
# Sequential
map_dbl(1:4, function(x){
x^2
})
## [1] 1 4 9 16
To convert this same set of computations to run in parallel you simply (1) load the furrr
package, (2) tell R how to set up the parallelization and (3) add future_
in front of the function name.
library(furrr) future::plan(multiprocess) # Parallel future_map_dbl(1:4, function(x){ x^2 })
## [1] 1 4 9 16
You can find more detail on plan()
by looking at the help (?future::plan()
) but plan(multiprocess)
will work to parallelize in most instances on both Mac and Windows.
How do you know it ran in parallel?
Time the computation
If you add a timer you’ll see that future_map_dbl()
is significantly faster than map_dbl()
. Note that just squaring a number is fast so I added a 1 second sleep to make the difference clearer. I’m timing it with the tictoc
package and you can see that the parallel computation is 4x faster.
Note, though, as you’ll see below, in a real-world setting the speed improvements are not always as impressive. It takes time to instantiate a new R process and to copy over the objects needed for the computation. Speed improvements will be greatest in situations where the computations are significantly more time consuming than copying data.
library(tictoc)
# Run sequentially
tic() # start timer
map_dbl(1:4, function(x){
Sys.sleep(1)
x^2
})
toc() #end timer
## [1] 1 4 9 16
## 4.022 sec elapsed
# Run in parallel
tic() # start timer
future_map_dbl(1:4, function(x){
Sys.sleep(1)
x^2
})
toc() # end timer
## [1] 1 4 9 16
## 1.124 sec elapsed
Look at activity monitor or task manager
For definitive proof that future_map_dbl()
runs in multiple R sessions you can look at the Activity Monitor on a Mac or the Task Manager in Windows (on Windows the processes may be listed under the RStudio process). When running map_dbl()
you will see just one R session listed, meaning just one R instance is running through the computations, each in turn. While running future_map_dbl()
you’ll see multiple instances of R, in my case I’m seeing four new instances (five total).
Here we only see one instance of rsession, when using map_dbl()
Here we see five instances of rsession when using future_map_dbl()
Most purrr
functions have a parallel equivalent in furrr
Nearly all purrr
functions have been inplemented including map
, map2
, pmap
, imap
, walk
as well as more specific versions like map_dbl
, map_int
and so on.
All of these will run in parallel
future_map_dbl(1:4, ~.^2)
## [1] 1 4 9 16
l <- list(a = 1:3, b = 4:6, c = 11:13)
future_pmap_dbl(l, function(c, b, a) a / (b + c))
## [1] 0.06666667 0.11764706 0.15789474
future_imap_chr(sample(10), ~ paste0(.y, ": ", .x))
## [1] "1: 1" "2: 5" "3: 7" "4: 4" "5: 2" "6: 3" "7: 9" "8: 10"
## [9] "9: 6" "10: 8"
Adding a progress indicator is as easy as typing .progress = TRUE
future_map_dbl(1:4, function(x){
Sys.sleep(1)
x^2
}, .progress = TRUE)
## Progress: ────────────────────────────── 100%
## [1] 1 4 9 16
A somewhat more real-world example
While this example is still a little contrived, it provides a closer-to-real-world setting for why this might be useful.
In our example we will take a look at state-level and annual trends in births involving twins, triplets etc over a 10-year period 1995-2004.
Extract the data from Google’s BigQuery
Google’s BigQuery service has a ton of fun open data to work with and, not surprisingly, there is an R package to interact with it. For details you can review the readme. In short, you create an account and a project and then enter your project ID in the query in R. The "!!YOUR PROJECT ID!!"
in the code should be replaced by the project ID you get from Google.
If you want to skip this but follow along you can download the resulting datafile here.
library(bigrquery)
sql <- "SELECT state, year, month, plurality FROM `publicdata.samples.natality` WHERE year > 1994 and year < 2005"
tb <- bq_project_query("!!YOUR PROJECT ID!!", sql)
births <- bq_table_download(tb, max_results = Inf)
You can see that the dataset is reasonably large – about 40 million records representing 40 million births.
library(dplyr)
glimpse(births)
## Observations: 39,928,601
## Variables: 4
## $ state <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK"...
## $ year <int> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ plurality <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
The field plurality
refers to the number of babys, here are the counts.
group_by(births, year, plurality) %>%
summarise(counts = n()) %>%
tidyr::spread(year, counts)
## # A tibble: 5 x 11
## plurality `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003`
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1 3.80e6 3.79e6 3.77e6 3.83e6 3.84e6 3.94e6 3.90e6 3.89e6 3.96e6
## 2 2 9.68e4 1.01e5 1.04e5 1.11e5 1.14e5 1.19e5 1.21e5 1.25e5 1.29e5
## 3 3 4.57e3 5.30e3 6.16e3 6.92e3 6.75e3 6.74e3 6.90e3 6.91e3 7.12e3
## 4 4 3.65e2 5.64e2 5.10e2 6.27e2 5.12e2 5.06e2 5.05e2 4.34e2 4.68e2
## 5 5 5.70e1 8.10e1 7.90e1 7.90e1 6.70e1 7.70e1 8.50e1 6.90e1 8.50e1
## # ... with 1 more variable: `2004` <int>
Split the data by year
Here is the contrived portion. You could do the upcoming calculations on the full dataset without breaking it up by year and you could even do these calculations in your SQL call to BigQuery. But what we will do instead is split up our data set into a list based on year – each element in the list is a data frame for one year’s worth of data.
births_split <- split(births, births$year)
length(births_split) # one element per year
## [1] 10
# look at the data for one year
head(births_split[[1]])
## # A tibble: 6 x 4
## state year month plurality
## <chr> <int> <int> <int>
## 1 AK 1995 1 1
## 2 AK 1995 1 1
## 3 AK 1995 1 1
## 4 AK 1995 1 1
## 5 AK 1995 1 1
## 6 AK 1995 1 1
Compute the proportion of multiple births and compare purrr
and furrr
We will compute a table like the one below – we need a count of births involving multiple babies and a total count of births. The number we’re interested in computing is multiple_births_proportion
which is the proportion of all births that are multiple births.
## # A tibble: 612 x 6
## state year month multiple_births_co… total_births multiple_births_pro…
## <chr> <int> <int> <int> <int> <dbl>
## 1 AK 1995 1 16 795 0.02
## 2 AK 1995 2 23 736 0.031
## 3 AK 1995 3 31 895 0.035
## 4 AK 1995 4 20 840 0.024
## 5 AK 1995 5 26 901 0.029
## 6 AK 1995 6 24 908 0.026
## 7 AK 1995 7 20 852 0.023
## 8 AK 1995 8 23 899 0.026
## 9 AK 1995 9 23 893 0.026
## 10 AK 1995 10 16 829 0.019
## # ... with 602 more rows
We will use map_dfr
(and the furrr
companion future_map_dfr
) to cycle through the data frames in the list (remember each data frame is one year of births). The _dfr
tells the functions to return a single data frame – similar to rbind()
or bind_rows()
.
We could use the tictoc
package to time again but there are variations in computer processing time so a single run might not give a good portrait. Instead we will time with the microbenchmark
’s microbenchmark
function which allows us to run multiple iterations of the same code and provides an average of the time the computations take.
Run sequentially (the default)
time_serial <- microbenchmark::microbenchmark({
birth_results_serial <- purrr::map_dfr(births_split, function(.data){
.data %>%
mutate(morethan1 = plurality > 1) %>%
group_by(state, year, month, morethan1) %>%
summarize(multiple_births_count = n()) %>%
mutate(total_births = sum(multiple_births_count),
multiple_births_proportion =
round(multiple_births_count/total_births, 3)) %>%
arrange(state, year, month, morethan1) %>%
filter(morethan1)
})
}, times = 10) # times = 10 tells it run the code 10 times
summary(time_serial)$median # in seconds here
## [1] 2.442766
Run in parallel (using furrr
)
plan(multiprocess)
time_parallel <- microbenchmark::microbenchmark({
birth_results_parallel <- furrr::future_map_dfr(births_split, function(.data){
.data %>%
mutate(morethan1 = plurality > 1) %>%
group_by(state, year, month, morethan1) %>%
summarize(multiple_births_count = n()) %>%
mutate(total_births = sum(multiple_births_count),
multiple_births_proportion =
round(multiple_births_count/total_births, 3)) %>%
ungroup() %>%
arrange(state, year, month, morethan1) %>%
filter(morethan1)
})
}, times = 10)
summary(time_parallel)$median # in seconds here
## [1] 2.057291
Confirm the two resulting tables are the same
identical(birth_results_serial, birth_results_parallel)
## [1] TRUE
Confirm table looks correct
head(birth_results_parallel)
## # A tibble: 6 x 7
## state year month morethan1 multiple_births… total_births
## <chr> <int> <int> <lgl> <int> <int>
## 1 AK 1995 1 TRUE 16 795
## 2 AK 1995 2 TRUE 23 736
## 3 AK 1995 3 TRUE 31 895
## 4 AK 1995 4 TRUE 20 840
## 5 AK 1995 5 TRUE 26 901
## 6 AK 1995 6 TRUE 24 908
## # ... with 1 more variable: multiple_births_proportion <dbl>
Time difference
In this particular example, the time saved is less impressive, most likely due to the fact that the computation time for this example is small relative to the data size. It takes time for R and your computer to open new R sessions and to copy the data over. But we still see a time savings of approximately 19%.
# Time serial takes 1.187 (19%) more time to run
summary(time_serial)$median/summary(time_parallel)$median
## [1] 1.18737
A couple of notes about Windows machines
We found that there were two issues for Windows users in our office. First, one person got an error related to the data exceeding the maximum allowed size. To fix this issue she used:
options(future.globals.maxSize= 891289600)
We also noticed on Windows machines the dplyr
code ran slower (this was not true on Macs). We suspect this was due to slower speed setting up the new threads and copying the data given the the small bit of code at the beginning of this post ran quicker in parallel and that the future_map_dfr
code ran more slowly even if we commented out all of the actual processing code (i.e., left the section within future_map_dfr
empty).
Please report to us your experience if you’re working on Windows.
What does the multiple births data look like?
In case you’re curious about the patterns in multiple births, I’ve provided a few visualizations of the resulting data.
birth_results <- birth_results_parallel
Proportion of multiple births overall
There is significant spread in the proportion of births resulting in multiple babies with an average of 0.03.
library(ggplot2)
birth_results %>%
ggplot(aes(multiple_births_proportion))+
geom_histogram(color = "white") +
geom_vline(xintercept = median(birth_results$multiple_births_proportion), color = "forestgreen", size = 1.5) +
ggtitle("Proportion of births with more than one baby (by state, year and month)")
More detail: proportion of multiple births by state, year and season
For reference we will compute (and add to the plots) the state-level medians (across all months and years). We will also do a rough assignment of month to season.
The forcats
package is useful for working with categorical data, in this case we’re reordering the state factor levels by average proportion of multiple births (avg
) with fct_reorder()
so that the states will be organized in the plots and then we’re collapsing months into season with fct_collapse()
.
birth_results <- birth_results %>%
group_by(state) %>%
mutate(avg = median(multiple_births_proportion),
month = as.character(month)) %>%
ungroup() %>%
mutate(state = forcats::fct_reorder(state, avg)) %>%
mutate(season = forcats::fct_collapse(month,
winter = c("1", "2", "12"),
spring = c("3", "4","5"),
summer = c("6", "7", "8"),
fall = c("9", "10", "11")
))
There is a clear difference in states. There is also a hint of a yearly difference (see how the weight of the points is to the left of the black medians in 1995 and to the right in 2004) but I can’t see a seasonal difference.
birth_results %>%
ggplot(aes(state, multiple_births_proportion)) +
geom_point(aes(color = season)) +
coord_flip()+
geom_point(data = distinct(birth_results, state, avg), aes(state, avg)) +
facet_wrap(~year) +
theme(axis.text.y = element_text(size = 6))
Click on this image to see a bigger version
Significant state-level differences
We can see that Massachusetts, Rhode Island, New Jersey and Connecticut have nearly double the rate of multiple births than Wyoming, New Mexico and Hawaii.
state_medians <- birth_results %>%
group_by(state) %>%
summarize(avg = median(multiple_births_proportion)) %>%
arrange(avg)
head(state_medians)
## # A tibble: 6 x 2
## state avg
## <fct> <dbl>
## 1 WY 0.02
## 2 NM 0.0225
## 3 HI 0.023
## 4 ID 0.0255
## 5 AK 0.026
## 6 OK 0.026
tail(state_medians)
## # A tibble: 6 x 2
## state avg
## <fct> <dbl>
## 1 DE 0.036
## 2 DC 0.038
## 3 CT 0.039
## 4 NJ 0.039
## 5 RI 0.0395
## 6 MA 0.043
Closer inspection of yearly and seasonal differences
We can tease out yearly and seasonal differences with box plots organized in a couple of different ways. There seems to be a pretty clear yearly difference but I’m still not seeing a consistent seasonal difference (though in another setting I’d probably take a closer look).
birth_results %>%
ggplot(aes(season, multiple_births_proportion))+
geom_boxplot(aes(color = factor(year)))
birth_results %>%
ggplot(aes(factor(year), multiple_births_proportion))+
geom_boxplot(aes(color = factor(season)))
Finally we can take a look at the yearly trend. We definitely see a clear (probably linear) upward trend in the proportion of multiple births. This is consistent with findings elsewhere may be due to an increasing average age at which women have children or increasing use of fertility drugs.
birth_results %>%
ggplot(aes(year, multiple_births_proportion))+
geom_jitter(width = 0.2, alpha = 0.1) +
stat_smooth(method = "lm")
Conclusion
With less than a line of code you can convert your sequential purrr
code into code that runs in parallel using the furrr
package. Running the code can have significant performance benefits though the benefits you see will depend on the amount of computation and the amount of data you’re working with. Nearly all purrr
functions have a corresponding furrr
equivalent so pretty much anything you run with purrr
can be parallelized.