Using the merge()
function in R on big tables can be time consuming. Luckily the join functions in the new package dplyr
are much faster. The package offers four different joins:
inner_join
(similar to merge withall.x=F
andall.y=F
)left_join
(similar to merge withall.x=T
andall.y=F
)semi_join
(not really an equivalent inmerge()
unless y only includes join fields)anti_join
(no equivalent inmerge()
, this is all x without a match in y)
I can’t find a great discussion of the advantages of the dplyr
join functions but I do see a help response from Hadley Wickham, dplyr's
creator, here that briefly lists the advantages:
- rows are kept in existing order
- much faster
- tells you what keys you’re merging by (if you don’t supply)
- also work with database tables.
Of course, the advantage that matters most is the speed. In this tiny example using a table with more than 6 million records the inner_join function is 43 times faster!
nrow(dat)
# 6131037
nrow(dat2)
# 6
ptm<-proc.time()
x<-merge(dat, dat2, by=c("Parameter_Name", "Pollutant_Standard", "Sample_Duration"))
proc.time()[3]-ptm[3]
#elapsed
#65.56
ptm2<-proc.time()
y<-inner_join(dat, dat2, by=c("Parameter_Name", "Pollutant_Standard", "Sample_Duration"))
proc.time()[3]-ptm2[3]
#elapsed
#1.53
# 65.56/1.53 = Wow! 42.8 x faster using inner_join