If you browse through our technical blog posts you’ll see quite a few devoted to the data analysis functionality in the R packge
dplyr. This is due to the fact that we are constantly finding fun new functions to play with. We wanted to devote this small post to an unexpectedly useful function called
anti_join() from the
For most data analysis tasks you may have two tables you want to join based on a common ID. This is straightforward in any data analysis package. But occasionally, especially in quality assurance types of settings, we find ourselves wanting to identify the records from one table that did NOT match the other table. For example,
anti_join came in handy for us in a setting where we were trying to re-create an old table from the source data. We then wanted to be able to identify the records from the original table that did not exist in our updated table. This is where
anti_join comes in, especially when you’re dealing with a multi-column ID.
We’ll start with a relatively simple example.
table1<-data.frame(id=c(1:5), animal=c("cat", "dog", "parakeet", "lion", "duck")) table1 ## id animal ## 1 1 cat ## 2 2 dog ## 3 3 parakeet ## 4 4 lion ## 5 5 duck table2<-table1[c(1,3,5),] table2 ## id animal ## 1 1 cat ## 3 3 parakeet ## 5 5 duck
To identify the rows that exist in table1 but not in table2 you could use any number of strategies:
# strategy 1 table1[!table1$id%in%table2$id,] ## id animal ## 2 2 dog ## 4 4 lion # strategy 2 table1[is.na(match(table1$id,table2$id)),] ## id animal ## 2 2 dog ## 4 4 lion # strategy 3. anti join library(dplyr) anti_join(table1, table2, by="id") ## id animal ## 1 2 dog ## 2 4 lion
You might ask why
anti_join is an advance given the other easy solutions we’re showing above. We find it most useful when our common ID is a combination of multiple columns. So let’s use another example where we have a multi-column common ID:
table1<-data.frame(state=rep(1:3, each=2), county=rep(c("A", "B"), 3), vals=rnorm(6)) table1 ## state county vals ## 1 1 A 1.6891 ## 2 1 B -1.7812 ## 3 2 A 0.3795 ## 4 2 B 2.5580 ## 5 3 A 0.4315 ## 6 3 B -1.9571 table2<-table1[c(1,3,4),] table2 ## state county vals ## 1 1 A 1.6891 ## 3 2 A 0.3795 ## 4 2 B 2.5580
With a two-column unique ID using
match() is more challenging. You could create a single ID by concatenating the state/county fields but this adds a messy extra step. Instead anti_join() is your savior:
# which records occur in table1 but not in table2 anti_join(table1, table2, by=c("state", "county")) ## state county vals ## 1 3 B -1.9571 ## 2 3 A 0.4315 ## 3 1 B -1.7812
Thanks for the great simple example Zev, helped me on a Friday afternoon when it was nice to have something go right for once.. Hope you are great!
Nice to hear from you Ted, hope things are well!
This solution is quite helpful when dealing with joins in R,
This is a great post and helped me figure out the last little conundrum in my code. Thanks!
Just what I was looking for, thanks!!
Hi all, I am little confused with the setdiff( ) and anti_join( ). Here in the example they gave the same output. Do they always give same output? I have a hunch that they are for different purpose.
> setdiff(table1, table2, by =”id”)
1 2 dog
2 4 lion
> anti_join(table1, table2, by = “id”)
1 2 dog
2 4 lion
That’s interesting, I don’t know the difference, thanks for sharing.
thank you so much, this helped me alot
Do you know if there’s anyway to keep the original row numbers from the results of the anti-join? It may be necessary to fix those rows and it’s hard to get back to them if you have joined on multiple id values.
probably the best is before the join to use dat %>% mutate(rownum = row_number()) or something.
anti_join(table1, table2, by=c(“state” = “county”))