If you browse through our technical blog posts you’ll see quite a few devoted to the data analysis functionality in the R packge dplyr
. This is due to the fact that we are constantly finding fun new functions to play with. We wanted to devote this small post to an unexpectedly useful function called anti_join
.
Using anti_join()
from the dplyr
package
For most data analysis tasks you may have two tables you want to join based on a common ID. This is straightforward in any data analysis package. But occasionally, especially in quality assurance types of settings, we find ourselves wanting to identify the records from one table that did NOT match the other table. For example, anti_join
came in handy for us in a setting where we were trying to re-create an old table from the source data. We then wanted to be able to identify the records from the original table that did not exist in our updated table. This is where anti_join
comes in, especially when you’re dealing with a multi-column ID.
We’ll start with a relatively simple example.
table1<-data.frame(id=c(1:5), animal=c("cat", "dog", "parakeet",
"lion", "duck"))
table1
## id animal
## 1 1 cat
## 2 2 dog
## 3 3 parakeet
## 4 4 lion
## 5 5 duck
table2<-table1[c(1,3,5),]
table2
## id animal
## 1 1 cat
## 3 3 parakeet
## 5 5 duck
To identify the rows that exist in table1 but not in table2 you could use any number of strategies:
# strategy 1
table1[!table1$id%in%table2$id,]
## id animal
## 2 2 dog
## 4 4 lion
# strategy 2
table1[is.na(match(table1$id,table2$id)),]
## id animal
## 2 2 dog
## 4 4 lion
# strategy 3. anti join
library(dplyr)
anti_join(table1, table2, by="id")
## id animal
## 1 2 dog
## 2 4 lion
You might ask why anti_join
is an advance given the other easy solutions we’re showing above. We find it most useful when our common ID is a combination of multiple columns. So let’s use another example where we have a multi-column common ID:
table1<-data.frame(state=rep(1:3, each=2), county=rep(c("A", "B"), 3),
vals=rnorm(6))
table1
## state county vals
## 1 1 A 1.6891
## 2 1 B -1.7812
## 3 2 A 0.3795
## 4 2 B 2.5580
## 5 3 A 0.4315
## 6 3 B -1.9571
table2<-table1[c(1,3,4),]
table2
## state county vals
## 1 1 A 1.6891
## 3 2 A 0.3795
## 4 2 B 2.5580
With a two-column unique ID using %in%
or match()
is more challenging. You could create a single ID by concatenating the state/county fields but this adds a messy extra step. Instead anti_join() is your savior:
# which records occur in table1 but not in table2
anti_join(table1, table2, by=c("state", "county"))
## state county vals
## 1 3 B -1.9571
## 2 3 A 0.4315
## 3 1 B -1.7812
Thanks for the great simple example Zev, helped me on a Friday afternoon when it was nice to have something go right for once.. Hope you are great!
Nice to hear from you Ted, hope things are well!
This solution is quite helpful when dealing with joins in R,
Thanks
This is a great post and helped me figure out the last little conundrum in my code. Thanks!
great 🙂
Just what I was looking for, thanks!!
Hi all, I am little confused with the setdiff( ) and anti_join( ). Here in the example they gave the same output. Do they always give same output? I have a hunch that they are for different purpose.
> setdiff(table1, table2, by =”id”)
id animal
1 2 dog
2 4 lion
> anti_join(table1, table2, by = “id”)
id animal
1 2 dog
2 4 lion
That’s interesting, I don’t know the difference, thanks for sharing.
thank you so much, this helped me alot
Do you know if there’s anyway to keep the original row numbers from the results of the anti-join? It may be necessary to fix those rows and it’s hard to get back to them if you have joined on multiple id values.
probably the best is before the join to use dat %>% mutate(rownum = row_number()) or something.
Very helpful!
anti_join(table1, table2, by=c(“state” = “county”))