Using the R function anti_join to find unmatched records

If you browse through our technical blog posts you’ll see quite a few devoted to the data analysis functionality in the R packge dplyr. This is due to the fact that we are constantly finding fun new functions to play with. We wanted to devote this small post to an unexpectedly useful function called anti_join.

Using `anti_join()` from the `dplyr` package

For most data analysis tasks you may have two tables you want to join based on a common ID. This is straightforward in any data analysis package. But occasionally, especially in quality assurance types of settings, we find ourselves wanting to identify the records from one table that did NOT match the other table. For example, anti_join came in handy for us in a setting where we were trying to re-create an old table from the source data. We then wanted to be able to identify the records from the original table that did not exist in our updated table. This is where anti_join comes in, especially when you’re dealing with a multi-column ID.

We’ll start with a relatively simple example.

table1<-data.frame(id=c(1:5), animal=c("cat", "dog", "parakeet", 
"lion", "duck"))
table1
##   id   animal
## 1  1      cat
## 2  2      dog
## 3  3 parakeet
## 4  4     lion
## 5  5     duck

table2<-table1[c(1,3,5),]
table2
##   id   animal
## 1  1      cat
## 3  3 parakeet
## 5  5     duck

To identify the rows that exist in table1 but not in table2 you could use any number of strategies:


# strategy 1
table1[!table1$id%in%table2$id,]
##   id animal
## 2  2    dog
## 4  4   lion

# strategy 2
table1[is.na(match(table1$id,table2$id)),]
##   id animal
## 2  2    dog
## 4  4   lion

# strategy 3. anti join
library(dplyr)
anti_join(table1, table2, by="id")
##   id animal
## 1  2    dog
## 2  4   lion

You might ask why anti_join is an advance given the other easy solutions we’re showing above. We find it most useful when our common ID is a combination of multiple columns. So let’s use another example where we have a multi-column common ID:

table1<-data.frame(state=rep(1:3, each=2), county=rep(c("A", "B"), 3),
                   vals=rnorm(6))

table1
##   state county    vals
## 1     1      A  1.6891
## 2     1      B -1.7812
## 3     2      A  0.3795
## 4     2      B  2.5580
## 5     3      A  0.4315
## 6     3      B -1.9571


table2<-table1[c(1,3,4),]
table2
##   state county   vals
## 1     1      A 1.6891
## 3     2      A 0.3795
## 4     2      B 2.5580

With a two-column unique ID using %in% or match() is more challenging. You could create a single ID by concatenating the state/county fields but this adds a messy extra step. Instead anti_join() is your savior:

# which records occur in table1 but not in table2
anti_join(table1, table2, by=c("state", "county"))
##   state county    vals
## 1     3      B -1.9571
## 2     3      A  0.4315
## 3     1      B -1.7812

13 responses

Thanks for the great simple example Zev, helped me on a Friday afternoon when it was nice to have something go right for once.. Hope you are great!

zev@3.85.142.75 says:

March 25, 2016 at 4:52 pm

Nice to hear from you Ted, hope things are well!

Reply

This solution is quite helpful when dealing with joins in R,
Thanks

This is a great post and helped me figure out the last little conundrum in my code. Thanks!

zev@3.85.142.75 says:

October 12, 2017 at 9:33 am

great 🙂

Reply

Just what I was looking for, thanks!!

Hi all, I am little confused with the setdiff( ) and anti_join( ). Here in the example they gave the same output. Do they always give same output? I have a hunch that they are for different purpose.
> setdiff(table1, table2, by =”id”)
id animal
1 2 dog
2 4 lion
> anti_join(table1, table2, by = “id”)
id animal
1 2 dog
2 4 lion

zev@3.85.142.75 says:

May 22, 2018 at 1:05 pm

That’s interesting, I don’t know the difference, thanks for sharing.

Reply

thank you so much, this helped me alot

Do you know if there’s anyway to keep the original row numbers from the results of the anti-join? It may be necessary to fix those rows and it’s hard to get back to them if you have joined on multiple id values.

zev@3.85.142.75 says:

September 20, 2018 at 4:05 pm

probably the best is before the join to use dat %>% mutate(rownum = row_number()) or something.

Reply

Very helpful!

anti_join(table1, table2, by=c(“state” = “county”))

Technical Tidbits From Spatial Analysis & Data Science

Using the R function `anti_join` to find unmatched records

Using `anti_join()` from the `dplyr` package

13 responses

Leave a Reply Cancel reply

Using anti_join() from the dplyr package

Share this:

13 responses

Leave a Reply Cancel reply

Using `anti_join()` from the `dplyr` package