Using the R function anti_join to find unmatched records

If you browse through our technical blog posts you’ll see quite a few devoted to the data analysis functionality in the R packge dplyr. This is due to the fact that we are constantly finding fun new functions to play with. We wanted to devote this small post to an unexpectedly useful function called anti_join.

Using anti_join() from the dplyr package

For most data analysis tasks you may have two tables you want to join based on a common ID. This is straightforward in any data analysis package. But occasionally, especially in quality assurance types of settings, we find ourselves wanting to identify the records from one table that did NOT match the other table. For example, anti_join came in handy for us in a setting where we were trying to re-create an old table from the source data. We then wanted to be able to identify the records from the original table that did not exist in our updated table. This is where anti_join comes in, especially when you’re dealing with a multi-column ID.

We’ll start with a relatively simple example.

table1<-data.frame(id=c(1:5), animal=c("cat", "dog", "parakeet", 
"lion", "duck"))
table1
##   id   animal
## 1  1      cat
## 2  2      dog
## 3  3 parakeet
## 4  4     lion
## 5  5     duck

table2<-table1[c(1,3,5),]
table2
##   id   animal
## 1  1      cat
## 3  3 parakeet
## 5  5     duck

To identify the rows that exist in table1 but not in table2 you could use any number of strategies:


# strategy 1
table1[!table1$id%in%table2$id,]
##   id animal
## 2  2    dog
## 4  4   lion

# strategy 2
table1[is.na(match(table1$id,table2$id)),]
##   id animal
## 2  2    dog
## 4  4   lion

# strategy 3. anti join
library(dplyr)
anti_join(table1, table2, by="id")
##   id animal
## 1  2    dog
## 2  4   lion

You might ask why anti_join is an advance given the other easy solutions we’re showing above. We find it most useful when our common ID is a combination of multiple columns. So let’s use another example where we have a multi-column common ID:

table1<-data.frame(state=rep(1:3, each=2), county=rep(c("A", "B"), 3),
                   vals=rnorm(6))

table1
##   state county    vals
## 1     1      A  1.6891
## 2     1      B -1.7812
## 3     2      A  0.3795
## 4     2      B  2.5580
## 5     3      A  0.4315
## 6     3      B -1.9571


table2<-table1[c(1,3,4),]
table2
##   state county   vals
## 1     1      A 1.6891
## 3     2      A 0.3795
## 4     2      B 2.5580

With a two-column unique ID using %in% or match() is more challenging. You could create a single ID by concatenating the state/county fields but this adds a messy extra step. Instead anti_join() is your savior:

# which records occur in table1 but not in table2
anti_join(table1, table2, by=c("state", "county"))
##   state county    vals
## 1     3      B -1.9571
## 2     3      A  0.4315
## 3     1      B -1.7812
Posted in R

13 responses

  1. Thanks for the great simple example Zev, helped me on a Friday afternoon when it was nice to have something go right for once.. Hope you are great!

  2. Hi all, I am little confused with the setdiff( ) and anti_join( ). Here in the example they gave the same output. Do they always give same output? I have a hunch that they are for different purpose.
    > setdiff(table1, table2, by =”id”)
    id animal
    1 2 dog
    2 4 lion
    > anti_join(table1, table2, by = “id”)
    id animal
    1 2 dog
    2 4 lion

  3. Do you know if there’s anyway to keep the original row numbers from the results of the anti-join? It may be necessary to fix those rows and it’s hard to get back to them if you have joined on multiple id values.

Leave a Reply

Your email address will not be published. Required fields are marked *