Unhide hidden data using jitter in the R package `ggplot2`

When you’re plotting a lot of data overplotting can sometimes obscure important patterns. In situations like this it can be useful at the exploratory data analysis phase to ‘jitter’ the data so that underlying data can be viewed making it easier to see patterns. In addition, jittering can be a way of ‘anonymizing’ spatial data while maintaining, for example, neighborhood or county-level patterns.

Although jittering can be an extremely useful tool, keep in mind that from a data visualization perspective jittering means adding additional noise to your data. This additional variation can confuse your audience and lead to misinterpretation. As a result, we jitter in-house fairly regularly but only sparingly jitter data in graphics we share with clients or the public.

Jitter data – start with the `base` package

In a recent project we found jittering to be particularly useful. We had a dataset with two columns – one was the actual number of air pollutant samples taken at air monitoring sites in one month and the other was the required number of samples for the monthly value to be considered valid (at a completeness threshold of 75%). Although we had 10,000 data points there were only 129 unique combinations of our data points. As a result, the plot of actual vs required was very misleading:

``````head(sitedata)
##   count reqSamples
## 1    31         23
## 2    28         23
## 3    31         23
## 4    30         23
## 5    30         23
## 6    30         23
plot(sitedata\$reqSamples, sitedata\$count,
xlab="Required Samples for Valid Value", ylab="Actual Samples",
main="Hard to believe this is\nactually 10,000 points")
``````

In a situation like this, my tendency is to take a quick look at the data with jittering. But I always found using R’s jitter function to be a little messy because you need to apply the jitter function to both your X and Y variables like this:

``````plot(jitter(sitedata\$reqSamples, factor=1.1), jitter(sitedata\$count, factor=1.1),
xlab="Required Samples", ylab="Actual Samples")
``````

Although this is clearly not a ton of complicated code, I still wince at having to apply the jitter function twice so when I started to use the package `ggplot2` more regularly I was excited to see a jitter argument that simplifies plotting.

Package `ggplot2` to the rescue

The R package `ggplot2` has a default look that is much more attractive and the creator, Hadley Wickham, thoughtfully added an argument to the `geom_point()` function to implement jittering of points more easily and elegantly. Here is the same plot as above, but much nicer:

``````library(ggplot2)
ggplot(sitedata, aes(reqSamples, count))+
geom_point(position = position_jitter(w = 0.3, h = 0.3))+
xlab("Required Sampling Frequency for Valid Monthly Values")+
ylab("Actual Sampling Frequency")
``````

The plot above is nicer but could benefit from some additional styling. I particularly like using `ggplot's alpha` argument to add transparency. I also find the default title and axis labels to be too close to the plot itself so you can use `vjust` to adjust this.

``````ggplot(sitedata, aes(reqSamples, count))+
ggtitle("Required vs Actual Sampling Frequency\n for Air Pollution Monitors")+
geom_point(position = position_jitter(w = 0.3, h = 0.3),
alpha=0.1, color="firebrick")+
xlab("Required Sampling Frequency for Valid Monthly Values")+
ylab("Actual Sampling Frequency")+
theme(plot.title = element_text(lineheight=.8, face="bold", vjust=1),
axis.title.x = element_text(vjust=-0.5),
axis.title.y = element_text(vjust=0.3))
``````

Much better!

Jitter spatial data

There are two situations where I might also consider jittering spatial data. First and most obvious is when I have overlapping points. In this situation a little jittering can help visualize the data though, again, be careful because the added noise can mislead those who view your map. In the second situation, you might have sensitive data like addresses that you want to jitter to protect anonymity. I’ll demonstrate this second situation using a tiny dataset I created by hand using a couple of store locations in New York City’s SoHo neighborhood, the Apple Store, Kidrobot (a cute pop-art store), Puck Fair (a favorite bar) and a great bookstore – McNally Jackson. For the record, I got the coordinates the old-fashioned way – by right-clicking on the new Google Maps and choosing ‘What’s Here’ which gives the coordinates.

Here we create the data and map it the traditional way:

``````library(ggmap)
tmp<-data.frame(lat=c(40.725095,40.725116,40.724652,40.723371),
lon=c(-73.999115,-73.999775,-73.995937,-73.996085),
name=c("Apple Store", "Kidrobot", "Puck Fair", "McNally Jackson Books"))

qmap("Prince St & Mercer St, New York City", zoom = 16, maptype="hybrid")+
geom_point(aes(x=lon, y=lat, color=name), data=tmp,  size=5)+
theme(legend.title=element_blank()) # turn off legend title
``````

And then we jitter a tiny bit. I believe that the jitter values are percentages based on the resolution of the data. Since a degree of latitude is approximately 110 KM in NYC then 0.002 is equivalent to 0.2% or 220 meters.

``````qmap("Prince St & Mercer St, New York City", zoom = 16, maptype="hybrid")+
geom_point(aes(x=lon, y=lat, color=name), data=tmp, size=5,
position=position_jitter(w = 0.002, h = 0.002))
``````

And you can see that the points were moved randomly in both the X and Y direction obscuring (on purpose) the actual point locations.

Conclusions

Jittering data useful way to reveal patterns in your data that might be obscured by overplotting. I’ve found that using the `jitter` argument in R’s `ggplot` package is cleaner and easier to use than the `jitter` function in the `base` package but either option works well. Use jittering as needed in your exploratory data analysis but be cautious when using it in graphics for public consumption as the additional noise may confuse your audience.

2 responses

1. Kreuvf says:

Adding jitter is a random operation, so every time you run your programme the image will come out different. To avoid this, you may initialize the pseudo-random number generator with a hard-coded value like that: set.seed(20061001, “Mersenne-Twister”).

For reference: I’ve taken that from http://xtof.perso.math.cnrs.fr/pdf/ReproducibleAnalysis.pdf.

2. Jennifer says:

great post~ I just learnt the package ggplot2, and found it was amazing to create gorgeous and practical graphs.