Audience data validation is a crucial part of delivering accurate behavioral profiles, and as such we put a lot of effort into understanding the sources of inaccuracy in location data. In this case, a publisher noticed that Factual’s Location Validation Stack was blacklisting a significant fraction of their incoming records. Audience engineer Tom took a quick look and noticed some strange patterns, so we decided to investigate the issue in detail.
Tom’s initial insight was that a lot of the latitudes had repeated groups of digits:
By chance, we expect to see this for one of every thousand data points, but we were seeing it in more than 5% of the unfiltered inputs. Although we were already detecting most of these points as invalid, we wanted to investigate more carefully to understand the underlying cause.
Step 1: Characterizing the pathology
I started by looking at a bunch of latitude values by hand, just to see if the pattern generalized at all. At first I didn’t see very many repeating digits; instead, they tended to be off by one:
The histogram of differences between the first three decimal places and the fourth through sixth decimal places showed how unexpected the difference distribution was. Here’s a difference distribution of uniformly random values (which is more or less what we’d expect from latitude digits):
The table stabilized into normal-looking values quickly enough; the most interesting thing is the concentration of values with 1, 0, and 2 deltas. So I decided to use a cutoff of 2 to detect the pathology:
Tom was investigating the problem because one publisher noticed that their data was being blacklisted, but we quickly discovered that the problem spanned many apps, Android devices, and geographical regions:
The difference is also clear in the correlation matrices, which can be visualized by using the stochastic shading technique I mentioned in the polygon compression post. Here’s the latitude digit correlation (X axis is the first digit triple, Y axis is the second):
The latitude matrix has some interesting stuff going on. Zooming in, it looks like the covariant digit groups have even spacing:
The X coordinates of dense cells are all spaced exactly nine apart: 297, 306, 315, 324, 333. This suggests that the error is caused by some kind of quantization (since no real-world geographic feature has this much regularity).
If the error really is a quantization artifact, we should also see correlation between the integer and fractional parts of the latitudes. Here’s the correlation matrix (X is integer, Y is first three digits of fractional):
The gaps in the covariance matrix were spaced at 0.009 degrees and we’re spanning 80 degrees of latitude, so we need to look for a peak below 8888. The first is at 4440:
This means that the 0.009 spacing is the result of superimposition across 2-degree windows; the actual gap is a little under 0.018, which we see as broken stripes when we correlate even/odd integer parts of degrees with fractionals:
A frequency closer to 60/degree (ours is 55.5/degree) might indicate a buggy DMS-to-decimal converter, but this error is most likely unrelated. It’s some kind of precision truncation error, however, since 0.0 (and not 0.009) is one of the erroneous points. This suggests that the latitudes aren’t entirely inaccurate, just imprecise.
Longitudes of data points with erroneous latitudes
The problem is most obvious in latitudes, but the longitude covariance matrix also had some nonuniformity. Most of it is probably due to the difference between urban and rural population density, but it’s worth looking for any obvious sources of error. Let’s look at the FFT of longitudes to see if we find any periodicity:
The first set of peaks is the harmonics at 100000, 200000, 300000, etc. These happen when digits are truncated from the decimal representation. The other set of peaks occurs at 131072 and 262144, equal to 217 and 218, respectively:
These errors are most likely caused by the machine epsilon from single-precision float encoding. They’re about the right frequency; with a 24-bit mantissa (the high bit is implied) and between six and seven integer bits, the epsilon would be 17 or 18 bits into the fractional.
Before moving on, it’s worth making sure the decimal truncation follows some meaningful pattern:
$ nfu latitudes.gz -m 'length(%0 =~ s/^.*.//r)' -oc
52989 6 ## most have six fractional digits
$ nfu longitudes.gz -m 'length(%0 =~ s/^.*.//r)' -oc
12804 5 ## many of these are probably truncated
37496 6 ## ... but this is expected; see below
Here’s a correlation of the integer part of latitude and number of fractional digits:
Precision falls off at multiples of ten, which means implementations are limiting the total number of digits used to represent these quantities. We see more variance in the longitude precision just because the range is more evenly represented. (There are also a few minor outliers with artificially low precision – a little more than statistically expected – but they don’t make up very much of the data and we already filter them out.)
The underlying problem
I examined a variety of different parameters to examine potential common traits across apps that exhibited this behavior. Sparing you all the details, all of the apps with a high error rate requested coarse network-based location permissions. I then looked at the Android source, and the culprit is a class in the Android source specifically designed to quantize locations.
And 1 / 111000 = 0.000009009009. The default 2km margin produces 0.018018, which is exactly the pattern we saw.
The longitude is actually quantized too, but we didn’t observe it because its basis depends on the latitude:
The rationale for this is that if the app doesn’t have permission to get fine-grained coordinates, then all fine-grained location sources (e.g. GPS) are deliberately quantized to prevent data leakage. The locations aren’t completely wrong, just up to 2km away from the device.
Step 4: Flagging erroneous points
The duplication blacklist used by Factual’s Location Validation Stack was already identifying most of the bogus data, but it’s good to have classifiers specifically designed to detect known error modes.
The simplest solution is to flag all points coming from apps whose permissions don’t include fine-grained location access. Not all of these points were obviously erroneous because some of them came from cell towers or other triangulation mechanisms, but all of them are imprecise to a significant degree.
We often use machine learning to solve data-related problems at Factual, but sometimes we run into things like this where understanding the root cause provides value. By writing a classifier specifically to handle this case, we ended up identifying exactly the set of erroneous points and analyzing them with the understanding that they’re inaccurate to a known degree.