Note: This is a companion post to Investigating Various Pathologies of Low Quality Location Data #1 – App Permissions
Repeated latitude digits and where they come from
Audience data validation is a crucial part of delivering accurate behavioral profiles, and as such we put a lot of effort into understanding the sources of inaccuracy in location data. In this case, a publisher noticed that Factual’s Location Validation Stack was blacklisting a significant fraction of their incoming records. Audience engineer Tom took a quick look and noticed some strange patterns, so we decided to investigate the issue in detail.
Tom’s initial insight was that a lot of the latitudes had repeated groups of digits:
By chance, we expect to see this for one of every thousand data points, but we were seeing it in more than 5% of the unfiltered inputs. Although we were already detecting most of these points as invalid, we wanted to investigate more carefully to understand the underlying cause.
Step 1: Characterizing the pathology
I started by looking at a bunch of latitude values by hand, just to see if the pattern generalized at all. At first I didn’t see very many repeating digits; instead, they tended to be off by one:
The histogram of differences between the first three decimal places and the fourth through sixth decimal places showed how unexpected the difference distribution was. Here’s a difference distribution of uniformly random values (which is more or less what we’d expect from latitude digits):
$ nfu latitudes.gz -m 'abs(int(rand(1000)) - int(rand(1000)))' -ocf10p %l
and here’s what the latitudes looked like:
$ nfu latitudes.gz -m 'abs($1 - $2) if %0 =~ /.(d{3})(d{3})/' -ocf10p %l
Log-scaled:
Sorting by descending frequency of those differences:
$ nfu latitudes.gz -m 'abs($1 - $2) if %0 =~ /.(d{3})(d{3})/' -ocO 2802 1 2044 0 1119 2 226 3 123 33 123 19 118 7 116 69 114 50 114 4 113 22 112 8 112 183 ...
The table stabilized into normal-looking values quickly enough; the most interesting thing is the concentration of values with 1, 0, and 2 deltas. So I decided to use a cutoff of 2 to detect the pathology:
$ nfu latitudes.gz -m 'abs($1 - $2) if %0 =~ /.(d{3})(d{3})/' -k 'length(%0) && %0 <= 2' | gzip > pathological-latitudes.gz
Step 2: Measuring the scope of the problem
Tom was investigating the problem because one publisher noticed that their data was being blacklisted, but we quickly discovered that the problem spanned many apps, Android devices, and geographical regions:
## facet by app to check for correlations: $ nfu unfiltered.gz -m 'my $j = jd(%0); row $j.payload.device.geo.lat // "", $j.payload.app.name // ""' -k '%0 =~ /.(d{3})(d{3})/ && abs($1 - $2) <= 2' -f1gcOf10 | gzip > bogus-app-frequencies.gz $ nfu unfiltered.gz -m 'my $j = jd(%0); $j.payload.device.geo.lat ? row $j.payload.app.name // "" : ()' -gcOf10 | gzip > geo-app-frequencies.gz ## which apps reliably have bogus data? $ nfu bogus-app-frequencies.gz -i0 geo-app-frequencies.gz -m 'sprintf "%dt%ft%s", %2, %1 / %2, %0' -Ok '%1 > 0.01' 99497 0.660653 app-1 80123 0.012543 app-3 22378 0.728349 app-9 8699 0.906656 app-17 8688 0.805824 app-18 8687 0.040635 app-19 5251 0.752999 app-23 5095 0.967026 app-24 4930 0.470588 app-25 4193 0.031243 app-27 3641 0.143093 app-29 2917 0.461090 app-30 ...
So about one in every three apps has a statistically significant error rate.
We saw similar results for device types (most Apple and a few Android devices didn’t seem to have the problem):
79180 0.00473604445567062 iPhone 25861 0.00467886005954913 iPhone 6 25222 0.000237887558480691 iPad 17957 0.187837612073286 SM-G900V ## bogus 17392 0.369077736890524 GT-I9300 ## bogus 14773 0.0260610573343261 iPhone 5s (GSM) ## probably bogus 13946 0.00322673167933458 iPhone 4S 13249 0.247188467054117 XT1080 ## bogus 11933 0.271180759239085 SM-G900F ## bogus 11431 0.171988452453854 GT-I9505 ## bogus 9939 0.000503068719187041 iPhone 5 (GSM+CDMA) 9757 0.171671620375115 HTC One ## bogus 9362 0.171864986114078 SAMSUNG-SM-G900A ## bogus 9341 0.141633658066588 SM-G900P ## bogus 9312 0.119201030927835 SAMSUNG-SGH-I337 ## bogus 9005 0.131038312048862 SCH-I545 ## bogus 8871 0.129748619095931 HTC One_M8 ## bogus 8188 0.122984855886663 SM-N9005 ## bogus 7574 0.00752574597306575 iPhone 6+ 7523 0.17320217998139 Nexus 5 ## bogus 7405 0.169074949358542 SGH-I337M ## bogus 7387 0.00148910247732503 iPhone 5c (GSM) 7360 0.000407608695652174 iPhone 5s (GSM+CDMA) 6994 0.0057191878753217 GT-I8190 6439 0.35362633949371 XT1032 ## bogus 5916 0.06710615280595 SM-G900H ## bogus ...
So far this seems like some sort of intermittent location bug. What’s interesting is that the coordinates in aggregate look reasonable:
$ nfu latlngs.gz -k '@_ == 2' -k '%0 =~ /.(d{3})(d{3})/ && abs($1 - $2) <= 2' -f10p %d
Step 3: Identifying possible root causes
One obvious question is whether the longitude shows similar behavior. Interestingly, it does not; here’s the difference distribution for longitude digits:
$ nfu longitudes.gz -m 'abs($1 - $2) if %0 =~ /.(d{3})(d{3})/' -k 'length %0' -ocf10p %l
The difference is also clear in the correlation matrices, which can be visualized by using the stochastic shading technique I mentioned in the polygon compression post. Here’s the latitude digit correlation (X axis is the first digit triple, Y axis is the second):
$ nfu latitudes.gz -m 'row $1, $2 if %0 =~ /.(d{3})(d{3})/' -k @_ -m 'row map $_ + rand(), @_' -p %d
Here’s longitude:
For comparison, here’s the covariance of uniformly random values:
$ nfu latitudes.gz -m 'row int(rand(1000)), int(rand(1000))' -m 'row map $_ + rand(), @_' -p %d
Latitude covariance in detail
The latitude matrix has some interesting stuff going on. Zooming in, it looks like the covariant digit groups have even spacing:
The X coordinates of dense cells are all spaced exactly nine apart: 297, 306, 315, 324, 333. This suggests that the error is caused by some kind of quantization (since no real-world geographic feature has this much regularity).
If the error really is a quantization artifact, we should also see correlation between the integer and fractional parts of the latitudes. Here’s the correlation matrix (X is integer, Y is first three digits of fractional):
$ nfu latitudes.gz -m 'row $1, $2 if %0 =~ /^(-?d+).(d{3})/' -k @_ -m 'row map $_ + rand(), @_' -p %d
A more powerful strategy is to Fourier-transform the histogram to look for recurrent intervals (I added Octave support to nfu to make this easier):
$ nfu perl:-40_0000..40_0000 -m 'sprintf "%.4f", %0 * 1e-4' -I0 @[ latitudes.gz -k '%0 >= -40 && %0 <= 40' -k '%0 =~ /.(d{3})(d{3})/ && abs($1 - $2) <= 2' -m 'sprintf "%.4f", %0' -ocm 'row %1, 1' ] -m '%1 // 0' --octave 'xs = abs(fft(xs))' -p %i
The gaps in the covariance matrix were spaced at 0.009 degrees and we’re spanning 80 degrees of latitude, so we need to look for a peak below 8888. The first is at 4440:
This means that the 0.009 spacing is the result of superimposition across 2-degree windows; the actual gap is a little under 0.018, which we see as broken stripes when we correlate even/odd integer parts of degrees with fractionals:
$ nfu latitudes.gz -m 'row $1 % 2, $2 if %0 =~ /^(-?d+).(d{3})(d{3})/ && abs($2 - $3) <= 2' -k @_ -m 'row map $_ + rand(), @_' -p %d
A frequency closer to 60/degree (ours is 55.5/degree) might indicate a buggy DMS-to-decimal converter, but this error is most likely unrelated. It’s some kind of precision truncation error, however, since 0.0 (and not 0.009) is one of the erroneous points. This suggests that the latitudes aren’t entirely inaccurate, just imprecise.
Longitudes of data points with erroneous latitudes
The problem is most obvious in latitudes, but the longitude covariance matrix also had some nonuniformity. Most of it is probably due to the difference between urban and rural population density, but it’s worth looking for any obvious sources of error. Let’s look at the FFT of longitudes to see if we find any periodicity:
$ nfu perl:-180_0000..180_0000 -m 'sprintf "%.4f", %0 * 1e-4' -I0 @[ latlngs.gz -k '%0 =~ /.(d{3})(d{3})/ && abs($1 - $2) <= 2' -m 'sprintf "%.4f", %1' -ocm 'row %1, 1' ] -m '%1 // 0' --octave 'xs = abs(fft(xs))' -p %i
Here there’s no pattern at all; whatever is happening with longitudes isn’t periodic at scale. Two sources of error are evident in the FFT of the fractional parts:
$ nfu perl:0..999999 -m 'sprintf "%.6d", %0' -I0 @[ bad-longitudes.gz -m 'sprintf("%.6f", %0) =~ s/^.*.//r' -ocm 'row %1, 1' ] -m '%1 // 0' --octave 'xs = abs(fft(xs))' -p %i
The first set of peaks is the harmonics at 100000, 200000, 300000, etc. These happen when digits are truncated from the decimal representation. The other set of peaks occurs at 131072 and 262144, equal to 217 and 218, respectively:
These errors are most likely caused by the machine epsilon from single-precision float encoding. They’re about the right frequency; with a 24-bit mantissa (the high bit is implied) and between six and seven integer bits, the epsilon would be 17 or 18 bits into the fractional.
Before moving on, it’s worth making sure the decimal truncation follows some meaningful pattern:
$ nfu latitudes.gz -m 'length(%0 =~ s/^.*.//r)' -oc 8 1 70 2 129 3 888 4 5252 5 52989 6 ## most have six fractional digits 3596 7 133 8 10 9 $ nfu longitudes.gz -m 'length(%0 =~ s/^.*.//r)' -oc 6 1 34 2 415 3 1627 4 12804 5 ## many of these are probably truncated 37496 6 ## ... but this is expected; see below 9357 7 1150 8 184 9 $
Here’s a correlation of the integer part of latitude and number of fractional digits:
$ nfu latitudes.gz -m 'row int(%0), length(%0 =~ s/^.*.//r)' -m 'row map $_ + rand(), @_' -p %d
Longitude:
Precision falls off at multiples of ten, which means implementations are limiting the total number of digits used to represent these quantities. We see more variance in the longitude precision just because the range is more evenly represented. (There are also a few minor outliers with artificially low precision – a little more than statistically expected – but they don’t make up very much of the data and we already filter them out.)
The underlying problem
I examined a variety of different parameters to examine potential common traits across apps that exhibited this behavior. Sparing you all the details, all of the apps with a high error rate requested coarse network-based location permissions. I then looked at the Android source, and the culprit is a class in the Android source specifically designed to quantize locations.
/**
* Contains the logic to obfuscate (fudge) locations for coarse applications.
*
*
The goal is just to prevent applications with only
* the coarse location permission from receiving a fine location.
*/
public class LocationFudger {
// ...
private static final int APPROXIMATE_METERS_PER_DEGREE_AT_EQUATOR = 111000;
// ...
}
And 1 / 111000 = 0.000009009009. The default 2km margin produces 0.018018, which is exactly the pattern we saw.
The longitude is actually quantized too, but we didn’t observe it because its basis depends on the latitude:
/**
* Requires latitude since longitudinal distances change with distance from equator.
*/
private static double metersToDegreesLongitude(double distance, double lat) {
return distance / APPROXIMATE_METERS_PER_DEGREE_AT_EQUATOR / Math.cos(Math.toRadians(lat));
}
The rationale for this is that if the app doesn’t have permission to get fine-grained coordinates, then all fine-grained location sources (e.g. GPS) are deliberately quantized to prevent data leakage. The locations aren’t completely wrong, just up to 2km away from the device.
Step 4: Flagging erroneous points
The duplication blacklist used by Factual’s Location Validation Stack was already identifying most of the bogus data, but it’s good to have classifiers specifically designed to detect known error modes.
The simplest solution is to flag all points coming from apps whose permissions don’t include fine-grained location access. Not all of these points were obviously erroneous because some of them came from cell towers or other triangulation mechanisms, but all of them are imprecise to a significant degree.
We often use machine learning to solve data-related problems at Factual, but sometimes we run into things like this where understanding the root cause provides value. By writing a classifier specifically to handle this case, we ended up identifying exactly the set of erroneous points and analyzing them with the understanding that they’re inaccurate to a known degree.