## Investigating Various Pathologies of Low Quality Location Data #1 - App Permissions - Lab Notes

Note: This is a companion post to Investigating Various Pathologies of Low Quality Location Data #1 - App Permissions

Repeated latitude digits and where they come from

Audience data validation is a crucial part of delivering accurate behavioral profiles, and as such we put a lot of effort into understanding the sources of inaccuracy in location data. In this case, a publisher noticed that Factual’s Location Validation Stack was blacklisting a significant fraction of their incoming records. Audience engineer Tom took a quick look and noticed some strange patterns, so we decided to investigate the issue in detail.

Tom’s initial insight was that a lot of the latitudes had repeated groups of digits:

43.315315
42.234234
41.963963
...

By chance, we expect to see this for one of every thousand data points, but we were seeing it in more than 5% of the unfiltered inputs. Although we were already detecting most of these points as invalid, we wanted to investigate more carefully to understand the underlying cause.

Step 1: Characterizing the pathology

I started by looking at a bunch of latitude values by hand, just to see if the pattern generalized at all. At first I didn’t see very many repeating digits; instead, they tended to be off by one:

32.972973
24.972973
41.333332
42.720722
28.396397
47.585586
...

The histogram of differences between the first three decimal places and the fourth through sixth decimal places showed how unexpected the difference distribution was. Here’s a difference distribution of uniformly random values (which is more or less what we’d expect from latitude digits):

`\$ nfu latitudes.gz -m 'abs(int(rand(1000)) - int(rand(1000)))' -ocf10p %l`

and here’s what the latitudes looked like:

`\$ nfu latitudes.gz -m 'abs(\$1 - \$2) if %0 =~ /\.(\d{3})(\d{3})/' -ocf10p %l`

Log-scaled:

Sorting by descending frequency of those differences:

```\$ nfu latitudes.gz -m 'abs(\$1 - \$2) if %0 =~ /\.(\d{3})(\d{3})/' -ocO
2802    1
2044    0
1119    2
226     3
123     33
123     19
118     7
116     69
114     50
114     4
113     22
112     8
112     183
...```

The table stabilized into normal-looking values quickly enough; the most interesting thing is the concentration of values with 1, 0, and 2 deltas. So I decided to use a cutoff of 2 to detect the pathology:

```\$ nfu latitudes.gz -m 'abs(\$1 - \$2) if %0 =~ /\.(\d{3})(\d{3})/' \
-k 'length(%0) && %0 <= 2' \
| gzip > pathological-latitudes.gz```

Step 2: Measuring the scope of the problem

Tom was investigating the problem because one publisher noticed that their data was being blacklisted, but we quickly discovered that the problem spanned many apps, Android devices, and geographical regions:

```## facet by app to check for correlations:
\$ nfu unfiltered.gz \
-m 'my \$j = jd(%0);
-k '%0 =~ /\.(\d{3})(\d{3})/ && abs(\$1 - \$2) <= 2'
-f1gcOf10 \
| gzip > bogus-app-frequencies.gz

\$ nfu unfiltered.gz \
-m 'my \$j = jd(%0);
-gcOf10 \
| gzip > geo-app-frequencies.gz

## which apps reliably have bogus data?
\$ nfu bogus-app-frequencies.gz \
-i0 geo-app-frequencies.gz \
-m 'sprintf "%d\t%f\t%s", %2, %1 / %2, %0' \
-Ok '%1 > 0.01'
99497   0.660653        app-1
80123   0.012543        app-3
22378   0.728349        app-9
8699    0.906656        app-17
8688    0.805824        app-18
8687    0.040635        app-19
5251    0.752999        app-23
5095    0.967026        app-24
4930    0.470588        app-25
4193    0.031243        app-27
3641    0.143093        app-29
2917    0.461090        app-30
...```

So about one in every three apps has a statistically significant error rate.

We saw similar results for device types (most Apple and a few Android devices didn’t seem to have the problem):

```79180   0.00473604445567062     iPhone
25861   0.00467886005954913     iPhone 6
17957   0.187837612073286       SM-G900V                ## bogus
17392   0.369077736890524       GT-I9300                ## bogus
14773   0.0260610573343261      iPhone 5s (GSM)         ## probably bogus
13946   0.00322673167933458     iPhone 4S
13249   0.247188467054117       XT1080                  ## bogus
11933   0.271180759239085       SM-G900F                ## bogus
11431   0.171988452453854       GT-I9505                ## bogus
9939    0.000503068719187041    iPhone 5 (GSM+CDMA)
9757    0.171671620375115       HTC One                 ## bogus
9362    0.171864986114078       SAMSUNG-SM-G900A        ## bogus
9341    0.141633658066588       SM-G900P                ## bogus
9312    0.119201030927835       SAMSUNG-SGH-I337        ## bogus
9005    0.131038312048862       SCH-I545                ## bogus
8871    0.129748619095931       HTC One_M8              ## bogus
8188    0.122984855886663       SM-N9005                ## bogus
7574    0.00752574597306575     iPhone 6+
7523    0.17320217998139        Nexus 5                 ## bogus
7405    0.169074949358542       SGH-I337M               ## bogus
7387    0.00148910247732503     iPhone 5c (GSM)
7360    0.000407608695652174    iPhone 5s (GSM+CDMA)
6994    0.0057191878753217      GT-I8190
6439    0.35362633949371        XT1032                  ## bogus
5916    0.06710615280595        SM-G900H                ## bogus
...```

So far this seems like some sort of intermittent location bug. What’s interesting is that the coordinates in aggregate look reasonable:

```\$ nfu latlngs.gz -k '@_ == 2' \
-k '%0 =~ /\.(\d{3})(\d{3})/ && abs(\$1 - \$2) <= 2' \
-f10p %d```

Step 3: Identifying possible root causes

One obvious question is whether the longitude shows similar behavior. Interestingly, it does not; here’s the difference distribution for longitude digits:

```\$ nfu longitudes.gz -m 'abs(\$1 - \$2) if %0 =~ /\.(\d{3})(\d{3})/' \
-k 'length %0' -ocf10p %l```

The difference is also clear in the correlation matrices, which can be visualized by using the stochastic shading technique I mentioned in the polygon compression post. Here’s the latitude digit correlation (X axis is the first digit triple, Y axis is the second):

```\$ nfu latitudes.gz -m 'row \$1, \$2 if %0 =~ /\.(\d{3})(\d{3})/' -k @_ \
-m 'row map \$_ + rand(), @_' -p %d```

Here’s longitude:

For comparison, here’s the covariance of uniformly random values:

```\$ nfu latitudes.gz -m 'row int(rand(1000)), int(rand(1000))' \
-m 'row map \$_ + rand(), @_' -p %d```

Latitude covariance in detail

The latitude matrix has some interesting stuff going on. Zooming in, it looks like the covariant digit groups have even spacing:

The X coordinates of dense cells are all spaced exactly nine apart: 297, 306, 315, 324, 333. This suggests that the error is caused by some kind of quantization (since no real-world geographic feature has this much regularity).

If the error really is a quantization artifact, we should also see correlation between the integer and fractional parts of the latitudes. Here’s the correlation matrix (X is integer, Y is first three digits of fractional):

```\$ nfu latitudes.gz -m 'row \$1, \$2 if %0 =~ /^(-?\d+)\.(\d{3})/' -k @_ \
-m 'row map \$_ + rand(), @_' -p %d```

A more powerful strategy is to Fourier-transform the histogram to look for recurrent intervals (I added Octave support to nfu to make this easier):

```\$ nfu perl:-40_0000..40_0000 -m 'sprintf "%.4f", %0 * 1e-4' \
-I0 @[ latitudes.gz -k '%0 >= -40 && %0 <= 40' \
-k '%0 =~ /\.(\d{3})(\d{3})/ && abs(\$1 - \$2) <= 2' \
-m 'sprintf "%.4f", %0' \
-ocm 'row %1, 1' ] \
-m '%1 // 0' \
--octave 'xs = abs(fft(xs))' \
-p %i```

The gaps in the covariance matrix were spaced at 0.009 degrees and we’re spanning 80 degrees of latitude, so we need to look for a peak below 8888. The first is at 4440:

This means that the 0.009 spacing is the result of superimposition across 2-degree windows; the actual gap is a little under 0.018, which we see as broken stripes when we correlate even/odd integer parts of degrees with fractionals:

```\$ nfu latitudes.gz \
-m 'row \$1 % 2, \$2 if %0 =~ /^(-?\d+)\.(\d{3})(\d{3})/
&& abs(\$2 - \$3) <= 2' \
-k @_ \
-m 'row map \$_ + rand(), @_' -p %d```

A frequency closer to 60/degree (ours is 55.5/degree) might indicate a buggy DMS-to-decimal converter, but this error is most likely unrelated. It’s some kind of precision truncation error, however, since 0.0 (and not 0.009) is one of the erroneous points. This suggests that the latitudes aren’t entirely inaccurate, just imprecise.

Longitudes of data points with erroneous latitudes

The problem is most obvious in latitudes, but the longitude covariance matrix also had some nonuniformity. Most of it is probably due to the difference between urban and rural population density, but it’s worth looking for any obvious sources of error. Let’s look at the FFT of longitudes to see if we find any periodicity:

```\$ nfu perl:-180_0000..180_0000 -m 'sprintf "%.4f", %0 * 1e-4' \
-I0 @[ latlngs.gz -k '%0 =~ /\.(\d{3})(\d{3})/ && abs(\$1 - \$2) <= 2' \
-m 'sprintf "%.4f", %1' \
-ocm 'row %1, 1' ] \
-m '%1 // 0' \
--octave 'xs = abs(fft(xs))' \
-p %i```

Here there’s no pattern at all; whatever is happening with longitudes isn’t periodic at scale. Two sources of error are evident in the FFT of the fractional parts:

```\$ nfu perl:0..999999 -m 'sprintf "%.6d", %0' \
-I0 @[ bad-longitudes.gz -m 'sprintf("%.6f", %0) =~ s/^.*\.//r' \
-ocm 'row %1, 1' ] \
-m '%1 // 0' \
--octave 'xs = abs(fft(xs))' \
-p %i```

The first set of peaks is the harmonics at 100000, 200000, 300000, etc. These happen when digits are truncated from the decimal representation. The other set of peaks occurs at 131072 and 262144, equal to 217 and 218, respectively:

These errors are most likely caused by the machine epsilon from single-precision float encoding. They’re about the right frequency; with a 24-bit mantissa (the high bit is implied) and between six and seven integer bits, the epsilon would be 17 or 18 bits into the fractional.

Before moving on, it’s worth making sure the decimal truncation follows some meaningful pattern:

```\$ nfu latitudes.gz -m 'length(%0 =~ s/^.*\.//r)' -oc
8       1
70      2
129     3
888     4
5252    5
52989   6       ## most have six fractional digits
3596    7
133     8
10      9
\$ nfu longitudes.gz -m 'length(%0 =~ s/^.*\.//r)' -oc
6       1
34      2
415     3
1627    4
12804   5       ## many of these are probably truncated
37496   6       ## ... but this is expected; see below
9357    7
1150    8
184     9
\$```

Here’s a correlation of the integer part of latitude and number of fractional digits:

```\$ nfu latitudes.gz -m 'row int(%0), length(%0 =~ s/^.*\.//r)' \
-m 'row map \$_ + rand(), @_' -p %d```

Longitude:

Precision falls off at multiples of ten, which means implementations are limiting the total number of digits used to represent these quantities. We see more variance in the longitude precision just because the range is more evenly represented. (There are also a few minor outliers with artificially low precision – a little more than statistically expected – but they don’t make up very much of the data and we already filter them out.)

The underlying problem

I examined a variety of different parameters to examine potential common traits across apps that exhibited this behavior. Sparing you all the details, all of the apps with a high error rate requested coarse network-based location permissions. I then looked at the Android source, and the culprit is a class in the Android source specifically designed to quantize locations.

And 1 / 111000 = 0.000009009009. The default 2km margin produces 0.018018, which is exactly the pattern we saw.

The longitude is actually quantized too, but we didn’t observe it because its basis depends on the latitude:

The rationale for this is that if the app doesn’t have permission to get fine-grained coordinates, then all fine-grained location sources (e.g. GPS) are deliberately quantized to prevent data leakage. The locations aren’t completely wrong, just up to 2km away from the device.

Step 4: Flagging erroneous points

The duplication blacklist used by Factual’s Location Validation Stack was already identifying most of the bogus data, but it’s good to have classifiers specifically designed to detect known error modes.

The simplest solution is to flag all points coming from apps whose permissions don’t include fine-grained location access. Not all of these points were obviously erroneous because some of them came from cell towers or other triangulation mechanisms, but all of them are imprecise to a significant degree.

We often use machine learning to solve data-related problems at Factual, but sometimes we run into things like this where understanding the root cause provides value. By writing a classifier specifically to handle this case, we ended up identifying exactly the set of erroneous points and analyzing them with the understanding that they’re inaccurate to a known degree.

Enjoy this read? Factual might be the place for you!