Note: This article has also been published in GeoMarketing here
A significant percentage of location data in the mobile ad ecosystem – anywhere from 30% – 70% – is of insufficient quality for appropriate use in location based mobile ad targeting, measurement, or analytics. In a previous post, Validating Mobile Ad Location Data at Factual, we describe the different reasons for this and the variety of methods we employ to pre-process location data. In this post, the first of many that explore specific pathologies of low quality location data, we’ll do a deep dive into one specific driver of low quality location data – app permissions.
Overview
The Android mobile operating system offers two permissions for location tracking – “coarse” (ACCESS_COARSE_LOCATION) and “fine” (ACCESS_FINE_LOCATION). With the coarse permission, according to Google, the OS will return a location with the accuracy approximately equivalent to a city block1 (although we found it to be 2,000 meters of error in most cases).
The major problem is that this nuance gets lost in the mobile ad ecosystem. The data we see in mobile exchanges from apps with coarse location permissions, on its face, looks like the data from apps with fine location permissions. The data comes through with up to 6 decimal places of precision, so it looks highly precise2, and the type flag in the geo object is set to 1 – “GPS/Location Services3,” which is true, but obscures the fact that Android has decreased the accuracy of these points.
Factual employs automated methods to detect and filter out this kind of activity. The system is built on a statistical model that learns which points are over-represented based on all points in the system. We also maintain a blacklist of apps that have significant percentages of blacklisted traffic.
Dive into the Data
Note: For an in-depth, technical look into the research process, please refer to the lab notes.
A specific app publisher reported to us that our Location Validation Stack was blacklisting a significant portion of their data, so we decided to investigate. We noticed that a lot of the latitudes being passed had repeated groups of 3 digits.
Purely by chance one would expect to see this for one out of every thousand data points (0.1%), but we were seeing it in more than 5% of the unvalidated inputs. Although we were already detecting these as invalid, we wanted to investigate more carefully to understand the underlying cause.
Step 1: Characterizing the pathology
Since we were seeing repeated groups of 3 digits, I represented each latitude in the form X.YYYZZZ, where YYY are the first three decimal digits and ZZZ are the next three. For example, the latitude 34.194783 has X = 34, YYY = 194, and ZZZ = 783. I then created a histogram of the difference between YYY and ZZZ. For comparison, I also created a histogram of a difference distribution of uniformly random values (which is more or less what we’d expect if the latitudes were unmodified).

Clearly the observed latitude data is not behaving as one would expect, and we’re seeing an abnormal concentration of values with deltas of 0, 1, and 2.
One obvious question is whether the longitude values exhibit a similar behavior. Interestingly, they don’t; as you can also see on the above chart.
Step 2: Measuring the scope of the problem
While the investigation started because one publisher inquired as to why their data was being blacklisted, we quickly discovered that the problem spanned many apps, Android devices, and geographical regions.
We found that about one in three apps had a statistically significant error rate. When we looked at device types, we found that Apple doesn’t seem to have a problem, and some Android devices were OK as well.
iPhone | 79180 | 0.00473604445567062 |
iPhone 6 | 25861 | 0.00467886005954913 |
iPad | 25222 | 0.000237887558480691 |
SM-G900V</font> | 17957 | 0.187837612073286 |
GT-I9300 | 17392 | 0.369077736890524 |
iPhone 5S (GSM) | 14773 | 0.0260610573343261 |
iPhone 4S | 13946 | 0.00322673167933458 |
XT1080 | 13249 | 0.247188467054117 |
SM-G900F | 11933 | 0.271180759239085 |
GT-I9505 | 11431 | 0.171988452453854 |
iPhone 5 (GSM+CDMA) | 9939 | 0.000503068719187041 |
HTC One | 9757 | 0.171671620375115 |
SAMSUNG-SM-G900A | 9362 | 0.171864986114078 |
SM-G900P | 9341 | 0.141633658066588 |
SAMSUNG-SGH-1337 | 9312 | 0.119201030927835 |
SCH-I545 | 9005 | 0.131038312048862 |
HTC One_M8 | 8871 | 0.129748619095931 |
SM-N9005 | 8188 | 0.122984855886663 |
iPhone 6 Plus | 7574 | 0.00752574597306575 |
Nexus 5 | 7523 | 0.17320217998139 |
SGH-1337M | 7405 | 0.169074949358542 |
iPhone 5C (GSM) | 7387 | 0.00148910247732503 |
iPhone 5S (GSM+CDMA) | 7360 | 0.000407608695652174 |
GT-18190 | 6994 | 0.0057191878753217 |
XT1032 | 6439 | 0.35362633949371 |
SM-G900H | 5916 | 0.06710615280595 |
Lastly, when plotting the coordinates on a map, they look reasonable and are from all around the world.

Step 3: Identifying Possible Root Causes
The following is a latitude correlation graph, where the X axis is the first 3 digits and the Y axis is the second 3 digits, compared to a covariance of uniformly random values:

There is a clear diagonal line through the center of the latitude correlation graph, illustrating the high frequency of similar 3 digit pairs. There’s an interesting pattern going on though, which becomes more apparent if you zoom in. It looks like the covariant digit groups have even spacing.

Zooming in some more…

The X coordinates of dense cells are all spaced exactly nine apart: 297, 306, 315, 324, 333. This suggests that the error is caused by some kind of quantization, since no real-world geographic feature has this much regularity). This suggests that the latitudes are being rounded to the nearest 0.009009 (Z.306306 – Z.297297 = 0.009009).
Before we conclude that the latitudes are being rounded to the nearest 0.009009 we need to make sure that what we’re seeing isn’t the overlap of multiple signals. (To picture this, suppose you rounded a bunch of numbers to the nearest ⅜; looking just at the decimals you’d think it was ⅛) We can do this with a Fourier transform to recover base frequencies, which in this case tells us that the actual quantization is happening at intervals of 0.018018 instead of 0.009009.

We see 0.009009 because the pattern is doubled on an even/odd basis:

The Underlying Problem
All of the apps with a high error rate requested coarse network-based location permissions. The Android source code includes a class called “LocationFudger”, specifically designed to quantize locations for user privacy:
/**
* Contains the logic to obfuscate (fudge) locations for coarse applications.
*
* <p>The goal is just to prevent applications with only
* the coarse location permission from receiving a fine location.
*/
public class LocationFudger {
/**
* Default coarse accuracy in meters.
*/
private static final float DEFAULT_ACCURACY_IN_METERS = 2000.0f;
// ...
private static final int APPROXIMATE_METERS_PER_DEGREE_AT_EQUATOR = 111000;
// ...
}
If you divide 1 (degree) by 111,000 (meters), you get 0.000009009009. Multiplying by 2000 meters (the default margin seen above), you get 0.018018, which is exactly the pattern we saw.
The longitude is actually quantized too, but we didn’t observe it because its basis depends on the latitude.
/**
* Requires latitude since longitudinal distances change with distance from equator.
*/
private static double metersToDegreesLongitude(double distance, double lat) {
return distance / APPROXIMATE_METERS_PER_DEGREE_AT_EQUATOR / Math.cos(Math.toRadians(lat));
}
The rationale for this is that if the app doesn’t have permission to get fine-grained coordinates, then all fine-grained location sources (e.g., GPS) are deliberately quantized to prevent data leakage. The reported locations aren’t completely wrong, just up to 2,000 meters away from the actual location of the device.
Conclusion
Our standard model was already identifying the bad data, but it’s good to have classifiers specifically designed to detect known errors. While one can use other techniques to flag apps with insufficient permissions (e.g., crawling the app store), we’ve found that these data-driven models are lower-latency and more reliable, as they have fewer external dependencies. Another advantage of a purely data-driven approach is that it allows us to safely process data whose source is unknown or incorrectly indicated.
– Spencer Tipping, Software Engineer and Vikas Gupta, Director of Marketing
Notes:
- http://developer.android.com/training/location/retrieve-current.html
- 6 decimal places is precise down to approximately 4 inches
- See section 5.16 in the IAB OpenRTB Spec v2.3 http://www.iab.net/media/file/OpenRTB-API-Specification-Version-2-3.pdf