Login

Factual Blog /

Investigating Various Pathologies of Low Quality Location Data #1 - App Permissions

Note: This article has also been published in GeoMarketing here

A significant percentage of location data in the mobile ad ecosystem - anywhere from 30% - 70% - is of insufficient quality for appropriate use in location based mobile ad targeting, measurement, or analytics. In a previous post, Validating Mobile Ad Location Data at Factual, we describe the different reasons for this and the variety of methods we employ to pre-process location data. In this post, the first of many that explore specific pathologies of low quality location data, we’ll do a deep dive into one specific driver of low quality location data - app permissions.

Overview

The Android mobile operating system offers two permissions for location tracking - “coarse” (ACCESS_COARSE_LOCATION) and “fine” (ACCESS_FINE_LOCATION). With the coarse permission, according to Google, the OS will return a location with the accuracy approximately equivalent to a city block1 (although we found it to be 2,000 meters of error in most cases).

The major problem is that this nuance gets lost in the mobile ad ecosystem. The data we see in mobile exchanges from apps with coarse location permissions, on its face, looks like the data from apps with fine location permissions. The data comes through with up to 6 decimal places of precision, so it looks highly precise2, and the type flag in the geo object is set to 1 - “GPS/Location Services3,” which is true, but obscures the fact that Android has decreased the accuracy of these points.

Factual employs automated methods to detect and filter out this kind of activity. The system is built on a statistical model that learns which points are over-represented based on all points in the system. We also maintain a blacklist of apps that have significant percentages of blacklisted traffic.

Dive into the Data

Note: For an in-depth, technical look into the research process, please refer to the lab notes.

A specific app publisher reported to us that our Location Validation Stack was blacklisting a significant portion of their data, so we decided to investigate. We noticed that a lot of the latitudes being passed had repeated groups of 3 digits.

43.315315
42.234234
41.963963
...

Purely by chance one would expect to see this for one out of every thousand data points (0.1%), but we were seeing it in more than 5% of the unvalidated inputs. Although we were already detecting these as invalid, we wanted to investigate more carefully to understand the underlying cause.

Step 1: Characterizing the pathology

Since we were seeing repeated groups of 3 digits, I represented each latitude in the form X.YYYZZZ, where YYY are the first three decimal digits and ZZZ are the next three. For example, the latitude 34.194783 has X = 34, YYY = 194, and ZZZ = 783. I then created a histogram of the difference between YYY and ZZZ. For comparison, I also created a histogram of a difference distribution of uniformly random values (which is more or less what we’d expect if the latitudes were unmodified).

Clearly the observed latitude data is not behaving as one would expect, and we’re seeing an abnormal concentration of values with deltas of 0, 1, and 2.

One obvious question is whether the longitude values exhibit a similar behavior. Interestingly, they don’t; as you can also see on the above chart.

Step 2: Measuring the scope of the problem

While the investigation started because one publisher inquired as to why their data was being blacklisted, we quickly discovered that the problem spanned many apps, Android devices, and geographical regions.

We found that about one in three apps had a statistically significant error rate. When we looked at device types, we found that Apple doesn’t seem to have a problem, and some Android devices were OK as well.

Device Type
Number of Data Points
Fraction of Bogus Points
iPhone 79180 0.00473604445567062
iPhone 6 25861 0.00467886005954913
iPad 25222 0.000237887558480691
SM-G900V</font> 17957 0.187837612073286
GT-I9300 17392 0.369077736890524
iPhone 5S (GSM) 14773 0.0260610573343261
iPhone 4S 13946 0.00322673167933458
XT1080 13249 0.247188467054117
SM-G900F 11933 0.271180759239085
GT-I9505 11431 0.171988452453854
iPhone 5 (GSM+CDMA) 9939 0.000503068719187041
HTC One 9757 0.171671620375115
SAMSUNG-SM-G900A 9362 0.171864986114078
SM-G900P 9341 0.141633658066588
SAMSUNG-SGH-1337 9312 0.119201030927835
SCH-I545 9005 0.131038312048862
HTC One_M8 8871 0.129748619095931
SM-N9005 8188 0.122984855886663
iPhone 6 Plus 7574 0.00752574597306575
Nexus 5 7523 0.17320217998139
SGH-1337M 7405 0.169074949358542
iPhone 5C (GSM) 7387 0.00148910247732503
iPhone 5S (GSM+CDMA) 7360 0.000407608695652174
GT-18190 6994 0.0057191878753217
XT1032 6439 0.35362633949371
SM-G900H 5916 0.06710615280595

Lastly, when plotting the coordinates on a map, they look reasonable and are from all around the world.

Step 3: Identifying Possible Root Causes

The following is a latitude correlation graph, where the X axis is the first 3 digits and the Y axis is the second 3 digits, compared to a covariance of uniformly random values:

There is a clear diagonal line through the center of the latitude correlation graph, illustrating the high frequency of similar 3 digit pairs. There’s an interesting pattern going on though, which becomes more apparent if you zoom in. It looks like the covariant digit groups have even spacing.

Zooming in some more…

The X coordinates of dense cells are all spaced exactly nine apart: 297, 306, 315, 324, 333. This suggests that the error is caused by some kind of quantization, since no real-world geographic feature has this much regularity). This suggests that the latitudes are being rounded to the nearest 0.009009 (Z.306306 - Z.297297 = 0.009009).

Before we conclude that the latitudes are being rounded to the nearest 0.009009 we need to make sure that what we’re seeing isn’t the overlap of multiple signals. (To picture this, suppose you rounded a bunch of numbers to the nearest ⅜; looking just at the decimals you’d think it was ⅛) We can do this with a Fourier transform to recover base frequencies, which in this case tells us that the actual quantization is happening at intervals of 0.018018 instead of 0.009009.

We see 0.009009 because the pattern is doubled on an even/odd basis:

The Underlying Problem

All of the apps with a high error rate requested coarse network-based location permissions. The Android source code includes a class called “LocationFudger”, specifically designed to quantize locations for user privacy:

/**
	 * Contains the logic to obfuscate (fudge) locations for coarse applications.
	 *
	 * <p>The goal is just to prevent applications with only
	 * the coarse location permission from receiving a fine location.
	 */
	public class LocationFudger {
	  
   /**




	* Default coarse accuracy in meters.


	*/


   private static final float DEFAULT_ACCURACY_IN_METERS = 2000.0f;

// ...
	  private static final int APPROXIMATE_METERS_PER_DEGREE_AT_EQUATOR = 111000;
	  // ...
}

If you divide 1 (degree) by 111,000 (meters), you get 0.000009009009. Multiplying by 2000 meters (the default margin seen above), you get 0.018018, which is exactly the pattern we saw.

The longitude is actually quantized too, but we didn’t observe it because its basis depends on the latitude.

/**
 * Requires latitude since longitudinal distances change with distance from equator.
 */
private static double metersToDegreesLongitude(double distance, double lat) {
  return distance / APPROXIMATE_METERS_PER_DEGREE_AT_EQUATOR / Math.cos(Math.toRadians(lat));
}

The rationale for this is that if the app doesn’t have permission to get fine-grained coordinates, then all fine-grained location sources (e.g., GPS) are deliberately quantized to prevent data leakage. The reported locations aren’t completely wrong, just up to 2,000 meters away from the actual location of the device.

Conclusion

Our standard model was already identifying the bad data, but it’s good to have classifiers specifically designed to detect known errors. While one can use other techniques to flag apps with insufficient permissions (e.g., crawling the app store), we’ve found that these data-driven models are lower-latency and more reliable, as they have fewer external dependencies. Another advantage of a purely data-driven approach is that it allows us to safely process data whose source is unknown or incorrectly indicated.

- Spencer Tipping, Software Engineer and Vikas Gupta, Director of Marketing

Notes:

  1. http://developer.android.com/training/location/retrieve-current.html
  2. 6 decimal places is precise down to approximately 4 inches
  3. See section 5.16 in the IAB OpenRTB Spec v2.3 http://www.iab.net/media/file/OpenRTB-API-Specification-Version-2-3.pdf

Related Posts:

Enjoy this read? Factual might be the place for you!
See Openings