Note: This is a companion post to Investigating Low Quality Location Data #2 - Suspicious Activity Over Greenland
Audience data validation is a crucial part of delivering accurate behavioral profiles. After seeing a suspiciously high number of mobile ads with locations over Greenland, the Arctic Circle, and the middle of the ocean, the curious engineers at Factual decided to figure out what these apps were up to.
Suspicious points are visible just by looking at the aggregated coordinates from all
apps. Using nfu, we can plot latitude and longitude points from a file with
%d specifying one dot per point:
There appear to be two rectangles, one superimposed on the other. Each spans 90 degrees of latitude and longitude and appears to be uniformly distributed:
Time to do some digging.
Confirming sources of bogus data
This one is kind of easy because we can choose a rectangle that contains almost entirely invalid data by geofencing. Let’s take longitudes in [-90, 0] and latitudes above 70 degrees (the Arctic Circle is everything north of about 66 degrees).
In the snippet below, “all-tuples.gz” contains latitude and longitude (fields 2 and 3) and app name (field 9). The output is apps with the most bogus data printed in descending order.
None of the app names looked familiar (anonymized for this blog post). Let’s see how much of the overall data they represent:
Ok, so they collectively make up about 10% of the invalid location data in Greenland. But how do we know for sure that we can disregard all geocoordinates coming from these particular apps? We need to make sure we didn’t also catch any apps with legitimate snow-loving travelers, so let’s look for things outside the known-invalid rectangles:
For app-38 and app-90, it appears two-thirds of their data points aren’t in either known-invalid rectangle. For now let’s exclude those from the file we’re investigating, since they do appear to be reporting some valid user data:
How it was generated
For the remaining apps, the high user activity in Greenland is obviously bad data, but sometimes we can still get useful information by taking a closer look. Borrowing a couple of ideas from my last post, let’s start by looking for abnormal correlations that might tell us something about how it was generated. In particular, here’s latitude correlation between the integer and first-two digits of the fractional (
-gcf1. is an nfu idiom to deduplicate, which we want to do here because devices tend to cache locations):
Some things jump out right away:
I also attached the OS to see whether that was a causal factor (and because Android apps can easily be disassembled):
I’m surprised to see apps on different operating systems all using the same strategy. Before getting into any code, let’s see if there’s an obvious reason someone would want to omit the zero digit prefix (e.g. to avoid numbers like 34.058). Here’s correlation between digit position and value:
That looks about as we expect; the absence of zeroes in the sixth position is just because most number formatters drop trailing zeroes in the fraction.
Separating app into the third dimension:
It’s difficult to visualize 3D data effectively, but the thing to notice here is that most of the high-data apps have exactly the same pattern.
Another thing to verify is that these aren’t all coming from the same device:
Quite the opposite: most of the apps have just one data point per device. Likewise, only six of the 9788 devices were associated with more than one app:
Back to the drawing board
I disassembled app-431 and looked for calls to
Random methods like
nextFloat, etc., but didn’t end up finding anything interesting.
At this point my guess was that the apps themselves weren’t generating the
bogus points; instead, some third party was fabricating app names, device IDs,
locations, and even operating systems.
To gather more information I did a significant-terms analysis of the records with bad location data by splitting the unfiltered data into individual words and looking for any with an unusually high frequency. The first interesting terms, about 100 lines down, belonged to an app publisher:
Terms are anonymized for this post, but it stuck out because the terms were uncommon and outside of the openRTB schema.
I searched for that publisher through the unfiltered stream and looked at the geo:
Here’s the difference when we remove this publisher (original on left):
This publisher is clearly contributing a significant amount of the bad data. No zeroes are ever generated in the most-significant fractional position, which could be a deliberate attempt to increase precision; just as likely, though, is that they started by generating two random integers and glued them together with string operations. PHP’s rand() function, for example, generates integers rather than floats – though that’s not much to go on.
Other bogus publishers
Does any publisher that produces bogus data also produce any valid data? If not, then we can just blacklist all of them. Here I’m splitting the data into separate geo-planes, one per publisher:
Not bad; just one publisher has non-bogus data (and we can easily detect this false-positive because it has data outside the rectangles). This means that the app’s publisher is probably the predictive dimension. Let’s make sure we got everything:
Looks like we did. Here’s the distribution without those bogus publishers:
Constructing the classifier
There isn’t any publisher or app metadata that reliably indicates when the data is suspect, but we have a lot of statistical leverage we can use. Starting with the base rate of bad publishers (well, for this particular mode of bad data anyway):
Just under 3.4% of publishers provide mostly or entirely bad data. Put in another way, we need about 5 bits of evidence to make a bad publisher the most likely explanation. We gather that evidence cumulatively:
Put together, that’s 1.62 bits of evidence per positive observation; so after observing 10 data points we can conclude with 99.9% accuracy that a publisher is supplying wholly inaccurate data. In fact, we can actually do much better because we have population models for geo areas. We can use that to significantly increase the information content of most of the rectangular area while reducing the probability of false-positives.
- Spencer Tipping, Software Engineer