Audience data validation is a crucial part of delivering accurate behavioral profiles. After seeing a suspiciously high number of mobile ads with locations over Greenland, the Arctic Circle, and the middle of the ocean, the curious engineers at Factual decided to figure out what these apps were up to.
Suspicious points are visible just by looking at the aggregated coordinates from all
apps. Using nfu, we can plot latitude and longitude points from a file with %d specifying one dot per point:
There appear to be two rectangles, one superimposed on the other. Each spans 90
degrees of latitude and longitude and appears to be uniformly distributed:
Time to do some digging.
Confirming sources of bogus data
This one is kind of easy because we can choose a rectangle that contains almost
entirely invalid data by geofencing. Let’s take longitudes in [-90, 0] and latitudes above 70
degrees (the Arctic Circle is everything north of about 66 degrees).
In the snippet below, “all-tuples.gz” contains latitude and longitude (fields 2 and 3) and app name (field 9). The output is apps with the most bogus data printed in descending order.
None of the app names looked familiar (anonymized for this blog post). Let’s see how much of the overall data they represent:
Ok, so they collectively make up about 10% of the invalid location data in Greenland. But how do we know for sure that we can disregard all geocoordinates coming from these particular apps? We need to make sure we
didn’t also catch any apps with legitimate snow-loving travelers, so let’s look
for things outside the known-invalid rectangles:
For app-38 and app-90, it appears two-thirds of their data points aren’t in either known-invalid rectangle.
For now let’s exclude those from the file we’re investigating, since they do appear to be reporting some valid user data:
How it was generated
For the remaining apps, the high user activity in Greenland is obviously bad data, but sometimes we can still get useful information by taking a closer look.
Borrowing a couple of ideas from my last post,
let’s start by looking for abnormal correlations that might tell us something about how it was generated. In particular, here’s latitude correlation between
the integer and first-two digits of the fractional (-gcf1. is an nfu idiom to deduplicate, which we want to do here because devices tend to cache locations):
Some things jump out right away:
Both latitude and longitude appear to be generated the same way because the graphs look nearly identical.
Fractional parts beginning with 0 are artificially uncommon; they seem to
belong only to the rectangle extending north to 100 degrees (the “more
I also attached the OS to see whether that was a causal factor (and because
Android apps can easily be disassembled):
I’m surprised to see apps on different operating systems all using the same
strategy. Before getting into any code, let’s see if there’s an obvious reason
someone would want to omit the zero digit prefix (e.g. to avoid numbers like 34.058). Here’s correlation between
digit position and value:
That looks about as we expect; the absence of zeroes in the sixth position is
just because most number formatters drop trailing zeroes in the fraction.
Separating app into the third dimension:
It’s difficult to visualize 3D data effectively, but the thing to notice here
is that most of the high-data apps have exactly the same pattern.
Another thing to verify is that these aren’t all coming from the same device:
Quite the opposite: most of the apps have just one data point per device.
Likewise, only six of the 9788 devices were associated with more than one app:
Back to the drawing board
I disassembled app-431 and looked for calls to Random methods like nextDouble, nextFloat, etc., but didn’t end up finding anything interesting.
At this point my guess was that the apps themselves weren’t generating the
bogus points; instead, some third party was fabricating app names, device IDs,
locations, and even operating systems.
To gather more information I did a significant-terms analysis of the records with bad location data by splitting the unfiltered data into individual words and looking for any with an unusually high frequency. The first interesting terms, about 100 lines down, belonged to an app publisher:
Terms are anonymized for this post, but it stuck out because the terms were uncommon and outside of the openRTB schema.
I searched for that publisher through the unfiltered stream and looked at the
Here’s the difference when we remove this publisher (original on left):
This publisher is clearly contributing a significant amount of the bad data. No zeroes are ever generated in the most-significant fractional
position, which could be a deliberate attempt to increase precision; just as
likely, though, is that they started by generating two random integers and
glued them together with string operations. PHP’s rand()
function, for example, generates
integers rather than floats – though that’s not much to go on.
Other bogus publishers
Does any publisher that produces bogus data also produce any valid data? If
not, then we can just blacklist all of them. Here I’m splitting the data into
separate geo-planes, one per publisher:
Not bad; just one publisher has non-bogus data (and we can easily detect
this false-positive because it has data outside the rectangles). This means
that the app’s publisher is probably the predictive dimension. Let’s make sure
we got everything:
Looks like we did. Here’s the distribution without those bogus publishers:
Constructing the classifier
There isn’t any publisher or app metadata that reliably indicates when the data
is suspect, but we have a lot of statistical leverage we can use. Starting with
the base rate of bad publishers (well, for this particular mode of bad data anyway):
Just under 3.4% of publishers provide mostly or entirely bad data. Put in another way, we need about
5 bits of evidence to make a bad publisher the most likely explanation. We gather
that evidence cumulatively:
Put together, that’s 1.62 bits of evidence per positive observation; so after
observing 10 data points we can conclude with 99.9% accuracy that a publisher
is supplying wholly inaccurate data. In fact, we can actually do much better because we have population
models for geo areas. We can use that to significantly increase the information
content of most of the rectangular area while reducing the probability of