Note: For an in-depth, technical look into the research process, please refer to the lab notes companion to this post.
A significant percentage of locations reported in the mobile ad ecosystem - anywhere from 30% to 70% - is of insufficient quality for use in location based mobile ad targeting, measurement, or analytics. In our previous post, Validating Mobile Ad Location Data at Factual, we describe the different reasons for this, and the ways we pre-process and clean location data. This post, the second in a series, explores a particularly blatant case of inaccurate locations reported over Greenland and beyond.
Greenland is a wonderful place, but it’s not known for being especially populated. That’s one of the reasons a large, rectangular shape of points plotted over Greenland, the Arctic Circle, and the Atlantic Ocean caught our eye, pictured below.
Although the latitudes and longitudes within the rectangular shape appear to be generated randomly, clear pathologies emerged that enabled us to classify this type of bad data in the future. We also learned that the bogus locations were the work of multiple publishers across many different mobile apps.
To investigate, we started by isolating the obviously invalid points from valid ones. This was pretty straightforward, since a tremendous amount of traffic is found where actual human population is extremely sparse: the Arctic Circle. We drew a rectangle that contains almost entirely invalid data by taking longitudes in [-90, 0] and latitudes above 70 degrees (the Arctic Circle is everything north of about 66 degrees). Then we looked for abnormal correlations that could tell us something about how the locations were generated.
Borrowing a couple of ideas from our last post, we tried mapping correlation between the integer and leading-two digits of the fractional of this data, pictured below.
Based on the graphs, two things jumped out right away:
Here is another look at the correlation between the integer and first-two digits of the fractional, this time with both latitude and longitude in the same graph:
The absence of zeroes in the sixth position is explained by most number formatters dropping trailing zeroes in a fraction.
At this point we know the bogus data is easily identifiable by isolating geo-coordinates over areas with very small human populations. We also know the data has a strange, unexplained pathology where zeros are artificially uncommon in the first-fractional position of latitude and longitudes. Using these two conditions, we can narrow down apps and publishers that are responsible for reporting the bulk of the bad location data.
Using the Arctic Circle geo-boundaries, we found a few dozen apps responsible for reporting the bulk of these invalid points. But we still want to check each app for uniform behavior. Below is correlation between digit position and digit value for latitude and longitude, this time broken out by individual app:
We see again the same pattern of artificially low number of 0s in the first fractional position of latitude and longitudes for each app.
Next we look at suspicious publishers by examining the publishers for each of the apps identified above. Once a publisher has been identified as a bad data contributor, they are a candidate for blacklisting. But first, does any publisher that produces bogus data also produce valid data? We can check by mapping the distribution of points reported by each publisher’s apps in aggregate, 3D-visualized below.
Not bad; just one publisher has data that doesn’t look to be entirely fabricated. We can detect this false-positive (near the right of the graph, publisher #4) because it has points plotted outside of the rectangles, while the rest are all within the rectangle boundaries. This means the app’s publisher is probably the predictive dimension for bad location data, since every publisher but one exhibits uniform activity.
We blacklist all publishers except for #4, which does supply valid data.
It’s still unclear why seemingly random points would be generated in a way that avoids numbers like 77.017034. This could be a deliberate attempt to increase precision of the locations. Just as likely, though, is that the developer started by generating two random integers and glued them together with string operations. PHP’s rand() function, for example, generates integers rather than floats. Something like
randint(100) + "." + randint(1000000) would be able to produce the random latitudes and longitudes we’ve observed. It also accounts for the scarcity of zeros in the first fractional of points, since
randint(1000000) wouldn’t output an integer with a leading 0 (i.e., 100000 is a valid integer, but 010000 isn’t). It’s just a guess, but we’d bet
Random methods like
nextFloat are generating the basically random locations found in this type of bad data.
While our Location Validation Stack was already catching these bad locations over Greenland and the Arctic Circle, the investigation found a few dozen publishers that supply 100% falsified location information in mobile ads. There isn’t any publisher or app metadata that reliably indicates when the data is suspect, but we have a lot of statistical leverage we can use based on the reported locations.
Using this statistical leverage, we built a classifier to identify apps and publishers that should be blacklisted in the future. Starting with the base rate of bad publishers (well, for this particular type of bad data, anyway), just 3.4% of publishers are supplying wholly invalid locations. We find that after observing 10 invalid data points, we can conclude with 99.9% certainty that the publisher should be blacklisted. So while this type of falsified location data is a serious problem, it’s fortunately easy to identify the publishers responsible.