Update 9/27/16: Learn about the importance of location validation and how we do it on our new page covering our Location Validation Stack.
Location Data appears straightforward on first blush: two numbers – longitude and latitude – combine as a coordinate to identify an unambiguous point on the earth’s surface: X marks the spot, unequivocally.
Location data in the Mobile Ad-tech Ecosystem, however – especially that used by marketers and advertisers – has a number of distinguishing characteristics that make it more problematic. In no particular order, they are:
- Unvalidated: the great majority of location data, such as that coming through the Real-Time Bid (RTB) stream, are from unknown sources: no inherent quality guarantees are attached to the data, and it must be viewed with skepticism initially;
- Independent: Most coordinate pairs come through the pipes naked, unclothed by metadata. Although mobile devices can source precision, speed, and heading with their location readings, while welcome, this sort of context is extremely rare if not entirely absent;
- Intermittent: the majority of mobile apps register location data infrequently — think of most mobile location data as an extremely low-res digital sample of a rich, analogue behavior. Put another way, most mobile location data represents only a fraction of ongoing activity, with little context of what came before or after.
These are pretty major caveats. How then, with so many qualifiers and so much dubious, isolated, and unvalidated data, can one extract signal from the noise? The answer is a Location Validation Stack, a platform that pre-processes location data before we build critical consumer insights.
Factual has two products that require a Location Validation Stack: Geopulse Audience, which creates geographic, behavioral, demographic and retail profiles based on where people go over time; and Geopulse Proximity, which performs realtime server-side geofencing to the tune of 20k-50k queries per second per server.
Most customers run Factual Location Validation on between three and 100 servers; a single customer may validate location at a rate of five-billion locations queries per day at a peak of around 500k qps. Taken together, every month Factual processes over 600 billion location data points for Geopulse Audience creation, and over 800 billion of the same for Proximity realtime geofencing. These not-insignificant volumes, combined with the requirements of both asynchronous and real-time data validation solutions, drove us to create a location validation solution that is both fast and intelligent.
Factual’s Location Data Cleaning Process
When a location data point comes down the pipe, we look at it closely and reject it outright if it hits any of our filter criteria. These are:
Coordinates with three decimals provide no better than ~100m accuracy, and truncated coordinates (reducing the decimal places in coordinate measurements) will consistently ‘pull’ devices in a single direction away from their ‘real’ location. Factual creates audiences with precise location targeting – and our algorithms tie devices to specific businesses, not grid squares – so coordinates with fewer than four decimal places are insufficiently precise and for small venues a precision of 5 or more decimal places is optimal. You really cannot use coordinates with fewer than 4 decimal places for precise real-time geofencing or to create retail-based audiences. Anyone who says differently is selling you something.
Here’s a specific example of what happens when you truncate decimal places. Let’s say that you are at Factual’s HQ in Century City, Los Angeles, and we progressively truncate your coordinate precision. With each decimal removed, your apparent location drifts to the southeast:
|5 decimal places (~1m)
||You are correctly located in Factual’s building (perhaps with other businesses on different floors).
|4 decimal places (~10m)
||Factual’s building is large, so you are still located here. However, this precision would place you outside a smaller venue.
|3 decimal places (~100m)
||About a football field from your current location; you are now across the street in a hotel.
|2 decimal places (~1,000m)
||You are now over 1km away in a golf course.
|1 decimal places (~10,000m)
||You are in another city, and have been eaten by a grue.
Figure 001 — location drift with coordinate truncation; locations with fewer than five decimal places are almost unusable in precision targeting. (Image: Google Maps
Figure 002 — there’s not much really happening at Null Island (0,0) – but it’s a great trap for bad data, and you can check the weather online
Generally coordinates with bad numbers. The most common we see under this heading are coordinates found at ‘Null Island’ (0,0), but there is also a growing menagerie of points that generally represent classes of errors indicative of sloppy coding or an upstream data issue, such as matching coordinate pairs. ‘Null Island’ is a valid geographic point, but geo-geeks use it as a trap to ‘catch’ bogus device locations – seeing it in data streams always points to problems where location data is missing.
Out of Bounds Coordinates
Figure 003: the effect of swapping longitude and latitude: swapped, in the ocean (red); corrected, in Buenos Aires (green). (map: OpenStreetMap
When you process billions of records, you’re going to see a lot of weird coordinates. Many are considered ‘out-of-bounds’ – most because they are outside the range of legal coordinates, but others simply fall at the extremes of the earth where very few people live and whose appearance in the location data pipeline is generally due to developers’ inadvertent swapping of latitude and longitude. For example in figure 003, we identify and discard the coordinate -58.436597, -34.607187 (the red marker) as out-of-bounds, because it is deep in the ocean and at the southern extremes of the earth. Debugging this erroneous location shows that switching the coordinate order to -34.607187, -58.436597 (the green marker) puts the point in Buenos Aires, almost certainly the legitimate location the developer intended.
This filter mechanism catches the biggest proportion of transgressions by identifying apparently high-precision points that have been encoded using a wifi, IP, cell tower, or centroid lookup.
Some of these points may be ‘fraudulent’, but most are just negligent coding on the developer’s part. The best bit about this feature is that it does not run off of a static list of blacklisted places, but instead evolves its logic from the 20+ billion we see daily. This approach is built on a statistical model that identifies blacklisted points via a hypothesis testing framework, that learns which points are over-represented based on all points in the system. The model is therefore improved with every point that we see, and thus requires very little active maintenance. To-date we’ve identified 650k bogus points globally using this method, and the list is growing.
Figure 004 – a farmstead in rural Wichita: the most populated place in the world according to unfiltered mobile location logs (image: Google Maps
One of the most egregious examples of these blacklisted points is the coordinate (37.999004, -96.97783), which corresponds to a lovely, innocuous-looking plot of farmland 30 miles northeast of Wichita, Kansas (figure 004). According to our unfiltered location logs, this point is the the most popular location for mobile users across the globe, beating New York, London, and Seoul for top billing.
Our Blacklist does not care why a specific place is artificially popular, but this one is easy: it is the geographic center of the continental United States (which by itself speaks volumes about the quality of geodata in Mobile Location). It’s clear that publishers are tagging US locations with the intention of noting that the data point is ‘in the United States’, but instead the high-precision, low accuracy coordinate is only adding noise to the signal.
Fortunately, because our model has identified this point as curiously over-subscribed, it is ignored when processing audiences and validating geofence inputs, and all is well.
Bad Devices, Bad Apps
Bad apples: we frequently observe device IDs that are over-represented in every data stream we monitor – usually because the device ID is poorly coded, not passed to the bid stream correctly, and very likely is shared between devices. We’ll also detect devices that appear to ‘blip’ between locations suggesting either travel above Mach 1 or the developer is employing randomized locations (figure 005). Other developers will semi-randomize locations in ways that can be observed and blocked. When we see evidence of malicious or detrimental coding, usually the whole app is blocked from our pipelines and we work with our partners to address and remedy.
Figure 005 – real world example of a bad apple: good app (left) vs bad app (right). Most location pathologies are less pronounced (map: OpenStreetMap)
These checks are applied to all location data consumed by our Geopulse Audience and Geopulse Proximity products. We do this at speeds measured in microseconds (millionths of a second), which means that we can do more verification in less time, sort the wheat from the chaff, and provide the best possible location-based consumer insights.
If you’re interested in learning more about our mobile ad targeting capabilities, please contact us.
Tyler Bell & Tom White, Geopulse Product and Engineering Leads