In my previous post, 5 Principles for Applying Machine Learning Techniques, I reviewed the principles for putting machine learning to work. In this blog, I will present a brief guided tour of the pipeline that translates that theory into action.
At the beginning of our pipeline, we have an ocean of data coming at us in many different forms. Some of it is super high quality. Some of it is junk. Most of it is in between. Here are the broad stages of our pipeline.
- The first task is to make sense of the data (clean it)
- Put it in a standard form (canonicalize it)
- Determine if the data applies to existing entities (resolve it)
Performing those tasks involves handling a huge amount of data, leaving interesting corner cases that we attack by:
- Using our own algorithms (applying advanced techniques)
- Determining which changes are trustworthy (trust rank)
- Figuring out how to make sure our data is correct (quality control)
I’ll finish off this post describing how we have two versions of our pipeline (one quick, one more heavy duty) and add a few notes about the technology we use.
Clean and Canonicalize
The algorithmic pipeline starts with the idea of cleaning the data. Imagine a restaurant called Johnny’s BBQ Joint. People could refer to it as “Johnny’s”, “Johnny’s Joint” or using a bunch of other names. When cleaning data, we have to find out which words are meaningful and which are not. For example, “LAX Hilton” and “LAX” are clearly referring to different places, even though just one word is different.
We can determine whether words are meaningful by grouping the data in various ways. For example, suppose we’re looking at business names. If we group the words by location, we might observe that “Hilton” is more interesting than “LAX” because it has less geographic clustering.
Similarly, we can identify stopwords (such as “the”, “and”, and “of”) by looking at adjacent word correlation. The problem can be solved with a few heuristics:
- A word like “the” will be found very frequently and next to all kinds of other words.
- A word like “Starbucks” will be found fairly frequently but next to a small variety of words like “cafe” or “coffee” (which often implies a business chain).
- A word like “pizzeria” will be found frequently and next to a large variety of words like “Johnny’s” or “Tony’s”, which often indicates a business category.
Doing these kinds of analyses lets us improve our automated data-cleaning efforts.
Cleaning is really tough. We use all of the following approaches and more in our cleaning process:
- Grammar induction
- Vocabulary building
- Language modeling for names
There are all sorts of fascinating problems related to cleaning data. Of course, when you have huge amounts of data, it is easy to find all the ways people misspell a name like Britney Spears. But if we are to do a good job, we cannot just focus on where we have data. We have to apply our algorithms to sparse data. Lots of our corner cases occur where we have sparse data.
The next stage of the pipeline is to canonicalize the data, which means to put the data into a standard form. While this sounds easy enough, just think of all the ways you can write street names or all the different ways you can write out a phone number, not just in the US but all over the world. Did I mention that we are doing all of this work for more than 50 languages? That, of course, adds to the challenge.
Resolving the data is a key question early in the pipeline. Specifically, we want to figure out whether the data is for a new place or whether it’s for a place we’ve already tracked. (We assign all of the businesses and points of interest we track a unique ID number.) Of course, doing a good job cleaning and canonicalizing makes this easier, but resolving data is a very nontrivial task. One of our most useful insights here is that each attribute has two important qualities: its information content and its resolution ability.
The information content (a.k.a. entropy) of an attribute is how much it tells us about a place. For example, a state code puts a business into one of 50 buckets, a zip code puts it into one of ~45,000 buckets, and an address puts it into one of millions of buckets.
The resolution ability of an attribute describes if that attribute is helpful for identifying matches, non-matches, both, or neither. For example:
- A state code is useful for identifying non-matches but useless of determining matches (if two businesses are in different states, they are definitely different; if they are in the same state, then we don’t know whether they’re different).
- A phone number is somewhat useful for identifying matches and non-matches, but it’s not perfect because different businesses can share one phone number (e.g. doctors in an office building, stores in a chain, or stores in a mall) and one business can have multiple phone numbers.
- An address is very useful for determining non-matches (a single business won’t have two addresses), and somewhat useful for determining matches (a single address often – but not always – houses exactly one business).
Once we understand a value’s distribution and have a similarity scorer, we can then figure out how to weigh each attribute and use this information to determine whether two places are the same.
Once we have done a first pass of cleaning, canonicalizing, and resolving the data, the bulk of the data has either been determined to be useful or tossed out. What’s left are the corner cases, a much smaller amount of data that is nonetheless crucial for Factual to gain and maintain an edge in quality. One approach, of course, would be to toss all of these cases out, but that would leave too much on the table. In fact, we do toss out a lot of these cases, but we don’t do it casually.
It is also important to point out that much of the advanced work relies on the work done in the previous stage. The overall technique in many of these advanced cases is to use regularities or patterns in the data (from the easy cases) to inform the hard cases. One of the reasons we spend so much time on the corner cases is that solving them often shows us how to do a better job with the earlier algorithms. For example, we may find that a single location has many entities, which may require extra processing.
Here are some of the advanced techniques we use:
- Advanced language models to map many variations of a word into an abstract form. Stemming is a form of this.
- Domain models, which is language modeling based on domain knowledge
- Name scoring
- Address scoring
- Geographic distance metrics
- Entity-oriented distance metrics for entities that go beyond the term frequencyâ€“inverse document frequency method
- Other natural language processing techniques
- Web-scale clustering
We also do a lot of to infer facts from a variety of different types of data. For example, let’s say that a restaurant lists hours from 10:00am to 9:00pm on its web site. But then, from our social media data, we start seeing check-ins or Tweets or other comments about being at the restaurant after 9:00pm up to midnight starting after Memorial Day. Could this be an expansion of hours? Summer hours? It is the job of our algorithms to figure this out when we can for the 60 million places we track worldwide.
It is not at all uncommon for conflicts to appear after even the most advanced processing. Then the questions become:
- Which data should I trust?
- Which sources should I trust?
- How can we decide between conflicting changes?
- How do we identify malicious actors?
Over time, we build a trust rank for sources, so we know which sources are high quality and which are full of errors. We also have to take into account that, for various reasons, people may put bad data out there. One way of identifying bad data was inspired by the boosting technique developed by machine learning researchers.
Finally, even after we have done our best, we know that errors slip through our algorithms. So we have a variety of quality control methods to make sure that we catch false positives, data that looks good but isn’t.
One important principle of our systems is that we don’t assume everything can be automated. We take random samples of our data and run them by our quality control team and through teams organized with the help of Amazon’s Mechanical Turk service. This allows us to identify specific cases that slipped through our algorithms. The errors then go back to our algorithm designers who try to find improvements.
Quick and Stable Pipelines
To make sure our data is updated as quickly as possible, we have two pipelines of data. One pipeline updates data quickly and applies all of the changes for which we have high confidence after a short amount of processing. The second pipeline takes longer but applies all the algorithmic firepower we have. The changes from this pipeline are applied whenever the processing is complete. We employ many such approaches to balance tradeoffs like the one between speed and quality discussed here.
Technology We Love
We are massive users of open source. As you might expect, we use Hadoop and HBase, but we also do a lot of work making Hadoop work better by using Clojure and Cascalog, along with dcache, our URI based repository.
Cascalog (built on top of the Cascading library) is our preferred method for joining data and managing the flow of data through the pipeline. Cascalog allows us to enhance our code for joining and algorithmic processing in Clojure and also easily orchestrate complicated streams of processing depending on what we find.
The most exciting aspect of working at Factual in general and on the algorithm team in particular is that wonderful feeling of being just at the beginning of something new. We’ve made a lot of progress, but there is so much more to do and so much more to learn. If this post has provoked any questions, please send them along.