Here at Factual we apply machine learning techniques to help us build high quality data sets out of the gnarly mass of data that we gather from everywhere we can find it. To date we have built a collection of high quality datasets in the areas of places (local businesses and other points of interest) and products (starting with consumer packaged goods). In the long term, however, Factual is about perfecting the process of building data regardless of the area, so many of our techniques are domain agnostic. In this post, I cover 5 principles we use when putting machine learning techniques to work.
1. Don’t Ignore the Corners
The biggest mistake people make when they attempt to use machine learning on data at huge volumes is ignoring the corner cases. If you have a dataset in the billions, you will have 4.5 sigma events in the thousands. To do a great job, you have to handle cases that are this far away from the mean.
The key is not giving up too soon. The challenge is that the game gets harder as you move away from the mass of data to the exceptions. In the corners, simple and straightforward techniques won’t work, and you must apply increasing amounts of algorithmic firepower to tease out the meaning. A lot of people give up instead of doing this work and it impacts the quality of their data.
Think of Olympic sprinters. Millions of people can do a 100-meter dash in 20 seconds. A few hundred can do it in less than 10 seconds. The elite runners spend huge amounts of time battling for hundredths of a second.
At Factual, we run this race toward high quality data in a pipeline with several stages we will describe in a later blog post. Our first approaches handle most of the data. When you are dealing with the bulk of the data, having a huge data set really helps. Our second stage gets us much of the rest of the way (to the level of the 10 second sprinter). Then we spend a huge amount of time battling with the remaining data. In other words, the cost of getting the corners right in time and effort is much larger than getting the rest of the data right. But that’s where we make new discoveries, and that’s why we spend extra time so our datasets can be relied on.
2. Be Attentive to the Boundaries
Boundary cases are another area where we pay significant attention. If we decide that everything above a 0.85 is good and everything below is bad, then the stuff on the boundary is really worth being careful about. This helps us tune the number to make sure we are right most of the time and also helps us figure out other approaches for catching the edge cases.
3. Spend Time on Special Cases
When we spend time on corners and boundaries, we often find different categories that need special treatment. Entities like Starbucks may have many locations, all with the same name. Entities like The White House may have to be curated manually or locked. Some locations may be loaded with entities. As time has passed, we have enhanced our algorithms to handle these special cases. We are always on the lookout for more.
4. Listen to the Data
Probably the biggest thing I’ve learned from my work on advanced algorithms is to listen to the data. When I go in the wrong direction, it is usually because I’m trying to make the data fit an idea I started with rather than watching and listening to the subtleties that the data is whispering through the technology.
5. Love Your Data
All of this involves building teams with smart people, giving them the best tools, using and contributing to open source, allowing room for mistakes, and, perhaps most of all, creating a culture of people who love data. If you love data too, please get in touch. We are always looking for like-minded people to help us do this work.
Stay tuned for my next post where I give a tour of our machine learning pipeline.