When writing code, you’d like to be able to test the “happy path”, error conditions, throughput, thread-safety, fault-tolerance, etc. Back in the real world, however, you find yourself delighted when the code you’re working with has even a single test. The silver lining is that code testing is becoming easier and easier, with tools like continuous integration, dependency injection, mocking frameworks, unit testing frameworks, Fitnesse, automatic test generators like AgitarOne, and so on.
For data, as for code, there are plenty of things to test, including consistency, accuracy, coverage, correctness, completeness, and so on. However, not only is it hard to test data using automated tests (as opposed to manual review), it’s often hard to even know where to start!
Let’s say I gave you the following list of all of the Starbucks cafes in zip codes 90067 and 98134, and then asked you to tell me if the data is high quality. What would you do? How would you approach the problem?
1999 Ave. of the Stars, Century City, CA 90067, (310) 557-0785
2025 Ave. of the Stars, Century City, CA 90067, 310-228-1234
2049 Century Park East, Century City, CA 90067, (310) 286-3023
2401 Utah Avenue South, Seattle, WA, 98134, (206) 447-0828
4115 4th Avenue S, Seattle, 98134, (206)-381-1553
1926 1st Ave S., Seattle, WA, 98134, (206)-624-6045
134071 Seventh Ave NW, Seattle, WA, 98134, (206)-521-1242
Here are some of the things you might want to look for:
Not only are these attributes hard to test, but often the tests are as hard as the original problem! For example, if you can write an automated test to see if dedupe is overfolding or underfolding, then presumably you can update the deduping algorithm to use this code to make sure you’re not overfolding or underfolding. That leaves you back at square one for testing.
While testing data quality is a big challenge, it also yields great dividends. When you’re talking about millions or even billions of pieces of data, improving the quality by a few percent has a huge absolute impact.
Here are a few example problems that you might encounter during the Factual interview process. If you work on one of these challenges and are happy with your results, please email your solution to us at firstname.lastname@example.org. We’d love to meet you!
A few practical notes: First, for efficiency reasons, a lot of our data-crunching code is in Java. If your answers are in Java, that would be ideal; if not, that’s okay too. Second, these problems are hard and have a tendency to suck up as much time as you allot to them. Please don’t feel like you have to go overboard and write thousands of lines of code. The goal is to play with data and have some fun, not to take a week off from your regular activities. =)
Given a file containing between 10,000 and 1,000,000 inputs (1 line = 1 input), write a program to emit outliers along with brief justifications of why they are outliers. You can assume the total amount of data will not exceed 500MB. For example:
… 9990 more 5-digit numbers …
348597 (99.99% of inputs have length 5)
abcde (99.99% of inputs consist of only digits)
It’s hard to give sample data for a problem like this without strongly biasing approaches and solutions. If you’d like some sample data, we recommend that you download a dataset of marathon times for a race, all words in the English language, etc. If your heuristics work well, then you (probably) won’t see many outliers in the downloaded data. However, if you manually add couple of suspicious lines, they should be flagged.
You will be given a file with up to 1,000,000 inputs (< 500MB of data). There will be one input per line, and each input will consist of the following TAB-separated elements: id, address, city, region, and postal code. You can assume that the data itself will not contain tabs, so just splitting each line by ‘\t’ will be fine.
Your goal is to generate a list of pairs of elements that you think are identical. For example:
1 123 First Street Los Angeles California 90025
2 123 First St. West Los Angeles CA 90025
3 123 First Street Paris Texas 75462
4 123 First Street Paris France 75008
5 123 1er Street Paris France 75008
As with outlier detection, sample data tends to bias solutions. A good test would be to find a list of locations for some chain (Starbucks, McDonald’s, etc.). Once you have this list, copy some of the inputs in the list and modify them slightly (add a typo, change the street number, reformat the phone number, etc). Your dedupe code will mark some of your copied rows as duplicates of existing inputs, while other copied rows will be correctly marked as too different from any existing row.