Return to Blog

Battling Fake Data

Everyone from the president to Facebook claims to be battling fake news. I am engaged in a war against fake data.

In today’s environment, energy geared in this direction feels vital. “Post-truth” was Oxford Dictionaries 2016 word of the year. The Washington Post subsequently stated plainly “It’s official: Truth is dead. Facts are passe.” Farhad Manjoo caught this trend early with his 2008 book “True Enough: Learning to Live in a Post-Fact Society,” describing how individuals construct personalized realities “built out of our own facts.”

But I don’t want to live in a post-fact world. At Factual, the neutral location data company, we always strive to build and offer high quality data, factual data. Our customers rely on this data to make powerful apps, to optimize marketing spend, and to make strategic decisions. So detecting bad or fake data always has to be our first priority. Much of what we’ve learned can apply to industries that are are being re-imagined for a digital, automated, data-driven world.

The methods of battling #fakedata aren’t magic. They are, rather, grueling work. But it is a satisfying endeavor, if, like me, you hold the view that shared access to accurate knowledge is the underpinning of everything from amazing mobile app experiences to a functioning democracy.

At Factual, we follow six principles that keep us on the front lines in the battle against fake data:

Regular measurement. No matter what your source of data, it is critical to have a process for measuring its quality regularly. Your data may be “first-party data” entered directly by your web or mobile users. It may be data you’ve licensed or acquired from business partners. You may have hardware that captures streams of information directly. In every case, however, there are plenty of opportunities for bad data, biased data, or, yes, fake data. By running a regular process for measuring quality, you’ll at least know where you stand. At Factual, we regularly measure many things including place existence, attribute accuracy, entity uniqueness – so we can be assured that our location data quality is improving at a healthy rate.

Continuous improvement. The key is to avoid any steps backwards. A good architecture will allow you to roll back any changes that resulted in quality problems. A better architecture will first measure the impact of data changes ahead of making them. With an effective data build system, you can ensure that quality only moves up and to the right, and with enough effort and time you will reach your quality goals. While there are many systems for building software, there are fewer for data. At Factual, we open sourced Drake which is a system for managing the workflows around making data.

Independent confirmation. Bad data is all around us. There are so many sources of error: bias, manipulation, software error, bad logic, spam… someday I’ll present the complete taxonomy. But when multiple, independent sources point to the same fact the likelihood of error is reduced. It is statistically unlikely that multiple sources will possess the same bad data point, unless they are in cahoots — thus the stipulation that the sources must be independent.

Transparency. Linus’s Law, which states that “given enough eyeballs, all bugs are shallow,” is an argument for why open source software can become so reliable. Similarly, when data is open and browsable by many people, bugs in that data can be found and subsequently fixed. At Factual, we’re proud of the fact that our Global Places™ data is transparent and browsable. With this open approach, we can receive real-time feedback and get things fixed more quickly.

Positive feedback loops. There are many ways to architect feedback loops that will result in better data quality. One approach we’ve focused on is our partnership model. We incentivize business partners to share data back with us. Partners provide us streams of new data leading to quality improvements, and with higher quality data we attract brand new partnerships, creating a virtuous circle.

Trust modeling. When analyzing trillions of data points from a multitude of sources, we are always looking for ways to sift the good from the bad. For example, by statistically comparing contributed data to a hand-curated gold standard, we can measure how trustworthy any given source is. With a trust model, you may determine that certain users or sources have a tendency to be biased or inaccurate. At Factual, machine learned models are needed to operate efficiently at our vast scale. But, no matter the technical approach, the key is to get those #fakeusers and #fakesources out of there! Eliminate bad data channels and quality will improve rapidly.

Simply understanding these six key principles puts you ahead of the game. But make sure that your data partners embrace them, too. Then, with some effort, we’ll all be on our way to defeating #fakedata.

This post was originally published on Medium here.