Incorporating third-party data into your business is always a headache. Issues of format, price, rights, and accessibility consistently introduce inefficiency, slow things down, and generate excess heat, so it’s not inappropriate to refer to them as “frictions.”
Using geodata as an exemplar, I’m outlining these four frictions here so we can stare them in the face, note what needs fixing, and suggest how best to address them. The good news is that these issues are not insurmountable, and we seem to be moving in the right direction.
This is the one we’re all familiar with, but issues of format friction have evolved over the last fifteen years: data vendors are increasingly keen to speak in your language of choice, and formats like GeoJSON are supported by Github. In the non-commercial world, however, government organizations are stubbornly refusing to make data available as structured data, and often display an almost pathological aversion to downloads.
Consider the Australian Census data, released as CC-BY (Creative Commons, Attribution required), but whose code (annotated by in-line comments) makes it intentionally difficult to download, or Alberta's Open Data Catalogue which chose to publish data on the Physical Condition of Government Owned and Operated Facilities without itemizing the actual facilities. Passive-aggressive data publication at its best.
For those who continue to provide data in PDF format, Caitlin Rivers offers guidelines on making data available in the most basic, structured format possible (hint: it’s CSV - primitive, but hugely popular, in part for that very reason.).
The good news is that format friction is recognized and is being addressed. Exemplars of civic open data exist, such as the city of Palo Alto, California and New York’s recent release of property records, building permits and building footprints. In May this year, President Obama released an executive order to make data both open and machine-readable. Technologists everywhere are saying “show me the data,” and the government is, largely, listening.
Sourcing third-party data is usually more nuanced than a simple Build v. Buy decision. For example, in 2008 Google was using TeleAtlas map data in the US – we don’t know what the annual contract fee was for this deal, but we might safely suggest that it fell somewhere between “onerous” and “punitive.” Google replaced TeleAtlas US data with their own in 2009. Jeff Bezos’ maxim that “your margin is my opportunity” prevailed, but in this case the opportunity was Google’s.
The price of data is not just about the bits; you’re paying for the data collection process, cleaning, management, updates, reps, warrants, and indemnification. You’re paying for the means to get a product up-and-running quickly, and you’re paying because a good data vendor will provide complementary tools and expertise outside your own. When the price justifies these benefits, the model holds true, but when it fails, developers treat overpriced data as ridiculous and quickly route around it. Data vendors are increasingly making greater efforts to articulate their entire value, which leads to more holistic tool chains and, of course, lower prices.
Google’s decision to move off Tele Atlas was not based on price alone; the terms of the deal undoubtedly dictated what Google could and could not do with the data. These rights often prove to be the most significant source of friction.
License encumbrances are not confined to commercial data: OpenStreetMap (OSM) remains my preferred source for map data, but it falls under the Open Database License (ODbL). In most circumstances this is a Good Thing. However, Section 4.4 of the ODbL describes how, “extraction or re-utilization of the whole or a substantial part of the contents into a new database is a Derivative Database and must comply with [the share-alike provision].” The term “substantial” is ill-defined within the document itself, so the community, to their credit, provides guidance in a separate document. Unfortunately, the guidance clarifies little: “insubstantial” is defined as “less than 100 features” but includes a number of caveats about use and intent which suggest that “substantial” is not entirely a quantitative metric. If taken at face value, the addition of 101 OSM entities to the Factual database of 65 million places – 0.00016% of the data, a positively homeopathic contribution – could, in principle, force the entirety of the Factual data set to adopt the Share Alike provision.
I am a big fan of the Share-Alike approach, but this guidance is nuts. The vagaries of the language and the lack of case law make so much of OSM a danger zone. The unfortunate result is that developers build walls between their data and OSM, and spend time duplicating efforts instead of digesting, editing, and contributing back to a common resource. A pragmatic loss, of unintended consequences.
This final friction affects the alacrity and the convenience of data access. A few related, items fall under this heading. The specific bugbears are:
Implementing third-party data into your business can still be a bear, but the good news is that data vendors are slowly changing their tune: developer expectations are being met by more reasonable prices, increasingly standardized formats, increased interaction, and an approach to data ownership that moves away from the idea of data as a zero-sum game. The pace of change is predicated by developer expectations, so demand the download, demand the ability to contribute, look for alternatives, and keep driving to an outcome where third-party data integration is easy, affordable, and cooperative.