Note: This was originally posted on the O’Reilly Strata Blog on 10/11/13, available here.
Incorporating third-party data into your business is always a headache. Issues of format, price, rights, and accessibility consistently introduce inefficiency, slow things down, and generate excess heat, so it’s not inappropriate to refer to them as “frictions.”
Using geodata as an exemplar, I’m outlining these four frictions here so we can stare them in the face, note what needs fixing, and suggest how best to address them. The good news is that these issues are not insurmountable, and we seem to be moving in the right direction.
This is the one we’re all familiar with, but issues of format friction have evolved over the last fifteen years: data vendors are increasingly keen to speak in your language of choice, and formats like GeoJSON are supported by Github. In the non-commercial world, however, government organizations are stubbornly refusing to make data available as structured data, and often display an almost pathological aversion to downloads.
Consider the Australian Census data, released as CC-BY (Creative Commons, Attribution required), but whose code (annotated by in-line comments) makes it intentionally difficult to download, or Alberta’s Open Data Catalogue which chose to publish data on the Physical Condition of Government Owned and Operated Facilities without itemizing the actual facilities. Passive-aggressive data publication at its best.
For those who continue to provide data in PDF format, Caitlin Rivers offers guidelines on making data available in the most basic, structured format possible (hint: it’s CSV – primitive, but hugely popular, in part for that very reason.).
The good news is that format friction is recognized and is being addressed. Exemplars of civic open data exist, such as the city of Palo Alto, California and New York’s recent release of property records, building permits and building footprints. In May this year, President Obama released an executive order to make data both open and machine-readable. Technologists everywhere are saying “show me the data,” and the government is, largely, listening.
Sourcing third-party data is usually more nuanced than a simple Build v. Buy decision. For example, in 2008 Google was using TeleAtlas map data in the US – we don’t know what the annual contract fee was for this deal, but we might safely suggest that it fell somewhere between “onerous” and “punitive.” Google replaced TeleAtlas US data with their own in 2009. Jeff Bezos’ maxim that “your margin is my opportunity” prevailed, but in this case the opportunity was Google’s.
The price of data is not just about the bits; you’re paying for the data collection process, cleaning, management, updates, reps, warrants, and indemnification. You’re paying for the means to get a product up-and-running quickly, and you’re paying because a good data vendor will provide complementary tools and expertise outside your own. When the price justifies these benefits, the model holds true, but when it fails, developers treat overpriced data as ridiculous and quickly route around it. Data vendors are increasingly making greater efforts to articulate their entire value, which leads to more holistic tool chains and, of course, lower prices.
Google’s decision to move off Tele Atlas was not based on price alone; the terms of the deal undoubtedly dictated what Google could and could not do with the data. These rights often prove to be the most significant source of friction.
License encumbrances are not confined to commercial data: OpenStreetMap (OSM) remains my preferred source for map data, but it falls under the Open Database License (ODbL). In most circumstances this is a Good Thing. However, Section 4.4 of the ODbL describes how, “extraction or re-utilization of the whole or a substantial part of the contents into a new database is a Derivative Database and must comply with [the share-alike provision].” The term “substantial” is ill-defined within the document itself, so the community, to their credit, provides guidance in a separate document. Unfortunately, the guidance clarifies little: “insubstantial” is defined as “less than 100 features” but includes a number of caveats about use and intent which suggest that “substantial” is not entirely a quantitative metric. If taken at face value, the addition of 101 OSM entities to the Factual database of 65 million places – 0.00016% of the data, a positively homeopathic contribution – could, in principle, force the entirety of the Factual data set to adopt the Share Alike provision.
I am a big fan of the Share-Alike approach, but this guidance is nuts. The vagaries of the language and the lack of case law make so much of OSM a danger zone. The unfortunate result is that developers build walls between their data and OSM, and spend time duplicating efforts instead of digesting, editing, and contributing back to a common resource. A pragmatic loss, of unintended consequences.
This final friction affects the alacrity and the convenience of data access. A few related, items fall under this heading. The specific bugbears are:
- Lack of Scope: can you see the entirety of the data you are licensing, or only a slice at a time? Traditional APIs – great for entity discovery in something like a Local Search use case – are worthless for any holistic understanding of the data. Wherever practical, developers desire the entirety of the data as a download. Most data providers view downloads as infeasible, impractical, or otherwise undesirous for reasons of trust or sub-licensing restrictions. However, CSVs are increasingly the market expectation, and, as a consequence, APIs have become a restriction on, rather than gateway to, the data. In response to this need, after much debate, we started offering CSVs at Factual two years ago.
- Too Much Data: conversely, the entirety of the data is often too unwieldy, or simply not relevant to one’s needs. OSM is unique among map data providers in that it allows access to the raw mapping data underlying the tiles. The specific friction here is the size of the default download: 27 GB compressed (370 GB uncompressed) of joyous XML. The size of this dataset makes it difficult to isolate thematic or geographic extracts: this is where companies like like Open Cage Data (I am an adviser) reduce friction by providing custom, pre-baked extraction filters for specific subsets, and regular refreshes, of OSM data. Extracting and indexing select data components are computationally intensive and will often fall to a secondary organization that specializes purely in value-through-accessibility: simplifying the discovery, extraction, and refresh of select datasets from a larger whole.
- No Contributory Model: until recently, the majority of third-party datasets were read-only: the vendor collected data, collated it in a black-box sort of fashion, and pushed it out to customers in a one-way stream of information. This flow is changing as many vendors realize that their customers are very often best positioned to correct and supplement their data. Data producers are increasingly asked to be more responsible and responsive stewards of data, to do less vending and more management.
- No Interface: The existence of a contribution API does not guarantee accessibility; genuine usefulness, however, does: OSM has always been a user-created dataset, but the recent introduction of a new editing tool by MapBox raised the average number of people who returned to contribute by 8%. “By lowering the barrier to contributions, we believe that more people can contribute their local knowledge to the map – the crucial factor that sets OSM apart from closed-source commercial maps.” Indeed: greater accessibility, lower friction, increased engagement and, ultimately, better data
Implementing third-party data into your business can still be a bear, but the good news is that data vendors are slowly changing their tune: developer expectations are being met by more reasonable prices, increasingly standardized formats, increased interaction, and an approach to data ownership that moves away from the idea of data as a zero-sum game. The pace of change is predicated by developer expectations, so demand the download, demand the ability to contribute, look for alternatives, and keep driving to an outcome where third-party data integration is easy, affordable, and cooperative.