One of the hardest attributes to pin down in our Global Places data is our Category Labels field. This is where we describe what a place is or what service it provides. It’s what tells you that LaRocco’s Pizzeria is a Pizza Restaurant, whereas Rocco’s Tavern, a mere 200 feet down the street, is both a Sports Bar and a Pizza Restaurant.
While the process of building data is complex in and of itself, it is exacerbated by a number of issues that impact categories specifically. Categorizing a business is often a subjective endeavour and it can be extremely difficult to guess what category a place may be from any location or name-based cues. For example, it is difficult to know that Public School 310 is a restaurant, as opposed to a school, solely by looking at the name.
We work hard to build high quality data, so our customers receive correct information about the places they care about. Having said that, when you provide data for over 90 million places across 50 countries, there will be some warts. Since categorization poses such a thorny problem, some of these imperfections rise to the surface from time to time. With that in mind, we decided to show some of these challenges, explain why they arise, and describe how we address them so that our data is always of the highest quality.
In order to understand where challenges in categorization come from, it’s important to first understand how categories are assigned to places. We build data by gathering information from multiple sources (feedback from different partner apps, trusted data contributors, data from the web, etc.). Pieces of information about the same place are called inputs, and a key part of our process is determining which value is correct if inputs from different sources have conflicting values.
For categories, every source has their own taxonomy, or way of defining how a business or point of interest fits in the world. Likewise, we have our own taxonomy at Factual. So, a preliminary, essential part of our category process is mapping all of those source taxonomies onto ours. For example, in our taxonomy, pizza restaurants fall under Social > Food and Dining > Restaurants > Pizza. But this can be expressed in a seemingly limitless number of ways by other sources, such as: restaurants - pizza, pizza restaurants, pizzerias, pizza places, pizza-restaurants, etc., all of which must point to the correct node in our taxonomy. We have 466 categories in our taxonomy and other sources can have even more than that, and we are constantly gathering information from new sources; so this mapping must be continually maintained to include as accurate and up-to-date information as possible. We keep our mappings up to date both by adding specific taxonomy mappings one at a time, and via a machine learning-based approach that can add thousands of mappings at a time (developed with one of our former interns, Sarah Krasnik).
The next step in the process is analyzing the data gathered from all of our millions of unique sources to build our places dataset. Once we have identified the many data points that refer to a single place, we then algorithmically determine the most factual representation of that place. To select which categories get assigned to our places, we systematically consider the category information from all of the inputs associated with each place to discover what category or set of categories is most likely correct. Since places can often be represented by more than one category (think bar + restaurant) we surface up to three categories per record. We also keep a small list of categories that can co-occur together, (one place can be labeled both a gas station and a convenience store, for example), so that we don’t end up with bizarre groupings like Pharmacy + Museum (a place can be either a pharmacy or a museum, not both).
Here are a few examples of particularly tricky place categorizations that we have seen in our data and how we caught and fixed them.
One of the obvious problems associated with categories is simply not having one. Some location services will fill in the blanks when they are unsure of what a place is with a non-informative category label such as “local business” or “establishment.” We prefer to leave the label off all together if we cannot provide a meaningful one (it’s worth noting that this is rare; over 97% of our places in the US have a category assigned). This can happen when we either can’t get reliable information about what a place is, or if the description of the place has not been mapped to our taxonomy yet. When Bethsaida Seventh Day Church first surfaced in our data, it was a case of the former problem.
|Name||Category Before||Category After|
|Bethsaida Seventh Day Church||null||Churches|
|Wxmc-1310 AM-New Jerusalem Church Lines||Media|
|Church & Dwight Co.||Manufacturing|
|Thomas Carton Church||Ophthalmologists|
When businesses without categories pop up in our data, they typically get categories assigned quickly since we have a dynamic categorization system in place that allows us to make changes at any time to rectify any problems. We add new taxonomy mappings every month, which leads to both adding new categories and correcting existing errors. On top of that, we’re constantly adding new sources to continually improve the quality of our places data. With new data and new category mappings, we assign meaningful categories to Factual places on an ongoing basis.
|Name||Category Before||Category After|
|Vape Star||Home Improvement||Tobacco|