One of the hardest attributes to pin down in our Global Places data is our Category Labels
field. This is where we describe what a place is or what service it provides. It’s what tells you that LaRocco’s Pizzeria
is a Pizza Restaurant, whereas Rocco’s Tavern, a mere 200 feet down the
street, is both a Sports Bar and a Pizza Restaurant.
While the process of building data is complex in and of itself, it is exacerbated by a number of issues that impact categories
specifically. Categorizing a business is often a subjective endeavour and it can be extremely difficult to guess what category a place
may be from any location or name-based cues. For example, it is difficult to know that Public School 310
is a restaurant, as opposed to a school, solely by looking at the name.
We work hard to build high quality data, so our customers receive correct information about the places they care about. Having said that,
when you provide data for over 90 million places across 50 countries, there will be some warts. Since categorization poses such a thorny
problem, some of these imperfections rise to the surface from time to time. With that in mind, we decided to show some of these challenges, explain why they arise, and describe how we address them so that our data is always of the highest quality.
Where Categories Come From
In order to understand where challenges in categorization come from, it’s important to first understand how categories are assigned to
places. We build data by gathering information from multiple sources (feedback from different partner apps,
trusted data contributors,
data from the web, etc.). Pieces of information about the same place are called inputs, and a key part of our process is determining
which value is correct if inputs from different sources have conflicting values.
For categories, every source has their own taxonomy, or way of defining how a business or point of interest fits in the world. Likewise,
we have our own taxonomy at Factual. So, a preliminary, essential part of our
category process is mapping all of those source taxonomies onto ours. For example, in our taxonomy, pizza restaurants fall under Social > Food and Dining > Restaurants > Pizza. But this can be expressed in a seemingly limitless number of ways by other sources, such as:
restaurants – pizza, pizza restaurants, pizzerias, pizza places, pizza-restaurants, etc., all of which must point to the correct node in our taxonomy. We have 466 categories in our taxonomy and other sources can have even more than that, and we are constantly gathering
information from new sources; so this mapping must be continually maintained to include as accurate and up-to-date information as possible. We keep our mappings up to date both by adding specific taxonomy mappings one at a time, and via a machine learning-based approach that can add thousands of mappings at a time (developed with one of our former interns, Sarah Krasnik).
The next step in the process is analyzing the data gathered from all of our millions of unique sources to build our places dataset.
Once we have identified the many data points that refer to a single place, we then algorithmically determine the most factual
representation of that place. To select which categories get assigned to our places, we systematically consider the category information
from all of the inputs associated with each place to discover what category or set of categories is most likely correct. Since places can
often be represented by more than one category (think bar + restaurant) we surface up to three categories per record. We also keep a
small list of categories that can co-occur together, (one place can be labeled both a gas station and a convenience store, for example),
so that we don’t end up with bizarre groupings like Pharmacy + Museum (a place can be either a pharmacy or a museum, not both).
Here are a few examples of particularly tricky place categorizations that we have seen in our data and how we caught and fixed them.
Bethsaida Seventh Day Church
One of the obvious problems associated with categories is simply not having one. Some location services will fill in the blanks when
they are unsure of what a place is with a non-informative category label such as “local business” or “establishment.” We prefer to leave
the label off all together if we cannot provide a meaningful one (it’s worth noting that this is rare; over 97% of our places in the US
have a category assigned). This can happen when we either can’t get reliable information about what a place is, or if the description of
the place has not been mapped to our taxonomy yet. When Bethsaida Seventh Day Church first surfaced in our data, it was a case of the former problem.
|Bethsaida Seventh Day Church
You might be thinking: “Why don’t you just look at the name? You can plainly see that it’s a church!” It turns out that trying to
assign categories strictly based off of business names is an unreliable approach. While it’s true that “church” often shows up in the
names of churches, it also shows up in the names of other things, such as the radio station Wxmc-1310 Am-New Jerusalem Church Lines.
So, categorizing based on the word “church” would get that wrong. Now you might be thinking something along the lines of: “But it will
work if Church is the first or last word, you could use regular expressions, right?” However, even when you restrict the search with
these types of rules you’ll still run into things like Church & Dwight Co., which manufactures cleaning products or Thomas Carton Church, an ophthalmologist.
|Wxmc-1310 AM-New Jerusalem Church Lines
|Church & Dwight Co.
|Thomas Carton Church
When businesses without categories pop up in our data, they typically get categories assigned quickly since we have a dynamic
categorization system in place that allows us to make changes at any time to rectify any problems. We add new taxonomy mappings every
month, which leads to both adding new categories and correcting existing errors. On top of that, we’re constantly adding new sources to
continually improve the quality of our places data. With new data and new category mappings, we assign meaningful categories to Factual
places on an ongoing basis.
The recent increasing popularity of vaping has lead to a rise in vape-associated businesses, e.g. retail stores and vape bars, where
people can gather to vape together. One such establishment, Vape Star, was initially mislabeled as providing home improvement services.
As mentioned above, one of the challenges in assigning categories is mapping the components of our sources’ taxonomies onto our own.
The problem for Vape Star is that one of the sources provided a category of “pipe and smoker”, which we had incorrectly mapped to our
home improvement node. This happened because in lots of instances, “pipe” legitimately shows up in descriptions of home improvement
businesses. For example, businesses like Southland Pipe and
Star Pipe Products that sell supplies for pipe-specific construction are often described using just the words “pipe” or “pipes”. So you can see how “pipe and smoker” could be grouped with those other descriptions. In cases like this, we simply update that mapping in our system, and any businesses with this description get automatically assigned to the appropriate Factual category.
Assigning the correct categories to places is deceptively tricky, but it is exactly the type of problem that our engineers love to take
on at Factual. We embrace the challenge and allot a considerable amount of time and resources towards ensuring that our Global Places
data continue to have the most accurate and comprehensive category coverage possible.