We use geohashes all the time at Factual, so one day I got curious and read through the canonical Java implementation to figure out exactly how they work. It was an enlightening read, but along the way I encountered some unfortunate bits of coding horror that got me wondering about the fastest way to encode and decode geohashes.
The geohash format
A geohash is a series of bits that repeatedly bisects a search space of latitudes and longitudes. The first bit bisects the longitude, the next one latitude, the next longitude, etc. As a result, some geohashes specify square-ish (in spherical coordinates) regions, while others specify 2:1 rectangles. Because geohashes are fundamentally binary data, they’re often written in base-32. The alphabet is a little unusual in that it omits some letters that might be confused with numbers.
Like floating-point numbers, geohashes contain some error depending on how many bits they have. A geohash library often returns the range of latitude and longitude values you get when decoding, since geohash error is typically much larger than the underlying floating-point error. You can’t technically decode a geohash to a specific point, but you can usually approximate by taking the midpoint of the error box. Wikipedia has a great example that illustrates how this works.
Typical encoding strategy
Geohashing is basically storing the intermediate results of two binary searches. The basic idea looks something like this in code:
Decoding is similar; instead of bisecting based on midpoint comparisons, you just go backwards by setting either the min or max value to the midpoint based on the bits from the geohash.
Optimizing the algorithm
The code above is already pretty fast. We aren’t allocating any memory or making any method calls, so it’s a handful of primitive operations and conditions per bit we want to encode. The only way to make it substantially faster is to get rid of the loop; this involves finding a sublinear way to do both the bisections and the interleaving.
The latitude and longitude are in some sense bisected already since they’re specified as doubles. The only problem is that the mantissa bisects the range between two powers of two, not the limits of latitude or longitude. But we can change that easily enough with an add and a divide:
Now our numbers are both in the range [0, 1]. We can convert them to bits either by using
Double.doubleToLongBits, or by multiplying and doing a cast:
At this point we have the two binary search results, each stored in the low 31 bits of the respective longs. (Technically 30 is all we need since most geohashes end up written in base-32.) Now we need to interleave them, also without looping.
We need a 64-bit Morton number, so we’ll need one more stage. Here’s the code I ended up with to gap the bits:
Then to finalize:
This result can then be right-shifted to get the desired precision.
Calculating error bounds (decoding)
I mentioned earlier that geohash libraries typically return bounding boxes when decoding. Normally we’d have the required information available after the decoding loop, but the loopless version returns results without actually constructing the lat/lng ranges. We could trivially store these ranges in a lookup table since there are only 64 levels of precision, but there’s a fun algorithmic solution that’s almost as fast.
The latitude and longitude error are related: if #bits is even then longitude error = 2x latitude error, otherwise they’re equal. Given that, it suffices to calculate the latitude error first and then figure out the longitude error. Only every other bit impacts the latitude error, so there are 32 distinct possibilities. Specifically, the latitude error is 180 * 2-floor(#bits/2).
Math.pow() is the obvious solution, but it’s boring and is generally considered slow. Given that our exponent is always a 5-bit integer, we can use repeated squaring to very quickly come up with an exact answer:
This can be optimized a bit further by unrolling and saving a couple of multiplies:
Final notes: decoding base-32
A common strategy is to use a
HashMap or similar to do reverse-lookups on characters. That’s how the reference implementation does it, for example. The problem is that Java generic data structures always work on boxed values, so every invocation is going to require allocating the boxed type, calculating its hashcode, doing a
.equals() check, and then unboxing the result. Not only is this a lot of work at runtime, but the lookup structure itself is quite large due to the nontrivial per-object overhead imposed by the JVM.
Even though it seems wasteful, a primitive array is a much more efficient way to represent the reverse lookup table:
Check out the code on Github
This code is available in the Factual beercode-open repository, which is a collection of libraries whose correctness is informally guaranteed by beer. And if Morton codes and repeated squaring are your thing, you might be interested in working with us.
- Spencer “keeping it primitive” Tipping, Software Engineer