Data Pipeline Performance

If data scientists had their way, we would buy petabytes of memory and never think about IO performance or serialization ever again. Such is the influence of accountants, however, that few companies have realized this otherwise brilliant strategy; instead, whether the workflow is written in Hadoop or in bash, it usually amounts to “stream data in, process a little bit, stream data out.” And this means that performance isn’t purely a matter of profiling code; it’s going to depend on stuff we don’t control like gzip and grep.

Fast Indirect Sorting in Java

Fast indirect sorting in Java
I was recently writing some performance-sensitive code in which I had a double
array of distances (one per element), and I wanted to get a list of elements
sorted by distance:

Bug Du Jour: CDH5 Upgrade

We upgraded our Hadoop cluster to YARN/CDH5 last weekend, which brought along the usual flurry of “oops, gotta fix this” commits as various services had hiccups, and in many cases refused altogether to do anything useful. Last week Tom sent me my favorite message: “I just want this to work” (seriously, it’s awesome to get these because you never know what kind of bug it is).

How Geohashes Work

We use geohashes all the time at Factual, so one day I got curious and read through the canonical Java implementation to figure out exactly how they work. It was an enlightening read, but along the way I encountered some unfortunate bits of coding horror that got me wondering about the fastest way to encode and decode geohashes.