If data scientists had their way, we would buy petabytes of memory and never think about IO performance or serialization ever again. Such is the influence of accountants, however, that few companies have realized this otherwise brilliant strategy; instead, whether the workflow is written in Hadoop or in bash, it usually amounts to “stream data in, process a little bit, stream data out.” And this means that performance isn’t purely a matter of profiling code; it’s going to depend on stuff we don’t control like gzip and grep.
Fast indirect sorting in Java
I was recently writing some performance-sensitive code in which I had a double
array of distances (one per element), and I wanted to get a list of elements
sorted by distance:
Note: For an in-depth, technical look into the research process, please refer to the lab notes companion to this post.
Note: This is a companion post to Investigating Low Quality Location Data #2 – Suspicious Activity Over Greenland
Note: This article has also been published in GeoMarketing here
Note: This is a companion post to Investigating Various Pathologies of Low Quality Location Data #1 – App Permissions
We upgraded our Hadoop cluster to YARN/CDH5 last weekend, which brought along the usual flurry of “oops, gotta fix this” commits as various services had hiccups, and in many cases refused altogether to do anything useful. Last week Tom sent me my favorite message: “I just want this to work” (seriously, it’s awesome to get these because you never know what kind of bug it is).
In this series of blog posts, Factual’s blog team asked engineers to describe what a typical day looks like.
We use geohashes all the time at Factual, so one day I got curious and read through the canonical Java implementation to figure out exactly how they work. It was an enlightening read, but along the way I encountered some unfortunate bits of coding horror that got me wondering about the fastest way to encode and decode geohashes.