Login

Tags

Factual Blog / Tagged:

Hadoop

Bug Du Jour: CDH5 Upgrade

We upgraded our Hadoop cluster to YARN/CDH5 last weekend, which brought along the usual flurry of “oops, gotta fix this” commits as various services had hiccups, and in many cases refused altogether to do anything useful. Last week Tom sent me my favorite message: “I just want this to work” (seriously, it’s awesome to get these because you...

How Factual Uses Persistent Storage For Its Real-Time Services

As part of Factual’s Geopulse product suite, we need to be able to absorb and process large amounts of data, and deliver back a somewhat smaller amount of data. There is a significant amount of technology available for the processing stage, but fewer for both the intake and delivery. Today, we’re open sourcing two libraries that...

Profiling Hadoop Jobs With Riemann

Factual processes nontrivial amounts of data. Our analyses may range over 1011 records, reading hundreds of gigabytes to hundreds of terabytes of source data and intermediate representations. At this scale, performance optimizations can save us significant time and money. We use VisualVM, jhat, and Yourkit for memory and CPU profiling, and the excellent Criterium for microbenchmarks in...

Clojure on Hadoop: A New Hope

Factual’s U.S. Places dataset is built from tens of billions of signals. Our raw data is stored in HDFS and processed using Hadoop. We’re big fans of the core Hadoop stack, however there is a dark side to using Hadoop. The traditional approach to building and running Hadoop jobs can be cumbersome. As our Director of...

Practical Hadoop Streaming – Dealing with Brittle Code

It is a universal truth that our own code is perfect and that had we simply written every line of our project, every library, none of this would be happening. Let’s say that you’ve decided you’re ready to take matters into your own hands, but for some reason, your employer just isn’t into paying you to...