Note: Explore nfu on Github here
We often use the UNIX command line for ad-hoc data crunching. Most of the time we have the good sense to use a better tool after the first 100 characters or so, but sometimes we’ll just blow past the right margin with a string of sort, uniq -c, sort -nr, cut -f1, and other “glue” commands. To make this easier, I decided to bundle a bunch of common ones up into a Perl script called nfu.
The idea behind nfu is to save as much command-line real estate as possible for simple command-line data analysis. It’s designed to wrap or replace a bunch of filter processes like sort, uniq, and in many cases, awk and perl, by providing a series of composable operators designed to operate on rows of whitespace column-delimited text input. For example, two such operators are “sum” and “delta”:
$ seq 4 | nfu -s # or nfu --sum 1 3 6 10 $ seq 4 | nfu -d # or nfu --delta 1 1 1 1 $
Operators compose by juxtaposition (as described in further detail)
$ seq 4 | nfu -ss 1 4 10 20 $
These operators are especially useful in combination with things like histograms, which you can construct using the “group” and “count” operators. For example, here’s how you might build a histogram of English letter pairings:
$ perl -ne 'print "$1$2n" while s/^(.)(.)/$2/' < /usr/share/dict/words | grep 'ww' > pairings $ nfu -gcO < pairings | head -n5 16982 in 15871 er 13647 es 10556 ti 10400 on $
nfu hasn’t saved much effort yet since we could have typed sort | uniq -c | sort -rn easily enough. The real wins happen when we want to do things like log-scaling, calculating cumulative totals, or plotting the data:
$ nfu -gcOl < pairings # log-scale each number $ nfu -gcOs < pairings # cumulative total $ nfu -gcOp 'with lines' < pairings # show data with gnuplot
And more usefully, combining some of these features:
$ nfu -gcOsp ‘title “pair frequency” with lines’ < pairings
At this point we can see that just under 20% of the pairings account for about 80% of the occurrences. Let’s take a closer look at the long tail by dropping the first 200 points and log-scaling the rest:
$ nfu -gcOS200,0lp ‘with lines’ < pairings
All of the data fields are preserved, so you can easily go back and look at the original letter pairings. Here are some of the least-common lowercase ones, for example:
$ nfu -gcO < pairings | grep -v '[A-Z]' | tail -n10 1 zg 1 zd 1 xv 1 vg 1 qb 1 mk 1 jr 1 gq 1 fw 1 cb $
And we can easily find the words corresponding to these pairings by using the second column to form a pattern for grep. We can extract this column using -f and specifying the index 1 (fields are zero-indexed):
$ egrep "$(nfu -gcOf1 < pairings | grep -v '[A-Z]' | tail -n10 | paste -sd'|')" /usr/share/dict/words Chongqing Fitzgerald Gujranwala Iqbal Knoxville Macbeth Mazda Novgorod Potemkin halfway $
nfu also supports commands for generating and processing noisy sample-based data. For example, a couple of days ago I started logging my battery level using polling (-P):
$ nfu -P 1 ‘cat /sys/class/power/BAT/energy_now’ > battery-log &
So every second, nfu runs cat and adds another data point to the log file. Here’s the unfiltered data:
$ nfu -p ‘with lines’ < battery-log
There isn’t much discontinuity, so let’s look at the deltas to measure charge/discharge rate:
$ nfu -dp ‘with lines’ < battery-log
This spike is from the computer being on standby, causing an unsampled duration of charging. We could clip the spike, but we’d be losing information. It’s easier to take a sliding average to spread it out. We also need to remove the first data point after delta-transforming, since it will be encoded as its delta from zero (and will therefore look like another spike):
$ nfu -dS1,0a1000p ‘with lines’ < battery-log
We can use the eval command to calculate the relative amounts of time spent on AC and battery power:
$ nfu -dS1,0e '%0 < 0 ? "BAT" : "AC"' -gc < battery-log 83464 AC 35214 BAT $
You can also use eval to filter values by returning an empty list. Here’s how you might reduce the dataset to battery samples only, calculating the running-average (-a0) discharge rate:
$ nfu -dS1,0e ‘%0 < 0 ? %0 : ()’ -a0p ‘with lines’ < battery-log
nfu isn’t a great piece of architecture or design. It’s just one of those scripts that saves a little bit of time here and there, and is (we hope) useful to have around. Grab the code and let us know what you think! Feature requests and bug reports are welcome as always.
– Spencer Tipping, Software Engineer @ Factual