Login

Factual Blog /

Data Pipeline Performance

If data scientists had their way, we would buy petabytes of memory and never think about IO performance or serialization ever again. Such is the influence of accountants, however, that few companies have realized this otherwise brilliant strategy; instead, whether the workflow is written in Hadoop or in bash, it usually amounts to “stream data in, process a little bit, stream data out.” And this means that performance isn’t purely a matter of profiling code; it’s going to depend on stuff we don’t control like gzip and grep.

This post is about some experiments I ran to build a better intuition about pipeline performance, and which later informed the design of ni and other tools I use at Factual.

Before we get going: some UNIX measurement tools

Two common and absolutely essential utilities are pv and units.

pv is cat, but it gives you a live indicator of how fast data is moving, and your progress/ETA if you use it on a file. For example:

$ pv data.gz | gzip -dc | command...
105MiB 0:00:02 [53.7MiB/s] [=>                                ]  8% ETA 0:00:21

This functionality is so useful that I’ve written it into every pipeline constructor I use: nfu has a throughput monitor, and its successor ni takes the concept further with throughput and data pressure.

units is a command-line calculator that can do unit conversions (like the ones Google supports). For instance:

$ units -t '1TB / 5Mbps' weeks
2.6455026
$ units -t '300W * 0.17USD / (kW * hour)' USD/month
37.254704
$ units -t '1826231800238 bytes' GB
1826.2318

These tools are great because they make it easy to think about units of throughput, rather than just time – and that’s exactly the type of intuition required when you’re working on large datasets.

When time and throughput are different: fast and slow languages

A “slow language” usually means a language with a high interpreter overhead – Ruby and Perl sink to the bottom of the benchmarks game for this reason. But that doesn’t mean these languages will be slow for text-processing or similar use cases; ni, for example, is written in Perl and can process text at 800MB/s with multithreading; Ruby could easily do the same. The key is that these languages provide specialized, efficient vectorized operations that outperform obvious native implementations.

For example, I wrote three line counting programs, one in C and two in Perl:

// C version: read straight into a buffer and count newlines
#include <unistd.h>
#include <stdio.h>
int main() {
  char buf[65536];
  ssize_t n;
  long lines = 0;
  while ((n = read(0, buf, sizeof(buf))) > 0)
    for (long i = 0; i < n; ++i)
      lines += buf[i] == '\n';
  printf("%ld\n", lines);
}

Here’s wcl1.pl, the obvious version:

#!/usr/bin/env perl
my $lines = 0;
++$lines while <STDIN>;
print "$lines\n";

And here’s wcl2.pl, which doesn’t reallocate any memory:

#!/usr/bin/env perl
my $lines = 0;
my $fh    = \*STDIN;
$_ = "\0" x 65536;
while (sysread $fh, $_, 65536) {
  ++$lines while /\n/g;
}
print "$lines\n";

I then benchmarked their throughput using a high-speed data source:

# our data source: ./produce-numbers runs forever, so just take the first 4GiB
# using dd, which also conveniently provides timing/throughput info as shown
# below.
$ source-4gb() { ./produce-numbers | dd bs=32K count=131072 iflag=fullblock; }

# data source limit performance (using an explicit "cat" process is necessary
# here; otherwise the kernel optimizes the write() calls by bypassing the memory
# copy altogether)
$ source-4gb | cat > /dev/null
4294967296 bytes (4.3 GB) copied, 1.68788 s, 2.5 GB/s

# c version
$ gcc wcl.c -o wcl
$ source-4gb | ./wcl
4294967296 bytes (4.3 GB) copied, 13.2817 s, 323 MB/s
2386265

# perl version 1
$ source-4gb | perl wcl1.pl
4294967296 bytes (4.3 GB) copied, 6.03566 s, 712 MB/s
2386332

# perl version 2
$ source-4gb | perl wc2.pl
4294967296 bytes (4.3 GB) copied, 3.8689 s, 1.1 GB/s
2384503

There’s nothing up my sleeve here: if you run the above programs, you should see Perl count lines about 3.5x as fast as C. gcc mostly fixes the CPU bottleneck when you compile with -O3, now outperforming Perl as we’d expect:

$ gcc wcl.c -O3 -o wcl-O3
$ source-4gb | ./wcl-O3
4294967296 bytes (4.3 GB) copied, 2.98875 s, 1.4 GB/s
2386583

And, of course, the real wc -l is faster still, saturating the data source:

$ source-4gb | wc -l
4294967296 bytes (4.3 GB) copied, 1.71074 s, 2.5 GB/s
2387565

What’s going on here

C is a “fast language” in that it has no interpreter overhead, but the CPU can move data only so quickly itself. Taking a look at the assembly for wcl.c compiled at the default and -O3:

// Slow version (below is one loop iteration):
.L4:
        movl    -65572(%rbp), %eax        // load i into %rax
        cltq
        movzbl  -65552(%rbp,%rax), %eax   // load buf[i]
        cmpb    $10, %al                  // compare to '\n'
        sete    %al                       // if equal, %al = 1
        movzbl  %al, %eax
        addq    %rax, -65568(%rbp)        // add to line count
        addl    $1, -65572(%rbp)          // ++i
.L3:
        movl    -65572(%rbp), %eax        // save i to stack
        cltq
        cmpq    -65560(%rbp), %rax        // i < n
        jl      .L4                       // next iteration

It makes sense that this would run slowly: each byte of input data becomes 12 CPU instructions. To get the data rate we observed of ~320MB/sec, the CPU’s throughput would have to be about 3.5Ginsns/sec – about right, given that modern CPUs have some instructions that amortize to 1/2 or 1/3 of a clock cycle.

When we compile with -O3, the picture changes completely. gcc unrolls the loop by a factor of 16 and uses SSE instructions:

.L3:
        movq    %rcx, %rdi                // vectorized loop index
        pxor    %xmm7, %xmm7
        salq    $4, %rdi                  // shift left by 4 = multiply by 16
        addq    $1, %rcx                  // ++vectorized_i
        movdqa  (%rsp,%rdi), %xmm0        // vector load into %xmm0
        cmpq    %rcx, %rdx                // check for end
        pcmpeqb .LC0(%rip), %xmm0         // vectorized compare bytes == 10
        pand    .LC1(%rip), %xmm0         // vectorized change 255 to 1

        movdqa  %xmm0, %xmm1              // BEGIN mystery byte-adding logic
        punpckhbw       %xmm7, %xmm0      // (various shuffles to collect the
        punpcklbw       %xmm7, %xmm1      //  one-bytes from %xmm0)
        movdqa  %xmm0, %xmm4              //
        movdqa  %xmm1, %xmm6              //
        punpckhwd       %xmm5, %xmm1      //
        punpcklwd       %xmm5, %xmm4      //
        punpcklwd       %xmm5, %xmm6      //
        punpckhwd       %xmm5, %xmm0      //
        movdqa  %xmm6, %xmm7              //
        punpckhdq       %xmm2, %xmm6      //
        punpckldq       %xmm2, %xmm7      //
        paddq   %xmm7, %xmm3              //
        paddq   %xmm6, %xmm3              //
        movdqa  %xmm1, %xmm6              //
        punpckhdq       %xmm2, %xmm1      //
        punpckldq       %xmm2, %xmm6      //
        paddq   %xmm6, %xmm3              //
        paddq   %xmm3, %xmm1              //
        movdqa  %xmm4, %xmm3              //
        punpckhdq       %xmm2, %xmm4      //
        punpckldq       %xmm2, %xmm3      //
        paddq   %xmm3, %xmm1              //
        movdqa  %xmm0, %xmm3              //
        paddq   %xmm4, %xmm1              //
        punpckhdq       %xmm2, %xmm0      //
        punpckldq       %xmm2, %xmm3      //
        paddq   %xmm3, %xmm1              //
        paddq   %xmm0, %xmm1              //
        movdqa  %xmm1, %xmm3              // END mystery byte-adding logic

        ja      .L3                       // next vectorized iteration

This implementation uses 38 instructions for 16 bytes, which is about six times as efficient as before.

Ok, but what’s going on with wc -l? The source provides a hint:

/* memchr is more efficient with longer lines.  */
while ((p = memchr (p, '\n', end - p)))
  {
    ++p;
    ++lines;
  }

Intuitively this makes sense: function call overhead would dictate that the more work memchr can do internally, the faster this will go. But memchr itself is still outperforming the -O3 byte loop by more than 2x. Here’s how it works:

while (n >= sizeof (longword))
  {
    longword longword1 = *longword_ptr ^ repeated_c;
    if ((((longword1 - repeated_one) & ~longword1)
         & (repeated_one << 7)) != 0)
      break;
    longword_ptr++;
    n -= sizeof (longword);
  }

Like -O3, it’s also vectorizing; but this version, although more efficient for long lines, is optimized for cases where most of the time is spent searching (i.e. a hit on any given byte is unlikely). All of wc -l’s advantage is data-dependent, which becomes evident if we feed it just a bunch of newlines:

$ source-4gb-n() { ./produce-newlines | dd bs=32K count=131072 iflag=fullblock; }
$ source-4gb-n | cat > /dev/null
4294967296 bytes (4.3 GB) copied, 1.69366 s, 2.5 GB/s

$ source-4gb-n | ./wcl-O3
4294967296 bytes (4.3 GB) copied, 2.96402 s, 1.4 GB/s
4294967296

$ source-4gb-n | wc -l
4294967296 bytes (4.3 GB) copied, 29.0535 s, 148 MB/s
4294967296

# perl's performance drops into oblivion because now the interpreter overhead
# dominates:
$ source-4gb-n | perl wcl2.pl
4294967296 bytes (4.3 GB) copied, 424.342 s, 10.1 MB/s
4294967296

memchr is why Perl ends up being faster than the original C version: most of the data is being pushed through the very high-throughput inner loop, and only a few bytes have to go the slow route through the Perl interpreter itself. Other aspects of overhead, like copying lines into $_ and updating various internal variables, made it unable to compete with -O3 even for long lines.

The Takeaway

In a throughput-bound world, what’s the right tool for the job? I haven’t found a particularly good way to predict it aside from understanding how things work and testing on specific data. The complicating factors are that (1) many tools have been optimized for a specific use case (e.g. Perl and text processing), and (2) many of the fastest algorithms are data-dependent and aren’t worst-case optimal. Agner Fog’s optimization guides illustrate the subtleties of even the hardware itself; as the infrastructure increases in complexity, optimization becomes an increasingly vertical pursuit.

In part 2 I’ll go into some lower-level aspects like input encoding and garbage collection, and part 3 will be an unbelievably smashing wrap-up of some kind. Stay tuned.

Related Posts:

If you like what you see, Factual might be the place for you
See Job Openings