A major new feature in the recent Drake 0.1.4 release is allowing the asynchronous execution of steps. Drake can now automatically parallelize the steps in your workflow based on the specified dependencies.
If you have steps in your workflow that have high latency, or you have different steps that use different resources (e.g. one step is CPU bound, another is disk bound, another network bound), then this feature may help you significantly speed up your workflow. For example, at Factual, we use this feature to run Hadoop jobs in parallel in order to make sure that our Hadoop cluster is utilized as close to 100% as possible during our data processing workflows.
You can specify the maximum number of simultaneously executing steps by using the –jobs parameter, e.g. drake –jobs 5 to run up to 5 steps simultaneously. By default, –jobs is set to 1, which is the same as synchronous execution.
Drake has strict rules to make sure that the order of execution is consistent with what is specified in the documentation, while at the same time making sure that as many steps are simultaneously run as possible.
For example, let’s look at the following simple Drake workflow in detail:
f1 <- echo -n "$OUTPUT" > $OUTPUT f2 <- f1 cat $INPUT > $OUTPUT f3 <- f2 cat $INPUT > $OUTPUT g1 <- echo -n "$OUTPUT" > $OUTPUT g2 <- g1 cat $INPUT > $OUTPUT g3 <- g2 cat $INPUT > $OUTPUT h1 <- echo -n "$OUTPUT" > $OUTPUT
Note that there are have two “dependency chains”: the f1->f2->f3 dependency chain, and the g1->g2->g3 dependency chain. These groups of steps have dependencies such that the steps within them must be run sequentially.
Running this workflow with –jobs 1 will run the steps in the order specified by the workflow file per Drake’s ordering rules.
Running with --jobs 2 will kick off the f1 and g1 steps in parallel. One thread will work through the f1->f2->f3 dependency chain while the other thread will work through the g1->g2->g3 dependency chain. The h1 step will run when one of these threads is finally finished. Note that the dependency tree doesn’t preclude running h1 first, but Drake runs it last per its ordering rules.
Running with –jobs 3 will kick off the f1, g1, and h1 steps in parallel. One thread will work on the f1->f2->f3 dependency chain, another thread will work on the g1->g2->g3 dependency chain, and the third thread will work on the h1 step. In this case, Drake kicks off h1 right away since Drake tries to run as many steps as possible in parallel.
A huge thank you to Guillaume Carbonneau for doing the majority of the implementation for this feature!
Myron Ahn, Software Engineer, Factual