We released Drake (“Data workflow tool, like a ‘Make for data’”) two years ago. The impetus behind Drake was the classic “scratch your own itch”. We craved a text-based data workflow tool that would make it easier to build and manage our data processing tasks.
We open sourced Drake in the hope that others would find it useful and maybe even be willing to help. We were delighted when the original blog post attracted attention. A fair number of relevant bug reports and feature requests flowed in. “Are you awash in fame and fortune?” my boss, Boris, asked me. “No,” I replied, “we’re awash in github tickets.” Most exciting, over the past two years we’ve seen substantial bug fixes and contributions to the project from outside Factual. This includes S3 support and a Clojure front end (thanks, all!).
Early on we applied Drake internally with an almost reckless abandon, trying it out on just about any project that looked even remotely like a data workflow. Over time, however, we developed a better feel for the strengths and weaknesses of Drake’s design choices.
Drake makes it refreshingly easy to spin up proof of concepts. We like to use Drake to express evolving data workflows that will be shared across a wide team. And Drake is especially useful for tying together heterogeneous scripts and command line calls, thereby taming an otherwise hodgepodge of tasks. We also like Drake for helping us manage chains of Hadoop jobs that relate to each other.
On the other hand: we’ve found Drake is not great if you just need to glue together some existing same-language code. That can usually be done more simply by staying within the borders of that language. Doubly so if you plan to share your work with others who don’t already know and understand Drake workflows.
Also, Drake does not fully answer all problems in the data workflow space. A leading example: how to manage long running tasks? For tasks that take a long time and do a large amount of work, you often want features like resumability and trackability. At Factual, we have a custom built task management service called Vineyard that addresses these and other similar issues for us. We have glue that allows Vineyard and Drake to work together in various ways, but Drake out-of-the-box doesn’t offer these kinds of features for long running tasks.
Earlier this year Factual welcomed Clojure aficionado Alan Malloy to our engineering ranks. Alan showed interest in Drake and invested time and expertise in maintaining the codebase and responding to community requests. This was no surprise given Alan’s Clojure chops and generous willingness to help people. We invited Alan to become the primary owner of Drake’s codebase and it was super great that he accepted.
We hope that Drake’s future is bright and that the project continues to evolve to better serve users. We’re encouraged by the bit of traction that the project has seen so far – I like to think of Drake as “barely famous”: Drake was given its very own chapter in the recently published “Data Science at the Command Line”, and Artem’s YouTube tutorial for Drake has been viewed over 5,000 times, or as Artem puts it: “Since launch, people spent cumulative 639.8 hours watching Drake tutorial on Youtube, which is not Apache Hadoop, of course, but still pretty neat. :)”.
If you’re a current user of Drake, we hope you’ll let us know what you think and tell us what’s missing. If you’ve never used Drake but have always wanted a ‘Make for data’, we hope you’ll give it a go.
And if you’ve ever filed a ticket or, even better, sent us a pull request… Thank you!
Yours, Aaron Crow, Factual engineer and Drake contributor