As part of Factual’s Geopulse product suite, we need to be able to absorb and process large amounts of data, and deliver back a somewhat smaller amount of data. There is a significant amount of technology available for the processing stage, but fewer for both the intake and delivery. Today, we’re open sourcing two libraries that we’ve used for for these purposes, s3-journal and riffle. Both of these libraries are notable for making efficient use of persistent storage by avoiding random writes, which will be discussed in more detail later in this post.
Most data structures are designed to hold arbitrary amounts of data. When we talk about their complexity in time and space, we use big O notation, which is only concerned with performance characteristics as n grows arbitrarily large. Understanding how to cast an O(n) problem as O(log n) or even O(1) is certainly valuable, and necessary for much of the work we do at Factual. And yet, most instances of data structures used in non-numerical software are very small. Most lists are tuples of a few entries, and most maps are a few keys representing different facets of related data. These may be elements in a much larger collection, but this still means that the majority of operations we perform are on small instances.
Everyone’s motivations for attending a meetup are good: they want to educate and to learn, to nurture and grow the community, to meet people who share their enthusiasm. And yet, a typical meetup is at best weakly successful in all of these dimensions.