Tablesaw is like having your own, personal, column store: a column store that’s embedded in Java and easier to use than an R dataframe.
With Tablesaw, you can work with half a billion rows on a laptop. For a few dollars an hour on AWS, you can work with over 2 billion records, interactively, without Spark, Hadoop or any distributed infrastructure. Without even a relational database.
What I wanted for Tablesaw was the ease of Pandas and the performance of C. The biggest obstacle was memory. Primitives are far lighter than their equivalent objects, but they’re hard to use because many libraries auto-box them. Try sorting primitives using a comparator and you’ll see.
Tablesaw avoids using non-primitives for data, and when that’s not possible (with Strings, or dates, for example), it uses type-specific encoding schemes to minimize the footprint. Even primitives use type-specific compression: boolean columns, for example, are compressed bitmaps that use 1/8th the storage of primitive booleans, or about 1/32 the storage of Boolean objects. We can do this, because the data is stored in columns, just as it is in advanced OLAP data-stores like Redshift.
Tablesaw is currently under active development, but I thought I had enough working to put the initial version on Github at: https://github.com/lwhite1/tablesaw/. Look for more updates as the system is hardened and extended.