Introducing Tablesaw

Tablesaw is like having your own, personal, column store: a column store that’s embedded in Java and easier to use than an R dataframe.

With Tablesaw, you can work with half a billion rows on a laptop. For a few dollars an hour on AWS,  you can work with over 2 billion records, interactively, without Spark, Hadoop or any distributed infrastructure. Without even a relational database.

What I wanted for Tablesaw was the ease of Pandas and the performance of C. The biggest obstacle was memory. Primitives are far lighter than their equivalent objects, but they’re hard to use because many libraries auto-box them. Try sorting primitives using a comparator and you’ll see.

Tablesaw avoids using non-primitives for data, and when that’s not possible (with Strings, or dates, for example), it uses type-specific encoding schemes to minimize the footprint. Even primitives use type-specific compression: boolean columns, for example, are compressed bitmaps that use 1/8th the storage of primitive booleans, or about 1/32 the storage of Boolean objects. We can do this, because the data is stored in columns, just as it is in advanced OLAP data-stores like Redshift.

Tablesaw is currently under active development, but I thought I had enough working to put the initial version on Github at: https://github.com/lwhite1/tablesaw/.  Look for more updates as the system is hardened and extended.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s