Tablesaw picked as one “13 Java Projects You Should Try”

Thanks to the team at Takipi for the shout-out. Check out the other interesting projects, too.

The Hitchhikers Guide to GitHub: 13 Java Projects You Should Try

Introducing Tablesaw

Tablesaw is the shortest path to data science in Java. It lets you transform, summarize, or filter data in one statement. If you work with data in Java, it will probably save you time and effort.

Tablesaw also supports descriptive statistics, visualization, and machine learning.

With Tablesaw, you can manipulate half a billion rows on a laptop. For a few dollars an hour on AWS, you can work with over 2 billion records, interactively, without Spark, Yarn, Hadoop or any distributed infrastructure.

Tablesaw is open-sourced under a user friendly Apache 2 License.

Get it on MavenCentral:

<dependency>
    <groupId>tech.tablesaw</groupId>
    <artifactId>tablesaw-core</artifactId>
    <version>0.8.0</version>
</dependency>

Tablesaw: the road to release

IMG_0760

Here’s what we’re looking to complete this summer for V.1.0.

  • Improved sorting API, making it easier to execute complex sorts
  • Refactoring code to reduce number of import statements needed in a REPL
  • Refactoring code to make use more intuitive
  • Cleanup on the split-apply-combine logic; maybe to include more sophisticated cross-tabs
  • Checking for, and adding missing column-level functionality
  • More testing and debugging
  • Improved documentation

And last but not least

  • Publish to Maven Central

In the department of new and cool, but not highest priority is the introduction of simple exploratory graphics that are easy to use with Java 9 REPL (and otherwise). There is only so much you can do if you can’t see the data.

New: Load data from any RDBMS

tedCodd
Ted Codd

As of today, you can easily import into Tablesaw from any data source with a JDBC driver. Meaning, pretty much every relational database. Meaning, we are now fully compliant with the 1970s, and with Ted Codd, who I’ve stolen many ideas from. Now I’m repaying Ted by putting his photo in this post.

Thank you, Ted.

To use this feature, you write standard Java/JDBC client code, execute a query, and pass the returned ResultSet into a static create() method on Table.  There’s a simple example below.

So bring on your databases.

 

String DB_URL = "jdbc:derby:CoffeeDB;create=true";
Connection conn = DriverManager.getConnection(DB_URL);

Table customer = null; 
try (Statement stmt = conn.createStatement()) {
  String sql = "SELECT * FROM Customer";
  try (ResultSet results = stmt.executeQuery(sql)) {
    customer = Table.create(results, "Customer");
  }
}

 

1 millisecond selects with 1/2 billion rows

Correction: There was a bug in the testHow embarrassing.

It really took about 20 ms to retrieve the 500 or so ‘hits’ out of the unsorted table. With some performance fixes, it’s now down to ~2 ms per request.  1 ms to go.

cheetah

I started running tests tonight on the largest data set I’ve used to date. This new test searches a medical records table with 4 columns (lab name, lab value, date, and patientId) and 500,000,000 rows. The CSV file that held all this data used 35 GB of disk.

The test was performed on a Macbook pro with one 4-core CPU and 16 GB of RAM.

The first bit of goodness was that loading the data from disk was reasonably performant.

Loaded 500,000,000 records from column storage in 174 seconds

On an un-indexed column, it took about 8 1/2 seconds to execute two queries, each returning about 5,000 records.  The code for the queries looked like this:

Table result = t.selectWhere(column("lab").isEqualTo(randomLab1));

and the timings:

lab found in 5317349 micros
lab found in 3020043 micros

Next I created an index on the patient id column and ran some queries on that. Each of these returns about 500 records. Creating the index can be done in a background thread.

total retrieval time 988 micros
total retrieval time 668 micros

Those were the original (and incorrect) results. After numerous tweaks, we’re now at:

total retrieval time 2085 micros
total retrieval time 1973 micros

One thing to keep in mind with these results is that when you’re measuring in low milliseconds, little things (like a minor garbage collection) can skew individual results. Which is to say ‘your mileage may vary’.

Tablesaw performance: first results

In an earlier post, I compared Outlier performance importing data from a CSV against published data with Pandas and Python. In Python it took 3,047 seconds (50 minutes) to load the 8 million rows of data.  Outlier loaded 10 million rows of the same data in 2 minutes, or “25% more data, 25 times faster”.

Tablesaw loads the larger dataset from a CSV in 79 seconds: 25% more data, 38 times faster.  Better still, that data can be saved in Tablesaw format in 1 second.  Subsequent reads now take 3 seconds, or 1,015 times faster than in the original Python data. 

Introducing Tablesaw

Tablesaw is like having your own, personal, column store: a column store that’s embedded in Java and easier to use than an R dataframe.

With Tablesaw, you can work with half a billion rows on a laptop. For a few dollars an hour on AWS,  you can work with over 2 billion records, interactively, without Spark, Hadoop or any distributed infrastructure. Without even a relational database.

What I wanted for Tablesaw was the ease of Pandas and the performance of C. The biggest obstacle was memory. Primitives are far lighter than their equivalent objects, but they’re hard to use because many libraries auto-box them. Try sorting primitives using a comparator and you’ll see.

Tablesaw avoids using non-primitives for data, and when that’s not possible (with Strings, or dates, for example), it uses type-specific encoding schemes to minimize the footprint. Even primitives use type-specific compression: boolean columns, for example, are compressed bitmaps that use 1/8th the storage of primitive booleans, or about 1/32 the storage of Boolean objects. We can do this, because the data is stored in columns, just as it is in advanced OLAP data-stores like Redshift.

Tablesaw is currently under active development, but I thought I had enough working to put the initial version on Github at: https://github.com/lwhite1/tablesaw/.  Look for more updates as the system is hardened and extended.

 

 

Classic texts for software designers

These books have had a profound effect on how I think about software interface design, whether for Web applications or low-level APIs for programmer use.  I hope you find them useful and inspirational as well.

RAM eats Big Data

My work on Tablesaw is focused on making as many data-munging jobs doable on a single machine as possible. It’s a tall order, but I’ll be getting a lot of help from hardware trends.

KDNuggets recently posted poll results that show that most analytics don’t require “Big Data” tools. The poll asked data scientists about the largest data sets they work with, and found that the largest were often not so large.

In another post based on that poll, they note that:

A majority of data scientists (56%) work in Gigabyte dataset range.

In other words, most people can do their work on a laptop.

poll-largest-dataset-analyzed-2013-2015

The more interesting finding was that RAM is growing faster than data. By their estimate, RAM is growing at 50% per year, while the trend for the largest data sets is increasing at 20% per year.

If your laptop is too small, you can probably get your work done faster, easier, and cheaper by leasing a server on the cloud. This is basically the findings of  Nobody ever got fired for using Hadoop on a Cluster out of Microsoft Research, which discusses the cost tradeoffs of using distributed “big data” tools like Spark and Hadoop. Their summary:

Should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs.

A post from the FastML blog quotes another Microsoft Researcher Paul Mineiro:

Since this is my day job, I’m of course paranoid that the need for distributed learning is diminishing as individual computing nodes… become increasingly powerful.

When he wrote that, Mineiro was taking notes at a talk by Stanford prof. Jure Leskovic. Leskovic is co-author of the text Mining of Massive Datasets, so he understands large-scale data crunching. What he said was:

Bottom line: get your own 1TB RAM server
Jure Leskovic’s take on the best way to mine large datasets.

Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM.

Pretty soon, you can have one, too. Amazon has dropped hints that EC2 instances with 2 TB of RAM are coming soon. Once you have one, you can make the most of it by using a RAM optimized data manipulation tool. This is, of course, the idea behind Tablesaw.

Fast enough for now

There are much work remaining with Outlier, but I think it’s time to declare a 0.1 milestone. The API has been fairly stable, with most improvements directed at performance and memory consumption. Both have been reduced by up to an order of magnitude.

How fast is it?

Importing data has always been the slowest part. For a benchmark, I compared time-to-import the NYC Complaint Dataset with a well-known blog post where Python, Pandas, and Sqlite were used.  Here’s the result:

In the original post, they read 6 columns from a CSV with 8, 281, 035 rows in just over 50 minutes.

With Outlier, we read 7 columns from a later version of that dataset that had grown to over 10 million rows.  Outlier loaded the data in just under 2 minutes. That’s 25% more data, 25-times faster.

Even better, while the original post declared

“The dataset is too large to load into a Pandas dataframe”

Outlier handled it, in memory, with no problem. No DB and no SQL required.

What has become clear though, is that there’s much more that can be done to improve Outlier’s performance. This is a journey, and journey’s need milestones. So in the next week, I’ll clean up a few things and declare the first milestone done. And then the journey continues.