1 millisecond selects with 1/2 billion rows

Correction: There was a bug in the testHow embarrassing.

It really took about 20 ms to retrieve the 500 or so ‘hits’ out of the unsorted table. With some performance fixes, it’s now down to ~2 ms per request.  1 ms to go.

cheetah

I started running tests tonight on the largest data set I’ve used to date. This new test searches a medical records table with 4 columns (lab name, lab value, date, and patientId) and 500,000,000 rows. The CSV file that held all this data used 35 GB of disk.

The test was performed on a Macbook pro with one 4-core CPU and 16 GB of RAM.

The first bit of goodness was that loading the data from disk was reasonably performant.

Loaded 500,000,000 records from column storage in 174 seconds

On an un-indexed column, it took about 8 1/2 seconds to execute two queries, each returning about 5,000 records.  The code for the queries looked like this:

Table result = t.selectWhere(column("lab").isEqualTo(randomLab1));

and the timings:

lab found in 5317349 micros
lab found in 3020043 micros

Next I created an index on the patient id column and ran some queries on that. Each of these returns about 500 records. Creating the index can be done in a background thread.

total retrieval time 988 micros
total retrieval time 668 micros

Those were the original (and incorrect) results. After numerous tweaks, we’re now at:

total retrieval time 2085 micros
total retrieval time 1973 micros

One thing to keep in mind with these results is that when you’re measuring in low milliseconds, little things (like a minor garbage collection) can skew individual results. Which is to say ‘your mileage may vary’.

Tablesaw performance: first results

In an earlier post, I compared Outlier performance importing data from a CSV against published data with Pandas and Python. In Python it took 3,047 seconds (50 minutes) to load the 8 million rows of data.  Outlier loaded 10 million rows of the same data in 2 minutes, or “25% more data, 25 times faster”.

Tablesaw loads the larger dataset from a CSV in 79 seconds: 25% more data, 38 times faster.  Better still, that data can be saved in Tablesaw format in 1 second.  Subsequent reads now take 3 seconds, or 1,015 times faster than in the original Python data. 

Introducing Tablesaw

Tablesaw is like having your own, personal, column store: a column store that’s embedded in Java and easier to use than an R dataframe.

With Tablesaw, you can work with half a billion rows on a laptop. For a few dollars an hour on AWS,  you can work with over 2 billion records, interactively, without Spark, Hadoop or any distributed infrastructure. Without even a relational database.

What I wanted for Tablesaw was the ease of Pandas and the performance of C. The biggest obstacle was memory. Primitives are far lighter than their equivalent objects, but they’re hard to use because many libraries auto-box them. Try sorting primitives using a comparator and you’ll see.

Tablesaw avoids using non-primitives for data, and when that’s not possible (with Strings, or dates, for example), it uses type-specific encoding schemes to minimize the footprint. Even primitives use type-specific compression: boolean columns, for example, are compressed bitmaps that use 1/8th the storage of primitive booleans, or about 1/32 the storage of Boolean objects. We can do this, because the data is stored in columns, just as it is in advanced OLAP data-stores like Redshift.

Tablesaw is currently under active development, but I thought I had enough working to put the initial version on Github at: https://github.com/lwhite1/tablesaw/.  Look for more updates as the system is hardened and extended.

 

 

Classic texts for software designers

These books have had a profound effect on how I think about software interface design, whether for Web applications or low-level APIs for programmer use.  I hope you find them useful and inspirational as well.

RAM eats Big Data

My work on Tablesaw is focused on making as many data-munging jobs doable on a single machine as possible. It’s a tall order, but I’ll be getting a lot of help from hardware trends.

KDNuggets recently posted poll results that show that most analytics don’t require “Big Data” tools. The poll asked data scientists about the largest data sets they work with, and found that the largest were often not so large.

In another post based on that poll, they note that:

A majority of data scientists (56%) work in Gigabyte dataset range.

In other words, most people can do their work on a laptop.

poll-largest-dataset-analyzed-2013-2015

The more interesting finding was that RAM is growing faster than data. By their estimate, RAM is growing at 50% per year, while the trend for the largest data sets is increasing at 20% per year.

If your laptop is too small, you can probably get your work done faster, easier, and cheaper by leasing a server on the cloud. This is basically the findings of  Nobody ever got fired for using Hadoop on a Cluster out of Microsoft Research, which discusses the cost tradeoffs of using distributed “big data” tools like Spark and Hadoop. Their summary:

Should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs.

A post from the FastML blog quotes another Microsoft Researcher Paul Mineiro:

Since this is my day job, I’m of course paranoid that the need for distributed learning is diminishing as individual computing nodes… become increasingly powerful.

When he wrote that, Mineiro was taking notes at a talk by Stanford prof. Jure Leskovic. Leskovic is co-author of the text Mining of Massive Datasets, so he understands large-scale data crunching. What he said was:

Bottom line: get your own 1TB RAM server
Jure Leskovic’s take on the best way to mine large datasets.

Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM.

Pretty soon, you can have one, too. Amazon has dropped hints that EC2 instances with 2 TB of RAM are coming soon. Once you have one, you can make the most of it by using a RAM optimized data manipulation tool. This is, of course, the idea behind Tablesaw.

Fast enough for now

There are much work remaining with Outlier, but I think it’s time to declare a 0.1 milestone. The API has been fairly stable, with most improvements directed at performance and memory consumption. Both have been reduced by up to an order of magnitude.

How fast is it?

Importing data has always been the slowest part. For a benchmark, I compared time-to-import the NYC Complaint Dataset with a well-known blog post where Python, Pandas, and Sqlite were used.  Here’s the result:

In the original post, they read 6 columns from a CSV with 8, 281, 035 rows in just over 50 minutes.

With Outlier, we read 7 columns from a later version of that dataset that had grown to over 10 million rows.  Outlier loaded the data in just under 2 minutes. That’s 25% more data, 25-times faster.

Even better, while the original post declared

“The dataset is too large to load into a Pandas dataframe”

Outlier handled it, in memory, with no problem. No DB and no SQL required.

What has become clear though, is that there’s much more that can be done to improve Outlier’s performance. This is a journey, and journey’s need milestones. So in the next week, I’ll clean up a few things and declare the first milestone done. And then the journey continues.

Getting Started with Outlier

With this brief introduction, we demonstrate a few of the tools that Outlier provides for examining a new data set.

The data describes the approval ratings for US President George W. Bush, over time, and as measured by different polling organizations. To get started, we load the data from a CSV file, first providing a list of the data types.

ColumnType[] types = {LOCAL_DATE, INT, STRING};

Then we load the file into a Table object. Table is similar to a Data Frame in R, Julia, or Pandas.

Table bushTable = CsvReader.read("data/BushApproval.csv", types);

Once the data is loaded, a good next step is to look at the structure of the table:

bushTable.structure().print();

The structure() method returns another table, which print() converts to a string:

Table: data/BushApproval.csv - 323 observations (rows) of 3 variables (cols)
Index Column Name Type       Unique Values First      Last 
0     date        LOCAL_DATE 288           2004-02-04 2001-02-09 
1     approval    INT        46            53         57 
2     who         STRING     6             fox        zogby

As you can see, structure provides the column index, name and type, as well as the number of unique values and the first and last value in each column.

The head(int n) method also returns a table, this one containing the first n rows of the target table. Again we convert to a String with print():

table.head(10).print();

For our dataset, this produces the following output:

data/BushApproval.csv
date       approval who
2004-02-04 53       fox
2004-01-21 53       fox
2004-01-07 58       fox
2003-12-03 52       fox
2003-11-18 52       fox
2003-10-28 53       fox
2003-10-14 52       fox
2003-09-23 50       fox
2003-09-09 58       fox
2003-08-12 57       fox

Of course, if you want to print the entire table (and it’s not too big) you can simply call print() on the original table:

table.print();

This produces the same output as head(), but includes every row in the table.

If you simply want the column names, you can use the columnNames() method

table.columnNames();

which produces:

[date, approval, who]

Individual columns can be retrieved by name or by its (zero-based) index. To get a column, use

table.column("date");

or

table.column(0);

which both return the same column for this dataset.

You can print the values in a column using the column print() method.

Column approval = table.column("approval");
approval.print();

producing:

approval: INT
53
53
58
52
52
53
etc...

You can also summarize the data in each column, in a column-type appropriate way.

approval.summary().print();

produces, for example:

approval summary
Disconnected from the target VM, address: '127.0.0.1:51653', transport: 'socket'
Metric    Value              
n         323                
Mean      64.88235294117646  
Min       45.0               
1st Qu    56.0               
Median    63.0               
3rd Qu    73.0               
Max       90.0               
Range     45.0               
Sum       20957.0            
Std. Dev. 11.270465086514845 

Data: The cake dataset is from http://www.stats4stem.org/r-bushapproval-data.html.

Code: The code can be found in the GettingStarted.java example on Github.

Welcome

In this site, I propose that Java can be a good language for data science and analytics.

With a tool-chain that includes Hadoop, HDFS, etc., Java is very popular for big-data analytics. However, it is not widely used on smaller datasets, for quick analytics, exploration, or data massaging.

Here we focus on exactly that: Using java for datasets of ten to tens-of-millions of observations. In spite of the hype around  big-data, most datasets are of this scale. And you don’t want to setup HDFS to generate a simple histogram.

The primary tool used here is Outlier, which I developed as a Java alternative to data frames in R, Julia, or Pandas. Outlier lets you bend and twist datasets until they do what you want. It’s being integrated with advanced statistical tools for regression, classification, machine learning, and (someday) deep learning, so that you can go seamlessly from exploration to prediction.

Developing Outlier is great fun for me. I hope you have fun using it.