1 millisecond selects with 1/2 billion rows

Correction: There was a bug in the testHow embarrassing.

It really took about 20 ms to retrieve the 500 or so ‘hits’ out of the unsorted table. With some performance fixes, it’s now down to ~2 ms per request.  1 ms to go.

cheetah

I started running tests tonight on the largest data set I’ve used to date. This new test searches a medical records table with 4 columns (lab name, lab value, date, and patientId) and 500,000,000 rows. The CSV file that held all this data used 35 GB of disk.

The test was performed on a Macbook pro with one 4-core CPU and 16 GB of RAM.

The first bit of goodness was that loading the data from disk was reasonably performant.

Loaded 500,000,000 records from column storage in 174 seconds

On an un-indexed column, it took about 8 1/2 seconds to execute two queries, each returning about 5,000 records.  The code for the queries looked like this:

Table result = t.selectWhere(column("lab").isEqualTo(randomLab1));

and the timings:

lab found in 5317349 micros
lab found in 3020043 micros

Next I created an index on the patient id column and ran some queries on that. Each of these returns about 500 records. Creating the index can be done in a background thread.

total retrieval time 988 micros
total retrieval time 668 micros

Those were the original (and incorrect) results. After numerous tweaks, we’re now at:

total retrieval time 2085 micros
total retrieval time 1973 micros

One thing to keep in mind with these results is that when you’re measuring in low milliseconds, little things (like a minor garbage collection) can skew individual results. Which is to say ‘your mileage may vary’.

Tablesaw performance: first results

In an earlier post, I compared Outlier performance importing data from a CSV against published data with Pandas and Python. In Python it took 3,047 seconds (50 minutes) to load the 8 million rows of data.  Outlier loaded 10 million rows of the same data in 2 minutes, or “25% more data, 25 times faster”.

Tablesaw loads the larger dataset from a CSV in 79 seconds: 25% more data, 38 times faster.  Better still, that data can be saved in Tablesaw format in 1 second.  Subsequent reads now take 3 seconds, or 1,015 times faster than in the original Python data. 

Fast enough for now

There are much work remaining with Outlier, but I think it’s time to declare a 0.1 milestone. The API has been fairly stable, with most improvements directed at performance and memory consumption. Both have been reduced by up to an order of magnitude.

How fast is it?

Importing data has always been the slowest part. For a benchmark, I compared time-to-import the NYC Complaint Dataset with a well-known blog post where Python, Pandas, and Sqlite were used.  Here’s the result:

In the original post, they read 6 columns from a CSV with 8, 281, 035 rows in just over 50 minutes.

With Outlier, we read 7 columns from a later version of that dataset that had grown to over 10 million rows.  Outlier loaded the data in just under 2 minutes. That’s 25% more data, 25-times faster.

Even better, while the original post declared

“The dataset is too large to load into a Pandas dataframe”

Outlier handled it, in memory, with no problem. No DB and no SQL required.

What has become clear though, is that there’s much more that can be done to improve Outlier’s performance. This is a journey, and journey’s need milestones. So in the next week, I’ll clean up a few things and declare the first milestone done. And then the journey continues.