Correction: There was a bug in the test. How embarrassing.
It really took about 20 ms to retrieve the 500 or so ‘hits’ out of the unsorted table. With some performance fixes, it’s now down to ~2 ms per request. 1 ms to go.
I started running tests tonight on the largest data set I’ve used to date. This new test searches a medical records table with 4 columns (lab name, lab value, date, and patientId) and 500,000,000 rows. The CSV file that held all this data used 35 GB of disk.
The test was performed on a Macbook pro with one 4-core CPU and 16 GB of RAM.
The first bit of goodness was that loading the data from disk was reasonably performant.
Loaded 500,000,000 records from column storage in 174 seconds
On an un-indexed column, it took about 8 1/2 seconds to execute two queries, each returning about 5,000 records. The code for the queries looked like this:
Table result = t.selectWhere(column("lab").isEqualTo(randomLab1));
and the timings:
lab found in 5317349 micros lab found in 3020043 micros
Next I created an index on the patient id column and ran some queries on that. Each of these returns about 500 records. Creating the index can be done in a background thread.
total retrieval time 988 micros total retrieval time 668 micros
Those were the original (and incorrect) results. After numerous tweaks, we’re now at:
total retrieval time 2085 micros total retrieval time 1973 micros
One thing to keep in mind with these results is that when you’re measuring in low milliseconds, little things (like a minor garbage collection) can skew individual results. Which is to say ‘your mileage may vary’.