There are much work remaining with Outlier, but I think it’s time to declare a 0.1 milestone. The API has been fairly stable, with most improvements directed at performance and memory consumption. Both have been reduced by up to an order of magnitude.
How fast is it?
Importing data has always been the slowest part. For a benchmark, I compared time-to-import the NYC Complaint Dataset with a well-known blog post where Python, Pandas, and Sqlite were used. Here’s the result:
In the original post, they read 6 columns from a CSV with 8, 281, 035 rows in just over 50 minutes.
With Outlier, we read 7 columns from a later version of that dataset that had grown to over 10 million rows. Outlier loaded the data in just under 2 minutes. That’s 25% more data, 25-times faster.
Even better, while the original post declared
“The dataset is too large to load into a Pandas dataframe”
Outlier handled it, in memory, with no problem. No DB and no SQL required.
What has become clear though, is that there’s much more that can be done to improve Outlier’s performance. This is a journey, and journey’s need milestones. So in the next week, I’ll clean up a few things and declare the first milestone done. And then the journey continues.