My work on Tablesaw is focused on making as many data-munging jobs doable on a single machine as possible. It’s a tall order, but I’ll be getting a lot of help from hardware trends.
KDNuggets recently posted poll results that show that most analytics don’t require “Big Data” tools. The poll asked data scientists about the largest data sets they work with, and found that the largest were often not so large.
In another post based on that poll, they note that:
A majority of data scientists (56%) work in Gigabyte dataset range.
In other words, most people can do their work on a laptop.
The more interesting finding was that RAM is growing faster than data. By their estimate, RAM is growing at 50% per year, while the trend for the largest data sets is increasing at 20% per year.
If your laptop is too small, you can probably get your work done faster, easier, and cheaper by leasing a server on the cloud. This is basically the findings of Nobody ever got fired for using Hadoop on a Cluster out of Microsoft Research, which discusses the cost tradeoffs of using distributed “big data” tools like Spark and Hadoop. Their summary:
Should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs.
A post from the FastML blog quotes another Microsoft Researcher Paul Mineiro:
Since this is my day job, I’m of course paranoid that the need for distributed learning is diminishing as individual computing nodes… become increasingly powerful.
When he wrote that, Mineiro was taking notes at a talk by Stanford prof. Jure Leskovic. Leskovic is co-author of the text Mining of Massive Datasets, so he understands large-scale data crunching. What he said was:
Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM.