New Plot Types in Tablesaw

In a prior post, I showed how to create some native Java scatter plots and a quantile plot in Tablesaw. Since then, I’ve added a few more plot types.

When it comes to plotting, Tablesaw integrates other libraries and tries to make their use as consistent as possible. Like the earlier scatter plots, this line chart is rendered using XChart under the covers:boston_robberiesThe dramatic increase in armed robberies is shown by plotting the sorted data against its in sequence.  The code looks like this:

Table baseball = Table.createFromCsv("data/boston-robberies.csv");
NumericColumn x = baseball.nCol("Record");
NumericColumn y = baseball.nCol("Robberies");
Line.show("Monthly Boston Armed Robberies Jan. 1966 - Oct. 1975", x, y);

Histograms are a must have. We use the plotting capabilities of the Smile machine learning library to create the one below. batting_histogram

Although they’re from different libraries, the Tablesaw API is similar:

Table baseball = Table.createFromCsv("data/baseball.csv");
NumericColumn x = baseball.nCol("BA");
Histogram.show("Distribution of team batting averages", x);

This is currently the only Smile plot we’re using, but there’s more to come. Heatmaps, Contour plots and QQ plots are coming soon. We’re also starting to integrate Smile’s machine learning capabilities, which will be a huge step forward for Tablesaw.

Bar plots are unglamorous, but very useful. Tablesaw can produce both horizontal and vertical bar plots, and also creates Pareto charts directly as a convenience. They’re all based on the JavaFx chart library, and like the other Tablesaw plots, they’re rendered in Swing windows. Here we show a Pareto chart of tornado fatalities by US state.

paretoThe code to produce this chart, including a filter to remove states with fewer than three fatalities is shown below. The grouping is done using the summarize method, which produces tabular summaries that can be passed directly to the plotting API.

Note the use of the #sum method. Any numerical summary supported by Tablesaw (standard deviation, median, sumOfLogs, etc.) can be substituted for easy plotting.

Table table = Table.createFromCsv("data/tornadoes_1950-2014.csv");
table = table.selectWhere(column("Fatalities").isGreaterThan(3));
Pareto.show("Tornado Fatalities by State", 
    table.summarize("fatalities", sum).by("State"));

As you can see, loading from a CSV, filtering the data, grouping, summing, sorting, and plotting is all done in three lines of code.

Finally, we have a BoxPlot.

tornado_boxplot

For Boxplots, the groups are formed using Table’s splitOn() method, or simply by passing the names of the summary and grouping columns along with the Table:

Table table = Table.createFromCsv("data/tornadoes_1950-2014.csv");
Box.show("Tornado Injuries by Scale", table, "injuries", "scale");

I hope you’ll find Tablesaw useful for your data analytics work.

 

Tablesaw gets Graphic

Today we introduced the first elements of what will be Tablesaw’s support for exploratory data visualization in pure Java. As Tablesaw expands its scope to integrate statistical and machine learning capabilities, this kind of visualization will be critical.tornadosThis slightly ghostly US map image was created by as a simple scatter plot of the starting latitude and longitude for every US tornado between 1950 and 2014. The code below loads the data, filters out missing records, and renders the plot:

Table tornado = Table.createFromCsv("data/tornadoes_1950-2014.csv");

tornado = tornado.selectWhere(
    both(column("Start Lat").isGreaterThan(0f),
         column("Scale").isGreaterThanOrEqualTo(0)));

Scatter.show("US Tornados 1950-2014",
    tornado.numericColumn("Start Lon"),
    tornado.numericColumn("Start Lat"));

These plots provide visual feedback to the analyst while she’s working. They’re for discovery, rather than for presentation, and ease of use is stressed over beauty. Behind the scenes, the charts are created with Tim Molter’s awesome XChart library:  https://github.com/timmolter/XChart.

The following chart is taken from a baseball data set. It shows how to split a table on the values of one or more columns, producing a series for each group. In this case, we color the mark differently if the team made the playoffs. winsByYear

Here’s the code:

Table baseball = Table.createFromCsv("data/baseball.csv");
Scatter.show("Regular season wins by year",
    baseball.numericColumn("W"),
    baseball.numericColumn("Year"),
    baseball.splitOn(baseball.column("Playoffs")));

A chart that looks like a scatter plot and works like a histogram is a Quantile Plot. The plot below presents the distribution of public opinion poll ratings for one US president.

bush_quantiles

This chart was build using the Quantile class:

String title = "Quantiles: George W. Bush (Feb. 2001 - Feb. 2004)";
Quantile.show(title, bush.numericColumn("approval"));

Further down the line, I expect to add JavaScript plot support based on D3. These plots will be focused more on presentation, especially Web-based presentation, as Tablesaw becomes a complete platform for data science.

 

New: Load data from any RDBMS

tedCodd
Ted Codd

As of today, you can easily import into Tablesaw from any data source with a JDBC driver. Meaning, pretty much every relational database. Meaning, we are now fully compliant with the 1970s, and with Ted Codd, who I’ve stolen many ideas from. Now I’m repaying Ted by putting his photo in this post.

Thank you, Ted.

To use this feature, you write standard Java/JDBC client code, execute a query, and pass the returned ResultSet into a static create() method on Table.  There’s a simple example below.

So bring on your databases.

 

String DB_URL = "jdbc:derby:CoffeeDB;create=true";
Connection conn = DriverManager.getConnection(DB_URL);

Table customer = null; 
try (Statement stmt = conn.createStatement()) {
  String sql = "SELECT * FROM Customer";
  try (ResultSet results = stmt.executeQuery(sql)) {
    customer = Table.create(results, "Customer");
  }
}

 

RAM eats Big Data

My work on Tablesaw is focused on making as many data-munging jobs doable on a single machine as possible. It’s a tall order, but I’ll be getting a lot of help from hardware trends.

KDNuggets recently posted poll results that show that most analytics don’t require “Big Data” tools. The poll asked data scientists about the largest data sets they work with, and found that the largest were often not so large.

In another post based on that poll, they note that:

A majority of data scientists (56%) work in Gigabyte dataset range.

In other words, most people can do their work on a laptop.

poll-largest-dataset-analyzed-2013-2015

The more interesting finding was that RAM is growing faster than data. By their estimate, RAM is growing at 50% per year, while the trend for the largest data sets is increasing at 20% per year.

If your laptop is too small, you can probably get your work done faster, easier, and cheaper by leasing a server on the cloud. This is basically the findings of  Nobody ever got fired for using Hadoop on a Cluster out of Microsoft Research, which discusses the cost tradeoffs of using distributed “big data” tools like Spark and Hadoop. Their summary:

Should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs.

A post from the FastML blog quotes another Microsoft Researcher Paul Mineiro:

Since this is my day job, I’m of course paranoid that the need for distributed learning is diminishing as individual computing nodes… become increasingly powerful.

When he wrote that, Mineiro was taking notes at a talk by Stanford prof. Jure Leskovic. Leskovic is co-author of the text Mining of Massive Datasets, so he understands large-scale data crunching. What he said was:

Bottom line: get your own 1TB RAM server
Jure Leskovic’s take on the best way to mine large datasets.

Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM.

Pretty soon, you can have one, too. Amazon has dropped hints that EC2 instances with 2 TB of RAM are coming soon. Once you have one, you can make the most of it by using a RAM optimized data manipulation tool. This is, of course, the idea behind Tablesaw.