Getting Started with Outlier

With this brief introduction, we demonstrate a few of the tools that Outlier provides for examining a new data set.

The data describes the approval ratings for US President George W. Bush, over time, and as measured by different polling organizations. To get started, we load the data from a CSV file, first providing a list of the data types.

ColumnType[] types = {LOCAL_DATE, INT, STRING};

Then we load the file into a Table object. Table is similar to a Data Frame in R, Julia, or Pandas.

Table bushTable = CsvReader.read("data/BushApproval.csv", types);

Once the data is loaded, a good next step is to look at the structure of the table:

bushTable.structure().print();

The structure() method returns another table, which print() converts to a string:

Table: data/BushApproval.csv - 323 observations (rows) of 3 variables (cols)
Index Column Name Type       Unique Values First      Last 
0     date        LOCAL_DATE 288           2004-02-04 2001-02-09 
1     approval    INT        46            53         57 
2     who         STRING     6             fox        zogby

As you can see, structure provides the column index, name and type, as well as the number of unique values and the first and last value in each column.

The head(int n) method also returns a table, this one containing the first n rows of the target table. Again we convert to a String with print():

table.head(10).print();

For our dataset, this produces the following output:

data/BushApproval.csv
date       approval who
2004-02-04 53       fox
2004-01-21 53       fox
2004-01-07 58       fox
2003-12-03 52       fox
2003-11-18 52       fox
2003-10-28 53       fox
2003-10-14 52       fox
2003-09-23 50       fox
2003-09-09 58       fox
2003-08-12 57       fox

Of course, if you want to print the entire table (and it’s not too big) you can simply call print() on the original table:

table.print();

This produces the same output as head(), but includes every row in the table.

If you simply want the column names, you can use the columnNames() method

table.columnNames();

which produces:

[date, approval, who]

Individual columns can be retrieved by name or by its (zero-based) index. To get a column, use

table.column("date");

or

table.column(0);

which both return the same column for this dataset.

You can print the values in a column using the column print() method.

Column approval = table.column("approval");
approval.print();

producing:

approval: INT
53
53
58
52
52
53
etc...

You can also summarize the data in each column, in a column-type appropriate way.

approval.summary().print();

produces, for example:

approval summary
Disconnected from the target VM, address: '127.0.0.1:51653', transport: 'socket'
Metric    Value              
n         323                
Mean      64.88235294117646  
Min       45.0               
1st Qu    56.0               
Median    63.0               
3rd Qu    73.0               
Max       90.0               
Range     45.0               
Sum       20957.0            
Std. Dev. 11.270465086514845 

Data: The cake dataset is from http://www.stats4stem.org/r-bushapproval-data.html.

Code: The code can be found in the GettingStarted.java example on Github.