Getting Started with Outlier

With this brief introduction, we demonstrate a few of the tools that Outlier provides for examining a new data set.

The data describes the approval ratings for US President George W. Bush, over time, and as measured by different polling organizations. To get started, we load the data from a CSV file, first providing a list of the data types.

ColumnType[] types = {LOCAL_DATE, INT, STRING};

Then we load the file into a Table object. Table is similar to a Data Frame in R, Julia, or Pandas.

Table bushTable ="data/BushApproval.csv", types);

Once the data is loaded, a good next step is to look at the structure of the table:


The structure() method returns another table, which print() converts to a string:

Table: data/BushApproval.csv - 323 observations (rows) of 3 variables (cols)
Index Column Name Type       Unique Values First      Last 
0     date        LOCAL_DATE 288           2004-02-04 2001-02-09 
1     approval    INT        46            53         57 
2     who         STRING     6             fox        zogby

As you can see, structure provides the column index, name and type, as well as the number of unique values and the first and last value in each column.

The head(int n) method also returns a table, this one containing the first n rows of the target table. Again we convert to a String with print():


For our dataset, this produces the following output:

date       approval who
2004-02-04 53       fox
2004-01-21 53       fox
2004-01-07 58       fox
2003-12-03 52       fox
2003-11-18 52       fox
2003-10-28 53       fox
2003-10-14 52       fox
2003-09-23 50       fox
2003-09-09 58       fox
2003-08-12 57       fox

Of course, if you want to print the entire table (and it’s not too big) you can simply call print() on the original table:


This produces the same output as head(), but includes every row in the table.

If you simply want the column names, you can use the columnNames() method


which produces:

[date, approval, who]

Individual columns can be retrieved by name or by its (zero-based) index. To get a column, use




which both return the same column for this dataset.

You can print the values in a column using the column print() method.

Column approval = table.column("approval");


approval: INT

You can also summarize the data in each column, in a column-type appropriate way.


produces, for example:

approval summary
Disconnected from the target VM, address: '', transport: 'socket'
Metric    Value              
n         323                
Mean      64.88235294117646  
Min       45.0               
1st Qu    56.0               
Median    63.0               
3rd Qu    73.0               
Max       90.0               
Range     45.0               
Sum       20957.0            
Std. Dev. 11.270465086514845 

Data: The cake dataset is from

Code: The code can be found in the example on Github.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s