jmpy

JMP Style Plotting and Modeling


License
MIT
Install
pip install jmpy==0.10.4

Documentation

JMPY

Jmpy is an analysis library created to simplify common plotting and modeling tasks. Simplicity, a common plotting signature, and ease of use are preferred over flexibility.

The goal is to create mini-reports with each function for better visualization of data.


Currently, there are three modules: plotting, modeling, and bayes.
Plotting is used to make pretty graphs, and modeling is used to anaylyze data with visual output.
Bayes is currenlty in development to support easy bayesian analysis and plotting, relying on significant assumptions about the underlaying data.

This module relies heavily on statsmodels, patsy, pymc, and pandas.


Overview
Plotting

Grid Layout
Modeling

%load_ext autoreload
%autoreload 2

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('bmh')

import jmpy.plotting as jp
import jmpy.modeling as jm

import warnings
warnings.filterwarnings('ignore')
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Plotting


Create some artifical data for our analysis:

nsamples = 250
xc = np.linspace(0, 100, nsamples)
xc2 = xc**2
xd = np.random.choice([1, 3, 5, 7], nsamples)
xe = np.random.choice([10, 30, 50], nsamples)
xf = np.random.choice([.1, .4], nsamples)
xg = np.random.normal(size=nsamples)*15

X = np.column_stack((xc, xc2, xd, xe))
beta = np.array([1, .01, 17, .001])

e = np.random.normal(size=nsamples)*10
ytrue = np.dot(X, beta)
y = ytrue + e

data = {}
data['xc'] = xc
data['xc2'] = xc2
data['xd'] = xd
data['xe'] = xe
data['xf'] = xf
data['xg'] = xg
data['y'] = y
data['ytrue'] = ytrue

df = pd.DataFrame.from_dict(data)
df.head()
xc xc2 xd xe xf xg y ytrue
0 0.000000 0.000000 3 50 0.4 -9.081563 69.428600 51.050000
1 0.401606 0.161288 3 50 0.1 5.957173 66.168140 51.453219
2 0.803213 0.645151 7 10 0.1 -1.048492 110.819879 119.819664
3 1.204819 1.451589 7 50 0.4 -15.881455 111.754919 120.269335
4 1.606426 2.580604 3 50 0.4 -19.796739 57.810824 52.682232

jp.histogram('y', df)

png

Lets start to visualize how our artifical data looks. First, plot a histogram of the results.

If you want to look at the data color coded by a categorical variable, you need to specify "legend":

jp.histogram('y', df, legend='xd', bins=50, cumprob=True)

png

You can also create cumprob plots:

jp.cumprob('y', df, legend='xd', marker='D')

png


Lets look at the data with a boxplot to see if there is any difference between the groups defined by xd:

fig = jp.boxplot(x='xd', y='y', data=df, legend='xd', cumprob=True)
fig

png

Violin plots can be created in the boxplot function:

fig = jp.boxplot(x='xd', y='y', data=df, legend='xd', cumprob=True, violin=True)
fig

png


You can also create a scatter plot with a fit.

jp.scatter(x='xc', y='y', data=df, legend='xd', fit='linear', marker='^')

png

We can generate the same graph using the arrays directly without creating the pandas dataframe. You can fit the data by specifying a fit param. Currently, linear, quadratic, smooth, and interpolate are supported.

jp.scatter(xc, y=y, legend=xd, fit='linear', marker='^')

png

jp.scatter('xc2', 'y', df, legend='xd', fit='quadratic', marker='o')

png

# fitparams get passed into the fitting functions.  Smoothing uses the scipy Univariate spline function.
jp.scatter(x='xc2', y='y', data=df, legend='xd', fit='smooth', fitparams={'s': 1e6})

png

Contour plots can be created as well:

cdf = pd.DataFrame()
x = np.random.triangular(-1, 0, 1, 400)
y = np.random.triangular(-1, 0, 1, 400)
z = np.sqrt(x**2 + 2*y**2)
cdf['x'] = x
cdf['y'] = y
cdf['z'] = z
cdf['Gx'] = np.random.choice(['X1', 'X2'], size=400)
cdf['Gy'] = np.random.choice(['Y1', 'Y2', 'Y3'], size=400)
cdf['Gz'] = np.random.choice(['Z1', 'Z2'], size=400)

jp.contour(x='x', y='y', z='z', data=cdf, cmap='YlGnBu', figsize=(7,5), marker='.')

png

Grid

JMPY has the capability of creating grids by using the jp.grid function. You can specify either rows, or columns, or both, and the individual charts will be created with data filtered for the grid.

jp.grid(cols='Gz', data=cdf, 
        chart=jp.contour, args={'x':'x', 'y':'y', 'z':'z', 'marker':'.', 'cmap':'YlGnBu', 
                                'axislabels':False, 'axisticks':False},
        figsize=(8, 4),
       )

png

jp.grid(rows='Gx', cols='Gy', data=cdf, 
        chart=jp.scatter, args={'x':'x', 'y':'y', 'marker':'D', 'fit': 'linear'},
        figsize=(12, 7),
        legend='Gz')

png

jp.grid(cols='Gx', data=cdf, 
        chart=jp.boxplot, args={'x':'Gz', 'y':'y', 'marker':'D'},
        figsize=(12, 7),
        legend='Gz')

png


Modeling

Ordinary Least Squares

Now that we have visualized some of the data, lets do some modeling. jmpy current supports two types of linear modeling: ordinary least squares and robust linear model, all built on the statsmodels functions. Lets do the OLS first.

All models are specified based on the patsy text formulas that are very similar to R.

By default, only 80% of the data is used to fit the model, and the other 20% is plotted alongside the data to validate the model results. This can be changed by specifying a different sample_rate parameter.

model = 'y ~ xc + xc2 +  xg'
jm.fit(model, data=df, sample_rate=.8, model_type='ols');

png


Robust Linear Model

Now... what if we had a couple of outliers in our dataset... lets create some outliers

dfo = df.copy()
p = np.random.uniform(0, 1, size=dfo.y.shape)
err = np.random.normal(size=dfo.y.shape)

dfo['yout'] = dfo.y + np.greater_equal(p, 0.9) * (dfo.y * err)

Our coefficient estimates will be skewed due to the outliers.

model = 'yout ~ xc + xc2 + C(xd)'
jm.fit(model, data=dfo, sample_rate=.8, model_type='ols');

png

Employing the robust linear model, we can minimize the influence of the outliers, and get better coefficient predictions.

model = 'yout ~ xc + xc2 + C(xd)'
jm.fit(model, data=dfo, sample_rate=.8, model_type='rlm');

png

Our parameter estimates using the robust linear model are much closer to the truth, than using the OLS.

Overfitting

Next we will test how robust the parameter estimates are by running many iterations of the model, and randomly subsetting the data, and then looking at the parameter estimate distributions.

model = 'y ~ xc + xc2 + C(xd)'
jm.check_estimates(model, data=df, sample_rate=.8, model_type='ols', iterations=275);

png