jmpy on Pypi

JMPY

Jmpy is an analysis library created to simplify common plotting and modeling tasks. Simplicity, a common plotting signature, and ease of use are preferred over flexibility.

The goal is to create mini-reports with each function for better visualization of data.

Currently, there are three modules: plotting, modeling, and bayes.
Plotting is used to make pretty graphs, and modeling is used to anaylyze data with visual output.
Bayes is currenlty in development to support easy bayesian analysis and plotting, relying on significant assumptions about the underlaying data.

This module relies heavily on statsmodels, patsy, pymc, and pandas.

Overview
Plotting

Histogram
Cumprob
Boxplot
ScatterPlot
Contour

Grid Layout
Modeling

%load_ext autoreload
%autoreload 2

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('bmh')

import jmpy.plotting as jp
import jmpy.modeling as jm

import warnings
warnings.filterwarnings('ignore')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Plotting

Create some artifical data for our analysis:

nsamples = 250
xc = np.linspace(0, 100, nsamples)
xc2 = xc**2
xd = np.random.choice([1, 3, 5, 7], nsamples)
xe = np.random.choice([10, 30, 50], nsamples)
xf = np.random.choice([.1, .4], nsamples)
xg = np.random.normal(size=nsamples)*15

X = np.column_stack((xc, xc2, xd, xe))
beta = np.array([1, .01, 17, .001])

e = np.random.normal(size=nsamples)*10
ytrue = np.dot(X, beta)
y = ytrue + e

data = {}
data['xc'] = xc
data['xc2'] = xc2
data['xd'] = xd
data['xe'] = xe
data['xf'] = xf
data['xg'] = xg
data['y'] = y
data['ytrue'] = ytrue

df = pd.DataFrame.from_dict(data)
df.head()

	xc	xc2	xd	xe	xf	xg	y	ytrue
0	0.000000	0.000000	3	50	0.4	-9.081563	69.428600	51.050000
1	0.401606	0.161288	3	50	0.1	5.957173	66.168140	51.453219
2	0.803213	0.645151	7	10	0.1	-1.048492	110.819879	119.819664
3	1.204819	1.451589	7	50	0.4	-15.881455	111.754919	120.269335
4	1.606426	2.580604	3	50	0.4	-19.796739	57.810824	52.682232

jp.histogram('y', df)

Lets start to visualize how our artifical data looks. First, plot a histogram of the results.

If you want to look at the data color coded by a categorical variable, you need to specify "legend":

jp.histogram('y', df, legend='xd', bins=50, cumprob=True)

You can also create cumprob plots:

jp.cumprob('y', df, legend='xd', marker='D')

Lets look at the data with a boxplot to see if there is any difference between the groups defined by xd:

fig = jp.boxplot(x='xd', y='y', data=df, legend='xd', cumprob=True)
fig

Violin plots can be created in the boxplot function:

fig = jp.boxplot(x='xd', y='y', data=df, legend='xd', cumprob=True, violin=True)
fig

You can also create a scatter plot with a fit.

jp.scatter(x='xc', y='y', data=df, legend='xd', fit='linear', marker='^')

We can generate the same graph using the arrays directly without creating the pandas dataframe. You can fit the data by specifying a fit param. Currently, linear, quadratic, smooth, and interpolate are supported.

jp.scatter(xc, y=y, legend=xd, fit='linear', marker='^')

jp.scatter('xc2', 'y', df, legend='xd', fit='quadratic', marker='o')

# fitparams get passed into the fitting functions.  Smoothing uses the scipy Univariate spline function.
jp.scatter(x='xc2', y='y', data=df, legend='xd', fit='smooth', fitparams={'s': 1e6})

Contour plots can be created as well:

cdf = pd.DataFrame()
x = np.random.triangular(-1, 0, 1, 400)
y = np.random.triangular(-1, 0, 1, 400)
z = np.sqrt(x**2 + 2*y**2)
cdf['x'] = x
cdf['y'] = y
cdf['z'] = z
cdf['Gx'] = np.random.choice(['X1', 'X2'], size=400)
cdf['Gy'] = np.random.choice(['Y1', 'Y2', 'Y3'], size=400)
cdf['Gz'] = np.random.choice(['Z1', 'Z2'], size=400)

jp.contour(x='x', y='y', z='z', data=cdf, cmap='YlGnBu', figsize=(7,5), marker='.')

Grid

JMPY has the capability of creating grids by using the jp.grid function. You can specify either rows, or columns, or both, and the individual charts will be created with data filtered for the grid.

jp.grid(cols='Gz', data=cdf, 
        chart=jp.contour, args={'x':'x', 'y':'y', 'z':'z', 'marker':'.', 'cmap':'YlGnBu', 
                                'axislabels':False, 'axisticks':False},
        figsize=(8, 4),
       )

jp.grid(rows='Gx', cols='Gy', data=cdf, 
        chart=jp.scatter, args={'x':'x', 'y':'y', 'marker':'D', 'fit': 'linear'},
        figsize=(12, 7),
        legend='Gz')

jp.grid(cols='Gx', data=cdf, 
        chart=jp.boxplot, args={'x':'Gz', 'y':'y', 'marker':'D'},
        figsize=(12, 7),
        legend='Gz')

Modeling

Ordinary Least Squares

Now that we have visualized some of the data, lets do some modeling. jmpy current supports two types of linear modeling: ordinary least squares and robust linear model, all built on the statsmodels functions. Lets do the OLS first.

All models are specified based on the patsy text formulas that are very similar to R.

By default, only 80% of the data is used to fit the model, and the other 20% is plotted alongside the data to validate the model results. This can be changed by specifying a different sample_rate parameter.

model = 'y ~ xc + xc2 +  xg'
jm.fit(model, data=df, sample_rate=.8, model_type='ols');

Robust Linear Model

Now... what if we had a couple of outliers in our dataset... lets create some outliers

dfo = df.copy()
p = np.random.uniform(0, 1, size=dfo.y.shape)
err = np.random.normal(size=dfo.y.shape)

dfo['yout'] = dfo.y + np.greater_equal(p, 0.9) * (dfo.y * err)

Our coefficient estimates will be skewed due to the outliers.

model = 'yout ~ xc + xc2 + C(xd)'
jm.fit(model, data=dfo, sample_rate=.8, model_type='ols');

Employing the robust linear model, we can minimize the influence of the outliers, and get better coefficient predictions.

model = 'yout ~ xc + xc2 + C(xd)'
jm.fit(model, data=dfo, sample_rate=.8, model_type='rlm');

Our parameter estimates using the robust linear model are much closer to the truth, than using the OLS.

Overfitting

Next we will test how robust the parameter estimates are by running many iterations of the model, and randomly subsetting the data, and then looking at the parameter estimate distributions.

model = 'y ~ xc + xc2 + C(xd)'
jm.check_estimates(model, data=df, sample_rate=.8, model_type='ols', iterations=275);

jmpy
Release 0.10.4

Release 0.10.4

0.10.4

0.10.3

0.10.1

0.10.0

Documentation

JMPY

Plotting

Grid

Modeling

Ordinary Least Squares

Robust Linear Model

Overfitting

Stats

Development practices

Releases

Contributors

jmpy Release 0.10.4

Release 0.10.4 Toggle Dropdown 0.10.4 0.10.3 0.10.1 0.10.0

Documentation

JMPY

Plotting

Grid

Modeling

Ordinary Least Squares

Robust Linear Model

Overfitting

Stats

Development practices

Releases

Contributors

jmpy
Release 0.10.4

Release 0.10.4

0.10.4

0.10.3

0.10.1

0.10.0