SALMON

Salmon is a package for symbolic algebra of linear regression and modeling. The goal for this package is to ease the process of model building for linear regression by separating the model (with all its interactions and variables) from the data being used to fit it.

If you would like to use Salmon, you can install it by first cloning the repository, navigating to the repository directory, and execute the following commands (inside a virtual environment if you prefer):

> python setup.py build
> python setup.py install

From there, the package should be installed into your Python environment and can be accessed by importing the package name salmon.

References

To understand how to use salmon, please refer either to our paper or the documentation below.

Using salmon can be defined in three stages:

Defining the model
Fitting the model
Using the model

As such, the documentation will be broken up into these three parts.

# Setup
import pandas as pd
from salmon import *
%matplotlib inline
data = pd.read_csv("./data/harris.csv")
data = data[data['Educ'] != 10].reset_index(drop=True) # Remove single outlier

Model Definition

A model is defined by a quatitative or categorical variables existing either stand alone, within interactions, within linear combinations, or a mix of the latter two.

Variable Types

A variable in this package represts the variables you would commonly see in a definition of a regression model like so: $$f(\tt{Var}) = \beta_0 + \beta_1 \tt{Var_1} + \beta_2 \tt{Var_2} + \beta_3 \tt{Var_2}^2 + \beta_4 \tt{Var_1}*{Var_2} + \beta_5 \tt{Var_1} * {Var_2}^2$$ where $\tt{Var_i}$ represents either a quantitative or categorical variable.

Salmon represents these symbolically. When defining the variables, there are three options to pick from:

# Variables known to be quantitative.
quant_var = Q("Bsal")
# Variables known to be categorical.
cat_var = C("Sex")
# Variables of unknown type. These will be interpreted when fitting as either categorical or quantitative.
interp_var = Var("Educ")

The string passed into the variables are the column names to extract from a pandas DataFrame when fitting on a set of data. So for instance, if we defined a model with Q("Bsal") then the model would extract the Bsal column to work with from the data passed in.

Quantitative Variables

Common transformations of quantitative variables are also supported. For example:

bsal_squared = Q("Bsal") ** 2
bsal_shifted = Q("Bsal") + 150
bsal_logged = Log(Q("Bsal"))

Categorical Variables

When defining categorical variables it is possible to set ahead of time the possible levels/factors to fit with, as well as the encoding method for use. For instance, if we wanted to treat the Educ column in our example dataset as a categorical variable, and we knew the possible levels of education were either 8, 10, 12, 15, or 16, then we could define our variable as follows:

educ_var_v1 = C("Educ", method = 'one-hot', levels = [8, 10, 12, 15, 16])
educ_var_v2 = C("Educ", method = 'one-hot', levels = [8, 10, 12])

The first variable defined set the order of the levels to interpreted as. This would matter with encoding methods such as ordinal encoding (note: currently not supported, only one-hot encoding is supported at this time). In our case with the one-hot encoding method used, our ordering designated the '8' level to be dropped to avoid multi-colinearity.

The second variable defined still designated the '8' level to be dropped; however, it also designates that any levels found in the data that are not either '8', '10', or '12' will be binned into an 'other' category.

By default, categorical variables will use a one-hot encoding method and will dynamically extract the possible levels of a variable upon fitting. The levels will be ordered by sorting and the smallest (according to Python's sorted function) level will be dropped.

Combinations

Many regression applications require multiple variables within the model. This is achieved in salmon by simply adding together several variables. For instance, suppose we wanted to represent: $$\tt{Sex} + \tt{Bsal} + \tt{Bsal}^2$$ This would be achieved like so:

combo = C("Sex") + Q("Bsal") + Q("Bsal") ** 2

Should you want to define an full polynomial sequence you can use the following command as well:

combo = C("Sex") + Q("Bsal", 4) # Expands to 'Sex + Bsal + Bsal^2 + Bsal^3 + Bsal^4'

Interactions

It is common to want to model interaction effects between variables. Salmon supports this symbolically using the * operator. Any combination of variable type is supported.

For example, let's model this interaction: $$\tt{Sex * Bsal}$$

interaction = C("Sex") * Q("Bsal")

Here is how we would model a more complicated linear combination of variables and interactions like $$\tt{Sex} + \tt{Bsal} + \tt{Bsal}^2 + \tt{Sex * Bsal} + \tt{Sex}*{Bsal}^2$$

complicated_combo = C("Sex") + Q("Bsal") + Q("Bsal")**2 + C("Sex")*Q("Bsal") + C("Sex")*Q("Bsal")**2

Salmon also supports distribution of singular terms into combinations. The above expression could be represented more succinctly as such:

# Equivalent to C("Sex") + Q("Bsal") + Q("Bsal")**2 + (Q("Bsal") + Q("Bsal")**2) * C("Sex") 
complicated_combo = C("Sex") + Poly("Bsal", 2) + Poly("Bsal", 2) * C("Sex")

Representing the Model

Now that we understand how to form expressions of variables, we can now represent our models. LinearModels are always defined of the form:

model = LinearModel(explanatory_expression, response_expression)

The explanatory_expression is allowed to be a single term, an interaction, or a combination of the other two. The response_expression is allowed to be either a single term or an interaction. Categorical variables are allowed within the response_expression so long after encoding the resultant expansion is represented by only one column.

Fitting the Model

For an example, let us fit this model:

$$\widehat{Sal77}(Sex, Bsal) = \beta_0 + \beta_1 \tt{Sex} + \beta_2 \tt{Bsal} + \beta_3 \tt{Bsal}^2 + \beta_4 \tt{Sex * Bsal} + \beta_5 \tt{Sex}*{Bsal}^2$$

First we must define our model:

explanatory = C("Sex") + Poly("Bsal", 2) + C("Sex") * Poly("Bsal", 2)
response = Q("Sal77")
model = LinearModel(explanatory, response)

Note how we did not need a term for the $\beta_0$ (the intercept). This is because it is not a part of our explantory expression of variables, but rather inherent in the model definition. Should we have wanted to define our model without an intercept, we would define it as LinearModel(explanatory, response, intercept = False)

Now that we have our model defined, we must fit the data to it for it to compute all $\beta_i$ values. We do this like so:

model.fit(data)

	Coefficients	SE	t	p
Intercept	15244.585697	15004.866623	1.015976	0.312491
Sex::Male	-15746.353511	20175.689005	-0.780462	0.437262
Bsal	-2.333748	5.847857	-0.399078	0.690825
(Bsal)^2	0.000242	0.000566	0.427475	0.670102
{Bsal}{Sex::Male}	5.472859	7.291904	0.750539	0.454979
{(Bsal)^2}{Sex::Male}	-0.000423	0.000666	-0.636137	0.526377

Notice how we did not designate datasets separately for the explantory and response. The model assumed all variables used when defined will be found within the one DataFrame passed in as an argument. Also notice how we did not have to transform our original dataset to include the transformations and interactions. This was all done interally at runtime while fitting the data to the model.

Using the Model

Now that our model is fit, we can do a variety of things with it.

First off, the most common use of a model would be to make predictions with new data. This would be done like so:

model.predict(data)

	Predicted Sal77
0	10714.633449
1	12079.759519
2	11806.937185
3	11806.937185
4	11806.937185
5	12488.612620
6	13031.467692
7	11806.937185
8	11806.937185
9	12527.514782
10	12527.514782
11	11163.403112
12	11806.937185
13	11806.937185
14	9640.923816
15	9621.847633
16	9673.291727
17	9673.291727
18	9621.847633
19	9621.847633
20	9703.587918
21	9740.858177
22	9703.587918
23	9809.839940
24	9826.146597
25	9621.847633
26	10031.820474
27	9660.758907
28	9640.923816
29	9668.368680
...	...
62	12319.952051
63	11501.485049
64	11806.937185
65	11806.937185
66	11806.937185
67	11806.937185
68	10131.682604
69	10944.891645
70	12319.952051
71	11163.403112
72	11806.937185
73	11163.403112
74	11806.937185
75	9809.839940
76	9703.587918
77	9656.492266
78	10153.107740
79	9959.679880
80	9640.923816
81	9621.847633
82	9640.923816
83	9809.839940
84	9703.587918
85	9640.923816
86	9621.847633
87	9959.679880
88	9668.368680
89	9762.108581
90	9631.324124
91	9660.758907

92 rows × 1 columns

The only restriction is that the new DataFrame being passed in must have enough columns with the necessary names used to define the model originally.

Plotting

Should the model's definition fall under certain categories, plotting the original training data against the linear fit is available as well. As of right now, plotting supports models with explantory expressions consisting of only categorical variables, or expressions consisting of only one quantitative variable and zero or more categorical variables. Some example plots would look as follows:

ex_explanatory = C("Sex")
ex_response = Q("Sal77")
ex_model = LinearModel(ex_explanatory, ex_response)
ex_model.fit(data)
ex_model.plot()

ex_explanatory = C("Sex") + C("Educ") + C("Sex") * C("Educ")
ex_response = Q("Sal77")
ex_model = LinearModel(ex_explanatory, ex_response)
ex_model.fit(data)
ex_model.plot()

ex_explanatory = Q("Bsal")
ex_response = Q("Sal77")
ex_model = LinearModel(ex_explanatory, ex_response)
ex_model.fit(data)
ex_model.plot()

ex_explanatory = Poly("Bsal", 2)
ex_response = Q("Sal77")
ex_model = LinearModel(ex_explanatory, ex_response)
ex_model.fit(data)
ex_model.plot()

ex_explanatory = C("Sex") + Poly("Bsal", 2) + C("Sex") * Poly("Bsal", 2)
ex_response = Q("Sal77")
ex_model = LinearModel(ex_explanatory, ex_response)
ex_model.fit(data)
ex_model.plot()

Diagnostics

There are two main diagnostic tools that salmon offers to allow you to evaluate the performance of your model and assure that no assumptions are broken: residual plots and partial regression plots. They can be accessed like so:

model.residual_plots()

[<matplotlib.collections.PathCollection at 0x2a232239d68>,
 <matplotlib.collections.PathCollection at 0x2a232686f60>,
 <matplotlib.collections.PathCollection at 0x2a232cc5d68>,
 <matplotlib.collections.PathCollection at 0x2a232d18e10>,
 <matplotlib.collections.PathCollection at 0x2a232d82c18>,
 <matplotlib.collections.PathCollection at 0x2a232dea780>]