salmon-lm

A symbolic algebra based linear regression tool.


Keywords
modeling, symbolic, regression
License
GPL-3.0
Install
pip install salmon-lm==1.2

Documentation

SALMON

Salmon is a package for symbolic algebra of linear regression and modeling. The goal for this package is to ease the process of model building for linear regression by separating the model (with all its interactions and variables) from the data being used to fit it.

If you would like to use Salmon, you can install it by first cloning the repository, navigating to the repository directory, and execute the following commands (inside a virtual environment if you prefer):

> python setup.py build
> python setup.py install

From there, the package should be installed into your Python environment and can be accessed by importing the package name salmon.

References

To understand how to use salmon, please refer either to our paper or the documentation below.

Using salmon can be defined in three stages:

  1. Defining the model
  2. Fitting the model
  3. Using the model

As such, the documentation will be broken up into these three parts.

# Setup
import pandas as pd
from salmon import *
%matplotlib inline
data = pd.read_csv("./data/harris.csv")
data = data[data['Educ'] != 10].reset_index(drop=True) # Remove single outlier

Model Definition

A model is defined by a quatitative or categorical variables existing either stand alone, within interactions, within linear combinations, or a mix of the latter two.

Variable Types

A variable in this package represts the variables you would commonly see in a definition of a regression model like so: $$f(\tt{Var}) = \beta_0 + \beta_1 \tt{Var_1} + \beta_2 \tt{Var_2} + \beta_3 \tt{Var_2}^2 + \beta_4 \tt{Var_1}*{Var_2} + \beta_5 \tt{Var_1} * {Var_2}^2$$ where $\tt{Var_i}$ represents either a quantitative or categorical variable.

Salmon represents these symbolically. When defining the variables, there are three options to pick from:

# Variables known to be quantitative.
quant_var = Q("Bsal")
# Variables known to be categorical.
cat_var = C("Sex")
# Variables of unknown type. These will be interpreted when fitting as either categorical or quantitative.
interp_var = Var("Educ")

The string passed into the variables are the column names to extract from a pandas DataFrame when fitting on a set of data. So for instance, if we defined a model with Q("Bsal") then the model would extract the Bsal column to work with from the data passed in.

Quantitative Variables

Common transformations of quantitative variables are also supported. For example:

bsal_squared = Q("Bsal") ** 2
bsal_shifted = Q("Bsal") + 150
bsal_logged = Log(Q("Bsal"))

Categorical Variables

When defining categorical variables it is possible to set ahead of time the possible levels/factors to fit with, as well as the encoding method for use. For instance, if we wanted to treat the Educ column in our example dataset as a categorical variable, and we knew the possible levels of education were either 8, 10, 12, 15, or 16, then we could define our variable as follows:

educ_var_v1 = C("Educ", method = 'one-hot', levels = [8, 10, 12, 15, 16])
educ_var_v2 = C("Educ", method = 'one-hot', levels = [8, 10, 12])

The first variable defined set the order of the levels to interpreted as. This would matter with encoding methods such as ordinal encoding (note: currently not supported, only one-hot encoding is supported at this time). In our case with the one-hot encoding method used, our ordering designated the '8' level to be dropped to avoid multi-colinearity.

The second variable defined still designated the '8' level to be dropped; however, it also designates that any levels found in the data that are not either '8', '10', or '12' will be binned into an 'other' category.

By default, categorical variables will use a one-hot encoding method and will dynamically extract the possible levels of a variable upon fitting. The levels will be ordered by sorting and the smallest (according to Python's sorted function) level will be dropped.

Combinations

Many regression applications require multiple variables within the model. This is achieved in salmon by simply adding together several variables. For instance, suppose we wanted to represent: $$\tt{Sex} + \tt{Bsal} + \tt{Bsal}^2$$ This would be achieved like so:

combo = C("Sex") + Q("Bsal") + Q("Bsal") ** 2

Should you want to define an full polynomial sequence you can use the following command as well:

combo = C("Sex") + Q("Bsal", 4) # Expands to 'Sex + Bsal + Bsal^2 + Bsal^3 + Bsal^4'

Interactions

It is common to want to model interaction effects between variables. Salmon supports this symbolically using the * operator. Any combination of variable type is supported.

For example, let's model this interaction: $$\tt{Sex * Bsal}$$

interaction = C("Sex") * Q("Bsal")

Here is how we would model a more complicated linear combination of variables and interactions like $$\tt{Sex} + \tt{Bsal} + \tt{Bsal}^2 + \tt{Sex * Bsal} + \tt{Sex}*{Bsal}^2$$

complicated_combo = C("Sex") + Q("Bsal") + Q("Bsal")**2 + C("Sex")*Q("Bsal") + C("Sex")*Q("Bsal")**2

Salmon also supports distribution of singular terms into combinations. The above expression could be represented more succinctly as such:

# Equivalent to C("Sex") + Q("Bsal") + Q("Bsal")**2 + (Q("Bsal") + Q("Bsal")**2) * C("Sex") 
complicated_combo = C("Sex") + Poly("Bsal", 2) + Poly("Bsal", 2) * C("Sex")

Representing the Model

Now that we understand how to form expressions of variables, we can now represent our models. LinearModels are always defined of the form:

model = LinearModel(explanatory_expression, response_expression)

The explanatory_expression is allowed to be a single term, an interaction, or a combination of the other two. The response_expression is allowed to be either a single term or an interaction. Categorical variables are allowed within the response_expression so long after encoding the resultant expansion is represented by only one column.

Fitting the Model

For an example, let us fit this model:

$$\widehat{Sal77}(Sex, Bsal) = \beta_0 + \beta_1 \tt{Sex} + \beta_2 \tt{Bsal} + \beta_3 \tt{Bsal}^2 + \beta_4 \tt{Sex * Bsal} + \beta_5 \tt{Sex}*{Bsal}^2$$

First we must define our model:

explanatory = C("Sex") + Poly("Bsal", 2) + C("Sex") * Poly("Bsal", 2)
response = Q("Sal77")
model = LinearModel(explanatory, response)

Note how we did not need a term for the $\beta_0$ (the intercept). This is because it is not a part of our explantory expression of variables, but rather inherent in the model definition. Should we have wanted to define our model without an intercept, we would define it as LinearModel(explanatory, response, intercept = False)

Now that we have our model defined, we must fit the data to it for it to compute all $\beta_i$ values. We do this like so:

model.fit(data)
Coefficients SE t p
Intercept 15244.585697 15004.866623 1.015976 0.312491
Sex::Male -15746.353511 20175.689005 -0.780462 0.437262
Bsal -2.333748 5.847857 -0.399078 0.690825
(Bsal)^2 0.000242 0.000566 0.427475 0.670102
{Bsal}{Sex::Male} 5.472859 7.291904 0.750539 0.454979
{(Bsal)^2}{Sex::Male} -0.000423 0.000666 -0.636137 0.526377

Notice how we did not designate datasets separately for the explantory and response. The model assumed all variables used when defined will be found within the one DataFrame passed in as an argument. Also notice how we did not have to transform our original dataset to include the transformations and interactions. This was all done interally at runtime while fitting the data to the model.

Using the Model

Now that our model is fit, we can do a variety of things with it.

First off, the most common use of a model would be to make predictions with new data. This would be done like so:

model.predict(data)
Predicted Sal77
0 10714.633449
1 12079.759519
2 11806.937185
3 11806.937185
4 11806.937185
5 12488.612620
6 13031.467692
7 11806.937185
8 11806.937185
9 12527.514782
10 12527.514782
11 11163.403112
12 11806.937185
13 11806.937185
14 9640.923816
15 9621.847633
16 9673.291727
17 9673.291727
18 9621.847633
19 9621.847633
20 9703.587918
21 9740.858177
22 9703.587918
23 9809.839940
24 9826.146597
25 9621.847633
26 10031.820474
27 9660.758907
28 9640.923816
29 9668.368680
... ...
62 12319.952051
63 11501.485049
64 11806.937185
65 11806.937185
66 11806.937185
67 11806.937185
68 10131.682604
69 10944.891645
70 12319.952051
71 11163.403112
72 11806.937185
73 11163.403112
74 11806.937185
75 9809.839940
76 9703.587918
77 9656.492266
78 10153.107740
79 9959.679880
80 9640.923816
81 9621.847633
82 9640.923816
83 9809.839940
84 9703.587918
85 9640.923816
86 9621.847633
87 9959.679880
88 9668.368680
89 9762.108581
90 9631.324124
91 9660.758907

92 rows × 1 columns

The only restriction is that the new DataFrame being passed in must have enough columns with the necessary names used to define the model originally.

Plotting

Should the model's definition fall under certain categories, plotting the original training data against the linear fit is available as well. As of right now, plotting supports models with explantory expressions consisting of only categorical variables, or expressions consisting of only one quantitative variable and zero or more categorical variables. Some example plots would look as follows:

ex_explanatory = C("Sex")
ex_response = Q("Sal77")
ex_model = LinearModel(ex_explanatory, ex_response)
ex_model.fit(data)
ex_model.plot()

png

ex_explanatory = C("Sex") + C("Educ") + C("Sex") * C("Educ")
ex_response = Q("Sal77")
ex_model = LinearModel(ex_explanatory, ex_response)
ex_model.fit(data)
ex_model.plot()

png

ex_explanatory = Q("Bsal")
ex_response = Q("Sal77")
ex_model = LinearModel(ex_explanatory, ex_response)
ex_model.fit(data)
ex_model.plot()

png

ex_explanatory = Poly("Bsal", 2)
ex_response = Q("Sal77")
ex_model = LinearModel(ex_explanatory, ex_response)
ex_model.fit(data)
ex_model.plot()

png

ex_explanatory = C("Sex") + Poly("Bsal", 2) + C("Sex") * Poly("Bsal", 2)
ex_response = Q("Sal77")
ex_model = LinearModel(ex_explanatory, ex_response)
ex_model.fit(data)
ex_model.plot()

png

Diagnostics

There are two main diagnostic tools that salmon offers to allow you to evaluate the performance of your model and assure that no assumptions are broken: residual plots and partial regression plots. They can be accessed like so:

model.residual_plots()

png

png

png

png

png

png

[<matplotlib.collections.PathCollection at 0x2a232239d68>,
 <matplotlib.collections.PathCollection at 0x2a232686f60>,
 <matplotlib.collections.PathCollection at 0x2a232cc5d68>,
 <matplotlib.collections.PathCollection at 0x2a232d18e10>,
 <matplotlib.collections.PathCollection at 0x2a232d82c18>,
 <matplotlib.collections.PathCollection at 0x2a232dea780>]
model.partial_plots()

png

png

png

png

png