pdLSR: Pandas-aware least squares regression
pdLSR is a library for performing least squares regression. It attempts to seamlessly incorporate this task in a Pandas-focused workflow. Input data are expected in dataframes, and multiple regressions can be performed using functionality similar to Pandas
groupby. Results are returned as grouped dataframes and include best-fit parameters, statistics, residuals, and more. The results can be easily visualized using
pdLSR currently utilizes
lmfit, a flexible and powerful library for least squares minimization, which in turn, makes use of
scipy.optimize.leastsq. I began using
lmfit because it is one of the few libraries that supports non-linear least squares regression, which is commonly used in the natural sciences. I also like the flexibility it offers for testing different modeling scenarios and the variety of assessment statistics it provides. However, I found myself writing many
for loops to perform regressions on groups of data and aggregate the resulting output. Simplification of this task was my inspiration for writing
pdLSR is related to libraries such as
scikit-learn that provide linear regression functions that operate on dataframes. However, these libraries don't support grouping operations on dataframes and don't aggregate output into dataframes. Supporting
scikit-learn in the future is being considered. (And pull requests adding this functionality would be welcome.)
Some additional 'niceties' associated with the input of parameters and equations have also been incorporated.
pdLSR also utilizes multithreading for the calculation of confidence intervals, as this process is time consuming when there are more than a few groups.
The following libraries are required for
multiprocess is a fork of Python's
multiprocessing library that provides more robust multithreading. I found that this library is required for multithreading to work with
lmfit will install automatically from
conda (see below).
matplotlib is required and
seaborn is recommended.
pdLSR works with Python 2 and 3.
Installation and Demo
The preferred method for installing
pdLSR and all of its dependencies is to use the
pip package managers.
- For conda:
conda install -c mlgill pdlsr-- unfortunately conda seems to require lowercase names for packages
- For pip:
pip install pdLSR
However it can also be installed manually by cloning the repo into your
There is a demo notebook that can be executed locally or live from GitHub using mybinder.org. After clicking the badge at the top of this section, navigate to
pdLSR --> demo --> pdLSR_demo.ipynb and everything should be setup to execute the demo in a browser. No installation required!
The functions of
pdLSR are documented within the code, but currently the best single source for using
pdLSR is the demo notebook. Developing stand-alone documentation is a future goal.