A data analysis and machine-learning toolset using pandas DataFrame and scikit-learn models.
Install
pip install dflearn
Dependencies
- numpy: 1.11.0 or higher
- scipy: 0.18.0 or higher
- pandas: 0.20.0 or higher
- statsmodels: 0.6.0 or higher
- scikit-learn: 0.18.0 or higher
- nltk: 3.0.0 or higher
Contents
-
MLtools: machine learning tools, main toolset
- Whole dataset
- Data summary
- Variable type, NA/non-NA values, numeric summary statistics, most frequent values.
- Data cleaning
- Categorical variables transformation into dummy variables.
- Numeric variables standarzation/normalization with imputation.
- Sparse variables deletion.
- Collinear variables deletion.
- Data summary
- Machine learning
- Training/validation set creation
- Single training/validation set split.
- Cross-validation set creation.
- Cross-join with different models.
- Model training
- Scikit-learn like regression/classification.
- Linear equation estimation
- Variable analysis
- Variable importance inference (tree models, random forest interactions)
- Validation and error analysis
- Model effects inference on cross-validation loss with linear mixed model
- Training/validation set creation
- Whole dataset
-
NLtools: natural language tools, waiting for development
- Clean text
- Word tokenize
-
SNPtools: used for genetic SNP data, not general
-
PLINK
- Binary data reading and writing
- Data analysis pipeline (summary statistics, LD matrix, clumping, risk score prediction)
- Bayesian C-pi inference of high-dimensional single linear association statistics.
-
PLINK
License
MIT license