A DataFrame-based Machine Learning Toolset in Python


Keywords
machine-learning, pandas, scikit-learn, cross-validation, data-mining
License
MIT
Install
pip install dflearn==0.1.11

Documentation

Latest Release License Build Status

A data analysis and machine-learning toolset using pandas DataFrame and scikit-learn models.

Install

pip install dflearn

Dependencies

Contents

  • MLtools: machine learning tools, main toolset

    • Whole dataset
      • Data summary
        • Variable type, NA/non-NA values, numeric summary statistics, most frequent values.
      • Data cleaning
        • Categorical variables transformation into dummy variables.
        • Numeric variables standarzation/normalization with imputation.
        • Sparse variables deletion.
        • Collinear variables deletion.
    • Machine learning
      • Training/validation set creation
        • Single training/validation set split.
        • Cross-validation set creation.
        • Cross-join with different models.
      • Model training
        • Scikit-learn like regression/classification.
        • Linear equation estimation
      • Variable analysis
        • Variable importance inference (tree models, random forest interactions)
      • Validation and error analysis
        • Model effects inference on cross-validation loss with linear mixed model
  • NLtools: natural language tools, waiting for development

    • Clean text
    • Word tokenize
  • SNPtools: used for genetic SNP data, not general

    • PLINK
      • Binary data reading and writing
      • Data analysis pipeline (summary statistics, LD matrix, clumping, risk score prediction)
    • Bayesian C-pi inference of high-dimensional single linear association statistics.

License

MIT license