missCforest

Ensemble Conditional Trees for Missing Data Imputation


License
CNRI-Python-GPL-Compatible

Documentation

missCforest

missCforest is an Ensemble Conditional Trees algorithm for Missing Data Imputation. It performs single imputation based on the Cforest algorithm which is an ensemble of Conditional Inference Trees.

The aim of missCforest is to produce a complete dataset using an iterative prediction approach by predicting missing values after learning from the complete cases.

Installation

You can install the development version of missCforest as follow:

#install.packages("devtools")
devtools::install_github("ielbadisy/missCforest")

Examples

library(missCforest)
#> Loading required package: partykit
#> Loading required package: grid
#> Loading required package: libcoin
#> Loading required package: mvtnorm

# import the GBSG2 dataset
library(TH.data)
#> Loading required package: survival
#> Loading required package: MASS
#> 
#> Attaching package: 'TH.data'
#> The following object is masked from 'package:MASS':
#> 
#>     geyser
data("GBSG2")

# consider the cens variable as a factor
GBSG2$cens <- as.factor(GBSG2$cens)

# introduce randomly 30% of NA to variables
datNA <- missForest::prodNA(GBSG2, 0.2)
head(datNA)
#>   horTh age menostat tsize tgrade pnodes progrec estrec time cens
#> 1    no  70     Post    21     II      3      NA     66 1814    1
#> 2  <NA>  56     Post    12     II      7      61     77 2018    1
#> 3   yes  58     Post    35     II      9      52    271  712    1
#> 4   yes  NA     <NA>    NA     II      4      NA     29   NA    1
#> 5    no  NA     Post    NA     II      1      26     65  772    1
#> 6  <NA>  32      Pre    57    III     24      NA     13   NA    1

You can impute all the missing values using all the possible combinations of the imputation model formula:

impdat <- missCforest(datNA, .~., 
                      ntree = 300L,
                      minsplit = 20L,
                      minbucket = 7L,
                      alpha = 0.05,
                      cores = 4)  
head(impdat)
#>   horTh      age menostat    tsize tgrade pnodes  progrec estrec      time cens
#> 1    no 70.00000     Post 21.00000     II      3 88.68623     66 1814.0000    1
#> 2   yes 56.00000     Post 12.00000     II      7 61.00000     77 2018.0000    1
#> 3   yes 58.00000     Post 35.00000     II      9 52.00000    271  712.0000    1
#> 4   yes 53.31934     Post 27.15198     II      4 64.72325     29  970.5889    1
#> 5    no 56.31297     Post 25.92921     II      1 26.00000     65  772.0000    1
#> 6    no 32.00000      Pre 57.00000    III     24 48.91202     13  521.7364    1