qCBA

Quantitative Classification by Association Rules


License
GPL-3.0

Documentation

Quantitative CBA

Quantitative CBA (QCBA) is a postprocessing algorithm for association rule classification algorithm CBA, which implements a number of optimization steps to improve handling of quantitative (numerical) attributes. The viable properties of these rule lists that make CBA classification models most comprehensible among all association rule classification algorithms, such as one-rule classification and crisp rules, are retained. The postprocessing is conceptually fast, because it is performed on a relatively small number of rules that passed the pruning steps, and can be adapted also for multi-rule classification algorithms. Benchmarks show about 50% decrease in the total size of the model as measured by the total number of conditions in all rules. Model accuracy generally remains on the same level as for CBA with QCBA even providing small improvement over CBA on 11 of the 22 datasets involved in our benchmark.

Kliegr, Tomas. "Quantitative CBA: Small and Comprehensible Association Rule Classification Models." arXiv preprint arXiv:1711.10166 (2017).

The arc package is used for generation of the CBA classifier, which is postprocessed by the QCBA R package.

Feature Tutorial

The tutorial visually demonstrates all the optimization steps in QCBA:

  • Refitting rules Literals originally aligned to borders of the discretized regions are refit to finer grid.
  • Attribute pruning Remove redundant attributes from rules.
  • Trimming Literals in discovered rules are trimmed so that they do not contain regions not covered by data.
  • Extension Ranges of literals in the body of each rule are extended, escaping from the coarse hypercubic created by discretization.
  • Data coverage pruning Remove some of the newly redundant rules
  • Default rule overlap pruning Some rules that classify into the same class as the default rule in the end of the classifier can be removed.

The R Markdown source for this tutorial is located here. Note that while GitHub displays the syntax, it does not run the code or even display the knitted HTML. For this reason, it is recommended to view the tutorial outside github.

Prerequisites

The qCBA package depends on Java 8, and correctly installed rJava package. On Linux, even if you have Java installed, it might be necessary to install it again with

apt-get install r-cran-rjava

For instructions on how to setup rJava please refer to rJava documentation .

Installation

The package version available on CRAN is .

The latest version can be installed from the R environment using the devtools package.

devtools::install_github("kliegr/QCBA")

Example

Baseline CBA model

Learn a CBA classifier.

library(arc)
set.seed(111)
allData <- datasets::iris[sample(nrow(datasets::iris)),]
trainFold <- allData[1:100,]
testFold <- allData[101:nrow(datasets::iris),]
rmCBAiris <- cba(trainFold, classAtt="Species")
inspect(rmCBAiris@rules)

The model:

    lhs                                                    rhs                  support confidence lift     lhs_length
[1] {Petal.Length=[-Inf;2.6],Petal.Width=[-Inf;0.8]}    => {Species=setosa}     0.32    1.00       3.125000 2         
[2] {Petal.Length=(2.6;4.75],Petal.Width=(0.8;1.75]}    => {Species=versicolor} 0.30    1.00       2.777778 2         
[3] {Sepal.Length=(5.85; Inf],Petal.Length=(5.15; Inf]} => {Species=virginica}  0.25    1.00       3.125000 2         
[4] {Sepal.Width=[-Inf;3.05],Petal.Width=(1.75; Inf]}   => {Species=virginica}  0.18    1.00       3.125000 2         
[5] {}                                                  => {Species=versicolor} 0.36    0.36       1.000000 0 

The statistics:

library(stringr)
prediction_iris <- predict(rmCBAiris,testFold)
acc <- CBARuleModelAccuracy(prediction_iris, testFold[[rmCBAiris@classAtt]])
avgRuleLengthCBA <- sum(rmCBAiris@rules@lhs@data)/length(rmCBAiris@rules)
print(paste("Number of rules: ",length(rmCBAiris@rules),", average number of conditions per rule :",round(avgRuleLengthCBA,2), ", accuracy on test data: ",round(acc,2)))

Returns:

  Number of rules:  5 , average number of conditions per rule : 1.6 , accuracy on test data:  0.94

QCBA model

Learn a QCBA model.

library(qCBA)
rmCBA4QCBAiris <- cba(trainFold, classAtt="Species",pruning_options=list(default_rule_pruning=FALSE))
rmqCBAiris <- qcba(cbaRuleModel=rmCBA4QCBAiris,datadf=trainFold)
print(rmqCBAiris@rules)

The model:

    lhs                                                    rhs                  support confidence lift     lhs_length
[1] {Petal.Width=[-Inf;0.6]}                            => {Species=setosa}     0.32    1.00       3.125000 2         
[2] {Petal.Length=[5.2;Inf]}                            => {Species=virginica}  0.25    1.00       3.125000 2         
[3] {Sepal.Width=[-Inf;3.1],Petal.Width=[1.8;Inf]}      => {Species=virginica}  0.20    1.00       3.125000 2         
[4] {}                                                  => {Species=versicolor} 0.36    0.36       1.000000 0 

The statistics:

prediction_iris <- predict(rmqCBAiris,testFold)
acc <- CBARuleModelAccuracy(prediction_iris, testFold[[rmqCBAiris@classAtt]])
avgRuleLengthQCBA <- (sum(unlist(lapply(rmqCBAiris@rules[1],str_count,pattern=",")))+
                              # assuming the last rule has antecedent length zero - not counting its length
                              nrow(rmqCBAiris@rules)-1)/nrow(rmqCBAiris@rules)
print(paste("Number of rules: ",nrow(rmqCBAiris@rules),", average number of conditions per rule :",avgRuleLengthQCBA, ", accuracy on test data: ",round(acc,2)))

Returns:

Number of rules:  4 , average number of conditions per rule : 1 , accuracy on test data:  0.96

QCBA:

  • Improved accuracy from 0.94 to 0.96
  • Reduced number of rules from 5 to 4
  • Reduced number of conditions in the rules from 1.6 to 1
  • Unlike other ARC approaches retains interpretability of CBA models by performing one rule classification.

New feature - ROC and AUC curves

library(ROCR)
library(qCBA)
twoClassIris<-datasets::iris[1:100,]
twoClassIris <- twoClassIris[sample(nrow(twoClassIris)),]
#twoClassIris$Species<-as.factor(as.character(iris$Species))
trainFold <- twoClassIris[1:75,]
testFold <- twoClassIris[76:nrow(twoClassIris),]
rmCBA <- cba(trainFold, classAtt="Species")
rmqCBA <- qcba(cbaRuleModel=rmCBA, datadf=trainFold)
print(rmqCBA@rules)
prediction <- predict(rmqCBA,testFold)
acc <- CBARuleModelAccuracy(prediction, testFold[[rmqCBA@classAtt]])
message(acc)
confidences <- predict(rmqCBA,testFold,output,outputConfidenceScores=TRUE,positiveClass="setosa")
#it is importat that the first level is different from positiveClass specified in the line above
target<-droplevels(factor(testFold[[rmqCBA@classAtt]],ordered = TRUE,levels=c("versicolor","setosa")))

pred = ROCR::prediction(confidences, target)
roc = ROCR::performance(pred, "tpr", "fpr")
plot(roc, lwd=2, colorize=TRUE)
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1)
auc = ROCR::performance(pred, "auc")
auc = unlist(auc@y.values)
auc