cdcstream

Implementation of Ienco's algorithm CDCStream


License
GPL-3.0
Install
pip install cdcstream==0.2.2

Documentation

Change Detection in Categorical Evolving Data Streams - CDCStream

Paper available at http://dx.doi.org/10.5445/IR/1000155196, cite as (BibTex):

@techreport{TratBenderOvtcharova2023_1000155196,
    author       = {Trat, Martin and Bender, Janek and Ovtcharova, Jivka},
    year         = {2023},
    title        = {Sensitivity-Based Optimization of Unsupervised Drift Detection for Categorical Data Streams},
    doi          = {10.5445/IR/1000155196},
    institution  = {{Karlsruher Institut für Technologie (KIT)}},
    issn         = {2194-1629},
    series       = {KIT Scientific Working Papers},
    keywords     = {unsupervised conceptdriftdetection, data streammining, productiveartificialintelligence, categorical data processing},
    pagetotal    = {10},
    language     = {english},
    volume       = {208}
}

Implementation of an augmented version of Dino Ienco's algorithm CDCStream (https://doi.org/10.1145/2554850.2554864).

Installation

Requirements

  • WEKA v3.8.6 or greater: Installation, GitHub
    • Without this requirement, code execution fails.
  • Java
    • Download and install Java 11 OpenJDK 11, e.g. from RedHat (more recent versions might work as well).
    • Note that I experienced issues using Temurin (via adoptium.net).
    • Make sure that the Java folder (path including /bin at the end) is added to environment variable PATH.
    • Some problems during python package installation can be solved by correctly setting the environment variable JAVA_HOME. Set it to point to the respective java folder (e.g. /usr/lib/jvm/java-11-openjdk-amd64) NOT including /bin or further components at the end.
    • Without this requirement, attempting to install package javabridge might fail.
  • Build tools
    • Ubuntu: Based on the python-weka-wrapper3 documentation, fulfill build requirements.
      sudo apt-get install build-essential python3-dev
    • Windows: Microsoft Visual C++ 14.0 or greater. For this, download Build Tools from Microsoft and install those (installation of Core Features for C++ Build Tools, C++ 2019 Redistributable Update, Windows 10 SDK and MSVC v142 (or greater) should suffice; a subsequent restart might be necessary).
    • Without these requirements, attempting to install package javabridge might fail.
  • Python >=3.7

Setup

  • Use pip (after installing above-stated requirements!):
    python -m pip install cdcstream
  • @poetry users: trouble installing python-javabridge? --> See Development section
  • First usage of the cdcstream package should automatically add all required WEKA packages. If this does not succeed: Manually add package DilcaDistance v1.0.2 or greater to WEKA:
    • Start WEKA GUI
    • Select Tools / Package manager and install the latest version of DilcaDistance (Dependency fastCorrBasedFS should be installed after confirming prompted request.); It might be necessary to click the Toggle load button with DilcaDistance selected in order to get Yes in the Loaded column.

Example

import numpy as np
import pandas as pd
from cdcstream.dilca_wrapper import dilca_workflow
from cdcstream import CDCStream, tools


N_BATCHES = 50
tools.manage_jvm_start()  # start a Java VM in order to integrate WEKA


# instatiate drift detector
def alert_cbck(alert_code, alert_msg):
    if not alert_msg:
        alert_msg = 'no msg'
    print(f'{alert_msg} (code {alert_code})')

c = CDCStream(
    alert_callback=alert_cbck,
    summary_extractor=dilca_workflow,
    summary_extractor_args={'nominal_cols': 'all'},
    factor_warn=2.0,
    factor_change=3.0,
    factor_std_extr_forg=0,
    cooldown_cycles=0
)

# create random data (will be interpreted as being nominal)
batches = []
for i in range(N_BATCHES):
    batches.append(
        pd.DataFrame(np.random.randint(1, 10, size=(10,5)))
    )

# employ created data as stream and feed it to drift detector
for b in batches:
    c.feed_new_batch(b)

tools.manage_jvm_stop()  # cleanup

Development

  • Python poetry
    • strangely, installation of python-javabridge fails with poetry versions > 1.1.15 (at the time of writing, newest poetry version is 1.3.1); this might be related to PEP 621 --> a workaround is to install python-javabridge via pip:
      python -m poetry run pip install python-javabridge  # from outside the virtual environment
    • afterwards, continue installation via poetry
      python -m poetry install

License

Code is copyright to the FZI Research Center for Information Technology and released under the GNU General Public License v3.0. All dependencies are copyright to the respective authors and released under the respective licenses. A copy of these licenses is provided in LICENSE_LIBRARIES.

Acknowledgements

BMBF Logo

This software was developed at the FZI Research Center for Information Technology. The associated research was funded by the German Federal Ministry of Education and Research (grant number: 02K18D033) within the context of the project SEAMLESS.

To Do

  • add tests