AutoML for Time Series.
- License: MIT
- Documentation: https://sintel-dev.github.io/Draco
- Homepage: https://github.com/sintel-dev/Draco
The Draco project is a collection of end-to-end solutions for machine learning problems commonly found in time series monitoring systems. Most tasks utilize sensor data emanating from monitoring systems. We utilize the foundational innovations developed for automation of machine Learning at Data to AI Lab at MIT.
The salient aspects of this customized project are:
- A set of ready to use, well tested pipelines for different machine learning tasks. These are vetted through testing across multiple publicly available datasets for the same task.
- An easy interface to specify the task, pipeline, and generate results and summarize them.
- A production ready, deployable pipeline.
- An easy interface to
tunepipelines using Bayesian Tuning and Bandits library.
- A community oriented infrastructure to incorporate new pipelines.
- A robust continuous integration and testing infrastructure.
learning databaserecording all past outcomes --> tasks, pipelines, outcomes.
Draco has been developed and runs on Python 3.6, 3.7 and 3.8.
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where you are trying to run Draco.
Download and Install
Draco can be installed locally using pip with the following command:
pip install draco-ml
This will pull and install the latest stable release from PyPi.
If you want to install from source or contribute to the project please read the Contributing Guide.
The minimum input expected by the Draco system consists of the following two elements,
which need to be passed as
A table containing the specification of the problem that we are solving, which has three columns:
turbine_id: Unique identifier of the turbine which this label corresponds to.
cutoff_time: Time associated with this target
target: The value that we want to predict. This can either be a numerical value or a categorical label. This column can also be skipped when preparing data that will be used only to make predictions and not to fit any pipeline.
A table containing the signal data from the different sensors, with the following columns:
turbine_id: Unique identifier of the turbine which this reading comes from.
signal_id: Unique identifier of the signal which this reading comes from.
timestamp (datetime): Time where the reading took place, as a datetime.
value (float): Numeric value of this reading.
Optionally, a third table can be added containing metadata about the turbines.
The only requirement for this table is to have a
turbine_id field, and it can have
an arbitraty number of additional fields.
A part from the in-memory data format explained above, which is limited by the memory allocation capabilities of the system where it is run, Draco is also prepared to load and work with data stored as a collection of CSV files, drastically increasing the amount of data which it can work with. Further details about this format can be found in the project documentation site.
In this example we will load some demo data and classify it using a Draco Pipeline.
1. Load and split the demo data
The first step is to load the demo data.
For this, we will import and call the
draco.demo.load_demo function without any arguments:
from draco.demo import load_demo target_times, readings = load_demo()
The returned objects are:
turbine_id cutoff_time target 0 T001 2013-01-12 0 1 T001 2013-01-13 0 2 T001 2013-01-14 0 3 T001 2013-01-15 1 4 T001 2013-01-16 0
pandas.DataFramecontaining the time series data in the format explained above.
turbine_id signal_id timestamp value 0 T001 S01 2013-01-10 323.0 1 T001 S02 2013-01-10 320.0 2 T001 S03 2013-01-10 284.0 3 T001 S04 2013-01-10 348.0 4 T001 S05 2013-01-10 273.0
Once we have loaded the
target_times and before proceeding to training any Machine Learning
Pipeline, we will have split them in 2 partitions for training and testing.
In this case, we will split them using the train_test_split function from scikit-learn, but it can be done with any other suitable tool.
from sklearn.model_selection import train_test_split train, test = train_test_split(target_times, test_size=0.25, random_state=0)
Notice how we are only splitting the
target_times data and not the
This is because the pipelines will later on take care of selecting the parts of the
readings table needed for the training based on the information found inside
Additionally, if we want to calculate a goodness-of-fit score later on, we can separate the
testing target values from the
test table by popping them from it:
test_targets = test.pop('target')
2. Exploring the available Pipelines
Once we have the data ready, we need to find a suitable pipeline.
The list of available Draco Pipelines can be obtained using the
from draco import get_pipelines pipelines = get_pipelines()
pipeline variable will be
list containing the names of all the pipelines
available in the Draco system:
['lstm', 'lstm_with_unstack', 'double_lstm', 'double_lstm_with_unstack']
For the rest of this tutorial, we will select and use the pipeline
lstm_with_unstack as our template.
pipeline_name = 'lstm_with_unstack'
3. Fitting the Pipeline
Once we have loaded the data and selected the pipeline that we will use, we have to fit it.
For this, we will create an instance of a
DracoPipeline object passing the name
of the pipeline that we want to use:
from draco.pipeline import DracoPipeline pipeline = DracoPipeline(pipeline_name)
And then we can directly fit it to our data by calling its
fit method and passing in the
target_times and the complete
4. Make predictions
After fitting the pipeline, we are ready to make predictions on new data by calling the
pipeline.predict method passing the testing
target_times and, again, the complete
predictions = pipeline.predict(test, readings)
5. Evaluate the goodness-of-fit
Finally, after making predictions we can evaluate how good the prediction was using any suitable metric.
from sklearn.metrics import f1_score f1_score(test_targets, predictions)