Active learning algorithms with application in astronomy.

active, learning, astronomy, classficiation
pip install mclearn==0.1.6



Multiclass Active Learning Algorithms with Application in Astronomy.

Contributors: Alasdair Tran, Cheng Soon Ong, Lee Wei Yen
License: This package is distributed under a a 3-clause ("Simplified" or "New") BSD license.
Thesis: Photometric Classification with Thompson Sampling by Alasdair Tran


mclearn is a Python package that implement selected multiclass active learning algorithms, with a focus in astronomical data. For a quick overview of how mclearn works, have a look at the Getting Started notebook.


The dependencies are Python 3.4, numpy, pandas, matplotlib, seaborn, ephem, scipy, ipython, and scikit-learn. It's best to first install the Anaconda distribution for Python 3, then install mclearn using pip:

pip install mclearn


Throughout the experiments, we will be using the dataset from the Sloan Digital Sky Survey. Due to their size, the following datasets are not included in this repo:

│   sdss.h5
│   sdss_dr7_photometry_source.csv.gz
│   sdss_full.h5
│   sdss_subclass.h5

The above datasets (except for sdss_full.h5) can be downloaded from the NICTA filestore.


The following nine notebooks accompany Alasdair's thesis on Photometric Classification with Thompson Sampling.

  1. Dataset Preparation

    We provide instruction on how to obtain the SDSS dataset from the Sloan SkySever. We then clean up the data and convert the csv files to HDF5 for quicker reading. We also do some cleaning up of the raw data from the VST ATLAS survey.

  2. Exploratory Data Analysis

    To get a feel for the data, we plot the distributions of the classes (Galaxy, Quasar, Star). We will see that the data is quite unbalanced, with three times as many galaxies as quasars. A distinction is made between photometry and spectroscopy. We also use PCA to reduce the data down to two dimensions.

  3. Dust Extinction

    TDust extinction is a potential problem in photometry, so we compare three sets of reddening corrections (SFD98, SF11, and W14) to see which set is best at removing the bias. It turns out that there are no significant differences between the three extinction vectors.

  4. Learning Curves

    To see how random sampling performs, we construct learning curves for SVMs, Logistic Regression, and Random Forest. A grid search with a 5-fold cross validation is performed to choose the best hyperparameters for the SVM and Logistic Regression. We also do a polynomial transformation of degree 2 and 3 on the features.

  5. Class Proportion Estimation

    We predict the classes of the 800,000 million unlabelled SDSS objects using a random forest.

  6. Active Learning with SDSS

    We look at six active learning heuristics and see how well they perform in the SDSS dataset.

  7. Active Learning with VST ATLAS

    We look at six active learning heuristics and see how well they perform in the VST ATLAS dataset.

  8. Thompson Sampling with SDSS

    We know examine the six active learning heuristics under the multi-arm bandit setting with Thompson sampling and using the SDSS dataset.

  9. Thompson Sampling with VST ATLAS

    We repeat the same Thompson sampling experiment with the VST ATLAS dataset