mclearn

Multiclass Active Learning Algorithms with Application in Astronomy.

Contributors:	Alasdair Tran, Cheng Soon Ong, Lee Wei Yen
License:	This package is distributed under a a 3-clause ("Simplified" or "New") BSD license.
Source:	https://github.com/chengsoonong/mclass-sky
Doc:	https://mclearn.readthedocs.org/en/latest/
Thesis:	Photometric Classification with Thompson Sampling by Alasdair Tran

https://coveralls.io/repos/alasdairtran/mclearn/badge.svg?branch=master&service=github

Introduction

mclearn is a Python package that implement selected multiclass active learning algorithms, with a focus in astronomical data. For a quick overview of how mclearn works, have a look at the Getting Started notebook.

Installation

The dependencies are Python 3.4, numpy, pandas, matplotlib, seaborn, ephem, scipy, ipython, and scikit-learn. It's best to first install the Anaconda distribution for Python 3, then install mclearn using pip:

pip install mclearn

Datasets

Throughout the experiments, we will be using the dataset from the Sloan Digital Sky Survey. Due to their size, the following datasets are not included in this repo:

projects/alasdair/data/
│   sdss.h5
│   sdss_dr7_photometry_source.csv.gz
│   sdss_full.h5
│   sdss_subclass.h5

The above datasets (except for sdss_full.h5) can be downloaded from the NICTA filestore.

Notebooks

The following nine notebooks accompany Alasdair's thesis on Photometric Classification with Thompson Sampling.

Dataset Preparation

We provide instruction on how to obtain the SDSS dataset from the Sloan SkySever. We then clean up the data and convert the csv files to HDF5 for quicker reading. We also do some cleaning up of the raw data from the VST ATLAS survey.
Exploratory Data Analysis

To get a feel for the data, we plot the distributions of the classes (Galaxy, Quasar, Star). We will see that the data is quite unbalanced, with three times as many galaxies as quasars. A distinction is made between photometry and spectroscopy. We also use PCA to reduce the data down to two dimensions.
Dust Extinction

TDust extinction is a potential problem in photometry, so we compare three sets of reddening corrections (SFD98, SF11, and W14) to see which set is best at removing the bias. It turns out that there are no significant differences between the three extinction vectors.
Learning Curves

To see how random sampling performs, we construct learning curves for SVMs, Logistic Regression, and Random Forest. A grid search with a 5-fold cross validation is performed to choose the best hyperparameters for the SVM and Logistic Regression. We also do a polynomial transformation of degree 2 and 3 on the features.
Class Proportion Estimation

We predict the classes of the 800,000 million unlabelled SDSS objects using a random forest.
Active Learning with SDSS

We look at six active learning heuristics and see how well they perform in the SDSS dataset.
Active Learning with VST ATLAS

We look at six active learning heuristics and see how well they perform in the VST ATLAS dataset.
Thompson Sampling with SDSS

We know examine the six active learning heuristics under the multi-arm bandit setting with Thompson sampling and using the SDSS dataset.
Thompson Sampling with VST ATLAS

We repeat the same Thompson sampling experiment with the VST ATLAS dataset

mclearn
Release 0.1.2

Release 0.1.2

0.1.6

0.1.5

0.1.4

0.1.3

0.1.2

0.1.1

Documentation

mclearn

Introduction

Installation

Datasets

Notebooks

Stats

Development practices

Releases

Contributors

mclearn Release 0.1.2

Release 0.1.2 Toggle Dropdown 0.1.6 0.1.5 0.1.4 0.1.3 0.1.2 0.1.1

Documentation

mclearn

Introduction

Installation

Datasets

Notebooks

Stats

Development practices

Releases

Contributors

mclearn
Release 0.1.2

Release 0.1.2

0.1.6

0.1.5

0.1.4

0.1.3

0.1.2

0.1.1