MLme

tools to work on machine learning and data science projects


Keywords
data-science, machine-learning, pandas, tools
License
MIT
Install
pip install MLme==0.0.101

Documentation

MLme

tools to work on machine learning and data science projects

For now, there are only one module:

  • structured_data -- providing tools to tackle structured data (such as tables of data in a form of pandas' DataFrame)

Structured Data Module

summarizeColumns(dataframe, filename=None)

The function analyzes each column in the DataFrame 'dataframe' for (1) data type; (2) number of unique values; (3) percentage of missing data points. The summary of the analysis is saved into a CSV file for a given filename when it is not None.

INPUTS: dataframe -- a pandas' DataFrame data to analyze. filename -- (optional) a filename to save the analysis summary

OUTPUTS: dataframe of the summary

getLabelDict(dataframe)

The function walks through each column of the given dataframe. If the column has dtype as object, it finds unique values and assigns a unique numerica label for it.

INPUT: dataframe -- a pandas' DataFrame data to analyze

OUTPUTS: a dictionary binding the column name to another dictionary binding unique value to a unique numeric label.

convertObjToLabel(df, labelDict, verbose=False)

Convert all entries in dataframe's column whose origingal dtype is 'object' to a corresponding unique label according to the given labelDict

INPUTS: df -- a pandas' DataFrame data to analyze labelDict -- a value-label dictionary resulted from a function getLabelDict() verbose -- (optional) True or False to show the progress of converting to label

OUTPUTS: a dataframe with all object-type value converted to correct numerical labels

getOneHotDict(df, labelDict)

The function returns a dictionary from the column name, listed in labelDict, to an index-matching dataframe whose row is now converted to a onehot representation of the original numerical label.

INPUTS: df -- pandas' DataFrame that has all elements being numeric, i.e. been through convertObjToLabel(). labelDict -- a value-label dictionary resulted from a function getLabelDict()

OUTPUT: a dictionary from the column name, listed in labelDict, to an index-matching dataframe whose row is now converted to a onehot representation of the original numerical label.

NColumnOnehot(df, ohdict)

Calculate the new number of column if the label-encoded dataframe, df, would be turned into a onehot encoding.

INPUTS: df -- pandas' DataFrame that has all elements being numeric, i.e. been through convertObjToLabel(). ohdict -- a dictionary output from getOneHotDict(...)

OUTPUT: a number of column

to_categorical(y, num_classes=None)

Converts a class vector (integers) to binary class matrix. E.g. for use with categorical_crossentropy. Arguments y: class vector to be converted into a matrix (integers from 0 to num_classes). num_classes: total number of classes. Returns A binary matrix representation of the input. The classes axis is placed last.

COPIED FROM https://github.com/keras-team/keras/blob/master/keras/utils/np_utils.py

getNormFactors(df, cols)

Return a dictionary with column names as keys and (mean, std) tuple as value.

INPUTS: df -- pandas' DataFrame that has all elements being numeric, i.e. been through convertObjToLabel(). cols -- a list of column names in df that would be analyzed for mean and std values. If some members of cols are not in df, error will be raised.

convertDFtoNP(df, ohdict=None, normdict=None)

Return a numpy array of values from the dataframe where the onehot conversion and normalization can be optionally done.

INPUTS: df -- pandas' DataFrame that has all elements being numeric, i.e. been through convertObjToLabel(). ohdict -- a dictionary output from getOneHotDict(...) normdict -- a dictionary output from getNormFactors(...)

Note that the index of df and those in ohdict must match to avoid wrong conversion.