MLme
tools to work on machine learning and data science projects
For now, there are only one module:
- structured_data -- providing tools to tackle structured data (such as tables of data in a form of pandas' DataFrame)
Structured Data Module
summarizeColumns(dataframe, filename=None)
The function analyzes each column in the DataFrame 'dataframe' for (1) data type; (2) number of unique values; (3) percentage of missing data points. The summary of the analysis is saved into a CSV file for a given filename when it is not None.
INPUTS: dataframe -- a pandas' DataFrame data to analyze. filename -- (optional) a filename to save the analysis summary
OUTPUTS: dataframe of the summary
getLabelDict(dataframe)
The function walks through each column of the given dataframe. If the column has dtype as object, it finds unique values and assigns a unique numerica label for it.
INPUT: dataframe -- a pandas' DataFrame data to analyze
OUTPUTS: a dictionary binding the column name to another dictionary binding unique value to a unique numeric label.
convertObjToLabel(df, labelDict, verbose=False)
Convert all entries in dataframe's column whose origingal dtype is 'object' to a corresponding unique label according to the given labelDict
INPUTS: df -- a pandas' DataFrame data to analyze labelDict -- a value-label dictionary resulted from a function getLabelDict() verbose -- (optional) True or False to show the progress of converting to label
OUTPUTS: a dataframe with all object-type value converted to correct numerical labels
getOneHotDict(df, labelDict)
The function returns a dictionary from the column name, listed in labelDict, to an index-matching dataframe whose row is now converted to a onehot representation of the original numerical label.
INPUTS: df -- pandas' DataFrame that has all elements being numeric, i.e. been through convertObjToLabel(). labelDict -- a value-label dictionary resulted from a function getLabelDict()
OUTPUT: a dictionary from the column name, listed in labelDict, to an index-matching dataframe whose row is now converted to a onehot representation of the original numerical label.
NColumnOnehot(df, ohdict)
Calculate the new number of column if the label-encoded dataframe, df, would be turned into a onehot encoding.
INPUTS: df -- pandas' DataFrame that has all elements being numeric, i.e. been through convertObjToLabel(). ohdict -- a dictionary output from getOneHotDict(...)
OUTPUT: a number of column
to_categorical(y, num_classes=None)
Converts a class vector (integers) to binary class matrix. E.g. for use with categorical_crossentropy. Arguments y: class vector to be converted into a matrix (integers from 0 to num_classes). num_classes: total number of classes. Returns A binary matrix representation of the input. The classes axis is placed last.
COPIED FROM https://github.com/keras-team/keras/blob/master/keras/utils/np_utils.py
getNormFactors(df, cols)
Return a dictionary with column names as keys and (mean, std) tuple as value.
INPUTS: df -- pandas' DataFrame that has all elements being numeric, i.e. been through convertObjToLabel(). cols -- a list of column names in df that would be analyzed for mean and std values. If some members of cols are not in df, error will be raised.
convertDFtoNP(df, ohdict=None, normdict=None)
Return a numpy array of values from the dataframe where the onehot conversion and normalization can be optionally done.
INPUTS: df -- pandas' DataFrame that has all elements being numeric, i.e. been through convertObjToLabel(). ohdict -- a dictionary output from getOneHotDict(...) normdict -- a dictionary output from getNormFactors(...)
Note that the index of df and those in ohdict must match to avoid wrong conversion.