dfcleanser
Dataframe Cleanser
Dependencies
The full dataframe Cleanser has dependencies on the following python packages.
- numpy
- scipy
- pandas-sklearn/spark
- geopy
- sqlalchemy
- json
- specific database driver(MySQL/MsSQLServer/PostGres/Oracle/SQLite)
Package Installation
It's highly recommended that you install in a virtualenv. The following commands assume you have already done
source activate my-env-name
. After doing so, install both the serverextension and the toolbar button (nbextension) with:
pip install
pip install dfcleanser
conda install
conda install dfcleanser
Jupyter Extension(s)
You can install dfcleanser extensions to the Jupyter Server and Notebook for a permanent implementation of dfcleanser:
dfcleanser Jupyter Installation
Automatic Intallation
With the pip install of dfcleanser the Jupyter hooks to run dfcleanser are automatically defined and the extensions must be enabled via the dfcleanser Jupyter Extension commands defined below.
dfcleanser Jupyter Extension Enable
From a bash or anaconda prompt enter
jupyter serverextension enable --py dfcleanser --sys-prefix
jupyter nbextension install --py dfcleanser --sys-prefix
jupyter nbextension enable --py dfcleanser --sys-prefix
dfcleanser Jupyter Extension Verification
From a bash or anaconda propmt enter
jupyter nbextension list
jupyter serverextension list
Automatic dfcleanser notebook Load/Unload
A new button will appear on your toolbar that looks like this:
Click the button once to toggle between loading and unloading the dataframe cleanser... Double click the button to reset the dataframe cleanser chapters...
Manual Installation
If you do not wish to modify the Jupyter server and subsequent notebooks you can manually install the dfcleanser modules for a temporary implementation of dfcleanser. The manual installation modifies the custom.js and custom.css jupyter files.
Manual dfcleanser Install
Once the dfcleanser package is installed via pip or conda you need to manually install the dfcleanser functionality to be available in Jupyter.
Open a Jupyter notebook and from a code cell run the following :
from dfcleanser.system.install import install_dfcleanser
install_dfcleanser()
Once the above install completes you can check if the install is good via the following run from a Jupyter code cell:
from dfcleanser.system.install import setup_dfcleanser
setup_dfcleanser()
Manual dfcleanser UnInstall
To remove the dfcleanser module from your Jupyter Server run the following:
from dfcleanser.system.install import uninstall_dfcleanser
uninstall_dfcleanser()
Manual dfcleanser Load/Unload
Once you have installed the above components successfully dfcleanser is now available to be dynamically loaded into or unloaded from a Jupyter notebook.
To manually load dfcleanser into your notebook run the following from a Jupyter code cell:
from dfcleanser.system.load import load_dfcleanser
load_dfcleanser()
The dfcleanser Jupyter cells are now set up in your current notebook for you to use independent of any other cells in the current notebook.
To unload dfcleanser from your notebook run the following from a Jupyter code cell:
from dfcleanser.system.load import unload_dfcleanser
unload_dfcleanser()
dfcleanser Data Files
The dfcleanser creates and uses two sets of files while running :
- dfcleanser Common files
- Notebook specific files
The dfcleanser Common files are files used by the provided utilities to maintain a server wide set of common functions, lists, dicts ... that are available within any dfcleanser loaded notebook.
The Notebook specific files are files that are tied specifically to a Jupyter notebook that has/had dfcleanser loaded into it. For every notebook that loads dfcleanser into it a unique cfg an script file is maintained for that notebook.
The notebook specific files are maintained internally by dfcleanser relative to the dfCleasnser package.
The user can manage these files via the System Environment->dfcleanser files button. If you copy, rename
or delete a notebook then use the Environment->dfcleanser files functions to generate proper custom
notebook files.
Datasets/DataSources
dfcleanser allows you to browse through datasets for import or export while running dfcleanser. To be able to automatically browse and identify datasets you must create a Datasets directory in the tree where the notebook is found. All datasets in this directory can be browsed and worked with automatically.
If you do not want to browse you can given the absolute path of your dataset and that will be used.