ocpexplore

Python Boilerplate contains all the boilerplate you need to create a Python package.


Keywords
ocpexplore
License
MIT
Install
pip install ocpexplore==0.3.0

Documentation

ocpexplore

Simple functions for tabular data exploration

Dependencies

The following packages are required:

  • pandas
  • numpy
  • matplotlib.pyplot
  • seaborn
  • warnings

Installation

Please run the following code to install the package

pip install ocpexplore

Please use the following code to import the functions. (The recommended shortcut for using the package is 'expl')

import ocpexplore.ocpexplore as expl

Functions

value_counter(df, obj_cols_only = True, unique_limit = 25)

Purpose
Apply the value counts method to specific columns of a dataframe

Arguments

  • df: Pandas dataframe
  • obj_cols_only: apply function to columns with object dtype only
  • unique_limit: apply function to columns which have less unique observations than the provided limit

Output
For every relevant column the function prints counts and ratios of unique values and also a barchart with for better visibility.

Example

value_counter(df)

 

check_ID(df)

Purpose
Check how many unique values are present in each column of the provided dataframe

Arguments

  • df: Pandas dataframe

Output
Prints the number of unique values in each column, and the ratio of the unique values to all values

Example

check_ID(df)

# or

check_ID(df[['relevant_col1', 'relevant_col2']])

 

check_NA(df)

Purpose
Check how many missing values are present in each column

Arguments

  • df: Pandas dataframe

Output
Prints the number and ratio of missing values in each column and also a barchart for better visibility.

Example

check_NA(df)

 

plot_continuous(df)

Purpose
Visually examine the distribution of values in each numerical column in a dataframe.

Arguments

  • df: Pandas dataframe with numerical columns only.

Output
Creates a boxplot and a density plot for each column in the provided dataframe.

Example

plot_continuous(df)

# or

plot_continuous(df[['numerical_col1', 'numerical_col2']])

 

describe_continuous(df, interpolation = 'nearest')

Purpose
Examine the distribution of values in each numerical column in a dataframe with simple descriptive statistics.

Arguments

  • df: Pandas dataframe with numerical columns only.
  • interpolation: interpolation method used for quantile calculation. For more info, please see pandas.DataFrame.quantile

Output
Creates a dataframe which contains the descriptives for the columns in the input dataframe.

Example

describe_continuous(df)

# or

describe_continuous(df[['numerical_col1', 'numerical_col2']])

 

tail_density_table(df, interpolation = 'nearest')

Purpose
Examine the number of outliers and extreme values for each row in a dataframe.

Arguments

  • df: Pandas dataframe with numerical columns only.
  • interpolation: interpolation method used for quantile calculation. For more info, please see pandas.DataFrame.quantile

Output
Creates a dataframe which contains the number of outliers and extreme values for the columns in the input dataframe.

Example

tail_density_table(df)

# or

tail_density_table(df[['numerical_col1', 'numerical_col2']])

 

obs_by_date(column, date_aggregation = 'M')

Purpose
Aggregates the number of observations in a date column for a given period of time.

Arguments

  • column: Pandas series in datetime format.
  • date_aggregation: Level of aggregation. Can be
    • days: 'D'
    • months: 'M'
    • years: 'Y'

Output
Returns the results of a value_counts method which was applied to the aggregated date series and also creates a barplot for better visibility.

Example

obs_by_date(series)

 

values_by_date(df, date_column, date_aggregation = 'M')

Purpose
Observe the distribution of values over time.

Arguments

  • df: Pandas dataframe with one date column and numerical columns
  • date_column: name of the data column in the input dataframe.
  • date_aggregation: Level of aggregation. Can be
    • days: 'D'
    • months: 'M'
    • years: 'Y'

Output
Returns boxplots with x axis as aggregated date and y as value for every numerical variable in the input database.

Example

values_by_date(df,'date_col')