ocpexplore

Simple functions for tabular data exploration

Dependencies

The following packages are required:

pandas
numpy
matplotlib.pyplot
seaborn
warnings

Installation

Please run the following code to install the package

pip install ocpexplore

Please use the following code to import the functions. (The recommended shortcut for using the package is 'expl')

import ocpexplore.ocpexplore as expl

Functions

value_counter(df, obj_cols_only = True, unique_limit = 25)

Purpose
Apply the value counts method to specific columns of a dataframe

Arguments

df: Pandas dataframe
obj_cols_only: apply function to columns with object dtype only
unique_limit: apply function to columns which have less unique observations than the provided limit

Output
For every relevant column the function prints counts and ratios of unique values and also a barchart with for better visibility.

Example

value_counter(df)

check_ID(df)

Purpose
Check how many unique values are present in each column of the provided dataframe

Arguments

df: Pandas dataframe

Output
Prints the number of unique values in each column, and the ratio of the unique values to all values

Example

check_ID(df)

# or

check_ID(df[['relevant_col1', 'relevant_col2']])

check_NA(df)

Purpose
Check how many missing values are present in each column

Arguments

df: Pandas dataframe

Output
Prints the number and ratio of missing values in each column and also a barchart for better visibility.

Example

check_NA(df)

plot_continuous(df)

Purpose
Visually examine the distribution of values in each numerical column in a dataframe.

Arguments

df: Pandas dataframe with numerical columns only.

Output
Creates a boxplot and a density plot for each column in the provided dataframe.

Example

plot_continuous(df)

# or

plot_continuous(df[['numerical_col1', 'numerical_col2']])

describe_continuous(df, interpolation = 'nearest')

Purpose
Examine the distribution of values in each numerical column in a dataframe with simple descriptive statistics.

Arguments

df: Pandas dataframe with numerical columns only.
interpolation: interpolation method used for quantile calculation. For more info, please see pandas.DataFrame.quantile

Output
Creates a dataframe which contains the descriptives for the columns in the input dataframe.

Example

describe_continuous(df)

# or

describe_continuous(df[['numerical_col1', 'numerical_col2']])

tail_density_table(df, interpolation = 'nearest')

Purpose
Examine the number of outliers and extreme values for each row in a dataframe.

Arguments

df: Pandas dataframe with numerical columns only.
interpolation: interpolation method used for quantile calculation. For more info, please see pandas.DataFrame.quantile

Output
Creates a dataframe which contains the number of outliers and extreme values for the columns in the input dataframe.

Example

tail_density_table(df)

# or

tail_density_table(df[['numerical_col1', 'numerical_col2']])

obs_by_date(column, date_aggregation = 'M')

Purpose
Aggregates the number of observations in a date column for a given period of time.

Arguments

column: Pandas series in datetime format.
date_aggregation: Level of aggregation. Can be
- days: 'D'
- months: 'M'
- years: 'Y'

Output
Returns the results of a value_counts method which was applied to the aggregated date series and also creates a barplot for better visibility.

Example

obs_by_date(series)

values_by_date(df, date_column, date_aggregation = 'M')

Purpose
Observe the distribution of values over time.

Arguments

df: Pandas dataframe with one date column and numerical columns
date_column: name of the data column in the input dataframe.
date_aggregation: Level of aggregation. Can be
- days: 'D'
- months: 'M'
- years: 'Y'

Output
Returns boxplots with x axis as aggregated date and y as value for every numerical variable in the input database.

Example

values_by_date(df,'date_col')

ocpexplore
Release 0.3.0

Release 0.3.0

0.3.0

0.2.0

0.1.0

Documentation

ocpexplore

Dependencies

Installation

Functions

value_counter(df, obj_cols_only = True, unique_limit = 25)

check_ID(df)

check_NA(df)

plot_continuous(df)

describe_continuous(df, interpolation = 'nearest')

tail_density_table(df, interpolation = 'nearest')

obs_by_date(column, date_aggregation = 'M')

values_by_date(df, date_column, date_aggregation = 'M')

Stats

Development practices

Releases

Contributors

ocpexplore Release 0.3.0

Release 0.3.0 Toggle Dropdown 0.3.0 0.2.0 0.1.0

Documentation

ocpexplore

Dependencies

Installation

Functions

value_counter(df, obj_cols_only = True, unique_limit = 25)

check_ID(df)

check_NA(df)

plot_continuous(df)

describe_continuous(df, interpolation = 'nearest')

tail_density_table(df, interpolation = 'nearest')

obs_by_date(column, date_aggregation = 'M')

values_by_date(df, date_column, date_aggregation = 'M')

Stats

Development practices

Releases

Contributors

ocpexplore
Release 0.3.0

Release 0.3.0

0.3.0

0.2.0

0.1.0