ocpexplore
Simple functions for tabular data exploration
Dependencies
The following packages are required:
- pandas
- numpy
- matplotlib.pyplot
- seaborn
- warnings
Installation
Please run the following code to install the package
pip install ocpexplore
Please use the following code to import the functions. (The recommended shortcut for using the package is 'expl')
import ocpexplore.ocpexplore as expl
Functions
value_counter(df, obj_cols_only = True, unique_limit = 25)
Purpose
Apply the value counts method to specific columns of a dataframe
Arguments
- df: Pandas dataframe
- obj_cols_only: apply function to columns with object dtype only
- unique_limit: apply function to columns which have less unique observations than the provided limit
Output
For every relevant column the function prints counts and ratios of unique values and also a barchart with for better visibility.
Example
value_counter(df)
Â
check_ID(df)
Purpose
Check how many unique values are present in each column of the provided dataframe
Arguments
- df: Pandas dataframe
Output
Prints the number of unique values in each column, and the ratio of the unique values to all values
Example
check_ID(df)
# or
check_ID(df[['relevant_col1', 'relevant_col2']])
Â
check_NA(df)
Purpose
Check how many missing values are present in each column
Arguments
- df: Pandas dataframe
Output
Prints the number and ratio of missing values in each column and also a barchart for better visibility.
Example
check_NA(df)
Â
plot_continuous(df)
Purpose
Visually examine the distribution of values in each numerical column in a dataframe.
Arguments
- df: Pandas dataframe with numerical columns only.
Output
Creates a boxplot and a density plot for each column in the provided dataframe.
Example
plot_continuous(df)
# or
plot_continuous(df[['numerical_col1', 'numerical_col2']])
Â
describe_continuous(df, interpolation = 'nearest')
Purpose
Examine the distribution of values in each numerical column in a dataframe with simple descriptive statistics.
Arguments
- df: Pandas dataframe with numerical columns only.
- interpolation: interpolation method used for quantile calculation. For more info, please see pandas.DataFrame.quantile
Output
Creates a dataframe which contains the descriptives for the columns in the input dataframe.
Example
describe_continuous(df)
# or
describe_continuous(df[['numerical_col1', 'numerical_col2']])
Â
tail_density_table(df, interpolation = 'nearest')
Purpose
Examine the number of outliers and extreme values for each row in a dataframe.
Arguments
- df: Pandas dataframe with numerical columns only.
- interpolation: interpolation method used for quantile calculation. For more info, please see pandas.DataFrame.quantile
Output
Creates a dataframe which contains the number of outliers and extreme values for the columns in the input dataframe.
Example
tail_density_table(df)
# or
tail_density_table(df[['numerical_col1', 'numerical_col2']])
Â
obs_by_date(column, date_aggregation = 'M')
Purpose
Aggregates the number of observations in a date column for a given period of time.
Arguments
- column: Pandas series in datetime format.
- date_aggregation: Level of aggregation. Can be
- days: 'D'
- months: 'M'
- years: 'Y'
Output
Returns the results of a value_counts method which was applied to the aggregated date series and also creates a barplot for better visibility.
Example
obs_by_date(series)
Â
values_by_date(df, date_column, date_aggregation = 'M')
Purpose
Observe the distribution of values over time.
Arguments
- df: Pandas dataframe with one date column and numerical columns
- date_column: name of the data column in the input dataframe.
- date_aggregation: Level of aggregation. Can be
- days: 'D'
- months: 'M'
- years: 'Y'
Output
Returns boxplots with x axis as aggregated date and y as value for every numerical variable in the input database.
Example
values_by_date(df,'date_col')