panda-helper

Data profiler for Pandas


Keywords
data-profiling, data-analysis, data-profiler, data-profiling-utility, pandas, python
License
MIT
Install
pip install panda-helper==0.0.2

Documentation

PyPI - Python Version Tests Status Lint/Format Status

Panda-Helper: Quickly and easily inspect data

Panda-Helper is a simple data-profiling utility for Pandas DataFrames and Series

Assess data quality and usefulness with minimal effort

Quickly perform initial data exploration, so you can move on to more in-depth analysis


DataFrame profiles:

  • Report shape
  • Detect duplicated rows
  • Display series names and data types
  • Calculate distribution statistics on null values per row providing a view on data completeness

Sample DataFrame profile
Vehicles passing through toll stations

DataFrame-Level Info
-------------------------  ------------
DF Shape                   (1586280, 6)
Duplicated Rows             2184

Column Name                 Data Type
--------------------------  -----------
Plaza ID                    int64
Date                        object
Hour                        int64
Direction                   object
# Vehicles - ETC (E-ZPass)  int64
# Vehicles - Cash/VToll     int64

Summary of Nulls Per Row
--------------------------  -----------
count                       1.58628e+06
min                         0
1%                          0
5%                          0
25%                         0
50%                         0
75%                         0
95%                         0
99%                         0
max                         0
median                      0
mean                        0
median absolute deviation   0
standard deviation          0
skew                        0

Series profiles report the:

  • Series data type
  • Count of non-null values in the series
  • Number of unique values
  • Count of null values
  • Counts and frequency of the most and least common values
  • Distribution statistics for numeric data

Sample profile of categorical data
Direction vehicles are traveling

Direction Info
----------------  -------
Data Type         object
Count             1586280
Unique Values     2
Null Values       0

Value      Count  % of total
-------  -------  ------------
I         814100  51.32%
O         772180  48.68%

Sample profile of numeric data
Hourly vehicle counts at tolling points

# Vehicles - ETC (E-ZPass) Info
---------------------------------  -------
Data Type                          int64
Count                              1586280
Unique Values                      8987
Null Values                        0

  Value    Count  % of total
-------  -------  ------------
      0     3137  0.20%
     43     1762  0.11%
     44     1743  0.11%
     40     1712  0.11%
     42     1699  0.11%
     41     1682  0.11%
     39     1676  0.11%
     37     1673  0.11%
     48     1659  0.10%
     46     1654  0.10%
     38     1646  0.10%
     45     1641  0.10%
     36     1636  0.10%
     52     1574  0.10%
     47     1572  0.10%
     50     1571  0.10%
     51     1555  0.10%
     53     1547  0.10%
     55     1543  0.10%
     34     1534  0.10%
   8269        1  0.00%
   8438        1  0.00%
   8876        1  0.00%
   8261        1  0.00%
   8694        1  0.00%

Statistic                            Value
-------------------------  ---------------
count                          1.58628e+06
min                            0
1%                            25
5%                            68
25%                          407
50%                         1054
75%                         2071
95%                         3583
99%                         6308
max                        16854
median                      1054
mean                        1373.16
median absolute deviation    751
standard deviation          1253.1
skew                           1.69154

Installing Panda-Helper

pip install panda-helper


Using Panda-Helper

Profiling a DataFrame
Create the DataFrameProfile and then display it or save the profile.

import pandas as pd
import pandahelper.reports as ph

data = {
    "user_id": [1, 2, 3, 4, 4],
    "transaction": ["purchase", "return", "purchase", "exchange", "exchange"],
    "amount": [100.00, None, 1400.00, 85.12, 85.12],
    "survey": [None, None, None, "online", "online"],
}
df = pd.DataFrame(data)
df_profile = ph.DataFrameProfile(df)
df_profile
DataFrame-Level Info
-------------------------  ------
DF Shape                   (5, 4)
Obviously Duplicated Rows  1

Column Name    Data Type
-------------  -----------
user_id        int64
transaction    object
amount         float64
survey         object

Summary of Nulls Per Row
--------------------------  --------
count                       5
min                         0
1%                          0
5%                          0
25%                         0
50%                         1
75%                         1
95%                         1.8
99%                         1.96
max                         2
median                      1
mean                        0.8
median absolute deviation   1
standard deviation          0.83666
skew                        0.512241
df_profile.save_report("df_profile.txt")

Profiling a Series
Create the SeriesProfile and then display it or save it. That's it!

series_profile = ph.SeriesProfile(df["amount"])
series_profile
amount Info
-------------  -------
Data Type      float64
Count          4
Unique Values  3
Null Values    1

  Value    Count  % of total
-------  -------  ------------
  85.12        2  50.00%
 100           1  25.00%
1400           1  25.00%

Statistic                       Value
-------------------------  ----------
count                         4
min                          85.12
1%                           85.12
5%                           85.12
25%                          85.12
50%                          92.56
75%                         425
95%                        1205
99%                        1361
max                        1400
median                       92.56
mean                        417.56
median absolute deviation     7.44
standard deviation          654.998
skew                          1.99931
series_profile.save_report("amount_profile.txt")

Sample data obtained from: