pandas-cub

A simple data analysis library similar to pandas


License
MIT
Install
pip install pandas-cub==0.0.7

Documentation

How to use pandas_cub

The README.ipynb notebook will serve as the documentation and usage guide to pandas_cub.

Installation

pip install pandas-cub

What is pandas_cub?

pandas_cub is a simple data analysis library that emulates the functionality of the pandas library. The library is not meant for serious work. It was built as an assignment for one of Ted Petrou's Python classes. If you would like to complete the assignment on your own, visit this repository. There are about 40 steps and 100 tests that you must pass in order to rebuild the library. It is a good challenge and teaches you the fundamentals of how to build your own data analysis library.

pandas_cub functionality

pandas_cub has limited functionality but is still capable of a wide variety of data analysis tasks.

  • Subset selection with the brackets
  • Arithmetic and comparison operators (+, -, <, !=, etc...)
  • Aggregation of columns with most of the common functions (min, max, mean, median, etc...)
  • Grouping via pivot tables
  • String-only methods for columns containing strings
  • Reading in simple comma-separated value files
  • Several other methods

pandas_cub DataFrame

pandas_cub has a single main object, the DataFrame, to hold all of the data. The DataFrame is capable of holding 4 data types - booleans, integers, floats, and strings. All data is stored in NumPy arrays. panda_cub DataFrames have no index (as in pandas). The columns must be strings.

Missing value representation

Boolean and integer columns will have no missing value representation. The NumPy NaN is used for float columns and the Python None is used for string columns.

Code Examples

pandas_cub syntax is very similar to pandas, but implements much fewer methods. The below examples will cover just about all of the API.

Reading data with read_csv

pandas_cub consists of a single function, read_csv, that has a single parameter, the location of the file you would like to read in as a DataFrame. This function can only handle simple CSV's and the delimiter must be a comma. A sample employee dataset is provided in the data directory. Notice that the visual output of the DataFrame is nearly identical to that of a pandas DataFrame. The head method returns the first 5 rows by default.

import pandas_cub as pdc
df = pdc.read_csv('data/employee.csv')
df.head()
dept race gender salary
0 Houston Police Department-HPD White Male 45279
1 Houston Fire Department (HFD) White Male 63166
2 Houston Police Department-HPD Black Male 66614
3 Public Works & Engineering-PWE Asian Male 71680
4 Houston Airport System (HAS) White Male 42390

DataFrame properties

The shape property returns a tuple of the number of rows and columns

df.shape
(1535, 4)

The len function returns just the number of rows.

len(df)
1535

The dtypes property returns a DataFrame of the column names and their respective data type.

df.dtypes
Column Name Data Type
0 dept string
1 race string
2 gender string
3 salary int

The columns property returns a list of the columns.

df.columns
['dept', 'race', 'gender', 'salary']

Set new columns by assigning the columns property to a list.

df.columns = ['department', 'race', 'gender', 'salary']
df.head()
department race gender salary
0 Houston Police Department-HPD White Male 45279
1 Houston Fire Department (HFD) White Male 63166
2 Houston Police Department-HPD Black Male 66614
3 Public Works & Engineering-PWE Asian Male 71680
4 Houston Airport System (HAS) White Male 42390

The values property returns a single numpy array of all the data.

df.values
array([['Houston Police Department-HPD', 'White', 'Male', 45279],
       ['Houston Fire Department (HFD)', 'White', 'Male', 63166],
       ['Houston Police Department-HPD', 'Black', 'Male', 66614],
       ...,
       ['Houston Police Department-HPD', 'White', 'Male', 43443],
       ['Houston Police Department-HPD', 'Asian', 'Male', 55461],
       ['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],
      dtype=object)

Subset selection

Subset selection is handled with the brackets. To select a single column, place that column name in the brackets.

df['race'].head()
race
0 White
1 White
2 Black
3 Asian
4 White

Select multiple columns with a list of strings.

df[['race', 'salary']].head()
race salary
0 White 45279
1 White 63166
2 Black 66614
3 Asian 71680
4 White 42390

Simultaneously select rows and columns by passing the brackets the row selection followed by the column selection separated by a comma. Here we use integers for rows and strings for columns.

rows = [10, 50, 100]
cols = ['salary', 'race']
df[rows, cols]
salary race
0 77076 Black
1 81239 White
2 81239 White

You can use integers for the columns as well.

rows = [10, 50, 100]
cols = [2, 0]
df[rows, cols]
gender department
0 Male Houston Police Department-HPD
1 Male Houston Police Department-HPD
2 Male Houston Police Department-HPD

You can use a single integer and not just a list.

df[99, 3]
salary
0 66614

Or a single string for the columns

df[99, 'salary']
salary
0 66614

You can use a slice for the rows

df[20:100:10, ['race', 'gender']]
race gender
0 White Male
1 White Male
2 Hispanic Male
3 White Male
4 White Male
5 Hispanic Male
6 Hispanic Male
7 Black Female

You can also slice the columns with either integers or strings

df[20:100:10, :2]
department race
0 Houston Police Department-HPD White
1 Houston Fire Department (HFD) White
2 Houston Police Department-HPD Hispanic
3 Houston Police Department-HPD White
4 Houston Fire Department (HFD) White
5 Houston Police Department-HPD Hispanic
6 Houston Fire Department (HFD) Hispanic
7 Houston Police Department-HPD Black
df[20:100:10, 'department':'gender']
department race gender
0 Houston Police Department-HPD White Male
1 Houston Fire Department (HFD) White Male
2 Houston Police Department-HPD Hispanic Male
3 Houston Police Department-HPD White Male
4 Houston Fire Department (HFD) White Male
5 Houston Police Department-HPD Hispanic Male
6 Houston Fire Department (HFD) Hispanic Male
7 Houston Police Department-HPD Black Female

You can do boolean selection if you pass the brackets a one-column boolean DataFrame.

filt = df['salary'] > 100000
filt.head()
salary
0 False
1 False
2 False
3 False
4 False
df[filt].head()
department race gender salary
0 Public Works & Engineering-PWE White Male 107962
1 Health & Human Services Black Male 180416
2 Houston Fire Department (HFD) Hispanic Male 165216
3 Health & Human Services White Female 100791
4 Houston Airport System (HAS) White Male 120916
df[filt, ['race', 'salary']].head()
race salary
0 White 107962
1 Black 180416
2 Hispanic 165216
3 White 100791
4 White 120916

Assigning Columns

You can only assign an entire new column or overwrite an old one. You cannot assign a subset of the data. You can assign a new column with a single value like this:

df['bonus'] = 1000
df.head()
department race gender salary bonus
0 Houston Police Department-HPD White Male 45279 1000
1 Houston Fire Department (HFD) White Male 63166 1000
2 Houston Police Department-HPD Black Male 66614 1000
3 Public Works & Engineering-PWE Asian Male 71680 1000
4 Houston Airport System (HAS) White Male 42390 1000

You can assign with a numpy array the same length as a column.

import numpy as np
df['bonus'] = np.random.randint(100, 5000, len(df))
df.head()
department race gender salary bonus
0 Houston Police Department-HPD White Male 45279 3536
1 Houston Fire Department (HFD) White Male 63166 1296
2 Houston Police Department-HPD Black Male 66614 511
3 Public Works & Engineering-PWE Asian Male 71680 4267
4 Houston Airport System (HAS) White Male 42390 3766

You can assign a new column with a one column DataFrame.

df['salary'] + df['bonus']
salary
0 48815
1 64462
2 67125
3 75947
4 46156
5 110001
6 53738
7 185348
8 32575
9 57918
... ...
1525 32936
1526 49294
1527 34218
1528 82795
1529 104900
1530 46408
1531 67050
1532 47368
1533 60013
1534 52624
df['total salary'] = df['salary'] + df['bonus']
df.head()
department race gender salary bonus total salary
0 Houston Police Department-HPD White Male 45279 3536 48815
1 Houston Fire Department (HFD) White Male 63166 1296 64462
2 Houston Police Department-HPD Black Male 66614 511 67125
3 Public Works & Engineering-PWE Asian Male 71680 4267 75947
4 Houston Airport System (HAS) White Male 42390 3766 46156

Arithmetic and comparison operators

df1 = df[['salary', 'bonus']] * 5
df1.head()
salary bonus
0 226395 17680
1 315830 6480
2 333070 2555
3 358400 21335
4 211950 18830
df1 = df[['salary', 'bonus']] > 100000
df1.head()
salary bonus
0 False False
1 False False
2 False False
3 False False
4 False False
df1 = df['race'] == 'White'
df1.head()
race
0 True
1 True
2 False
3 False
4 True

Aggregation

Most of the common aggregation methods are available. They only work down the columns and not across the rows.

df.min()
department race gender salary bonus total salary
0 Health & Human Services Asian Female 24960 101 25913

Columns that the aggregation does not work are dropped.

df.mean()
salary bonus total salary
0 56278.746 2594.283 58873.029
df.argmax()
department race gender salary bonus total salary
0 3 0 0 145 1516 145
df['salary'].argmin()
salary
0 347

Check if all salaries are greater than 20000

df1 = df['salary'] > 20000
df1.all()
salary
0 True

Count the number of non-missing values

df.count()
department race gender salary bonus total salary
0 1535 1535 1535 1535 1535 1535

Get number of unique values.

df.nunique()
department race gender salary bonus total salary
0 6 5 2 548 1318 1524

Non-Aggregating Methods

These are methods that do not return a single value.

Get the unique values of each column. The unique method returns a list of DataFrames containing the unique values for each column.

dfs = df.unique()
dfs[0]
department
0 Health & Human Services
1 Houston Airport System (HAS)
2 Houston Fire Department (HFD)
3 Houston Police Department-HPD
4 Parks & Recreation
5 Public Works & Engineering-PWE
dfs[1]
race
0 Asian
1 Black
2 Hispanic
3 Native American
4 White
dfs[2]
gender
0 Female
1 Male

Rename columns with a dictionary.

df.rename({'department':'dept', 'bonus':'BONUS'}).head()
dept race gender salary BONUS total salary
0 Houston Police Department-HPD White Male 45279 3536 48815
1 Houston Fire Department (HFD) White Male 63166 1296 64462
2 Houston Police Department-HPD Black Male 66614 511 67125
3 Public Works & Engineering-PWE Asian Male 71680 4267 75947
4 Houston Airport System (HAS) White Male 42390 3766 46156

Drop columns with a string or list of strings.

df.drop('race').head()
department gender salary bonus total salary
0 Houston Police Department-HPD Male 45279 3536 48815
1 Houston Fire Department (HFD) Male 63166 1296 64462
2 Houston Police Department-HPD Male 66614 511 67125
3 Public Works & Engineering-PWE Male 71680 4267 75947
4 Houston Airport System (HAS) Male 42390 3766 46156
df.drop(['race', 'gender']).head()
department salary bonus total salary
0 Houston Police Department-HPD 45279 3536 48815
1 Houston Fire Department (HFD) 63166 1296 64462
2 Houston Police Department-HPD 66614 511 67125
3 Public Works & Engineering-PWE 71680 4267 75947
4 Houston Airport System (HAS) 42390 3766 46156

Non-aggregating methods that keep all columns

The next several methods are non-aggregating methods that return a DataFrame with the same exact shape as the original. They only work on boolean, integer and float columns and ignore string columns.

Absolute value

df.abs().head()
department race gender salary bonus total salary
0 Houston Police Department-HPD White Male 45279 3536 48815
1 Houston Fire Department (HFD) White Male 63166 1296 64462
2 Houston Police Department-HPD Black Male 66614 511 67125
3 Public Works & Engineering-PWE Asian Male 71680 4267 75947
4 Houston Airport System (HAS) White Male 42390 3766 46156

Cumulative min, max, and sum

df.cummax().head()
department race gender salary bonus total salary
0 Houston Police Department-HPD White Male 45279 3536 48815
1 Houston Fire Department (HFD) White Male 63166 3536 64462
2 Houston Police Department-HPD Black Male 66614 3536 67125
3 Public Works & Engineering-PWE Asian Male 71680 4267 75947
4 Houston Airport System (HAS) White Male 71680 4267 75947

Clip values to be within a range.

df.clip(40000, 60000).head()
department race gender salary bonus total salary
0 Houston Police Department-HPD White Male 45279 40000 48815
1 Houston Fire Department (HFD) White Male 60000 40000 60000
2 Houston Police Department-HPD Black Male 60000 40000 60000
3 Public Works & Engineering-PWE Asian Male 60000 40000 60000
4 Houston Airport System (HAS) White Male 42390 40000 46156

Round numeric columns

df.round(-3).head()
department race gender salary bonus total salary
0 Houston Police Department-HPD White Male 45000 4000 49000
1 Houston Fire Department (HFD) White Male 63000 1000 64000
2 Houston Police Department-HPD Black Male 67000 1000 67000
3 Public Works & Engineering-PWE Asian Male 72000 4000 76000
4 Houston Airport System (HAS) White Male 42000 4000 46000

Copy the DataFrame

df.copy().head()
department race gender salary bonus total salary
0 Houston Police Department-HPD White Male 45279 3536 48815
1 Houston Fire Department (HFD) White Male 63166 1296 64462
2 Houston Police Department-HPD Black Male 66614 511 67125
3 Public Works & Engineering-PWE Asian Male 71680 4267 75947
4 Houston Airport System (HAS) White Male 42390 3766 46156

Take the nth difference.

df.diff(2).head(10)
department race gender salary bonus total salary
0 Houston Police Department-HPD White Male nan nan nan
1 Houston Fire Department (HFD) White Male nan nan nan
2 Houston Police Department-HPD Black Male 21335.000 -3025.000 18310.000
3 Public Works & Engineering-PWE Asian Male 8514.000 2971.000 11485.000
4 Houston Airport System (HAS) White Male -24224.000 3255.000 -20969.000
5 Public Works & Engineering-PWE White Male 36282.000 -2228.000 34054.000
6 Houston Fire Department (HFD) Hispanic Male 10254.000 -2672.000 7582.000
7 Health & Human Services Black Male 72454.000 2893.000 75347.000
8 Public Works & Engineering-PWE Black Male -22297.000 1134.000 -21163.000
9 Health & Human Services Black Male -125147.000 -2283.000 -127430.000

Find the nth percentage change.

df.pct_change(2).head(10)
department race gender salary bonus total salary
0 Houston Police Department-HPD White Male nan nan nan
1 Houston Fire Department (HFD) White Male nan nan nan
2 Houston Police Department-HPD Black Male 0.471 -0.855 0.375
3 Public Works & Engineering-PWE Asian Male 0.135 2.292 0.178
4 Houston Airport System (HAS) White Male -0.364 6.370 -0.312
5 Public Works & Engineering-PWE White Male 0.506 -0.522 0.448
6 Houston Fire Department (HFD) Hispanic Male 0.242 -0.710 0.164
7 Health & Human Services Black Male 0.671 1.419 0.685
8 Public Works & Engineering-PWE Black Male -0.424 1.037 -0.394
9 Health & Human Services Black Male -0.694 -0.463 -0.688

Sort the DataFrame by one or more columns

df.sort_values('salary').head()
department race gender salary bonus total salary
0 Houston Police Department-HPD Black Female 24960 953 25913
1 Public Works & Engineering-PWE Hispanic Male 26104 4258 30362
2 Public Works & Engineering-PWE Black Female 26125 3247 29372
3 Houston Airport System (HAS) Hispanic Female 26125 832 26957
4 Houston Airport System (HAS) Black Female 26125 2461 28586

Sort descending

df.sort_values('salary', asc=False).head()
department race gender salary bonus total salary
0 Houston Fire Department (HFD) White Male 210588 3724 214312
1 Houston Police Department-HPD White Male 199596 848 200444
2 Houston Airport System (HAS) Black Male 186192 1778 187970
3 Health & Human Services Black Male 180416 4932 185348
4 Public Works & Engineering-PWE White Female 178331 2124 180455

Sort by multiple columns

df.sort_values(['race', 'salary']).head()
department race gender salary bonus total salary
0 Houston Airport System (HAS) Asian Female 26125 4446 30571
1 Houston Police Department-HPD Asian Male 27914 2855 30769
2 Houston Police Department-HPD Asian Male 28169 2572 30741
3 Public Works & Engineering-PWE Asian Male 28995 2874 31869
4 Public Works & Engineering-PWE Asian Male 30347 4938 35285

Randomly sample the DataFrame

df.sample(n=3)
department race gender salary bonus total salary
0 Houston Fire Department (HFD) White Male 62540 2995 65535
1 Public Works & Engineering-PWE White Male 63336 1547 64883
2 Houston Police Department-HPD White Male 52514 1150 53664

Randomly sample a fraction

df.sample(frac=.005)
department race gender salary bonus total salary
0 Houston Police Department-HPD Hispanic Female 60347 1200 61547
1 Public Works & Engineering-PWE Black Male 49109 3598 52707
2 Health & Human Services Black Female 48984 4602 53586
3 Houston Police Department-HPD White Male 55461 2813 58274
4 Houston Airport System (HAS) Black Female 29286 1877 31163
5 Houston Police Department-HPD Asian Male 66614 4480 71094
6 Houston Fire Department (HFD) White Male 28024 4475 32499

Sample with replacement

df.sample(n=10000, replace=True).head()
department race gender salary bonus total salary
0 Parks & Recreation Black Female 31075 1665 32740
1 Public Works & Engineering-PWE Hispanic Male 67038 644 67682
2 Houston Police Department-HPD Black Male 37024 1532 38556
3 Health & Human Services Black Female 57433 3106 60539
4 Public Works & Engineering-PWE Black Male 53373 924 54297

String-only methods

Use the str accessor to call methods available just to string columns. Pass the name of the string column as the first parameter for all these methods.

df.str.count('department', 'P').head()
department
0 2
1 0
2 2
3 2
4 0
df.str.lower('department').head()
department
0 houston police department-hpd
1 houston fire department (hfd)
2 houston police department-hpd
3 public works & engineering-pwe
4 houston airport system (has)
df.str.find('department', 'Houston').head()
department
0 0
1 0
2 0
3 -1
4 0

Grouping

pandas_cub provides the value_counts method for simple frequency counting of unique values and pivot_table for grouping and aggregating.

The value_counts method returns a list of DataFrames, one for each column.

dfs = df[['department', 'race', 'gender']].value_counts()
dfs[0]
department count
0 Houston Police Department-HPD 570
1 Houston Fire Department (HFD) 365
2 Public Works & Engineering-PWE 341
3 Health & Human Services 103
4 Houston Airport System (HAS) 103
5 Parks & Recreation 53
dfs[1]
race count
0 White 542
1 Black 518
2 Hispanic 381
3 Asian 87
4 Native American 7
dfs[2]
gender count
0 Male 1135
1 Female 400

If your DataFrame has one column, a DataFrame and not a list is returned. You can also return the relative frequency by setting the normalize parameter to True.

df['race'].value_counts(normalize=True)
race count
0 White 0.353
1 Black 0.337
2 Hispanic 0.248
3 Asian 0.057
4 Native American 0.005

The pivot_table method allows to group by one or two columns and aggregate values from another column. Let's find the average salary for each race and gender. All parameters must be strings.

df.pivot_table(rows='race', columns='gender', values='salary', aggfunc='mean')
race Female Male
0 Asian 58304.222 60622.957
1 Black 48133.382 51853.000
2 Hispanic 44216.960 55493.064
3 Native American 58844.333 68850.500
4 White 66415.528 63439.196

If you don't provide values or aggfunc then by default it will return frequency (a contingency table).

df.pivot_table(rows='race', columns='gender')
race Female Male
0 Asian 18 69
1 Black 207 311
2 Hispanic 100 281
3 Native American 3 4
4 White 72 470

You can group by just a single column.

df.pivot_table(rows='department', values='salary', aggfunc='mean')
department mean
0 Health & Human Services 51324.981
1 Houston Airport System (HAS) 53990.369
2 Houston Fire Department (HFD) 59960.441
3 Houston Police Department-HPD 60428.746
4 Parks & Recreation 39426.151
5 Public Works & Engineering-PWE 50207.806
df.pivot_table(columns='department', values='salary', aggfunc='mean')
Health & Human Services Houston Airport System (HAS) Houston Fire Department (HFD) Houston Police Department-HPD Parks & Recreation Public Works & Engineering-PWE
0 51324.981 53990.369 59960.441 60428.746 39426.151 50207.806