How to use pandas_cub

The README.ipynb notebook will serve as the documentation and usage guide to pandas_cub.

Installation

pip install pandas-cub

What is pandas_cub?

pandas_cub is a simple data analysis library that emulates the functionality of the pandas library. The library is not meant for serious work. It was built as an assignment for one of Ted Petrou's Python classes. If you would like to complete the assignment on your own, visit this repository. There are about 40 steps and 100 tests that you must pass in order to rebuild the library. It is a good challenge and teaches you the fundamentals of how to build your own data analysis library.

pandas_cub functionality

pandas_cub has limited functionality but is still capable of a wide variety of data analysis tasks.

Subset selection with the brackets
Arithmetic and comparison operators (+, -, <, !=, etc...)
Aggregation of columns with most of the common functions (min, max, mean, median, etc...)
Grouping via pivot tables
String-only methods for columns containing strings
Reading in simple comma-separated value files
Several other methods

pandas_cub DataFrame

pandas_cub has a single main object, the DataFrame, to hold all of the data. The DataFrame is capable of holding 4 data types - booleans, integers, floats, and strings. All data is stored in NumPy arrays. panda_cub DataFrames have no index (as in pandas). The columns must be strings.

Missing value representation

Boolean and integer columns will have no missing value representation. The NumPy NaN is used for float columns and the Python None is used for string columns.

Code Examples

pandas_cub syntax is very similar to pandas, but implements much fewer methods. The below examples will cover just about all of the API.

Reading data with `read_csv`

pandas_cub consists of a single function, read_csv, that has a single parameter, the location of the file you would like to read in as a DataFrame. This function can only handle simple CSV's and the delimiter must be a comma. A sample employee dataset is provided in the data directory. Notice that the visual output of the DataFrame is nearly identical to that of a pandas DataFrame. The head method returns the first 5 rows by default.

import pandas_cub as pdc

df = pdc.read_csv('data/employee.csv')
df.head()

	dept	race	gender	salary
0	Houston Police Department-HPD	White	Male	45279
1	Houston Fire Department (HFD)	White	Male	63166
2	Houston Police Department-HPD	Black	Male	66614
3	Public Works & Engineering-PWE	Asian	Male	71680
4	Houston Airport System (HAS)	White	Male	42390

DataFrame properties

The shape property returns a tuple of the number of rows and columns

df.shape

(1535, 4)

The len function returns just the number of rows.

len(df)

The dtypes property returns a DataFrame of the column names and their respective data type.

df.dtypes

	Column Name	Data Type
0	dept	string
1	race	string
2	gender	string
3	salary	int

The columns property returns a list of the columns.

df.columns

['dept', 'race', 'gender', 'salary']

Set new columns by assigning the columns property to a list.

df.columns = ['department', 'race', 'gender', 'salary']
df.head()

	department	race	gender	salary
0	Houston Police Department-HPD	White	Male	45279
1	Houston Fire Department (HFD)	White	Male	63166
2	Houston Police Department-HPD	Black	Male	66614
3	Public Works & Engineering-PWE	Asian	Male	71680
4	Houston Airport System (HAS)	White	Male	42390

The values property returns a single numpy array of all the data.

df.values

array([['Houston Police Department-HPD', 'White', 'Male', 45279],
       ['Houston Fire Department (HFD)', 'White', 'Male', 63166],
       ['Houston Police Department-HPD', 'Black', 'Male', 66614],
       ...,
       ['Houston Police Department-HPD', 'White', 'Male', 43443],
       ['Houston Police Department-HPD', 'Asian', 'Male', 55461],
       ['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],
      dtype=object)

Subset selection

Subset selection is handled with the brackets. To select a single column, place that column name in the brackets.

df['race'].head()

	race
0	White
1	White
2	Black
3	Asian
4	White

Select multiple columns with a list of strings.

df[['race', 'salary']].head()

	race	salary
0	White	45279
1	White	63166
2	Black	66614
3	Asian	71680
4	White	42390

Simultaneously select rows and columns by passing the brackets the row selection followed by the column selection separated by a comma. Here we use integers for rows and strings for columns.

rows = [10, 50, 100]
cols = ['salary', 'race']
df[rows, cols]

	salary	race
0	77076	Black
1	81239	White
2	81239	White

You can use integers for the columns as well.

rows = [10, 50, 100]
cols = [2, 0]
df[rows, cols]

	gender	department
0	Male	Houston Police Department-HPD
1	Male	Houston Police Department-HPD
2	Male	Houston Police Department-HPD

You can use a single integer and not just a list.

df[99, 3]

	salary
0	66614

Or a single string for the columns

df[99, 'salary']

	salary
0	66614

You can use a slice for the rows

df[20:100:10, ['race', 'gender']]

	race	gender
0	White	Male
1	White	Male
2	Hispanic	Male
3	White	Male
4	White	Male
5	Hispanic	Male
6	Hispanic	Male
7	Black	Female

You can also slice the columns with either integers or strings

df[20:100:10, :2]

	department	race
0	Houston Police Department-HPD	White
1	Houston Fire Department (HFD)	White
2	Houston Police Department-HPD	Hispanic
3	Houston Police Department-HPD	White
4	Houston Fire Department (HFD)	White
5	Houston Police Department-HPD	Hispanic
6	Houston Fire Department (HFD)	Hispanic
7	Houston Police Department-HPD	Black

df[20:100:10, 'department':'gender']

	department	race	gender
0	Houston Police Department-HPD	White	Male
1	Houston Fire Department (HFD)	White	Male
2	Houston Police Department-HPD	Hispanic	Male
3	Houston Police Department-HPD	White	Male
4	Houston Fire Department (HFD)	White	Male
5	Houston Police Department-HPD	Hispanic	Male
6	Houston Fire Department (HFD)	Hispanic	Male
7	Houston Police Department-HPD	Black	Female

You can do boolean selection if you pass the brackets a one-column boolean DataFrame.

filt = df['salary'] > 100000
filt.head()

	salary
0	False
1	False
2	False
3	False
4	False

df[filt].head()

	department	race	gender	salary
0	Public Works & Engineering-PWE	White	Male	107962
1	Health & Human Services	Black	Male	180416
2	Houston Fire Department (HFD)	Hispanic	Male	165216
3	Health & Human Services	White	Female	100791
4	Houston Airport System (HAS)	White	Male	120916

df[filt, ['race', 'salary']].head()

	race	salary
0	White	107962
1	Black	180416
2	Hispanic	165216
3	White	100791
4	White	120916

Assigning Columns

You can only assign an entire new column or overwrite an old one. You cannot assign a subset of the data. You can assign a new column with a single value like this:

df['bonus'] = 1000
df.head()

	department	race	gender	salary	bonus
0	Houston Police Department-HPD	White	Male	45279	1000
1	Houston Fire Department (HFD)	White	Male	63166	1000
2	Houston Police Department-HPD	Black	Male	66614	1000
3	Public Works & Engineering-PWE	Asian	Male	71680	1000
4	Houston Airport System (HAS)	White	Male	42390	1000

You can assign with a numpy array the same length as a column.

import numpy as np
df['bonus'] = np.random.randint(100, 5000, len(df))
df.head()

	department	race	gender	salary	bonus
0	Houston Police Department-HPD	White	Male	45279	3536
1	Houston Fire Department (HFD)	White	Male	63166	1296
2	Houston Police Department-HPD	Black	Male	66614	511
3	Public Works & Engineering-PWE	Asian	Male	71680	4267
4	Houston Airport System (HAS)	White	Male	42390	3766

You can assign a new column with a one column DataFrame.

df['salary'] + df['bonus']

	salary
0	48815
1	64462
2	67125
3	75947
4	46156
5	110001
6	53738
7	185348
8	32575
9	57918
...	...
1525	32936
1526	49294
1527	34218
1528	82795
1529	104900
1530	46408
1531	67050
1532	47368
1533	60013
1534	52624

df['total salary'] = df['salary'] + df['bonus']
df.head()

	department	race	gender	salary	bonus	total salary
0	Houston Police Department-HPD	White	Male	45279	3536	48815
1	Houston Fire Department (HFD)	White	Male	63166	1296	64462
2	Houston Police Department-HPD	Black	Male	66614	511	67125
3	Public Works & Engineering-PWE	Asian	Male	71680	4267	75947
4	Houston Airport System (HAS)	White	Male	42390	3766	46156

Arithmetic and comparison operators

df1 = df[['salary', 'bonus']] * 5
df1.head()

	salary	bonus
0	226395	17680
1	315830	6480
2	333070	2555
3	358400	21335
4	211950	18830

df1 = df[['salary', 'bonus']] > 100000
df1.head()

	salary	bonus
0	False	False
1	False	False
2	False	False
3	False	False
4	False	False

df1 = df['race'] == 'White'
df1.head()

	race
0	True
1	True
2	False
3	False
4	True

Aggregation

Most of the common aggregation methods are available. They only work down the columns and not across the rows.

df.min()

	department	race	gender	salary	bonus	total salary
0	Health & Human Services	Asian	Female	24960	101	25913

Columns that the aggregation does not work are dropped.

df.mean()

	salary	bonus	total salary
0	56278.746	2594.283	58873.029

df.argmax()

	department	race	gender	salary	bonus	total salary
0	3	0	0	145	1516	145

df['salary'].argmin()

	salary
0	347

Check if all salaries are greater than 20000

df1 = df['salary'] > 20000
df1.all()

	salary
0	True

Count the number of non-missing values

df.count()

	department	race	gender	salary	bonus	total salary
0	1535	1535	1535	1535	1535	1535

Get number of unique values.

df.nunique()

	department	race	gender	salary	bonus	total salary
0	6	5	2	548	1318	1524

Non-Aggregating Methods

These are methods that do not return a single value.

Get the unique values of each column. The unique method returns a list of DataFrames containing the unique values for each column.

dfs = df.unique()

dfs[0]

	department
0	Health & Human Services
1	Houston Airport System (HAS)
2	Houston Fire Department (HFD)
3	Houston Police Department-HPD
4	Parks & Recreation
5	Public Works & Engineering-PWE

dfs[1]

	race
0	Asian
1	Black
2	Hispanic
3	Native American
4	White

dfs[2]

	gender
0	Female
1	Male

Rename columns with a dictionary.

df.rename({'department':'dept', 'bonus':'BONUS'}).head()

	dept	race	gender	salary	BONUS	total salary
0	Houston Police Department-HPD	White	Male	45279	3536	48815
1	Houston Fire Department (HFD)	White	Male	63166	1296	64462
2	Houston Police Department-HPD	Black	Male	66614	511	67125
3	Public Works & Engineering-PWE	Asian	Male	71680	4267	75947
4	Houston Airport System (HAS)	White	Male	42390	3766	46156

Drop columns with a string or list of strings.

df.drop('race').head()

	department	gender	salary	bonus	total salary
0	Houston Police Department-HPD	Male	45279	3536	48815
1	Houston Fire Department (HFD)	Male	63166	1296	64462
2	Houston Police Department-HPD	Male	66614	511	67125
3	Public Works & Engineering-PWE	Male	71680	4267	75947
4	Houston Airport System (HAS)	Male	42390	3766	46156

df.drop(['race', 'gender']).head()

	department	salary	bonus	total salary
0	Houston Police Department-HPD	45279	3536	48815
1	Houston Fire Department (HFD)	63166	1296	64462
2	Houston Police Department-HPD	66614	511	67125
3	Public Works & Engineering-PWE	71680	4267	75947
4	Houston Airport System (HAS)	42390	3766	46156

Non-aggregating methods that keep all columns

The next several methods are non-aggregating methods that return a DataFrame with the same exact shape as the original. They only work on boolean, integer and float columns and ignore string columns.

Absolute value

df.abs().head()

	department	race	gender	salary	bonus	total salary
0	Houston Police Department-HPD	White	Male	45279	3536	48815
1	Houston Fire Department (HFD)	White	Male	63166	1296	64462
2	Houston Police Department-HPD	Black	Male	66614	511	67125
3	Public Works & Engineering-PWE	Asian	Male	71680	4267	75947
4	Houston Airport System (HAS)	White	Male	42390	3766	46156

Cumulative min, max, and sum

df.cummax().head()

	department	race	gender	salary	bonus	total salary
0	Houston Police Department-HPD	White	Male	45279	3536	48815
1	Houston Fire Department (HFD)	White	Male	63166	3536	64462
2	Houston Police Department-HPD	Black	Male	66614	3536	67125
3	Public Works & Engineering-PWE	Asian	Male	71680	4267	75947
4	Houston Airport System (HAS)	White	Male	71680	4267	75947

Clip values to be within a range.

df.clip(40000, 60000).head()

	department	race	gender	salary	bonus	total salary
0	Houston Police Department-HPD	White	Male	45279	40000	48815
1	Houston Fire Department (HFD)	White	Male	60000	40000	60000
2	Houston Police Department-HPD	Black	Male	60000	40000	60000
3	Public Works & Engineering-PWE	Asian	Male	60000	40000	60000
4	Houston Airport System (HAS)	White	Male	42390	40000	46156

Round numeric columns

df.round(-3).head()

	department	race	gender	salary	bonus	total salary
0	Houston Police Department-HPD	White	Male	45000	4000	49000
1	Houston Fire Department (HFD)	White	Male	63000	1000	64000
2	Houston Police Department-HPD	Black	Male	67000	1000	67000
3	Public Works & Engineering-PWE	Asian	Male	72000	4000	76000
4	Houston Airport System (HAS)	White	Male	42000	4000	46000

Copy the DataFrame

df.copy().head()

	department	race	gender	salary	bonus	total salary
0	Houston Police Department-HPD	White	Male	45279	3536	48815
1	Houston Fire Department (HFD)	White	Male	63166	1296	64462
2	Houston Police Department-HPD	Black	Male	66614	511	67125
3	Public Works & Engineering-PWE	Asian	Male	71680	4267	75947
4	Houston Airport System (HAS)	White	Male	42390	3766	46156

Take the nth difference.

df.diff(2).head(10)

	department	race	gender	salary	bonus	total salary
0	Houston Police Department-HPD	White	Male	nan	nan	nan
1	Houston Fire Department (HFD)	White	Male	nan	nan	nan
2	Houston Police Department-HPD	Black	Male	21335.000	-3025.000	18310.000
3	Public Works & Engineering-PWE	Asian	Male	8514.000	2971.000	11485.000
4	Houston Airport System (HAS)	White	Male	-24224.000	3255.000	-20969.000
5	Public Works & Engineering-PWE	White	Male	36282.000	-2228.000	34054.000
6	Houston Fire Department (HFD)	Hispanic	Male	10254.000	-2672.000	7582.000
7	Health & Human Services	Black	Male	72454.000	2893.000	75347.000
8	Public Works & Engineering-PWE	Black	Male	-22297.000	1134.000	-21163.000
9	Health & Human Services	Black	Male	-125147.000	-2283.000	-127430.000

Find the nth percentage change.

df.pct_change(2).head(10)

	department	race	gender	salary	bonus	total salary
0	Houston Police Department-HPD	White	Male	nan	nan	nan
1	Houston Fire Department (HFD)	White	Male	nan	nan	nan
2	Houston Police Department-HPD	Black	Male	0.471	-0.855	0.375
3	Public Works & Engineering-PWE	Asian	Male	0.135	2.292	0.178
4	Houston Airport System (HAS)	White	Male	-0.364	6.370	-0.312
5	Public Works & Engineering-PWE	White	Male	0.506	-0.522	0.448
6	Houston Fire Department (HFD)	Hispanic	Male	0.242	-0.710	0.164
7	Health & Human Services	Black	Male	0.671	1.419	0.685
8	Public Works & Engineering-PWE	Black	Male	-0.424	1.037	-0.394
9	Health & Human Services	Black	Male	-0.694	-0.463	-0.688

Sort the DataFrame by one or more columns

df.sort_values('salary').head()

	department	race	gender	salary	bonus	total salary
0	Houston Police Department-HPD	Black	Female	24960	953	25913
1	Public Works & Engineering-PWE	Hispanic	Male	26104	4258	30362
2	Public Works & Engineering-PWE	Black	Female	26125	3247	29372
3	Houston Airport System (HAS)	Hispanic	Female	26125	832	26957
4	Houston Airport System (HAS)	Black	Female	26125	2461	28586

Sort descending

df.sort_values('salary', asc=False).head()

	department	race	gender	salary	bonus	total salary
0	Houston Fire Department (HFD)	White	Male	210588	3724	214312
1	Houston Police Department-HPD	White	Male	199596	848	200444
2	Houston Airport System (HAS)	Black	Male	186192	1778	187970
3	Health & Human Services	Black	Male	180416	4932	185348
4	Public Works & Engineering-PWE	White	Female	178331	2124	180455

Sort by multiple columns

df.sort_values(['race', 'salary']).head()

	department	race	gender	salary	bonus	total salary
0	Houston Airport System (HAS)	Asian	Female	26125	4446	30571
1	Houston Police Department-HPD	Asian	Male	27914	2855	30769
2	Houston Police Department-HPD	Asian	Male	28169	2572	30741
3	Public Works & Engineering-PWE	Asian	Male	28995	2874	31869
4	Public Works & Engineering-PWE	Asian	Male	30347	4938	35285

Randomly sample the DataFrame

df.sample(n=3)

	department	race	gender	salary	bonus	total salary
0	Houston Fire Department (HFD)	White	Male	62540	2995	65535
1	Public Works & Engineering-PWE	White	Male	63336	1547	64883
2	Houston Police Department-HPD	White	Male	52514	1150	53664

Randomly sample a fraction

df.sample(frac=.005)

	department	race	gender	salary	bonus	total salary
0	Houston Police Department-HPD	Hispanic	Female	60347	1200	61547
1	Public Works & Engineering-PWE	Black	Male	49109	3598	52707
2	Health & Human Services	Black	Female	48984	4602	53586
3	Houston Police Department-HPD	White	Male	55461	2813	58274
4	Houston Airport System (HAS)	Black	Female	29286	1877	31163
5	Houston Police Department-HPD	Asian	Male	66614	4480	71094
6	Houston Fire Department (HFD)	White	Male	28024	4475	32499

Sample with replacement

df.sample(n=10000, replace=True).head()

	department	race	gender	salary	bonus	total salary
0	Parks & Recreation	Black	Female	31075	1665	32740
1	Public Works & Engineering-PWE	Hispanic	Male	67038	644	67682
2	Houston Police Department-HPD	Black	Male	37024	1532	38556
3	Health & Human Services	Black	Female	57433	3106	60539
4	Public Works & Engineering-PWE	Black	Male	53373	924	54297

String-only methods

Use the str accessor to call methods available just to string columns. Pass the name of the string column as the first parameter for all these methods.

df.str.count('department', 'P').head()

	department
0	2
1	0
2	2
3	2
4	0

df.str.lower('department').head()

	department
0	houston police department-hpd
1	houston fire department (hfd)
2	houston police department-hpd
3	public works & engineering-pwe
4	houston airport system (has)

df.str.find('department', 'Houston').head()

	department
0	0
1	0
2	0
3	-1
4	0

Grouping

pandas_cub provides the value_counts method for simple frequency counting of unique values and pivot_table for grouping and aggregating.

The value_counts method returns a list of DataFrames, one for each column.

dfs = df[['department', 'race', 'gender']].value_counts()

dfs[0]

	department	count
0	Houston Police Department-HPD	570
1	Houston Fire Department (HFD)	365
2	Public Works & Engineering-PWE	341
3	Health & Human Services	103
4	Houston Airport System (HAS)	103
5	Parks & Recreation	53

dfs[1]

	race	count
0	White	542
1	Black	518
2	Hispanic	381
3	Asian	87
4	Native American	7

dfs[2]

	gender	count
0	Male	1135
1	Female	400

If your DataFrame has one column, a DataFrame and not a list is returned. You can also return the relative frequency by setting the normalize parameter to True.

df['race'].value_counts(normalize=True)

	race	count
0	White	0.353
1	Black	0.337
2	Hispanic	0.248
3	Asian	0.057
4	Native American	0.005

The pivot_table method allows to group by one or two columns and aggregate values from another column. Let's find the average salary for each race and gender. All parameters must be strings.

df.pivot_table(rows='race', columns='gender', values='salary', aggfunc='mean')

	race	Female	Male
0	Asian	58304.222	60622.957
1	Black	48133.382	51853.000
2	Hispanic	44216.960	55493.064
3	Native American	58844.333	68850.500
4	White	66415.528	63439.196

If you don't provide values or aggfunc then by default it will return frequency (a contingency table).

df.pivot_table(rows='race', columns='gender')

	race	Female	Male
0	Asian	18	69
1	Black	207	311
2	Hispanic	100	281
3	Native American	3	4
4	White	72	470

You can group by just a single column.

df.pivot_table(rows='department', values='salary', aggfunc='mean')

	department	mean
0	Health & Human Services	51324.981
1	Houston Airport System (HAS)	53990.369
2	Houston Fire Department (HFD)	59960.441
3	Houston Police Department-HPD	60428.746
4	Parks & Recreation	39426.151
5	Public Works & Engineering-PWE	50207.806

df.pivot_table(columns='department', values='salary', aggfunc='mean')

	Health & Human Services	Houston Airport System (HAS)	Houston Fire Department (HFD)	Houston Police Department-HPD	Parks & Recreation	Public Works & Engineering-PWE
0	51324.981	53990.369	59960.441	60428.746	39426.151	50207.806

pandas-cub
Release 0.0.7

Release 0.0.7

0.0.7

0.0.6

0.0.5

0.0.4

0.0.3

0.0.2

0.0.1

Documentation

How to use pandas_cub

Installation

What is pandas_cub?

pandas_cub functionality

pandas_cub DataFrame

Missing value representation

Code Examples

Reading data with `read_csv`

DataFrame properties

Subset selection

Assigning Columns

Arithmetic and comparison operators

Aggregation

Non-Aggregating Methods

Non-aggregating methods that keep all columns

String-only methods

Grouping

Stats

Development practices

Releases

Contributors

pandas-cub Release 0.0.7

Release 0.0.7 Toggle Dropdown 0.0.7 0.0.6 0.0.5 0.0.4 0.0.3 0.0.2 0.0.1

Documentation

How to use pandas_cub

Installation

What is pandas_cub?

pandas_cub functionality

pandas_cub DataFrame

Missing value representation

Code Examples

Reading data with read_csv

DataFrame properties

Subset selection

Assigning Columns

Arithmetic and comparison operators

Aggregation

Non-Aggregating Methods

Non-aggregating methods that keep all columns

String-only methods

Grouping

Stats

Development practices

Releases

Contributors

pandas-cub
Release 0.0.7

Release 0.0.7

0.0.7

0.0.6

0.0.5

0.0.4

0.0.3

0.0.2

0.0.1

Reading data with `read_csv`