a-pandas-ex-fastsort

Speedup up to 40 percent when sorting Pandas index/Series


Keywords
c++, numpy, sort, pandas, reindex, cpp, dataframe, fast, index, python, series
License
MIT
Install
pip install a-pandas-ex-fastsort==0.10

Documentation

Speedup up to 40 percent when sorting Pandas index/Series

MSVC C++ x64/x86 build tools must be installed.

This module uses https://pypi.org/project/npfastsortcpp/

There you can get all instructions

Important: Only for float/int

Tested against Windows 10 / Python 3.9.13

import pandas as pd
from a_pandas_ex_fastsort import pd_add_fastsort
pd_add_fastsort()
dafra = "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv"
df5 = pd.read_csv(dafra)
# Speed gain even for small DataFrames
df = pd.concat([df5.copy() for x in range(10)], ignore_index=True)
df = df.sample(len(df))
%timeit df.d_fast_reindex() # Values must be unique
846 µs ± 37.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df.sort_index()
933 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# The bigger, the better
df = pd.concat([df5.copy() for x in range(100)], ignore_index=True)
df = df.sample(len(df))
%timeit df.d_fast_reindex() # Values must be unique
11.1 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.sort_index()
15 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df = pd.concat([df5.copy() for x in range(100)], ignore_index=True)
df = df.sample(len(df))
%timeit df.Pclass.sort_values()
2.08 ms ± 66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.Pclass.s_fastsort_copy() # Be careful: original index will be dropped!
583 µs ± 5.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# Be careful: 
df.Pclass.s_fastsort_inplace()
# sorts only one Series in place, 
# values in other columns are not being sorted! 

df # starting with:
Out[19]: 
       PassengerId  Survived  Pclass  ...      Fare Cabin  Embarked
34102          245         0       3  ...    7.2250   NaN         C
28329          709         1       1  ...  151.5500   NaN         S
50018          123         0       2  ...   30.0708   NaN         C
51258          472         0       3  ...    8.6625   NaN         S
51813          136         0       2  ...   15.0458   NaN         C
            ...       ...     ...  ...       ...   ...       ...
36357          718         1       2  ...   10.5000  E101         S
78608          201         0       3  ...    9.5000   NaN         S
64989          838         0       3  ...    8.0500   NaN         S
20824          332         0       1  ...   28.5000  C124         S
21108          616         1       2  ...   65.0000   NaN         S
[89100 rows x 12 columns]
df.Pclass.s_fastsort_inplace()

df # Result - Only Pclass has been sorted
Out[21]: 
       PassengerId  Survived  Pclass  ...      Fare Cabin  Embarked
34102          245         0       1  ...    7.2250   NaN         C
28329          709         1       1  ...  151.5500   NaN         S
50018          123         0       1  ...   30.0708   NaN         C
51258          472         0       1  ...    8.6625   NaN         S
51813          136         0       1  ...   15.0458   NaN         C
            ...       ...     ...  ...       ...   ...       ...
36357          718         1       3  ...   10.5000  E101         S
78608          201         0       3  ...    9.5000   NaN         S
64989          838         0       3  ...    8.0500   NaN         S
20824          332         0       3  ...   28.5000  C124         S
21108          616         1       3  ...   65.0000   NaN         S
[89100 rows x 12 columns]