readcsvturbo

Improved method for reading the first/last/specific line from csv into a DataFrame


Keywords
python, pandas, readcsv, readfirstlinecsv, readlastlinecsv, readspecificlinecsv, pandas-dataframe, python3
Install
pip install readcsvturbo==0.0.3

Documentation

read-csv-turbo

Ever deal with multiple huge csv files and and the panads read_csv/skiprows method is slowing you down? You are not alone.

read-csv-turbo is an improved method of reading the first and last lines using unix head and tail commands to get the data you want in a dataframe as fast as possible. I may include Windows support in the future if requested.

Reading a large csv once is "fine" but often I find myself looping through many files and this process is painfully slow which is why StackOverflow suggestions didn't cut it. There may be a newer/smarter way of approaching this but this method should be as fast as you could get.

At the moment the use case of this is quite limited as it just provides a fast way to read the first, last or n row of a csv into a dataframe

Installation

pip install readcsvturbo

Usage

import pandas as pd
import readcsvturbo as rct

csv_file_path = Path("./big_csv.csv")

df_head = rct.read_csv_head(csv_file_path)
df_tail = rct.read_csv_tail(csv_file_path)
df_headtail = rct.read_csv_headtail(csv_file_path)
df_specific_line = rct.read_csv_line(csv_file_path)

# Can also specify if there are no headers in the csv file

df_head = rct.read_csv_head(csv_file_path, headers=False)

Speed Test Results

RAW PANDAS TIME: 5.47s
SKIPROWS TIME:   3.04s (1.7x faster)
TURBO TIME:      0.25s (21.8x faster)

Speed Test

Test completed with a csv file 609MB with 5,000,000 lines. Downloaded from https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

The speedtest.py script was run to produce this. The test is reading the first and last lines of a csv file including the header.

############## RAW PANDAS ###############
# Read the whole file in using read_csv #
#########################################

                        Region Country  ...  Total Cost Total Profit
0        Australia and Oceania   Palau  ...  1260428.96    303126.25
4999999                 Europe  Greece  ...  2272551.84    546536.25

[2 rows x 14 columns]
RAW PANDAS TIME: 5.47s


################### SKIPROWS ##################
# Get the length of the file from readlines() #
# Read big_csv using skiprows                 #
###############################################

                  Region Country  ...  Total Cost Total Profit
0  Australia and Oceania   Palau  ...  1260428.96    303126.25
1                 Europe  Greece  ...  2272551.84    546536.25

[2 rows x 14 columns]
SKIPROWS TIME: 3.04s


##################### TURBO #####################
# Read first and last line using read-csv-turbo #
#################################################

                  Region Country  ...  Total Cost Total Profit
0  Australia and Oceania   Palau  ...  1260428.96    303126.25
1                 Europe  Greece  ...  2272551.84    546536.25

[2 rows x 14 columns]
TURBO TIME: 0.25s