Tool to handle some large csv files

go get



This is a tool I put together to allow manipulation of very large CSV data files, I have to deal with at Work. The files have 111 fields and are often almost 500k records....

The data has subsets of data so grabbing what I need by field and time makes life easier.

It is also possible to retrieve a range/list of records, using -r (line number) instead of -t (time range). Using the -all will output all records according to the other parameters set.

You can use the tool with Stdin and Stdout, to pipe from one tool to another or you can specify an input and or output file.


Command switches are as follows:

Tool Usage:
      provide all records to output
  -blanks int
        Ignore records if this column is blank (default -1)
  -c string
        Which columns to export, eg 1-5 or 1,3-10 etc
  -comment string
        Specifiy the delimiter to use (default "#")
  -delimiter string
        Specifiy the delimiter to use (default ",")
        include header row
        help for guidance on usage
  -i string
        Input CSV file
        Use strict rules for length of a record
  -o string
        Output CSV file
  -r string
        Span index of records to export, eg 1-5 or 1,3-10 etc
  -specific int
        Limit search to a specific column x, default all (slow) (default -1)
  -t string
        Span of time records, eg 10:00:00-16:00:00

for example, the following will read the file and output to the specified file, with a header ignoring the record lengths, columns 0,32 to 85, 96 to 110. In addition it will only match the time on column 0 and use column 32 for ignoring blank lines. Provided the data is between the time span.

./csvtool -i 502_00409D8C3071_20160524.csv -t "24/05/2016 06:00:00.000 +1000-24/05/2016 18:59:59.999 +1000" -loose  -o subsecondDataTraction.csv -header -specific 0 -c 0,32-85,96-110 -blanks 32