nio on Nimble

The basic idea here is to take fully seriously the old chestnut that file name extensions tell users contained formats in order to make data access efficient. Specifically, we append to binary file pathnames a terse syntax inspired by (but simpler than) various Perl/Python "pack formats" and build APIs/tools around this. The syntax is simple & easy to remember (I think) since it is derived from the C programming language family. Most programmers have these basic "CPU types" memorized. With the exception of long double (an already exceptional thing), the type code is just the first letter of each C type. Uppercase are unsigned; Lowercase are signed. The full syntax is one or more:

  [<COUNT[,..]>]<c|C|s|S|i|I|l|L|f|d|g>

where

  c: signed (c)har    C: unsigned (C)har
  s: signed (s)hort   S: unsigned (S)hort
  i: signed (i)nt     I: unsigned (I)nt
  l: signed (l)ong    L: unsigned (L)ong
  f: (f)loat          d: (d)ouble          g: lon(g) double

The number of rows is inferred from the file size (but could be a length-prefix in some message buffer context). Some examples:

  hey.NS        a column-vector (Nx1 matrix) of unsigned shorts
  foo.N10f      an Nx10 matrix of floats
  bar.N2i4d     a table of int 2-vectors and double 4-vectors
  covs.N10,10f  a vector of 10 by 10 covariance matrices

While learning the syntax is needed to use streaming/pipe style calculation, you can also stow the format inside a file of a parallel name (e.g. "foo" & ".foo"). I have found this setup to be usable, flexible, & efficient. It can perhaps cure you from your likely addiction of parsing & re-parsing ASCII numbers which is up to hundreds of times slower than modern SIMD FP operations. (Seriously -- SIMD's go at L1 cache bandwidth which are order 100s of GB/s while parsing at even 1 GB/s is a challenge and printing/binary->ASCII is even slower.)

Unpacking other linearized/serialized marshal formats often requires at least iterating over all data, but often has decompression work woven in. NIO tries to allow "mmap & go" when feasible. In a sense like the above 100s vs 1 comparison, this is "infinite GB/s". In a more accurate sense, start-up cost is as fixed as opening random access files can be. (This is what DB software has always done and should not surprise.)

More documentation can be had by just running nio with no arguments or nio h for a big help dump. nio is both a library usable via import nio and a cligen multi-command. So the shortest unique prefix for subcommand names (and long option names) is sufficient. The FAQ has more motivation.

One can do some things with the nio command, but the main point of the design is to be extensible by actual programmers importing, nOpen/initFileArray ing, etc., writing their own libs & tools either on top or off to the side. Extended tools/logic must document themselves, but they can, e.g. share n-foo namespaces if desired. (Note that nio zip is named after functional PL terms|real world clothing zippers. It is unrelated to data compression.)

Convenience tools live in utils/. E.g., transpose an be useful in the context of schema writing (as in c2tsv < foo | head | transpose > editMe.sc).

Usage Vignette

Here is a little usage vignette using simulated data. First we will show some steps and then explain them. To start, you will first need to compile & install demo/tabGen.nim in your $PATH. Something like this may do the trick:

nimble install cligen   # may not be needed
git clone https://github.com/c-blake/nio
cd nio
n="nim c -d:danger"
$n nio && $n utils/c2tsv && $n demo/tabGen &&
  install -cm755 nio utils/c2tsv demo/tabGen ~/bin

After that, you can save this blurb to some demo.sh and run "sh demo.sh":

#!/bin/sh
t=/usr/bin/time
tabGen 1_000_000 4 > f 2> f.sc        # generate million*4 table
head -n3 < f                          # look @top
head f.sc                             # look @parsing schema
$t nio fromSV -s f.sc /dev/stdin < f  # parse the data into NIO
ls                                    # peruse some files
nio pr a.Nf b.Nf%.5f | head -n3       # print human readable
$t nio zip a.Nf b.Nf > ab.Nff         # stitch cols together
nio pr ab.Nff%.9f%.5f | head -n3      # print same vals another way
$t nio moments [a-z].N*               # compute some summary stats

and get output that looks like this (i7-6700k @4.7GHz; 8MiB L3):

a,b,c,d
2.380153181848329,-2.279945642690398,-0.6395001602969651,7.233130606792596
-0.025225344508444,2.21176551984741,0.494893265790349,0.4640431696829914
0.79user 0.04system 0:00.72elapsed 114%CPU (0avgtext+0avgdata 2512maxresident)k
0inputs+0outputs (0major+392minor)pagefaults 0swaps
a.Nf
b.Nf
c.Nf
d.Nf
f
f.sc
2.380153	-2.27995
-0.02522535	2.21177
0.1616571	3.55054
0.02user 0.00system 0:00.02elapsed 100%CPU (0avgtext+0avgdata 7920maxresident)k
0inputs+0outputs (0major+250minor)pagefaults 0swaps
2.380153179	-2.27995
-0.025225345	2.21177
0.161657065	3.55054
a.Nf:0 min: -4.772 max: 4.811
b.Nf:0 min: -9.172 max: 10.99
c.Nf:0 min: -11.89 max: 16.50
d.Nf:0 min: -16.27 max: 25.47
0.04user 0.00system 0:00.04elapsed 97%CPU (0avgtext+0avgdata 18268maxresident)k
0inputs+0outputs (0major+377minor)pagefaults 0swaps

Going Faster

Performance savvy readers may note, of the final line, that 40 ms for 4 million numbers is weak performance. 10 nanosec/number or 50 clock cycles/num or lowly 16 MB/40ms = 400 MB/s is not great for what could be vectorized min/max in a perfectly predictable pipeline. This is because I was lazy doing nio moments and just used stdlib stats.RunningStat which is accuracy semi-optimized, not speed optimized.

demo/datGen shows how easy it is to just stay in binary the whole time:

#!/bin/sh
t=/usr/bin/time
$t datGen 1_000_000 4               # generates abcd.Nffff
$t nio rip -i abcd.Nffff a b c d    # rip apart into column files
$t favg [a-d].Nf                    # Does 100 loops by default!

Compiling the tiny demo/favg with -d:danger for me results in a run-time on that same machine of 0.045 sec for 100 passes or 0.45 ms/pass. This is 40ms/.45=~ 90X faster or about 16/.45 = 35.5GB/s. Memory BW on this particular 2016 Linux box tops out at ~45GB/s (with >=3 cores pulling).

It is straightforward (but demo-messy) to split the loop into p parts & then grand total over thread/process subtotals to realize that last 1.3x (45/35) speed-up. More recent server/HEDT models have much higher peak parallel/peak single core BW ratios than 1.3 (more like 15+X) pushing optimizing folk to parallelism complexity simply to saturate DIMMs. In this example, since outputs are tiny subtotals, it's fine to first memory map files, then fork to engage hardware parallelism with processes via cligen/procpool. (Were outputs giant, kids could write to NIO files and return pathnames. Once you are whole CPU/system optimizing, what idea is best quickly becomes "it depends".)

See db-bench.md for another worked out example, perhaps easier to compare to other systems.

nio
Tag 0.6.2

Tag 0.6.2

Documentation

Usage Vignette

Going Faster

Stats

Development practices

Tagged Releases

Contributors

nio Tag 0.6.2

Tag 0.6.2 Toggle Dropdown

Documentation

Usage Vignette

Going Faster

Stats

Development practices

Tagged Releases

Contributors

nio
Tag 0.6.2

Tag 0.6.2