tinysort
Sort large amounts of Python objects that are larger than available memory but smaller than available disk.
Overview
tinysort
operates on large streams of data by breaking it up into small
chunks, each of which is passed to a multiprocessing
job where it is sorted
and dumped into an intermediary tempfile. These tempfiles are combined with
heapq.merge()
and converted into a stream of sorted data. See
help(tinysort)
for information about limitations, optimizations, and future
plans.
API
Additional documentation can be found with help(tinysort)
and
help(tinysort._sort)
, but here are the most most important functions:
tinysort.stream2stream()
Sort a stream of data in situ:
import tinysort
import random
# Generate a bunch of random values
values = list(range(1000000))
random.shuffle(values)
# Sort values in parallel with 4 cores and iterate over the results
for item in tinysort.stream2stream(values, jobs=4):
pass
tinysort.file2file()
Sort a file into a new file:
import tinysort
import tinysort.io
# We use this class for reading + writing data
# See the section on data serialization for more info
serializer = tinysort.io.Pickle()
# Generate some random values and write to a file
values = list(range(1000000))
random.shuffle(values)
with serializer.open('data.pickle', 'w') as f:
for v in values:
f.write(v)
# Sort from 1 file into another in parallel across 4 cores
tinysort.file2file(
'data.pickle',
'sorted-data.pickle',
reader=serializer,
writer=serializer,
jobs=4)
Other Functions
The various other combinations of working with streams and files also exist:
-
file2stream()
- Read a file and sort it. -
files2stream()
- Merge and sort a bunch of files into a single sorted stream. -
stream2file()
- Sort a stream and write it to a file.
Data Serialization
tinysort
operates by sorting its inputs in chunks, serializing to disk, and
using heapq.merge()
to produce a final sorted stream of data. The
tinysort.io
module contains several data serialization classes that operate
like:
import tinysort.io
serializer = tinysort.io.Pickle()
with serializer.open('data.pickle', 'w') as f:
f.write({'key': 'value'})
with serializer.open('data.pickle') as f:
for item in f:
pass
By default Pickle()
is used for tempfiles as it can handle a large variety
of objects.
Other formats include DelimitedText()
, and NewlineJSON()
. The newline
JSON serializer requires the newlinejson
library, which is not installed as a requirement.
Terminology and Sorter kwargs
See help(tinysort.io)
for a more detailed explanation.
The serializers in tinysort.io
are given to the sort functions under one of
three names: reader
, writer
, and serializer
. All take the same kind
of thing, but serve different purposes. reader
is used when reading from
an input file, and writer
is used when writing to an output file, and
serializer
for writing to and reading from itnermediary tempfiles.
The sort functions all take a generic **kwargs
. For more information see
help(tinysort._sort)
, but these arguments are generally chunksize
,
jobs
, and keyword arguments for Python's sorted()
and/or heapq.merge()
.
heapq
Module
Python 2's heapq.merge()
doesn't accept a key
argument, so this module
contains a copy of the heapq
module in tinysort._backport_heapq
from
somewhere between Python 3.5 and 3.6. This code is largely unchanged, except
for a few small Python 2 compatibility changes, and retains its original
license. There is a unittest that only runs on the CI server that compares
a snapshot of the original source code to the current source code and fails if
any changes occur, which shouldn't happen too often.
Developing
$ git clone https://github.com/geowurster/tinysort.git
$ cd tinysort
$ pip install -e .\[dev\]
$ py.test tests --cov tinysort --cov-report term-missing
License
See LICENSE.txt
Changelog
See CHANGES.md