datareducer


License
MIT
Install
pip install datareducer==0.1.1

Documentation

Datareducer

Installation

pip3 install datareducer

# with optional Cython support
pip3 install datareducer[Cython]

API

from datareducer import shader

data = [[1, 1e-6], [2,2e-5], [3,4e-4]]

# set limits for 2 dimensions for the shader
data_store = shader().setLimits(0, 20, 21).setLimits(1e-12, 1e-3, 100, scale_type="log10")
# The shader can have arbitrary number of dimensions. It can be used to construct a simple histogram if only one dimension is defined, i.e. 
oneD_store = shader().setLimits(0, 20, 21)
# or multiple dimensions
multi_store = shader() \
    .setLimits(0, 20, 21) \
    .setLimits(-10, 100, 200) \
    .setLimits(1, 1e6, 7, scale_type="log10")

# map values to bins
data_store.applyOnBatches(data, yValueIndex=1)
# named parameter yValueIndex refers to the position of the y-value in each pair

# map a single pair of values to a bin
data_store.apply([4, 1e-7], yValueIndex=1)

In the example above data_store is an aggregator or also know as a sink in the context of data streams. This approach can be used on arbitrarily big dataset and each value or value-pairs can be efficiently streamed and aggregated in the shader. The shader stored the aggregates in a matrix or tensor whose size is determined by the number of bins for each dimension.

Aggregates can be extracted from the shader container:

# Get value count per bin in a 2D matrix, because two dimensions were defined above
print(data_store.getAgg("cnt"))
# Assuming the right-most value is of interest
print(data_store.getAgg("min"))
# returns the min value for each bin
# or the max value
print(data_store.getAgg("max"))
print(data_store.getAgg("sum"))
print(data_store.getAgg("sum2"))

References