OnlineStats
Online algorithms for statistics.
OnlineStats is a Julia package which provides online algorithms for statistical models. Online algorithms are well suited for streaming data or when data is too large to hold in memory. Observations are processed one at a time and all algorithms use O(1) memory.
API and Examples
Overview
Every OnlineStat is a Type
There are two ways of creating an OnlineStat:
- Create "empty" object and add data
- Create object with data
o = Mean()
fit!(o, y)
o = Mean(y)
All OnlineStats can be updated
Note: fit!(o, y)
is a cleaner way to write for yi in y; fit!(o, yi); end
y1 = randn(100)
y2 = randn(100)
o = Variance()
fit!(o, y1)
fit!(o, y2)
nobs(o) # number of observations == 200
New data can be weighted differently
o = Mean(EqualWeight())
o2 = Variance(y, ExponentialWeight(.1))
o3 = QuantileMM(y, LearningRate(.6))
-
EqualWeight()
- All observations are weighted equally. Weight at update
t
is1 / t
.
- All observations are weighted equally. Weight at update
-
BoundedExponentialWeight(minstep::Real)
,BoundedExponentialWeight(lookback::Integer)
- Use equal weight until weights reach
minstep = 2 / (lookback + 1)
, then hold constant. Weight at updatet
ismax(minstep, 1 / t)
.
- Use equal weight until weights reach
-
ExponentialWeight(λ::Real)
,ExponentialWeight(lookback::Integer)
- True exponential weighting. Each update weight is constant
λ = 2 / (lookback + 1)
- True exponential weighting. Each update weight is constant
-
LearningRate(r)
-
r
should be in (0.5, 1]. - For stochastic approximation methods. Weight at update
t
is1 / t^r
.
-
OnlineStats share a common interface
-
value(o)
- the associated value of an OnlineStat
-
nobs(o)
- the number of observations seen
Advanced Usage
New data can be updated in batches
Batch updates have an effect on convergence for stochastic approximation methods.
y = randn(100_000)
o = QuantileMM(tau = [.25, .75]) # Online MM algorithm for quantiles
fit!(o, y, 10) # update in batches of size 10
Weights can be overridden
Users can provide a vector of weights along with the input:
y = randn(1000)
weights = rand(1000)
o = Mean()
fit!(o, y, weights)
Or a single weight to be used for each update:
weight = rand()
o = Mean()
fit!(o, y, weight)