Sedan implements an encoding of scalar and simple aggretate data with a static lexicographical ordering to match a defined runtime sort/collation order, both intended to be language- and runtime-neutral.
Sedan is part of the Quilt Project.
(Historical note re: the name: sortable EDN => sedan.)
A lot of systems provide features that rely on ordered data (the most common example perhaps being database indexes), but those systems' data models are rarely very powerful, at least in their default/off-the-shelf form — often limited to strings or sometimes even only bytes/blobs — and introducing any runtime to such systems to support additional datatypes is almost always impractical.
If we could encode rich data values into (byte)strings that retained those values' runtime collation order, many of the benefits of our data model can be layered on top of these systems' string/byte-only capabilities.
Mise en garde
Use may cause personal injury.
This encoding will change, at a minimum to en/decode to/from binary (see TODOs
below). When no further changes that will impact the encoding are pending, a
1.0.0 release will be cut.
That said, it's been dog-fooded pretty heavily in conjunction with other parts of Quilt, so the brave should hack away.
Sedan is available in Clojars and Maven central. Add this
:dependency to your
Or, add this to your Maven project's
<dependency> <groupId>org.quilt</groupId> <artifactId>sedan</artifactId> <version>0.0.5</version> </dependency>
Sedan requires Clojure 1.6.0+, or a recent version of ClojureScript (check the
project.clj file for the current dependency; assume it or a later rev of
ClojureScript is needed).
Data serialized with Sedan maintains the natural runtime sort order among values
of the same type as implemented in Clojure[Script].NOTE This implies a simple invariant for any
pair of values of the same type,
y, leaving aside the detail of the
magnitude of the return value of
compare (which is irrelevant to collation
; (require '[cemerick.sedan :as sedan]) (== (compare x y) (compare (sedan/encode x) (sedan/encode y)))
Sedan deviates from this intuition in only a few cases:
First, values of different types are generally not comparable (as per
cljs.core.IComparable), so Sedan
encoding stratifies values by prefixing all serializations with a tag, a
single-byte type identifier. This ensures that values with very similar
serializations (e.g. the string
"foo" and the symbol
'foo) are kept very
separate when collated.
Second, of all of Clojure's collections, only vectors are comparable. However, their collation semantics (smaller collections sort before larger ones, and same-sized collections sort based on the first pairwise difference between their elements) are readily extended to other finite, ordered collections, including seqs and lists.
Sedan's serialization does this.
Third, there are some types for which the "natural order" as defined by Clojure or the JDK is not carried forward to the encoded representation. The only such values that are so affected currently are:
- Certain numeric edge cases where enabling a distinct decodable encoding
necessitates deviating from the ordering implementated by Clojure at runtime
-0.0, which is equivalent to other zero values as far as
compareis concerned, but which has its own "partition" within numbers as encoded by Sedan). Doing otherwise would make it impossible to represent such values in Sedan-encoded data.
These are the rules of Sedan's encoding, which are implemented for use at
runtime by the
cemerick.sedan/compare function. So, revisiting the invariant
above, this expression exclusively describes Sedan's function:
(== (sedan/compare x y) (compare (sedan/encode x) (sedan/encode y)))
- Compactness: efficiency is obviously preferable when possible, but reliable effective collation semantics are the ballgame here. Sedan is quite compact for what it does (e.g. the size of Sedan-encoded numbers grows sub-linearly with with the scale+precision of the values in question, though a larger "constant factor" is involved relative to other serialization strategies that). A reasonable generalization is that data encoded with Sedan is likely to be very slightly larger than the same data encoded as EDN; the differential is effectively only due to Sedan's type tags, and will be totally dominated by the actual data, esp. string and byte values.
The scalars and aggregates supported by Sedan are largely a consequence of its Clojure/ClojureScript origins, but their encoding and the semantics of Sedan's comparison function at runtime are intended to be language- and runtime-neutral. It is hoped that Sedan encoding will eventually be implemented in other languages for other runtimes.
Supported types (listed in the same order as Sedan's stratification):
- all numbers, except
- vectors, lists, seqs (these are collapsed into a single partition, and materialized as vectors on read)
- all custom types (see next section)
It is possible to extend your own types to Sedan's protocols so they may participate in its encoding, decoding, and comparison operations, just as you might add support for your own types' tagged literals to an EDN reader (with all of the same tradeoffs and shortcomings).
The particulars of this are still under consideration from an API standpoint, so are an active TODO item. Feel free to check out the source and utilize as you wish, but at your own risk.
Items potentially affecting the encoding itself
- Encode/decode binary instead character data
- add support for additional types:
- byte arrays
- rationals, outside of the numeric partition as described below
(none planned a.t.m.)
All other items (API / runtime related)
- Encoding/decoding runtime optimizations: yes, please.
- roll Splice's custom Sedan types (references and ordered attributes) here? Tension between Sedan being a general-purpose library vs. one dedicated to supporting Quilt.
- disallow NaN in :number impls in cemerick.sedan.impl
- verify cross-runtime compat by generating N values, sorting them based on
sedan/compare, writing them to disk, and then in the complementary runtime, read them in, and verify that the collation on disk is valid according to its
sedan/compareimpl. Do this in both directions (JVM <==> JS)
- provide a way to specify how to decode sequential things (list vs. vector vs. seq [maybe to support lazy decoding?)
- provide a way to indicate numeric decode preferences (e.g. force bigdec or doubles, the former ensuring that no precision inflation occurs, etc)
- eliminate (arbitrary) limit on "arbitrary precision" integers
- Dates suck, per usual:
- Use Joda JVM-side
- Find a JS Date type that can at least support the full range of longs as millis (even if we have to use goog.math.Integer)
- consume/produce JS big numerics using a single blessed arbitrary-precision library, maybe bignumber.js
Design decision non sequiturs
Documentation of various design decisions made with rationales, hopefully:
- Despite the potential for snafus arising from necessary changes to the encoding (e.g. situations like this that lead to unfortunate things like that, rejected the notion of "versioning" each Sedan string/byte-array. Given the desire to Sedan-encode small datums, even a one-byte overhead (for each value!) seems bad. The revision of Sedan encoding should be managed operationally, with up/downgrades happening as necessary at storage/transmission boundaries.
- Not going to support sorted set and map encodings at this time. No compelling use case yet, and the testing/usage complexity is potentially daunting.
- There is no benefit to having separate partitions for lists and vectors. All sequential values are encoded to the same partition. The concrete type could be determined upon decode per user preference, but currently defaults to vector.
- Rationals will need to be encoded as an "opaque" type in a separate partition from other numbers. Impossible to define appropriate collation order within other numeric types, and an approximation of one would be worse than having none at all.
- Support -0.0, Infinity, and -Infinity via dedicated number (sub)partitions
- No support for encoding
NaN. Since it's definitionally neither greater or less than any other number, collating it in any way would be highly confusing, and perhaps very bad. Applications probably should encode a representation of
NaN, since it's a fundamentally uncomputable artifact of IEEE 754.
- The "uncomputable" assertion is somewhat undermined by the (ill-advised?)
use of signalling
NaNs in certain calcuations, but such things are probably best considered to be hacks, not valid instances of floating-point usage that should influence how Sedan should partition the space of possible encodable values.
- The "uncomputable" assertion is somewhat undermined by the (ill-advised?) use of signalling
There is obviously a great deal of prior art around language- and runtime-neutral data encoding schemes.
Sources for general inspiration for encoding values into strings (or bytes) that lexicographically collate in a way consistent with the runtime sort order of those values include:
sext, "A sortable serialization library
[that] offers a serialization format (a la
term_to_binary()) that preserves the Erlang term order."
The strategy(ies) for encoding arbitrary-precision numbers were influenced / informed by a number of prior efforts:
- "Creating ASCII-Sortable Strings from Numbers", and the XKCD forum thread linked therein: ("Generating savegame names so they appear in sorted order").
- The encoding implementation for arbitrary-precision integers
java.math.BigIntegers, specifically) in
org.apache.solr.util.BCDUtils(originally written by Yonik Seeley AFAICT). This implementation's strategy forms the basis for the treatment of numbers here, though extended to support decimals in addition to integers.
- Steven Rowe's speculation of a "thoroughly untested idea" to encode
java.math.BigDecimals for indexing in Lucene / Solr: http://grokbase.com/t/lucene/java-user/08bm2ysjxx/bigdecimal-values
- Doug Wood's draft IETF proposal for "Directory string representation for
values", and the
implementation of it by D. Kavanagh in the typica SimpleDB
com.xerox.amazonws.simpledb.DataUtils, which was later slurped into Rummage, a SimpleDB client largely written in Clojure).
Copyright © 2012-* Chas Emerick.
Distributed under the terms of the Mozilla Public License, v2.0+.