org.quilt/sedan

encoding scalar and simple aggregate data with a static lexicographical ordering matching a defined runtime collation order


License
MPL-2.0

Documentation

sedan Travis CI status

1972 Oldsmobile Delta 88

Sedan implements an encoding of scalar and simple aggretate data with a static lexicographical ordering to match a defined runtime sort/collation order, both intended to be language- and runtime-neutral.

Sedan is part of the Quilt Project.

(Historical note re: the name: sortable EDN => sedan.)

Why?

A lot of systems provide features that rely on ordered data (the most common example perhaps being database indexes), but those systems' data models are rarely very powerful, at least in their default/off-the-shelf form — often limited to strings or sometimes even only bytes/blobs — and introducing any runtime to such systems to support additional datatypes is almost always impractical.

If we could encode rich data values into (byte)strings that retained those values' runtime collation order, many of the benefits of our data model can be layered on top of these systems' string/byte-only capabilities.

Mise en garde

Use may cause personal injury.

This encoding will change, at a minimum to en/decode to/from binary (see TODOs below). When no further changes that will impact the encoding are pending, a 1.0.0 release will be cut.

That said, it's been dog-fooded pretty heavily in conjunction with other parts of Quilt, so the brave should hack away.

"Installation"

Sedan is available in Clojars and Maven central. Add this :dependency to your Leiningen project.clj:

[org.quilt/sedan "0.0.5"]

Or, add this to your Maven project's pom.xml:

<dependency>
  <groupId>org.quilt</groupId>
  <artifactId>sedan</artifactId>
  <version>0.0.5</version>
</dependency>

Sedan requires Clojure 1.6.0+, or a recent version of ClojureScript (check the project.clj file for the current dependency; assume it or a later rev of ClojureScript is needed).

"Specification"

Data serialized with Sedan maintains the natural runtime sort order among values of the same type as implemented in Clojure[Script].NOTE This implies a simple invariant for any pair of values of the same type, x and y, leaving aside the detail of the magnitude of the return value of compare (which is irrelevant to collation order, sort, etc):

; (require '[cemerick.sedan :as sedan])
(== (compare x y) (compare (sedan/encode x) (sedan/encode y)))

Sedan deviates from this intuition in only a few cases:

First, values of different types are generally not comparable (as per java.lang.Comparable or cljs.core.IComparable), so Sedan encoding stratifies values by prefixing all serializations with a tag, a single-byte type identifier. This ensures that values with very similar serializations (e.g. the string "foo" and the symbol 'foo) are kept very separate when collated.

Second, of all of Clojure's collections, only vectors are comparable. However, their collation semantics (smaller collections sort before larger ones, and same-sized collections sort based on the first pairwise difference between their elements) are readily extended to other finite, ordered collections, including seqs and lists.

Sedan's serialization does this.

Third, there are some types for which the "natural order" as defined by Clojure or the JDK is not carried forward to the encoded representation. The only such values that are so affected currently are:

  • UUIDs
  • Certain numeric edge cases where enabling a distinct decodable encoding necessitates deviating from the ordering implementated by Clojure at runtime (e.g. -0.0, which is equivalent to other zero values as far as compare is concerned, but which has its own "partition" within numbers as encoded by Sedan). Doing otherwise would make it impossible to represent such values in Sedan-encoded data.

These are the rules of Sedan's encoding, which are implemented for use at runtime by the cemerick.sedan/compare function. So, revisiting the invariant above, this expression exclusively describes Sedan's function:

(== (sedan/compare x y) (compare (sedan/encode x) (sedan/encode y)))

Non-objectives

  • Compactness: efficiency is obviously preferable when possible, but reliable effective collation semantics are the ballgame here. Sedan is quite compact for what it does (e.g. the size of Sedan-encoded numbers grows sub-linearly with with the scale+precision of the values in question, though a larger "constant factor" is involved relative to other serialization strategies that). A reasonable generalization is that data encoded with Sedan is likely to be very slightly larger than the same data encoded as EDN; the differential is effectively only due to Sedan's type tags, and will be totally dominated by the actual data, esp. string and byte[] values.

Supported types/values

The scalars and aggregates supported by Sedan are largely a consequence of its Clojure/ClojureScript origins, but their encoding and the semantics of Sedan's comparison function at runtime are intended to be language- and runtime-neutral. It is hoped that Sedan encoding will eventually be implemented in other languages for other runtimes.

Supported types (listed in the same order as Sedan's stratification):

  • nil
  • booleans
  • all numbers, except Ratios and NaN; includes:
    • shorts
    • ints
    • longs
    • floats
    • doubles
    • BigInteger
    • BigDecimal
  • strings
  • symbols
  • keywords
  • dates
  • UUIDs
  • vectors, lists, seqs (these are collapsed into a single partition, and materialized as vectors on read)
  • all custom types (see next section)

The same set of types is supported in JavaScript / ClojureScript, with the difference that big numerics are represented via opaque non-computable containers, which represent the values as strings. You can bring those strings over to your arbitrary-precision numerics library of choice in JavaScript. This is a stopgap, pending developments in ClojureScript support for arbitrary-precision numerics, and/or compelling demand for more concreate JS bignum support here.

Extending Sedan

It is possible to extend your own types to Sedan's protocols so they may participate in its encoding, decoding, and comparison operations, just as you might add support for your own types' tagged literals to an EDN reader (with all of the same tradeoffs and shortcomings).

The particulars of this are still under consideration from an API standpoint, so are an active TODO item. Feel free to check out the source and utilize as you wish, but at your own risk.

TODOs

Items potentially affecting the encoding itself

  • Encode/decode binary instead character data
    • IndexedDB supports binary values already, binary keys coming soon (already in chrome, though only when chrome://flags/#enable-experimental-web-platform-features is set)
  • add support for additional types:
    • byte arrays
    • "references"
    • URI/URLs?
    • rationals, outside of the numeric partition as described below

(Space) Optimizations

(none planned a.t.m.)

All other items (API / runtime related)

  • Encoding/decoding runtime optimizations: yes, please.
  • roll Splice's custom Sedan types (references and ordered attributes) here? Tension between Sedan being a general-purpose library vs. one dedicated to supporting Quilt.
  • disallow NaN in :number impls in cemerick.sedan.impl
  • verify cross-runtime compat by generating N values, sorting them based on sedan/compare, writing them to disk, and then in the complementary runtime, read them in, and verify that the collation on disk is valid according to its sedan/compare impl. Do this in both directions (JVM <==> JS)
  • provide a way to specify how to decode sequential things (list vs. vector vs. seq [maybe to support lazy decoding?)
  • provide a way to indicate numeric decode preferences (e.g. force bigdec or doubles, the former ensuring that no precision inflation occurs, etc)
  • eliminate (arbitrary) limit on "arbitrary precision" integers
  • Dates suck, per usual:
    • Use Joda JVM-side
    • Find a JS Date type that can at least support the full range of longs as millis (even if we have to use goog.math.Integer)
  • consume/produce JS big numerics using a single blessed arbitrary-precision library, maybe bignumber.js

Design decision non sequiturs

Documentation of various design decisions made with rationales, hopefully:

  • Despite the potential for snafus arising from necessary changes to the encoding (e.g. situations like this that lead to unfortunate things like that, rejected the notion of "versioning" each Sedan string/byte-array. Given the desire to Sedan-encode small datums, even a one-byte overhead (for each value!) seems bad. The revision of Sedan encoding should be managed operationally, with up/downgrades happening as necessary at storage/transmission boundaries.
  • Not going to support sorted set and map encodings at this time. No compelling use case yet, and the testing/usage complexity is potentially daunting.
  • There is no benefit to having separate partitions for lists and vectors. All sequential values are encoded to the same partition. The concrete type could be determined upon decode per user preference, but currently defaults to vector.
  • Rationals will need to be encoded as an "opaque" type in a separate partition from other numbers. Impossible to define appropriate collation order within other numeric types, and an approximation of one would be worse than having none at all.
  • Support -0.0, Infinity, and -Infinity via dedicated number (sub)partitions
  • No support for encoding NaN. Since it's definitionally neither greater or less than any other number, collating it in any way would be highly confusing, and perhaps very bad. Applications probably should encode a representation of NaN, since it's a fundamentally uncomputable artifact of IEEE 754.
    • The "uncomputable" assertion is somewhat undermined by the (ill-advised?) use of signalling NaNs in certain calcuations, but such things are probably best considered to be hacks, not valid instances of floating-point usage that should influence how Sedan should partition the space of possible encodable values.

Credits

There is obviously a great deal of prior art around language- and runtime-neutral data encoding schemes.

Sources for general inspiration for encoding values into strings (or bytes) that lexicographically collate in a way consistent with the runtime sort order of those values include:

  • sext, "A sortable serialization library [that] offers a serialization format (a la term_to_binary()) that preserves the Erlang term order."
  • bytewise, "Binary serialization of arbitrarily complex [JavaScript] structures that sort bytewise"

The strategy(ies) for encoding arbitrary-precision numbers were influenced / informed by a number of prior efforts:

License

Copyright © 2012-* Chas Emerick.

Distributed under the terms of the Mozilla Public License, v2.0+.