JLTool - Tools for JSON-Lines Records
by Cathal Garvey, ©2016, Released under terms of the GNU AGPLv3 or later
The JSON-Lines format is a clean alternative to comma-separated values as a means to store data records in a scaleable, flat manner, for cases where a database is too much but a flat JSON file is inefficient.
JLTool is a tool for working with JSON-Lines records; it assists in schema validation, duplicate detection, de-duplication and normalisation, and 'grepping' using objectpath queries.
ObjectPath is supported for most operations; in particular, for fetching a unique, representative ID from objects for purposes of deduplication or diffing documents. For grepping, ObjectPath can be used to query for matching rows.
Installing JLTool with python3 setup.py install
or pip install jl
will install
the jltool
command-line tool, which is the primary intended purpose. However,
for operations on files the subcommands of jltool
are all available in the jltool
import if desired. Just open it in ipython
and take a look at the docs on the
command functions for more information.
Usage Examples
Say you have a JSON-Lines file 'records.jsonl
' containing records that look like this:
{"type": "email", "value": "cathal@isgre.at", "meta": {"foo": "bar"} }
..which is similar enough to the job I needed doing, when I wrote jltool
. :)
Many of the commands use objectpath, as an optional way of selecting or uniquifying records. Check the documentation there for info.
For some commands that require a 'fingerprint' for a record in order to work (dedupe, report, diff, clean), if an objectpath selector is not given then a fingerprint will be generated by normalising objects (sorted keys) and hashing the resulting JSON.
This may be highly misleading for some kinds of data, for
example when a record may represent an updated form of another record, differing
only in timestamp. However do note that in such cases (update records), the first
matching result is kept, discarding the rest, by default. This may also not be
desired behaviour. An option to reverse this behaviour may be added in the future,
but would mean loading everything into memory. Meanwhile, pipe files backwards-linewise
using tac
(on Linux, obviously) to approximate a reversal of this behaviour.
Get a Report
The report
subcommand returns a report on the size and structure of a file,
including reporting common keys and keys that have an uncertain type/schema:
$ jltool report records.jsonl
Number of records: 13
Number of Duplicates: 0
Common keys: {'type': {'string'}, 'value': {'string'}, 'meta': {'object'}}
Filtering Reports
The grep
subcommand allows the use of objectpath queries to filter a JSONL file.
The objectpath query must evaluate to a boolean. If desired, deduplication
may be done prior to selection, by passing a -s
selector by which to deduplicate
records, but if no -s
selector is given then no deduplication is performed.
$ cat records.jsonl
{"type": "email", "value": "foo@bar.com", "meta": {}}
{"type": "twitter", "value": "onetruecathal", "meta": {"awesomeness": 9001}}
{"type": "email", "value": "baz@qux.tld", "meta": {"lol": "wut"}}
$ jltool grep '$.type is "twitter"' records.jsonl
{"type": "twitter", "value": "onetruecathal", "meta": {"awesomeness": 9001}}
Difference Between Two files
The diff
command reports records that are present in one file and not the
other. This is done without regard to order, and hashes or representative
extracted strings are stored in memory during this operation, so for very large
files this may consume a lot of RAM.
By default, this uses the hash of a normalised form of each line as a fingerprint, but this is obviously not ideal in cases where metadata, timestamps or other bits may cause two functionally identical records to appear different.
To fix this, you can use objectpath queries to extract a representative string
according to your needs, by passing an objectpath query with the -s
flag.
This is also true of many ensuing commands, not just diff
.
$ # Observe query that pulls out type and value for a unique reference..
$ jltool diff -s '$.type + ":" + $.value' records.jsonl others.jsonl
<<< 50: {"meta": {"job": "http://www.lol.org"}, "type": "email", "value": "kboo@lol.foo"}
<<< 51: {"meta": {"job": "http://www.baaa.com"}, "type": "email", "value": "adonis@rap.com"}
>>> 0: {"meta": {"job": "http://nonsense.com/"}, "type": "twitter", "value": "nonsense"}
Deduplicate
The dedupe
command reports duplicate records. This is where objectpath queries
may become relevant, because the same "record" may have different metadata
attached, and may therefore appear to be different if simply serialised as
ordered JSON, which is the default.
Note that due to the linewise way reports are made, this may issue notifications of duplicates several times as additional duplicates appear, as in the below example.
$ jltool dedupe records.jsonl
Duplicate of line 0 at lines: [13]
Duplicate of line 2 at lines: [15]
Duplicate of line 2 at lines: [15, 28]
Duplicate of line 5 at lines: [18, 31]
Duplicate of line 10 at lines: [23, 36, 49]
Duplicate of line 12 at lines: [25, 38, 51]
Found 39 duplicates.
Clean
The clean
subcommand normalises, minifies, and deduplicates jsonl files.
It should be used with similar care to other optional-query commands as, if
a query is incorrectly formed, it may result in loss of data.
$ ls -lah
drwxr-sr-x 4 cathal cathal 4.0K May 31 16:39 .
drwxrwxr-x 3 cathal cathal 4.0K May 30 19:52 ..
-rw-rw-r-- 1 cathal cathal 10K May 31 15:43 records.jsonl
$ jltool clean records.jsonl dedupe.jsonl
$ ls -lah
drwxr-sr-x 4 cathal cathal 4.0K May 31 16:39 .
drwxrwxr-x 3 cathal cathal 4.0K May 30 19:52 ..
-rw-rw-r-- 1 cathal cathal 2.4K May 31 16:42 dedupe.jsonl
-rw-rw-r-- 1 cathal cathal 10K May 31 15:43 records.jsonl