A set of utilities for stream-processing MediaWiki data.
mwstream (-h | --help)
mwstream <utility> [-h|--help]
diffs2persistence- Generates token persistence statistics using revision JSON blobs with diff information.
dump2json- Converts an XML dump to a stream of revision JSON blobs
dump2diffs- Computes diffs directly from an XML dump
fetch_missing_diffs- Scans diff documents looking for missing diffs and fills them in.
json2diffs- Computes and adds a "diff" field to a stream of revision JSON blobs
mend_diffs- Mends diffs that were computed in chunks and out of order.
persistence2stats- Aggregates a token persistence statistics to revision statistics
wikihadoop2json- Converts a Wikihadoop-processed stream of XML pages to JSON blobs
json2tsv- Converts a stream of JSON blobs to tab-separated values based a set of fieldnames.
normalize- Normalizes old versions of RevisionDocument json schemas to correspond to the most recent schema version.
validate- Validates JSON against a provided schema.
truncate_text- Truncates the 'text' field of JSON blobs to a limited length in unicode characters. (addresses content dump vandalism issues) and adds a boolean 'truncated' field.
pip install mwstreaming