mwstreaming

A collection of scripts and utilities to support the stream-processing of MediaWiki data.


License
MIT
Install
pip install mwstreaming==0.5.5

Documentation

MediaWiki Streaming

A set of utilities for stream-processing MediaWiki data.

Usage

mwstream (-h | --help)

mwstream <utility> [-h|--help]

Data processing utilities

diffs2persistence
Generates token persistence statistics using revision JSON blobs with diff information.
dump2json
Converts an XML dump to a stream of revision JSON blobs
dump2diffs
Computes diffs directly from an XML dump
fetch_missing_diffs
Scans diff documents looking for missing diffs and fills them in.
json2diffs
Computes and adds a "diff" field to a stream of revision JSON blobs
mend_diffs
Mends diffs that were computed in chunks and out of order.
persistence2stats
Aggregates a token persistence statistics to revision statistics
wikihadoop2json
Converts a Wikihadoop-processed stream of XML pages to JSON blobs

General utilities

json2tsv
Converts a stream of JSON blobs to tab-separated values based a set of fieldnames.
normalize
Normalizes old versions of RevisionDocument json schemas to correspond to the most recent schema version.
validate
Validates JSON against a provided schema.
truncate_text
Truncates the 'text' field of JSON blobs to a limited length in unicode characters. (addresses content dump vandalism issues) and adds a boolean 'truncated' field.

Installation

pip install mwstreaming