rubyslippers

Simple Wikipedia plain text extractor with article link annotations (and stuff)


Keywords
ner, nlp, wikipedia
License
GPL-3.0
Install
pip install rubyslippers==0.0.1

Documentation

A Wikipedia Plain Text Extractor with Link Annotations (and stuff)

This is port of @jodaiber's Annotated-WikiExtractor which is built upon Wikipedia Extractor by Medialab.

Usage


$ git clone https://github.com/alvations/rubyslippers.git
$ cd rubyslippers

# This will take a while...
$ wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

$ mkdir extracted-new
$ bzip2 -dc enwiki-latest-pages-articles.xml.bz2 | python3 extract.py extracted-new/