streamcorpus_pipeline

Tools for building streamcorpus objects, such as those used in TREC.


Licenses
MIT/NTP/MIT
Install
pip install streamcorpus_pipeline==0.3.38.dev1

Documentation

StreamCorpus Pipeline

streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.

The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.

Read more at streamcorpus.org