NwalaTextUtils

Collection of text processing Python functions.

Installation Options

Installation after installing boilerpipe dependency

$ pip install NwalaTextUtils

$ git clone https://github.com/oduwsdl/NwalaTextUtils.git
$ cd NwalaTextUtils/; pip install .; cd ..; rm -rf NwalaTextUtils;

Installation within Docker container

$ docker run -it --rm --name NwalaTextUtils -v "$PWD":/usr/src/myapp -w /usr/src/myapp python:3.7-stretch bash
$ pip install NwalaTextUtils

Function Documentation and Usage Examples

Dereference URI with `derefURI(uri, sleepSec=0, timeout=10, sizeRestrict=4000000, headers={})`:

Returns HTML text from uri. Set sleepSec (sleep seconds) > 0 to throttle (sleep) request.

(int) sleepSec: Default = 0. The number of seconds to sleep before the request.
(int) timeout: Default = 10. Argument passed to timeout of requests.get
(int) sizeRestrict: Default = 4,000,000 (~4 MB). Maximum size of HTML payload. If Content-Length exceeds this size, content would be discarded.
(dict) headers: Default = {}. User-supplied HTTP Request headers. If default is not specified, then getCustomHeaderDict() is called to fill this value with sensible defaults.

Remove boilerplate from HTML with `cleanHtml(html, method='boilerpy3')`:

Returns plaintext after removing HTML boilerplate from html using either the default recommended boilerplate removal method, python-boilerpipe or NLTK's regex method.

Extract HTML Page title with `getPgTitleFrmHTML(html)`:

Returns text from within HTML title tag.

Usage example of `derefURI()`, `cleanHtml()`, and `getPgTitleFrmHTML()`:

from NwalaTextUtils.textutils import derefURI
from NwalaTextUtils.textutils import cleanHtml
from NwalaTextUtils.textutils import getPgTitleFrmHTML

uri = 'https://time.com/3505982/ebola-new-cases-world-health-organization/'

html = derefURI(uri, 0)
plaintext = cleanHtml(html)
title = getPgTitleFrmHTML(html)

print('title:\n', title.strip(), '\n')
print('html prefix:\n', html[:100].strip(), '\n')
print('plaintext prefix:\n', plaintext[:100].strip(), '\n')

Dereference and Remove Boilerplate from URIs in parallel with `parallelGetTxtFrmURIs(urisLst, updateRate=10)`:

(list) urisLst: The list of URIs to dereference and remove boilerplate from.
(int) updateRate: Default = 10. Print 1 message per updateRate log status updates.

Usage example without logs:

import json
from NwalaTextUtils.textutils import parallelGetTxtFrmURIs

uris_lst = [
    'http://www.euro.who.int/en/health-topics/emergencies/pages/news/news/2015/03/united-kingdom-is-declared-free-of-ebola-virus-disease',
    'https://time.com/3505982/ebola-new-cases-world-health-organization/',
    'https://www.scientificamerican.com/article/why-ebola-survivors-struggle-with-new-symptoms/'
  ]


doc_lst = parallelGetTxtFrmURIs(uris_lst)
with open('doc_lst.json', 'w') as outfile:
    json.dump(doc_lst, outfile)

Sample output of parallelGetTxtFrmURIs():

{
	'text': 'WHO commends the United Kingdom of Great Britain and Northern...',
	'title': 'United Kingdom is declared free of Ebola virus disease',
	'uri': 'http://www.euro.who.int/en/health-topics/emergencies/pages/news/news/2015/03/united-kingdom-is-declared-free-of-ebola-virus-disease'
}

Usage example with logs:

import json
import logging
from NwalaTextUtils.textutils import parallelGetTxtFrmURIs

uris_lst = [
	'http://www.euro.who.int/en/health-topics/emergencies/pages/news/news/2015/03/united-kingdom-is-declared-free-of-ebola-virus-disease',
	'https://time.com/3505982/ebola-new-cases-world-health-organization/',
	'https://www.scientificamerican.com/article/why-ebola-survivors-struggle-with-new-symptoms/',
	'https://en.wikipedia.org/wiki/Ebola_virus',
	'http://www.realclearscience.com/journal_club/2014/04/21/a_possible_cure_for_ebola_virus_infection_108610.html',
	'http://www.nbcnews.com/storyline/ebola-virus-outbreak/who-declares-nigeria-ebola-free-after-42-days-no-cases-n229536',
	'http://www.independent.co.uk/news/world/africa/ebola-virus-top-sierra-leone-doctor-shek-umar-dies-of-disease-9636406.html',
	'http://www.nbcnews.com/storyline/ebola-virus-outbreak/exclusive-first-ebola-vaccine-trial-starts-africa-n222266',
	'http://www.theglobeandmail.com/news/national/canadian-researchers-thwart-ebola-virus/article4258104/',
	'http://www.who.int/mediacentre/factsheets/fs103/en/',
	'http://www.cnn.com/2014/08/07/world/ebola-virus-q-and-a/index.html',
	'http://www.healthline.com/health/ebola-hemorrhagic-fever',
	'https://www.nytimes.com/interactive/2014/07/31/world/africa/ebola-virus-outbreak-qa.html',
	'http://www.vanityfair.com/news/2014/10/ebola-virus-epidemic-containment'
]

logging.basicConfig(format='%(asctime)s [%(levelname)s] %(name)s: %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("starting script")

doc_lst = parallelGetTxtFrmURIs(uris_lst, updateRate=2)

logger.info("done with script")
with open('doc_lst.json', 'w') as outfile:
    json.dump(doc_lst, outfile)

Dereference and remove Boilerplate from files in parallel with `parallelGetTxtFrmFiles(folder, rmHtml=False)`:

This function is similar to parallelGetTxtFrmURIs(), but instead of dereferencing and removing boilerplate from a list of URIs like parallelGetTxtFrmURIs() does, parallelGetTxtFrmFiles() processes a folder containing HTML or plaintext files. Since rmHtml = False by default, the function simple reads and returns plaintext files. If rmHtml = True, parallelGetTxtFrmFiles() removes boilerplate (via cleanHtml()) from the HTML files. In summary, if the folder contains HTML files, set rmHtml = True, if folder contains plaintext, set rmHtml = False.

Parallelize function with `parallelTask(jobsLst, threadCount=5)`:

Given a list of jobs and data specified by jobsLst, this function executes jobs in parallel using threadCount threads. For example parallelGetTxtFrmURIs() used parallelTask() to parallelize dereferencing URIs (derefURI()). Here's a snippet from parallelGetTxtFrmURIs() with associated inline explanation.

docsLst = []
size = len(urisLst)

#<BLOCKS OF CODE NOT PERTINENT TO THE EXPLANATION OF parallelTask() HAVE BEEN DELETED FOR BREVITY>

#list containing function to be parallelized and arguments to be passed to function
jobsLst = []

for i in range(size):

	printMsg = ''

	if( i % 10 == 0 ):
		printMsg = 'dereferencing uri i: ' + str(i) + ' of ' + str(size)

	#keywords is a dictionary specifying arguments to be passed to derefURI()
	#the keys of keywords (uri & sleepSec) match the parameter signature of derefURI(uri, sleepSec)
	keywords = {
		'uri': urisLst[i],
		'sleepSec': 0
	}

	#jobsLst contains pool of data to be processed in parallel by func (derefURI())
	jobsLst.append({
		'func': derefURI, # function to be parallelized
		'args': keywords, # arguments to pass to function
		'misc': False,    # data to send back after processing this keywords input
		'print': printMsg # optional message to print when processing this request, set blank if print not required
	})

#Function call to start parallel processing, resLst contains data 
#returned by func (derefURI) after processing each argument, len(resLst) = len(jobsLst)
resLst = parallelTask(jobsLst)

for res in resLst:
	
	res['input'] 	# input data send to func (derefURI())
	res['output']	# output (HTML text) returned by func (derefURI()) after processing input, None if func does not return
	res['misc']  	# echo-back data by user

	docsLst.append({
		'text': cleanHTML( res['output'] ),
		'title': getPgTitleFrmHTML( res['output'] ),
		'uri': res['input']['args']['uri']
	})

return docsLst

NwalaTextUtils
Release 0.0.5

Release 0.0.5

0.0.5

0.0.4

0.0.3

0.0.2

0.0.1

Documentation

NwalaTextUtils

Installation Options

Installation after installing boilerpipe dependency

Installation within Docker container

Function Documentation and Usage Examples

Dereference URI with `derefURI(uri, sleepSec=0, timeout=10, sizeRestrict=4000000, headers={})`:

Remove boilerplate from HTML with `cleanHtml(html, method='boilerpy3')`:

Extract HTML Page title with `getPgTitleFrmHTML(html)`:

Usage example of `derefURI()`, `cleanHtml()`, and `getPgTitleFrmHTML()`:

Dereference and Remove Boilerplate from URIs in parallel with `parallelGetTxtFrmURIs(urisLst, updateRate=10)`:

Dereference and remove Boilerplate from files in parallel with `parallelGetTxtFrmFiles(folder, rmHtml=False)`:

Parallelize function with `parallelTask(jobsLst, threadCount=5)`:

Stats

Development practices

Releases

Contributors

NwalaTextUtils Release 0.0.5

Release 0.0.5 Toggle Dropdown 0.0.5 0.0.4 0.0.3 0.0.2 0.0.1

Documentation

NwalaTextUtils

Installation Options

Installation after installing boilerpipe dependency

Installation within Docker container

Function Documentation and Usage Examples

Dereference URI with derefURI(uri, sleepSec=0, timeout=10, sizeRestrict=4000000, headers={}):

Remove boilerplate from HTML with cleanHtml(html, method='boilerpy3'):

Extract HTML Page title with getPgTitleFrmHTML(html):

Usage example of derefURI(), cleanHtml(), and getPgTitleFrmHTML():

Dereference and Remove Boilerplate from URIs in parallel with parallelGetTxtFrmURIs(urisLst, updateRate=10):

Dereference and remove Boilerplate from files in parallel with parallelGetTxtFrmFiles(folder, rmHtml=False):

Parallelize function with parallelTask(jobsLst, threadCount=5):

Stats

Development practices

Releases

Contributors

NwalaTextUtils
Release 0.0.5

Release 0.0.5

0.0.5

0.0.4

0.0.3

0.0.2

0.0.1

Dereference URI with `derefURI(uri, sleepSec=0, timeout=10, sizeRestrict=4000000, headers={})`:

Remove boilerplate from HTML with `cleanHtml(html, method='boilerpy3')`:

Extract HTML Page title with `getPgTitleFrmHTML(html)`:

Usage example of `derefURI()`, `cleanHtml()`, and `getPgTitleFrmHTML()`:

Dereference and Remove Boilerplate from URIs in parallel with `parallelGetTxtFrmURIs(urisLst, updateRate=10)`:

Dereference and remove Boilerplate from files in parallel with `parallelGetTxtFrmFiles(folder, rmHtml=False)`:

Parallelize function with `parallelTask(jobsLst, threadCount=5)`: