
Collection of functions for processing text

pip install NwalaTextUtils==0.0.3



Collection of text processing Python functions.

Installation Options

Installation after installing boilerpipe dependency

$ pip install NwalaTextUtils


$ git clone https://github.com/oduwsdl/NwalaTextUtils.git
$ cd NwalaTextUtils/; pip install .; cd ..; rm -rf NwalaTextUtils;

Installation within Docker container

$ docker run -it --rm --name NwalaTextUtils -v "$PWD":/usr/src/myapp -w /usr/src/myapp python:3.7-stretch bash
$ pip install NwalaTextUtils

Function Documentation and Usage Examples

Dereference URI with derefURI(uri, sleepSec=0, timeout=10, sizeRestrict=4000000, headers={}):

Returns HTML text from uri. Set sleepSec (sleep seconds) > 0 to throttle (sleep) request.

  • (int) sleepSec: Default = 0. The number of seconds to sleep before the request.

  • (int) timeout: Default = 10. Argument passed to timeout of requests.get

  • (int) sizeRestrict: Default = 4,000,000 (~4 MB). Maximum size of HTML payload. If Content-Length exceeds this size, content would be discarded.

  • (dict) headers: Default = {}. User-supplied HTTP Request headers. If default is not specified, then getCustomHeaderDict() is called to fill this value with sensible defaults.

Remove boilerplate from HTML with cleanHtml(html, method='boilerpy3'):

Returns plaintext after removing HTML boilerplate from html using either the default recommended boilerplate removal method, python-boilerpipe or NLTK's regex method.

Extract HTML Page title with getPgTitleFrmHTML(html):

Returns text from within HTML title tag.

Usage example of derefURI(), cleanHtml(), and getPgTitleFrmHTML():
from NwalaTextUtils.textutils import derefURI
from NwalaTextUtils.textutils import cleanHtml
from NwalaTextUtils.textutils import getPgTitleFrmHTML

uri = 'https://time.com/3505982/ebola-new-cases-world-health-organization/'

html = derefURI(uri, 0)
plaintext = cleanHtml(html)
title = getPgTitleFrmHTML(html)

print('title:\n', title.strip(), '\n')
print('html prefix:\n', html[:100].strip(), '\n')
print('plaintext prefix:\n', plaintext[:100].strip(), '\n')
Dereference and Remove Boilerplate from URIs in parallel with parallelGetTxtFrmURIs(urisLst, updateRate=10):
  • (list) urisLst: The list of URIs to dereference and remove boilerplate from.

  • (int) updateRate: Default = 10. Print 1 message per updateRate log status updates.

Usage example without logs:

import json
from NwalaTextUtils.textutils import parallelGetTxtFrmURIs

uris_lst = [

doc_lst = parallelGetTxtFrmURIs(uris_lst)
with open('doc_lst.json', 'w') as outfile:
    json.dump(doc_lst, outfile)

Sample output of parallelGetTxtFrmURIs():

	'text': 'WHO commends the United Kingdom of Great Britain and Northern...',
	'title': 'United Kingdom is declared free of Ebola virus disease',
	'uri': 'http://www.euro.who.int/en/health-topics/emergencies/pages/news/news/2015/03/united-kingdom-is-declared-free-of-ebola-virus-disease'

Usage example with logs:

import json
import logging
from NwalaTextUtils.textutils import parallelGetTxtFrmURIs

uris_lst = [

logging.basicConfig(format='%(asctime)s [%(levelname)s] %(name)s: %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("starting script")

doc_lst = parallelGetTxtFrmURIs(uris_lst, updateRate=2)

logger.info("done with script")
with open('doc_lst.json', 'w') as outfile:
    json.dump(doc_lst, outfile)
Dereference and remove Boilerplate from files in parallel with parallelGetTxtFrmFiles(folder, rmHtml=False):

This function is similar to parallelGetTxtFrmURIs(), but instead of dereferencing and removing boilerplate from a list of URIs like parallelGetTxtFrmURIs() does, parallelGetTxtFrmFiles() processes a folder containing HTML or plaintext files. Since rmHtml = False by default, the function simple reads and returns plaintext files. If rmHtml = True, parallelGetTxtFrmFiles() removes boilerplate (via cleanHtml()) from the HTML files. In summary, if the folder contains HTML files, set rmHtml = True, if folder contains plaintext, set rmHtml = False.

Parallelize function with parallelTask(jobsLst, threadCount=5):

Given a list of jobs and data specified by jobsLst, this function executes jobs in parallel using threadCount threads. For example parallelGetTxtFrmURIs() used parallelTask() to parallelize dereferencing URIs (derefURI()). Here's a snippet from parallelGetTxtFrmURIs() with associated inline explanation.

docsLst = []
size = len(urisLst)


#list containing function to be parallelized and arguments to be passed to function
jobsLst = []

for i in range(size):

	printMsg = ''

	if( i % 10 == 0 ):
		printMsg = 'dereferencing uri i: ' + str(i) + ' of ' + str(size)

	#keywords is a dictionary specifying arguments to be passed to derefURI()
	#the keys of keywords (uri & sleepSec) match the parameter signature of derefURI(uri, sleepSec)
	keywords = {
		'uri': urisLst[i],
		'sleepSec': 0

	#jobsLst contains pool of data to be processed in parallel by func (derefURI())
		'func': derefURI, # function to be parallelized
		'args': keywords, # arguments to pass to function
		'misc': False,    # data to send back after processing this keywords input
		'print': printMsg # optional message to print when processing this request, set blank if print not required

#Function call to start parallel processing, resLst contains data 
#returned by func (derefURI) after processing each argument, len(resLst) = len(jobsLst)
resLst = parallelTask(jobsLst)

for res in resLst:
	res['input'] 	# input data send to func (derefURI())
	res['output']	# output (HTML text) returned by func (derefURI()) after processing input, None if func does not return
	res['misc']  	# echo-back data by user

		'text': cleanHTML( res['output'] ),
		'title': getPgTitleFrmHTML( res['output'] ),
		'uri': res['input']['args']['uri']

return docsLst