NwalaTextUtils
Collection of text processing Python functions.
Installation Options
Installation after installing boilerpipe dependency
$ pip install NwalaTextUtils
OR
$ git clone https://github.com/oduwsdl/NwalaTextUtils.git
$ cd NwalaTextUtils/; pip install .; cd ..; rm -rf NwalaTextUtils;
Installation within Docker container
$ docker run -it --rm --name NwalaTextUtils -v "$PWD":/usr/src/myapp -w /usr/src/myapp python:3.7-stretch bash
$ pip install NwalaTextUtils
Function Documentation and Usage Examples
derefURI(uri, sleepSec=0, timeout=10, sizeRestrict=4000000, headers={})
:
Dereference URI with Returns HTML text from uri
. Set sleepSec
(sleep seconds) > 0 to throttle (sleep) request.
-
(int)
sleepSec
: Default = 0. The number of seconds to sleep before the request. -
(int)
timeout
: Default = 10. Argument passed to timeout of requests.get -
(int)
sizeRestrict
: Default = 4,000,000 (~4 MB). Maximum size of HTML payload. If Content-Length exceeds this size, content would be discarded. -
(dict)
headers
: Default = {}. User-supplied HTTP Request headers. If default is not specified, then getCustomHeaderDict() is called to fill this value with sensible defaults.
cleanHtml(html, method='boilerpy3')
:
Remove boilerplate from HTML with Returns plaintext after removing HTML boilerplate from html
using either the default recommended boilerplate removal method, python-boilerpipe
or NLTK's regex method.
getPgTitleFrmHTML(html)
:
Extract HTML Page title with Returns text from within HTML title tag.
derefURI()
, cleanHtml()
, and getPgTitleFrmHTML()
:
Usage example of from NwalaTextUtils.textutils import derefURI
from NwalaTextUtils.textutils import cleanHtml
from NwalaTextUtils.textutils import getPgTitleFrmHTML
uri = 'https://time.com/3505982/ebola-new-cases-world-health-organization/'
html = derefURI(uri, 0)
plaintext = cleanHtml(html)
title = getPgTitleFrmHTML(html)
print('title:\n', title.strip(), '\n')
print('html prefix:\n', html[:100].strip(), '\n')
print('plaintext prefix:\n', plaintext[:100].strip(), '\n')
parallelGetTxtFrmURIs(urisLst, updateRate=10)
:
Dereference and Remove Boilerplate from URIs in parallel with -
(list)
urisLst
: The list of URIs to dereference and remove boilerplate from. -
(int)
updateRate
: Default = 10. Print 1 message perupdateRate
log status updates.
Usage example without logs:
import json
from NwalaTextUtils.textutils import parallelGetTxtFrmURIs
uris_lst = [
'http://www.euro.who.int/en/health-topics/emergencies/pages/news/news/2015/03/united-kingdom-is-declared-free-of-ebola-virus-disease',
'https://time.com/3505982/ebola-new-cases-world-health-organization/',
'https://www.scientificamerican.com/article/why-ebola-survivors-struggle-with-new-symptoms/'
]
doc_lst = parallelGetTxtFrmURIs(uris_lst)
with open('doc_lst.json', 'w') as outfile:
json.dump(doc_lst, outfile)
Sample output of parallelGetTxtFrmURIs()
:
{
'text': 'WHO commends the United Kingdom of Great Britain and Northern...',
'title': 'United Kingdom is declared free of Ebola virus disease',
'uri': 'http://www.euro.who.int/en/health-topics/emergencies/pages/news/news/2015/03/united-kingdom-is-declared-free-of-ebola-virus-disease'
}
Usage example with logs:
import json
import logging
from NwalaTextUtils.textutils import parallelGetTxtFrmURIs
uris_lst = [
'http://www.euro.who.int/en/health-topics/emergencies/pages/news/news/2015/03/united-kingdom-is-declared-free-of-ebola-virus-disease',
'https://time.com/3505982/ebola-new-cases-world-health-organization/',
'https://www.scientificamerican.com/article/why-ebola-survivors-struggle-with-new-symptoms/',
'https://en.wikipedia.org/wiki/Ebola_virus',
'http://www.realclearscience.com/journal_club/2014/04/21/a_possible_cure_for_ebola_virus_infection_108610.html',
'http://www.nbcnews.com/storyline/ebola-virus-outbreak/who-declares-nigeria-ebola-free-after-42-days-no-cases-n229536',
'http://www.independent.co.uk/news/world/africa/ebola-virus-top-sierra-leone-doctor-shek-umar-dies-of-disease-9636406.html',
'http://www.nbcnews.com/storyline/ebola-virus-outbreak/exclusive-first-ebola-vaccine-trial-starts-africa-n222266',
'http://www.theglobeandmail.com/news/national/canadian-researchers-thwart-ebola-virus/article4258104/',
'http://www.who.int/mediacentre/factsheets/fs103/en/',
'http://www.cnn.com/2014/08/07/world/ebola-virus-q-and-a/index.html',
'http://www.healthline.com/health/ebola-hemorrhagic-fever',
'https://www.nytimes.com/interactive/2014/07/31/world/africa/ebola-virus-outbreak-qa.html',
'http://www.vanityfair.com/news/2014/10/ebola-virus-epidemic-containment'
]
logging.basicConfig(format='%(asctime)s [%(levelname)s] %(name)s: %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("starting script")
doc_lst = parallelGetTxtFrmURIs(uris_lst, updateRate=2)
logger.info("done with script")
with open('doc_lst.json', 'w') as outfile:
json.dump(doc_lst, outfile)
parallelGetTxtFrmFiles(folder, rmHtml=False)
:
Dereference and remove Boilerplate from files in parallel with This function is similar to parallelGetTxtFrmURIs()
, but instead of dereferencing and removing boilerplate from a list of URIs like parallelGetTxtFrmURIs()
does, parallelGetTxtFrmFiles()
processes a folder
containing HTML or plaintext files. Since rmHtml = False
by default, the function simple reads and returns plaintext files. If rmHtml = True
, parallelGetTxtFrmFiles()
removes boilerplate (via cleanHtml()
) from the HTML files. In summary, if the folder
contains HTML files, set rmHtml = True
, if folder
contains plaintext, set rmHtml = False
.
parallelTask(jobsLst, threadCount=5)
:
Parallelize function with Given a list of jobs and data specified by jobsLst
, this function executes jobs in parallel using threadCount
threads. For example parallelGetTxtFrmURIs()
used parallelTask()
to parallelize dereferencing URIs (derefURI()
). Here's a snippet from parallelGetTxtFrmURIs()
with associated inline explanation.
docsLst = []
size = len(urisLst)
#<BLOCKS OF CODE NOT PERTINENT TO THE EXPLANATION OF parallelTask() HAVE BEEN DELETED FOR BREVITY>
#list containing function to be parallelized and arguments to be passed to function
jobsLst = []
for i in range(size):
printMsg = ''
if( i % 10 == 0 ):
printMsg = 'dereferencing uri i: ' + str(i) + ' of ' + str(size)
#keywords is a dictionary specifying arguments to be passed to derefURI()
#the keys of keywords (uri & sleepSec) match the parameter signature of derefURI(uri, sleepSec)
keywords = {
'uri': urisLst[i],
'sleepSec': 0
}
#jobsLst contains pool of data to be processed in parallel by func (derefURI())
jobsLst.append({
'func': derefURI, # function to be parallelized
'args': keywords, # arguments to pass to function
'misc': False, # data to send back after processing this keywords input
'print': printMsg # optional message to print when processing this request, set blank if print not required
})
#Function call to start parallel processing, resLst contains data
#returned by func (derefURI) after processing each argument, len(resLst) = len(jobsLst)
resLst = parallelTask(jobsLst)
for res in resLst:
res['input'] # input data send to func (derefURI())
res['output'] # output (HTML text) returned by func (derefURI()) after processing input, None if func does not return
res['misc'] # echo-back data by user
docsLst.append({
'text': cleanHTML( res['output'] ),
'title': getPgTitleFrmHTML( res['output'] ),
'uri': res['input']['args']['uri']
})
return docsLst