zim-newspaper
library to get newspaper, news from zim leading news providers
Example
from newspaperzw.Provider import Providers
from newspaperzw.news import News
# add your favourite news source so as to use its name only when getting news data
p = Providers(provider={'techzim': 'https://www.techzim.co.zw/'})
# get all preset news sources and url
all = Providers().getAll()
try:
# get data from news site by name, default = `herald`
api = News(provider='techzim')
news_data = api.paper()
# return dict with all news data, best to use `prettyprinter`
print(new_data)
except Exception as e:
print("There was a problem: ", e)
Get summary of article
- to get summary through NLP function,
nltk
is needed as dependent - this returns same results as above but with a summary attribute that contains a summary of the article
from newspaperzw.Provider import Providers
from newspaperzw.news import News
try:
# get data from news site by name, default = `herald`
api = News(provider='techzim', summary=True)
news_data = api.paper()
# return dict with all news data, best to use `prettyprinter`
print(new_data)
except Exception as e:
print("There was a problem: ", e)
NEW!
added flag to set or disable cache
# __version__1.1.0
from newspaperzw.Provider import Providers
from newspaperzw.news import News
'''
get a summary of the article from each article through the `summary=True` flag
avoid cache memory by disabling it through the `cache=False` flag
if cache is True, it will not return news previously downloaded on previous runs
'''
try:
# get data from news site by name, default = `herald`
api = News(provider='techzim', summary=True, cache=False)
news_data = api.paper()
# return dict with all news data, best to use `prettyprinter`
print(new_data)
except Exception as e:
print("There was a problem: ", e)
result with summary & keywords attr
TODO
-
library scrapes all available data that it encounters, need to narrow it down by date/month/year
-
start from
today
up to any news obtained fromlast year
-
improve speed
-
disble logging
-
exception handling
-
date published
-
log files need to be deleted in case they occupy signficant space on disk