NewsHarvest

Package for collecting headlines, article content, and corresponding data from Associated Press, Reuters, and Bloomberg


License
Other
Install
pip install NewsHarvest==1.2.0

Documentation

NewsHarvest

Package for collecting headlines and news stories from the Associated Press, Bloomberg, and Reuters

Collect headlines and cooresponding data with this code:

from NewsHarvest.NewsHarvest import *

ap_data = AssociatedPress().get_data()
reuters_data = Reuters().get_data()
bloomberg_data = Bloomberg().get_data()

The get_data() method accepts these arguments:

  • get_content=True - This determines whether or not you wish to scrape all visible text from the headlines' url
  • sleep=True - This determines whether there is a one second pause between each headline scrape
  • json_format=False - When set to True, it will output data into json format
  • include_headings=False - When set to True, it will include the column headings in the output. This is espciall helpful when writing the output to a CSV file

All news sources return these data points:

  • Source
  • Headline
  • URL
  • Exceprt
  • Location (location is blank for Bloomberg)
  • Publish_time
  • Date
  • Page_content (unless get_content is set to False)

Feedback is welcome at: mdgithub@gmail.com

Associated Press

Sources within Associated Press: headlines (or home), business, US, world, sports, entertainment, health, science, politics

Example:

from NewsHarvest.NewsHarvest import *

ap_sources = ['headlines', 'business', 'us', 'world']
output = []
for source in ap_sources:
    source_data = AssociatedPress(source=source).get_data()
    for news_item in source_data:
        output.append(news_item)

write_to_csv(output, 'associatepress.csv')

Reuters

Sources within Reuters: businessNews, wealth, bankruptcyNews, bondsNews, deals, economy, globalmarketsNews, hedgefunds, hotStocksNews, mergersNews,governmentfilingsNews, summitNews, USdollarreportNews, usmarkets

Example:

from NewsHarvest.NewsHarvest import *

reuters_sources = ['businessNews', 'wealth', 'deals', 'economy']
output = []
for source in reuters_sources:
    source_data = Reuters(source=source).get_data()
    for news_item in source_data:
        output.append(news_item)

write_to_csv(output, 'reuters.csv')

Bloomberg

Sources within Bloomberg: top news, markets

from NewsHarvest.NewsHarvest import *

output = []
source_data = Reuters(source='top news').get_data()
for news_item in source_data:
    output.append(news_item)

write_to_csv(output, 'bloomberg.csv')

Utilities Module

This module contains many functions that are used as helper functions in the NewsHarvest module.

But it also contains many functions that should aid in exporting and importing data:

from NewsHarvest.NewsHarvest import *

data = AssociatedPress(source='headlines').get_data()
write_to_csv(data, 'outpt_file.csv')
write_to_txt(data, 'outpt_file.csv')

data = AssociatedPress(source='headlines').get_data()
append_to_csv(data, 'file_name.csv')

data = AssociatedPress(source='headlines').get_data(json_format=True)
write_json(data, 'output_file.csv')