GoogleNewsScraper

Scrapes Google News article data


Keywords
crawler, googleautomator, googlenews, googlenewsscraper, googlescraper, python, scraper, scraping, selenium, web-scraping, webcrawler, webdriver, webscraper
License
MIT
Install
pip install GoogleNewsScraper==1.0.1

Documentation

google-news-scraper

Getting Started

Installation

pip install google-news-scraper

Reference

Importing

import importlib

GoogleNewsScraper = importlib.import_module('google-news-scraper').GoogleNewsScraper

Instantiating Scraper

GoogleNewsScraper(driver, automation_options, chrome_driver_arguments)

Constructor Parameters

Name Type Required
driver web driver no

Possible values:

  • 'chrome': The driver will default to use this package's chrome driver
  • A path to some driver (FireFox, for instance) stored on the user's system

Name Type Required
automation_options object yes

Possible values:

  • keywords : A series of words that will be inputted into Google News.
  • date_range : Filters how recent data should be. Can be any of the following (defaults to 'Past 24 hours'):
    • Past hours
    • Past 24 hours
    • Past week
    • Past month
    • Past year
    • Archives
  • pages : Number of pages that should be scraped (defaults to 'max').
  • pagination_pause_per_page : Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.

Name Type Required
chrome_driver_arguments list no

Ignore this parameter if you are not choosing to use the default 'chrome' driver

Possible values:

  • '--headless'
  • '--ignore-certificate-errors'
  • '--incognito'
  • '--no-sandbox'
  • '--disable-setuid-sandbox'
  • '--disable-dev-shm-usage'

Click this link to view all possible arguments: https://chromedriver.chromium.org/capabilities


Methods

This method is both public and private, though it really should only be used by the class

locate_html_element(self, driver, element, selector, wait_seconds)

Name Type Required
driver web driver yes

Possible values:

  • A web driver (Chrome, FireFox, etc)

Name Type Required
element string yes

Possible values:

  • Id selector of an HTML element
  • Class selector of an HTML element

Name Type Required
selector Module import yes

First install selenium

pip install selenium

Then import By

from selenium.webdriver.common.by import By

Possible values:

  • By.ID
  • By.CLASS_NAME
  • By.CSS_SELECTOR
  • By.LINK_TEXT
  • By.NAME
  • By.PARTIAL_LINK_TEXT
  • By.TAG_NAME
  • By.XPATH

Name Type Required
wait_seconds number no

default: 30

Description:

  • Waits a certain number of seconds in order to locate an HTML element
  • If an element exists on the page, it will be located instantaneously
  • If an element does not yet exist, (if it will appear once a request is made, for instance)
  • wait_seconds may have to be increased depending on how long it takes for an element to appear

please note: 30 seconds is plenty; this time would rarely have to be increased


GoogleNewsScraper.scrape()
  • Begins the scraping process and Returns a two-dimensional list
  • Each list represents a single page, and contains multiple objects
  • Each object representing one article

Example of what type of data a single article-object will contain:

  • 'description': The preview description of the news article
  • 'title': The title of the news article
  • 'source': The source of news article (New York Times, for instance)
  • 'image_url': The url of the preview news article image
  • 'article_link': A link to the news article
  • 'time_published_ago': A datetime string that represents the date of when the article was published