google-news-scraper

Getting Started

Installation

pip install google-news-scraper

Reference

Importing

import importlib

GoogleNewsScraper = importlib.import_module('google-news-scraper').GoogleNewsScraper

Instantiating Scraper

GoogleNewsScraper(driver, automation_options, chrome_driver_arguments)

Constructor Parameters

Name	Type	Required
driver	web driver	no

Possible values:

'chrome': The driver will default to use this package's chrome driver
A path to some driver (FireFox, for instance) stored on the user's system

Name	Type	Required
automation_options	object	yes

Possible values:

keywords : A series of words that will be inputted into Google News.
date_range : Filters how recent data should be. Can be any of the following (defaults to 'Past 24 hours'):
- Past hours
- Past 24 hours
- Past week
- Past month
- Past year
- Archives
pages : Number of pages that should be scraped (defaults to 'max').
pagination_pause_per_page : Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.

Name	Type	Required
chrome_driver_arguments	list	no

Ignore this parameter if you are not choosing to use the default 'chrome' driver

Possible values:

'--headless'
'--ignore-certificate-errors'
'--incognito'
'--no-sandbox'
'--disable-setuid-sandbox'
'--disable-dev-shm-usage'

Click this link to view all possible arguments: https://chromedriver.chromium.org/capabilities

Methods

This method is both public and private, though it really should only be used by the class

locate_html_element(self, driver, element, selector, wait_seconds)

Name	Type	Required
driver	web driver	yes

Possible values:

A web driver (Chrome, FireFox, etc)

Name	Type	Required
element	string	yes

Possible values:

Id selector of an HTML element
Class selector of an HTML element

Name	Type	Required
selector	Module import	yes

First install selenium

pip install selenium

Then import By

from selenium.webdriver.common.by import By

Possible values:

By.ID
By.CLASS_NAME
By.CSS_SELECTOR
By.LINK_TEXT
By.NAME
By.PARTIAL_LINK_TEXT
By.TAG_NAME
By.XPATH

Name	Type	Required
wait_seconds	number	no

default: 30

Description:

Waits a certain number of seconds in order to locate an HTML element
If an element exists on the page, it will be located instantaneously
If an element does not yet exist, (if it will appear once a request is made, for instance)
wait_seconds may have to be increased depending on how long it takes for an element to appear

please note: 30 seconds is plenty; this time would rarely have to be increased

GoogleNewsScraper.scrape()

Begins the scraping process and Returns a two-dimensional list
Each list represents a single page, and contains multiple objects
Each object representing one article

Example of what type of data a single article-object will contain:

'description': The preview description of the news article
'title': The title of the news article
'source': The source of news article (New York Times, for instance)
'image_url': The url of the preview news article image
'article_link': A link to the news article
'time_published_ago': A datetime string that represents the date of when the article was published

GoogleNewsScraper
Release 1.0.1

Release 1.0.1

1.0.1

1.0.0

0.0.9

0.0.8

0.0.7

0.0.6

0.0.5

0.0.4

0.0.3

0.0.2

Documentation

google-news-scraper

Getting Started

Installation

Reference

Importing

Instantiating Scraper

Methods

Stats

Development practices

Releases

Contributors

GoogleNewsScraper Release 1.0.1

Release 1.0.1 Toggle Dropdown 1.0.1 1.0.0 0.0.9 0.0.8 0.0.7 0.0.6 0.0.5 0.0.4 0.0.3 0.0.2

Documentation

google-news-scraper

Getting Started

Installation

Reference

Importing

Instantiating Scraper

Methods

Stats

Development practices

Releases

Contributors

GoogleNewsScraper
Release 1.0.1

Release 1.0.1

1.0.1

1.0.0

0.0.9

0.0.8

0.0.7

0.0.6

0.0.5

0.0.4

0.0.3

0.0.2