scrapy-domain-delay

This package provides a way to let you set different delay for different website, using the Scrapy framework.


Install
pip install scrapy-domain-delay==0.0.4

Documentation

Scrapy-Domain-Delay

PyPI PyPI - Python Version Build Status

Scrapy-Domain-Delay is a package that lets you set different delay for different website, using the Scrapy framework.

Install

$ pip install scrapy-domain-delay

Usage

Step 1: Extract the domain name from a full url using Python tldextract.

>>> import tldextract
>>> tldextract.extract('https://www.google.com/').domain
'google'

In this example, we would extract "google" as domain name from a full url "https://www.google.com/".

Step 2: Use the following config values in your scrapy settings:

  1. Enable the AutoThrottle extension.

    AUTOTHROTTLE_ENABLED = True
  2. Enable the Custom Delay Throttle by adding it to EXTENSIONS.

    EXTENSIONS = {
        'scrapy.extensions.throttle.AutoThrottle': None,
        'scrapy_domain_delay.extensions.CustomDelayThrottle': 300,
    }
  3. Add {'domain': 'download delay (in seconds)'} to the DOMAIN_DELAYS.

    something like:

    # set up custom delays per domain
    DOMAIN_DELAYS = {
        'google': 1.0,
        'github': 0.5,
    }