Scrapy-Domain-Delay
Scrapy-Domain-Delay
is a package that lets you set different delay for different website, using the Scrapy framework.
Install
$ pip install scrapy-domain-delay
Usage
Step 1: Extract the domain name from a full url using Python tldextract.
>>> import tldextract
>>> tldextract.extract('https://www.google.com/').domain
'google'
In this example, we would extract "google"
as domain name from a full url "https://www.google.com/"
.
Step 2: Use the following config values in your scrapy settings:
-
Enable the AutoThrottle extension.
AUTOTHROTTLE_ENABLED = True
-
Enable the Custom Delay Throttle by adding it to
EXTENSIONS
.EXTENSIONS = { 'scrapy.extensions.throttle.AutoThrottle': None, 'scrapy_domain_delay.extensions.CustomDelayThrottle': 300, }
-
Add
{'domain': 'download delay (in seconds)'}
to theDOMAIN_DELAYS
.something like:
# set up custom delays per domain DOMAIN_DELAYS = { 'google': 1.0, 'github': 0.5, }