scrapy-redirect restricts authorized HTTP redirections to the website start_urls
Why?
If the Scrapy REDIRECT_ENABLED
config key is set to False
and a request to the homepage of the crawled website returns a 3XX status code, the crawl will stop immediatly, as the redirection will not be followed.
scrapy-redirect will force Scrapy to tolerate redirections coming from the start_urls
urls, in the case where REDIRECT_ENABLED = False
, to avoid this particular problem.
Installation
$ pip install scrapy-redirect
Configuration
Install scrapy-redirect in your Scrapy middlewares by adding the following key/value pair in the SPIDER_MIDDLEWARES
settings key (in settings.py
):
SPIDER_MIDDLEWARES = {
...
'scrapyredirect.HomepageRedirectMiddleware': 575,
...
}
Note that it is important for the middleware order value to be inferior to 600 (the default value of the 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware'
middleware), as it must be executed before Scrapy blocks the redirection.
NB: if REDIRECT_ENABLED = True
, scrapy-redirect does nothing.
License
scrapy-redirect is published under the MIT License.