scrapy-redirect

Restrict authorized Scrapy redirections to the website start_urls


Keywords
scrapy, crawl, scraping
License
MIT
Install
pip install scrapy-redirect==0.1.0

Documentation

scrapy-redirect restricts authorized HTTP redirections to the website start_urls

Why?

If the Scrapy REDIRECT_ENABLED config key is set to False and a request to the homepage of the crawled website returns a 3XX status code, the crawl will stop immediatly, as the redirection will not be followed.

scrapy-redirect will force Scrapy to tolerate redirections coming from the start_urls urls, in the case where REDIRECT_ENABLED = False, to avoid this particular problem.

Installation

$ pip install scrapy-redirect

Configuration

Install scrapy-redirect in your Scrapy middlewares by adding the following key/value pair in the SPIDER_MIDDLEWARES settings key (in settings.py):

SPIDER_MIDDLEWARES = {
    ...
    'scrapyredirect.HomepageRedirectMiddleware': 575,
    ...
}

Note that it is important for the middleware order value to be inferior to 600 (the default value of the 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware' middleware), as it must be executed before Scrapy blocks the redirection.

NB: if REDIRECT_ENABLED = True, scrapy-redirect does nothing.

License

scrapy-redirect is published under the MIT License.