scrapy-selenium-middleware

Scrapy middleware for downloading a page html source using selenium, and interacting with the web driver in the request context eventually returning an HtmlResponse to the spider


Keywords
scrapy, selenium, middleware, proxy, web, scraping, render, javascript, selenium-wire, headless, browser
License
MIT
Install
pip install scrapy-selenium-middleware==0.0.5

Documentation

scrapy-selenium-middleware

requirements

  • This downloader middleware should be used inside an existing Scrapy project
  • Install Firefox and gekodriver on the machine running this middleware

pip

  • pip install scrapy-selenium-middleware

usage example

for a full scrapy project demo please go here

The middleware receives its settings from scrapy project settings
in your scrapy project settings.py file add the following settings

DOWNLOADER_MIDDLEWARES = {"scrapy_selenium_middleware.SeleniumDownloader":451}
CONCURRENT_REQUESTS = 1 # multiple concurrent browsers are not supported yet
SELENIUM_IS_HEADLESS = False
SELENIUM_PROXY = "http://user:password@my-proxy-server:port" # set to None to not use a proxy
SELENIUM_USER_AGENT = "User-Agent: Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>"           
SELENIUM_REQUEST_RECORD_SCOPE = ["api*"] # a list of regular expression to record the incoming requests by matching the url
SELENIUM_FIREFOX_PROFILE_SETTINGS = {}
SELENIUM_PAGE_LOAD_TIMEOUT = 120