distbot

Distributed (browser-based) web scraping in Python with automatic error handling and recovery and automation detection prevention.
Based off of Pyppeteer.

Installing

pip:
pip install distbot
Docker:
docker pull danielkelleher/distbot

Usage

Spiders can be ran with multiple browsers and/or multiple browser tabs.
All request will be distributed to browsers and tabs in a round-robin fashion.
The web browsers used by spiders do not need to be on the same machine as the scraping script.
To run a launch and connect to a browser on a remote machine, pass the IP of the remote server when creating the browser:
await spider.add_browser(server='remote.ip.addr')

For running distributed spiders, see examples/distributed.py
For non-distributed use, see examples/simple.py

Launch Options

distbot supports all of Pyppeteer's launch options, plus a few others:

screenshot
screenshot each page that is navigated.
default: False

defaultNavigationTimeout
Default maximum navigation timeout in ms.
Default: 45000

setAdBlockingEnabled
Enable Chrome's experimental ad filter on all sites.
Default: False

blockedURLs
List of URLs and/or URL patterns to prevent from loading.
Default: []

requestAbortTypes
List of content types that should have requests intercepted and aborted.
valid content types are: stylesheet, image, media, font, script,
texttrack, xhr, fetch, eventsource, websocket, manifest, other.
Default: []

blockRedirects
Intercept and abort redirect requests.
Default: False

evaluateOnNewDocument
List of JavaScript functions (wrapped as str) that will be invoked on every page navigation.
Default: []

deleteCookies Clear all cookies before each request.
Default: False

maxConsecutiveError Maximum allowable consecutive browser errors before browser will be replaced.
Default: 4

pageIdleTimeout
Watchdog timer to detect crashes in end user's script. If page is not set idle in {pageIdleTimeout} seconds, it will automatically be added to the idle queue.

proxy
Address of proxy server to use.

Example launch options might look like:

{
    "ignoreDefaultArgs": [
        "--disable-popup-blocking",
        "--disable-extensions"
    ],
    "headless": True,
    "ignoreHTTPSErrors": True,
    "defaultViewport": {},
    "executablePath": "/usr/bin/google-chrome-stable",
    "defaultNavigationTimeout": 60_000,
    "blockedURLs": ["google","facebook","youtube"],
    "requestAbortTypes": ["font", "stylesheet"],
    "deleteCookies": True,
    '--disable-extensions'
    "userDataDir": "/home/user/Default",
    "screenshot: True,
    "pageIdleTimeout": 120,
    "proxy": "http://5.79.66.2:13010",
    "args": [
        "--disable-web-security",
        "--no-sandbox",
        "--start-maximized",
        "--js-flags=\"--max_old_space_size=500\"",
        "--blink-settings=imagesEnabled=false"
    ]
}

Detection Prevention

distbot uses the full suit of scripts from extract-stealth-evasions to prevent sites from detecting robotic automation.

distbot
Release 2.8.9

Release 2.8.9

2.8.9

2.8.8

2.8.7

2.8.6

2.8.5

2.8.3

2.8.2

2.8.1

2.8.0

2.7.1

Documentation

distbot

Installing

Usage

Launch Options

Example launch options might look like:

Detection Prevention

Stats

Releases

Contributors

distbot Release 2.8.9

Release 2.8.9 Toggle Dropdown 2.8.9 2.8.8 2.8.7 2.8.6 2.8.5 2.8.3 2.8.2 2.8.1 2.8.0 2.7.1

Documentation

distbot

Installing

Usage

Launch Options

Example launch options might look like:

Detection Prevention

Stats

Releases

Contributors

distbot
Release 2.8.9

Release 2.8.9

2.8.9

2.8.8

2.8.7

2.8.6

2.8.5

2.8.3

2.8.2

2.8.1

2.8.0

2.7.1