distributed scrapy spider scheduling system


Keywords
microservice, scrapy
License
Other
Install
pip install scrapydd==0.7.5

Documentation

ScrapyDD (Scrapy Distributed Daemon)

PyPI Version Build Status Coverage report

Scrapydd is a system for scrapy spiders distributed running and scheduleing system, including server and client agent.

Advantages:

  • Distributed, easily add runner(agent) to scale out.
  • Project requirements auto install on demand.
  • Cron expression time driven trigger, run your spider on time.
  • Webhook loosely couple the data crawling and data processing.
  • Spider status insight, system will look into the log to clarify spider run status.

Installing Scrapydd

By pip:

pip install scrapydd

You can also install scrapydd manually:

  1. Download compressed package from github releases.
  2. Decompress the package
  3. Run python setup.py install

Run Scrapydd Server

scrapydd server

The server default serve on 0.0.0.0:6800, with both api and web ui. Add --daemon parameter in commmand line to run in background.

Run Scrapydd Agent

scrapydd agent

Add --daemon parameter in commmand line to run in background.

Docs

The docs is hosted here

Docker-Compose

version: '3'
services:
  scrapydd-server:
    image: "kevenli/scrapydd"
    ports:
      - "6800:6800"
    volumes:
      - "/scrapydd/server:/scrapydd"
      - "/var/run/docker.sock:/var/run/docker.sock"
    command: scrapydd server

  scrapydd-agent:
    image: "kevenli/scrapydd"
    volumes:
      - "/scrapydd/server:/scrapydd"
      - "/var/run/docker.sock:/var/run/docker.sock"
    links:
      - scrapydd-server
    environment:
      - SCRAPYDD_SERVER=scrapydd-server
    command: scrapydd agent