ScrapyDD (Scrapy Distributed Daemon)
Scrapydd is a system for scrapy spiders distributed running and scheduleing system, including server and client agent.
Advantages:
- Distributed, easily add runner(agent) to scale out.
- Project requirements auto install on demand.
- Cron expression time driven trigger, run your spider on time.
- Webhook loosely couple the data crawling and data processing.
- Spider status insight, system will look into the log to clarify spider run status.
Installing Scrapydd
By pip:
pip install scrapydd
You can also install scrapydd manually:
- Download compressed package from github releases.
- Decompress the package
- Run
python setup.py install
Run Scrapydd Server
scrapydd server
The server default serve on 0.0.0.0:6800, with both api and web ui. Add --daemon parameter in commmand line to run in background.
Run Scrapydd Agent
scrapydd agent
Add --daemon parameter in commmand line to run in background.
Docs
The docs is hosted here
Docker-Compose
version: '3'
services:
scrapydd-server:
image: "kevenli/scrapydd"
ports:
- "6800:6800"
volumes:
- "/scrapydd/server:/scrapydd"
- "/var/run/docker.sock:/var/run/docker.sock"
command: scrapydd server
scrapydd-agent:
image: "kevenli/scrapydd"
volumes:
- "/scrapydd/server:/scrapydd"
- "/var/run/docker.sock:/var/run/docker.sock"
links:
- scrapydd-server
environment:
- SCRAPYDD_SERVER=scrapydd-server
command: scrapydd agent