scrapy-db

Similar to scrapy-redis, using the database as a queue, database-based scrapy components.

Features

Distributed crawling/scraping

You can start multiple spider instances that share a single db queue. Best suitable for broad multi-domain crawls.
Distributed post-processing

Scraped items gets pushed into a DB queued meaning that you can start as many as needed post-processing processes sharing the items queue.
Scrapy plug-and-play components

Scheduler + Duplication Filter, Base Spiders.

Requirements

Python 3.7+
peewee >= 3.16.0
Scrapy >= 2.7.0
pymysql >= 1.0.3

Installation

From pip

pip install scrapy-db

From GitHub

git clone https://github.com/libra146/scrapy-db.git
cd scrapy-db
python setup.py install

From poetry

poetry add scrapy-db

If you are conducting distributed crawling tasks, scraper db is a very practical scraper component that can help you complete tasks more efficiently.

Use

Clone the current project and run the example crawler in example-project to experience it.

❗️Notice

This repository is still under development and may be unstable.

Why is there this library

Because I have a huge request pool, I don't have that much memory for redis to save it, so, I thought about database, I created it with reference to scrapy-redis and it works fine.

scrapy-db
Release 0.0.5

Release 0.0.5

0.0.1

0.0.2

0.0.3

0.0.4

0.0.5

Documentation

scrapy-db

Features

Requirements

Installation

Use

❗️Notice

Why is there this library

Stats

Development practices

Releases

Contributors

scrapy-db Release 0.0.5

Release 0.0.5 Toggle Dropdown 0.0.1 0.0.2 0.0.3 0.0.4 0.0.5

Documentation

scrapy-db

Features

Requirements

Installation

Use

❗️Notice

Why is there this library

Stats

Development practices

Releases

Contributors

scrapy-db
Release 0.0.5

Release 0.0.5

0.0.1

0.0.2

0.0.3

0.0.4

0.0.5