RACCY

OVERVIEW

Raccy is a multithreaded web scraping library based on selenium. It can be used for web automation, web scraping, and data mining.

REQUIREMENTS

Python 3.7+
Works on Linux, Windows, and Mac

ARCHITECTURE OVERVIEW

UrlDownloaderWorker: resonsible for downloading item(s) to be scraped urls and enqueue(s) them in ItemUrlQueue
ItemUrlQueue: receives item urls from UrlDownloaderWorker and enqueues them for feeding them to CrawlerWorker
CrawlerWorker: fetches item web pages and scrapes or extract data from them and enqueues the data in DatabaseQueue
DatabaseQueue: receives scraped item data from CrawlerWorker(s) and enques them for feeding them to DatabaseWorker.
DatabaseWorker: receives scraped data from DatabaseQueue and stores it in a persistent database.

INSTALL

pip install raccy

TUTORIAL

from raccy import (
    UrlDownloaderWorker, CrawlerWorker, DatabaseWorker, WorkersManager
)
import ro as model
from selenium import webdriver
from shutil import which

config = model.Config()
config.DATABASE = model.SQLiteDatabase('quotes.sqlite3')


class Quote(model.Model):
    quote_id = model.PrimaryKeyField()
    quote = model.TextField()
    author = model.CharField(max_length=100)


class UrlDownloader(UrlDownloaderWorker):
    start_url = 'https://quotes.toscrape.com/page/1/'
    max_url_download = 10

    def job(self):
        url = self.driver.current_url
        self.url_queue.put(url)
        self.follow(xpath="//a[contains(text(), 'Next')]", callback=self.job)


class Crawler(CrawlerWorker):

    def parse(self, url):
        self.driver.get(url)
        quotes = self.driver.find_elements_by_xpath("//div[@class='quote']")
        for q in quotes:
            quote = q.find_element_by_xpath(".//span[@class='text']").text
            author = q.find_element_by_xpath(".//span/small").text

            data = {
                'quote': quote,
                'author': author
            }
            self.log.info(data)
            self.db_queue.put(data)


class Db(DatabaseWorker):

    def save(self, data):
        Quote.objects.create(**data)


def get_driver():
    driver_path = which('.\\chromedriver.exe')
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument("--start-maximized")
    driver = webdriver.Chrome(executable_path=driver_path, options=options)
    return driver


if __name__ == '__main__':
    manager = WorkersManager()
    manager.add_driver(get_driver)
    manager.start()
    print('Done scraping...........')

Author

Afriyie Daniel

Hope You Enjoy Using It !!!!

raccy
Release 2.0.0

Release 2.0.0

2.0.0

1.3.1

1.3.0

1.2.6

1.2.5

1.2.4

1.2.3

1.1.2

1.1.1

1.0.1

Documentation

RACCY

OVERVIEW

REQUIREMENTS

ARCHITECTURE OVERVIEW

INSTALL

TUTORIAL

Author

Stats

Development practices

Releases

Contributors

raccy Release 2.0.0

Release 2.0.0 Toggle Dropdown 2.0.0 1.3.1 1.3.0 1.2.6 1.2.5 1.2.4 1.2.3 1.1.2 1.1.1 1.0.1

Documentation

RACCY

OVERVIEW

REQUIREMENTS

ARCHITECTURE OVERVIEW

INSTALL

TUTORIAL

Author

Stats

Development practices

Releases

Contributors

raccy
Release 2.0.0

Release 2.0.0

2.0.0

1.3.1

1.3.0

1.2.6

1.2.5

1.2.4

1.2.3

1.1.2

1.1.1

1.0.1