scrapy-rabbitmq-link

RabbitMQ plug-in for Scrapy


Keywords
scrapy rabbitmq urls links
License
MIT
Install
pip install scrapy-rabbitmq-link==0.3.0

Documentation

A RabbitMQ Scheduler for Scrapy Framework.

scrapy-rabbitmq-link is a library letting you to crawl URLs provided from RabbitMQ queue using Scrapy framework.

This project a modified version of scrapy-rabbitmq published by Royce Haynes in GitHub.

Installation

Using pip, type in your command-line prompt

pip install scrapy-rabbitmq-link

Or clone the repo and inside the scrapy-rabbitmq-link directory, type

python setup.py install

Usage

Step 1: In your scrapy settings, add the following config values:

# Enable RabbitMQ scheduler
SCHEDULER = "scrapy_rabbitmq_link.scheduler.SaaS"

# Provide AMQP connection string
RABBITMQ_CONNECTION_PARAMETERS = 'amqp://guest:guest@localhost:5672/'

# Set response status codes to requeue messages on
SCHEDULER_REQUEUE_ON_STATUS = [500]

# Middleware acks RabbitMQ message on success
DOWNLOADER_MIDDLEWARES = {
    'scrapy_rabbitmq_link.middleware.RabbitMQMiddleware': 999
}

Step 2: Add request building method to Spider : _make_request

Example: custom_spider.py

import scrapy


class CustomSpider(scrapy.Spider):
    name = 'custom_spider'    
    amqp_key = 'test_urls'

    def _make_request(self, mframe, hframe, body):
        url = body.decode()
        return scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        item = ... # parse item
        yield item

amqp_key serves as queue name in spider.

Step 3: Push URLs to RabbitMQ

Push url list to scrape from.

Example: push_urls_to_queue.py

#!/usr/bin/env python
import pika
import settings

connection = pika.BlockingConnection(pika.URLParameters(settings.RABBITMQ_CONNECTION_PARAMETERS))
channel = connection.channel()

# set queue name
queue_key = 'target_urls'

# publish links to queue
with open('urls.txt') as f:
    for url in f:
        url = url.strip(' \n\r')
        channel.basic_publish(exchange='',
                        routing_key=queue_key,
                        body=url,
                        pika.BasicProperties(
                            content_type='text/plain',
                            delivery_mode=2
                        ))

connection.close()

Step 4: Run spider using scrapy client

scrapy crawl custom_spider

HAPPY SCRAPING !!!

Contributing and Forking

See Contributing Guidlines

Copyright & License

See LICENCE