scrapy-rss-exporter

An RSS Exporter for Scrapy


License
BSD-3-Clause
Install
pip install scrapy-rss-exporter==0.2

Documentation

scrapy-rss-exporter

PyPI Version

Generate an RSS feed using the Scrapy framework.

Table of Contents

Installation

  • Install scrapy-rss-exporter using pip:

    pip install scrapy-rss-exporter
  • or using setuptools:

    python setup.py install

Usage

Feed Items

The most convenient way to use the exporter is to return the objects of RssItem class from your spiders. This class derives from scrapy.Item, so it will work with other exporters as well.

You will need to set the following keys:

from scrapy_rss_exporter.items import RssItem, Enclosure

rss_item = RssItem()
rss_item['title'] = 'Item title'
rss_item['link'] = 'Item url'
rss_item['guid'] = 'Item ID'
rss_item['description'] = 'Item Description'
rss_item['pub_date'] = None
rss_item['enclosure'] = [Enclosure(url=img, type='image/jpeg')]

The pub_date field should contain a date in the RFC882 format. If you use None, the system will insert the current date in the appropriate format. The enclosure field is optional and should contain a (possibly empty) list of Enclosure objects.

Global Exporter

To set the exporter up globally, you need to declare it in the FEED_EXPORTERS dictionary in the settings.py file:

FEED_EXPORTERS = {
  'rss': 'scrapy_rss_exporter.exporters.RssItemExporter'
}

You can then use it as a FEED_FORMAT and specify the output file in the FEED_URI:

FEED_FORMAT = 'rss'
FEED_URI = 's3://my-feeds/my-feed.rss'

Note: Bear in mind that, if you use a local file as output, scrapy will append to an existing file resulting with an invalid RSS code. You should, therefore, make sure to delete any existing output file before running the spider. The s3 storage does not have this problem because scrapy uploads are using the S3 PutObject method.

scrapy does not seem to allow to push any configuration option to an exporter. Therefore, if you want to customize the feed title and other metadata, you need to create a subclass and update the FEED_EXPORTERS dictionary with the new class name:

class MyRssExporter(RssItemExporter):
    def __init__(self, *args, **kwargs):
        kwargs['title'] = 'My RSS'
        kwargs['link'] = 'https://www.mywebsite.com'
        kwargs['description'] = 'My RSS Items'
        super(MyRssExporter, self).__init__(*args, **kwargs)

Per Spider Exporter

You can, of course, specify a different exporter with different settings for each spider. Just use the custom_settings field to override the global configuration fields:

class MySpider(scrapy.Spider):
    name = "my"
    start_urls = ['https://www.mywebsite.com']
    custom_settings = {
        'FEED_EXPORTERS': {'rss': 'project.spiders.my_spider.MyExporter'},
        'FEED_FORMAT': 'rss',
        'FEED_URI': 's3://my-feeds/my-feed.rss',
    }

    def parse(self, response):
        pass