Tools to easy generate RSS feed that contains each scraped item using Scrapy framework.
-
Install
scrapy_rss
using pippip install scrapy_rss
or using pip for the specific interpreter, e.g.:
pip3 install scrapy_rss
-
or using setuptools directly:
cd path/to/root/of/scrapy_rss python setup.py install
or using setuptools for specific interpreter, e.g.:
cd path/to/root/of/scrapy_rss python3 setup.py install
Add parameters to the Scrapy project settings (settings.py file)
or to the custom_settings
attribute of the spider:
-
Add item pipeline that export items to rss feed:
ITEM_PIPELINES = { # ... 'scrapy_rss.pipelines.RssExportPipeline': 900, # or another priority # ... }
-
Add required feed parameters:
- FEED_FILE
-
the absolute or relative file path where the result RSS feed will be saved. For example,
feed.rss
oroutput/feed.rss
. - FEED_TITLE
-
the name of the channel (feed),
- FEED_DESCRIPTION
-
the phrase or sentence that describes the channel (feed),
- FEED_LINK
-
the URL to the HTML website corresponding to the channel (feed)
FEED_FILE = 'path/to/feed.rss' FEED_TITLE = 'Some title of the channel' FEED_LINK = 'http://example.com/rss' FEED_DESCRIPTION = 'About channel'
If you want to change other channel parameters (such as language, copyright, managing_editor,
webmaster, pubdate, last_build_date, category, generator, docs, ttl)
then define your own exporter that's inherited from RssItemExporter
class, for example:
from scrapy_rss.exporters import RssItemExporter
class MyRssItemExporter(RssItemExporter):
def __init__(self, *args, **kwargs):
kwargs['generator'] = kwargs.get('generator', 'Special generator')
kwargs['language'] = kwargs.get('language', 'en-us')
super(MyRssItemExporter, self).__init__(*args, **kwargs)
And add FEED_EXPORTER
parameter to the Scrapy project settings
or to the custom_settings
attribute of the spider:
FEED_EXPORTER = 'myproject.exporters.MyRssItemExporter'
Declare your item directly as RssItem():
import scrapy_rss
item1 = scrapy_rss.RssItem()
Or use predefined item class RssedItem
with RSS field named as rss
that's instance of RssItem
:
import scrapy
import scrapy_rss
class MyItem(scrapy_rss.RssedItem):
field1 = scrapy.Field()
field2 = scrapy.Field()
# ...
item2 = MyItem()
Set/get item fields. Case sensitive attributes of RssItem()
are appropriate to RSS elements.
Attributes of RSS elements are case sensitive too.
If the editor allows autocompletion then it suggests attributes for instances of RssedItem
and RssItem
.
It's allowed to set any subset of RSS elements (e.g. title only). For example:
from datetime import datetime
item1.title = 'RSS item title' # set value of <title> element
title = item1.title.title # get value of <title> element
item1.description = 'description'
item1.guid = 'item identifier'
item1.guid.isPermaLink = True # set value of attribute isPermalink of <guid> element,
# isPermaLink is False by default
is_permalink = item1.guid.isPermaLink # get value of attribute isPermalink of <guid> element
guid = item1.guid.guid # get value of element <guid>
item1.category = 'single category'
category = item1.category
item1.category = ['first category', 'second category']
first_category = item1.category[0].category # get value of the element <category> with multiple values
all_categories = [cat.category for cat in item1.category]
# direct attributes setting
item1.enclosure.url = 'http://example.com/file'
item1.enclosure.length = 0
item1.enclosure.type = 'text/plain'
# or dict based attributes setting
item1.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}
item1.guid = {'guid': 'item identifier', 'isPermaLink': True}
item1.pubDate = datetime.now() # correctly works with Python' datetimes
item2.rss.title = 'Item title'
item2.rss.guid = 'identifier'
item2.rss.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}
All allowed elements are listed in the scrapy_rss/items.py. All allowed attributes of each element with constraints and default values are listed in the scrapy_rss/elements.py. Also you can read RSS specification for more details.
You can extend RssItem to add new XML fields that can be namespaced or not.
You can specify namespaces in an attribute and/or an element constructors.
Namespace prefix can be specified in the attribute/element name
using double underscores as delimiter (prefix__name
)
or in the attribute/element constructor using ns_prefix
argument.
Namespace URI can be specified using ns_uri
argument of the constructor.
from scrapy_rss.meta import ItemElementAttribute, ItemElement
from scrapy_rss.items import RssItem
class Element0(ItemElement):
# attributes without special namespace
attr0 = ItemElementAttribute(is_content=True, required=True)
attr1 = ItemElementAttribute()
class Element1(ItemElement):
# attribute "prefix2:attr2" with namespace xmlns:prefix2="id2"
attr2 = ItemElementAttribute(ns_prefix="prefix2", ns_uri="id2")
# attribute "prefix3:attr3" with namespace xmlns:prefix3="id3"
prefix3__attr3 = ItemElementAttribute(ns_uri="id3")
# attribute "prefix4:attr4" with namespace xmlns:prefix4="id4"
fake_prefix__attr4 = ItemElementAttribute(ns_prefix="prefix4", ns_uri="id4")
# attribute "attr5" with default namespace xmlns="id5"
attr5 = ItemElementAttribute(ns_uri="id5")
class MyXMLItem(RssItem):
# element <elem1> without namespace
elem1 = Element0()
# element <elem_prefix2:elem2> with namespace xmlns:elem_prefix2="id2e"
elem2 = Element0(ns_prefix="elem_prefix2", ns_uri="id2e")
# element <elem_prefix3:elem3> with namespace xmlns:elem_prefix3="id3e"
elem_prefix3__elem3 = Element1(ns_uri="id3e")
# yet another element <elem_prefix4:elem3> with namespace xmlns:elem_prefix4="id4e"
# (does not conflict with previous one)
fake_prefix__elem3 = Element0(ns_prefix="elem_prefix4", ns_uri="id4e")
# element <elem5> with default namespace xmlns="id5e"
elem5 = Element0(ns_uri="id5e")
Access to elements and its attributes is the same as with simple items:
item = MyXMLItem()
item.title = 'Some title'
item.elem1.attr0 = 'Required content value'
item.elem1 = 'Another way to set content value'
item.elem1.attr1 = 'Some attribute value'
item.elem_prefix3__elem3.prefix3__attr3 = 'Yet another attribute value'
item.elem_prefix3__elem3.fake_prefix__attr4 = '' # non-None value is interpreted as assigned
item.fake_prefix__elem3.attr1 = 42
Several optional settings are allowed for namespaced items:
- FEED_NAMESPACES
- list of tuples
[(prefix, URI), ...]
or dictionary{prefix: URI, ...}
of namespaces that must be defined in the root XML element - FEED_ITEM_CLASS or FEED_ITEM_CLS
-
main class of feed items (class object
MyXMLItem
or path to class"path.to.MyXMLItem"
). Default value:RssItem
. It's used in order to extract all possible namespaces that will be declared in the root XML element.Feed items do NOT have to be instances of this class or its subclass.
If these settings are not defined or only part of namespaces are defined
then other used namespaces will be declared either in the <item>
element
or in its subelements when these namespaces are not unique.
Each <item>
element and its sublements always contains
only namespace declarations of non-None
attributes (including ones that are interpreted as element content).
Examples directory contains several Scrapy projects with the scrapy_rss usage demonstration. It crawls this website whose source code is here.
Just go to the Scrapy project directory and run commands
scrapy crawl first_spider
scrapy crawl second_spider
Thereafter feed.rss and feed2.rss files will be created in the same directory.