Scrapy Warcio

A Web Archive WARC I/O module for Scrapy

Install

$ pip install scrapy-warcio

Usage

Create a project and spider:
https://docs.scrapy.org/en/latest/intro/tutorial.html

$ scrapy startproject <project>
$ cd <project>
$ scrapy genspider <spider> example.com

Copy and edit scrapy_warcio distributed settings.yml with your configuration settings:

---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000  # 10GB

collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~  # robots policy (obey or ignore)
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...

Export SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml
Add WarcioDownloaderMiddleware (distributed as middlewares.py) to your <project>/<project>/middlewares.py:

import scrapy_warcio


class WarcioDownloaderMiddleware:

    def __init__(self):
        self.warcio = scrapy_warcio.ScrapyWarcIo()

    def process_request(self, request, spider):
        request.meta['WARC-Date'] = scrapy_warcio.warc_date()
        return None

    def process_response(self, request, response, spider):
        self.warcio.write(response, request)
        return response

Enable WarcioDownloaderMiddleware in <project>/<project>/settings.py:

DOWNLOADER_MIDDLEWARES = {
    '<project>.middlewares.WarcioDownloaderMiddleware': 543,
}

Validate your warcs with internetarchive/warctools:

$ warcvalid WARC.warc.gz

Upload your WARC(s) to your favorite web archive!

Help

$ pydoc scrapy_warcio

>>> help(scrapy_warcio)

TODO

Making this a Scrapy extension may make it more useful:
https://docs.scrapy.org/en/latest/topics/extensions.html

@internetarchive

scrapy-warcio
Release 0.0.8

Release 0.0.8

0.0.8

0.0.7

0.0.6

0.0.5

0.0.4

0.0.3

0.0.2

0.0.1

Documentation

Scrapy Warcio

Install

Usage

Help

TODO

Stats

Development practices

Releases

scrapy-warcio Release 0.0.8

Release 0.0.8 Toggle Dropdown 0.0.8 0.0.7 0.0.6 0.0.5 0.0.4 0.0.3 0.0.2 0.0.1

Documentation

Scrapy Warcio

Install

Usage

Help

TODO

Stats

Development practices

Releases

scrapy-warcio
Release 0.0.8

Release 0.0.8

0.0.8

0.0.7

0.0.6

0.0.5

0.0.4

0.0.3

0.0.2

0.0.1