Verata

Yet another scraper

Why even consider?

It works just by providing correct config - no coding is required

How to install

pip install verata

Supported versions

It is tested on Python versions:

Usage

As crawler(Travel trough whole page):

verata --config=config-file.yml --output=<output_file> crawl

Optionally you can setup environment file:

verata --config=config-file.yml --env=.secret-env --output=<output_file> crawl

As scraper(Only read concrete link):

verata --config=config-file.yml --output=<output_file> scrape --link=<actual url>

Docs

http://verata.readthedocs.io

CLI Params

--env - environment file (default: .env). It is used to cointain variables which you don't want to expose in config (eg. password, username)
--config - congiration file (default: config.yml)
--log_level - select logging level (default: INFO)
--debug - shortcut for logging level DEBUG
--output - file in which results are being kept
--paginate - (Only makes sense if rest_interval is more than 0 seconds) crawl pages in chunks, resting after each chunk. Used to avoid being banned/bring site down.
--rest_interval - time to rest between chunks (default: 10s). Letter at the end tells in which format: s - seconds, m - minutes, h - hours.

Config example

---
name: Python Org scrapper
description: Just scrape it for testing
site_root: https://www.python.org
start_page: /blogs
cookies:
  authToken: abc1234
  remember: true
headers:
  "User-Agent": "Mozilla/5"
pages:
  - name: Blog
    link_pattern: /blog%
    mappings:
      - name: title
        path: h3[class="event-title"]/a

It support login to website as well...

name: A super secret page
description: Only we have access
site_root: http://page.secret
start_page: /restricted_area
auth:
  url: /login
  method: POST
  params:
    user: {{ secret_user }}
    password: {{ secret_password }}

Locked web is a big part of the internet, however it is rarely accessed by scrappers. This tool gives you possibility to login to some of them (CAPTCHA is a bit pain).

Variables secret_user and secret_password are being picket from file .env which would look like this:

secret_user=demo
secret_password=demo

It is done like this, because usually there are some variables we don't want to expose in config and put to any source control system.

Version history

1.2:

introduced scraping functionallity
fixed some bugs in data collection process

1.1:

args parsed from same page are grouped into array
selected elements by path now provides all atributes together with element text

1.0 - basic functionallity

verata
Release 1.0.8

Release 1.0.8

1.2.0

1.1.0

1.0.8

1.0.7

1.0.6

1.0.5

1.0.4

1.0.3

1.0.2

1.0.1

Documentation

Verata

Why even consider?

How to install

Supported versions

Usage

Docs

CLI Params

Config example

It support login to website as well...

Version history

Stats

Development practices

Releases

Contributors

verata Release 1.0.8

Release 1.0.8 Toggle Dropdown 1.2.0 1.1.0 1.0.8 1.0.7 1.0.6 1.0.5 1.0.4 1.0.3 1.0.2 1.0.1

Documentation

Verata

Why even consider?

How to install

Supported versions

Usage

Docs

CLI Params

Config example

It support login to website as well...

Version history

Stats

Development practices

Releases

Contributors

verata
Release 1.0.8

Release 1.0.8

1.2.0

1.1.0

1.0.8

1.0.7

1.0.6

1.0.5

1.0.4

1.0.3

1.0.2

1.0.1