Data collection manager


Keywords
data, python
License
MIT
Install
pip install aswan==0.5.14

Documentation

aswan

Documentation Status codeclimate codecov pypi DOI

collect and organize data into a T1 data depot named after the Aswan Dam

Collect and compress data from the internet for later parsing

  • quick, parallel, customizable to collect
  • compressed to store
  • quick to sync with a remote store
    • sync to continue collecting
    • sync to parse
  • immutable collection

To Setup a Remote

set the environment variables ASWAN_AUTH_HEX and ASWAN_AUTH_PASS according to the zimmauth package, and ASWAN_REMOTE with the name of the default remote.

Concepts

  • objects
    • saved by collection events
  • events
    • collection
    • registration (v2: registration for parsing)
    • (v2) parsing
  • runs
    • manual run vs automated run
      • makes manual adding of urls easy but revertible
    • has unique id
    • generates events
    • linked to a specific version of the code
      • ideally commit hash + pip freeze
  • statuses
    • determined by base status + runs integrated
    • contains
      • what urls need to be collected
      • (v2) what collected objects need to be parsed
    • sqlite file, constantly trimmed

Structure

  • objects

    • 00, 01, ...
  • runs

    • run-hash
      • context.yaml
        • commit-hash, pip-freeze, ...
      • events.zip
  • statuses

    • status-hash
      • context.yaml
        • parent-status, integrated
      • db.sqlite.zip
  • current-run

    • context.yaml
    • events
      • these to be compressed into ../runs
    • status.sqlite
  • there is a 'TEST' status

    • cannot be integrated whatever is based on it
    • a test run can be made on it...

when starting a run:

  • check if current-run is empty
    • if not, fail with
  • find latest status
    • if it has not integrated all past runs, create a new status that has
  • start collection (+ registration)
  • either stops or breaks, all events and objects are saved to disk
  • if properly stops, move and compress stuff
    • based on one that was the starter, and current run id

Pre v1.0 laundry list

  • parallelize push / pull

  • parsing/connection/broken session error docs

  • transferring / ignoring cookies

  • template projects

    • oddsportal
      • updating thingy, based on latest match in season
    • footy
    • rotten
    • boxoffice