Luigi tasks for building an article metadata index for

pip install siskin==0.43.0



Various tasks for heterogeneous metadata handling for Project finc. Based on luigi from Spotify.

pypi version


$ pip install -U siskin

Currently, Python 2 and 3 support is on the way and the installation might be a bit flaky.

Run taskchecksetup to see, what needs to be installed.

$ taskchecksetup
ok     7z
ok     curl
ok     filterline
ok     groupcover
ok     hurrly
ok     jq
ok     metha-sync
ok     microblob
ok     pigz
ok     solrbulk
ok     span-import
ok     unzip
ok     wget
ok     xmllint
ok     yaz-marcdump


For siskin updates a

$ pip install -U siskin

should suffices.


List tasks:

$ tasknames

Run simple task:

$ taskdo DOAJDump


$ taskdocs | less -R

Remove artefacts of a task:

$ taskrm DOAJDump


The siskin package harvests all kinds of data sources, some of which might be protected. All credentials and a few other configuration options go into siskin.ini, either below /etc/siskin/ or ~/.config/siskin/.

Luigi uses a bit of configuration as well, put it under /etc/luigi/.

Completions on task names will save you typing and time, so put under /etc/bash_completion.d or somewhere else.

#   $ tree etc
#   etc
#   ├── bash_completion.d
#   │   └──
#   ├── luigi
#   │   ├── client.cfg
#   │   └── logging.ini
#   └── siskin
#       └── siskin.ini

All configuration values can be inspected quickly with:

$ taskconfig
home = /var/siskin

listings-url =


ftp-username = abc
ftp-password = d3f

Software versioning

Since siskin works mostly on data, software versioning differs a bit, but we try to adhere to the following rules:

  • major changes: You need to recreate all your data from scratch.
  • minor changes: We added, renamed or removed at least one task. You will have to recreate a subset of the tasks to see the changes. You might need to change pipelines depending on those tasks, because they might not exist any more or have been renamed.
  • revision changes: A modification within existing tasks (e.g. bugfixes). You will have to recreate a subset of the tasks to see this changes, but no new task is introduced. No pipeline is broken, that wasn't already.

These rules apply for version 0.2.0 and later.

Schema changes

To remove all files of a certain format (due to schema changes or such) it helps, if naming is uniform:

$ tasknames | grep IntermediateSchema | xargs -I {} taskrm {}

Apart from that, all upstream tasks need to be removed manually (consult the map) as this is not automatic yet.

Task dependencies

Inspect task dependencies with:

$ taskdeps JstorIntermediateSchema
  └─ JstorIntermediateSchema(date=2018-05-25)
      └─ AMSLService(date=2018-05-25, name=outboundservices:discovery)
      └─ JstorCollectionMapping(date=2018-05-25)
      └─ JstorIntermediateSchemaGenericCollection(date=2018-05-25)

Or visually via graphviz.

$ taskdeps-dot JstorIntermediateSchema | dot -Tpng > deps.png

Evolving workflows