Tanker is a Python database library targeting analytic operations


Keywords
database, postgresql, python, sqlite
License
MIT
Install
pip install Tanker==0.8.9

Documentation

Tanker

Tanker is a Python database library targeting analytic operations but it also fits most transactional processing.

As its core it's mainly a query builder that simplify greatly join operations. It comes with a way to automatically create the database tables based on your schema definition and it can also introspect existing db and infer the needed metadata.

Currently Postgresql and Sqlite are supported and the API is made to seamlessly integrate pandas DataFrames.

Licence

Tanker is available under the ISC Licence, see LICENCE file at the root of the repository.

Main features

Schema definition and database connection

The file schema.yaml defines the database structure: table, columns (and their types) and key.

    - table: team
      columns:
        name: varchar
        country: m2o country.id
      key:
        - name
        - country
    - table: country
      columns:
        name: varchar
      key:
        - name

The code here-under create the config dictionary and use it to connect to the database and creates the tables.

    from tanker import connect, create_tables, View, yaml_load

    cfg = {
        'db_uri': 'sqlite:///test.db',
        'schema': yaml_load(open('schema.yaml')),
    }
    with connect(cfg):
        create_tables()

Tanker automatically add an id column on each table, to allow to define foreign keys. For example, in the yaml definition, country: m2o country.id means that a many-to-one relation will be created between the tables team and country. When the team table will be created this will generate the following column definition:

"country" INTEGER REFERENCES "country" (id) ON DELETE CASCADE

If not specified, sqlite:///:memory: will be used as db_uri. To use Postgresql, the uri should looks like postgresql://login:passwd@hostname/dbname (and you can choose the postgres schema to use by appending #shema_nameto the uri)

Note that every database interaction must happen inside the with connect(cfg) block.

Read & write

Tanker usage is centered around the View object, it is used to define a mapping between the relational world and Python. For example, to write and read countries, we define a view based on the country table:

    country_view = View(
        'country',  # The base table
        ['name']    # The fields we want to map
    )

So now we can write to the database:

    countries = [['Belgium'], ['France']]
    country_view.write(countries)

And read it back.

    countries_copy = country_view.read().all()

And countries_copy should be identical to countries. As .read() returns the database cursor, the .all() allows to fetch all the records. Instead of .all() one can use .df() to receive a pandas DataFrame.

Key role

As you can see in the database definition, each table comes with a key attribute. This attribute contains the list of columns that form a natural key.

This key is required by design in Tanker, its main role is to allow Tanker to know what to do with each record when View.write is called. Thanks to the key, we know if the record is already in the database (and in this case will generate an UPDATE statement) or if the record is new (and use an INSERT query).

It's especially handy when dealing for example with data coming from a website scraper or from an spreadsheet, where a technical id (like an integer or a uuid) is not always available.

To avoid to launch one query per record and suffer from network latencies, what Tanker do to speed up writes is to create a temporary table, insert all the record as one batch and then join this temporary table with the actual one to know which record to insert and which to update.

Foreign key resolution

To populate the team table we have to provide a team name and a country. We can do it like this:

    team_view = View('team, ['name', 'country'])
    team_view.write([['Red', 1]])

But it's more convenient to use the country name instead of it's id:

    teams = [
        ['Blue', 'Belgium'],
        ['Red', 'Belgium'],
        ['Blue', 'France'],
    ]
    team_view = View('team, ['name', 'country.name'])
    team_view.write(teams)

You can see that we changed country into country.name in the view, which means that the use the name column to identify the country (which is conveniently defined as the key in the table definition).

We can go further and use more than one dot and let Tanker resolve foreign key for us. Let's say we want to add a member table to our database, we append the following piece of yaml to our schema file

    - table: member
      columns:
        name: varchar
        registration_code: varchar
        team: m2o team.id
      key:
        - registration_code

And re-run the create_tables() as above. Now we can do:

rows = View('member', ['name', team.country.name]).read()

Here, two join queries will be automatically generated, one between member and team and one between team and country.

To add a member we have to link it to a team, whose key is composed of both the name and the country column (so we allow two teams with the same name in different countries):

    members = [
        ['Bob', 'Belgium', 'Blue', '001'],
        ['Alice', 'Belgium', 'Red', '002'],
        ['Trudy', 'France', 'Blue', '003'],
    ]
    member_view = View('member', ['name', 'team.country.name', 'team.name',
                                  'registration_code'],
    ])
    member_view.write(members)

Tanker will be able to identify for each member the correct team based on both country name and team name.

Filters

The read method accept a filters argument it can be a string or a list of strings. Filter strings use s-expression notation. So for example to filter a country by name you can do:

    filters = '(= name "Belgium")'
    country_view.read(filters)

or to get registration_code above a given value:

    member_view.read('(> registration_code "002")')

You can also combine those filters and use the dot notation:

filters = '(or ((> registration_code "002") (= team.country.name "Belgium")))'
member_view.read(filters).read()

The filters argument can also be a list, in this case all items are regrouped in a conjunction, equivalent to (and item1 item2 ...).

Query arguments

To facilitate the building of queries and more importantly to prevent sql injections you can use arguments. They use the syntax of Python own string format method, and will make use of the DB-API's parameter substitution (see for example the sqlite documentation):

    cond = '(= name {name})'
    rows = team_view.read(cond).args(name='Blue')

You can also pass list values, they will be automatically expanded. And you can use the dot notation to reach a given parameter in the object passed as argument:

    cond = '(or (in name {names}) (= registration_code {data.code}))'
    rows = member_view.read(cond).args(names=['Alice', 'Bob'], data=my_object)

The dot notation also supports dictionnaries, so the above example whould work with data={'code': '001'}. The query arguments can also refer to values from the configuration (which can be reach from the ctx object), like:

    ctx.cfg['default_team'] = 'Red'
    cond = '(in name {default_team})'
    rows = view.read(cond)

Finally, arguments can be a list instead of a dict and can be passed to the read method, so:

    cond = '(in name {names})'
    rows = team_view.read(cond).args(names=['Blue', 'Red'])

is equivalent to

    cond = '(in name {} {})'
    rows = team_view.read(cond).args('Blue', 'Red')

and is equivalent to

    cond = '(in name {} {})'
    rows = team_view.read(cond, args=['Blue', 'Red'])

Pandas Dataframes

Instead of passing a list of list we can use a dataframe, and use a dictionary to map dataframe columns to database columns.

    df = DataFrame({
        'Team': ['Blue', 'Red'],
        'Country': ['France', 'Belgium']
        })
    view = View('team', {
        'Team': 'name',
        'Country': 'country.name',
    })
    view.write(data)

    df_copy = view.read().df()

ACL

The ACL system of Tanker allows to systemically filter access to the rows of any table. We can define filters for reading data or writing it. To enable it, you can add it to the cfg parameter of connect, but you can also change it later:

cfg['acl-read'] = {'team': "(= team.country.name 'Belgium')"}
with connect(cfg):
	teams = View('team').read().all() # No belgian team will be returned
	cfg['acl-read'] = {}              # we reset the acl
	teams = View('team').read().all() # Returns everything again

As you can see it's a simple dict whose keys are table names and values are Tanker filters. Similarly, you can add an acl-write dictionary to cfg:

cfg['acl-read'] = {'team': "(= team.country.name 'Belgium')"}
View('team').write(teams)

Documentation TODO

  • Deletion (by data, by filter)
  • Aliases

Roadmap

Some ideas, in no particular order:

  • Add a view.insert method that bypass tmp table and write directly to the actual table
  • Support for version column (probably a write timestamp)
  • Add support for other 'ON CONFLICT' action (like incrementing a version column, or appening to an array)
  • Support for table constraints
  • Allow to execute complete queries with s-expressions (select, update, insert and delete).