tf-data-athena

An implementation of tf.data.Dataset for aws Athena


License
MIT
Install
pip install tf-data-athena==1.0.1

Documentation

Tensorflow Data for AWS Athena

An AWS athena library for tensorflow.data.Dataset. If you don't know tf.data, take a look at documentation and this example.

Instalation

Install is as simple as pip install:

pip install tf-data-athena

How to use

Use is almost as simple as another tf.Dataset implementation. You just need to create a dataset using the funciton create_athena_dataset

no (it follows aws authentication chain in boto3).

# imports
from tf_data_athena import create_athena_dataset

# connector parameters
s3_output_location = "s3://my-bucket/my-folder/athena-outputs" # Athena output bucket folder
waiting_interval = 0.1 # Time (in seconds) to wait before asking for query state

# query
query = "select * from my_namespace.my_table"

# create dataset
dataset = create_athena_dataset(query, s3_output_location)

Now, dataset is an instance of tf.data.Dataset containing query results.

Parameters

Then factory funcion create_athena_dataset has the following parameters:

  • query: The query to be ran in athena
  • s3_output_location: An s3 path with write access for the current account where the query results file will be saved
  • waiting_interval: A float number representing the number of seconds between to wait before ask for query status on athena
  • num_parallel_calls: Argument for tf.data.Dataset.map (see documentation) while parsing result rows
  • other named arguments: Any other named argument will be used on tf.data.TextLineDataset constructor, please, see documentation.

AWS Authorization

This library uses boto3 behind the scenes, then, it follows the same authentication/authorization chain. Authorized user or service needs permission to create and execute athena queries and create and read s3 objects in the folder defined by s3_output_location.