Mara Google Sheet Downloader
This package contains a google sheet downloader to be used with the mara ETL framework:
- Download a Google sheet to a database table
- Cells can be validated and formatted during download
To use the library directly, use pip:
pip install mara-google-sheet-downloader
pip install git+https://github.com/mara/mara-google-sheet-downloader.git
Here is a pipeline "gs_demo" which downloads to a table. This assumes you have a spread sheet under the URL https://docs.google.com/spreadsheets/d/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/edit. This sheet must be shared with the email address for which you configured the credentials (see below).
The spreadsheet contains a worksheet 'test' which contains the following colums:
|Berlin||01.01.2020||3||ja||2.3||added by JK on 2020-03-12|
Note that it doesn't contain an id column at the start!
from mara_pipelines.pipelines import Pipeline, Task from mara_pipelines.commands.sql import ExecuteSQL from mara_google_sheet_downloader.mara_integration import DownloadGoogleSpreadsheet pipeline = Pipeline( id='gs_demo', description='A small pipeline that demonstrates the a google sheet download') pipeline.add(Task( id='download_', description='Download a google sheet', commands=[ ExecuteSQL( sql_statement=f""" -- Creates the table where the google sheet data should end up in DROP TABLE IF EXISTS public.gs_test; CREATE TABLE public.gs_test ( id BIGINT PRIMARY KEY, city TEXT, some_date DATE, amount INTEGER, is_paid BOOLEAN, price DOUBLE PRECISION, comments TEXT ) """, echo_queries=False, ), DownloadGoogleSpreadsheet( spreadsheet_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', # from the URL worksheet_name='test', # worksheet # Validators: # c: counter (additional column, will start at 1 and count each row), # s: string, # d(in_fmt=%d.%m.%Y): date in YYYY.mm.dd format, # i: integer, # b(true:ja,false=nein): boolean with ja/nein interpreted as True/False, # f: float, # s: string # any cell which does not confirm to this spec will fail the import! columns_definition='csd(in_fmt=%d.%m.%Y)ib(true:ja,false=nein)fs', target_table_name='public.gs_test', # table where the data should end up target_db_alias='dwh', # alias of the DB where the data should end up skip_rows=1), # how many rows at the top should be skipped ]), )
The downloader needs OAuth2 credentials, either use a service account or a user account.
- For service accounts, see https://gspread.readthedocs.io/en/latest/oauth2.html. All required information is in the downloaded json file.
- For user account credentials, see https://developers.google.com/sheets/api/quickstart/python, Step 1.
For getting the initial refresh token, you can use
flask mara_google_sheet_downloader.generate-user-refresh-token /path/to/downloaded/credential.json
Credentials will need the scopes
Example with OAuth2 credentials for a user account:
from mara_app.monkey_patch import patch import mara_google_sheet_downloader.config patch(mara_google_sheet_downloader.config.gs_user_account_client_id)(lambda:"....client_id...") patch(mara_google_sheet_downloader.config.gs_user_account_client_secret)(lambda:"...client_secret...") patch(mara_google_sheet_downloader.config.gs_user_account_refresh_token)(lambda:"...initial_refresh_token...")
Setup access to sheets to be downloaded
All sheets which should be accessed by the downloader must be shared with the email address associated with these credentials. This email address is:
- for user account credentials: the email address of the user who created the credentials.
- for service accounts: the email address of the service account itself (e.g. "@.iam.gserviceaccount.com"). This email address is e.g. included in the json file you can download.
This package contains a small cli app which downloads a google sheet and outputs it as csv.
You can use it stand alone, see
mara-google-sheet-downloader --help for how to use it.