datacatalog-util
A Python package to manage Google Cloud Data Catalog helper commands and scripts.
Disclaimer: This is not an officially supported Google product.
Commands List
Command | Description | Documentation Link | Code Repo |
---|---|---|---|
create-tags | Load Tags from CSV file. | GO | GO |
export-tags | Export Tags to CSV file. | GO | GO |
create-tag-templates | Load Templates from CSV file. | GO | GO |
delete-tag-templates | Delete Templates from CSV file. | GO | GO |
export-tag-templates | Export Templates to CSV file. | GO | GO |
1. Environment setup
1.1. Python + virtualenv
Using virtualenv is optional, but strongly recommended unless you use Docker.
1.1.1. Install Python 3.6+
1.1.2. Create a folder
This is recommended so all related stuff will reside at same place, making it easier to follow below instructions.
mkdir ./datacatalog-util
cd ./datacatalog-util
All paths starting with ./
in the next steps are relative to the utilsr
folder.
1.1.3. Create and activate an isolated Python environment
pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate
1.1.4. Install the package
pip install --upgrade .
1.2. Docker
Docker may be used as an alternative to run the script. In this case, please disregard the Virtualenv setup instructions.
1.2.1. Get the source code
git clone https://github.com/mesmacosta/datacatalog-util
cd ./datacatalog-util
1.3. Auth credentials
1.3.1. Create a service account and grant it below roles
- BigQuery Metadata Viewer
- Data Catalog Admin
- A custom role with
bigquery.datasets.updateTag
andbigquery.tables.updateTag
permissions
1.3.2. Download a JSON key and save it as
./credentials/datacatalog-util.json
1.3.3. Set the environment variables
This step may be skipped if you're using Docker.
export GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-util.json
2. Load Tags from CSV file
2.1. Create a CSV file representing the Tags to be created
Tags are composed of as many lines as required to represent all of their fields. The columns are described as follows:
Column | Description | Mandatory |
---|---|---|
linked_resource | Full name of the asset the Entry refers to. | Y |
template_name | Resource name of the Tag Template for the Tag. | Y |
column | Attach Tags to a column belonging to the Entry schema. | N |
field_id | Id of the Tag field. | Y |
field_value | Value of the Tag field. | Y |
TIPS
- sample-input/create-tags for reference;
- Data Catalog Sample Tags (Google Sheets) may help to create/export the CSV.
2.2. Run the datacatalog-util script
- Python + virtualenv
datacatalog-util create-tags --csv-file CSV_FILE_PATH
- Docker
docker build --rm --tag datacatalog-util .
docker run --rm --tty \
--volume CREDENTIALS_FILE_FOLDER:/credentials --volume CSV_FILE_FOLDER:/data \
datacatalog-util create-tags --csv-file /data/CSV_FILE_NAME
3. Export Tags to CSV file
3.1. A list of CSV files, each representing one Template will be created.
One file with summary with stats about each template, will also be created on the same directory.
The columns for the summary file are described as follows:
Column | Description |
---|---|
template_name | Resource name of the Tag Template for the Tag. |
tags_count | Number of tags found from the template. |
tagged_entries_count | Number of tagged entries with the template. |
tagged_columns_count | Number of tagged columns with the template. |
tag_string_fields_count | Number of used String fields on tags of the template. |
tag_bool_fields_count | Number of used Bool fields on tags of the template. |
tag_double_fields_count | Number of used Double fields on tags of the template. |
tag_timestamp_fields_count | Number of used Timestamp fields on tags of the template. |
tag_enum_fields_count | Number of used Enum fields on tags of the template. |
The columns for each template file are described as follows:
Column | Description |
---|---|
relative_resource_name | Full resource name of the asset the Entry refers to. |
linked_resource | Full name of the asset the Entry refers to. |
template_name | Resource name of the Tag Template for the Tag. |
tag_name | Resource name of the Tag. |
column | Attach Tags to a column belonging to the Entry schema. |
field_id | Id of the Tag field. |
field_type | Type of the Tag field. |
field_value | Value of the Tag field. |
3.2. Run the datacatalog-util script
- Python + virtualenv
datacatalog-util export-tags --project-ids my-project --dir-path DIR_PATH
4. Load Templates from CSV file
4.1. Create a CSV file representing the Templates to be created
Templates are composed of as many lines as required to represent all of their fields. The columns are described as follows:
Column | Description | Mandatory |
---|---|---|
template_name | Resource name of the Tag Template for the Tag. | Y |
display_name | Resource name of the Tag Template for the Tag. | Y |
field_id | Id of the Tag Template field. | Y |
field_display_name | Display name of the Tag Template field. | Y |
field_type | Type of the Tag Template field. | Y |
enum_values | Values for the Enum field. | N |
4.2. Run the datacatalog-util script - Create the Tag Templates
- Python + virtualenv
datacatalog-util create-tag-templates --csv-file CSV_FILE_PATH
4.3. Run the datacatalog-util script - Delete the Tag Templates
- Python + virtualenv
datacatalog-util delete-tag-templates --csv-file CSV_FILE_PATH
TIPS
- sample-input/create-tag-templates for reference;
5. Export Templates to CSV file
5.1. A CSV file representing the Templates will be created
Templates are composed of as many lines as required to represent all of their fields. The columns are described as follows:
Column | Description |
---|---|
template_name | Resource name of the Tag Template for the Tag. |
display_name | Resource name of the Tag Template for the Tag. |
field_id | Id of the Tag Template field. |
field_display_name | Display name of the Tag Template field. |
field_type | Type of the Tag Template field. |
enum_values | Values for the Enum field. |
5.2. Run the datacatalog-util script
- Python + virtualenv
datacatalog-util export-tag-templates --project-ids my-project --file-path CSV_FILE_PATH