Easy Glue
This package helps you use AWS Glue easily.
📝 Table of Contents
🧐 About
You can use following functions.
🏁 Getting Started
Installing
- If you want save as parquet format, install
pandas
andfastparquet
.
pip install easy_glue
Prerequisites
1. (Required) Create Handler
Use this code to create handler.
import easy_glue
bucket_name = "YOUR BUCKET NAME"
# You don't need to use these parameters if your authentication file is in ~/.aws/config.
aws_access_key_id = "YOUR AWS ACCESS KEY ID"
aws_secret_access_key = "YOUR AWS SECRET ACCESS KEY"
region_name = "YOUR AWS REGION"
# You need to create this directory.
jobs_base_dir = "YOUR A PLACE TO STORE JOBS SCRIPTS"
handler = easy_glue.EasyGlue(bucket_name, jobs_base_dir=jobs_base_dir, aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key, region_name=region_name)
print(handler)
result:
<easy_glue.EasyGlue object at 0x016EE7F0>
🎈 Usage
Please check Prerequisites before starting Usage
.
🌱 deploy
Use this function to deploy job into glue.
Tutorial
-
Create a directory
sample_job
inYOUR_JOBS_BASE_DIR
. -
Create a py file
index.py
inYOUR_JOBS_BASE_DIR/sample_job
. -
Write
Spark
code inYOURJOBS_BASE_DIR/sample_job/index.py
. -
Deploy
sample_job
as the code below.>>> print(handler.deploy("sample_job"))
Execution Result:
{'Name': 'sample_job', 'ResponseMetadata': {'RequestId': 'e436b350-7b36-47f4-b663-df52a058c2cb', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 10 Aug 2020 03:53:56 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '21', 'connection': 'keep-alive', 'x-amzn-requestid': 'e436b350-7b36-47f4-b663-df52a058c2cb'}, 'RetryAttempts': 0}}
-
You can find deployed job in a glue console.
https://ap-northeast-2.console.aws.amazon.com/glue/home?2#etl:tab=jobs
Parameters
-
(required) job_name
: strName of glue job to be deployed.
-
max_capacity
: int (default = 3)Max Capactiy of Glue Workers
-
timeout
: int (default = 7200)Timeout of glue job
-
default_arguments
: dict (default = {})Default Arguments of glue job. Detail refer to below.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
Returns
-
Create job result
: dict
🌱 run_crawler
Use this function to Run Crawler
Parameters
-
(required) crawler_name
: str
Returns
-
Start crawler result
: dict