ETL with a Glue Python Shell Job: Load data from S3 to Redshift

cloudonaut · November 21, 2019

Gaining valuable insights from data is a challenge. After collecting data, the next step is to extract, transform, and load (ETL) the data into an analytics platform like Amazon Redshift. Luckily, there is a platform to build ETL pipelines: AWS Glue.

/images/2019/11/pipe.jpg

In short, AWS Glue solves the following problems: a managed-infrastructure to run ETL jobs, a data catalog to organize data stored in data lakes, and crawlers to discover and categorize data.

In the following, I would like to present a simple but exemplary ETL pipeline to load data from S3 to Redshift.

Someone uploads data to S3.
An S3 event triggers a Lambda function.
The Lambda function starts a Glue job.
The Glue job executes an SQL query to load the data from S3 to Redshift.

/images/2019/11/etl-pipeline.png

AWS Glue offers two different job types:

Apache Spark
Python Shell

An Apache Spark job allows you to do complex ETL tasks on vast amounts of data. However, the learning curve is quite steep. Luckily, there is an alternative: Python Shell. A Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume.

Therefore, I recommend a Glue job of type Python Shell to load data from S3 to Redshift without or with minimal transformation.

All you need to configure a Glue job is a Python script. The code example executes the following steps:

import modules that are bundled by AWS Glue by default.
Define some configuration parameters (e.g., the Redshift hostname RS_HOST).
Read the S3 bucket and object from the arguments (see getResolvedOptions) handed over when starting the job.
Establish a connection to Redshift: connect(...).
Increase the statement timeout (see statement_timeout) to one hour.
Execute the COPY query to tell Redshift to the object from S3.

from pgdb import connect
import os
import sys
from awsglue.utils import getResolvedOptions

# CONFIGURATION
RS_HOST = "xyz.redshift.amazonaws.com"
RS_PORT = "5439"
RS_DATABASE = "mydatabase"
RS_USER = "myadmin"
RS_PASSWORD = "XYZ"
RS_SCHEMA = "myschema"
RS_TABLE = "mytable"
RS_COLUMNS = "timestamp,value_a,value_b,value_c"
DELIMITER = "\t"
DATEFORMAT = "YYYY-MM-DD"

# ARGUMENTS
args = getResolvedOptions(sys.argv, ["s3-bucket", "s3-object"])
S3_BUCKET = args["s3_bucket"]
S3_OBJECT = args["s3_object"]

con = connect(host=RS_HOST + ':' + RS_PORT, database=RS_DATABASE, user=RS_USER, password=RS_PASSWORD)
cursor = con.cursor()

cursor.execute("set statement_timeout = 360000")

copy_query = "COPY %s.%s(%s) from 's3://%s/%s' iam_role 'arn:aws:iam::111111111111:role/LoadFromS3ToRedshiftJob' delimiter '%s' DATEFORMAT AS '%s' ROUNDEC TRUNCATECOLUMNS ESCAPE MAXERROR AS 500;" % (
  RS_SCHEMA, RS_TABLE, RS_COLUMNS, S3_BUCKET, S3_OBJECT, DELIMITER, DATEFORMAT)

cursor.execute(copy_query)
con.commit()
cursor.close()
con.close();

To trigger the ETL pipeline each time someone uploads a new object to an S3 bucket, you need to configure the following resources:

Create a Lambda function (Node.js) and use the code example from below to start the Glue job LoadFromS3ToRedshift.
Attach an IAM role to the Lambda function, which grants access to glue:StartJobRun.
Create a S3 Event Notification that invokes the Lambda function each time someone uploads an object to your S3 bucket.

The following example shows how to start a Glue job and pass the S3 bucket and object as arguments.

'use strict';
const AWS = require('aws-sdk');
const glue = new AWS.Glue({apiVersion: '2017-03-31'});

exports.handler = async (event, context) => {
  console.log(`Handling S3 event: ${JSON.stringify(event)}`);
  await glue.startJobRun({
    JobName: 'LoadFromS3ToRedshift',
    Arguments: {
      '--s3-object': decodeURIComponent(event.Records[0].s3.object.key),
      '--s3-bucket': event.Records[0].s3.bucket.name
    }
  }).promise();
};

There is only one thing left. You might want to set up monitoring for your simple ETL pipeline.

Create an SNS topic and add your e-mail address as a subscriber.
Create a CloudWatch Rule with the following event pattern and configure the SNS topic as a target.

{
  "detail-type": [
    "Glue Job State Change"
  ],
  "source": [
    "aws.glue"
  ],
  "detail": {
    "state": [
      "FAILED",
      "TIMEOUT"
    ]
  }
}

By doing so, you will receive an e-mail whenever your Glue job fails.

Summary

AWS Glue offers tools for solving ETL challenges. A Glue Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume. For example, loading data from S3 to Redshift can be accomplished with a Glue Python Shell job immediately after someone uploads data to S3.

And by the way: the whole solution is Serverless! No need to manage any EC2 instances.

View the full article

Sign In

ETL with a Glue Python Shell Job: Load data from S3 to Redshift

Recommended Posts

cloudonaut

Summary

Link to comment

Share on other sites

Join the conversation