Posted March 12Mar 12 AWS Glue is a serverless data integration service that allows you to process and integrate data coming through different data sources at scale. AWS Glue 5.0, the latest version of AWS Glue for Apache Spark jobs, provides a performance-optimized Apache Spark 3.5 runtime experience for batch and stream processing. With AWS Glue 5.0, you get improved performance, enhanced security, support for the next generation of Amazon SageMaker, and more. AWS Glue 5.0 enables you to develop, run, and scale your data integration workloads and get insights faster. AWS Glue accommodates various development preferences through multiple job creation approaches. For developers who prefer direct coding, Python or Scala development is available using the AWS Glue ETL library. Building production-ready data platforms requires robust development processes and continuous integration and delivery (CI/CD) pipelines. To support diverse development needs—whether on local machines, Docker containers on Amazon Elastic Compute Cloud (Amazon EC2), or other environments—AWS provides an official AWS Glue Docker image through the Amazon ECR Public Gallery. The image enables developers to work efficiently in their preferred environment while using the AWS Glue ETL library. In this post, we show how to develop and test AWS Glue 5.0 jobs locally using a Docker container. This post is an updated version of the post Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container, and uses AWS Glue 5.0 . Available Docker images The following Docker images are available for the Amazon ECR Public Gallery: AWS Glue version 5.0 – ecr.aws/glue/aws-glue-libs:5 AWS Glue Docker images are compatible with both x86_64 and arm64. In this post, we use public.ecr.aws/glue/aws-glue-libs:5 and run the container on a local machine (Mac, Windows, or Linux). This container image has been tested for AWS Glue 5.0 Spark jobs. The image contains the following: Amazon Linux 2023 AWS Glue ETL Library Apache Spark 3.5.2 Open table format libraries; Apache Iceberg 1.6.1, Apache Hudi 0.15.0, and Delta Lake 3.2.1 AWS Glue Data Catalog client Amazon Redshift connector for Apache Spark Amazon DynamoDB connector for Apache Hadoop To set up your container, you pull the image from the ECR Public Gallery and then run the container. We demonstrate how to run your container with the following methods, depending on your requirements: spark-submit REPL shell (pyspark) pytest Visual Studio Code Prerequisites Before you start, make sure that Docker is installed and the Docker daemon is running. For installation instructions, see the Docker documentation for Mac, Windows, or Linux. Also make sure that you have at least 7 GB of disk space for the image on the host running Docker. Configure AWS credentials To enable AWS API calls from the container, set up your AWS credentials with the following steps: Create an AWS named profile. Open cmd on Windows or a terminal on Mac/Linux, and run the following command: PROFILE_NAME="profile_name" In the following sections, we use this AWS named profile. Pull the image from the ECR Public Gallery If you’re running Docker on Windows, choose the Docker icon (right-click) and choose Switch to Linux containers before pulling the image. Run the following command to pull the image from the ECR Public Gallery: docker pull public.ecr.aws/glue/aws-glue-libs:5 Run the container Now you can run a container using this image. You can choose any of following methods based on your requirements. spark-submit You can run an AWS Glue job script by running the spark-submit command on the container. Write your job script (sample.py in the following example) and save it under the /local_path_to_workspace/src/ directory using the following commands: $ WORKSPACE_LOCATION=/local_path_to_workspace $ SCRIPT_FILE_NAME=sample.py $ mkdir -p ${WORKSPACE_LOCATION}/src $ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME} These variables are used in the following docker run command. The sample code (sample.py) used in the spark-submit command is included in the appendix at the end of this post. Run the following command to run the spark-submit command on the container to submit a new Spark application: $ docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \ -e AWS_PROFILE=$PROFILE_NAME \ --name glue5_spark_submit \ public.ecr.aws/glue/aws-glue-libs:5 \ spark-submit /home/hadoop/workspace/src/$SCRIPT_FILE_NAME REPL shell (pyspark) You can run a REPL (read-eval-print loop) shell for interactive development. Run the following command to run the pyspark command on the container to start the REPL shell: $ docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -e AWS_PROFILE=$PROFILE_NAME \ --name glue5_pyspark \ public.ecr.aws/glue/aws-glue-libs:5 \ pyspark You will see following output: Python 3.11.6 (main, Jan 9 2025, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.2-amzn-1 /_/ Using Python version 3.11.6 (main, Jan 9 2025 00:00:00) Spark context Web UI available at None Spark context available as 'sc' (master = local[*], app id = local-1740643079929). SparkSession available as 'spark'. >>> With this REPL shell, you can code and test interactively. pytest For unit testing, you can use pytest for AWS Glue Spark job scripts. Run the following commands for preparation: $ WORKSPACE_LOCATION=/local_path_to_workspace $ SCRIPT_FILE_NAME=sample.py $ UNIT_TEST_FILE_NAME=test_sample.py $ mkdir -p ${WORKSPACE_LOCATION}/tests $ vim ${WORKSPACE_LOCATION}/tests/${UNIT_TEST_FILE_NAME} Now let’s invoke pytest using docker run: $ docker run -i --rm \ -v ~/.aws:/home/hadoop/.aws \ -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \ --workdir /home/hadoop/workspace \ -e AWS_PROFILE=$PROFILE_NAME \ --name glue5_pytest \ public.ecr.aws/glue/aws-glue-libs:5 \ -c "python3 -m pytest --disable-warnings" When pytest finishes executing unit tests, your output will look something like the following: ============================= test session starts ============================== platform linux -- Python 3.11.6, pytest-8.3.4, pluggy-1.5.0 rootdir: /home/hadoop/workspace plugins: integration-mark-0.2.0 collected 1 item tests/test_sample.py . [100%] ======================== 1 passed, 1 warning in 34.28s ========================= Visual Studio Code To set up the container with Visual Studio Code, complete the following steps: Install Visual Studio Code. Install Python. Install Dev Containers. Open the workspace folder in Visual Studio Code. Press Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (Mac). Enter Preferences: Open Workspace Settings (JSON). Press Enter. Enter following JSON and save it: { "python.defaultInterpreterPath": "/usr/bin/python3.11", "python.analysis.extraPaths": [ "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip:/usr/lib/spark/python/:/usr/lib/spark/python/lib/", ] } Now you’re ready to set up the container. Run the Docker container: $ docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \ -e AWS_PROFILE=$PROFILE_NAME \ --name glue5_pyspark \ public.ecr.aws/glue/aws-glue-libs:5 \ pyspark Start Visual Studio Code. Choose Remote Explorer in the navigation pane. Choose the container ecr.aws/glue/aws-glue-libs:5 (right-click) and choose Attach in Current Window. If the following dialog appears, choose Got it. Open /home/hadoop/workspace/. Create an AWS Glue PySpark script and choose Run. You should see the successful run on the AWS Glue PySpark script. Changes between the AWS Glue 4.0 and AWS Glue 5.0 Docker image The following are major changes between the AWS Glue 4.0 and Glue 5.0 Docker image: In AWS Glue 5.0, there is a single container image for both batch and streaming jobs. This differs from AWS Glue 4.0, where there was one image for batch and another for streaming. In AWS Glue 5.0, the default user name of the container is hadoop. In AWS Glue 4.0, the default user name was glue_user. In AWS Glue 5.0, several additional libraries, including JupyterLab and Livy, have been removed from the image. You can manually install them. In AWS Glue 5.0, all of Iceberg, Hudi, and Delta libraries are pre-loaded by default, and the environment variable DATALAKE_FORMATS is no longer needed. Until AWS Glue 4.0, the environment variable DATALAKE_FORMATS was used to specify whether the specific table format is loaded. The preceding list is specific to the Docker image. To learn more about AWS Glue 5.0 updates, see Introducing AWS Glue 5.0 for Apache Spark and Migrating AWS Glue for Spark jobs to AWS Glue version 5.0. Considerations Keep in mind that the following features are not supported when using the AWS Glue container image to develop job scripts locally: Job bookmarks AWS Glue Parquet writer (see Using the Parquet format in AWS Glue) FillMissingValues transform FindMatches transform Vectorized SIMD CSV reader The property customJdbcDriverS3Path for loading the JDBC driver from an Amazon Simple Storage Service (Amazon S3) path AWS Glue Data Quality Sensitive data detection AWS Lake Formation permission-based credential vending Conclusion In this post, we explored how the AWS Glue 5.0 Docker images provide a flexible foundation for developing and testing AWS Glue job scripts in your preferred environment. These images, readily available in the Amazon ECR Public Gallery, streamline the development process by offering a consistent, portable environment for AWS Glue development. To learn more about how to build end-to-end development pipeline, see End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue. We encourage you to explore these capabilities and share your experiences with the AWS community. Appendix A: AWS Glue job sample codes for testing This appendix introduces three different scripts as AWS Glue job sample codes for testing purposes. You can use any of them in the tutorial. The following sample.py code uses the AWS Glue ETL library with an Amazon Simple Storage Service (Amazon S3) API call. The code requires Amazon S3 permissions in AWS Identity and Access Management (IAM). You need to grant the IAM-managed policy arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess or IAM custom policy that allows you to make ListBucket and GetObject API calls for the S3 path. import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.utils import getResolvedOptions class GluePythonSampleTest: def __init__(self): params = [] if '--JOB_NAME' in sys.argv: params.append('JOB_NAME') args = getResolvedOptions(sys.argv, params) self.context = GlueContext(SparkContext.getOrCreate()) self.job = Job(self.context) if 'JOB_NAME' in args: jobname = args['JOB_NAME'] else: jobname = "test" self.job.init(jobname, args) def run(self): dyf = read_json(self.context, "s3://awsglue-datasets/examples/us-legislators/all/persons.json") dyf.printSchema() self.job.commit() def read_json(glue_context, path): dynamicframe = glue_context.create_dynamic_frame.from_options( connection_type='s3', connection_options={ 'paths': [path], 'recurse': True }, format='json' ) return dynamicframe if __name__ == '__main__': GluePythonSampleTest().run() The following test_sample.py code is a sample for a unit test of sample.py: The following test_sample.py code is a sample for a unit test of sample.py: import pytest from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.utils import getResolvedOptions import sys from src import sample @pytest.fixture(scope="module", autouse=True) def glue_context(): sys.argv.append('--JOB_NAME') sys.argv.append('test_count') args = getResolvedOptions(sys.argv, ['JOB_NAME']) context = GlueContext(SparkContext.getOrCreate()) job = Job(context) job.init(args['JOB_NAME'], args) Appendix B: Adding JDBC drivers and Java libraries To add a JDBC driver not currently available in the container, you can create a new directory under your workspace with the JAR files you need and mount the directory to /opt/spark/jars/ in the docker run command. JAR files found under /opt/spark/jars/ within the container are automatically added to Spark Classpath and will be available for use during the job run. For example, you can use the following docker run command to add JDBC driver jars to a PySpark REPL shell: $ docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \ -v $WORKSPACE_LOCATION/jars/:/opt/spark/jars/ \ --workdir /home/hadoop/workspace \ -e AWS_PROFILE=$PROFILE_NAME \ --name glue5_jdbc \ public.ecr.aws/glue/aws-glue-libs:5 \ pyspark As highlighted earlier, the customJdbcDriverS3Path connection option can’t be used to import a custom JDBC driver from Amazon S3 in AWS Glue container images. Appendix C: Adding Livy and JupyterLab The AWS Glue 5.0 container image doesn’t have Livy installed by default. You can create a new container image extending the AWS Glue 5.0 container image as the base. The following Dockerfile demonstrates how you can extend the Docker image to include additional components you need to enhance your development and testing experience. To get started, create a directory on your workstation and place the Dockerfile.livy_jupyter file in the directory: $ mkdir -p $WORKSPACE_LOCATION/jupyterlab/ $ cd $WORKSPACE_LOCATION/jupyterlab/ $ vim Dockerfile.livy_jupyter The following code is Dockerfile.livy_jupyter: FROM public.ecr.aws/glue/aws-glue-libs:5 AS glue-base ENV LIVY_SERVER_JAVA_OPTS="--add-opens java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" # Download Livy ADD --chown=hadoop:hadoop https://dlcdn.apache.org/incubator/livy/0.8.0-incubating/apache-livy-0.8.0-incubating_2.12-bin.zip ./ # Install and configure Livy RUN unzip apache-livy-0.8.0-incubating_2.12-bin.zip && \ rm apache-livy-0.8.0-incubating_2.12-bin.zip && \ mv apache-livy-0.8.0-incubating_2.12-bin livy && \ mkdir -p livy/logs && \ cat <<EOF >> livy/conf/livy.conf livy.server.host = 0.0.0.0 livy.server.port = 8998 livy.spark.master = local livy.repl.enable-hive-context = true livy.spark.scala-version = 2.12 EOF && \ cat <<EOF >> livy/conf/log4j.properties log4j.rootCategory=INFO,console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n log4j.logger.org.eclipse.jetty=WARN EOF # Switching to root user temporarily to install dev dependency packages USER root RUN dnf update -y && dnf install -y krb5-devel gcc python3.11-devel USER hadoop # Install SparkMagic and JupyterLab RUN export PATH=$HOME/.local/bin:$HOME/livy/bin/:$PATH && \ printf "numpy<2\nIPython<=7.14.0\n" > /tmp/constraint.txt && \ pip3.11 --no-cache-dir install --constraint /tmp/constraint.txt --user pytest boto==2.49.0 jupyterlab==3.6.8 IPython==7.14.0 ipykernel==5.5.6 ipywidgets==7.7.2 sparkmagic==0.21.0 jupyterlab_widgets==1.1.11 && \ jupyter-kernelspec install --user $(pip3.11 --no-cache-dir show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/sparkkernel && \ jupyter-kernelspec install --user $(pip3.11 --no-cache-dir show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/pysparkkernel && \ jupyter server extension enable --user --py sparkmagic && \ cat <<EOF >> /home/hadoop/.local/bin/entrypoint.sh #!/usr/bin/env bash mkdir -p /home/hadoop/workspace/ livy-server start sleep 5 jupyter lab --no-browser --ip=0.0.0.0 --allow-root --ServerApp.root_dir=/home/hadoop/workspace/ --ServerApp.token='' --ServerApp.password='' EOF # Setup Entrypoint script RUN chmod +x /home/hadoop/.local/bin/entrypoint.sh # Add default SparkMagic Config ADD --chown=hadoop:hadoop https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/refs/heads/master/sparkmagic/example_config.json .sparkmagic/config.json # Update PATH var ENV PATH=/home/hadoop/.local/bin:/home/hadoop/livy/bin/:$PATH ENTRYPOINT ["/home/hadoop/.local/bin/entrypoint.sh"] Run the docker build command to build the image: docker build \ -t glue_v5_livy \ --file $WORKSPACE_LOCATION/jupyterlab/Dockerfile.livy_jupyter \ $WORKSPACE_LOCATION/jupyterlab/ When the image build is complete, you can use the following docker run command to start the newly built image: docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \ -p 8998:8998 \ -p 8888:8888 \ -e AWS_PROFILE=$PROFILE_NAME \ --name glue5_jupyter \ glue_v5_livy Appendix D: Adding extra Python libraries In this section, we discuss adding extra Python libraries and installing Python packages using Local Python libraries To add local Python libraries, place them under a directory and assign the path to $EXTRA_PYTHON_PACKAGE_LOCATION: $ docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \ -v $EXTRA_PYTHON_PACKAGE_LOCATION:/home/hadoop/workspace/extra_python_path/ \ --workdir /home/hadoop/workspace \ -e AWS_PROFILE=$PROFILE_NAME \ --name glue5_pylib \ public.ecr.aws/glue/aws-glue-libs:5 \ -c 'export PYTHONPATH=/home/hadoop/workspace/extra_python_path/:$PYTHONPATH; pyspark' To validate that the path has been added to PYTHONPATH, you can check for its existence in sys.path: Python 3.11.6 (main, Jan 9 2025, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.2-amzn-1 /_/ Using Python version 3.11.6 (main, Jan 9 2025 00:00:00) Spark context Web UI available at None Spark context available as 'sc' (master = local[*], app id = local-1740719582296). SparkSession available as 'spark'. >>> import sys >>> "/home/hadoop/workspace/extra_python_path" in sys.path True Installing Python packages using pip To install packages from PyPI (or any other artifact repository) using pip, you can use the following approach: docker run -it --rm \ -v ~/.aws:/home/hadoop/.aws \ -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \ --workdir /home/hadoop/workspace \ -e AWS_PROFILE=$PROFILE_NAME \ -e SCRIPT_FILE_NAME=$SCRIPT_FILE_NAME \ --name glue5_pylib \ public.ecr.aws/glue/aws-glue-libs:5 \ -c 'pip3 install snowflake==1.0.5; spark-submit /home/hadoop/workspace/src/$SCRIPT_FILE_NAME' About the Authors Subramanya Vajiraya is a Sr. Cloud Engineer (ETL) at AWS Sydney specialized in AWS Glue. He is passionate about helping customers solve issues related to their ETL workload and implementing scalable data processing and analytics pipelines on AWS. Outside of work, he enjoys going on bike rides and taking long walks with his dog Ollie. Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.View the full article
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.