Search the Community
Showing results for tags 'amazon redshift'.
-
The need to handle data and generate insights has become one of the primary considerations for companies. Corporations typically store their on-premise data in a database designed for day-to-day transactional operations. Consequently, integrating data from a database into a data warehouse is a crucial step for any organization. Amazon RDS is a database management web […]View the full article
-
Amazon Redshift is a serverless, fully managed leading data warehouse in the market, and many organizations are migrating their legacy data to Redshift for better analytics. In this blog, we will discuss the best Redshift ETL tools that you can use to load data into Redshift. 8 Best Redshift ETL Tools Let’s have a detailed […]View the full article
-
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the cloud at scale. Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows. With Amazon MWAA, you can use Apache Airflow and Python to create workflows without having to manage the underlying infrastructure for scalability, availability, and security. By using multiple AWS accounts, organizations can effectively scale their workloads and manage their complexity as they grow. This approach provides a robust mechanism to mitigate the potential impact of disruptions or failures, making sure that critical workloads remain operational. Additionally, it enables cost optimization by aligning resources with specific use cases, making sure that expenses are well controlled. By isolating workloads with specific security requirements or compliance needs, organizations can maintain the highest levels of data privacy and security. Furthermore, the ability to organize multiple AWS accounts in a structured manner allows you to align your business processes and resources according to your unique operational, regulatory, and budgetary requirements. This approach promotes efficiency, flexibility, and scalability, enabling large enterprises to meet their evolving needs and achieve their goals. This post demonstrates how to orchestrate an end-to-end extract, transform, and load (ETL) pipeline using Amazon Simple Storage Service (Amazon S3), AWS Glue, and Amazon Redshift Serverless with Amazon MWAA. Solution overview For this post, we consider a use case where a data engineering team wants to build an ETL process and give the best experience to their end-users when they want to query the latest data after new raw files are added to Amazon S3 in the central account (Account A in the following architecture diagram). The data engineering team wants to separate the raw data into its own AWS account (Account B in the diagram) for increased security and control. They also want to perform the data processing and transformation work in their own account (Account B) to compartmentalize duties and prevent any unintended changes to the source raw data present in the central account (Account A). This approach allows the team to process the raw data extracted from Account A to Account B, which is dedicated for data handling tasks. This makes sure the raw and processed data can be maintained securely separated across multiple accounts, if required, for enhanced data governance and security. Our solution uses an end-to-end ETL pipeline orchestrated by Amazon MWAA that looks for new incremental files in an Amazon S3 location in Account A, where the raw data is present. This is done by invoking AWS Glue ETL jobs and writing to data objects in a Redshift Serverless cluster in Account B. The pipeline then starts running stored procedures and SQL commands on Redshift Serverless. As the queries finish running, an UNLOAD operation is invoked from the Redshift data warehouse to the S3 bucket in Account A. Because security is important, this post also covers how to configure an Airflow connection using AWS Secrets Manager to avoid storing database credentials within Airflow connections and variables. The following diagram illustrates the architectural overview of the components involved in the orchestration of the workflow. The workflow consists of the following components: The source and target S3 buckets are in a central account (Account A), whereas Amazon MWAA, AWS Glue, and Amazon Redshift are in a different account (Account B). Cross-account access has been set up between S3 buckets in Account A with resources in Account B to be able to load and unload data. In the second account, Amazon MWAA is hosted in one VPC and Redshift Serverless in a different VPC, which are connected through VPC peering. A Redshift Serverless workgroup is secured inside private subnets across three Availability Zones. Secrets like user name, password, DB port, and AWS Region for Redshift Serverless are stored in Secrets Manager. VPC endpoints are created for Amazon S3 and Secrets Manager to interact with other resources. Usually, data engineers create an Airflow Directed Acyclic Graph (DAG) and commit their changes to GitHub. With GitHub actions, they are deployed to an S3 bucket in Account B (for this post, we upload the files into S3 bucket directly). The S3 bucket stores Airflow-related files like DAG files, requirements.txt files, and plugins. AWS Glue ETL scripts and assets are stored in another S3 bucket. This separation helps maintain organization and avoid confusion. The Airflow DAG uses various operators, sensors, connections, tasks, and rules to run the data pipeline as needed. The Airflow logs are logged in Amazon CloudWatch, and alerts can be configured for monitoring tasks. For more information, see Monitoring dashboards and alarms on Amazon MWAA. Prerequisites Because this solution centers around using Amazon MWAA to orchestrate the ETL pipeline, you need to set up certain foundational resources across accounts beforehand. Specifically, you need to create the S3 buckets and folders, AWS Glue resources, and Redshift Serverless resources in their respective accounts prior to implementing the full workflow integration using Amazon MWAA. Deploy resources in Account A using AWS CloudFormation In Account A, launch the provided AWS CloudFormation stack to create the following resources: The source and target S3 buckets and folders. As a best practice, the input and output bucket structures are formatted with hive style partitioning as s3://<bucket>/products/YYYY/MM/DD/. A sample dataset called products.csv, which we use in this post. Upload the AWS Glue job to Amazon S3 in Account B In Account B, create an Amazon S3 location called aws-glue-assets-<account-id>-<region>/scripts (if not present). Replace the parameters for the account ID and Region in the sample_glue_job.py script and upload the AWS Glue job file to the Amazon S3 location. Deploy resources in Account B using AWS CloudFormation In Account B, launch the provided CloudFormation stack template to create the following resources: The S3 bucket airflow-<username>-bucket to store Airflow-related files with the following structure: dags – The folder for DAG files. plugins – The file for any custom or community Airflow plugins. requirements – The requirements.txt file for any Python packages. scripts – Any SQL scripts used in the DAG. data – Any datasets used in the DAG. A Redshift Serverless environment. The name of the workgroup and namespace are prefixed with sample. An AWS Glue environment, which contains the following: An AWS Glue crawler, which crawls the data from the S3 source bucket sample-inp-bucket-etl-<username> in Account A. A database called products_db in the AWS Glue Data Catalog. An ELT job called sample_glue_job. This job can read files from the products table in the Data Catalog and load data into the Redshift table products. A VPC gateway endpointto Amazon S3. An Amazon MWAA environment. For detailed steps to create an Amazon MWAA environment using the Amazon MWAA console, refer to Introducing Amazon Managed Workflows for Apache Airflow (MWAA). Create Amazon Redshift resources Create two tables and a stored procedure on an Redshift Serverless workgroup using the products.sql file. In this example, we create two tables called products and products_f. The name of the stored procedure is sp_products. Configure Airflow permissions After the Amazon MWAA environment is created successfully, the status will show as Available. Choose Open Airflow UI to view the Airflow UI. DAGs are automatically synced from the S3 bucket and visible in the UI. However, at this stage, there are no DAGs in the S3 folder. Add the customer managed policy AmazonMWAAFullConsoleAccess, which grants Airflow users permissions to access AWS Identity and Access Management (IAM) resources, and attach this policy to the Amazon MWAA role. For more information, see Accessing an Amazon MWAA environment. The policies attached to the Amazon MWAA role have full access and must only be used for testing purposes in a secure test environment. For production deployments, follow the least privilege principle. Set up the environment This section outlines the steps to configure the environment. The process involves the following high-level steps: Update any necessary providers. Set up cross-account access. Establish a VPC peering connection between the Amazon MWAA VPC and Amazon Redshift VPC. Configure Secrets Manager to integrate with Amazon MWAA. Define Airflow connections. Update the providers Follow the steps in this section if your version of Amazon MWAA is less than 2.8.1 (the latest version as of writing this post). Providers are packages that are maintained by the community and include all the core operators, hooks, and sensors for a given service. The Amazon provider is used to interact with AWS services like Amazon S3, Amazon Redshift Serverless, AWS Glue, and more. There are over 200 modules within the Amazon provider. Although the version of Airflow supported in Amazon MWAA is 2.6.3, which comes bundled with the Amazon provided package version 8.2.0, support for Amazon Redshift Serverless was not added until the Amazon provided package version 8.4.0. Because the default bundled provider version is older than when Redshift Serverless support was introduced, the provider version must be upgraded in order to use that functionality. The first step is to update the constraints file and requirements.txt file with the correct versions. Refer to Specifying newer provider packages for steps to update the Amazon provider package. Specify the requirements as follows: --constraint "/usr/local/airflow/dags/constraints-3.10-mod.txt" apache-airflow-providers-amazon==8.4.0 Update the version in the constraints file to 8.4.0 or higher. Add the constraints-3.11-updated.txt file to the /dags folder. Refer to Apache Airflow versions on Amazon Managed Workflows for Apache Airflow for correct versions of the constraints file depending on the Airflow version. Navigate to the Amazon MWAA environment and choose Edit. Under DAG code in Amazon S3, for Requirements file, choose the latest version. Choose Save. This will update the environment and new providers will be in effect. To verify the providers version, go to Providers under the Admin table. The version for the Amazon provider package should be 8.4.0, as shown in the following screenshot. If not, there was an error while loading requirements.txt. To debug any errors, go to the CloudWatch console and open the requirements_install_ip log in Log streams, where errors are listed. Refer to Enabling logs on the Amazon MWAA console for more details. Set up cross-account access You need to set up cross-account policies and roles between Account A and Account B to access the S3 buckets to load and unload data. Complete the following steps: In Account A, configure the bucket policy for bucket sample-inp-bucket-etl-<username> to grant permissions to the AWS Glue and Amazon MWAA roles in Account B for objects in bucket sample-inp-bucket-etl-<username>: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<account-id-of- AcctB>:role/service-role/<Glue-role>", "arn:aws:iam::<account-id-of-AcctB>:role/service-role/<MWAA-role>" ] }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:PutObjectAcl", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::sample-inp-bucket-etl-<username>/*", "arn:aws:s3:::sample-inp-bucket-etl-<username>" ] } ] } Similarly, configure the bucket policy for bucket sample-opt-bucket-etl-<username> to grant permissions to Amazon MWAA roles in Account B to put objects in this bucket: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account-id-of-AcctB>:role/service-role/<MWAA-role>" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:PutObjectAcl", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::sample-opt-bucket-etl-<username>/*", "arn:aws:s3:::sample-opt-bucket-etl-<username>" ] } ] } In Account A, create an IAM policy called policy_for_roleA, which allows necessary Amazon S3 actions on the output bucket: { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "kms:Decrypt", "kms:Encrypt", "kms:GenerateDataKey" ], "Resource": [ "<KMS_KEY_ARN_Used_for_S3_encryption>" ] }, { "Sid": "VisualEditor1", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:GetBucketAcl", "s3:GetBucketCors", "s3:GetEncryptionConfiguration", "s3:GetBucketLocation", "s3:ListAllMyBuckets", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:ListBucketVersions", "s3:ListMultipartUploadParts" ], "Resource": [ "arn:aws:s3:::sample-opt-bucket-etl-<username>", "arn:aws:s3:::sample-opt-bucket-etl-<username>/*" ] } ] } Create a new IAM role called RoleA with Account B as the trusted entity role and add this policy to the role. This allows Account B to assume RoleA to perform necessary Amazon S3 actions on the output bucket. In Account B, create an IAM policy called s3-cross-account-access with permission to access objects in the bucket sample-inp-bucket-etl-<username>, which is in Account A. Add this policy to the AWS Glue role and Amazon MWAA role: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:PutObjectAcl" ], "Resource": "arn:aws:s3:::sample-inp-bucket-etl-<username>/*" } ] } In Account B, create the IAM policy policy_for_roleB specifying Account A as a trusted entity. The following is the trust policy to assume RoleA in Account A: { "Version": "2012-10-17", "Statement": [ { "Sid": "CrossAccountPolicy", "Effect": "Allow", "Action": "sts:AssumeRole", "Resource": "arn:aws:iam::<account-id-of-AcctA>:role/RoleA" } ] } Create a new IAM role called RoleB with Amazon Redshift as the trusted entity type and add this policy to the role. This allows RoleB to assume RoleA in Account A and also to be assumable by Amazon Redshift. Attach RoleB to the Redshift Serverless namespace, so Amazon Redshift can write objects to the S3 output bucket in Account A. Attach the policy policy_for_roleB to the Amazon MWAA role, which allows Amazon MWAA to access the output bucket in Account A. Refer to How do I provide cross-account access to objects that are in Amazon S3 buckets? for more details on setting up cross-account access to objects in Amazon S3 from AWS Glue and Amazon MWAA. Refer to How do I COPY or UNLOAD data from Amazon Redshift to an Amazon S3 bucket in another account? for more details on setting up roles to unload data from Amazon Redshift to Amazon S3 from Amazon MWAA. Set up VPC peering between the Amazon MWAA and Amazon Redshift VPCs Because Amazon MWAA and Amazon Redshift are in two separate VPCs, you need to set up VPC peering between them. You must add a route to the route tables associated with the subnets for both services. Refer to Work with VPC peering connections for details on VPC peering. Make sure that CIDR range of the Amazon MWAA VPC is allowed in the Redshift security group and the CIDR range of the Amazon Redshift VPC is allowed in the Amazon MWAA security group, as shown in the following screenshot. If any of the preceding steps are configured incorrectly, you are likely to encounter a “Connection Timeout” error in the DAG run. Configure the Amazon MWAA connection with Secrets Manager When the Amazon MWAA pipeline is configured to use Secrets Manager, it will first look for connections and variables in an alternate backend (like Secrets Manager). If the alternate backend contains the needed value, it is returned. Otherwise, it will check the metadata database for the value and return that instead. For more details, refer to Configuring an Apache Airflow connection using an AWS Secrets Manager secret. Complete the following steps: Configure a VPC endpoint to link Amazon MWAA and Secrets Manager (com.amazonaws.us-east-1.secretsmanager). This allows Amazon MWAA to access credentials stored in Secrets Manager. To provide Amazon MWAA with permission to access Secrets Manager secret keys, add the policy called SecretsManagerReadWrite to the IAM role of the environment. To create the Secrets Manager backend as an Apache Airflow configuration option, go to the Airflow configuration options, add the following key-value pairs, and save your settings. This configures Airflow to look for connection strings and variables at the airflow/connections/* and airflow/variables/* paths: secrets.backend: airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend secrets.backend_kwargs: {"connections_prefix" : "airflow/connections", "variables_prefix" : "airflow/variables"} To generate an Airflow connection URI string, go to AWS CloudShell and enter into a Python shell. Run the following code to generate the connection URI string: import urllib.parse conn_type = 'redshift' host = 'sample-workgroup.<account-id-of-AcctB>.us-east-1.redshift-serverless.amazonaws.com' #Specify the Amazon Redshift workgroup endpoint port = '5439' login = 'admin' #Specify the username to use for authentication with Amazon Redshift password = '<password>' #Specify the password to use for authentication with Amazon Redshift role_arn = urllib.parse.quote_plus('arn:aws:iam::<account_id>:role/service-role/<MWAA-role>') database = 'dev' region = 'us-east-1' #YOUR_REGION conn_string = '{0}://{1}:{2}@{3}:{4}?role_arn={5}&database={6}®ion={7}'.format(conn_type, login, password, host, port, role_arn, database, region) print(conn_string) The connection string should be generated as follows: redshift://admin:<password>@sample-workgroup.<account_id>.us-east-1.redshift-serverless.amazonaws.com:5439?role_arn=<MWAA role ARN>&database=dev®ion=<region> Add the connection in Secrets Manager using the following command in the AWS Command Line Interface (AWS CLI). This can also be done from the Secrets Manager console. This will be added in Secrets Manager as plaintext. aws secretsmanager create-secret --name airflow/connections/secrets_redshift_connection --description "Apache Airflow to Redshift Cluster" --secret-string "redshift://admin:<password>@sample-workgroup.<account_id>.us-east-1.redshift-serverless.amazonaws.com:5439?role_arn=<MWAA role ARN>&database=dev®ion=us-east-1" --region=us-east-1 Use the connection airflow/connections/secrets_redshift_connection in the DAG. When the DAG is run, it will look for this connection and retrieve the secrets from Secrets Manager. In case of RedshiftDataOperator, pass the secret_arn as a parameter instead of connection name. You can also add secrets using the Secrets Manager console as key-value pairs. Add another secret in Secrets Manager in and save it as airflow/connections/redshift_conn_test. Create an Airflow connection through the metadata database You can also create connections in the UI. In this case, the connection details will be stored in an Airflow metadata database. If the Amazon MWAA environment is not configured to use the Secrets Manager backend, it will check the metadata database for the value and return that. You can create an Airflow connection using the UI, AWS CLI, or API. In this section, we show how to create a connection using the Airflow UI. For Connection Id, enter a name for the connection. For Connection Type, choose Amazon Redshift. For Host, enter the Redshift endpoint (without port and database) for Redshift Serverless. For Database, enter dev. For User, enter your admin user name. For Password, enter your password. For Port, use port 5439. For Extra, set the region and timeout parameters. Test the connection, then save your settings. Create and run a DAG In this section, we describe how to create a DAG using various components. After you create and run the DAG, you can verify the results by querying Redshift tables and checking the target S3 buckets. Create a DAG In Airflow, data pipelines are defined in Python code as DAGs. We create a DAG that consists of various operators, sensors, connections, tasks, and rules: The DAG starts with looking for source files in the S3 bucket sample-inp-bucket-etl-<username> under Account A for the current day using S3KeySensor. S3KeySensor is used to wait for one or multiple keys to be present in an S3 bucket. For example, our S3 bucket is partitioned as s3://bucket/products/YYYY/MM/DD/, so our sensor should check for folders with the current date. We derived the current date in the DAG and passed this to S3KeySensor, which looks for any new files in the current day folder. We also set wildcard_match as True, which enables searches on bucket_key to be interpreted as a Unix wildcard pattern. Set the mode to reschedule so that the sensor task frees the worker slot when the criteria is not met and it’s rescheduled at a later time. As a best practice, use this mode when poke_interval is more than 1 minute to prevent too much load on a scheduler. After the file is available in the S3 bucket, the AWS Glue crawler runs using GlueCrawlerOperator to crawl the S3 source bucket sample-inp-bucket-etl-<username> under Account A and updates the table metadata under the products_db database in the Data Catalog. The crawler uses the AWS Glue role and Data Catalog database that were created in the previous steps. The DAG uses GlueCrawlerSensor to wait for the crawler to complete. When the crawler job is complete, GlueJobOperator is used to run the AWS Glue job. The AWS Glue script name (along with location) and is passed to the operator along with the AWS Glue IAM role. Other parameters like GlueVersion, NumberofWorkers, and WorkerType are passed using the create_job_kwargs parameter. The DAG uses GlueJobSensor to wait for the AWS Glue job to complete. When it’s complete, the Redshift staging table products will be loaded with data from the S3 file. You can connect to Amazon Redshift from Airflow using three different operators: PythonOperator. SQLExecuteQueryOperator, which uses a PostgreSQL connection and redshift_default as the default connection. RedshiftDataOperator, which uses the Redshift Data API and aws_default as the default connection. In our DAG, we use SQLExecuteQueryOperator and RedshiftDataOperator to show how to use these operators. The Redshift stored procedures are run RedshiftDataOperator. The DAG also runs SQL commands in Amazon Redshift to delete the data from the staging table using SQLExecuteQueryOperator. Because we configured our Amazon MWAA environment to look for connections in Secrets Manager, when the DAG runs, it retrieves the Redshift connection details like user name, password, host, port, and Region from Secrets Manager. If the connection is not found in Secrets Manager, the values are retrieved from the default connections. In SQLExecuteQueryOperator, we pass the connection name that we created in Secrets Manager. It looks for airflow/connections/secrets_redshift_connection and retrieves the secrets from Secrets Manager. If Secrets Manager is not set up, the connection created manually (for example, redshift-conn-id) can be passed. In RedshiftDataOperator, we pass the secret_arn of the airflow/connections/redshift_conn_test connection created in Secrets Manager as a parameter. As final task, RedshiftToS3Operator is used to unload data from the Redshift table to an S3 bucket sample-opt-bucket-etl in Account B. airflow/connections/redshift_conn_test from Secrets Manager is used for unloading the data. TriggerRule is set to ALL_DONE, which enables the next step to run after all upstream tasks are complete. The dependency of tasks is defined using the chain() function, which allows for parallel runs of tasks if needed. In our case, we want all tasks to run in sequence. The following is the complete DAG code. The dag_id should match the DAG script name, otherwise it won’t be synced into the Airflow UI. from datetime import datetime from airflow import DAG from airflow.decorators import task from airflow.models.baseoperator import chain from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor from airflow.providers.amazon.aws.operators.glue import GlueJobOperator from airflow.providers.amazon.aws.operators.glue_crawler import GlueCrawlerOperator from airflow.providers.amazon.aws.sensors.glue import GlueJobSensor from airflow.providers.amazon.aws.sensors.glue_crawler import GlueCrawlerSensor from airflow.providers.amazon.aws.operators.redshift_data import RedshiftDataOperator from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator from airflow.providers.amazon.aws.transfers.redshift_to_s3 import RedshiftToS3Operator from airflow.utils.trigger_rule import TriggerRule dag_id = "data_pipeline" vYear = datetime.today().strftime("%Y") vMonth = datetime.today().strftime("%m") vDay = datetime.today().strftime("%d") src_bucket_name = "sample-inp-bucket-etl-<username>" tgt_bucket_name = "sample-opt-bucket-etl-<username>" s3_folder="products" #Please replace the variable with the glue_role_arn glue_role_arn_key = "arn:aws:iam::<account_id>:role/<Glue-role>" glue_crawler_name = "products" glue_db_name = "products_db" glue_job_name = "sample_glue_job" glue_script_location="s3://aws-glue-assets-<account_id>-<region>/scripts/sample_glue_job.py" workgroup_name = "sample-workgroup" redshift_table = "products_f" redshift_conn_id_name="secrets_redshift_connection" db_name = "dev" secret_arn="arn:aws:secretsmanager:us-east-1:<account_id>:secret:airflow/connections/redshift_conn_test-xxxx" poll_interval = 10 @task def get_role_name(arn: str) -> str: return arn.split("/")[-1] @task def get_s3_loc(s3_folder: str) -> str: s3_loc = s3_folder + "/year=" + vYear + "/month=" + vMonth + "/day=" + vDay + "/*.csv" return s3_loc with DAG( dag_id=dag_id, schedule="@once", start_date=datetime(2021, 1, 1), tags=["example"], catchup=False, ) as dag: role_arn = glue_role_arn_key glue_role_name = get_role_name(role_arn) s3_loc = get_s3_loc(s3_folder) # Check for new incremental files in S3 source/input bucket sensor_key = S3KeySensor( task_id="sensor_key", bucket_key=s3_loc, bucket_name=src_bucket_name, wildcard_match=True, #timeout=18*60*60, #poke_interval=120, timeout=60, poke_interval=30, mode="reschedule" ) # Run Glue crawler glue_crawler_config = { "Name": glue_crawler_name, "Role": role_arn, "DatabaseName": glue_db_name, } crawl_s3 = GlueCrawlerOperator( task_id="crawl_s3", config=glue_crawler_config, ) # GlueCrawlerOperator waits by default, setting as False to test the Sensor below. crawl_s3.wait_for_completion = False # Wait for Glue crawler to complete wait_for_crawl = GlueCrawlerSensor( task_id="wait_for_crawl", crawler_name=glue_crawler_name, ) # Run Glue Job submit_glue_job = GlueJobOperator( task_id="submit_glue_job", job_name=glue_job_name, script_location=glue_script_location, iam_role_name=glue_role_name, create_job_kwargs={"GlueVersion": "4.0", "NumberOfWorkers": 10, "WorkerType": "G.1X"}, ) # GlueJobOperator waits by default, setting as False to test the Sensor below. submit_glue_job.wait_for_completion = False # Wait for Glue Job to complete wait_for_job = GlueJobSensor( task_id="wait_for_job", job_name=glue_job_name, # Job ID extracted from previous Glue Job Operator task run_id=submit_glue_job.output, verbose=True, # prints glue job logs in airflow logs ) wait_for_job.poke_interval = 5 # Execute the Stored Procedure in Redshift Serverless using Data Operator execute_redshift_stored_proc = RedshiftDataOperator( task_id="execute_redshift_stored_proc", database=db_name, workgroup_name=workgroup_name, secret_arn=secret_arn, sql="""CALL sp_products();""", poll_interval=poll_interval, wait_for_completion=True, ) # Execute the Stored Procedure in Redshift Serverless using SQL Operator delete_from_table = SQLExecuteQueryOperator( task_id="delete_from_table", conn_id=redshift_conn_id_name, sql="DELETE FROM products;", trigger_rule=TriggerRule.ALL_DONE, ) # Unload the data from Redshift table to S3 transfer_redshift_to_s3 = RedshiftToS3Operator( task_id="transfer_redshift_to_s3", s3_bucket=tgt_bucket_name, s3_key=s3_loc, schema="PUBLIC", table=redshift_table, redshift_conn_id=redshift_conn_id_name, ) transfer_redshift_to_s3.trigger_rule = TriggerRule.ALL_DONE #Chain the tasks to be executed chain( sensor_key, crawl_s3, wait_for_crawl, submit_glue_job, wait_for_job, execute_redshift_stored_proc, delete_from_table, transfer_redshift_to_s3 ) Verify the DAG run After you create the DAG file (replace the variables in the DAG script) and upload it to the s3://sample-airflow-instance/dags folder, it will be automatically synced with the Airflow UI. All DAGs appear on the DAGs tab. Toggle the ON option to make the DAG runnable. Because our DAG is set to schedule="@once", you need to manually run the job by choosing the run icon under Actions. When the DAG is complete, the status is updated in green, as shown in the following screenshot. In the Links section, there are options to view the code, graph, grid, log, and more. Choose Graph to visualize the DAG in a graph format. As shown in the following screenshot, each color of the node denotes a specific operator, and the color of the node outline denotes a specific status. Verify the results On the Amazon Redshift console, navigate to the Query Editor v2 and select the data in the products_f table. The table should be loaded and have the same number of records as S3 files. On the Amazon S3 console, navigate to the S3 bucket s3://sample-opt-bucket-etl in Account B. The product_f files should be created under the folder structure s3://sample-opt-bucket-etl/products/YYYY/MM/DD/. Clean up Clean up the resources created as part of this post to avoid incurring ongoing charges: Delete the CloudFormation stacks and S3 bucket that you created as prerequisites. Delete the VPCs and VPC peering connections, cross-account policies and roles, and secrets in Secrets Manager. Conclusion With Amazon MWAA, you can build complex workflows using Airflow and Python without managing clusters, nodes, or any other operational overhead typically associated with deploying and scaling Airflow in production. In this post, we showed how Amazon MWAA provides an automated way to ingest, transform, analyze, and distribute data between different accounts and services within AWS. For more examples of other AWS operators, refer to the following GitHub repository; we encourage you to learn more by trying out some of these examples. About the Authors Radhika Jakkula is a Big Data Prototyping Solutions Architect at AWS. She helps customers build prototypes using AWS analytics services and purpose-built databases. She is a specialist in assessing wide range of requirements and applying relevant AWS services, big data tools, and frameworks to create a robust architecture. Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for costs, reliability, performance, and operational excellence at scale in their cloud journey. He has a keen interest in data analytics as well. View the full article
-
- etl
- etl pipelines
- (and 5 more)
-
Corporations deal with massive amounts of data these days. As the amount of data increases, handling the incoming information and generating proper insights becomes necessary. Selecting the right data management services might be baffling since many options are available. Multiple platforms provide services that can assist you in analyzing and querying your data. In this […]View the full article
-
- amazon redshift
- amazon redshift serverless
- (and 3 more)
-
Analytics as a service (AaaS) is a business model that uses the cloud to deliver analytic capabilities on a subscription basis. This model provides organizations with a cost-effective, scalable, and flexible solution for building analytics. The AaaS model accelerates data-driven decision-making through advanced analytics, enabling organizations to swiftly adapt to changing market trends and make informed strategic choices. Amazon Redshift is a cloud data warehouse service that offers real-time insights and predictive analytics capabilities for analyzing data from terabytes to petabytes. It offers features like data sharing, Amazon Redshift ML, Amazon Redshift Spectrum, and Amazon Redshift Serverless, which simplify application building and make it effortless for AaaS companies to embed rich data analytics capabilities. Amazon Redshift delivers up to 4.9 times lower cost per user and up to 7.9 times better price-performance than other cloud data warehouses. The Powered by Amazon Redshift program helps AWS Partners operating an AaaS model quickly build analytics applications using Amazon Redshift and successfully scale their business. For example, you can build visualizations on top of Amazon Redshift and embed them within applications to provide outstanding analytics experiences for end-users. In this post, we explore how AaaS providers scale their processes with Amazon Redshift to deliver insights to their customers. AaaS delivery models While serving analytics at scale, AaaS providers and customers can choose where to store the data and where to process the data. AaaS providers could choose to ingest and process all the customer data into their own account and deliver insights to the customer account. Alternatively, they could choose to directly process data in-place within the customer’s account. The choice of these delivery models depends on many factors, and each has their own benefits. Because AaaS providers service multiple customers, they could mix these models in a hybrid fashion, meeting each customer’s preference. The following diagram illustrates the two delivery models. We explore the technical details of each model in the next sections. Build AaaS on Amazon Redshift Amazon Redshift has features that allow AaaS providers the flexibility to deploy three unique delivery models: Managed model – Processing data within the Redshift data warehouse the AaaS provider manages Bring-your-own-Redshift (BYOR) model – Processing data directly within the customer’s Redshift data warehouse Hybrid model – Using a mix of both models depending on customer needs These delivery models give AaaS providers the flexibility to deliver insights to their customers no matter where the data warehouse is located. Let’s look at how each of these delivery models work in practice. Managed model In this model, the AaaS provider ingests customer data in their own account, and engages their own Redshift data warehouse for processing. Then they use one or more methods to deliver the generated insights to their customers. Amazon Redshift enables companies to securely build multi-tenant applications, ensuring data isolation, integrity, and confidentiality. It provides features like row-level security (RLS), column-level security (CLS) for fine-grained access control, role-based access control (RBAC), and assigning permissions at the database and schema level. The following diagram illustrates the managed delivery model and the various methods AaaS providers can use to deliver insights to their customers. The workflow includes the following steps: The AaaS provider pulls data from customer data sources like operational databases, files, and APIs, and ingests them into the Redshift data warehouse hosted in their account. Data processing jobs enrich the data in Amazon Redshift. This could be an application the AaaS provider has built to process data, or they could use a data processing service like Amazon EMR or AWS Glue to run Spark applications. Now the AaaS provider has multiple methods to deliver insights to their customers: Option 1 – The enriched data with insights is shared directly with the customer’s Redshift instance using the Amazon Redshift data sharing feature. End-users consume data using business intelligence (BI) tools and analytics applications. Option 2 – If AaaS providers are publishing generic insights to AWS Data Exchange to reach millions of AWS customers and monetize those insights, their customers can use AWS Data Exchange for Amazon Redshift. With this feature, customers get instant insights in their Redshift data warehouse without having to write extract, transform, and load (ETL) pipelines to ingest the data. AWS Data Exchange provides their customers a secure and compliant way to subscribe to the data with consolidated billing and subscription management. Option 3 – The AaaS provider exposes insights on a web application using the Amazon Redshift Data API. Customers access the web application directly from the internet. The gives the AaaS provider the flexibility to expose insights outside an AWS account. Option 4 – Customers connect to the AaaS provider’s Redshift instance using Amazon QuickSight or other third-party BI tools through a JDBC connection. In this model, the customer shifts the responsibility of data management and governance to the AaaS providers, with light services to consume insights. This leads to improved decision-making as customers focus on core activities and save time from tedious data management tasks. Because AaaS providers move data from the customer accounts, there could be associated data transfer costs depending on how they move the data. However, because they deliver this service at scale to multiple customers, they can offer cost-efficient services using economies of scale. BYOR model In cases where the customer hosts a Redshift data warehouse and wants to run analytics in their own data platform without moving data out, you use the BYOR model. The following diagram illustrates the BYOR model, where AaaS providers process data to add insights directly in their customer’s data warehouse so the data never leaves the customer account. The solution includes the following steps: The customer ingests all the data from various data sources into their Redshift data warehouse. The data undergoes processing: The AaaS provider uses a secure channel, AWS PrivateLink for the Redshift Data API, to push data processing logic directly in the customer’s Redshift data warehouse. They use the same channel to process data at scale with multiple customers. The diagram illustrates a second customer, but this can scale to hundreds or thousands of customers. AaaS providers can tailor data processing logic per customer by isolating scripts for each customer and deploying them according to the customer’s identity, providing a customized and efficient service. The customer’s end-users consume data from their own account using BI tools and analytics applications. The customer has control over how to expose insights to their end-users. This delivery model allows customers to manage their own data, reducing dependency on AaaS providers and cutting data transfer costs. By keeping data in their own environment, customers can reduce the risk of data breach while benefiting from insights for better decision-making. Hybrid model Customers have diverse needs influenced by factors like data security, compliance, and technical expertise. To cover a broader range of customers, AaaS providers can choose a hybrid approach that delivers both the managed model and the BYOR model depending on the customer, offering flexibility and the ability to serve multiple customers. The following diagram illustrates the AaaS provider delivering insights through the BYOR model for Customer 1 and 4, the managed model for Customer 2 and 3, and so on. Conclusion In this post, we talked about the rising demand of analytics as a service and how providers can use the capabilities of Amazon Redshift to deliver insights to their customers. We examined two primary delivery models: the managed model, where AaaS providers process data on their own accounts, and the BYOR model, where AaaS providers process and enrich data directly in their customer’s account. Each method offers unique benefits, such as cost-efficiency, enhanced control, and personalized insights. The flexibility of the AWS Cloud facilitates a hybrid model, accommodating diverse customer needs and allowing AaaS providers to scale. We also introduced the Powered by Amazon Redshift program, which supports AaaS businesses in building effective analytics applications, fostering improved user engagement and business growth. We take this opportunity to invite our ISV partners to reach out to us and learn more about the Powered by Amazon Redshift program. About the Authors Sandipan Bhaumik is a Senior Analytics Specialist Solutions Architect based in London, UK. He helps customers modernize their traditional data platforms using the modern data architecture in the cloud to perform analytics at scale. Sain Das is a Senior Product Manager on the Amazon Redshift team and leads Amazon Redshift GTM for partner programs, including the Powered by Amazon Redshift and Redshift Ready programs. View the full article
-
Huge performance-boosting opportunities await those who choose the optimal data warehouse for their business. Identifying custom data points that steer your organizations’ successful outcomes is crucial. Decision-making is optimized through sophisticated means of accessing and analyzing your company’s data. As the use of data warehouses grows exponentially, consumer choices become additionally more challenging to discern […]View the full article
-
As analytics in your company graduates from a MySQL/PostgreSQL/SQL Server, a pertinent question that you need to answer is which data warehouse is best suited for you. This blog tries to compare Redshift vs BigQuery – two very famous cloud data warehouses today. In this post, we are going to talk about the two most […]View the full article
-
More data has been created in the past two years than was ever created in human history. With the exploding volumes of data, people are now looking for data warehouse solutions, which can benefit them in terms of performance, cost, security, and durability. To have an answer to this problem, many companies released data warehousing […]View the full article
-
As businesses continue to generate massive amounts of data, the need for an efficient and scalable data warehouse becomes paramount. Amazon Redshift has always been at the forefront of providing innovative cloud-based services, and with its latest addition, Amazon Redshift Serverless, the data warehouse industry is being revolutionized. With Amazon Redshift Serverless, AWS has removed […]View the full article
-
- amazon redshift
- serverless
-
(and 1 more)
Tagged with:
-
Data migration between different platforms is critical to consider when generating multiple organizational strategies. One such migration involves transferring data from a cloud-based data warehousing service environment to a relational database management system. Amazon Redshift SQL Server Integration can give organizations an edge, allowing them to conduct enhanced analysis and reporting through custom report application […]View the full article
-
This post is co-written with Amir Souchami and Fabian Szenkier from Unity. Aura from Unity (formerly known as ironSource) is the market standard for creating rich device experiences that engage and retain customers. With a powerful set of solutions, Aura enables complete digital transformation, letting operators promote key services outside the store, directly on-device. Amazon Redshift is a recommended service for online analytical processing (OLAP) workloads such as cloud data warehouses, data marts, and other analytical data stores. You can use simple SQL to analyze structured and semi-structured data, operational databases, and data lakes to deliver the best price/performance at any scale. The Amazon Redshift data sharing feature provides instant, granular, and high-performance access without data copies and data movement across multiple Redshift data warehouses in the same or different AWS accounts and across AWS Regions. Data sharing provides live access to data so that you always see the most up-to-date and consistent information as it’s updated in the data warehouse. Amazon Redshift Serverless makes it straightforward to run and scale analytics in seconds without the need to set up and manage data warehouse clusters. Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. You can load your data and start querying right away in the Amazon Redshift Query Editor or in your favorite business intelligence (BI) tool and continue to enjoy the best price/performance and familiar SQL features in an easy-to-use, zero administration environment. In this post, we describe Aura’s successful and swift adoption of Redshift Serverless, which allowed them to optimize their overall bidding advertisement campaigns’ time to market from 24 hours to 2 hours. We explore why Aura chose this solution and what technological challenges it helped solve. Aura’s initial data pipeline Aura is a pioneer in using Redshift RA3 clusters with data sharing for extract, transform, and load (ETL) and BI workloads. One of Aura’s operations is bidding advertisement campaigns. These campaigns are optimized by using an AI-based bid process that requires running hundreds of analytical queries per campaign. These queries are run on data that resides in an RA3 provisioned Redshift cluster. The integrated pipeline is comprised of various AWS services: Amazon Elastic Container Registry (Amazon ECR) for storing Amazon Elastic Kubernetes Service (Amazon EKS) Docker images Amazon Managed Workflows for Apache Airflow (Amazon MWAA) for pipeline orchestration Amazon DynamoDB for storing job-related configuration such as service connection strings and batch sizes Amazon Managed Streaming for Apache Kafka (Amazon MSK) for streaming last changed and added advertisement campaigns EKSPodOperator in Amazon MWAA for triggering an EKS pod task that runs the data preparation queries for each ad campaign on Aura’s main Redshift provisioned cluster Amazon Redshift provisioned for running ETL jobs, a BI layer, and analytical queries per ad campaign An Amazon Simple Storage Service (Amazon S3) bucket for storing the Redshift query results Amazon MWAA with Amazon EKS for running machine learning (ML) training on the query results using a Python-based ML algorithm The following diagram illustrates this architecture. Challenges of the initial architecture The queries for each campaign run in the following manner: First, a preparation query filters and aggregates raw data, preparing it for the subsequent operation. This is followed by the main query, which carries out the logic according to the preparation query result set. As the number of campaigns grew, Aura’s Data team was required to run hundreds of concurrent queries for each of these steps. Aura’s existing provisioned cluster was already heavily utilized with data ingestion, ETL, and BI workloads, so they were looking for cost-effective ways to isolate this workload with dedicated compute resources. The team evaluated a variety of options, including unloading data to Amazon S3 and a multi-cluster architecture using data sharing and Redshift serverless. The team gravitated towards the multi-cluster architecture with data sharing, as it requires no query rewrite, allows for dedicated compute for this specific workload, avoids the need to duplicate or move data from the main cluster, and provides high concurrency and automatic scaling. Lastly, it’s billed in a pay-for-what-you-use model, and provisioning is straightforward and quick. Proof of concept After evaluating the options, Aura’s Data team decided to conduct a proof of concept using Redshift Serverless as a consumer of their main Redshift provisioned cluster, sharing just the relevant tables for running the required queries. Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). A single RPU provides 16 GB of memory and a serverless endpoint can range from 8 RPU to 512 RPU. Aura’s Data team started the proof of concept using a 256 RPU Redshift Serverless endpoint and gradually lowered the RPU to reduce costs while making sure the query runtime was below the required target. Eventually, the team decided to use a 128 RPU (2 TB RAM) Redshift Serverless endpoint as the base RPU, while using the Redshift Serverless auto scaling feature, which allows hundreds of concurrent queries to run by automatically upscaling the RPU as needed. Aura’s new solution with Redshift Serverless After a successful proof of concept, the production setup included adding code to switch between the provisioned Redshift cluster and the Redshift Serverless endpoint. This was done using a configurable threshold based on the number of queries waiting to be processed in a specific MSK topic consumed at the beginning of the pipeline. Small-scale campaign queries would still run on the provisioned cluster, and large-scale queries would use the Redshift Serverless endpoint. The new solution uses an Amazon MWAA pipeline that fetches configuration information from a DynamoDB table, consumes jobs that represent ad campaigns, and then runs hundreds of EKS jobs triggered using EKSPodOperator. Each job runs the two serial queries (the preparation query followed by a main query, which outputs the results to Amazon S3). This happens several hundred times concurrently using Redshift Serverless compute resources. Then the process initiates another set of EKSPodOperator operators to run the AI training code based on the data result that was saved on Amazon S3. The following diagram illustrates the solution architecture. Outcome The overall runtime of the pipeline was reduced from 24 hours to just 2 hours, a 12-times improvement. This integration of Redshift Serverless, coupled with data sharing, led to a 90% reduction in pipeline duration, negating the necessity for data duplication or query rewriting. Moreover, the introduction of a dedicated consumer as an exclusive compute resource significantly eased the load of the producer cluster, enabling running small-scale queries even faster. “Redshift Serverless and data sharing enabled us to provision and scale our data warehouse capacity to deliver fast performance, high concurrency and handle challenging ML workloads with very minimal effort.” – Amir Souchami, Aura’s Principal Technical Systems Architect. Learnings Aura’s Data team is highly focused on working in a cost-effective manner and has therefore implemented several cost controls in their Redshift Serverless endpoint: Limit the overall spend by setting a maximum RPU-hour usage limit (per day, week, month) for the workgroup. Aura configured that limit so when it is reached, Amazon Redshift will send an alert to the relevant Amazon Redshift administrator team. This feature also allows writing an entry to a system table and even turning off user queries. Use a maximum RPU configuration, which defines the upper limit of compute resources that Redshift Serverless can use at any given time. When the maximum RPU limit is set for the workgroup, Redshift Serverless scales within that limit to continue to run the workload. Implement query monitoring rules that prevent wasteful resource utilization and runaway costs caused by poorly written queries. Conclusion A data warehouse is a crucial part of any modern data-driven company, enabling you to answer complex business questions and provide insights. The evolution of Amazon Redshift allowed Aura to quickly adapt to business requirements by combining data sharing between provisioned and Redshift Serverless data warehouses. Aura’s journey with Redshift Serverless underscores the vast potential of strategic tech integration in driving efficiency and operational excellence. If Aura’s journey has sparked your interest and you are considering implementing a similar solution in your organization, here are some strategic steps to consider: Start by thoroughly understanding your organization’s data needs and how such a solution can address them. Reach out to AWS experts, who can provide you with guidance based on their own experiences. Consider engaging in seminars, workshops, or online forums that discuss these technologies. The following resources are recommended for getting started: Redshift Serverless and data sharing workshop Redshift Serverless overview An important part of this journey would be to implement a proof of concept. Such hands-on experience will provide valuable insights before moving to production. Elevate your Redshift expertise. Already enjoying the power of Amazon Redshift? Enhance your data journey with the latest features and expert guidance. Reach out to your dedicated AWS account team for personalized support, discover cutting-edge capabilities, and unlock even greater value from your data with Amazon Redshift. About the Authors Amir Souchami, Chief Architect of Aura from Unity, focusing on creating resilient and performant cloud systems and mobile apps at major scale. Fabian Szenkier is the ML and Big Data Architect at Aura by Unity, works on building modern AI/ML solutions and state of the art data engineering pipelines at scale. Liat Tzur is a Senior Technical Account Manager at Amazon Web Services. She serves as the customer’s advocate and assists her customers in achieving cloud operational excellence in alignment with their business goals. Adi Jabkowski is a Sr. Redshift Specialist in EMEA, part of the Worldwide Specialist Organization (WWSO) at AWS. Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value. View the full article
-
Amazon Aurora MySQL zero-ETL integration with Amazon Redshift is now supported in 11 additional regions, enabling near real-time analytics and machine learning (ML) using Amazon Redshift. Based on your analytics needs, you can include or exclude specific databases and tables from an existing or a new zero-ETL integration and selectively bring data into Amazon Redshift. View the full article
-
- amazon aurora
- amazon redshift
-
(and 2 more)
Tagged with:
-
Healthcare providers have an opportunity to improve the patient experience by collecting and analyzing broader and more diverse datasets. This includes patient medical history, allergies, immunizations, family disease history, and individuals’ lifestyle data such as workout habits. Having access to those datasets and forming a 360-degree view of patients allows healthcare providers such as claim analysts to see a broader context about each patient and personalize the care they provide for every individual. This is underpinned by building a complete patient profile that enables claim analysts to identify patterns, trends, potential gaps in care, and adherence to care plans. They can then use the result of their analysis to understand a patient’s health status, treatment history, and past or upcoming doctor consultations to make more informed decisions, streamline the claim management process, and improve operational outcomes. Achieving this will also improve general public health through better and more timely interventions, identify health risks through predictive analytics, and accelerate the research and development process. AWS has invested in a zero-ETL (extract, transform, and load) future so that builders can focus more on creating value from data, instead of having to spend time preparing data for analysis. The solution proposed in this post follows a zero-ETL approach to data integration to facilitate near real-time analytics and deliver a more personalized patient experience. The solution uses AWS services such as AWS HealthLake, Amazon Redshift, Amazon Kinesis Data Streams, and AWS Lake Formation to build a 360 view of patients. These services enable you to collect and analyze data in near real time and put a comprehensive data governance framework in place that uses granular access control to secure sensitive data from unauthorized users. Zero-ETL refers to a set of features on the AWS Cloud that enable integrating different data sources with Amazon Redshift: Integration between Amazon Redshift and Amazon Simple Storage Service (Amazon S3) via Amazon Redshift Spectrum and auto-copy features Integration between Amazon Redshift and Amazon Aurora, Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB via the zero-ETL feature Integration between Amazon Redshift and streaming sources like Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK) via streaming ingestion Solution overview Organizations in the healthcare industry are currently spending a significant amount of time and money on building complex ETL pipelines for data movement and integration. This means data will be replicated across multiple data stores via bespoke and in some cases hand-written ETL jobs, resulting in data inconsistency, latency, and potential security and privacy breaches. With support for querying cross-account Apache Iceberg tables via Amazon Redshift, you can now build a more comprehensive patient-360 analysis by querying all patient data from one place. This means you can seamlessly combine information such as clinical data stored in HealthLake with data stored in operational databases such as a patient relationship management system, together with data produced from wearable devices in near real-time. Having access to all this data enables healthcare organizations to form a holistic view of patients, improve care coordination across multiple organizations, and provide highly personalized care for each individual. The following diagram depicts the high-level solution we build to achieve these outcomes. Deploy the solution You can use the following AWS CloudFormation template to deploy the solution components: This stack creates the following resources and necessary permissions to integrate the services: A Kinesis data stream. You can send data from your streaming source to this resource for ingesting the data into a Redshift data warehouse. We use on-demand capacity mode. An Amazon Aurora MySQL-Compatible Edition cluster version 8.0. This will be your online transaction processing (OLTP) data store for transactional data. To set up zero-ETL integration for ingesting transaction data to the Redshift data warehouse, see Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift. The required parameter groups for source and target are already created as part of the CloudFormation stack. An Amazon Redshift Serverless workgroup and associated namespace. The CloudFormation stack also deploys a provisioned Redshift cluster. If you would like to work with Redshift Serverless, you can remove the provisioned cluster from the template and vice versa. An AWS Identity and Access Management (IAM) role with required policies and trust relationships. Network components, including VPC, subnets, route table, and associations. You can customize these resources as per your organization’s rules. AWS Solution setup AWS HealthLake AWS HealthLake enables organizations in the health industry to securely store, transform, transact, and analyze health data. It stores data in HL7 FHIR format, which is an interoperability standard designed for quick and efficient exchange of health data. When you create a HealthLake data store, a Fast Healthcare Interoperability Resources (FHIR) data repository is made available via a RESTful API endpoint. Simultaneously and as part of AWS HealthLake managed service, the nested JSON FHIR data undergoes an ETL process and is stored in Apache Iceberg open table format in Amazon S3. To create an AWS HealthLake data store, refer to Getting started with AWS HealthLake. Make sure to select the option Preload sample data when creating your data store. In real-world scenarios and when you use AWS HealthLake in production environments, you don’t need to load sample data into your AWS HealthLake data store. Instead, you can use FHIR REST API operations to manage and search resources in your AWS HealthLake data store. We use two tables from the sample data stored in HealthLake: patient and allergyintolerance. Query AWS HealthLake tables with Redshift Serverless Amazon Redshift is the data warehousing service available on the AWS Cloud that provides up to six times better price-performance than any other cloud data warehouses in the market, with a fully managed, AI-powered, massively parallel processing (MPP) data warehouse built for performance, scale, and availability. With continuous innovations added to Amazon Redshift, it is now more than just a data warehouse. It enables organizations of different sizes and in different industries to access all the data they have in their AWS environments and analyze it from one single location with a set of features under the zero-ETL umbrella. Amazon Redshift integrates with AWS HealthLake and data lakes through Redshift Spectrum and Amazon S3 auto-copy features, enabling you to query data directly from files on Amazon S3. Query AWS HealthLake data with Amazon Redshift Amazon Redshift makes it straightforward to query the data stored in S3-based data lakes with automatic mounting of an AWS Glue Data Catalog in the Redshift query editor v2. This means you no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog. To get started with this feature, see Querying the AWS Glue Data Catalog. After it is set up and you’re connected to the Redshift query editor v2, complete the following steps: Validate that your tables are visible in the query editor V2. The Data Catalog objects are listed under the awsdatacatalog database. FHIR data stored in AWS HealthLake is highly nested. To learn about how to un-nest semi-structured data with Amazon Redshift, see Tutorial: Querying nested data with Amazon Redshift Spectrum. Use the following query to un-nest the allergyintolerance and patient tables, join them together, and get patient details and their allergies: WITH patient_allergy AS ( SELECT resourcetype, c AS allery_category, a."patient"."reference", SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id, a.recordeddate AS allergy_record_date, NVL(cd."code", 'NA') AS allergy_code, NVL(cd.display, 'NA') AS allergy_description FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a LEFT JOIN a.category c ON TRUE LEFT JOIN a.reaction r ON TRUE LEFT JOIN r.manifestation m ON TRUE LEFT JOIN m.coding cd ON TRUE ), patinet_info AS ( SELECT id, gender, g as given_name, n.family as family_name, pr as prefix FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p LEFT JOIN p.name n ON TRUE LEFT JOIN n.given g ON TRUE LEFT JOIN n.prefix pr ON TRUE ) SELECT DISTINCT p.id, p.gender, p.prefix, p.given_name, p.family_name, pa.allery_category, pa.allergy_code, pa.allergy_description from patient_allergy pa JOIN patinet_info p ON pa.patient_id = p.id ORDER BY p.id, pa.allergy_code ; To eliminate the need for Amazon Redshift to un-nest data every time a query is run, you can create a materialized view to hold un-nested and flattened data. Materialized views are an effective mechanism to deal with complex and repeating queries. They contain a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Use the following SQL to create a materialized view. You use it later to build a complete view of patients: CREATE MATERIALIZED VIEW patient_allergy_info AUTO REFRESH YES AS WITH patient_allergy AS ( SELECT resourcetype, c AS allery_category, a."patient"."reference", SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id, a.recordeddate AS allergy_record_date, NVL(cd."code", 'NA') AS allergy_code, NVL(cd.display, 'NA') AS allergy_description FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a LEFT JOIN a.category c ON TRUE LEFT JOIN a.reaction r ON TRUE LEFT JOIN r.manifestation m ON TRUE LEFT JOIN m.coding cd ON TRUE ), patinet_info AS ( SELECT id, gender, g as given_name, n.family as family_name, pr as prefix FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p LEFT JOIN p.name n ON TRUE LEFT JOIN n.given g ON TRUE LEFT JOIN n.prefix pr ON TRUE ) SELECT DISTINCT p.id, p.gender, p.prefix, p.given_name, p.family_name, pa.allery_category, pa.allergy_code, pa.allergy_description from patient_allergy pa JOIN patinet_info p ON pa.patient_id = p.id ORDER BY p.id, pa.allergy_code ; You have confirmed you can query data in AWS HealthLake via Amazon Redshift. Next, you set up zero-ETL integration between Amazon Redshift and Amazon Aurora MySQL. Set up zero-ETL integration between Amazon Aurora MySQL and Redshift Serverless Applications such as front-desk software, which are used to schedule appointments and register new patients, store data in OLTP databases such as Aurora. To get data out of OLTP databases and have them ready for analytics use cases, data teams might have to spend a considerable amount of time to build, test, and deploy ETL jobs that are complex to maintain and scale. With the Amazon Redshift zero-ETL integration with Amazon Aurora MySQL, you can run analytics on the data stored in OLTP databases and combine them with the rest of the data in Amazon Redshift and AWS HealthLake in near real time. In the next steps in this section, we connect to a MySQL database and set up zero-ETL integration with Amazon Redshift. Connect to an Aurora MySQL database and set up data Connect to your Aurora MySQL database using your editor of choice using AdminUsername and AdminPassword that you entered when running the CloudFormation stack. (For simplicity, it is the same for Amazon Redshift and Aurora.) When you’re connected to your database, complete the following steps: Create a new database by running the following command: CREATE DATABASE front_desk_app_db; Create a new table. This table simulates storing patient information as they visit clinics and other healthcare centers. For simplicity and to demonstrate specific capabilities, we assume that patient IDs are the same in AWS HealthLake and the front-of-office application. In real-world scenarios, this can be a hashed version of a national health care number: CREATE TABLE patient_appointment ( patient_id varchar(250), gender varchar(1), date_of_birth date, appointment_datetime datetime, phone_number varchar(15), PRIMARY KEY (patient_id, appointment_datetime) ); Having a primary key in the table is mandatory for zero-ETL integration to work. Insert new records into the source table in the Aurora MySQL database. To demonstrate the required functionalities, make sure the patient_id of the sample records inserted into the MySQL database match the ones in AWS HealthLake. Replace [patient_id_1] and [patient_id_2] in the following query with the ones from the Redshift query you ran previously (the query that joined allergyintolerance and patient): INSERT INTO front_desk_app_db.patient_appointment (patient_id, gender, date_of_birth, appointment_datetime, phone_number) VALUES([PATIENT_ID_1], 'F', '1988-7-04', '2023-12-19 10:15:00', '0401401401'), ([PATIENT_ID_1], 'F', '1988-7-04', '2023-09-19 11:00:00', '0401401401'), ([PATIENT_ID_1], 'F', '1988-7-04', '2023-06-06 14:30:00', '0401401401'), ([PATIENT_ID_2], 'F', '1972-11-14', '2023-12-19 08:15:00', '0401401402'), ([PATIENT_ID_2], 'F', '1972-11-14', '2023-01-09 12:15:00', '0401401402'); Now that your source table is populated with sample records, you can set up zero-ETL and have data ingested into Amazon Redshift. Set up zero-ETL integration between Amazon Aurora MySQL and Amazon Redshift Complete the following steps to create your zero-ETL integration: On the Amazon RDS console, choose Databases in the navigation pane. Choose the DB identifier of your cluster (not the instance). On the Zero-ETL Integration tab, choose Create zero-ETL integration. Follow the steps to create your integration. Create a Redshift database from the integration Next, you create a target database from the integration. You can do this by running a couple of simple SQL commands on Amazon Redshift. Log in to the query editor V2 and run the following commands: Get the integration ID of the zero-ETL you set up between your source database and Amazon Redshift: SELECT * FROM svv_integration; Create a database using the integration ID: CREATE DATABASE ztl_demo FROM INTEGRATION '[INTEGRATION_ID '; Query the database and validate that a new table is created and populated with data from your source MySQL database: SELECT * FROM ztl_demo.front_desk_app_db.patient_appointment; It might take a few seconds for the first set of records to appear in Amazon Redshift. This shows that the integration is working as expected. To validate it further, you can insert a new record in your Aurora MySQL database, and it will be available in Amazon Redshift for querying in near real time within a few seconds. Set up streaming ingestion for Amazon Redshift Another aspect of zero-ETL on AWS, for real-time and streaming data, is realized through Amazon Redshift Streaming Ingestion. It provides low-latency, high-speed ingestion of streaming data from Kinesis Data Streams and Amazon MSK. It lowers the effort required to have data ready for analytics workloads, lowers the cost of running such workloads on the cloud, and decreases the operational burden of maintaining the solution. In the context of healthcare, understanding an individual’s exercise and movement patterns can help with overall health assessment and better treatment planning. In this section, you send simulated data from wearable devices to Kinesis Data Streams and integrate it with the rest of the data you already have access to from your Redshift Serverless data warehouse. For step-by-step instructions, refer to Real-time analytics with Amazon Redshift streaming ingestion. Note the following steps when you set up streaming ingestion for Amazon Redshift: Select wearables_stream and use the following template when sending data to Amazon Kinesis Data Streams via Kinesis Data Generator, to simulate data generated by wearable devices. Replace [PATIENT_ID_1] and [PATIENT_ID_2] with the patient IDs you earlier when inserting new records into your Aurora MySQL table: { "patient_id": "{{random.arrayElement(["[PATIENT_ID_1]"," [PATIENT_ID_2]"])}}", "steps_increment": "{{random.arrayElement( [0,1] )}}", "heart_rate": {{random.number( { "min":45, "max":120} )}} } Create an external schema called from_kds by running the following query and replacing [IAM_ROLE_ARN] with the ARN of the role created by the CloudFormation stack (Patient360BlogRole): CREATE EXTERNAL SCHEMA from_kds FROM KINESIS IAM_ROLE '[IAM_ROLE_ARN]'; Use the following SQL when creating a materialized view to consume data from the stream: CREATE MATERIALIZED VIEW patient_wearable_data AUTO REFRESH YES AS SELECT approximate_arrival_timestamp, JSON_PARSE(kinesis_data) as Data FROM from_kds."wearables_stream" WHERE CAN_JSON_PARSE(kinesis_data); To validate that streaming ingestion works as expected, refresh the materialized view to get the data you already sent to the data stream and query the table to make sure data has landed in Amazon Redshift: REFRESH MATERIALIZED VIEW patient_wearable_data; SELECT * FROM patient_wearable_data ORDER BY approximate_arrival_timestamp DESC; Query and analyze patient wearable data The results in the data column of the preceding query are in JSON format. Amazon Redshift makes it straightforward to work with semi-structured data in JSON format. It uses PartiQL language to offer SQL-compatible access to relational, semi-structured, and nested data. Use the following query to flatten data: SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp FROM patient_wearable_data ORDER BY approximate_arrival_timestamp DESC; The result looks like the following screenshot. Now that you know how to flatten JSON data, you can analyze it further. Use the following query to get the number of minutes a patient has been physically active per day, based on their heart rate (greater than 80): WITH patient_wearble_flattened AS ( SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp, DATE(approximate_arrival_timestamp) AS date_received, extract(hour from approximate_arrival_timestamp) AS hour_received, extract(minute from approximate_arrival_timestamp) AS minute_received FROM patient_wearable_data ), patient_active_minutes AS ( SELECT patient_id, date_received, hour_received, minute_received, avg(heart_rate) AS heart_rate FROM patient_wearble_flattened GROUP BY patient_id, date_received, hour_received, minute_received HAVING avg(heart_rate) > 80 ) SELECT patient_id, date_received, COUNT(heart_rate) AS active_minutes_count FROM patient_active_minutes GROUP BY patient_id, date_received ORDER BY patient_id, date_received; Create a complete patient 360 Now that you are able to query all patient data with Redshift Serverless, you can combine the three datasets you used in this post and form a comprehensive patient 360 view with the following query: WITH patient_appointment_info AS ( SELECT "patient_id", "gender", "date_of_birth", "appointment_datetime", "phone_number" FROM ztl_demo.front_desk_app_db.patient_appointment ), patient_wearble_flattened AS ( SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp, DATE(approximate_arrival_timestamp) AS date_received, extract(hour from approximate_arrival_timestamp) AS hour_received, extract(minute from approximate_arrival_timestamp) AS minute_received FROM patient_wearable_data ), patient_active_minutes AS ( SELECT patient_id, date_received, hour_received, minute_received, avg(heart_rate) AS heart_rate FROM patient_wearble_flattened GROUP BY patient_id, date_received, hour_received, minute_received HAVING avg(heart_rate) > 80 ), patient_active_minutes_count AS ( SELECT patient_id, date_received, COUNT(heart_rate) AS active_minutes_count FROM patient_active_minutes GROUP BY patient_id, date_received ) SELECT pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth, ppi.appointment_datetime, ppi.phone_number, pamc.date_received, pamc.active_minutes_count FROM patient_allergy_info pai LEFT JOIN patient_active_minutes_count pamc ON pai.patient_id = pamc.patient_id LEFT JOIN patient_appointment_info ppi ON pai.patient_id = ppi.patient_id GROUP BY pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth, ppi.appointment_datetime, ppi.phone_number, pamc.date_received, pamc.active_minutes_count ORDER BY pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth DESC, ppi.appointment_datetime DESC, ppi.phone_number DESC, pamc.date_received, pamc.active_minutes_count You can use the solution and queries used here to expand the datasets used in your analysis. For example, you can include other tables from AWS HealthLake as needed. Clean up To clean up resources you created, complete the following steps: Delete the zero-ETL integration between Amazon RDS and Amazon Redshift. Delete the CloudFormation stack. Delete AWS HealthLake data store Conclusion Forming a comprehensive 360 view of patients by integrating data from various different sources offers numerous benefits for organizations operating in the healthcare industry. It enables healthcare providers to gain a holistic understanding of a patient’s medical journey, enhances clinical decision-making, and allows for more accurate diagnosis and tailored treatment plans. With zero-ETL features for data integration on AWS, it is effortless to build a view of patients securely, cost-effectively, and with minimal effort. You can then use visualization tools such as Amazon QuickSight to build dashboards or use Amazon Redshift ML to enable data analysts and database developers to train machine learning (ML) models with the data integrated through Amazon Redshift zero-ETL. The result is a set of ML models that are trained with a broader view into patients, their medical history, and their lifestyle, and therefore enable you make more accurate predictions about their upcoming health needs. About the Authors Saeed Barghi is a Sr. Analytics Specialist Solutions Architect specializing in architecting enterprise data platforms. He has extensive experience in the fields of data warehousing, data engineering, data lakes, and AI/ML. Based in Melbourne, Australia, Saeed works with public sector customers in Australia and New Zealand. Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 17 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe. View the full article
-
Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users. In this post, we discuss how to successfully conduct a proof of concept in Amazon Redshift by going through the main stages of the process, available tools that accelerate implementation, and common use cases. Proof of concept overview A proof of concept (POC) is a process that uses representative data to validate whether a technology or service fulfills a customer’s technical and business requirements. By testing the solution against key metrics, a POC provides insights that allow you to make an informed decision on the suitability of the technology for the intended use case. There are three major POC validation areas: Workload – Take a representative portion of an existing workload and test it on Amazon Redshift, such as an extract, transform, and load (ETL) process, reporting, or management Capability – Demonstrate how a specific Amazon Redshift feature, such as zero-ETL integration with Amazon Redshift, data sharing, or Amazon Redshift Spectrum, can simplify or enhance your overall architecture Architecture – Understand how Amazon Redshift fits into a new or existing architecture along with other AWS services and tools A POC is not: Planning and implementing a large-scale migration User-facing deployments, such as deploying a configuration for user testing and validation over extended periods (this is more of a pilot) End-to-end implementation of a use case (this is more of a prototype) Proof of concept process For a POC to be successful, it is recommended to follow and apply a well-defined and structured process. For a POC on Amazon Redshift, we recommend a three-phase process of discovery, implementation, and evaluation. Discovery phase The discovery phase is considered the most essential among the three phases and the longest. It defines through multiple sessions the scope of the POC and the list of tasks that need to be completed and later evaluated. The scope should contain inputs and data points on the current architecture as well as the target architecture. The following items need to be defined and documented to have a defined scope for the POC: Current state architecture and its challenges Business goals and the success criteria of the POC (such as cost, performance, and security) along with their associated priorities Evaluation criteria that will be used to evaluate and interpret the success criteria, such as service-level agreements (SLAs) Target architecture (the communication between the services and tools that will be used during the implementation of the POC) Dataset and the list of tables and schemas After the scope has been clearly defined, you should proceed with defining and planning the list of tasks that need to be run during the next phase in order to implement the scope. Also, depending on the technical familiarity with the latest developments in Amazon Redshift, a technical enablement session on Amazon Redshift is also highly recommended before starting the implementation phase. Optionally, a responsibility assignment matrix (RAM) is recommended, especially in large POCs. Implementation phase The implementation phase takes the output of the previous phase as input. It consists of the following steps: Set up the environment by respecting the defined POC architecture. Complete the implementation tasks such as data ingestion and performance testing. Collect data metrics and statistics on the completed tasks. Analyze the data and then optimize as necessary. Evaluation phase The evaluation phase is the POC assessment and the final step of the process. It aggregates the implementation results of the preceding phase, interprets them, and evaluates the success criteria described in the discovery phase. It is recommended to use percentiles instead of averages whenever possible for a better interpretation. Challenges In this section, we discuss the major challenges that you may encounter while planning your POC. Scope You may face challenges during the discovery phase while defining the scope of the POC, especially in complex environments. You should focus on the crucial requirements and prioritized success criteria that need to be evaluated so you avoid ending up with a small migration project instead of a POC. In terms of technical content (such as data structures, transformation jobs, and reporting queries), make sure to identify and consider as little as possible of the content that will still provide you with all the necessary information at the end of the implementation phase in order to assess the defined success criteria. Additionally, document any assumptions you are making. Time A time period should be defined for any POC project to ensure it stays focused and achieves clear results. Without an established time frame, scope creep can occur as requirements shift and unnecessary features get added. This may lead to misleading evaluations about the technology or concept being tested. The duration set for the POC depends on factors like workload complexity and resource availability. If a period such as 3 weeks has been committed to already without accounting for these considerations, the scope and planned content should be scaled to feasibly fit that fixed time period. Cost Cloud services operate on a pay-as-you-go model, and estimating costs accurately can be challenging during a POC. Overspending or underestimating resource requirements can impact budget allocations. It’s important to carefully estimate the initial sizing of the Redshift cluster, monitor resource usage closely, and consider setting service limits along with AWS Budget alerts to avoid unexpected expenditures. Technical The team running the POC has to be ready for initial technical challenges, especially during environment setup, data ingestion, and performance testing. Each data warehouse technology has its own design and architecture, which sometimes requires some initial tuning at the data structure or query level. This is an expected challenge that needs to be considered in the implementation phase timeline. Having a technical enablement session beforehand can alleviate such hurdles. Amazon Redshift POC tools and features In this section, we discuss tools that you can adapt based on the specific requirements and nature of the POC being conducted. It’s essential to choose tools that align with the scope and technologies involved. AWS Analytics Automation Toolkit The AWS Analytics Automation Toolkit enables automatic provisioning and integration of not only Amazon Redshift, but database migration services like AWS Database Migration Service (AWS DMS), AWS Schema Conversion Tool (AWS SCT), and Apache JMeter. This toolkit is essential in most POCs because it automates the provisioning of infrastructure and setup of the necessary environment. AWS SCT The AWS SCT makes heterogeneous database migrations predictable, secure, and fast by automatically converting the majority of the database code and storage objects to a format that is compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. In the context of a POC, the AWS SCT becomes crucial by streamlining and enhancing the efficiency of the schema conversion process from one database system to another. Given the time-sensitive nature of POCs, the AWS SCT automates the conversion process, facilitating planning, and estimation of time and efforts. Additionally, the AWS SCT plays a role in identifying potential compatibility issues, data mapping challenges, or other hurdles at an early stage of the process. Furthermore, the database migration assessment report summarizes all the action items for schemas that can’t be converted automatically to your target database. Getting started with AWS SCT is a straightforward process. Also, consider following the best practices for AWS SCT. Amazon Redshift auto-copy The Amazon Redshift auto-copy (preview) feature can automate data ingestion from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift with a simple SQL command. COPY statements are invoked and start loading data when Amazon Redshift auto-copy detects new files in the specified S3 prefixes. This also makes sure that end-users have the latest data available in Amazon Redshift shortly after the source files are available. You can use this feature for the purpose of data ingestion throughout the POC. To learn more about ingesting from files located in Amazon S3 using a SQL command, refer to Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy (preview). The post also shows you how to enable auto-copy using COPY jobs, how to monitor jobs, and considerations and best practices. Redshift Auto Loader The custom Redshift Auto Loader framework automatically creates schemas and tables in the target database and continuously loads data from Amazon S3 to Amazon Redshift. You can use this during the data ingestion phase of the POC. Deploying and setting up the Redshift Auto Loader framework to transfer files from Amazon S3 to Amazon Redshift is a straightforward process. For more information, refer to Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework. Apache JMeter Apache JMeter is an open-source load testing application written in Java that you can use to load test web applications, backend server applications, databases, and more. In a database context, it’s an extremely valuable tool for repeating benchmark tests in a consistent manner, simulating concurrency workloads, and scalability testing on different database configurations. When implementing your POC, benchmarking Amazon Redshift is often one of the main components of evaluation and a key source of insight into the price-performance of different Amazon Redshift configurations. With Apache JMeter, you can construct high-quality benchmark tests for Amazon Redshift. Workload Replicator If you are currently using Amazon Redshift and looking to replicate your existing production workload or isolate specific workloads in a POC, you can use the Workload Replicator to run them across different configurations of Redshift clusters (ra3.xlplus, ra3.4xl,ra3.16xl, serverless) for performance evaluation and comparison. This utility has the ability to mimic COPY and UNLOAD workloads and can run the transactions and queries in the same time interval as they’re run in the production cluster. However, it’s crucial to assess the limitations of the utility and AWS Identity and Access Management (IAM) security and compliance requirements. Node Configuration Comparison utility If you’re using Amazon Redshift and have stringent SLAs for query performance in your Amazon Redshift cluster, or you want to explore different Amazon Redshift configurations based on the price-performance of your workload, you can use the Amazon Redshift Node Configuration Comparison utility. This utility helps evaluate performance of your queries using different Redshift cluster configurations in parallel and compares the end results to find the best cluster configuration that meets your need. Similarly, If you’re already using Amazon Redshift and want to migrate from your existing DC2 or DS2 instances to RA3, you can refer to our recommendations on node count and type when upgrading. Before doing that, you can use this utility in your POC to evaluate the new cluster’s performance by replaying your past workloads, which integrates with the Workload Replicator utility to evaluate performance metrics for different Amazon Redshift configurations to meet your needs. This utility functions in a fully automated manner and has similar limitations as the workload replicator. However, it requires full permissions across various services for the user running the AWS CloudFormation stack. Use cases You have the opportunity to explore various functionalities and aspects of Amazon Redshift by defining and selecting a business use case you want to validate during the POC. In this section, we discuss some specific use cases you can explore using a POC. Functionality evaluation Amazon Redshift consists of a set of functionalities and options that simplify data pipelines and effortlessly integrate with other services. You can use a POC to test and evaluate one or more of those capabilities before refactoring your data pipeline and implementing them in your ecosystem. Functionalities could be existing features or new ones such as zero-ETL integration, streaming ingestion, federated queries, or machine learning. Workload isolation You can use the data sharing feature of Amazon Redshift to achieve workload isolation across diverse analytics use cases and achieve business-critical SLAs without duplicating or moving the data. Amazon Redshift data sharing enables a producer cluster to share data objects with one or more consumer clusters, thereby eliminating data duplication. This facilitates collaboration across isolated clusters, allowing data to be shared for innovation and analytic services. Sharing can occur at various levels such as databases, schemas, tables, views, columns, and user-defined functions, offering fine-grained access control. It is recommended to use Workload Replicator for performance evaluation and comparison in a workload isolation POC. The following sample architectures explain workload isolation using data sharing. The first diagram illustrates the architecture before using data sharing. The following diagram illustrates the architecture with data sharing. Migrating to Amazon Redshift If you’re interested in migrating from your existing data warehouse platform to Amazon Redshift, you can try out Amazon Redshift by developing a POC on a selected business use case. In this type of POC, it is recommended to use the AWS Analytics Automation Toolkit for setting up the environment, auto-copy or Redshift Auto Loader for data ingestion, and AWS SCT for schema conversion. When the development is complete, you can perform performance testing using Apache JMeter, which provides data points to measure price-performance and compare results with your existing platform. The following diagram illustrates this process. Moving to Amazon Redshift Serverless You can migrate your unpredictable and variable workloads to Amazon Redshift Serverless, which enables you to scale as and when needed and pay as per usage, making your infrastructure scalable and cost-efficient. If you’re migrating your full workload from provisioned (DC2, RA3) to serverless, you can use the Node Configuration Comparison utility for performance evaluation. The following diagram illustrates this workflow. Conclusion In a competitive environment, conducting a successful proof of concept is a strategic imperative for businesses aiming to validate the feasibility and effectiveness of new solutions. Amazon Redshift provides you with better price-performance compared to other cloud-centered data warehouses, and a large list of features that help you modernize and optimize your data pipelines. For more details, see Amazon Redshift continues its price-performance leadership. With the process discussed in this post and by choosing the tools needed for your specific use case, you can accelerate the process of conducting a POC. This allows you to collect the data metrics that can help you understand the potential challenges, benefits, and implications of implementing the proposed solution on a larger scale. A POC provides essential data points that evaluate price-performance as well as feasibility, which plays a vital role in decision-making. About the Authors Ziad WALI is an Acceleration Lab Solutions Architect at Amazon Web Services. He has over 10 years of experience in databases and data warehousing, where he enjoys building reliable, scalable, and efficient solutions. Outside of work, he enjoys sports and spending time in nature. Omama Khurshid is an Acceleration Lab Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies. Srikant Das is an Acceleration Lab Solutions Architect at Amazon Web Services. His expertise lies in constructing robust, scalable, and efficient solutions. Beyond the professional sphere, he finds joy in travel and shares his experiences through insightful blogging on social media platforms. View the full article
-
Welcome to March’s post announcing new training and certification updates — helping equip you and your teams with the skills to work with AWS services and solutions. This month we launched eight new digital training products on AWS Skill Builder, including four new AWS Builder Labs and a free learning plan called, Generative AI Developer Kit. We also have three new, and one updated AWS Classroom Training courses—two of which have AWS Partner versions—including Developing Generative AI Applications on AWS. A reminder: registration is now open for the new AWS Certified Data Engineer – Associate exam. You can begin preparing with curated exam prep resources, created by the experts at AWS, on AWS Skill Builder. Missed our February course update? Check it out here. New AWS Skill Builder subscription features AWS Skill Builder subscriptions are available globally, including Mainland China as of this month, and unlock enhanced AWS Certification exam prep and hands-on AWS Cloud training including 1,000+ interactive learning and lab experiences like AWS Cloud Quest, AWS Industry Quest, AWS Builder Labs, and AWS Jam challenges. Select plans offer access to AWS Digital Classroom courses to dive deep with expert instruction. Try a 7-day free trail of Individual subscription. *terms and conditions apply AWS Builder Labs Migrate On-Premises Servers to AWS Using Application Migration Service (MGN) (60 min.) is an intermediate-level lab providing you an opportunity to learn how to use AWS Application Migration Service to migrate an existing workload to AWS. Migrate On-premises Databases to AWS Using AWS Database Migration Service (DMS) (75 min.) is an intermediate-level lab providing you an opportunity to learn how to use AWS Database Migration Service to migrate an existing database to Amazon Aurora. Data Modeling for Amazon Neptune (60 min.) is an intermediate-level lab providing you an opportunity to explore the process of modeling data with Amazon Neptune to meet prescribed use cases. Analyzing CloudWatch Logs with Kinesis Data Streams and Kinesis Data Analytics(4 hr.) is an advanced-level, challenge-based lab allowing you to learn how to use Amazon CloudWatch to collect Amazon Elastic Compute Cloud (EC2) system logs and use Amazon Kinesis to analyze the collected data. AWS Certification exam preparation and updates Now available: AWS Certified Data Engineer – Associate Registration is now open for the AWS Certified Data Engineer – Associate. Showcase your knowledge and skills in core data-related AWS services, implementing data pipelines, and providing high-quality data for business insights. Gain confidence going into exam day with trusted exam prep on AWS Skill Builder, including an Official Pretest, available now in all exam languages. Free digital courses on AWS Skill Builder The following digital courses on AWS Skill Builder are free to all learners, along with 600+ free digital courses and learning plans. Digital learning plan Generative AI Developer Kit (includes labs) (16h 30 min.) is a collection of curated courses, labs, and challenges to develop the skills needed to build generative AI applications. Software developers interested in leveraging large language models without fine-tuning will benefit from this collection. You’ll receive an overview of generative AI, learn to plan a generative AI project, get started with Amazon CodeWhisperer and Amazon Bedrock, learn the foundations of prompt engineering, and discover the architecture patterns to build generative AI applications using Amazon Bedrock and Langchain. Digital courses Decarbonization with AWS Introduction (15 min.) is a fundamental-level course that teaches you about AWS Customer Carbon Footprint Tool and other resources that can be used to advance your sustainability goals. You’ll learn how businesses use the AWS Customer Carbon Footprint Tool, how it helps you reduce your carbon footprint and achieve decarbonization goals with AWS, and considerations for using the tool for a variety of optimal usage and cost savings considerations. Amazon Redshift Introduction (15 min.) is a fundamental-level course that provides an introduction to Amazon Redshift, including its common uses and benefits. AWS Mainframe Modernization – Using Replatform Tools with Amazon AppStream (60 min.) is an intermediate-level course teaching the setup and usage of Micro Focus tools from OpenText, such as Enterprise Analyzer and Enterprise Developer, with Amazon AppStream 2.0. AWS Classroom Training Designing and Implementing Storage on AWS is a three-day, intermediate-level course teaching you to select, design, implement, and optimize secure storage solutions to save on time and cost, improve performance and scale, and accelerate innovation. You’ll explore AWS storage services and solutions for storing, accessing, and protecting your data. An expert AWS instructor will help you understand where, how, and when to take advantage of different storage services. Learn how to best evaluate the appropriate AWS storage service options to meet your use case and business requirements. Build Modern Applications with AWS NoSQL Databases is a one-day, intermediate-level course to help you understand how to build applications that involve complex data characteristics and millisecond performance requirements from your databases. You’ll learn to use purpose-built databases to build typical modern applications with diverse access patterns and real-time scaling needs. AnAWS Partner version is also available. Running Containers on Amazon Elastic Kubernetes Service (Amazon EKS) is an updated, three-day, intermediate-level course from an expert AWS instructor that teaches container management and orchestration for Kubernetes using Amazon EKS. You’ll build an Amazon EKS cluster, configure the environment, deploy the cluster, and add applications to your cluster. Learn how to also manage container images using Amazon Elastic Container Registry (ECR) and automate application deployment. Developing Generative AI Applications on AWS is a two-day, advanced-level course that teaches you the basics, benefits, and associated terminology of generative AI. An expert AWS instructor will guide you through planning a generative AI project and the foundations of prompt engineering to develop generative AI applications with AWS services. By the end of the course, you’ll have the skills needed to build applications that can generate and summarize text, answer questions, and interact with users using a chatbot interface. An AWS Partner version is also available. View the full article
-
- 1
-
- courses
- certification
- (and 9 more)
-
Customer 360 (C360) provides a complete and unified view of a customer’s interactions and behavior across all touchpoints and channels. This view is used to identify patterns and trends in customer behavior, which can inform data-driven decisions to improve business outcomes. For example, you can use C360 to segment and create marketing campaigns that are more likely to resonate with specific groups of customers. In 2022, AWS commissioned a study conducted by the American Productivity and Quality Center (APQC) to quantify the Business Value of Customer 360. The following figure shows some of the metrics derived from the study. Organizations using C360 achieved 43.9% reduction in sales cycle duration, 22.8% increase in customer lifetime value, 25.3% faster time to market, and 19.1% improvement in net promoter score (NPS) rating. Without C360, businesses face missed opportunities, inaccurate reports, and disjointed customer experiences, leading to customer churn. However, building a C360 solution can be complicated. A Gartner Marketing survey found only 14% of organizations have successfully implemented a C360 solution, due to lack of consensus on what a 360-degree view means, challenges with data quality, and lack of cross-functional governance structure for customer data. In this post, we discuss how you can use purpose-built AWS services to create an end-to-end data strategy for C360 to unify and govern customer data that address these challenges. We structure it in five pillars that power C360: data collection, unification, analytics, activation, and data governance, along with a solution architecture that you can use for your implementation. The five pillars of a mature C360 When you embark on creating a C360, you work with multiple use cases, types of customer data, and users and applications that require different tools. Building a C360 on the right datasets, adding new datasets over time while maintaining the quality of data, and keeping it secure needs an end-to-end data strategy for your customer data. You also need to provide tools that make it straightforward for your teams to build products that mature your C360. We recommend building your data strategy around five pillars of C360, as shown in the following figure. This starts with basic data collection, unifying and linking data from various channels related to unique customers, and progresses towards basic to advanced analytics for decision-making, and personalized engagement through various channels. As you mature in each of these pillars, you progress towards responding to real-time customer signals. The following diagram illustrates the functional architecture that combines the building blocks of a Customer Data Platform on AWS with additional components used to design an end-to-end C360 solution. This is aligned to the five pillars we discuss in this post. Pillar 1: Data collection As you start building your customer data platform, you have to collect data from various systems and touchpoints, such as your sales systems, customer support, web and social media, and data marketplaces. Think of the data collection pillar as a combination of ingestion, storage, and processing capabilities. Data ingestion You have to build ingestion pipelines based on factors like types of data sources (on-premises data stores, files, SaaS applications, third-party data), and flow of data (unbounded streams or batch data). AWS provides different services for building data ingestion pipelines: AWS Glue is a serverless data integration service that ingests data in batches from on-premises databases and data stores in the cloud. It connects to more than 70 data sources and helps you build extract, transform, and load (ETL) pipelines without having to manage pipeline infrastructure. AWS Glue Data Quality checks for and alerts on poor data, making it straightforward to spot and fix issues before they harm your business. Amazon AppFlow ingests data from software as a service (SaaS) applications like Google Analytics, Salesforce, SAP, and Marketo, giving you the flexibility to ingest data from more than 50 SaaS applications. AWS Data Exchange makes it straightforward to find, subscribe to, and use third-party data for analytics. You can subscribe to data products that help enrich customer profiles, for example demographics data, advertising data, and financial markets data. Amazon Kinesis ingests streaming events in real time from point-of-sales systems, clickstream data from mobile apps and websites, and social media data. You could also consider using Amazon Managed Streaming for Apache Kafka (Amazon MSK) for streaming events in real time. The following diagram illustrates the different pipelines to ingest data from various source systems using AWS services. Data storage Structured, semi-structured, or unstructured batch data is stored in an object storage because these are cost-efficient and durable. Amazon Simple Storage Service (Amazon S3) is a managed storage service with archiving features that can store petabytes of data with eleven 9’s of durability. Streaming data with low latency needs is stored in Amazon Kinesis Data Streams for real-time consumption. This allows immediate analytics and actions for various downstream consumers—as seen with Riot Games’ central Riot Event Bus. Data processing Raw data is often cluttered with duplicates and irregular formats. You need to process this to make it ready for analysis. If you are consuming batch data and streaming data, consider using a framework that can handle both. A pattern such as the Kappa architecture views everything as a stream, simplifying the processing pipelines. Consider using Amazon Managed Service for Apache Flink to handle the processing work. With Managed Service for Apache Flink, you can clean and transform the streaming data and direct it to the appropriate destination based on latency requirements. You can also implement batch data processing using Amazon EMR on open source frameworks such as Apache Spark at 3.5 times better performance than the self-managed version. The architecture decision of using a batch or streaming processing system will depend on various factors; however, if you want to enable real-time analytics on your customer data, we recommend using a Kappa architecture pattern. Pillar 2: Unification To link the diverse data arriving from various touchpoints to a unique customer, you need to build an identity processing solution that identifies anonymous logins, stores useful customer information, links them to external data for better insights, and groups customers in domains of interest. Although the identity processing solution helps build the unified customer profile, we recommend considering this as part of your data processing capabilities. The following diagram illustrates the components of such a solution. The key components are as follows: Identity resolution – Identity resolution is a deduplication solution, where records are matched to identify a unique customer and prospects by linking multiple identifiers such as cookies, device identifiers, IP addresses, email IDs, and internal enterprise IDs to a known person or anonymous profile using privacy-compliant methods. This can be achieved using AWS Entity Resolution, which enables using rules and machine learning (ML) techniques to match records and resolve identities. Alternatively, you can build identity graphs using Amazon Neptune for a single unified view of your customers. Profile aggregation – When you’ve uniquely identified a customer, you can build applications in Managed Service for Apache Flink to consolidate all their metadata, from name to interaction history. Then, you transform this data into a concise format. Instead of showing every transaction detail, you can offer an aggregated spend value and a link to their Customer Relationship Management (CRM) record. For customer service interactions, provide an average CSAT score and a link to the call center system for a deeper dive into their communication history. Profile enrichment – After you resolve a customer to a single identity, enhance their profile using various data sources. Enrichment typically involves adding demographic, behavioral, and geolocation data. You can use third-party data products from AWS Marketplace delivered through AWS Data Exchange to gain insights on income, consumption patterns, credit risk scores, and many more dimensions to further refine the customer experience. Customer segmentation – After uniquely identifying and enriching a customer’s profile, you can segment them based on demographics like age, spend, income, and location using applications in Managed Service for Apache Flink. As you advance, you can incorporate AI services for more precise targeting techniques. After you have done the identity processing and segmentation, you need a storage capability to store the unique customer profile and provide search and query capabilities on top of it for downstream consumers to use the enriched customer data. The following diagram illustrates the unification pillar for a unified customer profile and single view of the customer for downstream applications. Unified customer profile Graph databases excel in modeling customer interactions and relationships, offering a comprehensive view of the customer journey. If you are dealing with billions of profiles and interactions, you can consider using Neptune, a managed graph database service on AWS. Organizations such as Zeta and Activision have successfully used Neptune to store and query billions of unique identifiers per month and millions of queries per second at millisecond response time. Single customer view Although graph databases provide in-depth insights, yet they can be complex for regular applications. It is prudent to consolidate this data into a single customer view, serving as a primary reference for downstream applications, ranging from ecommerce platforms to CRM systems. This consolidated view acts as a liaison between the data platform and customer-centric applications. For such purposes, we recommend using Amazon DynamoDB for its adaptability, scalability, and performance, resulting in an up-to-date and efficient customer database. This database will accept a lot of write queries back from the activation systems that learn new information about the customers and feed them back. Pillar 3: Analytics The analytics pillar defines capabilities that help you generate insights on top of your customer data. Your analytics strategy applies to the wider organizational needs, not just C360. You can use the same capabilities to serve financial reporting, measure operational performance, or even monetize data assets. Strategize based on how your teams explore data, run analyses, wrangle data for downstream requirements, and visualize data at different levels. Plan on how you can enable your teams to use ML to move from descriptive to prescriptive analytics. The AWS modern data architecture shows a way to build a purpose-built, secure, and scalable data platform in the cloud. Learn from this to build querying capabilities across your data lake and the data warehouse. The following diagram breaks down the analytics capability into data exploration, visualization, data warehousing, and data collaboration. Let’s find out what role each of these components play in the context of C360. Data exploration Data exploration helps unearth inconsistencies, outliers, or errors. By spotting these early on, your teams can have cleaner data integration for C360, which in turn leads to more accurate analytics and predictions. Consider the personas exploring the data, their technical skills, and the time to insight. For instance, data analysts who know to write SQL can directly query the data residing in Amazon S3 using Amazon Athena. Users interested in visual exploration can do so using AWS Glue DataBrew. Data scientists or engineers can use Amazon EMR Studio or Amazon SageMaker Studio to explore data from the notebook, and for a low-code experience, you can use Amazon SageMaker Data Wrangler. Because these services directly query S3 buckets, you can explore the data as it lands in the data lake, reducing time to insight. Visualization Turning complex datasets into intuitive visuals unravels hidden patterns in the data, and is crucial for C360 use cases. With this capability, you can design reports for different levels catering to varying needs: executive reports offering strategic overviews, management reports highlighting operational metrics, and detailed reports diving into the specifics. Such visual clarity helps your organization make informed decisions across all tiers, centralizing the customer’s perspective. The following diagram shows a sample C360 dashboard built on Amazon QuickSight. QuickSight offers scalable, serverless visualization capabilities. You can benefit from its ML integrations for automated insights like forecasting and anomaly detection or natural language querying with Amazon Q in QuickSight, direct data connectivity from various sources, and pay-per-session pricing. With QuickSight, you can embed dashboards to external websites and applications, and the SPICE engine enables rapid, interactive data visualization at scale. The following screenshot shows an example C360 dashboard built on QuickSight. Data warehouse Data warehouses are efficient in consolidating structured data from multifarious sources and serving analytics queries from a large number of concurrent users. Data warehouses can provide a unified, consistent view of a vast amount of customer data for C360 use cases. Amazon Redshift addresses this need by adeptly handling large volumes of data and diverse workloads. It provides strong consistency across datasets, allowing organizations to derive reliable, comprehensive insights about their customers, which is essential for informed decision-making. Amazon Redshift offers real-time insights and predictive analytics capabilities for analyzing data from terabytes to petabytes. With Amazon Redshift ML, you can embed ML on top of the data stored in the data warehouse with minimum development overhead. Amazon Redshift Serverless simplifies application building and makes it straightforward for companies to embed rich data analytics capabilities. Data collaboration You can securely collaborate and analyze collective datasets from your partners without sharing or copying one another’s underlying data using AWS Clean Rooms. You can bring together disparate data from across engagement channels and partner datasets to form a 360-degree view of your customers. AWS Clean Rooms can enhance C360 by enabling use cases like cross-channel marketing optimization, advanced customer segmentation, and privacy-compliant personalization. By safely merging datasets, it offers richer insights and robust data privacy, meeting business needs and regulatory standards. Pillar 4: Activation The value of data diminishes the older it gets, leading to higher opportunity costs over time. In a survey conducted by Intersystems, 75% of the organizations surveyed believe untimely data inhibited business opportunities. In another survey, 58% of organizations (out of 560 respondents of HBR Advisory council and readers) stated they saw an increase in customer retention and loyalty using real-time customer analytics. You can achieve a maturity in C360 when you build the ability to act on all the insights acquired from the previous pillars we discussed in real time. For example, at this maturity level, you can act on customer sentiment based on the context you automatically derived with an enriched customer profile and integrated channels. For this you need to implement prescriptive decision-making on how to address the customer’s sentiment. To do this at scale, you have to use AI/ML services for decision-making. The following diagram illustrates the architecture to activate insights using ML for prescriptive analytics and AI services for targeting and segmentation. Use ML for the decision-making engine With ML, you can improve the overall customer experience—you can create predictive customer behavior models, design hyper-personalized offers, and target the right customer with the right incentive. You can build them using Amazon SageMaker, which features a suite of managed services mapped to the data science lifecycle, including data wrangling, model training, model hosting, model inference, model drift detection, and feature storage. SageMaker enables you to build and operationalize your ML models, infusing them back into your applications to produce the right insight to the right person at the right time. Amazon Personalize supports contextual recommendations, through which you can improve the relevance of recommendations by generating them within a context—for instance, device type, location, or time of day. Your team can get started without any prior ML experience using APIs to build sophisticated personalization capabilities in a few clicks. For more information, see Customize your recommendations by promoting specific items using business rules with Amazon Personalize. Activate channels across marketing, advertising, direct-to-consumer, and loyalty Now that you know who your customers are and who to reach out to, you can build solutions to run targeting campaigns at scale. With Amazon Pinpoint, you can personalize and segment communications to engage customers across multiple channels. For example, you can use Amazon Pinpoint to build engaging customer experiences through various communication channels like email, SMS, push notifications, and in-app notifications. Pillar 5: Data governance Establishing the right governance that balances control and access gives users trust and confidence in data. Imagine offering promotions on products that a customer doesn’t need, or bombarding the wrong customers with notifications. Poor data quality can lead to such situations, and ultimately results in customer churn. You have to build processes that validate data quality and take corrective actions. AWS Glue Data Quality can help you build solutions that validate the quality of data at rest and in transit, based on predefined rules. To set up a cross-functional governance structure for customer data, you need a capability for governing and sharing data across your organization. With Amazon DataZone, admins and data stewards can manage and govern access to data, and consumers such as data engineers, data scientists, product managers, analysts, and other business users can discover, use, and collaborate with that data to drive insights. It streamlines data access, letting you find and use customer data, promotes team collaboration with shared data assets, and provides personalized analytics either via a web app or API on a portal. AWS Lake Formation makes sure data is accessed securely, guaranteeing the right people see the right data for the right reasons, which is crucial for effective cross-functional governance in any organization. Business metadata is stored and managed by Amazon DataZone, which is underpinned by technical metadata and schema information, which is registered in the AWS Glue Data Catalog. This technical metadata is also used both by other governance services such as Lake Formation and Amazon DataZone, and analytics services such as Amazon Redshift, Athena, and AWS Glue. Bringing it all together Using the following diagram as a reference, you can create projects and teams for building and operating different capabilities. For example, you can have a data integration team focus on the data collection pillar—you can then align functional roles, like data architects and data engineers. You can build your analytics and data science practices to focus on the analytics and activation pillars, respectively. Then you can create a specialized team for customer identity processing and for building the unified view of the customer. You can establish a data governance team with data stewards from different functions, security admins, and data governance policymakers to design and automate policies. Conclusion Building a robust C360 capability is fundamental for your organization to gain insights into your customer base. AWS Databases, Analytics, and AI/ML services can help streamline this process, providing scalability and efficiency. Following the five pillars to guide your thinking, you can build an end-to-end data strategy that defines the C360 view across the organization, makes sure data is accurate, and establishes cross-functional governance for customer data. You can categorize and prioritize the products and features you have to build within each pillar, select the right tool for the job, and build the skills you need in your teams. Visit AWS for Data Customer Stories to learn how AWS is transforming customer journeys, from the world’s largest enterprises to growing startups. About the Authors Ismail Makhlouf is a Senior Specialist Solutions Architect for Data Analytics at AWS. Ismail focuses on architecting solutions for organizations across their end-to-end data analytics estate, including batch and real-time streaming, big data, data warehousing, and data lake workloads. He primarily works with organizations in retail, ecommerce, FinTech, HealthTech, and travel to achieve their business objectives with well architected data platforms. Sandipan Bhaumik (Sandi) is a Senior Analytics Specialist Solutions Architect at AWS. He helps customers modernize their data platforms in the cloud to perform analytics securely at scale, reduce operational overhead, and optimize usage for cost-effectiveness and sustainability. View the full article
-
Amazon DataZone is used by customers to catalog, discover, analyze, share, and govern data at scale across organizational boundaries with governance and access controls. Today, Amazon DataZone has introduced several enhancements to its Amazon Redshift integration, simplifying the process of publishing and subscribing to Amazon Redshift tables and views. These updates streamline the experience for both data producers and consumers, allowing them to quickly create data warehouse environments using pre-configured credentials and connection parameters provided by their DataZone administrators. Additionally, these enhancements grant administrators greater control over who can use the resources within their AWS accounts and Amazon Redshift clusters, and for what purpose. View the full article
-
Amazon Relational Database Service (Amazon RDS) for MySQL zero-ETL integration with Amazon Redshift was announced in preview at AWS re:Invent 2023 for Amazon RDS for MySQL version 8.0.28 or higher. In this post, we provide step-by-step guidance on how to get started with near real-time operational analytics using this feature. This post is a continuation of the zero-ETL series that started with Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift. Challenges Customers across industries today are looking to use data to their competitive advantage and increase revenue and customer engagement by implementing near real time analytics use cases like personalization strategies, fraud detection, inventory monitoring, and many more. There are two broad approaches to analyzing operational data for these use cases: Analyze the data in-place in the operational database (such as read replicas, federated query, and analytics accelerators) Move the data to a data store optimized for running use case-specific queries such as a data warehouse The zero-ETL integration is focused on simplifying the latter approach. The extract, transform, and load (ETL) process has been a common pattern for moving data from an operational database to an analytics data warehouse. ELT is where the extracted data is loaded as is into the target first and then transformed. ETL and ELT pipelines can be expensive to build and complex to manage. With multiple touchpoints, intermittent errors in ETL and ELT pipelines can lead to long delays, leaving data warehouse applications with stale or missing data, further leading to missed business opportunities. Alternatively, solutions that analyze data in-place may work great for accelerating queries on a single database, but such solutions aren’t able to aggregate data from multiple operational databases for customers that need to run unified analytics. Zero-ETL Unlike the traditional systems where data is siloed in one database and the user has to make a trade-off between unified analysis and performance, data engineers can now replicate data from multiple RDS for MySQL databases into a single Redshift data warehouse to derive holistic insights across many applications or partitions. Updates in transactional databases are automatically and continuously propagated to Amazon Redshift so data engineers have the most recent information in near real time. There is no infrastructure to manage and the integration can automatically scale up and down based on the data volume. At AWS, we have been making steady progress towards bringing our zero-ETL vision to life. The following sources are currently supported for zero-ETL integrations: Amazon Aurora MySQL-Compatible Edition (generally available) Amazon Aurora PostgreSQL-Compatible Edition (preview) Amazon RDS for MySQL (preview) Amazon DynamoDB (limited preview) When you create a zero-ETL integration for Amazon Redshift, you continue to pay for underlying source database and target Redshift database usage. Refer to Zero-ETL integration costs (Preview) for further details. With zero-ETL integration with Amazon Redshift, the integration replicates data from the source database into the target data warehouse. The data becomes available in Amazon Redshift within seconds, allowing you to use the analytics features of Amazon Redshift and capabilities like data sharing, workload optimization autonomics, concurrency scaling, machine learning, and many more. You can continue with your transaction processing on Amazon RDS or Amazon Aurora while simultaneously using Amazon Redshift for analytics workloads such as reporting and dashboards. The following diagram illustrates this architecture. Solution overview Let’s consider TICKIT, a fictional website where users buy and sell tickets online for sporting events, shows, and concerts. The transactional data from this website is loaded into an Amazon RDS for MySQL 8.0.28 (or higher version) database. The company’s business analysts want to generate metrics to identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. They would like to get these metrics in near real time using a zero-ETL integration. The integration is set up between Amazon RDS for MySQL (source) and Amazon Redshift (destination). The transactional data from the source gets refreshed in near real time on the destination, which processes analytical queries. You can use either the serverless option or an encrypted RA3 cluster for Amazon Redshift. For this post, we use a provisioned RDS database and a Redshift provisioned data warehouse. The following diagram illustrates the high-level architecture. The following are the steps needed to set up zero-ETL integration. These steps can be done automatically by the zero-ETL wizard, but you will require a restart if the wizard changes the setting for Amazon RDS or Amazon Redshift. You could do these steps manually, if not already configured, and perform the restarts at your convenience. For the complete getting started guides, refer to Working with Amazon RDS zero-ETL integrations with Amazon Redshift (preview) and Working with zero-ETL integrations. Configure the RDS for MySQL source with a custom DB parameter group. Configure the Redshift cluster to enable case-sensitive identifiers. Configure the required permissions. Create the zero-ETL integration. Create a database from the integration in Amazon Redshift. Configure the RDS for MySQL source with a customized DB parameter group To create an RDS for MySQL database, complete the following steps: On the Amazon RDS console, create a DB parameter group called zero-etl-custom-pg. Zero-ETL integration works by using binary logs (binlogs) generated by MySQL database. To enable binlogs on Amazon RDS for MySQL, a specific set of parameters must be enabled. Set the following binlog cluster parameter settings: binlog_format = ROW binlog_row_image = FULL binlog_checksum = NONE In addition, make sure that the binlog_row_value_options parameter is not set to PARTIAL_JSON. By default, this parameter is not set. Choose Databases in the navigation pane, then choose Create database. For Engine Version, choose MySQL 8.0.28 (or higher). For Templates, select Production. For Availability and durability, select either Multi-AZ DB instance or Single DB instance (Multi-AZ DB clusters are not supported, as of this writing). For DB instance identifier, enter zero-etl-source-rms. Under Instance configuration, select Memory optimized classes and choose the instance db.r6g.large, which should be sufficient for TICKIT use case. Under Additional configuration, for DB cluster parameter group, choose the parameter group you created earlier (zero-etl-custom-pg). Choose Create database. In a couple of minutes, it should spin up an RDS for MySQL database as the source for zero-ETL integration. Configure the Redshift destination After you create your source DB cluster, you must create and configure a target data warehouse in Amazon Redshift. The data warehouse must meet the following requirements: Using an RA3 node type (ra3.16xlarge, ra3.4xlarge, or ra3.xlplus) or Amazon Redshift Serverless Encrypted (if using a provisioned cluster) For our use case, create a Redshift cluster by completing the following steps: On the Amazon Redshift console, choose Configurations and then choose Workload management. In the parameter group section, choose Create. Create a new parameter group named zero-etl-rms. Choose Edit parameters and change the value of enable_case_sensitive_identifier to True. Choose Save. You can also use the AWS Command Line Interface (AWS CLI) command update-workgroup for Redshift Serverless: aws redshift-serverless update-workgroup --workgroup-name <your-workgroup-name> --config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=true Choose Provisioned clusters dashboard. At the top of you console window, you will see a Try new Amazon Redshift features in preview banner. Choose Create preview cluster. For Preview track, chose preview_2023. For Node type, choose one of the supported node types (for this post, we use ra3.xlplus). Under Additional configurations, expand Database configurations. For Parameter groups, choose zero-etl-rms. For Encryption, select Use AWS Key Management Service. Choose Create cluster. The cluster should become Available in a few minutes. Navigate to the namespace zero-etl-target-rs-ns and choose the Resource policy tab. Choose Add authorized principals. Enter either the Amazon Resource Name (ARN) of the AWS user or role, or the AWS account ID (IAM principals) that are allowed to create integrations. An account ID is stored as an ARN with root user. In the Authorized integration sources section, choose Add authorized integration source to add the ARN of the RDS for MySQL DB instance that’s the data source for the zero-ETL integration. You can find this value by going to the Amazon RDS console and navigating to the Configuration tab of the zero-etl-source-rms DB instance. Your resource policy should resemble the following screenshot. Configure required permissions To create a zero-ETL integration, your user or role must have an attached identity-based policy with the appropriate AWS Identity and Access Management (IAM) permissions. An AWS account owner can configure required permissions for users or roles who may create zero-ETL integrations. The sample policy allows the associated principal to perform the following actions: Create zero-ETL integrations for the source RDS for MySQL DB instance. View and delete all zero-ETL integrations. Create inbound integrations into the target data warehouse. This permission is not required if the same account owns the Redshift data warehouse and this account is an authorized principal for that data warehouse. Also note that Amazon Redshift has a different ARN format for provisioned and serverless clusters: Provisioned – arn:aws:redshift:{region}:{account-id}:namespace:namespace-uuid Serverless – arn:aws:redshift-serverless:{region}:{account-id}:namespace/namespace-uuid Complete the following steps to configure the permissions: On the IAM console, choose Policies in the navigation pane. Choose Create policy. Create a new policy called rds-integrations using the following JSON (replace region and account-id with your actual values): { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "rds:CreateIntegration" ], "Resource": [ "arn:aws:rds:{region}:{account-id}:db:source-instancename", "arn:aws:rds:{region}:{account-id}:integration:*" ] }, { "Effect": "Allow", "Action": [ "rds:DescribeIntegration" ], "Resource": ["*"] }, { "Effect": "Allow", "Action": [ "rds:DeleteIntegration" ], "Resource": [ "arn:aws:rds:{region}:{account-id}:integration:*" ] }, { "Effect": "Allow", "Action": [ "redshift:CreateInboundIntegration" ], "Resource": [ "arn:aws:redshift:{region}:{account-id}:cluster:namespace-uuid" ] }] } Attach the policy you created to your IAM user or role permissions. Create the zero-ETL integration To create the zero-ETL integration, complete the following steps: On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane. Choose Create zero-ETL integration. For Integration identifier, enter a name, for example zero-etl-demo. For Source database, choose Browse RDS databases and choose the source cluster zero-etl-source-rms. Choose Next. Under Target, for Amazon Redshift data warehouse, choose Browse Redshift data warehouses and choose the Redshift data warehouse (zero-etl-target-rs). Choose Next. Add tags and encryption, if applicable. Choose Next. Verify the integration name, source, target, and other settings. Choose Create zero-ETL integration. You can choose the integration to view the details and monitor its progress. It took about 30 minutes for the status to change from Creating to Active. The time will vary depending on the size of your dataset in the source. Create a database from the integration in Amazon Redshift To create your database from the zero-ETL integration, complete the following steps: On the Amazon Redshift console, choose Clusters in the navigation pane. Open the zero-etl-target-rs cluster. Choose Query data to open the query editor v2. Connect to the Redshift data warehouse by choosing Save. Obtain the integration_id from the svv_integration system table: select integration_id from svv_integration; -- copy this result, use in the next sql Use the integration_id from the previous step to create a new database from the integration: CREATE DATABASE zetl_source FROM INTEGRATION '<result from above>'; The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near real time. Analyze the near real time transactional data Now we can run analytics on TICKIT’s operational data. Populate the source TICKIT data To populate the source data, complete the following steps: Copy the CSV input data files into a local directory. The following is an example command: aws s3 cp 's3://redshift-blogs/zero-etl-integration/data/tickit' . --recursive Connect to your RDS for MySQL cluster and create a database or schema for the TICKIT data model, verify that the tables in that schema have a primary key, and initiate the load process: mysql -h <rds_db_instance_endpoint> -u admin -p password --local-infile=1 Use the following CREATE TABLE commands. Load the data from local files using the LOAD DATA command. The following is an example. Note that the input CSV file is broken into several files. This command must be run for every file if you would like to load all data. For demo purposes, a partial data load should work as well. Analyze the source TICKIT data in the destination On the Amazon Redshift console, open the query editor v2 using the database you created as part of the integration setup. Use the following code to validate the seed or CDC activity: SELECT * FROM SYS_INTEGRATION_ACTIVITY ORDER BY last_commit_timestamp DESC; You can now apply your business logic for transformations directly on the data that has been replicated to the data warehouse. You can also use performance optimization techniques like creating a Redshift materialized view that joins the replicated tables and other local tables to improve query performance for your analytical queries. Monitoring You can query the following system views and tables in Amazon Redshift to get information about your zero-ETL integrations with Amazon Redshift: SVV_INTEGRATION – Provides configuration details for your integrations SYS_INTEGRATION_ACTIVITY– Provides information about completed integration runs SVV_INTEGRATION_TABLE_STATE – Describes the table-level integration information To view the integration-related metrics published to Amazon CloudWatch, open the Amazon Redshift console. Choose Zero-ETL integrations in the navigation pane and choose the integration to display activity metrics. Available metrics on the Amazon Redshift console are integration metrics and table statistics, with table statistics providing details of each table replicated from Amazon RDS for MySQL to Amazon Redshift. Integration metrics contain table replication success and failure counts and lag details. Manual resyncs The zero-ETL integration will automatically initiate a resync if a table sync state shows as failed or resync required. But in case the auto resync fails, you can initiate a resync at table-level granularity: ALTER DATABASE zetl_source INTEGRATION REFRESH TABLES tbl1, tbl2; A table can enter a failed state for multiple reasons: The primary key was removed from the table. In such cases, you need to re-add the primary key and perform the previously mentioned ALTER command. An invalid value is encountered during replication or a new column is added to the table with an unsupported data type. In such cases, you need to remove the column with the unsupported data type and perform the previously mentioned ALTER command. An internal error, in rare cases, can cause table failure. The ALTER command should fix it. Clean up When you delete a zero-ETL integration, your transactional data isn’t deleted from the source RDS or the target Redshift databases, but Amazon RDS doesn’t send any new changes to Amazon Redshift. To delete a zero-ETL integration, complete the following steps: On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane. Select the zero-ETL integration that you want to delete and choose Delete. To confirm the deletion, choose Delete. Conclusion In this post, we showed you how to set up a zero-ETL integration from Amazon RDS for MySQL to Amazon Redshift. This minimizes the need to maintain complex data pipelines and enables near real time analytics on transactional and operational data. To learn more about Amazon RDS zero-ETL integration with Amazon Redshift, refer to Working with Amazon RDS zero-ETL integrations with Amazon Redshift (preview). About the Authors Milind Oke is a senior Redshift specialist solutions architect who has worked at Amazon Web Services for three years. He is an AWS-certified SA Associate, Security Specialty and Analytics Specialty certification holder, based out of Queens, New York. Aditya Samant is a relational database industry veteran with over 2 decades of experience working with commercial and open-source databases. He currently works at Amazon Web Services as a Principal Database Specialist Solutions Architect. In his role, he spends time working with customers designing scalable, secure and robust cloud native architectures. Aditya works closely with the service teams and collaborates on designing and delivery of the new features for Amazon’s managed databases. View the full article
-
- amazon rds
- mysql
-
(and 3 more)
Tagged with:
-
Amazon Aurora MySQL zero-ETL integration with Amazon Redshift now supports data filtering, enabling you to include or exclude specific databases and tables as part of the zero-ETL integration. Based on your analytics needs, filtering of specific databases and tables helps you selectively bring data into Amazon Redshift. In addition, you can now easily manage and automate the configuration and deployment of resources needed for an Aurora MySQL zero-ETL integration with Amazon Redshift using AWS CloudFormation. View the full article
-
- amazon aurora mysql
- amazon aurora
- (and 5 more)
-
As your organization becomes more data driven and uses data as a source of competitive advantage, you’ll want to run analytics on your data to better understand your core business drivers to grow sales, reduce costs, and optimize your business. To run analytics on your operational data, you might build a solution that is a combination of a database, a data warehouse, and an extract, transform, and load (ETL) pipeline. ETL is the process data engineers use to combine data from different sources. To reduce the effort involved in building and maintaining ETL pipelines between transactional databases and data warehouses, AWS announced Amazon Aurora zero-ETL integration with Amazon Redshift at AWS re:Invent 2022 and is now generally available (GA) for Amazon Aurora MySQL-Compatible Edition 3.05.0. AWS is now announcing data filtering on zero-ETL integrations, enabling you to bring in selective data from the database instance on zero-ETL integrations between Amazon Aurora MySQL and Amazon Redshift. This feature allows you to select individual databases and tables to be replicated to your Redshift data warehouse for analytics use cases. In this post, we provide an overview of use cases where you can use this feature, and provide step-by-step guidance on how to get started with near real time operational analytics using this feature. Data filtering use cases Data filtering allows you to choose the databases and tables to be replicated from Amazon Aurora MySQL to Amazon Redshift. You can apply multiple filters to the zero-ETL integration, allowing you to tailor the replication to your specific needs. Data filtering applies either an exclude or include filter rule, and can use regular expressions to match multiple databases and tables. In this section, we discuss some common use cases for data filtering. Improve data security by excluding tables containing PII data from replication Operational databases often contain personally identifiable information (PII). This is information that is sensitive in nature, and can include information such as mailing addresses, customer verification documentation, or credit card information. Due to strict security compliance regulations, you may not want to use PII for your analytics use cases. Data filtering allows you to filter out databases or tables containing PII data, excluding them from replication to Amazon Redshift. This improves data security and compliance with analytics workloads. Save on storage costs and manage analytics workloads by replicating tables required for specific use cases Operational databases often contain many different datasets that aren’t useful for analytics. This includes supplementary data, specific application data, and multiple copies of the same dataset for different applications. Moreover, it’s common to build different use cases on different Redshift warehouses. This architecture requires different datasets to be available in individual endpoints. Data filtering allows you to only replicate the datasets that are required for your use cases. This can save costs by eliminating the need to store data that is not being used. You can also modify existing zero-ETL integrations to apply more restrictive data replication where desired. If you add a data filter to an existing integration, Aurora will fully reevaluate the data being replicated with the new filter. This will remove the newly filtered data from the target Redshift endpoint. For more information about quotas for Aurora zero-ETL integrations with Amazon Redshift, refer to Quotas. Start with small data replication and incrementally add tables as required As more analytics use cases are developed on Amazon Redshift, you may want to add more tables to an individual zero-ETL replication. Rather than replicating all tables to Amazon Redshift to satisfy the chance that they may be used in the future, data filtering allows you to start small with a subset of tables from your Aurora database and incrementally add more tables to the filter as they’re required. After a data filter on a zero-ETL integration is updated, Aurora will fully reevaluate the entire filter as if the previous filter didn’t exist, so workloads using previously replicated tables aren’t impacted in the addition of new tables. Improve individual workload performance by load balancing replication processes For large transactional databases, you may need to load balance the replication and any downstream processing to multiple Redshift clusters to allow for reduction of compute requirements for an individual Redshift endpoint and the ability to split workloads onto multiple endpoints. By load balancing workloads across multiple Redshift endpoints, you can effectively create a data mesh architecture, where endpoints are appropriately sized for individual workloads. This can improve performance and lower overall cost. Data filtering allows you to replicate different databases and tables to separate Redshift endpoints. The following figure shows how you could use data filters on zero-ETL integrations to split different databases in Aurora to separate Redshift endpoints. Example use case Consider the TICKIT database. The TICKIT sample database contains data from a fictional company where users can buy and sell tickets for various events. The company’s business analysts want to use the data that is stored in their Aurora MySQL database to generate various metrics, and would like to perform this analysis in near real time. For this reason, the company has identified zero-ETL as a potential solution. Throughout their investigation of the datasets required, the company’s analysts noted that the users table contains personal information about their customer user information that is not useful for their analytics requirements. Therefore, they want to replicate all data except the users table and will use zero-ETL’s data filtering to do so. Setup Start by following the steps in Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift to create a new Aurora MySQL database, Amazon Redshift Serverless endpoint, and zero-ETL integration. Then open the Redshift query editor v2 and run the following query to show that data from the users table has been replicated successfully: select * from aurora_zeroetl.demodb.users; Data filters Data filters are applied directly to the zero-ETL integration on Amazon Relational Database Service (Amazon RDS). You can define multiple filters for a single integration, and each filter is defined as either an Include or Exclude filter type. Data filters apply a pattern to existing and future database tables to determine which filter should be applied. Apply a data filter To apply a filter to remove the users table from the zero-ETL integration, complete the following steps: On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane. Choose the zero-ETL integration to add a filter to. The default filter is to include all databases and tables represented by an include:*.* filter. Choose Modify. Choose Add filter in the Source section. For Choose filter type, choose Exclude. For Filter expression, enter the expression demodb.users. Filter expression order matters. Filters are evaluated left to right, top to bottom, and subsequent filters will override previous filters. In this example, Aurora will evaluate that every table should be included (filter 1) and then evaluate that the demodb.users table should be excluded (filter 2). The exclusion filter therefore overrides the inclusion because it’s after the inclusion filter. Choose Continue. Review the changes, making sure that the order of the filters is correct, and choose Save changes. The integration will be added and will be in a Modifying state until the changes have been applied. This can take up to 30 minutes. To check if the changes have finished applying, choose the zero-ETL integration and check its status. When it shows as Active, the changes have been applied. Verify the change To verify the zero-ETL integration has been updated, complete the following steps: In the Redshift query editor v2, connect to your Redshift cluster. Choose (right-click) the aurora-zeroetl database you created and choose Refresh. Expand demodb and Tables. The users table is no longer available because it has been removed from the replication. All other tables are still available. If you run the same SELECT statement from earlier, you will receive an error stating the object does not exist in the database: select * from aurora_zeroetl.demodb.users; Apply a data filter using the AWS CLI The company’s business analysts now understand that more databases are being added to the Aurora MySQL database and they want to ensure only the demodb database is replicated to their Redshift cluster. To this end, they want to update the filters on the zero-ETL integration with the AWS Command Line Interface (AWS CLI). To add data filters to a zero-ETL integration using the AWS CLI, you can call the modify-integration command. In addition to the integration identifier, specify the --data-filter parameter with a comma-separated list of include and exclude filters. Complete the following steps to alter the filter on the zero-ETL integration: Open a terminal with the AWS CLI installed. Enter the following command to list all available integrations: aws rds describe-integrations Find the integration you want to update and copy the integration identifier. The integration identifier is an alphanumeric string at the end of the integration ARN. Run the following command, updating <integration identifier> with the identifier copied from the previous step: aws rds modify-integration --integration-identifier "<integration identifier>" --data-filter 'exclude: *.*, include: demodb.*, exclude: demodb.users' When Aurora is assessing this filter, it will exclude everything by default, then only include the demodb database, but exclude the demodb.users table. Data filters can implement regular expressions for the databases and table. For example, if you want to filter out any tables starting with user, you can run the following: aws rds modify-integration --integration-identifier "<integration identifier>" --data-filter 'exclude: *.*, include: demodb.*, exclude *./^user/' As with the previous filter change, the integration will be added and will be in a Modifying state until the changes have been applied. This can take up to 30 minutes. When it shows as Active, the changes have been applied. Clean up To remove the filter added to the zero-ETL integration, complete the following steps: On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane. Choose your zero-ETL integration. Choose Modify. Choose Remove next to the filters you want to remove. You can also change the Exclude filter type to Include. Alternatively, you can use the AWS CLI to run the following: aws rds modify-integration --integration-identifier "<integration identifier>" --data-filter 'include: *.*' Choose Continue. Choose Save changes. The data filter will take up to 30 minutes to apply the changes. After you remove data filters, Aurora reevaluates the remaining filters as if the removed filter had never existed. Any data that previously didn’t match the filtering criteria but now does is replicated into the target Redshift data warehouse. Conclusion In this post, we showed you how to set up data filtering on your Aurora zero-ETL integration from Amazon Aurora MySQL to Amazon Redshift. This allows you to enable near real time analytics on transactional and operational data while replicating only the data required. With data filtering, you can split workloads into separate Redshift endpoints, limit the replication of private or confidential datasets, and increase performance of workloads by only replicating required datasets. To learn more about Aurora zero-ETL integration with Amazon Redshift, see Working with Aurora zero-ETL integrations with Amazon Redshift and Working with zero-ETL integrations. About the authors Jyoti Aggarwal is a Product Management Lead for AWS zero-ETL. She leads the product and business strategy, including driving initiatives around performance, customer experience, and security. She brings along an expertise in cloud compute, data pipelines, analytics, artificial intelligence (AI), and data services including databases, data warehouses and data lakes. Sean Beath is an Analytics Solutions Architect at Amazon Web Services. He has experience in the full delivery lifecycle of data platform modernisation using AWS services, and works with customers to help drive analytics value on AWS. Gokul Soundararajan is a principal engineer at AWS and received a PhD from University of Toronto and has been working in the areas of storage, databases, and analytics. View the full article
-
- data filtering
- amazon aurora mysql
- (and 4 more)
-
AWS Secrets Manager launched a new capability that allows customers to create and rotate user credentials for Amazon Redshift Serverless. Amazon Redshift Serverless allows you to run and scale analytics without having to provision and manage data warehouse clusters. With this launch, you can now create and set up automatic rotation for your user credentials for Amazon Redshift Serverless data warehouse directly from the AWS Secrets Manager console. View the full article
-
- aws secrets manager
- amazon redshift
- (and 2 more)
-
It always pays to know more about your customers, and AWS Data Exchange makes it straightforward to use publicly available census data to enrich your customer dataset. The United States Census Bureau conducts the US census every 10 years and gathers household survey data. This data is anonymized, aggregated, and made available for public use. The smallest geographic area for which the Census Bureau collects and aggregates data are census blocks, which are formed by streets, roads, railroads, streams and other bodies of water, other visible physical and cultural features, and the legal boundaries shown on Census Bureau maps. If you know the census block in which a customer lives, you are able to make general inferences about their demographic characteristics. With these new attributes, you are able to build a segmentation model to identify distinct groups of customers that you can target with personalized messaging. This data is available to subscribe to on AWS Data Exchange—and with data sharing, you don’t need to pay to store a copy of it in your account in order to query it. In this post, we show how to use customer addresses to enrich a dataset with additional demographic details from the US Census Bureau dataset. Solution overview The solution includes the following high-level steps: Set up an Amazon Redshift Serverless endpoint and load customer data. Set up a place index in Amazon Location Service. Write an AWS Lambda user-defined function (UDF) to call Location Service from Amazon Redshift. Subscribe to census data on AWS Data Exchange. Use geospatial queries to tag addresses to census blocks. Create a new customer dataset in Amazon Redshift. Evaluate new customer data in Amazon QuickSight. The following diagram illustrates the solution architecture. Prerequisites You can use the following AWS CloudFormation template to deploy the required infrastructure. Before deployment, you need to sign up for QuickSight access through the AWS Management Console. Load generic address data to Amazon Redshift Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Redshift Serverless makes it straightforward to run analytics workloads of any size without having to manage data warehouse infrastructure. To load our address data, we first create a Redshift Serverless workgroup. Then we use Amazon Redshift Query Editor v2 to load customer data from Amazon Simple Storage Service (Amazon S3). Create a Redshift Serverless workgroup There are two primary components of the Redshift Serverless architecture: Namespace – A collection of database objects and users. Namespaces group together all of the resources you use in Redshift Serverless, such as schemas, tables, users, datashares, and snapshots. Workgroup – A collection of compute resources. Workgroups have network and security settings that you can configure using the Redshift Serverless console, the AWS Command Line Interface (AWS CLI), or the Redshift Serverless APIs. To create your namespace and workgroup, refer to Creating a data warehouse with Amazon Redshift Serverless. For this exercise, name your workgroup sandbox and your namespace adx-demo. Use Query Editor v2 to load customer data from Amazon S3 You can use Query Editor v2 to submit queries and load data to your data warehouse through a web interface. To configure Query Editor v2 for your AWS account, refer to Data load made easy and secure in Amazon Redshift using Query Editor V2. After it’s configured, complete the following steps: Use the following SQL to create the customer_data schema within the dev database in your data warehouse: CREATE SCHEMA customer_data; Use the following SQL DDL to create your target table into which you’ll load your customer address data: CREATE TABLE customer_data.customer_addresses ( address character varying(256) ENCODE lzo, unitnumber character varying(256) ENCODE lzo, municipality character varying(256) ENCODE lzo, region character varying(256) ENCODE lzo, postalcode character varying(256) ENCODE lzo, country character varying(256) ENCODE lzo, customer_id integer ENCODE az64 ) DISTSTYLE AUTO; Load the address_list.csv file to the table you just created. For instructions, refer to Data load made easy and secure in Amazon Redshift using Query Editor V2. The file has no column headers and is pipe delimited (|). For information on how to load data from either Amazon S3 or your local desktop, refer to Loading data into a database. Use Location Service to geocode and enrich address data Location Service lets you add location data and functionality to applications, which includes capabilities such as maps, points of interest, geocoding, routing, geofences, and tracking. Our data is in Amazon Redshift, so we need to access the Location Service APIs using SQL statements. Each row of data contains an address that we want to enrich and geotag using the Location Service APIs. Amazon Redshift allows developers to create UDFs using a SQL SELECT clause, Python, or Lambda. Lambda is a compute service that lets you run code without provisioning or managing servers. With Lambda UDFs, you can write custom functions with complex logic and integrate with third-party components. Scalar Lambda UDFs return one result per invocation of the function—in this case, the Lambda function runs one time for each row of data it receives. For this post, we write a Lambda function that uses the Location Service API to geotag and validate our customer addresses. Then we register this Lambda function as a UDF with our Redshift instance, allowing us to call the function from a SQL command. For instructions to create a Location Service place index and create your Lambda function and scalar UDF, refer to Access Amazon Location Service from Amazon Redshift. For this post, we use ESRI as a provider and name the place index placeindex.redshift. Test your new function with the following code, which returns the coordinates of the White House in Washington, DC: select public.f_geocode_address('1600 Pennsylvania Ave.','Washington','DC','20500','USA'); Subscribe to demographic data from AWS Data Exchange AWS Data Exchange is a data marketplace with more than 3,500 products from over 300 providers delivered—through files, APIs, or Amazon Redshift queries—directly to the data lakes, applications, analytics, and machine learning models that use it. First, we need to give our Redshift namespace permission via AWS Identity and Access Management (IAM) to access subscriptions on AWS Data Exchange. Then we can subscribe to our sample demographic data. Complete the following steps: On the IAM console, add the AWSDataExchangeSubscriberFullAccess managed policy to your Amazon Redshift commands access role you assigned when creating the namespace. On the AWS Data Exchange console, navigate to the dataset ACS – Sociodemographics (USA, Census Block Groups, 2019), provided by CARTO. Choose Continue to subscribe, then choose Subscribe. The subscription may take a few minutes to configure. When your subscription is in place, navigate back to the Redshift Serverless console. In the navigation pane, choose Datashares. On the Subscriptions tab, choose the datashare that you just subscribed to. On the datashare details page, choose Create database from datashare. Choose the namespace you created earlier and provide a name for the new database that will hold the shared objects from the dataset you subscribed to. In Query Editor v2, you should see the new database you just created and two new tables: one that holds the block group polygons and another that holds the demographic information for each block group. Join geocoded customer data to census data with geospatial queries There are two primary types of spatial data: raster and vector data. Raster data is represented as a grid of pixels and is beyond the scope of this post. Vector data is comprised of vertices, edges, and polygons. With geospatial data, vertices are represented as latitude and longitude points and edges are the connections between pairs of vertices. Think of the road connecting two intersections on a map. A polygon is a set of vertices with a series of connecting edges that form a continuous shape. A simple rectangle is a polygon, just as the state border of Ohio can be represented as a polygon. The geography_usa_blockgroup_2019 dataset that you subscribed to has 220,134 rows, each representing a single census block group and its geographic shape. Amazon Redshift supports the storage and querying of vector-based spatial data with the GEOMETRY and GEOGRAPHY data types. You can use Redshift SQL functions to perform queries such as a point in polygon operation to determine if a given latitude/longitude point falls within the boundaries of a given polygon (such as state or county boundary). In this dataset, you can observe that the geom column in geography_usa_blockgroup_2019 is of type GEOMETRY. Our goal is to determine which census block (polygon) each of our geotagged addresses falls within so we can enrich our customer records with details that we know about the census block. Complete the following steps: Build a new table with the geocoding results from our UDF: CREATE TABLE customer_data.customer_addresses_geocoded AS select address ,unitnumber ,municipality ,region ,postalcode ,country ,customer_id ,public.f_geocode_address(address||' '||unitnumber,municipality,region,postalcode,country) as geocode_result FROM customer_data.customer_addresses; Use the following code to extract the different address fields and latitude/longitude coordinates from the JSON column and create a new table with the results: CREATE TABLE customer_data.customer_addresses_points AS SELECT customer_id ,geo_address address ,unitnumber ,municipality ,region ,postalcode ,country ,longitude ,latitude ,ST_SetSRID(ST_MakePoint(Longitude, Latitude),4326) as address_point --create new geom column of type POINT, set new point SRID = 4326 FROM ( select customer_id ,address ,unitnumber ,municipality ,region ,postalcode ,country ,cast(json_extract_path_text(geocode_result, 'Label', true) as VARCHAR) as geo_address ,cast(json_extract_path_text(geocode_result, 'Longitude', true) as float) as longitude ,cast(json_extract_path_text(geocode_result, 'Latitude', true) as float) as latitude --use json function to extract fields from geocode_result from customer_data.customer_addresses_geocoded) a; This code uses the ST_POINT function to create a new column from the latitude/longitude coordinates called address_point of type GEOMETRY and subtype POINT. It uses the ST_SetSRID geospatial function to set the spatial reference identifier (SRID) of the new column to 4326. The SRID defines the spatial reference system to be used when evaluating the geometry data. It’s important when joining or comparing geospatial data that they have matching SRIDs. You can check the SRID of an existing geometry column by using the ST_SRID function. For more information on SRIDs and GEOMETRY data types, refer to Querying spatial data in Amazon Redshift. Now that your customer addresses are geocoded as latitude/longitude points in a geometry column, you can use a join to identify which census block shape your new point falls within: CREATE TABLE customer_data.customer_addresses_with_census AS select c.* ,shapes.geoid as census_group_shape ,demo.* from customer_data.customer_addresses_points c inner join "carto_census_data"."carto".geography_usa_blockgroup_2019 shapes on ST_Contains(shapes.geom, c.address_point) --join tables where the address point falls within the census block geometry inner join carto_census_data.usa_acs.demographics_sociodemographics_usa_blockgroup_2019_yearly_2019 demo on demo.geoid = shapes.geoid; The preceding code creates a new table called customer_addresses_with_census, which joins the customer addresses to the census block in which they belong as well as the demographic data associated with that census block. To do this, you used the ST_CONTAINS function, which accepts two geometry data types as an input and returns TRUE if the 2D projection of the first input geometry contains the second input geometry. In our case, we have census blocks represented as polygons and addresses represented as points. The join in the SQL statement succeeds when the point falls within the boundaries of the polygon. Visualize the new demographic data with QuickSight QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people who you work with, wherever they are. QuickSight connects to your data in the cloud and combines data from many different sources. First, let’s build some new calculated fields that will help us better understand the demographics of our customer base. We can do this in QuickSight, or we can use SQL to build the columns in a Redshift view. The following is the code for a Redshift view: CREATE VIEW customer_data.customer_features AS ( SELECT customer_id ,postalcode ,region ,municipality ,geoid as census_geoid ,longitude ,latitude ,total_pop ,median_age ,white_pop/total_pop as perc_white ,black_pop/total_pop as perc_black ,asian_pop/total_pop as perc_asian ,hispanic_pop/total_pop as perc_hispanic ,amerindian_pop/total_pop as perc_amerindian ,median_income ,income_per_capita ,median_rent ,percent_income_spent_on_rent ,unemployed_pop/coalesce(pop_in_labor_force) as perc_unemployment ,(associates_degree + bachelors_degree + masters_degree + doctorate_degree)/total_pop as perc_college_ed ,(household_language_total - household_language_english)/coalesce(household_language_total) as perc_other_than_english FROM "dev"."customer_data"."customer_addresses_with_census" t ); To get QuickSight to talk to our Redshift Serverless endpoint, complete the following steps: Manually authorize connections from QuickSight to Redshift clusters. For instructions, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters (stop after Step 19). Configure the VPC connection between QuickSight and the Redshift Serverless endpoint. Now you can create a new dataset in QuickSight. On the QuickSight console, choose Datasets in the navigation pane. Choose New dataset. We want to create a dataset from a new data source and use the Redshift: Manual connect option. Provide the connection information for your Redshift Serverless workgroup. You will need the endpoint for our workgroup and the user name and password that you created when you set up your workgroup. You can find your workgroup’s endpoint on the Redshift Serverless console by navigating to your workgroup configuration. The following screenshot is an example of the connection settings needed. Notice the connection type is the name of the VPC connection that you previously configured in QuickSight. When you copy the endpoint from the Redshift console, be sure to remove the database and port number from the end of the URL before entering it in the field. Save the new data source configuration. You’ll be prompted to choose the table you want to use for your dataset. Choose the new view that you created that has your new derived fields. Select Directly query your data. This will connect your visualizations directly to the data in the database rather than ingesting data into the QuickSight in-memory data store. To create a histogram of median income level, choose the blank visual on Sheet1 and then choose the histogram visual icon under Visual types. Choose median_income under Fields list and drag it to the Value field well. This builds a histogram showing the distribution of median_income for our customers based on the census block group in which they live. Conclusion In this post, we demonstrated how companies can use open census data available on AWS Data Exchange to effortlessly gain a high-level understanding of their customer base from a demographic standpoint. This basic understanding of customers based on where they live can serve as the foundation for more targeted marketing campaigns and even influence product development and service offerings. As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section. About the Author Tony Stricker is a Principal Technologist on the Data Strategy team at AWS, where he helps senior executives adopt a data-driven mindset and align their people/process/technology in ways that foster innovation and drive towards specific, tangible business outcomes. He has a background as a data warehouse architect and data scientist and has delivered solutions in to production across multiple industries including oil and gas, financial services, public sector, and manufacturing. In his spare time, Tony likes to hang out with his dog and cat, work on home improvement projects, and restore vintage Airstream campers. View the full article
-
- geospatial
- amazon redshift
-
(and 2 more)
Tagged with:
-
Amazon Redshift now supports automatic and incremental refresh of materialized views on the data sharing consumer data warehouses when base tables used for materialzied views are shared data. With automatic and incremental refresh of materialized views, Amazon Redshift identifies the changes to the data in the base tables since the last refresh, and updates only these changes in the materialized views, executing refresh faster as compared to full refresh. View the full article
-
Forum Statistics
67.4k
Total Topics65.3k
Total Posts