Jump to content

Search the Community

Showing results for tags 'zero-etl'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

There are no results to display.

There are no results to display.


Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 5 results

  1. Amazon Aurora MySQL zero-ETL integration with Amazon Redshift is now supported in 11 additional regions, enabling near real-time analytics and machine learning (ML) using Amazon Redshift. Based on your analytics needs, you can include or exclude specific databases and tables from an existing or a new zero-ETL integration and selectively bring data into Amazon Redshift. View the full article
  2. Healthcare providers have an opportunity to improve the patient experience by collecting and analyzing broader and more diverse datasets. This includes patient medical history, allergies, immunizations, family disease history, and individuals’ lifestyle data such as workout habits. Having access to those datasets and forming a 360-degree view of patients allows healthcare providers such as claim analysts to see a broader context about each patient and personalize the care they provide for every individual. This is underpinned by building a complete patient profile that enables claim analysts to identify patterns, trends, potential gaps in care, and adherence to care plans. They can then use the result of their analysis to understand a patient’s health status, treatment history, and past or upcoming doctor consultations to make more informed decisions, streamline the claim management process, and improve operational outcomes. Achieving this will also improve general public health through better and more timely interventions, identify health risks through predictive analytics, and accelerate the research and development process. AWS has invested in a zero-ETL (extract, transform, and load) future so that builders can focus more on creating value from data, instead of having to spend time preparing data for analysis. The solution proposed in this post follows a zero-ETL approach to data integration to facilitate near real-time analytics and deliver a more personalized patient experience. The solution uses AWS services such as AWS HealthLake, Amazon Redshift, Amazon Kinesis Data Streams, and AWS Lake Formation to build a 360 view of patients. These services enable you to collect and analyze data in near real time and put a comprehensive data governance framework in place that uses granular access control to secure sensitive data from unauthorized users. Zero-ETL refers to a set of features on the AWS Cloud that enable integrating different data sources with Amazon Redshift: Integration between Amazon Redshift and Amazon Simple Storage Service (Amazon S3) via Amazon Redshift Spectrum and auto-copy features Integration between Amazon Redshift and Amazon Aurora, Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB via the zero-ETL feature Integration between Amazon Redshift and streaming sources like Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK) via streaming ingestion Solution overview Organizations in the healthcare industry are currently spending a significant amount of time and money on building complex ETL pipelines for data movement and integration. This means data will be replicated across multiple data stores via bespoke and in some cases hand-written ETL jobs, resulting in data inconsistency, latency, and potential security and privacy breaches. With support for querying cross-account Apache Iceberg tables via Amazon Redshift, you can now build a more comprehensive patient-360 analysis by querying all patient data from one place. This means you can seamlessly combine information such as clinical data stored in HealthLake with data stored in operational databases such as a patient relationship management system, together with data produced from wearable devices in near real-time. Having access to all this data enables healthcare organizations to form a holistic view of patients, improve care coordination across multiple organizations, and provide highly personalized care for each individual. The following diagram depicts the high-level solution we build to achieve these outcomes. Deploy the solution You can use the following AWS CloudFormation template to deploy the solution components: This stack creates the following resources and necessary permissions to integrate the services: A Kinesis data stream. You can send data from your streaming source to this resource for ingesting the data into a Redshift data warehouse. We use on-demand capacity mode. An Amazon Aurora MySQL-Compatible Edition cluster version 8.0. This will be your online transaction processing (OLTP) data store for transactional data. To set up zero-ETL integration for ingesting transaction data to the Redshift data warehouse, see Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift. The required parameter groups for source and target are already created as part of the CloudFormation stack. An Amazon Redshift Serverless workgroup and associated namespace. The CloudFormation stack also deploys a provisioned Redshift cluster. If you would like to work with Redshift Serverless, you can remove the provisioned cluster from the template and vice versa. An AWS Identity and Access Management (IAM) role with required policies and trust relationships. Network components, including VPC, subnets, route table, and associations. You can customize these resources as per your organization’s rules. AWS Solution setup AWS HealthLake AWS HealthLake enables organizations in the health industry to securely store, transform, transact, and analyze health data. It stores data in HL7 FHIR format, which is an interoperability standard designed for quick and efficient exchange of health data. When you create a HealthLake data store, a Fast Healthcare Interoperability Resources (FHIR) data repository is made available via a RESTful API endpoint. Simultaneously and as part of AWS HealthLake managed service, the nested JSON FHIR data undergoes an ETL process and is stored in Apache Iceberg open table format in Amazon S3. To create an AWS HealthLake data store, refer to Getting started with AWS HealthLake. Make sure to select the option Preload sample data when creating your data store. In real-world scenarios and when you use AWS HealthLake in production environments, you don’t need to load sample data into your AWS HealthLake data store. Instead, you can use FHIR REST API operations to manage and search resources in your AWS HealthLake data store. We use two tables from the sample data stored in HealthLake: patient and allergyintolerance. Query AWS HealthLake tables with Redshift Serverless Amazon Redshift is the data warehousing service available on the AWS Cloud that provides up to six times better price-performance than any other cloud data warehouses in the market, with a fully managed, AI-powered, massively parallel processing (MPP) data warehouse built for performance, scale, and availability. With continuous innovations added to Amazon Redshift, it is now more than just a data warehouse. It enables organizations of different sizes and in different industries to access all the data they have in their AWS environments and analyze it from one single location with a set of features under the zero-ETL umbrella. Amazon Redshift integrates with AWS HealthLake and data lakes through Redshift Spectrum and Amazon S3 auto-copy features, enabling you to query data directly from files on Amazon S3. Query AWS HealthLake data with Amazon Redshift Amazon Redshift makes it straightforward to query the data stored in S3-based data lakes with automatic mounting of an AWS Glue Data Catalog in the Redshift query editor v2. This means you no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog. To get started with this feature, see Querying the AWS Glue Data Catalog. After it is set up and you’re connected to the Redshift query editor v2, complete the following steps: Validate that your tables are visible in the query editor V2. The Data Catalog objects are listed under the awsdatacatalog database. FHIR data stored in AWS HealthLake is highly nested. To learn about how to un-nest semi-structured data with Amazon Redshift, see Tutorial: Querying nested data with Amazon Redshift Spectrum. Use the following query to un-nest the allergyintolerance and patient tables, join them together, and get patient details and their allergies: WITH patient_allergy AS ( SELECT resourcetype, c AS allery_category, a."patient"."reference", SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id, a.recordeddate AS allergy_record_date, NVL(cd."code", 'NA') AS allergy_code, NVL(cd.display, 'NA') AS allergy_description FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a LEFT JOIN a.category c ON TRUE LEFT JOIN a.reaction r ON TRUE LEFT JOIN r.manifestation m ON TRUE LEFT JOIN m.coding cd ON TRUE ), patinet_info AS ( SELECT id, gender, g as given_name, n.family as family_name, pr as prefix FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p LEFT JOIN p.name n ON TRUE LEFT JOIN n.given g ON TRUE LEFT JOIN n.prefix pr ON TRUE ) SELECT DISTINCT p.id, p.gender, p.prefix, p.given_name, p.family_name, pa.allery_category, pa.allergy_code, pa.allergy_description from patient_allergy pa JOIN patinet_info p ON pa.patient_id = p.id ORDER BY p.id, pa.allergy_code ; To eliminate the need for Amazon Redshift to un-nest data every time a query is run, you can create a materialized view to hold un-nested and flattened data. Materialized views are an effective mechanism to deal with complex and repeating queries. They contain a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Use the following SQL to create a materialized view. You use it later to build a complete view of patients: CREATE MATERIALIZED VIEW patient_allergy_info AUTO REFRESH YES AS WITH patient_allergy AS ( SELECT resourcetype, c AS allery_category, a."patient"."reference", SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id, a.recordeddate AS allergy_record_date, NVL(cd."code", 'NA') AS allergy_code, NVL(cd.display, 'NA') AS allergy_description FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a LEFT JOIN a.category c ON TRUE LEFT JOIN a.reaction r ON TRUE LEFT JOIN r.manifestation m ON TRUE LEFT JOIN m.coding cd ON TRUE ), patinet_info AS ( SELECT id, gender, g as given_name, n.family as family_name, pr as prefix FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p LEFT JOIN p.name n ON TRUE LEFT JOIN n.given g ON TRUE LEFT JOIN n.prefix pr ON TRUE ) SELECT DISTINCT p.id, p.gender, p.prefix, p.given_name, p.family_name, pa.allery_category, pa.allergy_code, pa.allergy_description from patient_allergy pa JOIN patinet_info p ON pa.patient_id = p.id ORDER BY p.id, pa.allergy_code ; You have confirmed you can query data in AWS HealthLake via Amazon Redshift. Next, you set up zero-ETL integration between Amazon Redshift and Amazon Aurora MySQL. Set up zero-ETL integration between Amazon Aurora MySQL and Redshift Serverless Applications such as front-desk software, which are used to schedule appointments and register new patients, store data in OLTP databases such as Aurora. To get data out of OLTP databases and have them ready for analytics use cases, data teams might have to spend a considerable amount of time to build, test, and deploy ETL jobs that are complex to maintain and scale. With the Amazon Redshift zero-ETL integration with Amazon Aurora MySQL, you can run analytics on the data stored in OLTP databases and combine them with the rest of the data in Amazon Redshift and AWS HealthLake in near real time. In the next steps in this section, we connect to a MySQL database and set up zero-ETL integration with Amazon Redshift. Connect to an Aurora MySQL database and set up data Connect to your Aurora MySQL database using your editor of choice using AdminUsername and AdminPassword that you entered when running the CloudFormation stack. (For simplicity, it is the same for Amazon Redshift and Aurora.) When you’re connected to your database, complete the following steps: Create a new database by running the following command: CREATE DATABASE front_desk_app_db; Create a new table. This table simulates storing patient information as they visit clinics and other healthcare centers. For simplicity and to demonstrate specific capabilities, we assume that patient IDs are the same in AWS HealthLake and the front-of-office application. In real-world scenarios, this can be a hashed version of a national health care number: CREATE TABLE patient_appointment ( patient_id varchar(250), gender varchar(1), date_of_birth date, appointment_datetime datetime, phone_number varchar(15), PRIMARY KEY (patient_id, appointment_datetime) ); Having a primary key in the table is mandatory for zero-ETL integration to work. Insert new records into the source table in the Aurora MySQL database. To demonstrate the required functionalities, make sure the patient_id of the sample records inserted into the MySQL database match the ones in AWS HealthLake. Replace [patient_id_1] and [patient_id_2] in the following query with the ones from the Redshift query you ran previously (the query that joined allergyintolerance and patient): INSERT INTO front_desk_app_db.patient_appointment (patient_id, gender, date_of_birth, appointment_datetime, phone_number) VALUES([PATIENT_ID_1], 'F', '1988-7-04', '2023-12-19 10:15:00', '0401401401'), ([PATIENT_ID_1], 'F', '1988-7-04', '2023-09-19 11:00:00', '0401401401'), ([PATIENT_ID_1], 'F', '1988-7-04', '2023-06-06 14:30:00', '0401401401'), ([PATIENT_ID_2], 'F', '1972-11-14', '2023-12-19 08:15:00', '0401401402'), ([PATIENT_ID_2], 'F', '1972-11-14', '2023-01-09 12:15:00', '0401401402'); Now that your source table is populated with sample records, you can set up zero-ETL and have data ingested into Amazon Redshift. Set up zero-ETL integration between Amazon Aurora MySQL and Amazon Redshift Complete the following steps to create your zero-ETL integration: On the Amazon RDS console, choose Databases in the navigation pane. Choose the DB identifier of your cluster (not the instance). On the Zero-ETL Integration tab, choose Create zero-ETL integration. Follow the steps to create your integration. Create a Redshift database from the integration Next, you create a target database from the integration. You can do this by running a couple of simple SQL commands on Amazon Redshift. Log in to the query editor V2 and run the following commands: Get the integration ID of the zero-ETL you set up between your source database and Amazon Redshift: SELECT * FROM svv_integration; Create a database using the integration ID: CREATE DATABASE ztl_demo FROM INTEGRATION '[INTEGRATION_ID '; Query the database and validate that a new table is created and populated with data from your source MySQL database: SELECT * FROM ztl_demo.front_desk_app_db.patient_appointment; It might take a few seconds for the first set of records to appear in Amazon Redshift. This shows that the integration is working as expected. To validate it further, you can insert a new record in your Aurora MySQL database, and it will be available in Amazon Redshift for querying in near real time within a few seconds. Set up streaming ingestion for Amazon Redshift Another aspect of zero-ETL on AWS, for real-time and streaming data, is realized through Amazon Redshift Streaming Ingestion. It provides low-latency, high-speed ingestion of streaming data from Kinesis Data Streams and Amazon MSK. It lowers the effort required to have data ready for analytics workloads, lowers the cost of running such workloads on the cloud, and decreases the operational burden of maintaining the solution. In the context of healthcare, understanding an individual’s exercise and movement patterns can help with overall health assessment and better treatment planning. In this section, you send simulated data from wearable devices to Kinesis Data Streams and integrate it with the rest of the data you already have access to from your Redshift Serverless data warehouse. For step-by-step instructions, refer to Real-time analytics with Amazon Redshift streaming ingestion. Note the following steps when you set up streaming ingestion for Amazon Redshift: Select wearables_stream and use the following template when sending data to Amazon Kinesis Data Streams via Kinesis Data Generator, to simulate data generated by wearable devices. Replace [PATIENT_ID_1] and [PATIENT_ID_2] with the patient IDs you earlier when inserting new records into your Aurora MySQL table: { "patient_id": "{{random.arrayElement(["[PATIENT_ID_1]"," [PATIENT_ID_2]"])}}", "steps_increment": "{{random.arrayElement( [0,1] )}}", "heart_rate": {{random.number( { "min":45, "max":120} )}} } Create an external schema called from_kds by running the following query and replacing [IAM_ROLE_ARN] with the ARN of the role created by the CloudFormation stack (Patient360BlogRole): CREATE EXTERNAL SCHEMA from_kds FROM KINESIS IAM_ROLE '[IAM_ROLE_ARN]'; Use the following SQL when creating a materialized view to consume data from the stream: CREATE MATERIALIZED VIEW patient_wearable_data AUTO REFRESH YES AS SELECT approximate_arrival_timestamp, JSON_PARSE(kinesis_data) as Data FROM from_kds."wearables_stream" WHERE CAN_JSON_PARSE(kinesis_data); To validate that streaming ingestion works as expected, refresh the materialized view to get the data you already sent to the data stream and query the table to make sure data has landed in Amazon Redshift: REFRESH MATERIALIZED VIEW patient_wearable_data; SELECT * FROM patient_wearable_data ORDER BY approximate_arrival_timestamp DESC; Query and analyze patient wearable data The results in the data column of the preceding query are in JSON format. Amazon Redshift makes it straightforward to work with semi-structured data in JSON format. It uses PartiQL language to offer SQL-compatible access to relational, semi-structured, and nested data. Use the following query to flatten data: SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp FROM patient_wearable_data ORDER BY approximate_arrival_timestamp DESC; The result looks like the following screenshot. Now that you know how to flatten JSON data, you can analyze it further. Use the following query to get the number of minutes a patient has been physically active per day, based on their heart rate (greater than 80): WITH patient_wearble_flattened AS ( SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp, DATE(approximate_arrival_timestamp) AS date_received, extract(hour from approximate_arrival_timestamp) AS hour_received, extract(minute from approximate_arrival_timestamp) AS minute_received FROM patient_wearable_data ), patient_active_minutes AS ( SELECT patient_id, date_received, hour_received, minute_received, avg(heart_rate) AS heart_rate FROM patient_wearble_flattened GROUP BY patient_id, date_received, hour_received, minute_received HAVING avg(heart_rate) > 80 ) SELECT patient_id, date_received, COUNT(heart_rate) AS active_minutes_count FROM patient_active_minutes GROUP BY patient_id, date_received ORDER BY patient_id, date_received; Create a complete patient 360 Now that you are able to query all patient data with Redshift Serverless, you can combine the three datasets you used in this post and form a comprehensive patient 360 view with the following query: WITH patient_appointment_info AS ( SELECT "patient_id", "gender", "date_of_birth", "appointment_datetime", "phone_number" FROM ztl_demo.front_desk_app_db.patient_appointment ), patient_wearble_flattened AS ( SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp, DATE(approximate_arrival_timestamp) AS date_received, extract(hour from approximate_arrival_timestamp) AS hour_received, extract(minute from approximate_arrival_timestamp) AS minute_received FROM patient_wearable_data ), patient_active_minutes AS ( SELECT patient_id, date_received, hour_received, minute_received, avg(heart_rate) AS heart_rate FROM patient_wearble_flattened GROUP BY patient_id, date_received, hour_received, minute_received HAVING avg(heart_rate) > 80 ), patient_active_minutes_count AS ( SELECT patient_id, date_received, COUNT(heart_rate) AS active_minutes_count FROM patient_active_minutes GROUP BY patient_id, date_received ) SELECT pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth, ppi.appointment_datetime, ppi.phone_number, pamc.date_received, pamc.active_minutes_count FROM patient_allergy_info pai LEFT JOIN patient_active_minutes_count pamc ON pai.patient_id = pamc.patient_id LEFT JOIN patient_appointment_info ppi ON pai.patient_id = ppi.patient_id GROUP BY pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth, ppi.appointment_datetime, ppi.phone_number, pamc.date_received, pamc.active_minutes_count ORDER BY pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth DESC, ppi.appointment_datetime DESC, ppi.phone_number DESC, pamc.date_received, pamc.active_minutes_count You can use the solution and queries used here to expand the datasets used in your analysis. For example, you can include other tables from AWS HealthLake as needed. Clean up To clean up resources you created, complete the following steps: Delete the zero-ETL integration between Amazon RDS and Amazon Redshift. Delete the CloudFormation stack. Delete AWS HealthLake data store Conclusion Forming a comprehensive 360 view of patients by integrating data from various different sources offers numerous benefits for organizations operating in the healthcare industry. It enables healthcare providers to gain a holistic understanding of a patient’s medical journey, enhances clinical decision-making, and allows for more accurate diagnosis and tailored treatment plans. With zero-ETL features for data integration on AWS, it is effortless to build a view of patients securely, cost-effectively, and with minimal effort. You can then use visualization tools such as Amazon QuickSight to build dashboards or use Amazon Redshift ML to enable data analysts and database developers to train machine learning (ML) models with the data integrated through Amazon Redshift zero-ETL. The result is a set of ML models that are trained with a broader view into patients, their medical history, and their lifestyle, and therefore enable you make more accurate predictions about their upcoming health needs. About the Authors Saeed Barghi is a Sr. Analytics Specialist Solutions Architect specializing in architecting enterprise data platforms. He has extensive experience in the fields of data warehousing, data engineering, data lakes, and AI/ML. Based in Melbourne, Australia, Saeed works with public sector customers in Australia and New Zealand. Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 17 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe. View the full article
  3. Amazon Relational Database Service (Amazon RDS) for MySQL zero-ETL integration with Amazon Redshift was announced in preview at AWS re:Invent 2023 for Amazon RDS for MySQL version 8.0.28 or higher. In this post, we provide step-by-step guidance on how to get started with near real-time operational analytics using this feature. This post is a continuation of the zero-ETL series that started with Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift. Challenges Customers across industries today are looking to use data to their competitive advantage and increase revenue and customer engagement by implementing near real time analytics use cases like personalization strategies, fraud detection, inventory monitoring, and many more. There are two broad approaches to analyzing operational data for these use cases: Analyze the data in-place in the operational database (such as read replicas, federated query, and analytics accelerators) Move the data to a data store optimized for running use case-specific queries such as a data warehouse The zero-ETL integration is focused on simplifying the latter approach. The extract, transform, and load (ETL) process has been a common pattern for moving data from an operational database to an analytics data warehouse. ELT is where the extracted data is loaded as is into the target first and then transformed. ETL and ELT pipelines can be expensive to build and complex to manage. With multiple touchpoints, intermittent errors in ETL and ELT pipelines can lead to long delays, leaving data warehouse applications with stale or missing data, further leading to missed business opportunities. Alternatively, solutions that analyze data in-place may work great for accelerating queries on a single database, but such solutions aren’t able to aggregate data from multiple operational databases for customers that need to run unified analytics. Zero-ETL Unlike the traditional systems where data is siloed in one database and the user has to make a trade-off between unified analysis and performance, data engineers can now replicate data from multiple RDS for MySQL databases into a single Redshift data warehouse to derive holistic insights across many applications or partitions. Updates in transactional databases are automatically and continuously propagated to Amazon Redshift so data engineers have the most recent information in near real time. There is no infrastructure to manage and the integration can automatically scale up and down based on the data volume. At AWS, we have been making steady progress towards bringing our zero-ETL vision to life. The following sources are currently supported for zero-ETL integrations: Amazon Aurora MySQL-Compatible Edition (generally available) Amazon Aurora PostgreSQL-Compatible Edition (preview) Amazon RDS for MySQL (preview) Amazon DynamoDB (limited preview) When you create a zero-ETL integration for Amazon Redshift, you continue to pay for underlying source database and target Redshift database usage. Refer to Zero-ETL integration costs (Preview) for further details. With zero-ETL integration with Amazon Redshift, the integration replicates data from the source database into the target data warehouse. The data becomes available in Amazon Redshift within seconds, allowing you to use the analytics features of Amazon Redshift and capabilities like data sharing, workload optimization autonomics, concurrency scaling, machine learning, and many more. You can continue with your transaction processing on Amazon RDS or Amazon Aurora while simultaneously using Amazon Redshift for analytics workloads such as reporting and dashboards. The following diagram illustrates this architecture. Solution overview Let’s consider TICKIT, a fictional website where users buy and sell tickets online for sporting events, shows, and concerts. The transactional data from this website is loaded into an Amazon RDS for MySQL 8.0.28 (or higher version) database. The company’s business analysts want to generate metrics to identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. They would like to get these metrics in near real time using a zero-ETL integration. The integration is set up between Amazon RDS for MySQL (source) and Amazon Redshift (destination). The transactional data from the source gets refreshed in near real time on the destination, which processes analytical queries. You can use either the serverless option or an encrypted RA3 cluster for Amazon Redshift. For this post, we use a provisioned RDS database and a Redshift provisioned data warehouse. The following diagram illustrates the high-level architecture. The following are the steps needed to set up zero-ETL integration. These steps can be done automatically by the zero-ETL wizard, but you will require a restart if the wizard changes the setting for Amazon RDS or Amazon Redshift. You could do these steps manually, if not already configured, and perform the restarts at your convenience. For the complete getting started guides, refer to Working with Amazon RDS zero-ETL integrations with Amazon Redshift (preview) and Working with zero-ETL integrations. Configure the RDS for MySQL source with a custom DB parameter group. Configure the Redshift cluster to enable case-sensitive identifiers. Configure the required permissions. Create the zero-ETL integration. Create a database from the integration in Amazon Redshift. Configure the RDS for MySQL source with a customized DB parameter group To create an RDS for MySQL database, complete the following steps: On the Amazon RDS console, create a DB parameter group called zero-etl-custom-pg. Zero-ETL integration works by using binary logs (binlogs) generated by MySQL database. To enable binlogs on Amazon RDS for MySQL, a specific set of parameters must be enabled. Set the following binlog cluster parameter settings: binlog_format = ROW binlog_row_image = FULL binlog_checksum = NONE In addition, make sure that the binlog_row_value_options parameter is not set to PARTIAL_JSON. By default, this parameter is not set. Choose Databases in the navigation pane, then choose Create database. For Engine Version, choose MySQL 8.0.28 (or higher). For Templates, select Production. For Availability and durability, select either Multi-AZ DB instance or Single DB instance (Multi-AZ DB clusters are not supported, as of this writing). For DB instance identifier, enter zero-etl-source-rms. Under Instance configuration, select Memory optimized classes and choose the instance db.r6g.large, which should be sufficient for TICKIT use case. Under Additional configuration, for DB cluster parameter group, choose the parameter group you created earlier (zero-etl-custom-pg). Choose Create database. In a couple of minutes, it should spin up an RDS for MySQL database as the source for zero-ETL integration. Configure the Redshift destination After you create your source DB cluster, you must create and configure a target data warehouse in Amazon Redshift. The data warehouse must meet the following requirements: Using an RA3 node type (ra3.16xlarge, ra3.4xlarge, or ra3.xlplus) or Amazon Redshift Serverless Encrypted (if using a provisioned cluster) For our use case, create a Redshift cluster by completing the following steps: On the Amazon Redshift console, choose Configurations and then choose Workload management. In the parameter group section, choose Create. Create a new parameter group named zero-etl-rms. Choose Edit parameters and change the value of enable_case_sensitive_identifier to True. Choose Save. You can also use the AWS Command Line Interface (AWS CLI) command update-workgroup for Redshift Serverless: aws redshift-serverless update-workgroup --workgroup-name <your-workgroup-name> --config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=true Choose Provisioned clusters dashboard. At the top of you console window, you will see a Try new Amazon Redshift features in preview banner. Choose Create preview cluster. For Preview track, chose preview_2023. For Node type, choose one of the supported node types (for this post, we use ra3.xlplus). Under Additional configurations, expand Database configurations. For Parameter groups, choose zero-etl-rms. For Encryption, select Use AWS Key Management Service. Choose Create cluster. The cluster should become Available in a few minutes. Navigate to the namespace zero-etl-target-rs-ns and choose the Resource policy tab. Choose Add authorized principals. Enter either the Amazon Resource Name (ARN) of the AWS user or role, or the AWS account ID (IAM principals) that are allowed to create integrations. An account ID is stored as an ARN with root user. In the Authorized integration sources section, choose Add authorized integration source to add the ARN of the RDS for MySQL DB instance that’s the data source for the zero-ETL integration. You can find this value by going to the Amazon RDS console and navigating to the Configuration tab of the zero-etl-source-rms DB instance. Your resource policy should resemble the following screenshot. Configure required permissions To create a zero-ETL integration, your user or role must have an attached identity-based policy with the appropriate AWS Identity and Access Management (IAM) permissions. An AWS account owner can configure required permissions for users or roles who may create zero-ETL integrations. The sample policy allows the associated principal to perform the following actions: Create zero-ETL integrations for the source RDS for MySQL DB instance. View and delete all zero-ETL integrations. Create inbound integrations into the target data warehouse. This permission is not required if the same account owns the Redshift data warehouse and this account is an authorized principal for that data warehouse. Also note that Amazon Redshift has a different ARN format for provisioned and serverless clusters: Provisioned – arn:aws:redshift:{region}:{account-id}:namespace:namespace-uuid Serverless – arn:aws:redshift-serverless:{region}:{account-id}:namespace/namespace-uuid Complete the following steps to configure the permissions: On the IAM console, choose Policies in the navigation pane. Choose Create policy. Create a new policy called rds-integrations using the following JSON (replace region and account-id with your actual values): { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "rds:CreateIntegration" ], "Resource": [ "arn:aws:rds:{region}:{account-id}:db:source-instancename", "arn:aws:rds:{region}:{account-id}:integration:*" ] }, { "Effect": "Allow", "Action": [ "rds:DescribeIntegration" ], "Resource": ["*"] }, { "Effect": "Allow", "Action": [ "rds:DeleteIntegration" ], "Resource": [ "arn:aws:rds:{region}:{account-id}:integration:*" ] }, { "Effect": "Allow", "Action": [ "redshift:CreateInboundIntegration" ], "Resource": [ "arn:aws:redshift:{region}:{account-id}:cluster:namespace-uuid" ] }] } Attach the policy you created to your IAM user or role permissions. Create the zero-ETL integration To create the zero-ETL integration, complete the following steps: On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane. Choose Create zero-ETL integration. For Integration identifier, enter a name, for example zero-etl-demo. For Source database, choose Browse RDS databases and choose the source cluster zero-etl-source-rms. Choose Next. Under Target, for Amazon Redshift data warehouse, choose Browse Redshift data warehouses and choose the Redshift data warehouse (zero-etl-target-rs). Choose Next. Add tags and encryption, if applicable. Choose Next. Verify the integration name, source, target, and other settings. Choose Create zero-ETL integration. You can choose the integration to view the details and monitor its progress. It took about 30 minutes for the status to change from Creating to Active. The time will vary depending on the size of your dataset in the source. Create a database from the integration in Amazon Redshift To create your database from the zero-ETL integration, complete the following steps: On the Amazon Redshift console, choose Clusters in the navigation pane. Open the zero-etl-target-rs cluster. Choose Query data to open the query editor v2. Connect to the Redshift data warehouse by choosing Save. Obtain the integration_id from the svv_integration system table: select integration_id from svv_integration; -- copy this result, use in the next sql Use the integration_id from the previous step to create a new database from the integration: CREATE DATABASE zetl_source FROM INTEGRATION '<result from above>'; The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near real time. Analyze the near real time transactional data Now we can run analytics on TICKIT’s operational data. Populate the source TICKIT data To populate the source data, complete the following steps: Copy the CSV input data files into a local directory. The following is an example command: aws s3 cp 's3://redshift-blogs/zero-etl-integration/data/tickit' . --recursive Connect to your RDS for MySQL cluster and create a database or schema for the TICKIT data model, verify that the tables in that schema have a primary key, and initiate the load process: mysql -h <rds_db_instance_endpoint> -u admin -p password --local-infile=1 Use the following CREATE TABLE commands. Load the data from local files using the LOAD DATA command. The following is an example. Note that the input CSV file is broken into several files. This command must be run for every file if you would like to load all data. For demo purposes, a partial data load should work as well. Analyze the source TICKIT data in the destination On the Amazon Redshift console, open the query editor v2 using the database you created as part of the integration setup. Use the following code to validate the seed or CDC activity: SELECT * FROM SYS_INTEGRATION_ACTIVITY ORDER BY last_commit_timestamp DESC; You can now apply your business logic for transformations directly on the data that has been replicated to the data warehouse. You can also use performance optimization techniques like creating a Redshift materialized view that joins the replicated tables and other local tables to improve query performance for your analytical queries. Monitoring You can query the following system views and tables in Amazon Redshift to get information about your zero-ETL integrations with Amazon Redshift: SVV_INTEGRATION – Provides configuration details for your integrations SYS_INTEGRATION_ACTIVITY– Provides information about completed integration runs SVV_INTEGRATION_TABLE_STATE – Describes the table-level integration information To view the integration-related metrics published to Amazon CloudWatch, open the Amazon Redshift console. Choose Zero-ETL integrations in the navigation pane and choose the integration to display activity metrics. Available metrics on the Amazon Redshift console are integration metrics and table statistics, with table statistics providing details of each table replicated from Amazon RDS for MySQL to Amazon Redshift. Integration metrics contain table replication success and failure counts and lag details. Manual resyncs The zero-ETL integration will automatically initiate a resync if a table sync state shows as failed or resync required. But in case the auto resync fails, you can initiate a resync at table-level granularity: ALTER DATABASE zetl_source INTEGRATION REFRESH TABLES tbl1, tbl2; A table can enter a failed state for multiple reasons: The primary key was removed from the table. In such cases, you need to re-add the primary key and perform the previously mentioned ALTER command. An invalid value is encountered during replication or a new column is added to the table with an unsupported data type. In such cases, you need to remove the column with the unsupported data type and perform the previously mentioned ALTER command. An internal error, in rare cases, can cause table failure. The ALTER command should fix it. Clean up When you delete a zero-ETL integration, your transactional data isn’t deleted from the source RDS or the target Redshift databases, but Amazon RDS doesn’t send any new changes to Amazon Redshift. To delete a zero-ETL integration, complete the following steps: On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane. Select the zero-ETL integration that you want to delete and choose Delete. To confirm the deletion, choose Delete. Conclusion In this post, we showed you how to set up a zero-ETL integration from Amazon RDS for MySQL to Amazon Redshift. This minimizes the need to maintain complex data pipelines and enables near real time analytics on transactional and operational data. To learn more about Amazon RDS zero-ETL integration with Amazon Redshift, refer to Working with Amazon RDS zero-ETL integrations with Amazon Redshift (preview). About the Authors Milind Oke is a senior Redshift specialist solutions architect who has worked at Amazon Web Services for three years. He is an AWS-certified SA Associate, Security Specialty and Analytics Specialty certification holder, based out of Queens, New York. Aditya Samant is a relational database industry veteran with over 2 decades of experience working with commercial and open-source databases. He currently works at Amazon Web Services as a Principal Database Specialist Solutions Architect. In his role, he spends time working with customers designing scalable, secure and robust cloud native architectures. Aditya works closely with the service teams and collaborates on designing and delivery of the new features for Amazon’s managed databases. View the full article
  4. Amazon Aurora MySQL zero-ETL integration with Amazon Redshift now supports data filtering, enabling you to include or exclude specific databases and tables as part of the zero-ETL integration. Based on your analytics needs, filtering of specific databases and tables helps you selectively bring data into Amazon Redshift. In addition, you can now easily manage and automate the configuration and deployment of resources needed for an Aurora MySQL zero-ETL integration with Amazon Redshift using AWS CloudFormation. View the full article
  5. As your organization becomes more data driven and uses data as a source of competitive advantage, you’ll want to run analytics on your data to better understand your core business drivers to grow sales, reduce costs, and optimize your business. To run analytics on your operational data, you might build a solution that is a combination of a database, a data warehouse, and an extract, transform, and load (ETL) pipeline. ETL is the process data engineers use to combine data from different sources. To reduce the effort involved in building and maintaining ETL pipelines between transactional databases and data warehouses, AWS announced Amazon Aurora zero-ETL integration with Amazon Redshift at AWS re:Invent 2022 and is now generally available (GA) for Amazon Aurora MySQL-Compatible Edition 3.05.0. AWS is now announcing data filtering on zero-ETL integrations, enabling you to bring in selective data from the database instance on zero-ETL integrations between Amazon Aurora MySQL and Amazon Redshift. This feature allows you to select individual databases and tables to be replicated to your Redshift data warehouse for analytics use cases. In this post, we provide an overview of use cases where you can use this feature, and provide step-by-step guidance on how to get started with near real time operational analytics using this feature. Data filtering use cases Data filtering allows you to choose the databases and tables to be replicated from Amazon Aurora MySQL to Amazon Redshift. You can apply multiple filters to the zero-ETL integration, allowing you to tailor the replication to your specific needs. Data filtering applies either an exclude or include filter rule, and can use regular expressions to match multiple databases and tables. In this section, we discuss some common use cases for data filtering. Improve data security by excluding tables containing PII data from replication Operational databases often contain personally identifiable information (PII). This is information that is sensitive in nature, and can include information such as mailing addresses, customer verification documentation, or credit card information. Due to strict security compliance regulations, you may not want to use PII for your analytics use cases. Data filtering allows you to filter out databases or tables containing PII data, excluding them from replication to Amazon Redshift. This improves data security and compliance with analytics workloads. Save on storage costs and manage analytics workloads by replicating tables required for specific use cases Operational databases often contain many different datasets that aren’t useful for analytics. This includes supplementary data, specific application data, and multiple copies of the same dataset for different applications. Moreover, it’s common to build different use cases on different Redshift warehouses. This architecture requires different datasets to be available in individual endpoints. Data filtering allows you to only replicate the datasets that are required for your use cases. This can save costs by eliminating the need to store data that is not being used. You can also modify existing zero-ETL integrations to apply more restrictive data replication where desired. If you add a data filter to an existing integration, Aurora will fully reevaluate the data being replicated with the new filter. This will remove the newly filtered data from the target Redshift endpoint. For more information about quotas for Aurora zero-ETL integrations with Amazon Redshift, refer to Quotas. Start with small data replication and incrementally add tables as required As more analytics use cases are developed on Amazon Redshift, you may want to add more tables to an individual zero-ETL replication. Rather than replicating all tables to Amazon Redshift to satisfy the chance that they may be used in the future, data filtering allows you to start small with a subset of tables from your Aurora database and incrementally add more tables to the filter as they’re required. After a data filter on a zero-ETL integration is updated, Aurora will fully reevaluate the entire filter as if the previous filter didn’t exist, so workloads using previously replicated tables aren’t impacted in the addition of new tables. Improve individual workload performance by load balancing replication processes For large transactional databases, you may need to load balance the replication and any downstream processing to multiple Redshift clusters to allow for reduction of compute requirements for an individual Redshift endpoint and the ability to split workloads onto multiple endpoints. By load balancing workloads across multiple Redshift endpoints, you can effectively create a data mesh architecture, where endpoints are appropriately sized for individual workloads. This can improve performance and lower overall cost. Data filtering allows you to replicate different databases and tables to separate Redshift endpoints. The following figure shows how you could use data filters on zero-ETL integrations to split different databases in Aurora to separate Redshift endpoints. Example use case Consider the TICKIT database. The TICKIT sample database contains data from a fictional company where users can buy and sell tickets for various events. The company’s business analysts want to use the data that is stored in their Aurora MySQL database to generate various metrics, and would like to perform this analysis in near real time. For this reason, the company has identified zero-ETL as a potential solution. Throughout their investigation of the datasets required, the company’s analysts noted that the users table contains personal information about their customer user information that is not useful for their analytics requirements. Therefore, they want to replicate all data except the users table and will use zero-ETL’s data filtering to do so. Setup Start by following the steps in Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift to create a new Aurora MySQL database, Amazon Redshift Serverless endpoint, and zero-ETL integration. Then open the Redshift query editor v2 and run the following query to show that data from the users table has been replicated successfully: select * from aurora_zeroetl.demodb.users; Data filters Data filters are applied directly to the zero-ETL integration on Amazon Relational Database Service (Amazon RDS). You can define multiple filters for a single integration, and each filter is defined as either an Include or Exclude filter type. Data filters apply a pattern to existing and future database tables to determine which filter should be applied. Apply a data filter To apply a filter to remove the users table from the zero-ETL integration, complete the following steps: On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane. Choose the zero-ETL integration to add a filter to. The default filter is to include all databases and tables represented by an include:*.* filter. Choose Modify. Choose Add filter in the Source section. For Choose filter type, choose Exclude. For Filter expression, enter the expression demodb.users. Filter expression order matters. Filters are evaluated left to right, top to bottom, and subsequent filters will override previous filters. In this example, Aurora will evaluate that every table should be included (filter 1) and then evaluate that the demodb.users table should be excluded (filter 2). The exclusion filter therefore overrides the inclusion because it’s after the inclusion filter. Choose Continue. Review the changes, making sure that the order of the filters is correct, and choose Save changes. The integration will be added and will be in a Modifying state until the changes have been applied. This can take up to 30 minutes. To check if the changes have finished applying, choose the zero-ETL integration and check its status. When it shows as Active, the changes have been applied. Verify the change To verify the zero-ETL integration has been updated, complete the following steps: In the Redshift query editor v2, connect to your Redshift cluster. Choose (right-click) the aurora-zeroetl database you created and choose Refresh. Expand demodb and Tables. The users table is no longer available because it has been removed from the replication. All other tables are still available. If you run the same SELECT statement from earlier, you will receive an error stating the object does not exist in the database: select * from aurora_zeroetl.demodb.users; Apply a data filter using the AWS CLI The company’s business analysts now understand that more databases are being added to the Aurora MySQL database and they want to ensure only the demodb database is replicated to their Redshift cluster. To this end, they want to update the filters on the zero-ETL integration with the AWS Command Line Interface (AWS CLI). To add data filters to a zero-ETL integration using the AWS CLI, you can call the modify-integration command. In addition to the integration identifier, specify the --data-filter parameter with a comma-separated list of include and exclude filters. Complete the following steps to alter the filter on the zero-ETL integration: Open a terminal with the AWS CLI installed. Enter the following command to list all available integrations: aws rds describe-integrations Find the integration you want to update and copy the integration identifier. The integration identifier is an alphanumeric string at the end of the integration ARN. Run the following command, updating <integration identifier> with the identifier copied from the previous step: aws rds modify-integration --integration-identifier "<integration identifier>" --data-filter 'exclude: *.*, include: demodb.*, exclude: demodb.users' When Aurora is assessing this filter, it will exclude everything by default, then only include the demodb database, but exclude the demodb.users table. Data filters can implement regular expressions for the databases and table. For example, if you want to filter out any tables starting with user, you can run the following: aws rds modify-integration --integration-identifier "<integration identifier>" --data-filter 'exclude: *.*, include: demodb.*, exclude *./^user/' As with the previous filter change, the integration will be added and will be in a Modifying state until the changes have been applied. This can take up to 30 minutes. When it shows as Active, the changes have been applied. Clean up To remove the filter added to the zero-ETL integration, complete the following steps: On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane. Choose your zero-ETL integration. Choose Modify. Choose Remove next to the filters you want to remove. You can also change the Exclude filter type to Include. Alternatively, you can use the AWS CLI to run the following: aws rds modify-integration --integration-identifier "<integration identifier>" --data-filter 'include: *.*' Choose Continue. Choose Save changes. The data filter will take up to 30 minutes to apply the changes. After you remove data filters, Aurora reevaluates the remaining filters as if the removed filter had never existed. Any data that previously didn’t match the filtering criteria but now does is replicated into the target Redshift data warehouse. Conclusion In this post, we showed you how to set up data filtering on your Aurora zero-ETL integration from Amazon Aurora MySQL to Amazon Redshift. This allows you to enable near real time analytics on transactional and operational data while replicating only the data required. With data filtering, you can split workloads into separate Redshift endpoints, limit the replication of private or confidential datasets, and increase performance of workloads by only replicating required datasets. To learn more about Aurora zero-ETL integration with Amazon Redshift, see Working with Aurora zero-ETL integrations with Amazon Redshift and Working with zero-ETL integrations. About the authors Jyoti Aggarwal is a Product Management Lead for AWS zero-ETL. She leads the product and business strategy, including driving initiatives around performance, customer experience, and security. She brings along an expertise in cloud compute, data pipelines, analytics, artificial intelligence (AI), and data services including databases, data warehouses and data lakes. Sean Beath is an Analytics Solutions Architect at Amazon Web Services. He has experience in the full delivery lifecycle of data platform modernisation using AWS services, and works with customers to help drive analytics value on AWS. Gokul Soundararajan is a principal engineer at AWS and received a PhD from University of Toronto and has been working in the areas of storage, databases, and analytics. View the full article
  • Forum Statistics

    63.6k
    Total Topics
    61.7k
    Total Posts
×
×
  • Create New...