Jump to content

Search the Community

Showing results for tags 'amazon emr'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • General
    • General Discussion
    • Artificial Intelligence
    • DevOpsForum News
  • DevOps & SRE
    • DevOps & SRE General Discussion
    • Databases, Data Engineering & Data Science
    • Development & Programming
    • CI/CD, GitOps, Orchestration & Scheduling
    • Docker, Containers, Microservices, Serverless & Virtualization
    • Infrastructure-as-Code
    • Kubernetes & Container Orchestration
    • Linux
    • Logging, Monitoring & Observability
    • Security, Governance, Risk & Compliance
  • Cloud Providers
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 12 results

  1. To enable your workforce users for analytics with fine-grained data access controls and audit data access, you might have to create multiple AWS Identity and Access Management (IAM) roles with different data permissions and map the workforce users to one of those roles. Multiple users are often mapped to the same role where they need similar privileges to enable data access controls at the corporate user or group level and audit data access. AWS IAM Identity Center enables centralized management of workforce user access to AWS accounts and applications using a local identity store or by connecting corporate directories via identity providers (IdPs). IAM Identity Center now supports trusted identity propagation, a streamlined experience for users who require access to data with AWS analytics services. Amazon EMR Studio is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to build data engineering and data science applications. With trusted identity propagation, data access management can be based on a user’s corporate identity and can be propagated seamlessly as they access data with single sign-on to build analytics applications with Amazon EMR (EMR Studio and Amazon EMR on EC2). AWS Lake Formation allows data administrators to centrally govern, secure, and share data for analytics and machine learning (ML). With trusted identity propagation, data administrators can directly provide granular access to corporate users using their identity attributes and simplify the traceability of end-to-end data access across AWS services. Because access is managed based on a user’s corporate identity, they don’t need to use database local user credentials or assume an IAM role to access data. In this post, we show how to bring your workforce identity to EMR Studio for analytics use cases, directly manage fine-grained permissions for the corporate users and groups using Lake Formation, and audit their data access. Solution overview For our use case, we want to enable a data analyst user named analyst1 to use their own enterprise credentials to query data they have been granted permissions to and audit their data access. We use Okta as the IdP for this demonstration. The following diagram illustrates the solution architecture. This architecture is based on the following components: Okta is responsible for maintaining the corporate user identities, related groups, and user authentication. IAM Identity Center connects Okta users and centrally manages their access across AWS accounts and applications. Lake Formation provides fine-grained access controls on data directly to corporate users using trusted identity propagation. EMR Studio is an IDE for users to build and run applications. It allows users to log in directly with their corporate credentials without signing in to the AWS Management Console. AWS Service Catalog provides a product template to create EMR clusters. EMR cluster is integrated with IAM Identity Center using a security configuration. AWS CloudTrail captures user data access activities. The following are the high-level steps to implement the solution: Integrate Okta with IAM Identity Center. Set up Amazon EMR Studio. Create an IAM Identity Center enabled security configuration for EMR clusters. Create a Service Catalog product template to create the EMR clusters. Use Lake Formation to grant permissions to users to access data. Test the solution by accessing data with a corporate identity. Audit user data access. Prerequisites You should have the following prerequisites: An AWS account with access to the following AWS services: AWS CloudFormation CloudTrail Amazon Elastic Compute Cloud (Amazon EC2) Amazon Simple Storage Service (Amazon S3) EMR Studio IAM IAM Identity Center Lake Formation Service Catalog An Okta account (you can create a free developer account) Integrate Okta with IAM Identity Center For more information about configuring Okta with IAM Identity Center, refer to Configure SAML and SCIM with Okta and IAM Identity Center. For this setup, we have created two users, analyst1 and engineer1, and assigned them to the corresponding Okta application. You can validate the integration is working by navigating to the Users page on the IAM Identity Center console, as shown in the following screenshot. Both enterprise users from Okta are provisioned in IAM Identity Center. The following exact users will not be listed in your account. You can either create similar users or use an existing user. Each provisioned user in IAM Identity Center has a unique user ID. This ID does not originate from Okta; it’s created in IAM Identity Center to uniquely identify this user. With trusted identity propagation, this user ID will be propagated across services and also used for traceability purposes in CloudTrail. The following screenshot shows the IAM Identity Center user matching the provisioned Okta user analyst1. Choose the link under AWS access portal URL and log in with the analyst1 Okta user credentials that are already assigned to this application. If you are able to log in and see the landing page, then all your configurations up to this step are set correctly. You will not see any applications on this page yet. Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. This allows users to directly access EMR Studio with their enterprise credentials. Note: All Amazon S3 buckets (created after January 5, 2023) have encryption configured by default (Amazon S3 managed keys (SSE-S3)), and all new objects that are uploaded to an S3 bucket are automatically encrypted at rest. To use a different type of encryption, to meet your security needs, please update the default encryption configuration for the bucket. See Protecting data for server-side encryption for further details. On the Amazon EMR console, choose Studios in the navigation pane under EMR Studio. Choose Create Studio. For Setup options¸ select Custom. For Studio name, enter a name (for this post, emr-studio-with-tip). For S3 location for Workspace storage, select Select existing location and enter an existing S3 bucket (if you have one). Otherwise, select Create new bucket. For Service role to let Studio access your AWS resources, choose View permissions details to get the trust and IAM policy information that is needed and create a role with those specific policies in IAM. In this case, we create a new role called emr_tip_role. For Service role to let Studio access your AWS resources, choose the IAM role you created. For Workspace name, enter a name (for this post, studio-workspace-with-tip). For Authentication, select IAM Identity Center. For User role¸ you can create a new role or choose an existing role. For this post, we choose the role we created (emr_tip_role). To use the same role, add the following statement to the trust policy of the service role: { "Version": "2008-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "elasticmapreduce.amazonaws.com", "AWS": "arn:aws:iam::xxxxxx:role/emr_tip_role" }, "Action": [ "sts:AssumeRole", "sts:SetContext" ] } ] } Select Enable trusted identity propagation to allow you to control and log user access across connected applications. For Choose who can access your application, select All users and groups. Later, we restrict access to resources using Lake Formation. However, there is an option here to restrict access to only assigned users and groups. In the Networking and security section, you can provide optional details for your VPC, subnets, and security group settings. Choose Create Studio. On the Studios page of the Amazon EMR console, locate your Studio enabled with IAM Identity Center. Copy the link for Studio Access URL. Enter the URL into a web browser and log in using Okta credentials. You should be able to successfully sign in to the EMR Studio console. Create an AWS Identity Center enabled security configuration for EMR clusters EMR security configurations allow you to configure data encryption, Kerberos authentication, and Amazon S3 authorization for the EMR File System (EMRFS) on the clusters. The security configuration is available to use and reuse when you create clusters. To integrate Amazon EMR with IAM Identity Center, you need to first create an IAM role that authenticates with IAM Identity Center from the EMR cluster. Amazon EMR uses IAM credentials to relay the IAM Identity Center identity to downstream services such as Lake Formation. The IAM role should also have the respective permissions to invoke the downstream services. Create a role (for this post, called emr-idc-application) with the following trust and permission policy. The role referenced in the trust policy is the InstanceProfile role for EMR clusters. This allows the EC2 instance profile to assume this role and act as an identity broker on behalf of the federated users. { "Version": "2012-10-17", "Statement": [ { "Sid": "AssumeRole", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::xxxxxxxxxxn:role/service-role/AmazonEMR-InstanceProfile-20240127T102444" }, "Action": [ "sts:AssumeRole", "sts:SetContext" ] } ] } { "Version": "2012-10-17", "Statement": [ { "Sid": "IdCPermissions", "Effect": "Allow", "Action": [ "sso-oauth:*" ], "Resource": "*" }, { "Sid": "GlueandLakePermissions", "Effect": "Allow", "Action": [ "glue:*", "lakeformation:GetDataAccess" ], "Resource": "*" }, { "Sid": "S3Permissions", "Effect": "Allow", "Action": [ "s3:GetDataAccess", "s3:GetAccessGrantsInstanceForPrefix" ], "Resource": "*" } ] } Next, you create certificates for encrypting data in transit with Amazon EMR. For this post, we use OpenSSL to generate a self-signed X.509 certificate with a 2048-bit RSA private key. The key allows access to the issuer’s EMR cluster instances in the AWS Region being used. For a complete guide on creating and providing a certificate, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption. Upload my-certs.zip to an S3 location that will be used to create the security configuration. The EMR service role should have access to the S3 location. The key allows access to the issuer’s EMR cluster instances in the us-west-2 Region as specified by the *.us-west-2.compute.internal domain name as the common name. You can change this to the Region your cluster is in. $ openssl req -x509 -newkey rsa:2048 -keyout privateKey.pem -out certificateChain.pem -days 365 -nodes -subj '/CN=*.us-west-2.compute.internal' $ cp certificateChain.pem trustedCertificates.pem $ zip -r -X my-certs.zip certificateChain.pem privateKey.pem trustedCertificates.pem Create an EMR security configuration with IAM Identity Center enabled from the AWS Command Line Interface (AWS CLI) with the following code: aws emr create-security-configuration --name "IdentityCenterConfiguration-with-lf-tip" --region "us-west-2" --endpoint-url https://elasticmapreduce.us-west-2.amazonaws.com --security-configuration '{ "AuthenticationConfiguration":{ "IdentityCenterConfiguration":{ "EnableIdentityCenter":true, "IdentityCenterApplicationAssigmentRequired":false, "IdentityCenterInstanceARN": "arn:aws:sso:::instance/ssoins-7907b0d7d77e3e0d", "IAMRoleForEMRIdentityCenterApplicationARN": "arn:aws:iam::1xxxxxxxxx0:role/emr-idc-application" } }, "AuthorizationConfiguration": { "LakeFormationConfiguration": { "EnableLakeFormation": true } }, "EncryptionConfiguration": { "EnableInTransitEncryption": true, "EnableAtRestEncryption": false, "InTransitEncryptionConfiguration": { "TLSCertificateConfiguration": { "CertificateProviderType": "PEM", "S3Object": "s3://<<Bucket Name>>/emr-transit-encry-certs/my-certs.zip" } } } }' You can view the security configuration on the Amazon EMR console. Create a Service Catalog product template to create EMR clusters EMR Studio with trusted identity propagation enabled can only work with clusters created from a template. Complete the following steps to create a product template in Service Catalog: On the Service Catalog console, choose Portfolios under Administration in the navigation pane. Choose Create portfolio. Enter a name for your portfolio (for this post, EMR Clusters Template) and an optional description. Choose Create. On the Portfolios page, choose the portfolio you just created to view its details. On the Products tab, choose Create product. For Product type, select CloudFormation. For Product name, enter a name (for this post, EMR-7.0.0). Use the security configuration IdentityCenterConfiguration-with-lf-tip you created in previous steps with the appropriate Amazon EMR service roles. Choose Create product. The following is an example CloudFormation template. Update the account-specific values for SecurityConfiguration, JobFlowRole, ServiceRole, LogUri, Ec2KeyName, and Ec2SubnetId. We provide a sample Amazon EMR service role and trust policy in Appendix A at the end of this post. 'Parameters': 'ClusterName': 'Type': 'String' 'Default': 'EMR_TIP_Cluster' 'EmrRelease': 'Type': 'String' 'Default': 'emr-7.0.0' 'AllowedValues': - 'emr-7.0.0' 'ClusterInstanceType': 'Type': 'String' 'Default': 'm5.xlarge' 'AllowedValues': - 'm5.xlarge' - 'm5.2xlarge' 'Resources': 'EmrCluster': 'Type': 'AWS::EMR::Cluster' 'Properties': 'Applications': - 'Name': 'Spark' - 'Name': 'Livy' - 'Name': 'Hadoop' - 'Name': 'JupyterEnterpriseGateway' 'SecurityConfiguration': 'IdentityCenterConfiguration-with-lf-tip' 'EbsRootVolumeSize': '20' 'Name': 'Ref': 'ClusterName' 'JobFlowRole': <Instance Profile Role> 'ServiceRole': <EMR Service Role> 'ReleaseLabel': 'Ref': 'EmrRelease' 'VisibleToAllUsers': !!bool 'true' 'LogUri': 'Fn::Sub': <S3 LOG Path> 'Instances': "Ec2KeyName" : <Key Pair Name> 'TerminationProtected': !!bool 'false' 'Ec2SubnetId': <subnet-id> 'MasterInstanceGroup': 'InstanceCount': !!int '1' 'InstanceType': 'Ref': 'ClusterInstanceType' 'CoreInstanceGroup': 'InstanceCount': !!int '2' 'InstanceType': 'Ref': 'ClusterInstanceType' 'Market': 'ON_DEMAND' 'Name': 'Core' 'Outputs': 'ClusterId': 'Value': 'Ref': 'EmrCluster' 'Description': 'The ID of the EMR cluster' 'Metadata': 'AWS::CloudFormation::Designer': {} 'Rules': {} Trusted identity propagation is supported from Amazon EMR 6.15 onwards. For Amazon EMR 6.15, add the following bootstrap action to the CloudFormation script: 'BootstrapActions': - 'Name': 'spark-config' 'ScriptBootstrapAction': 'Path': 's3://emr-data-access-control-<aws-region>/customer-bootstrap-actions/idc-fix/replace-puppet.sh' The portfolio now should have the EMR cluster creation product added. Grant the EMR Studio role emr_tip_role access to the portfolio. Grant Lake Formation permissions to users to access data In this step, we enable Lake Formation integration with IAM Identity Center and grant permissions to the Identity Center user analyst1. If Lake Formation is not already enabled, refer to Getting started with Lake Formation. To use Lake Formation with Amazon EMR, create a custom role to register S3 locations. You need to create a new custom role with Amazon S3 access and not use the default role AWSServiceRoleForLakeFormationDataAccess. Additionally, enable external data filtering in Lake Formation. For more details, refer to Enable Lake Formation with Amazon EMR. Complete the following steps to manage access permissions in Lake Formation: On the Lake Formation console, choose IAM Identity Center integration under Administration in the navigation pane. Lake Formation will automatically specify the correct IAM Identity Center instance. Choose Create. You can now view the IAM Identity Center integration details. For this post, we have a Marketing database and a customer table on which we grant access to our enterprise user analyst1. You can use an existing database and table in your account or create a new one. For more examples, refer to Tutorials. The following screenshot shows the details of our customer table. Complete the following steps to grant analyst1 permissions. For more information, refer to Granting table permissions using the named resource method. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane. Choose Grant. Select Named Data Catalog resources. For Databases, choose your database (marketing). For Tables, choose your table (customer). For Table permissions, select Select and Describe. For Data permissions, select All data access. Choose Grant. The following screenshot shows a summary of permissions that user analyst1 has. They have Select access on the table and Describe permissions on the databases. Test the solution To test the solution, we log in to EMR Studio as enterprise user analyst1, create a new Workspace, create an EMR cluster using a template, and use that cluster to perform an analysis. You could also use the Workspace that was created during the Studio setup. In this demonstration, we create a new Workspace. You need additional permissions in the EMR Studio role to create and list Workspaces, use a template, and create EMR clusters. For more details, refer to Configure EMR Studio user permissions for Amazon EC2 or Amazon EKS. Appendix B at the end of this post contains a sample policy. When the cluster is available, we attach the cluster to the Workspace and run queries on the customer table, which the user has access to. User analyst1 is now able to run queries for business use cases using their corporate identity. To open a PySpark notebook, we choose PySpark under Notebook. When the notebook is open, we run a Spark SQL query to list the databases: %%sql show databases In this case, we query the customer table in the marketing database. We should be able to access the data. %%sql select * from marketing.customer Audit data access Lake Formation API actions are logged by CloudTrail. The GetDataAccess action is logged whenever a principal or integrated AWS service requests temporary credentials to access data in a data lake location that is registered with Lake Formation. With trusted identity propagation, CloudTrail also logs the IAM Identity Center user ID of the corporate identity who requested access to the data. The following screenshot shows the details for the analyst1 user. Choose View event to view the event logs. The following is an example of the GetDataAccess event log. We can trace that user analyst1, Identity Center user ID c8c11390-00a1-706e-0c7a-bbcc5a1c9a7f, has accessed the customer table. { "eventVersion": "1.09", …. "onBehalfOf": { "userId": "c8c11390-00a1-706e-0c7a-bbcc5a1c9a7f", "identityStoreArn": "arn:aws:identitystore::xxxxxxxxx:identitystore/d-XXXXXXXX" } }, "eventTime": "2024-01-28T17:56:25Z", "eventSource": "lakeformation.amazonaws.com", "eventName": "GetDataAccess", "awsRegion": "us-west-2", …. "requestParameters": { "tableArn": "arn:aws:glue:us-west-2:xxxxxxxxxx:table/marketing/customer", "supportedPermissionTypes": [ "TABLE_PERMISSION" ] }, ….. } } Here is an end to end demonstration video of steps to follow for enabling trusted identity propagation to your analytics flow in Amazon EMR Clean up Clean up the following resources when you’re done using this solution: Delete the CloudFormation stacks created in each account to delete the EMR cluster. Delete the EMR Studio Workspaces and environment. Delete the Service Catalog product and portfolio. Delete Okta users Revoke Lake Formation access to the users. Conclusion In this post, we demonstrated how to set up and use trusted identity propagation using IAM Identity Center, EMR Studio, and Lake Formation for analytics. With trusted identity propagation, a user’s corporate identity is seamlessly propagated as they access data using single sign-on across AWS analytics services to build analytics applications. Data administrators can provide fine-grained data access directly to corporate users and groups and audit usage. To learn more, see Integrate Amazon EMR with AWS IAM Identity Center. About the Authors Pradeep Misra is a Principal Analytics Solutions Architect at AWS. He works across Amazon to architect and design modern distributed analytics and AI/ML platform solutions. He is passionate about solving customer challenges using data, analytics, and AI/ML. Outside of work, Pradeep likes exploring new places, trying new cuisines, and playing board games with his family. He also likes doing science experiments with his daughters. Deepmala Agarwal works as an AWS Data Specialist Solutions Architect. She is passionate about helping customers build out scalable, distributed, and data-driven solutions on AWS. When not at work, Deepmala likes spending time with family, walking, listening to music, watching movies, and cooking! Abhilash Nagilla is a Senior Specialist Solutions Architect at Amazon Web Services (AWS), helping public sector customers on their cloud journey with a focus on AWS analytics services. Outside of work, Abhilash enjoys learning new technologies, watching movies, and visiting new places. Appendix A Sample Amazon EMR service role and trust policy: Note: This is a sample service role. Fine grained access control is done using Lake Formation. Modify the permissions as per your enterprise guidance and to comply with your security team. Trust policy: { "Version": "2008-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "Service": "elasticmapreduce.amazonaws.com", "AWS": "arn:aws:iam::xxxxxx:role/emr_tip_role" }, "Action": [ "sts:AssumeRole", "sts:SetContext" ] } ] } Permission Policy: { "Version": "2012-10-17", "Statement": [ { "Sid": "ResourcesToLaunchEC2", "Effect": "Allow", "Action": [ "ec2:RunInstances", "ec2:CreateFleet", "ec2:CreateLaunchTemplate", "ec2:CreateLaunchTemplateVersion" ], "Resource": [ "arn:aws:ec2:*:*:network-interface/*", "arn:aws:ec2:*::image/ami-*", "arn:aws:ec2:*:*:key-pair/*", "arn:aws:ec2:*:*:capacity-reservation/*", "arn:aws:ec2:*:*:placement-group/pg-*", "arn:aws:ec2:*:*:fleet/*", "arn:aws:ec2:*:*:dedicated-host/*", "arn:aws:resource-groups:*:*:group/*" ] }, { "Sid": "TagOnCreateTaggedEMRResources", "Effect": "Allow", "Action": [ "ec2:CreateTags" ], "Resource": [ "arn:aws:ec2:*:*:network-interface/*", "arn:aws:ec2:*:*:instance/*", "arn:aws:ec2:*:*:volume/*", "arn:aws:ec2:*:*:launch-template/*" ], "Condition": { "StringEquals": { "ec2:CreateAction": [ "RunInstances", "CreateFleet", "CreateLaunchTemplate", "CreateNetworkInterface" ] } } }, { "Sid": "ListActionsForEC2Resources", "Effect": "Allow", "Action": [ "ec2:DescribeAccountAttributes", "ec2:DescribeCapacityReservations", "ec2:DescribeDhcpOptions", "ec2:DescribeImages", "ec2:DescribeInstances", "ec2:DescribeLaunchTemplates", "ec2:DescribeNetworkAcls", "ec2:DescribeNetworkInterfaces", "ec2:DescribePlacementGroups", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSubnets", "ec2:DescribeVolumes", "ec2:DescribeVolumeStatus", "ec2:DescribeVpcAttribute", "ec2:DescribeVpcEndpoints", "ec2:DescribeVpcs" ], "Resource": "*" }, { "Sid": "AutoScaling", "Effect": "Allow", "Action": [ "application-autoscaling:DeleteScalingPolicy", "application-autoscaling:DeregisterScalableTarget", "application-autoscaling:DescribeScalableTargets", "application-autoscaling:DescribeScalingPolicies", "application-autoscaling:PutScalingPolicy", "application-autoscaling:RegisterScalableTarget" ], "Resource": "*" }, { "Sid": "AutoScalingCloudWatch", "Effect": "Allow", "Action": [ "cloudwatch:PutMetricAlarm", "cloudwatch:DeleteAlarms", "cloudwatch:DescribeAlarms" ], "Resource": "arn:aws:cloudwatch:*:*:alarm:*_EMR_Auto_Scaling" }, { "Sid": "PassRoleForAutoScaling", "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::*:role/EMR_AutoScaling_DefaultRole", "Condition": { "StringLike": { "iam:PassedToService": "application-autoscaling.amazonaws.com*" } } }, { "Sid": "PassRoleForEC2", "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::xxxxxxxxxxx:role/service-role/<Instance-Profile-Role>", "Condition": { "StringLike": { "iam:PassedToService": "ec2.amazonaws.com*" } } }, { "Effect": "Allow", "Action": [ "s3:*", "s3-object-lambda:*" ], "Resource": [ "arn:aws:s3:::<bucket>/*", "arn:aws:s3:::*logs*/*" ] }, { "Effect": "Allow", "Resource": "*", "Action": [ "ec2:AuthorizeSecurityGroupEgress", "ec2:AuthorizeSecurityGroupIngress", "ec2:CancelSpotInstanceRequests", "ec2:CreateFleet", "ec2:CreateLaunchTemplate", "ec2:CreateNetworkInterface", "ec2:CreateSecurityGroup", "ec2:CreateTags", "ec2:DeleteLaunchTemplate", "ec2:DeleteNetworkInterface", "ec2:DeleteSecurityGroup", "ec2:DeleteTags", "ec2:DescribeAvailabilityZones", "ec2:DescribeAccountAttributes", "ec2:DescribeDhcpOptions", "ec2:DescribeImages", "ec2:DescribeInstanceStatus", "ec2:DescribeInstances", "ec2:DescribeKeyPairs", "ec2:DescribeLaunchTemplates", "ec2:DescribeNetworkAcls", "ec2:DescribeNetworkInterfaces", "ec2:DescribePrefixLists", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSpotInstanceRequests", "ec2:DescribeSpotPriceHistory", "ec2:DescribeSubnets", "ec2:DescribeTags", "ec2:DescribeVpcAttribute", "ec2:DescribeVpcEndpoints", "ec2:DescribeVpcEndpointServices", "ec2:DescribeVpcs", "ec2:DetachNetworkInterface", "ec2:ModifyImageAttribute", "ec2:ModifyInstanceAttribute", "ec2:RequestSpotInstances", "ec2:RevokeSecurityGroupEgress", "ec2:RunInstances", "ec2:TerminateInstances", "ec2:DeleteVolume", "ec2:DescribeVolumeStatus", "ec2:DescribeVolumes", "ec2:DetachVolume", "iam:GetRole", "iam:GetRolePolicy", "iam:ListInstanceProfiles", "iam:ListRolePolicies", "cloudwatch:PutMetricAlarm", "cloudwatch:DescribeAlarms", "cloudwatch:DeleteAlarms", "application-autoscaling:RegisterScalableTarget", "application-autoscaling:DeregisterScalableTarget", "application-autoscaling:PutScalingPolicy", "application-autoscaling:DeleteScalingPolicy", "application-autoscaling:Describe*" ] } ] } Appendix B Sample EMR Studio role policy: Note: This is a sample service role. Fine grained access control is done using Lake Formation. Modify the permissions as per your enterprise guidance and to comply with your security team. { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowEMRReadOnlyActions", "Effect": "Allow", "Action": [ "elasticmapreduce:ListInstances", "elasticmapreduce:DescribeCluster", "elasticmapreduce:ListSteps" ], "Resource": "*" }, { "Sid": "AllowEC2ENIActionsWithEMRTags", "Effect": "Allow", "Action": [ "ec2:CreateNetworkInterfacePermission", "ec2:DeleteNetworkInterface" ], "Resource": [ "arn:aws:ec2:*:*:network-interface/*" ], "Condition": { "StringEquals": { "aws:ResourceTag/for-use-with-amazon-emr-managed-policies": "true" } } }, { "Sid": "AllowEC2ENIAttributeAction", "Effect": "Allow", "Action": [ "ec2:ModifyNetworkInterfaceAttribute" ], "Resource": [ "arn:aws:ec2:*:*:instance/*", "arn:aws:ec2:*:*:network-interface/*", "arn:aws:ec2:*:*:security-group/*" ] }, { "Sid": "AllowEC2SecurityGroupActionsWithEMRTags", "Effect": "Allow", "Action": [ "ec2:AuthorizeSecurityGroupEgress", "ec2:AuthorizeSecurityGroupIngress", "ec2:RevokeSecurityGroupEgress", "ec2:RevokeSecurityGroupIngress", "ec2:DeleteNetworkInterfacePermission" ], "Resource": "*", "Condition": { "StringEquals": { "aws:ResourceTag/for-use-with-amazon-emr-managed-policies": "true" } } }, { "Sid": "AllowDefaultEC2SecurityGroupsCreationWithEMRTags", "Effect": "Allow", "Action": [ "ec2:CreateSecurityGroup" ], "Resource": [ "arn:aws:ec2:*:*:security-group/*" ], "Condition": { "StringEquals": { "aws:RequestTag/for-use-with-amazon-emr-managed-policies": "true" } } }, { "Sid": "AllowDefaultEC2SecurityGroupsCreationInVPCWithEMRTags", "Effect": "Allow", "Action": [ "ec2:CreateSecurityGroup" ], "Resource": [ "arn:aws:ec2:*:*:vpc/*" ], "Condition": { "StringEquals": { "aws:ResourceTag/for-use-with-amazon-emr-managed-policies": "true" } } }, { "Sid": "AllowAddingEMRTagsDuringDefaultSecurityGroupCreation", "Effect": "Allow", "Action": [ "ec2:CreateTags" ], "Resource": "arn:aws:ec2:*:*:security-group/*", "Condition": { "StringEquals": { "aws:RequestTag/for-use-with-amazon-emr-managed-policies": "true", "ec2:CreateAction": "CreateSecurityGroup" } } }, { "Sid": "AllowEC2ENICreationWithEMRTags", "Effect": "Allow", "Action": [ "ec2:CreateNetworkInterface" ], "Resource": [ "arn:aws:ec2:*:*:network-interface/*" ], "Condition": { "StringEquals": { "aws:RequestTag/for-use-with-amazon-emr-managed-policies": "true" } } }, { "Sid": "AllowEC2ENICreationInSubnetAndSecurityGroupWithEMRTags", "Effect": "Allow", "Action": [ "ec2:CreateNetworkInterface" ], "Resource": [ "arn:aws:ec2:*:*:subnet/*", "arn:aws:ec2:*:*:security-group/*" ], "Condition": { "StringEquals": { "aws:ResourceTag/for-use-with-amazon-emr-managed-policies": "true" } } }, { "Sid": "AllowAddingTagsDuringEC2ENICreation", "Effect": "Allow", "Action": [ "ec2:CreateTags" ], "Resource": "arn:aws:ec2:*:*:network-interface/*", "Condition": { "StringEquals": { "ec2:CreateAction": "CreateNetworkInterface" } } }, { "Sid": "AllowEC2ReadOnlyActions", "Effect": "Allow", "Action": [ "ec2:DescribeSecurityGroups", "ec2:DescribeNetworkInterfaces", "ec2:DescribeTags", "ec2:DescribeInstances", "ec2:DescribeSubnets", "ec2:DescribeVpcs" ], "Resource": "*" }, { "Sid": "AllowSecretsManagerReadOnlyActionsWithEMRTags", "Effect": "Allow", "Action": [ "secretsmanager:GetSecretValue" ], "Resource": "arn:aws:secretsmanager:*:*:secret:*", "Condition": { "StringEquals": { "aws:ResourceTag/for-use-with-amazon-emr-managed-policies": "true" } } }, { "Sid": "AllowWorkspaceCollaboration", "Effect": "Allow", "Action": [ "iam:GetUser", "iam:GetRole", "iam:ListUsers", "iam:ListRoles", "sso:GetManagedApplicationInstance", "sso-directory:SearchUsers" ], "Resource": "*" }, { "Sid": "S3Access", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:GetEncryptionConfiguration", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<bucket>", "arn:aws:s3:::<bucket>/*" ] }, { "Sid": "EMRStudioWorkspaceAccess", "Effect": "Allow", "Action": [ "elasticmapreduce:CreateEditor", "elasticmapreduce:DescribeEditor", "elasticmapreduce:ListEditors", "elasticmapreduce:DeleteEditor", "elasticmapreduce:UpdateEditor", "elasticmapreduce:PutWorkspaceAccess", "elasticmapreduce:DeleteWorkspaceAccess", "elasticmapreduce:ListWorkspaceAccessIdentities", "elasticmapreduce:StartEditor", "elasticmapreduce:StopEditor", "elasticmapreduce:OpenEditorInConsole", "elasticmapreduce:AttachEditor", "elasticmapreduce:DetachEditor", "elasticmapreduce:ListInstanceGroups", "elasticmapreduce:ListBootstrapActions", "servicecatalog:SearchProducts", "servicecatalog:DescribeProduct", "servicecatalog:DescribeProductView", "servicecatalog:DescribeProvisioningParameters", "servicecatalog:ProvisionProduct", "servicecatalog:UpdateProvisionedProduct", "servicecatalog:ListProvisioningArtifacts", "servicecatalog:DescribeRecord", "servicecatalog:ListLaunchPaths", "elasticmapreduce:RunJobFlow", "elasticmapreduce:ListClusters", "elasticmapreduce:DescribeCluster", "codewhisperer:GenerateRecommendations", "athena:StartQueryExecution", "athena:StopQueryExecution", "athena:GetQueryExecution", "athena:GetQueryRuntimeStatistics", "athena:GetQueryResults", "athena:ListQueryExecutions", "athena:BatchGetQueryExecution", "athena:GetNamedQuery", "athena:ListNamedQueries", "athena:BatchGetNamedQuery", "athena:UpdateNamedQuery", "athena:DeleteNamedQuery", "athena:ListDataCatalogs", "athena:GetDataCatalog", "athena:ListDatabases", "athena:GetDatabase", "athena:ListTableMetadata", "athena:GetTableMetadata", "athena:ListWorkGroups", "athena:GetWorkGroup", "athena:CreateNamedQuery", "athena:GetPreparedStatement", "glue:CreateDatabase", "glue:DeleteDatabase", "glue:GetDatabase", "glue:GetDatabases", "glue:UpdateDatabase", "glue:CreateTable", "glue:DeleteTable", "glue:BatchDeleteTable", "glue:UpdateTable", "glue:GetTable", "glue:GetTables", "glue:BatchCreatePartition", "glue:CreatePartition", "glue:DeletePartition", "glue:BatchDeletePartition", "glue:UpdatePartition", "glue:GetPartition", "glue:GetPartitions", "glue:BatchGetPartition", "kms:ListAliases", "kms:ListKeys", "kms:DescribeKey", "lakeformation:GetDataAccess", "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:ListMultipartUploadParts", "s3:AbortMultipartUpload", "s3:PutObject", "s3:PutBucketPublicAccessBlock", "s3:ListAllMyBuckets", "elasticmapreduce:ListStudios", "elasticmapreduce:DescribeStudio", "cloudformation:GetTemplate", "cloudformation:CreateStack", "cloudformation:CreateStackSet", "cloudformation:DeleteStack", "cloudformation:GetTemplateSummary", "cloudformation:ValidateTemplate", "cloudformation:ListStacks", "cloudformation:ListStackSets", "elasticmapreduce:AddTags", "ec2:CreateNetworkInterface", "elasticmapreduce:GetClusterSessionCredentials", "elasticmapreduce:GetOnClusterAppUIPresignedURL", "cloudformation:DescribeStackResources" ], "Resource": [ "*" ] }, { "Sid": "AllowPassingServiceRoleForWorkspaceCreation", "Action": "iam:PassRole", "Resource": [ "arn:aws:iam::*:role/<Studio Role>", "arn:aws:iam::*:role/<EMR Service Role>", "arn:aws:iam::*:role/<EMR Instance Profile Role>" ], "Effect": "Allow" }, { "Sid": "Statement1", "Effect": "Allow", "Action": [ "iam:PassRole" ], "Resource": [ "arn:aws:iam::*:role/<EMR Instance Profile Role>" ] } ] } View the full article
  2. Starting from release 6.14, Amazon EMR Studio supports interactive analytics on Amazon EMR Serverless. You can now use EMR Serverless applications as the compute, in addition to Amazon EMR on EC2 clusters and Amazon EMR on EKS virtual clusters, to run JupyterLab notebooks from EMR Studio Workspaces. EMR Studio is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to develop, visualize, and debug analytics applications written in PySpark, Python, and Scala. EMR Serverless is a serverless option for Amazon EMR that makes it straightforward to run open source big data analytics frameworks such as Apache Spark without configuring, managing, and scaling clusters or servers. In the post, we demonstrate how to do the following: Create an EMR Serverless endpoint for interactive applications Attach the endpoint to an existing EMR Studio environment Create a notebook and run an interactive application Seamlessly diagnose interactive applications from within EMR Studio Prerequisites In a typical organization, an AWS account administrator will set up AWS resources such as AWS Identity and Access management (IAM) roles, Amazon Simple Storage Service (Amazon S3) buckets, and Amazon Virtual Private Cloud (Amazon VPC) resources for internet access and access to other resources in the VPC. They assign EMR Studio administrators who manage setting up EMR Studios and assigning users to a specific EMR Studio. Once they’re assigned, EMR Studio developers can use EMR Studio to develop and monitor workloads. Make sure you set up resources like your S3 bucket, VPC subnets, and EMR Studio in the same AWS Region. Complete the following steps to deploy these prerequisites: Launch the following AWS CloudFormation stack. Enter values for AdminPassword and DevPassword and make a note of the passwords you create. Choose Next. Keep the settings as default and choose Next again. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names. Choose Submit. We have also provided instructions to deploy these resources manually with sample IAM policies in the GitHub repo. Set up EMR Studio and a serverless interactive application After the AWS account administrator completes the prerequisites, the EMR Studio administrator can log in to the AWS Management Console to create an EMR Studio, Workspace, and EMR Serverless application. Create an EMR Studio and Workspace The EMR Studio administrator should log in to the console using the emrs-interactive-app-admin-user user credentials. If you deployed the prerequisite resources using the provided CloudFormation template, use the password that you provided as an input parameter. On the Amazon EMR console, choose EMR Serverless in the navigation pane. Choose Get started. Select Create and launch EMR Studio. This creates a Studio with the default name studio_1 and a Workspace with the default name My_First_Workspace. A new browser tab will open for the Studio_1 user interface. Create an EMR Serverless application Complete the following steps to create an EMR Serverless application: On the EMR Studio console, choose Applications in the navigation pane. Create a new application. For Name, enter a name (for example, my-serverless-interactive-application). For Application setup options, select Use custom settings for interactive workloads. For interactive applications, as a best practice, we recommend keeping the driver and workers pre-initialized by configuring the pre-initialized capacity at the time of application creation. This effectively creates a warm pool of workers for an application and keeps the resources ready to be consumed, enabling the application to respond in seconds. For further best practices for creating EMR Serverless applications, see Define per-team resource limits for big data workloads using Amazon EMR Serverless. In the Interactive endpoint section, select Enable Interactive endpoint. In the Network connections section, choose the VPC, private subnets, and security group you created previously. If you deployed the CloudFormation stack provided in this post, choose emr-serverless-sg­ as the security group. A VPC is needed for the workload to be able to access the internet from within the EMR Serverless application in order to download external Python packages. The VPC also allows you to access resources such as Amazon Relational Database Service (Amazon RDS) and Amazon Redshift that are in the VPC from this application. Attaching a serverless application to a VPC can lead to IP exhaustion in the subnet, so make sure there are sufficient IP addresses in your subnet. Choose Create and start application. On the applications page, you can verify that the status of your serverless application changes to Started. Select your application and choose How it works. Choose View and launch workspaces. Choose Configure studio. For Service role¸ provide the EMR Studio service role you created as a prerequisite (emr-studio-service-role). For Workspace storage, enter the path of the S3 bucket you created as a prerequisite (emrserverless-interactive-blog-<account-id>-<region-name>). Choose Save changes. 14. Navigate to the Studios console by choosing Studios in the left navigation menu in the EMR Studio section. Note the Studio access URL from the Studios console and provide it to your developers to run their Spark applications. Run your first Spark application After the EMR Studio administrator has created the Studio, Workspace, and serverless application, the Studio user can use the Workspace and application to develop and monitor Spark workloads. Launch the Workspace and attach the serverless application Complete the following steps: Using the Studio URL provided by the EMR Studio administrator, log in using the emrs-interactive-app-dev-user user credentials shared by the AWS account admin. If you deployed the prerequisite resources using the provided CloudFormation template, use the password that you provided as an input parameter. On the Workspaces page, you can check the status of your Workspace. When the Workspace is launched, you will see the status change to Ready. Launch the workspace by choosing the workspace name (My_First_Workspace). This will open a new tab. Make sure your browser allows pop-ups. In the Workspace, choose Compute (cluster icon) in the navigation pane. For EMR Serverless application, choose your application (my-serverless-interactive-application). For Interactive runtime role, choose an interactive runtime role (for this post, we use emr-serverless-runtime-role). Choose Attach to attach the serverless application as the compute type for all the notebooks in this Workspace. Run your Spark application interactively Complete the following steps: Choose the Notebook samples (three dots icon) in the navigation pane and open Getting-started-with-emr-serverless notebook. Choose Save to Workspace. There are three choices of kernels for our notebook: Python 3, PySpark, and Spark (for Scala). When prompted, choose PySpark as the kernel. Choose Select. Now you can run your Spark application. To do so, use the %%configure Sparkmagic command, which configures the session creation parameters. Interactive applications support Python virtual environments. We use a custom environment in the worker nodes by specifying a path for a different Python runtime for the executor environment using spark.executorEnv.PYSPARK_PYTHON. See the following code: %%configure -f { "conf": { "spark.pyspark.virtualenv.enabled": "true", "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv", "spark.pyspark.virtualenv.type": "native", "spark.pyspark.python": "/usr/bin/python3", "spark.executorEnv.PYSPARK_PYTHON": "/usr/bin/python3" } } Install external packages Now that you have an independent virtual environment for the workers, EMR Studio notebooks allow you to install external packages from within the serverless application by using the Spark install_pypi_package function through the Spark context. Using this function makes the package available for all the EMR Serverless workers. First, install matplotlib, a Python package, from PyPi: sc.install_pypi_package("matplotlib") If the preceding step doesn’t respond, check your VPC setup and make sure it is configured correctly for internet access. Now you can use a dataset and visualize your data. Create visualizations To create visualizations, we use a public dataset on NYC yellow taxis: file_name = "s3://athena-examples-us-east-1/notebooks/yellow_tripdata_2016-01.parquet" taxi_df = (spark.read.format("parquet").option("header", "true") \ .option("inferSchema", "true").load(file_name)) In the preceding code block, you read the Parquet file from a public bucket in Amazon S3. The file has headers, and we want Spark to infer the schema. You then use a Spark dataframe to group and count specific columns from taxi_df: taxi1_df = taxi_df.groupBy("VendorID", "passenger_count").count() taxi1_df.show() Use %%display magic to view the result in table format: %%display taxi1_df You can also quickly visualize your data with five types of charts. You can choose the display type and the chart will change accordingly. In the following screenshot, we use a bar chart to visualize our data. Interact with EMR Serverless using Spark SQL You can interact with tables in the AWS Glue Data Catalog using Spark SQL on EMR Serverless. In the sample notebook, we show how you can transform data using a Spark dataframe. First, create a new temporary view called taxis. This allows you to use Spark SQL to select data from this view. Then create a taxi dataframe for further processing: taxi_df.createOrReplaceTempView("taxis") sqlDF = spark.sql( "SELECT DOLocationID, sum(total_amount) as sum_total_amount \ FROM taxis where DOLocationID < 25 Group by DOLocationID ORDER BY DOLocationID" ) sqlDF.show(5) In each cell in your EMR Studio notebook, you can expand Spark Job Progress to view the various stages of the job submitted to EMR Serverless while running this specific cell. You can see the time taken to complete each stage. In the following example, stage 14 of the job has 12 completed tasks. In addition, if there is any failure, you can see the logs, making troubleshooting a seamless experience. We discuss this more in the next section. Use the following code to visualize the processed dataframe using the matplotlib package. You use the maptplotlib library to plot the dropoff location and the total amount as a bar chart. import matplotlib.pyplot as plt import numpy as np import pandas as pd plt.clf() df = sqlDF.toPandas() plt.bar(df.DOLocationID, df.sum_total_amount) %matplot plt Diagnose interactive applications You can get the session information for your Livy endpoint using the %%info Sparkmagic. This gives you links to access the Spark UI as well as the driver log right in your notebook. The following screenshot is a driver log snippet for our application, which we opened via the link in our notebook. Similarly, you can choose the link below Spark UI to open the UI. The following screenshot shows the Executors tab, which provides access to the driver and executor logs. The following screenshot shows stage 14, which corresponds to the Spark SQL step we saw earlier in which we calculated the location wise sum of total taxi collections, which had been broken down into 12 tasks. Through the Spark UI, the interactive application provides fine-grained task-level status, I/O, and shuffle details, as well as links to corresponding logs for each task for this stage right from your notebook, enabling a seamless troubleshooting experience. Clean up If you no longer want to keep the resources created in this post, complete the following cleanup steps: Delete the EMR Serverless application. Delete the EMR Studio and the associated workspaces and notebooks. To delete rest of the resources, navigate to CloudFormation console, select the stack, and choose Delete. All of the resources will be deleted except the S3 bucket, which has its deletion policy set to retain. Conclusion The post showed how to run interactive PySpark workloads in EMR Studio using EMR Serverless as the compute. You can also build and monitor Spark applications in an interactive JupyterLab Workspace. In an upcoming post, we’ll discuss additional capabilities of EMR Serverless Interactive applications, such as: Working with resources such as Amazon RDS and Amazon Redshift in your VPC (for example, for JDBC/ODBC connectivity) Running transactional workloads using serverless endpoints If this is your first time exploring EMR Studio, we recommend checking out the Amazon EMR workshops and referring to Create an EMR Studio. About the Authors Sekar Srinivasan is a Principal Specialist Solutions Architect at AWS focused on Data Analytics and AI. Sekar has over 20 years of experience working with data. He is passionate about helping customers build scalable solutions modernizing their architecture and generating insights from their data. In his spare time he likes to work on non-profit projects, focused on underprivileged Children’s education. Disha Umarwani is a Sr. Data Architect with Amazon Professional Services within Global Health Care and LifeSciences. She has worked with customers to design, architect and implement Data Strategy at scale. She specializes in architecting Data Mesh architectures for Enterprise platforms. View the full article
  3. Today, we are excited to announce that customers will now be able to use Apache Livy to submit their Apache Spark jobs to Amazon EMR on EKS, in addition to using StartJobRun API, Spark Operator, Spark Submit and Interactive Endpoints. With this launch, customers will be able to use a REST interface to easily submit Spark jobs or snippets of Spark code, retrieve results synchronously or asynchronously while continuing to get all of the Amazon EMR on EKS benefits such as EMR optimized Spark runtime, SSL secured Livy endpoint, programmatic set-up experience etc. View the full article
  4. We are excited to announce that Amazon EMR on EKS simplified the authentication and authorization user experience by integrating with Amazon EKS's improved cluster access management controls. With this launch, Amazon EMR on EKS will use EKS access management controls to automatically obtain the necessary permissions to run Amazon EMR applications on the EKS cluster. View the full article
  5. Many enterprises are migrating their on-premises data stores to the AWS Cloud. During data migration, a key requirement is to validate all the data that has been moved from source to target. This data validation is a critical step, and if not done correctly, may result in the failure of the entire project. However, developing custom solutions to determine migration accuracy by comparing the data between the source and target can often be time-consuming. In this post, we walk through a step-by-step process to validate large datasets after migration using a configuration-based tool using Amazon EMR and the Apache Griffin open source library. Griffin is an open source data quality solution for big data, which supports both batch and streaming mode. In today’s data-driven landscape, where organizations deal with petabytes of data, the need for automated data validation frameworks has become increasingly critical. Manual validation processes are not only time-consuming but also prone to errors, especially when dealing with vast volumes of data. Automated data validation frameworks offer a streamlined solution by efficiently comparing large datasets, identifying discrepancies, and ensuring data accuracy at scale. With such frameworks, organizations can save valuable time and resources while maintaining confidence in the integrity of their data, thereby enabling informed decision-making and enhancing overall operational efficiency. The following are standout features for this framework: Utilizes a configuration-driven framework Offers plug-and-play functionality for seamless integration Conducts count comparison to identify any disparities Implements robust data validation procedures Ensures data quality through systematic checks Provides access to a file containing mismatched records for in-depth analysis Generates comprehensive reports for insights and tracking purposes Solution overview This solution uses the following services: Amazon Simple Storage Service (Amazon S3) or Hadoop Distributed File System (HDFS) as the source and target. Amazon EMR to run the PySpark script. We use a Python wrapper on top of Griffin to validate data between Hadoop tables created over HDFS or Amazon S3. AWS Glue to catalog the technical table, which stores the results of the Griffin job. Amazon Athena to query the output table to verify the results. We use tables that store the count for each source and target table and also create files that show the difference of records between source and target. The following diagram illustrates the solution architecture. In the depicted architecture and our typical data lake use case, our data either resides n Amazon S3 or is migrated from on premises to Amazon S3 using replication tools such as AWS DataSync or AWS Database Migration Service (AWS DMS). Although this solution is designed to seamlessly interact with both Hive Metastore and the AWS Glue Data Catalog, we use the Data Catalog as our example in this post. This framework operates within Amazon EMR, automatically running scheduled tasks on a daily basis, as per the defined frequency. It generates and publishes reports in Amazon S3, which are then accessible via Athena. A notable feature of this framework is its capability to detect count mismatches and data discrepancies, in addition to generating a file in Amazon S3 containing full records that didn’t match, facilitating further analysis. In this example, we use three tables in an on-premises database to validate between source and target : balance_sheet, covid, and survery_financial_report. Prerequisites Before getting started, make sure you have the following prerequisites: An AWS account with access to AWS services A VPC with private subnet An Amazon Elastic Compute Cloud (Amazon EC2) key pair An AWS Identity and Access Management (IAM) policy for AWS Secrets Manager permissions IAM roles EMR_DefaultRole and EMR_EC2_DefaultRole available in your account A SQL editor to connect to the source database The AWS Command Line Interface (AWS CLI) set up to run AWS commands locally: Create an IAM policy have access to specific S3 buckets with actions s3:PutObject and s3:GetObject , and an IAM policy to use an AWS CloudFormation template with desired permissions Create an IAM user Create an IAM role and attach the IAM S3 policy and CloudFormation policies Configure ~/.aws/config to use the IAM role and user Deploy the solution To make it straightforward for you to get started, we have created a CloudFormation template that automatically configures and deploys the solution for you. Complete the following steps: Create an S3 bucket in your AWS account called bdb-3070-griffin-datavalidation-blog-${AWS::AccountId}-${AWS::Region} (provide your AWS account ID and AWS Region). Unzip the following file to your local system. After unzipping the file to your local system, change <bucket name> to the one you created in your account (bdb-3070-griffin-datavalidation-blog-${AWS::AccountId}-${AWS::Region}) in the following files: bootstrap-bdb-3070-datavalidation.sh Validation_Metrics_Athena_tables.hql datavalidation/totalcount/totalcount_input.txt datavalidation/accuracy/accuracy_input.txt Upload all the folders and files in your local folder to your S3 bucket: aws s3 cp . s3://<bucket_name>/ --recursive Run the following CloudFormation template in your account. The CloudFormation template creates a database called griffin_datavalidation_blog and an AWS Glue crawler called griffin_data_validation_blog on top of the data folder in the .zip file. Choose Next. Choose Next again. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names. Choose Create stack. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command: aws cloudformation describe-stacks --stack-name <stack-name> --region us-east-1 --query Stacks[0].Outputs Run the AWS Glue crawler and verify that six tables have been created in the Data Catalog. Run the following CloudFormation template in your account. This template creates an EMR cluster with a bootstrap script to copy Griffin-related JARs and artifacts. It also runs three EMR steps: Create two Athena tables and two Athena views to see the validation matrix produced by the Griffin framework Run count validation for all three tables to compare the source and target table Run record-level and column-level validations for all three tables to compare between the source and target table For SubnetID, enter your subnet ID. Choose Next. Choose Next again. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names. Choose Create stack. You can view the stack outputs on the console or by using the following AWS CLI command: aws cloudformation describe-stacks --stack-name <stack-name> --region us-east-1 --query Stacks[0].Outputs It takes approximately 5 minutes for the deployment to complete. When the stack is complete, you should see the EMRCluster resource launched and available in your account. When the EMR cluster is launched, it runs the following steps as part of the post-cluster launch: Bootstrap action – It installs the Griffin JAR file and directories for this framework. It also downloads sample data files to use in the next step. Athena_Table_Creation – It creates tables in Athena to read the result reports. Count_Validation – It runs the job to compare the data count between source and target data from the Data Catalog table and stores the results in an S3 bucket, which will be read via an Athena table. Accuracy – It runs the job to compare the data rows between the source and target data from the Data Catalog table and store the results in an S3 bucket, which will be read via the Athena table. When the EMR steps are complete, your table comparison is done and ready to view in Athena automatically. No manual intervention is needed for validation. Validate data with Python Griffin When your EMR cluster is ready and all the jobs are complete, it means the count validation and data validation are complete. The results have been stored in Amazon S3 and the Athena table is already created on top of that. You can query the Athena tables to view the results, as shown in the following screenshot. The following screenshot shows the count results for all tables. The following screenshot shows the data accuracy results for all tables. The following screenshot shows the files created for each table with mismatched records. Individual folders are generated for each table directly from the job. Every table folder contains a directory for each day the job is run. Within that specific date, a file named __missRecords contains records that do not match. The following screenshot shows the contents of the __missRecords file. Clean up To avoid incurring additional charges, complete the following steps to clean up your resources when you’re done with the solution: Delete the AWS Glue database griffin_datavalidation_blog and drop the database griffin_datavalidation_blog cascade. Delete the prefixes and objects you created from the bucket bdb-3070-griffin-datavalidation-blog-${AWS::AccountId}-${AWS::Region}. Delete the CloudFormation stack, which removes your additional resources. Conclusion This post showed how you can use Python Griffin to accelerate the post-migration data validation process. Python Griffin helps you calculate count and row- and column-level validation, identifying mismatched records without writing any code. For more information about data quality use cases, refer to Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog and AWS Glue Data Quality. About the Authors Dipal Mahajan serves as a Lead Consultant at Amazon Web Services, providing expert guidance to global clients in developing highly secure, scalable, reliable, and cost-efficient cloud applications. With a wealth of experience in software development, architecture, and analytics across diverse sectors such as finance, telecom, retail, and healthcare, he brings invaluable insights to his role. Beyond the professional sphere, Dipal enjoys exploring new destinations, having already visited 14 out of 30 countries on his wish list. Akhil is a Lead Consultant at AWS Professional Services. He helps customers design & build scalable data analytics solutions and migrate data pipelines and data warehouses to AWS. In his spare time, he loves travelling, playing games and watching movies. Ramesh Raghupathy is a Senior Data Architect with WWCO ProServe at AWS. He works with AWS customers to architect, deploy, and migrate to data warehouses and data lakes on the AWS Cloud. While not at work, Ramesh enjoys traveling, spending time with family, and yoga. View the full article
  6. We are excited to introduce a new Amazon EMR on EC2 feature that enables automatic graceful replacement of unhealthy core nodes to ensure continued optimal cluster operations and prevent data loss. Additionally, EMR on EC2 will publish CloudWatch events to provide visibility into node health and recovery actions. These improvements are available for all Amazon EMR releases. View the full article
  7. Account reconciliation is an important step to ensure the completeness and accuracy of financial statements. Specifically, companies must reconcile balance sheet accounts that could contain significant or material misstatements. Accountants go through each account in the general ledger of accounts and verify that the balance listed is complete and accurate. When discrepancies are found, accountants investigate and take appropriate corrective action. As part of Amazon’s FinTech organization, we offer a software platform that empowers the internal accounting teams at Amazon to conduct account reconciliations. To optimize the reconciliation process, these users require high performance transformation with the ability to scale on demand, as well as the ability to process variable file sizes ranging from as low as a few MBs to more than 100 GB. It’s not always possible to fit data onto a single machine or process it with one single program in a reasonable time frame. This computation has to be done fast enough to provide practical services where programming logic and underlying details (data distribution, fault tolerance, and scheduling) can be separated. We can achieve these simultaneous computations on multiple machines or threads of the same function across groups of elements of a dataset by using distributed data processing solutions. This encouraged us to reinvent our reconciliation service powered by AWS services, including Amazon EMR and the Apache Spark distributed processing framework, which uses PySpark. This service enables users to process files over 100 GB containing up to 100 million transactions in less than 30 minutes. The reconciliation service has become a powerhouse for data processing, and now users can seamlessly perform a variety of operations, such as Pivot, JOIN (like an Excel VLOOKUP operation), arithmetic operations, and more, providing a versatile and efficient solution for reconciling vast datasets. This enhancement is a testament to the scalability and speed achieved through the adoption of distributed data processing solutions. In this post, we explain how we integrated Amazon EMR to build a highly available and scalable system that enabled us to run a high-volume financial reconciliation process. Architecture before migration The following diagram illustrates our previous architecture. Our legacy service was built with Amazon Elastic Container Service (Amazon ECS) on AWS Fargate. We processed the data sequentially using Python. However, due to its lack of parallel processing capability, we frequently had to increase the cluster size vertically to support larger datasets. For context, 5 GB of data with 50 operations took around 3 hours to process. This service was configured to scale horizontally to five ECS instances that polled messages from Amazon Simple Queue Service (Amazon SQS), which fed the transformation requests. Each instance was configured with 4 vCPUs and 30 GB of memory to allow horizontal scaling. However, we couldn’t expand its capacity on performance because the process happened sequentially, picking chunks of data from Amazon Simple Storage Service (Amazon S3) for processing. For example, a VLOOKUP operation where two files are to be joined required both files to be read in memory chunk by chunk to obtain the output. This became an obstacle for users because they had to wait for long periods of time to process their datasets. As part of our re-architecture and modernization, we wanted to achieve the following: High availability – The data processing clusters should be highly available, providing three 9s of availability (99.9%) Throughput – The service should handle 1,500 runs per day Latency – It should be able to process 100 GB of data within 30 minutes Heterogeneity – The cluster should be able to support a wide variety of workloads, with files ranging from a few MBs to hundreds of GBs Query concurrency – The implementation demands the ability to support a minimum of 10 degrees of concurrency Reliability of jobs and data consistency – Jobs need to run reliably and consistently to avoid breaking Service Level Agreements (SLAs) Cost-effective and scalable – It must be scalable based on the workload, making it cost-effective Security and compliance – Given the sensitivity of data, it must support fine-grained access control and appropriate security implementations Monitoring – The solution must offer end-to-end monitoring of the clusters and jobs Why Amazon EMR Amazon EMR is the industry-leading cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning (ML) using open source frameworks such as Apache Spark, Apache Hive, and Presto. With these frameworks and related open-source projects, you can process data for analytics purposes and BI workloads. Amazon EMR lets you transform and move large amounts of data in and out of other AWS data stores and databases, such as Amazon S3 and Amazon DynamoDB. A notable advantage of Amazon EMR lies in its effective use of parallel processing with PySpark, marking a significant improvement over traditional sequential Python code. This innovative approach streamlines the deployment and scaling of Apache Spark clusters, allowing for efficient parallelization on large datasets. The distributed computing infrastructure not only enhances performance, but also enables the processing of vast amounts of data at unprecedented speeds. Equipped with libraries, PySpark facilitates Excel-like operations on DataFrames, and the higher-level abstraction of DataFrames simplifies intricate data manipulations, reducing code complexity. Combined with automatic cluster provisioning, dynamic resource allocation, and integration with other AWS services, Amazon EMR proves to be a versatile solution suitable for diverse workloads, ranging from batch processing to ML. The inherent fault tolerance in PySpark and Amazon EMR promotes robustness, even in the event of node failures, making it a scalable, cost-effective, and high-performance choice for parallel data processing on AWS. Amazon EMR extends its capabilities beyond the basics, offering a variety of deployment options to cater to diverse needs. Whether it’s Amazon EMR on EC2, Amazon EMR on EKS, Amazon EMR Serverless, or Amazon EMR on AWS Outposts, you can tailor your approach to specific requirements. For those seeking a serverless environment for Spark jobs, integrating AWS Glue is also a viable option. In addition to supporting various open-source frameworks, including Spark, Amazon EMR provides flexibility in choosing deployment modes, Amazon Elastic Compute Cloud (Amazon EC2) instance types, scaling mechanisms, and numerous cost-saving optimization techniques. Amazon EMR stands as a dynamic force in the cloud, delivering unmatched capabilities for organizations seeking robust big data solutions. Its seamless integration, powerful features, and adaptability make it an indispensable tool for navigating the complexities of data analytics and ML on AWS. Redesigned architecture The following diagram illustrates our redesigned architecture. The solution operates under an API contract, where clients can submit transformation configurations, defining the set of operations alongside the S3 dataset location for processing. The request is queued through Amazon SQS, then directed to Amazon EMR via a Lambda function. This process initiates the creation of an Amazon EMR step for Spark framework implementation on a dedicated EMR cluster. Although Amazon EMR accommodates an unlimited number of steps over a long-running cluster’s lifetime, only 256 steps can be running or pending simultaneously. For optimal parallelization, the step concurrency is set at 10, allowing 10 steps to run concurrently. In case of request failures, the Amazon SQS dead-letter queue (DLQ) retains the event. Spark processes the request, translating Excel-like operations into PySpark code for an efficient query plan. Resilient DataFrames store input, output, and intermediate data in-memory, optimizing processing speed, reducing disk I/O cost, enhancing workload performance, and delivering the final output to the specified Amazon S3 location. We define our SLA in two dimensions: latency and throughput. Latency is defined as the amount of time taken to perform one job against a deterministic dataset size and the number of operations performed on the dataset. Throughput is defined as the maximum number of simultaneous jobs the service can perform without breaching the latency SLA of one job. The overall scalability SLA of the service depends on the balance of horizontal scaling of elastic compute resources and vertical scaling of individual servers. Because we had to run 1,500 processes per day with minimal latency and high performance, we choose to integrate Amazon EMR on EC2 deployment mode with managed scaling enabled to support processing variable file sizes. The EMR cluster configuration provides many different selections: EMR node types – Primary, core, or task nodes Instance purchasing options – On-Demand Instances, Reserved Instances, or Spot Instances Configuration options – EMR instance fleet or uniform instance group Scaling options – Auto Scaling or Amazon EMR managed scaling Based on our variable workload, we configured an EMR instance fleet (for best practices, see Reliability). We also decided to use Amazon EMR managed scaling to scale the core and task nodes (for scaling scenarios, refer to Node allocation scenarios). Lastly, we chose memory-optimized AWS Graviton instances, which provide up to 30% lower cost and up to 15% improved performance for Spark workloads. The following code provides a snapshot of our cluster configuration: Concurrent steps:10 EMR Managed Scaling: minimumCapacityUnits: 64 maximumCapacityUnits: 512 maximumOnDemandCapacityUnits: 512 maximumCoreCapacityUnits: 512 Master Instance Fleet: r6g.xlarge - 4 vCore, 30.5 GiB memory, EBS only storage - EBS Storage:250 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 1 units r6g.2xlarge - 8 vCore, 61 GiB memory, EBS only storage - EBS Storage:250 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 1 units Core Instance Fleet: r6g.2xlarge - 8 vCore, 61 GiB memory, EBS only storage - EBS Storage:100 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 8 units r6g.4xlarge - 16 vCore, 122 GiB memory, EBS only storage - EBS Storage:100 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 16 units Task Instances: r6g.2xlarge - 8 vCore, 61 GiB memory, EBS only storage - EBS Storage:100 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 8 units r6g.4xlarge - 16 vCore, 122 GiB memory, EBS only storage - EBS Storage:100 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 16 units Performance With our migration to Amazon EMR, we were able to achieve a system performance capable of handling a variety of datasets, ranging from as low as 273 B to as high as 88.5 GB with a p99 of 491 seconds (approximately 8 minutes). The following figure illustrates the variety of file sizes processed. The following figure shows our latency. To compare against sequential processing, we took two datasets containing 53 million records and ran a VLOOKUP operation against each other, along with 49 other Excel-like operations. This took 26 minutes to process in the new service, compared to 5 days to process in the legacy service. This improvement is almost 300 times greater over the previous architecture in terms of performance. Considerations Keep in mind the following when considering this solution: Right-sizing clusters – Although Amazon EMR is resizable, it’s important to right-size the clusters. Right-sizing mitigates a slow cluster, if undersized, or higher costs, if the cluster is oversized. To anticipate these issues, you can calculate the number and type of nodes that will be needed for the workloads. Parallel steps – Running steps in parallel allows you to run more advanced workloads, increase cluster resource utilization, and reduce the amount of time taken to complete your workload. The number of steps allowed to run at one time is configurable and can be set when a cluster is launched and any time after the cluster has started. You need to consider and optimize the CPU/memory usage per job when multiple jobs are running in a single shared cluster. Job-based transient EMR clusters – If applicable, it is recommended to use a job-based transient EMR cluster, which delivers superior isolation, verifying that each task operates within its dedicated environment. This approach optimizes resource utilization, helps prevent interference between jobs, and enhances overall performance and reliability. The transient nature enables efficient scaling, providing a robust and isolated solution for diverse data processing needs. EMR Serverless – EMR Serverless is the ideal choice if you prefer not to handle the management and operation of clusters. It allows you to effortlessly run applications using open-source frameworks available within EMR Serverless, offering a straightforward and hassle-free experience. Amazon EMR on EKS – Amazon EMR on EKS offers distinct advantages, such as faster startup times and improved scalability resolving compute capacity challenges—which is particularly beneficial for Graviton and Spot Instance users. The inclusion of a broader range of compute types enhances cost-efficiency, allowing tailored resource allocation. Furthermore, Multi-AZ support provides increased availability. These compelling features provide a robust solution for managing big data workloads with improved performance, cost optimization, and reliability across various computing scenarios. Conclusion In this post, we explained how Amazon optimized its high-volume financial reconciliation process with Amazon EMR for higher scalability and performance. If you have a monolithic application that’s dependent on vertical scaling to process additional requests or datasets, then migrating it to a distributed processing framework such as Apache Spark and choosing a managed service such as Amazon EMR for compute may help reduce the runtime to lower your delivery SLA, and also may help reduce the Total Cost of Ownership (TCO). As we embrace Amazon EMR for this particular use case, we encourage you to explore further possibilities in your data innovation journey. Consider evaluating AWS Glue, along with other dynamic Amazon EMR deployment options such as EMR Serverless or Amazon EMR on EKS, to discover the best AWS service tailored to your unique use case. The future of the data innovation journey holds exciting possibilities and advancements to be explored further. About the Authors Jeeshan Khetrapal is a Sr. Software Development Engineer at Amazon, where he develops fintech products based on cloud computing serverless architectures that are responsible for companies’ IT general controls, financial reporting, and controllership for governance, risk, and compliance. Sakti Mishra is a Principal Solutions Architect at AWS, where he helps customers modernize their data architecture and define their end-to-end data strategy, including data security, accessibility, governance, and more. He is also the author of the book Simplify Big Data Analytics with Amazon EMR. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. View the full article
  8. Trino is an open source distributed SQL query engine designed for interactive analytic workloads. On AWS, you can run Trino on Amazon EMR, where you have the flexibility to run your preferred version of open source Trino on Amazon Elastic Compute Cloud (Amazon EC2) instances that you manage, or on Amazon Athena for a serverless experience. When you use Trino on Amazon EMR or Athena, you get the latest open source community innovations along with proprietary, AWS developed optimizations. Starting from Amazon EMR 6.8.0 and Athena engine version 2, AWS has been developing query plan and engine behavior optimizations that improve query performance on Trino. In this post, we compare Amazon EMR 6.15.0 with open source Trino 426 and show that TPC-DS queries ran up to 2.7 times faster on Amazon EMR 6.15.0 Trino 426 compared to open source Trino 426. Later, we explain a few of the AWS-developed performance optimizations that contribute to these results. Benchmark setup In our testing, we used the 3 TB dataset stored in Amazon S3 in compressed Parquet format and metadata for databases and tables is stored in the AWS Glue Data Catalog. This benchmark uses unmodified TPC-DS data schema and table relationships. Fact tables are partitioned on the date column and contained 200-2100 partitions. Table and column statistics were not present for any of the tables. We used TPC-DS queries from the open source Trino Github repository without modification. Benchmark queries were run sequentially on two different Amazon EMR 6.15.0 clusters: one with Amazon EMR Trino 426 and the other with open source Trino 426. Both clusters used 1 r5.4xlarge coordinator and 20 r5.4xlarge worker instances. Results observed Our benchmarks show consistently better performance with Trino on Amazon EMR 6.15.0 compared to open source Trino. The total query runtime of Trino on Amazon EMR was 2.7 times faster compared to open source. The following graph shows performance improvements measured by the total query runtime (in seconds) for the benchmark queries. Many of the TPC-DS queries demonstrated performance gains over five times faster compared to open source Trino. Some queries showed even greater performance, like query 72 which improved by 160 times. The following graph shows the top 10 TPC-DS queries with the largest improvement in runtime. For succinct representation and to avoid skewness of performance improvements in the graph, we’ve excluded q72. Performance enhancements Now that we understand the performance gains with Trino on Amazon EMR, let’s delve deeper into some of the key innovations developed by AWS engineering that contribute to these improvements. Choosing a better join order and join type is critical to better query performance because it can affect how much data is read from a particular table, how much data is transferred to the intermediate stages through the network, and how much memory is needed to build up a hash table to facilitate a join. Join order and join algorithm decisions are typically a function performed by cost-based optimizers, which uses statistics to improve query plans by deciding how tables and subqueries are joined. However, table statistics are often not available, out of date, or too expensive to collect on large tables. When statistics aren’t available, Amazon EMR and Athena use S3 file metadata to optimize query plans. S3 file metadata is used to infer small subqueries and tables in the query while determining the join order or join type. For example, consider the following query: SELECT ss_promo_sk FROM store_sales ss, store_returns sr, call_center cc WHERE ss.ss_cdemo_sk = sr.sr_cdemo_sk AND ss.ss_customer_sk = cc.cc_call_center_sk AND cc_sq_ft > 0 The syntactical join order is store_sales joins store_returns joins call_center. With the Amazon EMR join type and order selection optimization rules, optimal join order is determined even if these tables don’t have statistics. For the preceding query if call_center is considered a small table after estimating the approximate size through S3 file metadata, EMR’s join optimization rules will join store_sales with call_center first and convert the join to a broadcast join, speeding-up the query and reducing memory consumption. Join reordering minimizes the intermediate result size, which helps to further reduce the overall query runtime. With Amazon EMR 6.10.0 and later, S3 file metadata-based join optimizations are turned on by default. If you are using Amazon EMR 6.8.0 or 6.9.0, you can turn on these optimizations by setting the session properties from Trino clients or adding the following properties to the trino-config classification when creating your cluster. Refer to Configure applications for details on how to override the default configurations for an application. Configuration for Join type selection: session property: rule_based_join_type_selection=true config property: rule-based-join-type-selection=true Configuration for Join reorder: session property: rule_based_join_reorder=true config property: rule-based-join-reorder=true Conclusion With Amazon EMR 6.8.0 and later, you can run queries on Trino significantly faster than open source Trino. As shown in this blog post, our TPC-DS benchmark showed a 2.7 times improvement in total query runtime with Trino on Amazon EMR 6.15.0. The optimizations discussed in this post, and many others, are also available when running Trino queries on Athena where similar performance improvements are observed. To learn more, refer to the Run queries 3x faster with up to 70% cost savings on the latest Amazon Athena engine. In our mission to innovate on behalf of customers, Amazon EMR and Athena frequently release performance and reliability enhancements on their latest versions. Check the Amazon EMR and Amazon Athena release pages to learn about new features and enhancements. About the Authors Bhargavi Sagi is a Software Development Engineer on Amazon Athena. She joined AWS in 2020 and has been working on different areas of Amazon EMR and Athena engine V3, including engine upgrade, engine reliability, and engine performance. Sushil Kumar Shivashankar is the Engineering Manager for EMR Trino and Athena Query Engine team. He has been focusing in the big data analytics space since 2014. View the full article
  9. Amazon EMR Serverless is now in scope for FedRAMP Moderate in the US East (Ohio), US East (N. Virginia), US West (N. California), and US West (Oregon) Regions. You can now use EMR Serverless to run your Apache Spark and Hive workloads that are subject to FedRAMP Moderate compliance. View the full article
  10. Amazon EMR is excited to announce a new capability that enables users to apply AWS Lake Formation based table and column level permissions on Amazon S3 data lake for write operations (i.e., INSERT INTO, INSERT OVERWRITE) with Apache Hive jobs submitted using Amazon EMR Steps API. This feature allows data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR. View the full article
  11. We are excited to launch two new features that help enforce access controls with Amazon EMR on EC2 clusters (EMR Clusters). These features are supported with jobs that are submitted to the cluster using the EMR Steps API. First is Runtime Role with EMR Steps. A Runtime Role is an AWS Identity and Access Management (IAM) role that you associate with an EMR Step. An EMR Step uses this role to access AWS resources. The second is integration with AWS Lake Formation to apply table and column-level access controls for Apache Spark and Apache Hive jobs with EMR Steps. View the full article
  12. Amazon EMR Release 6.2 now supports improved Apache HBase performance on Amazon S3 with persistent HFile tracking, and Apache Hive ACID transactions on HDFS and Amazon S3. EMR 6.2 contains performance improvements to EMR Runtime for Apache Spark, and PrestoDB performance improvements. View the full article
  • Forum Statistics

    43.3k
    Total Topics
    42.7k
    Total Posts
×
×
  • Create New...