Showing results for tags 'aws glue'.

aws glue AWS Glue Studio Notebooks is now available in 6 additional regions

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

AWS Glue Studio Notebooks provides interactive job authoring in AWS Glue, which helps simplify the process of developing data integration jobs. Studio Notebooks is generally available in the following 6 AWS regions starting today: Middle East (UAE), Asia Pacific (Hyderabad), Asia Pacific (Melbourne), Israel (Tel Aviv), Europe (Spain) and Europe (Zurich). View the full article

amazon datazone Amazon DataZone announces integration with AWS Lake Formation hybrid access mode for the AWS Glue Data Catalog

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Last week, we announced the general availability of the integration between Amazon DataZone and AWS Lake Formation hybrid access mode. In this post, we share how this new feature helps you simplify the way you use Amazon DataZone to enable secure and governed sharing of your data in the AWS Glue Data Catalog. We also delve into how data producers can share their AWS Glue tables through Amazon DataZone without needing to register them in Lake Formation first. Overview of the Amazon DataZone integration with Lake Formation hybrid access mode Amazon DataZone is a fully managed data management service to catalog, discover, analyze, share, and govern data between data producers and consumers in your organization. With Amazon DataZone, data producers populate the business data catalog with data assets from data sources such as the AWS Glue Data Catalog and Amazon Redshift. They also enrich their assets with business context to make it straightforward for data consumers to understand. After the data is available in the catalog, data consumers such as analysts and data scientists can search and access this data by requesting subscriptions. When the request is approved, Amazon DataZone can automatically provision access to the data by managing permissions in Lake Formation or Amazon Redshift so that the data consumer can start querying the data using tools such as Amazon Athena or Amazon Redshift. To manage the access to data in the AWS Glue Data Catalog, Amazon DataZone uses Lake Formation. Previously, if you wanted to use Amazon DataZone for managing access to your data in the AWS Glue Data Catalog, you had to onboard your data to Lake Formation first. Now, the integration of Amazon DataZone and Lake Formation hybrid access mode simplifies how you can get started with your Amazon DataZone journey by removing the need to onboard your data to Lake Formation first. Lake Formation hybrid access mode allows you to start managing permissions on your AWS Glue databases and tables through Lake Formation, while continuing to maintain any existing AWS Identity and Access Management (IAM) permissions on these tables and databases. Lake Formation hybrid access mode supports two permission pathways to the same Data Catalog databases and tables: In the first pathway, Lake Formation allows you to select specific principals (opt-in principals) and grant them Lake Formation permissions to access databases and tables by opting in The second pathway allows all other principals (that are not added as opt-in principals) to access these resources through the IAM principal policies for Amazon Simple Storage Service (Amazon S3) and AWS Glue actions With the integration between Amazon DataZone and Lake Formation hybrid access mode, if you have tables in the AWS Glue Data Catalog that are managed through IAM-based policies, you can publish these tables directly to Amazon DataZone, without registering them in Lake Formation. Amazon DataZone registers the location of these tables in Lake Formation using hybrid access mode, which allows managing permissions on AWS Glue tables through Lake Formation, while continuing to maintain any existing IAM permissions. Amazon DataZone enables you to publish any type of asset in the business data catalog. For some of these assets, Amazon DataZone can automatically manage access grants. These assets are called managed assets, and include Lake Formation-managed Data Catalog tables and Amazon Redshift tables and views. Prior to this integration, you had to complete the following steps before Amazon DataZone could treat the published Data Catalog table as a managed asset: Identity the Amazon S3 location associated with Data Catalog table. Register the Amazon S3 location with Lake Formation in hybrid access mode using a role with appropriate permissions. Publish the table metadata to the Amazon DataZone business data catalog. The following diagram illustrates this workflow. With the Amazon DataZone’s integration with Lake Formation hybrid access mode, you can simply publish your AWS Glue tables to Amazon DataZone without having to worry about registering the Amazon S3 location or adding an opt-in principal in Lake Formation by delegating these steps to Amazon DataZone. The administrator of an AWS account can enable the data location registration setting under the DefaultDataLake blueprint on the Amazon DataZone console. Now, a data owner or publisher can publish their AWS Glue table (managed through IAM permissions) to Amazon DataZone without the extra setup steps. When a data consumer subscribes to this table, Amazon DataZone registers the Amazon S3 locations of the table in hybrid access mode, adds the data consumer’s IAM role as an opt-in principal, and grants access to the same IAM role by managing permissions on the table through Lake Formation. This makes sure that IAM permissions on the table can coexist with newly granted Lake Formation permissions, without disrupting any existing workflows. The following diagram illustrates this workflow. Solution overview To demonstrate this new capability, we use a sample customer scenario where the finance team wants to access data owned by the sales team for financial analysis and reporting. The sales team has a pipeline that creates a dataset containing valuable information about ticket sales, popular events, venues, and seasons. We call it the tickit dataset. The sales team stores this dataset in Amazon S3 and registers it in a database in the Data Catalog. The access to this table is currently managed through IAM-based permissions. However, the sales team wants to publish this table to Amazon DataZone to facilitate secure and governed data sharing with the finance team. The steps to configure this solution are as follows: The Amazon DataZone administrator enables the data lake location registration setting in Amazon DataZone to automatically register the Amazon S3 location of the AWS Glue tables in Lake Formation hybrid access mode. After the hybrid access mode integration is enabled in Amazon DataZone, the finance team requests a subscription to the sales data asset. The asset shows up as a managed asset, which means Amazon DataZone can manage access to this asset even if the Amazon S3 location of this asset isn’t registered in Lake Formation. The sales team is notified of a subscription request raised by the finance team. They review and approve the access request. After the request is approved, Amazon DataZone fulfills the subscription request by managing permissions in the Lake Formation. It registers the Amazon S3 location of the subscribed table in Lake Formation hybrid mode. The finance team gains access to the sales dataset required for their financial reports. They can go to their DataZone environment and start running queries using Athena against their subscribed dataset. Prerequisites To follow the steps in this post, you need an AWS account. If you don’t have an account, you can create one. In addition, you must have the following resources configured in your account: An S3 bucket An AWS Glue database and crawler IAM roles for different personas and services An Amazon DataZone domain and project An Amazon DataZone environment profile and environment An Amazon DataZone data source If you don’t have these resources already configured, you can create them by deploying the following AWS CloudFormation stack: Choose Launch Stack to deploy a CloudFormation template. Complete the steps to deploy the template and leave all settings as default. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit. After the CloudFormation deployment is complete, you can log in to the Amazon DataZone portal and manually trigger a data source run. This pulls any new or modified metadata from the source and updates the associated assets in the inventory. This data source has been configured to automatically publish the data assets to the catalog. On the Amazon DataZone console, choose View domains. You should be logged in using the same role that is used to deploy CloudFormation and verify that you are in the same AWS Region. Find the domain blog_dz_domain, then choose Open data portal. Choose Browse all projects and choose Sales producer project. On the Data tab, choose Data sources in the navigation pane. Locate and choose the data source that you want to run. This opens the data source details page. Choose the options menu (three vertical dots) next to tickit_datasource and choose Run. The data source status changes to Running as Amazon DataZone updates the asset metadata. Enable hybrid mode integration in Amazon DataZone In this step, the Amazon DataZone administrator goes through the process of enabling the Amazon DataZone integration with Lake Formation hybrid access mode. Complete the following steps: On a separate browser tab, open the Amazon DataZone console. Verify that you are in the same Region where you deployed the CloudFormation template. Choose View domains. Choose the domain created by AWS CloudFormation, blog_dz_domain. Scroll down on the domain details page and choose the Blueprints tab. A blueprint defines what AWS tools and services can be used with the data assets published in Amazon DataZone. The DefaultDataLake blueprint is enabled as part of the CloudFormation stack deployment. This blueprint enables you to create and query AWS Glue tables using Athena. For the steps to enable this in your own deployments, refer to Enable built-in blueprints in the AWS account that owns the Amazon DataZone domain. Choose the DefaultDataLake blueprint. On the Provisioning tab, choose Edit. Select Enable Amazon DataZone to register S3 locations using AWS Lake Formation hybrid access mode. You have the option of excluding specific Amazon S3 locations if you don’t want Amazon DataZone to automatically register them to Lake Formation hybrid access mode. Choose Save changes. Request access In this step, you log in to Amazon DataZone as the finance team, search for the sales data asset, and subscribe to it. Complete the following steps: Return to your Amazon DataZone data portal browser tab. Switch to the finance consumer project by choosing the dropdown menu next to the project name and choosing Finance consumer project. From this step onwards, you take on the persona of a finance user looking to subscribe to a data asset published in the previous step. In the search bar, search for and choose the sales data asset. Choose Subscribe. The asset shows up as managed asset. This means that Amazon DataZone can grant access to this data asset to the finance team’s project by managing the permissions in Lake Formation. Enter a reason for the access request and choose Subscribe. Approve access request The sales team gets a notification that an access request from the finance team is submitted. To approve the request, complete the following steps: Choose the dropdown menu next to the project name and choose Sales producer project. You now assume the persona of the sales team, who are the owners and stewards of the sales data assets. Choose the notification icon at the top-right corner of the DataZone portal. Choose the Subscription Request Created task. Grant access to the sales data asset to the finance team and choose Approve. Analyze the data The finance team has now been granted access to the sales data, and this dataset has been to their Amazon DataZone environment. They can access the environment and query the sales dataset with Athena, along with any other datasets they currently own. Complete the following steps: On the dropdown menu, choose Finance consumer project. On the right pane of the project overview screen, you can find a list of active environments available for use. Choose the Amazon DataZone environment finance_dz_environment. In the navigation pane, under Data assets, choose Subscribed. Verify that your environment now has access to the sales data. It may take a few minutes for the data asset to be automatically added to your environment. Choose the new tab icon for Query data. A new tab opens with the Athena query editor. For Database, choose finance_consumer_db_tickitdb-<suffix>. This database will contain your subscribed data assets. Generate a preview of the sales table by choosing the options menu (three vertical dots) and choosing Preview table. Clean up To clean up your resources, complete the following steps: Switch back to the administrator role you used to deploy the CloudFormation stack. On the Amazon DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments. On the AWS CloudFormation console, delete the stack you deployed in the beginning of this post. On the Amazon S3 console, delete the S3 buckets containing the tickit dataset. On the Lake Formation console, delete the Lake Formation admins registered by Amazon DataZone. On the Lake Formation console, delete tables and databases created by Amazon DataZone. Conclusion In this post, we discussed how the integration between Amazon DataZone and Lake Formation hybrid access mode simplifies the process to start using Amazon DataZone for end-to-end governance of your data in the AWS Glue Data Catalog. This integration helps you bypass the manual steps of onboarding to Lake Formation before you can start using Amazon DataZone. For more information on how to get started with Amazon DataZone, refer to the Getting started guide. Check out the YouTube playlist for some of the latest demos of Amazon DataZone and short descriptions of the capabilities available. For more information about Amazon DataZone, see How Amazon DataZone helps customers find value in oceans of data. About the Authors Utkarsh Mittal is a Senior Technical Product Manager for Amazon DataZone at AWS. He is passionate about building innovative products that simplify customers’ end-to-end analytics journeys. Outside of the tech world, Utkarsh loves to play music, with drums being his latest endeavor. Praveen Kumar is a Principal Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-centered services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and generative AI applications. Paul Villena is a Senior Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure as code, serverless technologies, and coding in Python View the full article

April 8
- aws lake formation
- aws glue
- (and 1 more)
  Tagged with:

amazon datazone Amazon DataZone launches integration with AWS Glue Data Quality

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon DataZone is used by customers to catalog, discover, analyze, share, and govern data at scale across organizational boundaries with governance and access controls. Today, Amazon DataZone launches integration with AWS Glue Data Quality and offers APIs to integrate data quality metrics from third party data quality solutions. This integration helps Amazon DataZone customers gain trust in their data and make confident business decisions. View the full article

April 3
- aws glue
- aws glue data quality
- (and 1 more)
  Tagged with:

amazon datazone Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Today, we are pleased to announce that Amazon DataZone is now able to present data quality information for data assets. This information empowers end-users to make informed decisions as to whether or not to use specific assets. Many organizations already use AWS Glue Data Quality to define and enforce data quality rules on their data, validate data against predefined rules, track data quality metrics, and monitor data quality over time using artificial intelligence (AI). Other organizations monitor the quality of their data through third-party solutions. Amazon DataZone now integrates directly with AWS Glue to display data quality scores for AWS Glue Data Catalog assets. Additionally, Amazon DataZone now offers APIs for importing data quality scores from external systems. In this post, we discuss the latest features of Amazon DataZone for data quality, the integration between Amazon DataZone and AWS Glue Data Quality and how you can import data quality scores produced by external systems into Amazon DataZone via API. Challenges One of the most common questions we get from customers is related to displaying data quality scores in the Amazon DataZone business data catalog to let business users have visibility into the health and reliability of the datasets. As data becomes increasingly crucial for driving business decisions, Amazon DataZone users are keenly interested in providing the highest standards of data quality. They recognize the importance of accurate, complete, and timely data in enabling informed decision-making and fostering trust in their analytics and reporting processes. Amazon DataZone data assets can be updated at varying frequencies. As data is refreshed and updated, changes can happen through upstream processes that put it at risk of not maintaining the intended quality. Data quality scores help you understand if data has maintained the expected level of quality for data consumers to use (through analysis or downstream processes). From a producer’s perspective, data stewards can now set up Amazon DataZone to automatically import the data quality scores from AWS Glue Data Quality (scheduled or on demand) and include this information in the Amazon DataZone catalog to share with business users. Additionally, you can now use new Amazon DataZone APIs to import data quality scores produced by external systems into the data assets. With the latest enhancement, Amazon DataZone users can now accomplish the following: Access insights about data quality standards directly from the Amazon DataZone web portal View data quality scores on various KPIs, including data completeness, uniqueness, accuracy Make sure users have a holistic view of the quality and trustworthiness of their data. In the first part of this post, we walk through the integration between AWS Glue Data Quality and Amazon DataZone. We discuss how to visualize data quality scores in Amazon DataZone, enable AWS Glue Data Quality when creating a new Amazon DataZone data source, and enable data quality for an existing data asset. In the second part of this post, we discuss how you can import data quality scores produced by external systems into Amazon DataZone via API. In this example, we use Amazon EMR Serverless in combination with the open source library Pydeequ to act as an external system for data quality. Visualize AWS Glue Data Quality scores in Amazon DataZone You can now visualize AWS Glue Data Quality scores in data assets that have been published in the Amazon DataZone business catalog and that are searchable through the Amazon DataZone web portal. If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane. By selecting the corresponding asset, you can understand its content through the readme, glossary terms, and technical and business metadata. Additionally, the overall quality score indicator is displayed in the Asset Details section. A data quality score serves as an overall indicator of a dataset’s quality, calculated based on the rules you define. On the Data quality tab, you can access the details of data quality overview indicators and the results of the data quality runs. The indicators shown on the Overview tab are calculated based on the results of the rulesets from the data quality runs. Each rule is assigned an attribute that contributes to the calculation of the indicator. For example, rules that have the Completeness attribute will contribute to the calculation of the corresponding indicator on the Overview tab. To filter data quality results, choose the Applicable column dropdown menu and choose your desired filter parameter. You can also visualize column-level data quality starting on the Schema tab. When data quality is enabled for the asset, the data quality results become available, providing insightful quality scores that reflect the integrity and reliability of each column within the dataset. When you choose one of the data quality result links, you’re redirected to the data quality detail page, filtered by the selected column. Data quality historical results in Amazon DataZone Data quality can change over time for many reasons: Data formats may change because of changes in the source systems As data accumulates over time, it may become outdated or inconsistent Data quality can be affected by human errors in data entry, data processing, or data manipulation In Amazon DataZone, you can now track data quality over time to confirm reliability and accuracy. By analyzing the historical report snapshot, you can identify areas for improvement, implement changes, and measure the effectiveness of those changes. Enable AWS Glue Data Quality when creating a new Amazon DataZone data source In this section, we walk through the steps to enable AWS Glue Data Quality when creating a new Amazon DataZone data source. Prerequisites To follow along, you should have a domain for Amazon DataZone, an Amazon DataZone project, and a new Amazon DataZone environment (with a DataLakeProfile). For instructions, refer to Amazon DataZone quickstart with AWS Glue data. You also need to define and run a ruleset against your data, which is a set of data quality rules in AWS Glue Data Quality. To set up the data quality rules and for more information on the topic, refer to the following posts: Part 1: Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog Part 2: Getting started with AWS Glue Data Quality for ETL Pipelines Part 3: Set up data quality rules across multiple datasets using AWS Glue Data Quality Part 4: Set up alerts and orchestrate data quality rules with AWS Glue Data Quality Part 5: Visualize data quality score and metrics generated by AWS Glue Data Quality Part 6: Measure performance of AWS Glue Data Quality for ETL pipelines After you create the data quality rules, make sure that Amazon DataZone has the permissions to access the AWS Glue database managed through AWS Lake Formation. For instructions, see Configure Lake Formation permissions for Amazon DataZone. In our example, we have configured a ruleset against a table containing patient data within a healthcare synthetic dataset generated using Synthea. Synthea is a synthetic patient generator that creates realistic patient data and associated medical records that can be used for testing healthcare software applications. The ruleset contains 27 individual rules (one of them failing), so the overall data quality score is 96%. If you use Amazon DataZone managed policies, there is no action needed because these will get automatically updated with the needed actions. Otherwise, you need to allow Amazon DataZone to have the required permissions to list and get AWS Glue Data Quality results, as shown in the Amazon DataZone user guide. Create a data source with data quality enabled In this section, we create a data source and enable data quality. You can also update an existing data source to enable data quality. We use this data source to import metadata information related to our datasets. Amazon DataZone will also import data quality information related to the (one or more) assets contained in the data source. On the Amazon DataZone console, choose Data sources in the navigation pane. Choose Create data source. For Name, enter a name for your data source. For Data source type, select AWS Glue. For Environment, choose your environment. For Database name, enter a name for the database. For Table selection criteria, choose your criteria. Choose Next. For Data quality, select Enable data quality for this data source. If data quality is enabled, Amazon DataZone will automatically fetch data quality scores from AWS Glue at each data source run. Choose Next. Now you can run the data source. While running the data source, Amazon DataZone imports the last 100 AWS Glue Data Quality run results. This information is now visible on the asset page and will be visible to all Amazon DataZone users after publishing the asset. Enable data quality for an existing data asset In this section, we enable data quality for an existing asset. This might be useful for users that already have data sources in place and want to enable the feature afterwards. Prerequisites To follow along, you should have already run the data source and produced an AWS Glue table data asset. Additionally, you should have defined a ruleset in AWS Glue Data Quality over the target table in the Data Catalog. For this example, we ran the data quality job multiple times against the table, producing the related AWS Glue Data Quality scores, as shown in the following screenshot. Import data quality scores into the data asset Complete the following steps to import the existing AWS Glue Data Quality scores into the data asset in Amazon DataZone: Within the Amazon DataZone project, navigate to the Inventory data pane and choose the data source. If you choose the Data quality tab, you can see that there’s still no information on data quality because AWS Glue Data Quality integration is not enabled for this data asset yet. On the Data quality tab, choose Enable data quality. In the Data quality section, select Enable data quality for this data source. Choose Save. Now, back on the Inventory data pane, you can see a new tab: Data quality. On the Data quality tab, you can see data quality scores imported from AWS Glue Data Quality. Ingest data quality scores from an external source using Amazon DataZone APIs Many organizations already use systems that calculate data quality by performing tests and assertions on their datasets. Amazon DataZone now supports importing third-party originated data quality scores via API, allowing users that navigate the web portal to view this information. In this section, we simulate a third-party system pushing data quality scores into Amazon DataZone via APIs through Boto3 (Python SDK for AWS). For this example, we use the same synthetic dataset as earlier, generated with Synthea. The following diagram illustrates the solution architecture. The workflow consists of the following steps: Read a dataset of patients in Amazon Simple Storage Service (Amazon S3) directly from Amazon EMR using Spark. The dataset is created as a generic S3 asset collection in Amazon DataZone. In Amazon EMR, perform data validation rules against the dataset. The metrics are saved in Amazon S3 to have a persistent output. Use Amazon DataZone APIs through Boto3 to push custom data quality metadata. End-users can see the data quality scores by navigating to the data portal. Prerequisites We use Amazon EMR Serverless and Pydeequ to run a fully managed Spark environment. To learn more about Pydeequ as a data testing framework, see Testing Data quality at scale with Pydeequ. To allow Amazon EMR to send data to the Amazon DataZone domain, make sure that the IAM role used by Amazon EMR has the permissions to do the following: Read from and write to the S3 buckets Call the post_time_series_data_points action for Amazon DataZone: { "Version": "2012-10-17", "Statement": [ { "Sid": "Statement1", "Effect": "Allow", "Action": [ "datazone:PostTimeSeriesDataPoints" ], "Resource": [ "<datazone_domain_arn>" ] } ] } Make sure that you added the EMR role as a project member in the Amazon DataZone project. On the Amazon DataZone console, navigate to the Project members page and choose Add members. Add the EMR role as a contributor. Ingest and analyze PySpark code In this section, we analyze the PySpark code that we use to perform data quality checks and send the results to Amazon DataZone. You can download the complete PySpark script. To run the script entirely, you can submit a job to EMR Serverless. The service will take care of scheduling the job and automatically allocating the resources needed, enabling you to track the job run statuses throughout the process. You can submit a job to EMR within the Amazon EMR console using EMR Studio or programmatically, using the AWS CLI or using one of the AWS SDKs. In Apache Spark, a SparkSession is the entry point for interacting with DataFrames and Spark’s built-in functions. The script will start initializing a SparkSession: with SparkSession.builder.appName("PatientsDataValidation") \ .config("spark.jars.packages", pydeequ.deequ_maven_coord) \ .config("spark.jars.excludes", pydeequ.f2j_maven_coord) \ .getOrCreate() as spark: We read a dataset from Amazon S3. For increased modularity, you can use the script input to refer to the S3 path: s3inputFilepath = sys.argv[1] s3outputLocation = sys.argv[2] df = spark.read.format("csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .load(s3inputFilepath) #s3://<bucket_name>/patients/patients.csv Next, we set up a metrics repository. This can be helpful to persist the run results in Amazon S3. metricsRepository = FileSystemMetricsRepository(spark, s3_write_path) Pydeequ allows you to create data quality rules using the builder pattern, which is a well-known software engineering design pattern, concatenating instruction to instantiate a VerificationSuite object: key_tags = {'tag': 'patient_df'} resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags) check = Check(spark, CheckLevel.Error, "Integrity checks") checkResult = VerificationSuite(spark) \ .onData(df) \ .useRepository(metricsRepository) \ .addCheck( check.hasSize(lambda x: x >= 1000) \ .isComplete("birthdate") \ .isUnique("id") \ .isComplete("ssn") \ .isComplete("first") \ .isComplete("last") \ .hasMin("healthcare_coverage", lambda x: x == 1000.0)) \ .saveOrAppendResult(resultKey) \ .run() checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.show() The following is the output for the data validation rules: +----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+ |check |check_level|check_status|constraint |constraint_status|constraint_message | +----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+ |Integrity checks|Error |Error |SizeConstraint(Size(None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(birthdate,None))|Success | | |Integrity checks|Error |Error |UniquenessConstraint(Uniqueness(List(id),None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(ssn,None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(first,None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(last,None)) |Success | | |Integrity checks|Error |Error |MinimumConstraint(Minimum(healthcare_coverage,None))|Failure |Value: 0.0 does not meet the constraint requirement!| +----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+ At this point, we want to insert these data quality values in Amazon DataZone. To do so, we use the post_time_series_data_points function in the Boto3 Amazon DataZone client. The PostTimeSeriesDataPoints DataZone API allows you to insert new time series data points for a given asset or listing, without creating a new revision. At this point, you might also want to have more information on which fields are sent as input for the API. You can use the APIs to obtain the specification for Amazon DataZone form types; in our case, it’s amazon.datazone.DataQualityResultFormType. You can also use the AWS CLI to invoke the API and display the form structure: aws datazone get-form-type --domain-identifier <your_domain_id> --form-type-identifier amazon.datazone.DataQualityResultFormType --region <domain_region> --output text --query 'model.smithy' This output helps identify the required API parameters, including fields and value limits: $version: "2.0" namespace amazon.datazone structure DataQualityResultFormType { @amazon.datazone#timeSeriesSummary @range(min: 0, max: 100) passingPercentage: Double @amazon.datazone#timeSeriesSummary evaluationsCount: Integer evaluations: EvaluationResults } @length(min: 0, max: 2000) list EvaluationResults { member: EvaluationResult } @length(min: 0, max: 20) list ApplicableFields { member: String } @length(min: 0, max: 20) list EvaluationTypes { member: String } enum EvaluationStatus { PASS, FAIL } string EvaluationDetailType map EvaluationDetails { key: EvaluationDetailType value: String } structure EvaluationResult { description: String types: EvaluationTypes applicableFields: ApplicableFields status: EvaluationStatus details: EvaluationDetails } To send the appropriate form data, we need to convert the Pydeequ output to match the DataQualityResultsFormType contract. This can be achieved with a Python function that processes the results. For each DataFrame row, we extract information from the constraint column. For example, take the following code: CompletenessConstraint(Completeness(birthdate,None)) We convert it to the following: { "constraint": "CompletenessConstraint", "statisticName": "Completeness_custom", "column": "birthdate" } Make sure to send an output that matches the KPIs that you want to track. In our case, we are appending _custom to the statistic name, resulting in the following format for KPIs: Completeness_custom Uniqueness_custom In a real-world scenario, you might want to set a value that matches with your data quality framework in relation to the KPIs that you want to track in Amazon DataZone. After applying a transformation function, we have a Python object for each rule evaluation: ..., { 'applicableFields': ["healthcare_coverage"], 'types': ["Minimum_custom"], 'status': 'FAIL', 'description': 'MinimumConstraint - Minimum - Value: 0.0 does not meet the constraint requirement!' },... We also use the constraint_status column to compute the overall score: (number of success / total number of evaluation) * 100 In our example, this results in a passing percentage of 85.71%. We set this value in the passingPercentage input field along with the other information related to the evaluations in the input of the Boto3 method post_time_series_data_points: import boto3 # Instantiate the client library to communicate with Amazon DataZone Service # datazone = boto3.client( service_name='datazone', region_name=<Region(String) example: us-east-1> ) # Perform the API operation to push the Data Quality information to Amazon DataZone # datazone.post_time_series_data_points( domainIdentifier=<DataZone domain ID>, entityIdentifier=<DataZone asset ID>, entityType='ASSET', forms=[ { "content": json.dumps({ "evaluationsCount":<Number of evaluations (number)>, "evaluations": [<List of objects { 'description': <Description (String)>, 'applicableFields': [<List of columns involved (String)>], 'types': [<List of KPIs (String)>], 'status': <FAIL/PASS (string)> }> ], "passingPercentage":<Score (number)> }), "formName": <Form name(String) example: PydeequRuleSet1>, "typeIdentifier": "amazon.datazone.DataQualityResultFormType", "timestamp": <Date (timestamp)> } ] ) Boto3 invokes the Amazon DataZone APIs. In these examples, we used Boto3 and Python, but you can choose one of the AWS SDKs developed in the language you prefer. After setting the appropriate domain and asset ID and running the method, we can check on the Amazon DataZone console that the asset data quality is now visible on the asset page. We can observe that the overall score matches with the API input value. We can also see that we were able to add customized KPIs on the overview tab through custom types parameter values. With the new Amazon DataZone APIs, you can load data quality rules from third-party systems into a specific data asset. With this capability, Amazon DataZone allows you to extend the types of indicators present in AWS Glue Data Quality (such as completeness, minimum, and uniqueness) with custom indicators. Clean up We recommend deleting any potentially unused resources to avoid incurring unexpected costs. For example, you can delete the Amazon DataZone domain and the EMR application you created during this process. Conclusion In this post, we highlighted the latest features of Amazon DataZone for data quality, empowering end-users with enhanced context and visibility into their data assets. Furthermore, we delved into the seamless integration between Amazon DataZone and AWS Glue Data Quality. You can also use the Amazon DataZone APIs to integrate with external data quality providers, enabling you to maintain a comprehensive and robust data strategy within your AWS environment. To learn more about Amazon DataZone, refer to the Amazon DataZone User Guide. About the Authors Andrea Filippo is a Partner Solutions Architect at AWS supporting Public Sector partners and customers in Italy. He focuses on modern data architectures and helping customers accelerate their cloud journey with serverless technologies. Emanuele is a Solutions Architect at AWS, based in Italy, after living and working for more than 5 years in Spain. He enjoys helping large companies with the adoption of cloud technologies, and his area of expertise is mainly focused on Data Analytics and Data Management. Outside of work, he enjoys traveling and collecting action figures. Varsha Velagapudi is a Senior Technical Product Manager with Amazon DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about simplifying customers’ AI/ML and analytics journey to help them succeed in their day-to-day tasks. Outside of work, she enjoys nature and outdoor activities, reading, and traveling. View the full article

April 3
- aws glue
- aws glue data quality
- (and 1 more)
  Tagged with:

apache iceberg Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

This is post is co-written with Andries Engelbrecht and Scott Teal from Snowflake. Businesses are constantly evolving, and data leaders are challenged every day to meet new requirements. For many enterprises and large organizations, it is not feasible to have one processing engine or tool to deal with the various business requirements. They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions. Customers are using AWS and Snowflake to develop purpose-built data architectures that provide the performance required for modern analytics and artificial intelligence (AI) use cases. Implementing these solutions requires data sharing between purpose-built data stores. This is why Snowflake and AWS are delivering enhanced support for Apache Iceberg to enable and facilitate data interoperability between data services. Apache Iceberg is an open-source table format that provides reliability, simplicity, and high performance for large datasets with transactional integrity between various processing engines. In this post, we discuss the following: Advantages of Iceberg tables for data lakes Two architectural patterns for sharing Iceberg tables between AWS and Snowflake: Manage your Iceberg tables with AWS Glue Data Catalog Manage your Iceberg tables with Snowflake The process of converting existing data lakes tables to Iceberg tables without copying the data Now that you have a high-level understanding of the topics, let’s dive into each of them in detail. Advantages of Apache Iceberg Apache Iceberg is a distributed, community-driven, Apache 2.0-licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time. Apache Iceberg offers integrations with popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive, Presto, and more. Iceberg tables maintain metadata to abstract large collections of files, providing data management features including time travel, rollback, data compaction, and full schema evolution, reducing management overhead. Originally developed at Netflix before being open sourced to the Apache Software Foundation, Apache Iceberg was a blank-slate design to solve common data lake challenges like user experience, reliability, and performance, and is now supported by a robust community of developers focused on continually improving and adding new features to the project, serving real user needs and providing them with optionality. Transactional data lakes built on AWS and Snowflake Snowflake provides various integrations for Iceberg tables with multiple storage options, including Amazon S3, and multiple catalog options, including AWS Glue Data Catalog and Snowflake. AWS provides integrations for various AWS services with Iceberg tables as well, including AWS Glue Data Catalog for tracking table metadata. Combining Snowflake and AWS gives you multiple options to build out a transactional data lake for analytical and other use cases such as data sharing and collaboration. By adding a metadata layer to data lakes, you get a better user experience, simplified management, and improved performance and reliability on very large datasets. Manage your Iceberg table with AWS Glue You can use AWS Glue to ingest, catalog, transform, and manage the data on Amazon Simple Storage Service (Amazon S3). AWS Glue is a serverless data integration service that allows you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes in Iceberg format. With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. Snowflake integrates with AWS Glue Data Catalog to access the Iceberg table catalog and the files on Amazon S3 for analytical queries. This greatly improves performance and compute cost in comparison to external tables on Snowflake, because the additional metadata improves pruning in query plans. You can use this same integration to take advantage of the data sharing and collaboration capabilities in Snowflake. This can be very powerful if you have data in Amazon S3 and need to enable Snowflake data sharing with other business units, partners, suppliers, or customers. The following architecture diagram provides a high-level overview of this pattern. The workflow includes the following steps: AWS Glue extracts data from applications, databases, and streaming sources. AWS Glue then transforms it and loads it into the data lake in Amazon S3 in Iceberg table format, while inserting and updating the metadata about the Iceberg table in AWS Glue Data Catalog. The AWS Glue crawler generates and updates Iceberg table metadata and stores it in AWS Glue Data Catalog for existing Iceberg tables on an S3 data lake. Snowflake integrates with AWS Glue Data Catalog to retrieve the snapshot location. In the event of a query, Snowflake uses the snapshot location from AWS Glue Data Catalog to read Iceberg table data in Amazon S3. Snowflake can query across Iceberg and Snowflake table formats. You can share data for collaboration with one or more accounts in the same Snowflake region. You can also use data in Snowflake for visualization using Amazon QuickSight, or use it for machine learning (ML) and artificial intelligence (AI) purposes with Amazon SageMaker. Manage your Iceberg table with Snowflake A second pattern also provides interoperability across AWS and Snowflake, but implements data engineering pipelines for ingestion and transformation to Snowflake. In this pattern, data is loaded to Iceberg tables by Snowflake through integrations with AWS services like AWS Glue or through other sources like Snowpipe. Snowflake then writes data directly to Amazon S3 in Iceberg format for downstream access by Snowflake and various AWS services, and Snowflake manages the Iceberg catalog that tracks snapshot locations across tables for AWS services to access. Like the previous pattern, you can use Snowflake-managed Iceberg tables with Snowflake data sharing, but you can also use S3 to share datasets in cases where one party does not have access to Snowflake. The following architecture diagram provides an overview of this pattern with Snowflake-managed Iceberg tables. This workflow consists of the following steps: In addition to loading data via the COPY command, Snowpipe, and the native Snowflake connector for AWS Glue, you can integrate data via the Snowflake Data Sharing. Snowflake writes Iceberg tables to Amazon S3 and updates metadata automatically with every transaction. Iceberg tables in Amazon S3 are queried by Snowflake for analytical and ML workloads using services like QuickSight and SageMaker. Apache Spark services on AWS can access snapshot locations from Snowflake via a Snowflake Iceberg Catalog SDK and directly scan the Iceberg table files in Amazon S3. Comparing solutions These two patterns highlight options available to data personas today to maximize their data interoperability between Snowflake and AWS using Apache Iceberg. But which pattern is ideal for your use case? If you’re already using AWS Glue Data Catalog and only require Snowflake for read queries, then the first pattern can integrate Snowflake with AWS Glue and Amazon S3 to query Iceberg tables. If you’re not already using AWS Glue Data Catalog and require Snowflake to perform reads and writes, then the second pattern is likely a good solution that allows for storing and accessing data from AWS. Considering that reads and writes will probably operate on a per-table basis rather than the entire data architecture, it is advisable to use a combination of both patterns. Migrate existing data lakes to a transactional data lake using Apache Iceberg You can convert existing Parquet, ORC, and Avro-based data lake tables on Amazon S3 to Iceberg format to reap the benefits of transactional integrity while improving performance and user experience. There are several Iceberg table migration options (SNAPSHOT, MIGRATE, and ADD_FILES) for migrating existing data lake tables in-place to Iceberg format, which is preferable to rewriting all of the underlying data files—a costly and time-consuming effort with large datasets. In this section, we focus on ADD_FILES, because it’s useful for custom migrations. For ADD_FILES options, you can use AWS Glue to generate Iceberg metadata and statistics for an existing data lake table and create new Iceberg tables in AWS Glue Data Catalog for future use without needing to rewrite the underlying data. For instructions on generating Iceberg metadata and statistics using AWS Glue, refer to Migrate an existing data lake to a transactional data lake using Apache Iceberg or Convert existing Amazon S3 data lake tables to Snowflake Unmanaged Iceberg tables using AWS Glue. This option requires that you pause data pipelines while converting the files to Iceberg tables, which is a straightforward process in AWS Glue because the destination just needs to be changed to an Iceberg table. Conclusion In this post, you saw the two architecture patterns for implementing Apache Iceberg in a data lake for better interoperability across AWS and Snowflake. We also provided guidance on migrating existing data lake tables to Iceberg format. Sign up for AWS Dev Day on April 10 to get hands-on not only with Apache Iceberg, but also with streaming data pipelines with Amazon Data Firehose and Snowpipe Streaming, and generative AI applications with Streamlit in Snowflake and Amazon Bedrock. About the Authors Andries Engelbrecht is a Principal Partner Solutions Architect at Snowflake and works with strategic partners. He is actively engaged with strategic partners like AWS supporting product and service integrations as well as the development of joint solutions with partners. Andries has over 20 years of experience in the field of data and analytics. Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data architectures on the AWS Cloud. He has helped customers of all sizes implement data management, data warehouse, and data lake solutions. Brian Dolan joined Amazon as a Military Relations Manager in 2012 after his first career as a Naval Aviator. In 2014, Brian joined Amazon Web Services, where he helped Canadian customers from startups to enterprises explore the AWS Cloud. Most recently, Brian was a member of the Non-Relational Business Development team as a Go-To-Market Specialist for Amazon DynamoDB and Amazon Keyspaces before joining the Analytics Worldwide Specialist Organization in 2022 as a Go-To-Market Specialist for AWS Glue. Nidhi Gupta is a Sr. Partner Solution Architect at AWS. She spends her days working with customers and partners, solving architectural challenges. She is passionate about data integration and orchestration, serverless and big data processing, and machine learning. Nidhi has extensive experience leading the architecture design and production release and deployments for data workloads. Scott Teal is a Product Marketing Lead at Snowflake and focuses on data lakes, storage, and governance. View the full article

April 3
- data lakes
- amazon s3
- (and 2 more)
  Tagged with:
  - data lakes
  - amazon s3
  - s3
  - aws glue

Data Engineering Tools in 2024

DevOpsSchool posted a topic in Databases, Data Engineering & Data Science

Data Engineering Tools in 2024 The data engineering landscape in 2024 is bustling with innovative tools and evolving trends. Here’s an updated perspective on some of the key players and how they can empower your data pipelines: Data Integration: Informatica Cloud: Still a leader for advanced data quality and governance, with enhanced cloud-native capabilities. MuleSoft Anypoint Platform: Continues to shine in building API-based integrations, now with deeper cloud support and security features. Fivetran: Expands its automated data pipeline creation with pre-built connectors and advanced transformations. Hevo Data: Remains a strong contender for ease of use and affordability, now offering serverless options for scalability. Data Warehousing: Snowflake: Maintains its edge in cloud-based warehousing, with improved performance and broader integrations for analytics. Google BigQuery: Offers even more cost-effective options for variable workloads, while deepening its integration with other Google Cloud services. Amazon Redshift: Continues to be a powerful choice for AWS environments, now with increased focus on security and data governance. Microsoft Azure Synapse Analytics: Further integrates its data warehousing, lake, and analytics capabilities, providing a unified platform for diverse data needs. Data Processing and Orchestration: Apache Spark: Remains the reigning champion for large-scale data processing, now with enhanced performance optimizations and broader ecosystem support. Apache Airflow: Maintains its popularity for workflow orchestration, with improved scalability and user-friendliness. Databricks: Expands its cloud-based platform for Spark with advanced features like AI integration and real-time streaming. AWS Glue: Simplifies data processing and ETL within the AWS ecosystem, now with serverless options for cost efficiency. Emerging Trends: GitOps: Gaining traction for managing data pipelines with version control and collaboration, ensuring consistency and traceability. AI and Machine Learning: Increasingly integrated into data engineering tools for automation, anomaly detection, and data quality improvement. Serverless Data Processing: Offering cost-effective and scalable options for event-driven and real-time data processing. Choosing the right tools: With this diverse landscape, selecting the right tools depends on your specific needs. Consider factors like: Data volume and complexity: Match tool capabilities to your data size and structure. Cloud vs. on-premises: Choose based on your infrastructure preferences and security requirements. Budget: Evaluate pricing models and potential costs associated with each tool. Integration needs: Ensure seamless compatibility with your existing data sources and BI tools. Skillset: Consider the technical expertise required for each tool and available support resources. By carefully evaluating your needs and exploring the strengths and limitations of these top contenders, you’ll be well-equipped to choose the data engineering tools that empower your organization to unlock valuable insights from your data in 2024. The post Data Engineering Tools in 2024 appeared first on DevOpsSchool.com. View the full article

February 24
- 1
- snowflake
- databricks
- (and 9 more)
  Tagged with:
  - snowflake
  - databricks
  - presto
  - trino
  - iceberg
  - spark
  - dagster
  - prefect
  - aws glue
  - fivetran
  - apache airflow

aws weekly roundup AWS Weekly Roundup — Amazon Q in AWS Glue, Amazon PartyRock Hackathon, CDK Migrate, and more — February 5, 2024

Amazon Web Services posted a topic in Amazon Web Services

With all the generative AI announcements at AWS re:invent 2023, I’ve committed to dive deep into this technology and learn as much as I can. If you are too, I’m happy that among other resources available, the AWS community also has a space that I can access for generative AI tools and guides. Last week’s launches Here are some launches that got my attention during the previous week. Amazon Q data integration in AWS Glue (Preview) – Now you can use natural language to ask Amazon Q to author jobs, troubleshoot issues, and answer questions about AWS Glue and data integration. Amazon Q was launched in preview at AWS re:invent 2023, and is a generative AI–powered assistant to help you solve problems, generate content, and take action. General availability of CDK Migrate – CDK Migrate is a component of the AWS Cloud Development Kit (CDK) that enables you to migrate AWS CloudFormation templates, previously deployed CloudFormation stacks, or resources created outside of Infrastructure as Code (IaC) into a CDK application. This feature was launched alongside the CloudFormation IaC Generator to give you an end-to-end experience that enables you to create an IaC configuration based off a resource, as well as its relationships. You can expect the IaC generator to have a huge impact for a common use case we’ve seen. For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page. Other AWS news Here are some additional projects, programs, and news items that you might find interesting: Amazon API Gateway processed over 100 trillion API requests in 2023, demonstrating the growing demand for API-driven applications. API Gateway is a fully-managed API management service. Customers from all industry verticals told us they’re adopting API Gateway for multiple reasons. First, its ability to scale to meet the demands of even the most high-traffic applications. Second, its fully-managed, serverless architecture, which eliminates the need to manage any infrastructure, and frees customers to focus on their core business needs. Join the PartyRock Generative AI Hackathon by AWS. This is a challenge for you to get hands-on building generative AI-powered apps. You’ll use Amazon PartyRock, an Amazon Bedrock Playground, as a fast and fun way to learn about Prompt Engineering and Foundational Models (FMs) to build a functional app with generative AI. AWS open source news and updates – My colleague Ricardo writes this weekly open source newsletter in which he highlights new open source projects, tools, and demos from the AWS Community. Upcoming AWS events Whether you’re in the Americas, Asia Pacific & Japan, or EMEA region, there’s an upcoming AWS Innovate Online event that fits your timezone. Innovate Online events are free, online, and designed to inspire and educate you about AWS. AWS Summits are a series of free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. These events are designed to educate you about AWS products and services and help you develop the skills needed to build, deploy, and operate your infrastructure and applications. Find an AWS Summit near you and register or set a notification to know when registration opens for a Summit that interests you. AWS Community re:Invent re:Caps – Join a Community re:Cap event organized by volunteers from AWS User Groups and AWS Cloud Clubs around the world to learn about the latest announcements from AWS re:Invent. You can browse all upcoming in-person and virtual events. That’s all for this week. Check back next Monday for another Weekly Roundup! – Veliswa This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS! View the full article

February 5
- amazon q
- aws glue
- (and 3 more)
  Tagged with:

aws glue New chat experience for AWS Glue using natural language – Amazon Q data integration in AWS Glue (Preview)

Amazon Web Services posted a topic in Amazon Web Services

Today we’re previewing a new chat experience for AWS Glue that will let you use natural language to author and troubleshoot data integration jobs. Amazon Q data integration in AWS Glue will reduce the time and effort you need to learn, build, and run data integration jobs using AWS Glue data integration engines. You can author jobs, troubleshoot issues, and get instant answers to questions about AWS Glue and anything related to data integration. The chat experience is powered by Amazon Bedrock. You can describe your data integration workload and Amazon Q will generate a complete ETL script. You can troubleshoot your jobs by asking Amazon Q to explain errors and propose solutions. Amazon Q provides detailed guidance throughout the entire data integration workflow. Amazon Q helps you learn and build data integration jobs using AWS Glue. Amazon Q can help you connect to common AWS sources such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Amazon DynamoDB. Let me show you some capabilities of Amazon Q data integration in AWS Glue. 1. Conversational Q&A capability To start using this feature, I can select the Amazon Q icon on the right-hand side of the AWS Management Console. For example, I can ask, “What is AWS Glue,” and Amazon Q provides concise explanations along with references I can use to follow up on my questions and validate the guidance. With Amazon Q, I can elaborate on my use cases in more detail to provide context. For example, I can ask Amazon Q, “How do I create an AWS Glue job?” Next let me ask Amazon Q, “How do I optimize memory management in my AWS Glue job?” 2. AWS Glue job creation To use this feature, I can tell Amazon Q, “Write a Glue ETL job that reads from Redshift, drops null fields, and writes to S3 as parquet files.” I can copy code into the script editor or notebook with a simple click on the Copy button. I can also tell Amazon Q, “Help me with a Glue job that reads my DynamoDB table, maps the fields, and writes the results to Amazon S3 in Parquet format”. Get started with Amazon Q today With Amazon Q, you have an artificial intelligence (AI) expert by your side to answer questions, write code faster, troubleshoot issues, optimize workloads, and even help you code new features. These capabilities simplify every phase of building applications on AWS. Amazon Q data integration in AWS Glue is available in every region where Amazon Q is supported. To learn more, see the Amazon Q pricing page. Learn more Amazon Q main product page Amazon Q data integration Amazon Q details for IT pros and developers Get started with Amazon Q — Irshad View the full article

aws glue AWS Glue now supports GitLab, BitBucket in its Git integration feature

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

AWS Glue now supports GitLab and BitBucket, alongside GitHub and AWS CodeCommit, broadening your toolset for managing data integration pipeline deployments. AWS Glue is a serverless data integration service that makes it simpler to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. View the full article

October 9, 2023
- gitlab
- bitbucket
- (and 2 more)
  Tagged with:
  - gitlab
  - bitbucket
  - git
  - integration

aws lake formation AWS Lake Formation and Glue Data Catalog now manage Apache Hive Metastore resources

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

AWS Lake Formation and the Glue Data Catalog now extend data cataloging, data sharing and fine-grained access control support for customers using a self-managed Apache Hive Metastore (HMS) as their data catalog. Previously, customers had to replicate their metadata into the AWS Glue Data Catalog in order use Lake Formation permissions and data sharing capabilities. Now, customers can integrate their HMS metadata within AWS, allowing them to discover data alongside native tables in the Glue data catalog, manage permissions and sharing from Lake Formation, and query data using AWS analytics services. View the full article

April 19, 2023
- aws glue
- aws glue data catalog
- (and 1 more)
  Tagged with:

aws glue crawlers AWS Glue Crawlers enhances support for Delta Lake Tables

Amazon Web Services posted a topic in Amazon Web Services

AWS Glue crawlers now have enhanced support for Linux Foundation Delta Lake tables, increasing operational efficiency to extract meaningful insights from analytics services such as Amazon Athena, Amazon EMR, and AWS Glue. This feature enables analytics services scan Delta Lake tables without requiring the creation of manifest files by Glue crawlers. Newly cataloged data is now quickly made available for analysis using your preferred analytics and machine learning (ML) tools. View the full article

apache spark AWS Glue for Apache Spark Native support for Data Lake Frameworks (Apache Hudi, Apache Iceberg, Delta Lake)

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

AWS Glue for Apache Spark now supports three open source data lake storage frameworks: Apache Hudi, Apache Iceberg, and Linux Foundation Delta Lake. These frameworks allow you to read and write data in Amazon Simple Storage Service (Amazon S3) in a transactionally consistent manner. AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. This feature removes the need to install a separate connector and reduces the configuration steps required to use these frameworks in AWS Glue for Apache Spark jobs. View the full article

November 28, 2022
- aws glue
- spark
- (and 5 more)
  Tagged with:
  - aws glue
  - spark
  - data lakes
  - frameworks
  - apache hudi
  - apache iceberg
  - delta lake

aws glue studio AWS Glue Studio jobs can now update AWS Glue Data Catalog tables

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

AWS Glue Studio now supports updating the AWS Glue Data Catalog during job runs. This feature makes it easy to keep your tables up to date as AWS Glue writes new data into Amazon S3, making the data immediately queryable from any analytics service compatible with the AWS Glue Data Catalog. View the full article

February 18, 2021
2 replies
- aws glue
- aws glue data catalog
- (and 1 more)
  Tagged with:

AWS Glue Streaming ETL jobs support reading records in the Apache Avro format

Amazon Web Services posted a topic in Amazon Web Services

Streaming extract, transform, and load (ETL) jobs in AWS Glue can now read data encoded in the Apache Avro format. Previously, streaming ETL jobs could read data in the JSON, CSV, Parquet, and XML formats. With the addition of Avro, streaming ETL jobs now support all the same formats as batch AWS Glue jobs. View the full article

October 15, 2020
1 reply
- as
- aws glue
- (and 4 more)
  Tagged with:
  - as
  - aws glue
  - etl
  - streaming
  - apache
  - avro

Sign In

Search the Community

Search By Tags

Search By Author

Content Type

Forums

Calendars

Find results in...

Find results that contain...

Date Created

Start

End

Last Updated

Start

End

Filter by number of...

Minimum number of comments

Minimum number of replies

Minimum number of reviews

Minimum number of views

Joined

Start

End

Group

Website URL

LinkedIn Profile URL

About Me

Cloud Platforms

Cloud Experience

Development Experience

Current Role

Skills

Certifications

Favourite Tools

Interests

Forum Statistics