Jump to content

Search the Community

Showing results for tags 'amazon datazone'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • General
    • General Discussion
    • Artificial Intelligence
    • DevOpsForum News
  • DevOps & SRE
    • DevOps & SRE General Discussion
    • Databases, Data Engineering & Data Science
    • Development & Programming
    • CI/CD, GitOps, Orchestration & Scheduling
    • Docker, Containers, Microservices, Serverless & Virtualization
    • Infrastructure-as-Code
    • Kubernetes & Container Orchestration
    • Linux
    • Logging, Monitoring & Observability
    • Security, Governance, Risk & Compliance
  • Cloud Providers
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 7 results

  1. Last week, we announced the general availability of the integration between Amazon DataZone and AWS Lake Formation hybrid access mode. In this post, we share how this new feature helps you simplify the way you use Amazon DataZone to enable secure and governed sharing of your data in the AWS Glue Data Catalog. We also delve into how data producers can share their AWS Glue tables through Amazon DataZone without needing to register them in Lake Formation first. Overview of the Amazon DataZone integration with Lake Formation hybrid access mode Amazon DataZone is a fully managed data management service to catalog, discover, analyze, share, and govern data between data producers and consumers in your organization. With Amazon DataZone, data producers populate the business data catalog with data assets from data sources such as the AWS Glue Data Catalog and Amazon Redshift. They also enrich their assets with business context to make it straightforward for data consumers to understand. After the data is available in the catalog, data consumers such as analysts and data scientists can search and access this data by requesting subscriptions. When the request is approved, Amazon DataZone can automatically provision access to the data by managing permissions in Lake Formation or Amazon Redshift so that the data consumer can start querying the data using tools such as Amazon Athena or Amazon Redshift. To manage the access to data in the AWS Glue Data Catalog, Amazon DataZone uses Lake Formation. Previously, if you wanted to use Amazon DataZone for managing access to your data in the AWS Glue Data Catalog, you had to onboard your data to Lake Formation first. Now, the integration of Amazon DataZone and Lake Formation hybrid access mode simplifies how you can get started with your Amazon DataZone journey by removing the need to onboard your data to Lake Formation first. Lake Formation hybrid access mode allows you to start managing permissions on your AWS Glue databases and tables through Lake Formation, while continuing to maintain any existing AWS Identity and Access Management (IAM) permissions on these tables and databases. Lake Formation hybrid access mode supports two permission pathways to the same Data Catalog databases and tables: In the first pathway, Lake Formation allows you to select specific principals (opt-in principals) and grant them Lake Formation permissions to access databases and tables by opting in The second pathway allows all other principals (that are not added as opt-in principals) to access these resources through the IAM principal policies for Amazon Simple Storage Service (Amazon S3) and AWS Glue actions With the integration between Amazon DataZone and Lake Formation hybrid access mode, if you have tables in the AWS Glue Data Catalog that are managed through IAM-based policies, you can publish these tables directly to Amazon DataZone, without registering them in Lake Formation. Amazon DataZone registers the location of these tables in Lake Formation using hybrid access mode, which allows managing permissions on AWS Glue tables through Lake Formation, while continuing to maintain any existing IAM permissions. Amazon DataZone enables you to publish any type of asset in the business data catalog. For some of these assets, Amazon DataZone can automatically manage access grants. These assets are called managed assets, and include Lake Formation-managed Data Catalog tables and Amazon Redshift tables and views. Prior to this integration, you had to complete the following steps before Amazon DataZone could treat the published Data Catalog table as a managed asset: Identity the Amazon S3 location associated with Data Catalog table. Register the Amazon S3 location with Lake Formation in hybrid access mode using a role with appropriate permissions. Publish the table metadata to the Amazon DataZone business data catalog. The following diagram illustrates this workflow. With the Amazon DataZone’s integration with Lake Formation hybrid access mode, you can simply publish your AWS Glue tables to Amazon DataZone without having to worry about registering the Amazon S3 location or adding an opt-in principal in Lake Formation by delegating these steps to Amazon DataZone. The administrator of an AWS account can enable the data location registration setting under the DefaultDataLake blueprint on the Amazon DataZone console. Now, a data owner or publisher can publish their AWS Glue table (managed through IAM permissions) to Amazon DataZone without the extra setup steps. When a data consumer subscribes to this table, Amazon DataZone registers the Amazon S3 locations of the table in hybrid access mode, adds the data consumer’s IAM role as an opt-in principal, and grants access to the same IAM role by managing permissions on the table through Lake Formation. This makes sure that IAM permissions on the table can coexist with newly granted Lake Formation permissions, without disrupting any existing workflows. The following diagram illustrates this workflow. Solution overview To demonstrate this new capability, we use a sample customer scenario where the finance team wants to access data owned by the sales team for financial analysis and reporting. The sales team has a pipeline that creates a dataset containing valuable information about ticket sales, popular events, venues, and seasons. We call it the tickit dataset. The sales team stores this dataset in Amazon S3 and registers it in a database in the Data Catalog. The access to this table is currently managed through IAM-based permissions. However, the sales team wants to publish this table to Amazon DataZone to facilitate secure and governed data sharing with the finance team. The steps to configure this solution are as follows: The Amazon DataZone administrator enables the data lake location registration setting in Amazon DataZone to automatically register the Amazon S3 location of the AWS Glue tables in Lake Formation hybrid access mode. After the hybrid access mode integration is enabled in Amazon DataZone, the finance team requests a subscription to the sales data asset. The asset shows up as a managed asset, which means Amazon DataZone can manage access to this asset even if the Amazon S3 location of this asset isn’t registered in Lake Formation. The sales team is notified of a subscription request raised by the finance team. They review and approve the access request. After the request is approved, Amazon DataZone fulfills the subscription request by managing permissions in the Lake Formation. It registers the Amazon S3 location of the subscribed table in Lake Formation hybrid mode. The finance team gains access to the sales dataset required for their financial reports. They can go to their DataZone environment and start running queries using Athena against their subscribed dataset. Prerequisites To follow the steps in this post, you need an AWS account. If you don’t have an account, you can create one. In addition, you must have the following resources configured in your account: An S3 bucket An AWS Glue database and crawler IAM roles for different personas and services An Amazon DataZone domain and project An Amazon DataZone environment profile and environment An Amazon DataZone data source If you don’t have these resources already configured, you can create them by deploying the following AWS CloudFormation stack: Choose Launch Stack to deploy a CloudFormation template. Complete the steps to deploy the template and leave all settings as default. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit. After the CloudFormation deployment is complete, you can log in to the Amazon DataZone portal and manually trigger a data source run. This pulls any new or modified metadata from the source and updates the associated assets in the inventory. This data source has been configured to automatically publish the data assets to the catalog. On the Amazon DataZone console, choose View domains. You should be logged in using the same role that is used to deploy CloudFormation and verify that you are in the same AWS Region. Find the domain blog_dz_domain, then choose Open data portal. Choose Browse all projects and choose Sales producer project. On the Data tab, choose Data sources in the navigation pane. Locate and choose the data source that you want to run. This opens the data source details page. Choose the options menu (three vertical dots) next to tickit_datasource and choose Run. The data source status changes to Running as Amazon DataZone updates the asset metadata. Enable hybrid mode integration in Amazon DataZone In this step, the Amazon DataZone administrator goes through the process of enabling the Amazon DataZone integration with Lake Formation hybrid access mode. Complete the following steps: On a separate browser tab, open the Amazon DataZone console. Verify that you are in the same Region where you deployed the CloudFormation template. Choose View domains. Choose the domain created by AWS CloudFormation, blog_dz_domain. Scroll down on the domain details page and choose the Blueprints tab. A blueprint defines what AWS tools and services can be used with the data assets published in Amazon DataZone. The DefaultDataLake blueprint is enabled as part of the CloudFormation stack deployment. This blueprint enables you to create and query AWS Glue tables using Athena. For the steps to enable this in your own deployments, refer to Enable built-in blueprints in the AWS account that owns the Amazon DataZone domain. Choose the DefaultDataLake blueprint. On the Provisioning tab, choose Edit. Select Enable Amazon DataZone to register S3 locations using AWS Lake Formation hybrid access mode. You have the option of excluding specific Amazon S3 locations if you don’t want Amazon DataZone to automatically register them to Lake Formation hybrid access mode. Choose Save changes. Request access In this step, you log in to Amazon DataZone as the finance team, search for the sales data asset, and subscribe to it. Complete the following steps: Return to your Amazon DataZone data portal browser tab. Switch to the finance consumer project by choosing the dropdown menu next to the project name and choosing Finance consumer project. From this step onwards, you take on the persona of a finance user looking to subscribe to a data asset published in the previous step. In the search bar, search for and choose the sales data asset. Choose Subscribe. The asset shows up as managed asset. This means that Amazon DataZone can grant access to this data asset to the finance team’s project by managing the permissions in Lake Formation. Enter a reason for the access request and choose Subscribe. Approve access request The sales team gets a notification that an access request from the finance team is submitted. To approve the request, complete the following steps: Choose the dropdown menu next to the project name and choose Sales producer project. You now assume the persona of the sales team, who are the owners and stewards of the sales data assets. Choose the notification icon at the top-right corner of the DataZone portal. Choose the Subscription Request Created task. Grant access to the sales data asset to the finance team and choose Approve. Analyze the data The finance team has now been granted access to the sales data, and this dataset has been to their Amazon DataZone environment. They can access the environment and query the sales dataset with Athena, along with any other datasets they currently own. Complete the following steps: On the dropdown menu, choose Finance consumer project. On the right pane of the project overview screen, you can find a list of active environments available for use. Choose the Amazon DataZone environment finance_dz_environment. In the navigation pane, under Data assets, choose Subscribed. Verify that your environment now has access to the sales data. It may take a few minutes for the data asset to be automatically added to your environment. Choose the new tab icon for Query data. A new tab opens with the Athena query editor. For Database, choose finance_consumer_db_tickitdb-<suffix>. This database will contain your subscribed data assets. Generate a preview of the sales table by choosing the options menu (three vertical dots) and choosing Preview table. Clean up To clean up your resources, complete the following steps: Switch back to the administrator role you used to deploy the CloudFormation stack. On the Amazon DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments. On the AWS CloudFormation console, delete the stack you deployed in the beginning of this post. On the Amazon S3 console, delete the S3 buckets containing the tickit dataset. On the Lake Formation console, delete the Lake Formation admins registered by Amazon DataZone. On the Lake Formation console, delete tables and databases created by Amazon DataZone. Conclusion In this post, we discussed how the integration between Amazon DataZone and Lake Formation hybrid access mode simplifies the process to start using Amazon DataZone for end-to-end governance of your data in the AWS Glue Data Catalog. This integration helps you bypass the manual steps of onboarding to Lake Formation before you can start using Amazon DataZone. For more information on how to get started with Amazon DataZone, refer to the Getting started guide. Check out the YouTube playlist for some of the latest demos of Amazon DataZone and short descriptions of the capabilities available. For more information about Amazon DataZone, see How Amazon DataZone helps customers find value in oceans of data. About the Authors Utkarsh Mittal is a Senior Technical Product Manager for Amazon DataZone at AWS. He is passionate about building innovative products that simplify customers’ end-to-end analytics journeys. Outside of the tech world, Utkarsh loves to play music, with drums being his latest endeavor. Praveen Kumar is a Principal Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-centered services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and generative AI applications. Paul Villena is a Senior Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure as code, serverless technologies, and coding in Python View the full article
  2. Amazon DataZone is used by customers to catalog, discover, analyze, share, and govern data at scale across organizational boundaries with governance and access controls. Today, Amazon DataZone launches integration with AWS Glue Data Quality and offers APIs to integrate data quality metrics from third party data quality solutions. This integration helps Amazon DataZone customers gain trust in their data and make confident business decisions. View the full article
  3. Amazon DataZone is used by customers to catalog, discover, analyze, share, and govern data at scale across organizational boundaries with governance and access controls. Today, Amazon DataZone has introduced an integration with AWS Lake Formation hybrid mode. This integration enables customers to easily publish and share their AWS Glue tables through Amazon DataZone, without the need to register them in AWS Lake Formation first. Hybrid mode allows customers to start managing permissions on their AWS Glue tables through AWS Lake Formation, while continuing to maintain any existing IAM permissions on these tables. View the full article
  4. Today, we are pleased to announce that Amazon DataZone is now able to present data quality information for data assets. This information empowers end-users to make informed decisions as to whether or not to use specific assets. Many organizations already use AWS Glue Data Quality to define and enforce data quality rules on their data, validate data against predefined rules, track data quality metrics, and monitor data quality over time using artificial intelligence (AI). Other organizations monitor the quality of their data through third-party solutions. Amazon DataZone now integrates directly with AWS Glue to display data quality scores for AWS Glue Data Catalog assets. Additionally, Amazon DataZone now offers APIs for importing data quality scores from external systems. In this post, we discuss the latest features of Amazon DataZone for data quality, the integration between Amazon DataZone and AWS Glue Data Quality and how you can import data quality scores produced by external systems into Amazon DataZone via API. Challenges One of the most common questions we get from customers is related to displaying data quality scores in the Amazon DataZone business data catalog to let business users have visibility into the health and reliability of the datasets. As data becomes increasingly crucial for driving business decisions, Amazon DataZone users are keenly interested in providing the highest standards of data quality. They recognize the importance of accurate, complete, and timely data in enabling informed decision-making and fostering trust in their analytics and reporting processes. Amazon DataZone data assets can be updated at varying frequencies. As data is refreshed and updated, changes can happen through upstream processes that put it at risk of not maintaining the intended quality. Data quality scores help you understand if data has maintained the expected level of quality for data consumers to use (through analysis or downstream processes). From a producer’s perspective, data stewards can now set up Amazon DataZone to automatically import the data quality scores from AWS Glue Data Quality (scheduled or on demand) and include this information in the Amazon DataZone catalog to share with business users. Additionally, you can now use new Amazon DataZone APIs to import data quality scores produced by external systems into the data assets. With the latest enhancement, Amazon DataZone users can now accomplish the following: Access insights about data quality standards directly from the Amazon DataZone web portal View data quality scores on various KPIs, including data completeness, uniqueness, accuracy Make sure users have a holistic view of the quality and trustworthiness of their data. In the first part of this post, we walk through the integration between AWS Glue Data Quality and Amazon DataZone. We discuss how to visualize data quality scores in Amazon DataZone, enable AWS Glue Data Quality when creating a new Amazon DataZone data source, and enable data quality for an existing data asset. In the second part of this post, we discuss how you can import data quality scores produced by external systems into Amazon DataZone via API. In this example, we use Amazon EMR Serverless in combination with the open source library Pydeequ to act as an external system for data quality. Visualize AWS Glue Data Quality scores in Amazon DataZone You can now visualize AWS Glue Data Quality scores in data assets that have been published in the Amazon DataZone business catalog and that are searchable through the Amazon DataZone web portal. If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane. By selecting the corresponding asset, you can understand its content through the readme, glossary terms, and technical and business metadata. Additionally, the overall quality score indicator is displayed in the Asset Details section. A data quality score serves as an overall indicator of a dataset’s quality, calculated based on the rules you define. On the Data quality tab, you can access the details of data quality overview indicators and the results of the data quality runs. The indicators shown on the Overview tab are calculated based on the results of the rulesets from the data quality runs. Each rule is assigned an attribute that contributes to the calculation of the indicator. For example, rules that have the Completeness attribute will contribute to the calculation of the corresponding indicator on the Overview tab. To filter data quality results, choose the Applicable column dropdown menu and choose your desired filter parameter. You can also visualize column-level data quality starting on the Schema tab. When data quality is enabled for the asset, the data quality results become available, providing insightful quality scores that reflect the integrity and reliability of each column within the dataset. When you choose one of the data quality result links, you’re redirected to the data quality detail page, filtered by the selected column. Data quality historical results in Amazon DataZone Data quality can change over time for many reasons: Data formats may change because of changes in the source systems As data accumulates over time, it may become outdated or inconsistent Data quality can be affected by human errors in data entry, data processing, or data manipulation In Amazon DataZone, you can now track data quality over time to confirm reliability and accuracy. By analyzing the historical report snapshot, you can identify areas for improvement, implement changes, and measure the effectiveness of those changes. Enable AWS Glue Data Quality when creating a new Amazon DataZone data source In this section, we walk through the steps to enable AWS Glue Data Quality when creating a new Amazon DataZone data source. Prerequisites To follow along, you should have a domain for Amazon DataZone, an Amazon DataZone project, and a new Amazon DataZone environment (with a DataLakeProfile). For instructions, refer to Amazon DataZone quickstart with AWS Glue data. You also need to define and run a ruleset against your data, which is a set of data quality rules in AWS Glue Data Quality. To set up the data quality rules and for more information on the topic, refer to the following posts: Part 1: Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog Part 2: Getting started with AWS Glue Data Quality for ETL Pipelines Part 3: Set up data quality rules across multiple datasets using AWS Glue Data Quality Part 4: Set up alerts and orchestrate data quality rules with AWS Glue Data Quality Part 5: Visualize data quality score and metrics generated by AWS Glue Data Quality Part 6: Measure performance of AWS Glue Data Quality for ETL pipelines After you create the data quality rules, make sure that Amazon DataZone has the permissions to access the AWS Glue database managed through AWS Lake Formation. For instructions, see Configure Lake Formation permissions for Amazon DataZone. In our example, we have configured a ruleset against a table containing patient data within a healthcare synthetic dataset generated using Synthea. Synthea is a synthetic patient generator that creates realistic patient data and associated medical records that can be used for testing healthcare software applications. The ruleset contains 27 individual rules (one of them failing), so the overall data quality score is 96%. If you use Amazon DataZone managed policies, there is no action needed because these will get automatically updated with the needed actions. Otherwise, you need to allow Amazon DataZone to have the required permissions to list and get AWS Glue Data Quality results, as shown in the Amazon DataZone user guide. Create a data source with data quality enabled In this section, we create a data source and enable data quality. You can also update an existing data source to enable data quality. We use this data source to import metadata information related to our datasets. Amazon DataZone will also import data quality information related to the (one or more) assets contained in the data source. On the Amazon DataZone console, choose Data sources in the navigation pane. Choose Create data source. For Name, enter a name for your data source. For Data source type, select AWS Glue. For Environment, choose your environment. For Database name, enter a name for the database. For Table selection criteria, choose your criteria. Choose Next. For Data quality, select Enable data quality for this data source. If data quality is enabled, Amazon DataZone will automatically fetch data quality scores from AWS Glue at each data source run. Choose Next. Now you can run the data source. While running the data source, Amazon DataZone imports the last 100 AWS Glue Data Quality run results. This information is now visible on the asset page and will be visible to all Amazon DataZone users after publishing the asset. Enable data quality for an existing data asset In this section, we enable data quality for an existing asset. This might be useful for users that already have data sources in place and want to enable the feature afterwards. Prerequisites To follow along, you should have already run the data source and produced an AWS Glue table data asset. Additionally, you should have defined a ruleset in AWS Glue Data Quality over the target table in the Data Catalog. For this example, we ran the data quality job multiple times against the table, producing the related AWS Glue Data Quality scores, as shown in the following screenshot. Import data quality scores into the data asset Complete the following steps to import the existing AWS Glue Data Quality scores into the data asset in Amazon DataZone: Within the Amazon DataZone project, navigate to the Inventory data pane and choose the data source. If you choose the Data quality tab, you can see that there’s still no information on data quality because AWS Glue Data Quality integration is not enabled for this data asset yet. On the Data quality tab, choose Enable data quality. In the Data quality section, select Enable data quality for this data source. Choose Save. Now, back on the Inventory data pane, you can see a new tab: Data quality. On the Data quality tab, you can see data quality scores imported from AWS Glue Data Quality. Ingest data quality scores from an external source using Amazon DataZone APIs Many organizations already use systems that calculate data quality by performing tests and assertions on their datasets. Amazon DataZone now supports importing third-party originated data quality scores via API, allowing users that navigate the web portal to view this information. In this section, we simulate a third-party system pushing data quality scores into Amazon DataZone via APIs through Boto3 (Python SDK for AWS). For this example, we use the same synthetic dataset as earlier, generated with Synthea. The following diagram illustrates the solution architecture. The workflow consists of the following steps: Read a dataset of patients in Amazon Simple Storage Service (Amazon S3) directly from Amazon EMR using Spark. The dataset is created as a generic S3 asset collection in Amazon DataZone. In Amazon EMR, perform data validation rules against the dataset. The metrics are saved in Amazon S3 to have a persistent output. Use Amazon DataZone APIs through Boto3 to push custom data quality metadata. End-users can see the data quality scores by navigating to the data portal. Prerequisites We use Amazon EMR Serverless and Pydeequ to run a fully managed Spark environment. To learn more about Pydeequ as a data testing framework, see Testing Data quality at scale with Pydeequ. To allow Amazon EMR to send data to the Amazon DataZone domain, make sure that the IAM role used by Amazon EMR has the permissions to do the following: Read from and write to the S3 buckets Call the post_time_series_data_points action for Amazon DataZone: { "Version": "2012-10-17", "Statement": [ { "Sid": "Statement1", "Effect": "Allow", "Action": [ "datazone:PostTimeSeriesDataPoints" ], "Resource": [ "<datazone_domain_arn>" ] } ] } Make sure that you added the EMR role as a project member in the Amazon DataZone project. On the Amazon DataZone console, navigate to the Project members page and choose Add members. Add the EMR role as a contributor. Ingest and analyze PySpark code In this section, we analyze the PySpark code that we use to perform data quality checks and send the results to Amazon DataZone. You can download the complete PySpark script. To run the script entirely, you can submit a job to EMR Serverless. The service will take care of scheduling the job and automatically allocating the resources needed, enabling you to track the job run statuses throughout the process. You can submit a job to EMR within the Amazon EMR console using EMR Studio or programmatically, using the AWS CLI or using one of the AWS SDKs. In Apache Spark, a SparkSession is the entry point for interacting with DataFrames and Spark’s built-in functions. The script will start initializing a SparkSession: with SparkSession.builder.appName("PatientsDataValidation") \ .config("spark.jars.packages", pydeequ.deequ_maven_coord) \ .config("spark.jars.excludes", pydeequ.f2j_maven_coord) \ .getOrCreate() as spark: We read a dataset from Amazon S3. For increased modularity, you can use the script input to refer to the S3 path: s3inputFilepath = sys.argv[1] s3outputLocation = sys.argv[2] df = spark.read.format("csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .load(s3inputFilepath) #s3://<bucket_name>/patients/patients.csv Next, we set up a metrics repository. This can be helpful to persist the run results in Amazon S3. metricsRepository = FileSystemMetricsRepository(spark, s3_write_path) Pydeequ allows you to create data quality rules using the builder pattern, which is a well-known software engineering design pattern, concatenating instruction to instantiate a VerificationSuite object: key_tags = {'tag': 'patient_df'} resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags) check = Check(spark, CheckLevel.Error, "Integrity checks") checkResult = VerificationSuite(spark) \ .onData(df) \ .useRepository(metricsRepository) \ .addCheck( check.hasSize(lambda x: x >= 1000) \ .isComplete("birthdate") \ .isUnique("id") \ .isComplete("ssn") \ .isComplete("first") \ .isComplete("last") \ .hasMin("healthcare_coverage", lambda x: x == 1000.0)) \ .saveOrAppendResult(resultKey) \ .run() checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.show() The following is the output for the data validation rules: +----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+ |check |check_level|check_status|constraint |constraint_status|constraint_message | +----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+ |Integrity checks|Error |Error |SizeConstraint(Size(None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(birthdate,None))|Success | | |Integrity checks|Error |Error |UniquenessConstraint(Uniqueness(List(id),None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(ssn,None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(first,None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(last,None)) |Success | | |Integrity checks|Error |Error |MinimumConstraint(Minimum(healthcare_coverage,None))|Failure |Value: 0.0 does not meet the constraint requirement!| +----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+ At this point, we want to insert these data quality values in Amazon DataZone. To do so, we use the post_time_series_data_points function in the Boto3 Amazon DataZone client. The PostTimeSeriesDataPoints DataZone API allows you to insert new time series data points for a given asset or listing, without creating a new revision. At this point, you might also want to have more information on which fields are sent as input for the API. You can use the APIs to obtain the specification for Amazon DataZone form types; in our case, it’s amazon.datazone.DataQualityResultFormType. You can also use the AWS CLI to invoke the API and display the form structure: aws datazone get-form-type --domain-identifier <your_domain_id> --form-type-identifier amazon.datazone.DataQualityResultFormType --region <domain_region> --output text --query 'model.smithy' This output helps identify the required API parameters, including fields and value limits: $version: "2.0" namespace amazon.datazone structure DataQualityResultFormType { @amazon.datazone#timeSeriesSummary @range(min: 0, max: 100) passingPercentage: Double @amazon.datazone#timeSeriesSummary evaluationsCount: Integer evaluations: EvaluationResults } @length(min: 0, max: 2000) list EvaluationResults { member: EvaluationResult } @length(min: 0, max: 20) list ApplicableFields { member: String } @length(min: 0, max: 20) list EvaluationTypes { member: String } enum EvaluationStatus { PASS, FAIL } string EvaluationDetailType map EvaluationDetails { key: EvaluationDetailType value: String } structure EvaluationResult { description: String types: EvaluationTypes applicableFields: ApplicableFields status: EvaluationStatus details: EvaluationDetails } To send the appropriate form data, we need to convert the Pydeequ output to match the DataQualityResultsFormType contract. This can be achieved with a Python function that processes the results. For each DataFrame row, we extract information from the constraint column. For example, take the following code: CompletenessConstraint(Completeness(birthdate,None)) We convert it to the following: { "constraint": "CompletenessConstraint", "statisticName": "Completeness_custom", "column": "birthdate" } Make sure to send an output that matches the KPIs that you want to track. In our case, we are appending _custom to the statistic name, resulting in the following format for KPIs: Completeness_custom Uniqueness_custom In a real-world scenario, you might want to set a value that matches with your data quality framework in relation to the KPIs that you want to track in Amazon DataZone. After applying a transformation function, we have a Python object for each rule evaluation: ..., { 'applicableFields': ["healthcare_coverage"], 'types': ["Minimum_custom"], 'status': 'FAIL', 'description': 'MinimumConstraint - Minimum - Value: 0.0 does not meet the constraint requirement!' },... We also use the constraint_status column to compute the overall score: (number of success / total number of evaluation) * 100 In our example, this results in a passing percentage of 85.71%. We set this value in the passingPercentage input field along with the other information related to the evaluations in the input of the Boto3 method post_time_series_data_points: import boto3 # Instantiate the client library to communicate with Amazon DataZone Service # datazone = boto3.client( service_name='datazone', region_name=<Region(String) example: us-east-1> ) # Perform the API operation to push the Data Quality information to Amazon DataZone # datazone.post_time_series_data_points( domainIdentifier=<DataZone domain ID>, entityIdentifier=<DataZone asset ID>, entityType='ASSET', forms=[ { "content": json.dumps({ "evaluationsCount":<Number of evaluations (number)>, "evaluations": [<List of objects { 'description': <Description (String)>, 'applicableFields': [<List of columns involved (String)>], 'types': [<List of KPIs (String)>], 'status': <FAIL/PASS (string)> }> ], "passingPercentage":<Score (number)> }), "formName": <Form name(String) example: PydeequRuleSet1>, "typeIdentifier": "amazon.datazone.DataQualityResultFormType", "timestamp": <Date (timestamp)> } ] ) Boto3 invokes the Amazon DataZone APIs. In these examples, we used Boto3 and Python, but you can choose one of the AWS SDKs developed in the language you prefer. After setting the appropriate domain and asset ID and running the method, we can check on the Amazon DataZone console that the asset data quality is now visible on the asset page. We can observe that the overall score matches with the API input value. We can also see that we were able to add customized KPIs on the overview tab through custom types parameter values. With the new Amazon DataZone APIs, you can load data quality rules from third-party systems into a specific data asset. With this capability, Amazon DataZone allows you to extend the types of indicators present in AWS Glue Data Quality (such as completeness, minimum, and uniqueness) with custom indicators. Clean up We recommend deleting any potentially unused resources to avoid incurring unexpected costs. For example, you can delete the Amazon DataZone domain and the EMR application you created during this process. Conclusion In this post, we highlighted the latest features of Amazon DataZone for data quality, empowering end-users with enhanced context and visibility into their data assets. Furthermore, we delved into the seamless integration between Amazon DataZone and AWS Glue Data Quality. You can also use the Amazon DataZone APIs to integrate with external data quality providers, enabling you to maintain a comprehensive and robust data strategy within your AWS environment. To learn more about Amazon DataZone, refer to the Amazon DataZone User Guide. About the Authors Andrea Filippo is a Partner Solutions Architect at AWS supporting Public Sector partners and customers in Italy. He focuses on modern data architectures and helping customers accelerate their cloud journey with serverless technologies. Emanuele is a Solutions Architect at AWS, based in Italy, after living and working for more than 5 years in Spain. He enjoys helping large companies with the adoption of cloud technologies, and his area of expertise is mainly focused on Data Analytics and Data Management. Outside of work, he enjoys traveling and collecting action figures. Varsha Velagapudi is a Senior Technical Product Manager with Amazon DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about simplifying customers’ AI/ML and analytics journey to help them succeed in their day-to-day tasks. Outside of work, she enjoys nature and outdoor activities, reading, and traveling. View the full article
  5. In March 2024, we announced the general availability of the generative artificial intelligence (AI) generated data descriptions in Amazon DataZone. In this post, we share what we heard from our customers that led us to add the AI-generated data descriptions and discuss specific customer use cases addressed by this capability. We also detail how the feature works and what criteria was applied for the model and prompt selection while building on Amazon Bedrock. Amazon DataZone enables you to discover, access, share, and govern data at scale across organizational boundaries, reducing the undifferentiated heavy lifting of making data and analytics tools accessible to everyone in the organization. With Amazon DataZone, data users like data engineers, data scientists, and data analysts can share and access data across AWS accounts using a unified data portal, allowing them to discover, use, and collaborate on this data across their teams and organizations. Additionally, data owners and data stewards can make data discovery simpler by adding business context to data while balancing access governance to the data in the user interface. What we hear from customers Organizations are adopting enterprise-wide data discovery and governance solutions like Amazon DataZone to unlock the value from petabytes, and even exabytes, of data spread across multiple departments, services, on-premises databases, and third-party sources (such as partner solutions and public datasets). Data consumers need detailed descriptions of the business context of a data asset and documentation about its recommended use cases to quickly identify the relevant data for their intended use case. Without the right metadata and documentation, data consumers overlook valuable datasets relevant to their use case or spend more time going back and forth with data producers to understand the data and its relevance for their use case—or worse, misuse the data for a purpose it was not intended for. For instance, a dataset designated for testing might mistakenly be used for financial forecasting, resulting in poor predictions. Data producers find it tedious and time consuming to maintain extensive and up-to-date documentation on their data and respond to continued questions from data consumers. As data proliferates across the data mesh, these challenges only intensify, often resulting in under-utilization of their data. Introducing generative AI-powered data descriptions With AI-generated descriptions in Amazon DataZone, data consumers have these recommended descriptions to identify data tables and columns for analysis, which enhances data discoverability and cuts down on back-and-forth communications with data producers. Data consumers have more contextualized data at their fingertips to inform their analysis. The automatically generated descriptions enable a richer search experience for data consumers because search results are now also based on detailed descriptions, possible use cases, and key columns. This feature also elevates data discovery and interpretation by providing recommendations on analytical applications for a dataset giving customers additional confidence in their analysis. Because data producers can generate contextual descriptions of data, its schema, and data insights with a single click, they are incentivized to make more data available to data consumers. With the addition of automatically generated descriptions, Amazon DataZone helps organizations interpret their extensive and distributed data repositories. The following is an example of the asset summary and use cases detailed description. Use cases served by generative AI-powered data descriptions The automatically generated descriptions capability in Amazon DataZone streamlines relevant descriptions, provides usage recommendations and ultimately enhances the overall efficiency of data-driven decision-making. It saves organizations time for catalog curation and speeds discovery for relevant use cases of the data. It offers the following benefits: Aid search and discovery of valuable datasets – With the clarity provided by automatically generated descriptions, data consumers are less likely to overlook critical datasets through enhanced search and faster understanding, so every valuable insight from the data is recognized and utilized. Guide data application – Misapplying data can lead to incorrect analyses, missed opportunities, or skewed results. Automatically generated descriptions offer AI-driven recommendations on how best to use datasets, helping customers apply them in contexts where they are appropriate and effective. Increase efficiency in data documentation and discovery – Automatically generated descriptions streamline the traditionally tedious and manual process of data cataloging. This reduces the need for time-consuming manual documentation, making data more easily discoverable and comprehensible. Solution overview The AI recommendations feature in Amazon DataZone was built on Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models. To generate high-quality descriptions and impactful use cases, we use the available metadata on the asset such as the table name, column names, and optional metadata provided by the data producers. The recommendations don’t use any data that resides in the tables unless explicitly provided by the user as content in the metadata. To get the customized generations, we first infer the domain corresponding to the table (such as automotive industry, finance, or healthcare), which then guides the rest of the workflow towards generating customized descriptions and use cases. The generated table description contains information about how the columns are related to each other, as well as the overall meaning of the table, in the context of the identified industry segment. The table description also contains a narrative style description of the most important constituent columns. The use cases provided are also tailored to the domain identified, which are suitable not just for expert practitioners from the specific domain, but also for generalists. The generated descriptions are composed from LLM-produced outputs for table description, column description, and use cases, generated in a sequential order. For instance, the column descriptions are generated first by jointly passing the table name, schema (list of column names and their data types), and other available optional metadata. The obtained column descriptions are then used in conjunction with the table schema and metadata to obtain table descriptions and so on. This follows a consistent order like what a human would follow when trying to understand a table. The following diagram illustrates this workflow. Evaluating and selecting the foundation model and prompts Amazon DataZone manages the model(s) selection for the recommendation generation. The model(s) used can be updated or changed from time-to-time. Selecting the appropriate models and prompting strategies is a critical step in confirming the quality of the generated content, while also achieving low costs and low latencies. To realize this, we evaluated our workflow using multiple criteria on datasets that spanned more than 20 different industry domains before finalizing a model. Our evaluation mechanisms can be summarized as follows: Tracking automated metrics for quality assessment – We tracked a combination of more than 10 supervised and unsupervised metrics to evaluate essential quality factors such as informativeness, conciseness, reliability, semantic coverage, coherence, and cohesiveness. This allowed us to capture and quantify the nuanced attributes of generated content, confirming that it meets our high standards for clarity and relevance. Detecting inconsistencies and hallucinations – Next, we addressed the challenge of content reliability generated by LLMs through our self-consistency-based hallucination detection. This identifies any potential non-factuality in the generated content, and also serves as a proxy for confidence scores, as an additional layer of quality assurance. Using large language models as judges – Lastly, our evaluation process incorporates a method of judgment: using multiple state-of-the-art large language models (LLMs) as evaluators. By using bias-mitigation techniques and aggregating the scores from these advanced models, we can obtain a well-rounded assessment of the content’s quality. The approach of using LLMs as a judge, hallucination detection, and automated metrics brings diverse perspectives into our evaluation, as a proxy for expert human evaluations. Getting started with generative AI-powered data descriptions To get started, log in to the Amazon DataZone data portal. Go to your asset in your data project and choose Generate summary to obtain the detailed description of the asset and its columns. Amazon DataZone uses the available metadata on the asset to generate the descriptions. You can optionally provide additional context as metadata in the readme section or metadata form content on the asset for more customized descriptions. For detailed instructions, refer to New generative AI capabilities for Amazon DataZone further simplify data cataloging and discovery (preview). For API instructions, see Using machine learning and generative AI. Amazon DataZone AI recommendations for descriptions is generally available in Amazon DataZone domains provisioned in the following AWS Regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Frankfurt). For pricing, you will be charged for input and output tokens for generating column descriptions, asset descriptions, and analytical use cases in AI recommendations for descriptions. For more details, see Amazon DataZone Pricing. Conclusion In this post, we discussed the challenges and key use cases for the new AI recommendations for descriptions feature in Amazon DataZone. We detailed how the feature works and how the model and prompt selection were done to provide the most useful recommendations. If you have any feedback or questions, leave them in the comments section. About the Authors Varsha Velagapudi is a Senior Technical Product Manager with Amazon DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about simplifying customers’ AI/ML and analytics journey to help them succeed in their day-to-day tasks. Outside of work, she enjoys playing with her 3-year old, reading, and traveling. Zhengyuan Shen is an Applied Scientist at Amazon AWS, specializing in advancements in AI, particularly in large language models and their application in data comprehension. He is passionate about leveraging innovative ML scientific solutions to enhance products or services, thereby simplifying the lives of customers through a seamless blend of science and engineering. Outside of work, he enjoys cooking, weightlifting, and playing poker. Balasubramaniam Srinivasan is an Applied Scientist at Amazon AWS, working on foundational models for structured data and natural sciences. He enjoys enriching ML models with domain-specific knowledge and inductive biases to delight customers. Outside of work, he enjoys playing and watching tennis and soccer. View the full article
  6. Amazon DataZone is used by customers to catalog, discover, analyze, share, and govern data at scale across organizational boundaries with governance and access controls. Today, AWS announces the general availability of a new generative AI-based capability in Amazon DataZone to improve data discovery, data understanding and data usage by enriching the business data catalog. With a single click, data producers can generate comprehensive business data descriptions and context, highlight impactful columns, and include recommendations on analytical use cases. View the full article
  7. Amazon DataZone is used by customers to catalog, discover, analyze, share, and govern data at scale across organizational boundaries with governance and access controls. Today, Amazon DataZone has introduced several enhancements to its Amazon Redshift integration, simplifying the process of publishing and subscribing to Amazon Redshift tables and views. These updates streamline the experience for both data producers and consumers, allowing them to quickly create data warehouse environments using pre-configured credentials and connection parameters provided by their DataZone administrators. Additionally, these enhancements grant administrators greater control over who can use the resources within their AWS accounts and Amazon Redshift clusters, and for what purpose. View the full article
  • Forum Statistics

    43.3k
    Total Topics
    42.7k
    Total Posts
×
×
  • Create New...