Showing results for tags 'data quality'.

data quality Data Quality Management Techniques and Best Practices

Hevo Data posted a topic in Databases, Data Engineering & Data Science

Many organizations today heavily rely on data to make business-related decisions. Data is an invaluable asset that helps you substantiate your convictions with evidence and facilitates stakeholder buy-in. However, ensuring your data is of high quality is paramount as it directly correlates to the accuracy of the desired results. Implementing data quality management techniques can […]View the full article

snowflake New Snowflake Features Released in March 2024

Snowflake posted a topic in Databases, Data Engineering & Data Science

In March, Snowflake announced exciting releases, including advances in AI and ML with new features in Snowflake Cortex, new governance and privacy features in Snowflake Horizon, and broader developer support with the Snowflake CLI. Read on to learn more about everything we announced last month. Snowflake Cortex LLM Functions – in public preview Snowflake Cortex is an intelligent, fully managed service that delivers state-of-the-art large language models (LLMs) as serverless SQL/Python functions; there are no integrations to set up, data to move or GPUs to provision. In Snowflake Cortex, there are task-specific functions that teams can use to quickly and cost-effectively execute complex tasks, such as translation, sentiment analysis and summarization. Additionally, to build custom apps, teams can use the complete function to run custom prompts using LLMs from Mistral AI, Meta and Google. Learn more. Streamlit Streamlit 1.26 – in public preview We’re excited to announce support for Streamlit version 1.26 within Snowflake. This update, in preview, expands your options for building data apps directly in Snowflake’s secure environment. Now you can leverage the latest features and functionalities available in Streamlit 1.26.0 — including st.chat_input and st.chat_message, two powerful primitives for creating conversational interfaces within your data apps. This addition allows users to interact with your data applications using natural language, making them more accessible and user-friendly. You can also utilize the new features of Streamlit 1.26.0 to create even more interactive and informative data visualizations and dashboards. To learn more and get started, head over to the Snowflake documentation. Snowflake Horizon Sensitive Data Custom Classification – in public preview In addition to using standard classifiers in Snowflake, customers can now also write their own classifiers using SQL with custom logic to define what data is sensitive to their organization. This is an important enhancement to data classification and provides the necessary extensibility that customers need to detect and classify more of their data. Learn more. Data Quality Monitoring – in public preview Data Quality Monitoring is a built-in solution with out-of-the-box metrics, like null counts, time since the object was last updated and count of rows inserted into an object. Customers can even create custom metrics to monitor the quality of data. They can then effectively monitor and report on data quality by defining the frequency it is automatically measured and configure alerts to receive email notifications when quality thresholds are violated. Learn more. Snowflake Data Clean Rooms – generally available in select regions Snowflake Data Clean Rooms allow customers to unlock insights and value through secure data collaboration. Launched as a Snowflake Native App on Snowflake Marketplace, Snowflake Data Clean Rooms are now generally available to customers in AWS East, AWS West and Azure West. Snowflake Data Clean Rooms make it easy to build and use data clean rooms for both technical and non-technical users, with no additional access fees set by Snowflake. Find out more in this blog. DevOps on Snowflake Snowflake CLI – public preview The new Snowflake CLI is an open source tool that empowers developers with a flexible and extensible interface for managing the end-to-end lifecycle of applications across various workloads (Snowpark, Snowpark Container Services, Snowflake Native Applications and Streamlit in Snowflake). It offers features such as user-defined functions, stored procedures, Streamlit integration and direct SQL execution. Learn more. Snowflake Marketplace Snowflake customers can tap into Snowflake Marketplace for access to more than 2,500 live and ready-to-query third-party data, apps and AI products all in one place (as of April 10, 2024). Here are all the providers who launched on Marketplace in March: AI/ML Products Brillersys – Time Series Data Generator Atscale, Inc. – Semantic Modeling Data paretos GmbH – Demand Forecasting App Connectors/SaaS Data HALitics – eCommerce Platform Connector Developer Tools DataOps.live – CI/CD, Automation and DataOps Data Governance, Quality and Cost Optimization Select Labs US Inc. – Snowflake Performance & Cost Optimization Foreground Data Solutions Inc – PII Data Detector CareEvolution – Data Format Transformation Merse, Inc – Snowflake Performance & Cost Optimization Qbrainx – Snowflake Performance & Cost Optimization Yuki – Snowflake Performance Optimization DATAN3RD LLC – Data Quality App Third-Party Data Providers Upper Hand – Sports Facilities & Athletes Data Sporting Group – Sportsbook Data Quiet Data – UK Company Data Manifold Data Mining – Demographics Data in Canada SESAMm – ESG Controversy Data KASPR Datahaus – Internet Quality & Anomaly Data Blitzscaling – Blockchain Data Starlitics – ETF and Mutual Fund Data SFR Analytics – Geographic Data SignalRank – Startup Data GfK SE – Purchasing Power Data —- Forward-Looking Statement This post contains express and implied forward-looking statements, including statements regarding (i) Snowflake’s business strategy, (ii) Snowflake’s products, services, and technology offerings, including those that are under development or not generally available, (iii) market growth, trends, and competitive considerations, and (iv) the integration, interoperability, and availability of Snowflake’s products with and on third-party platforms. These forward-looking statements are subject to a number of risks, uncertainties, and assumptions, including those described under the heading “Risk Factors” and elsewhere in the Quarterly Reports on Form 10-Q and Annual Reports of Form 10-K that Snowflake files with the Securities and Exchange Commission. In light of these risks, uncertainties, and assumptions, actual results could differ materially and adversely from those anticipated or implied in the forward-looking statements. As a result, you should not rely on any forward-looking statements as predictions of future events. © 2024 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature, and service names mentioned herein are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All other brand names or logos mentioned or used herein are for identification purposes only and may be the trademarks of their respective holder(s). Snowflake may not be associated with, or be sponsored or endorsed by, any such holder(s). The post New Snowflake Features Released in March 2024 appeared first on Snowflake. View the full article

April 16, 2024
- 1
- snowflake cortex
- llms
- (and 9 more)
  Tagged with:

data quality Data Quality Toolkit review

RudderStack posted a topic in Databases, Data Engineering & Data Science

A summary of our Data Quality Toolkit, a set of features to help you guarantee customer data quality from the source. View the full article

data quality Data quality best practices: Bridging the dev data divide

RudderStack posted a topic in Databases, Data Engineering & Data Science

How to bridge the dev / data divide through alignment, collaboration, early enforcement, and transparency.View the full article

amazon datazone Amazon DataZone launches integration with AWS Glue Data Quality

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon DataZone is used by customers to catalog, discover, analyze, share, and govern data at scale across organizational boundaries with governance and access controls. Today, Amazon DataZone launches integration with AWS Glue Data Quality and offers APIs to integrate data quality metrics from third party data quality solutions. This integration helps Amazon DataZone customers gain trust in their data and make confident business decisions. View the full article

April 3, 2024
- aws glue
- aws glue data quality
- (and 1 more)
  Tagged with:

amazon datazone Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Today, we are pleased to announce that Amazon DataZone is now able to present data quality information for data assets. This information empowers end-users to make informed decisions as to whether or not to use specific assets. Many organizations already use AWS Glue Data Quality to define and enforce data quality rules on their data, validate data against predefined rules, track data quality metrics, and monitor data quality over time using artificial intelligence (AI). Other organizations monitor the quality of their data through third-party solutions. Amazon DataZone now integrates directly with AWS Glue to display data quality scores for AWS Glue Data Catalog assets. Additionally, Amazon DataZone now offers APIs for importing data quality scores from external systems. In this post, we discuss the latest features of Amazon DataZone for data quality, the integration between Amazon DataZone and AWS Glue Data Quality and how you can import data quality scores produced by external systems into Amazon DataZone via API. Challenges One of the most common questions we get from customers is related to displaying data quality scores in the Amazon DataZone business data catalog to let business users have visibility into the health and reliability of the datasets. As data becomes increasingly crucial for driving business decisions, Amazon DataZone users are keenly interested in providing the highest standards of data quality. They recognize the importance of accurate, complete, and timely data in enabling informed decision-making and fostering trust in their analytics and reporting processes. Amazon DataZone data assets can be updated at varying frequencies. As data is refreshed and updated, changes can happen through upstream processes that put it at risk of not maintaining the intended quality. Data quality scores help you understand if data has maintained the expected level of quality for data consumers to use (through analysis or downstream processes). From a producer’s perspective, data stewards can now set up Amazon DataZone to automatically import the data quality scores from AWS Glue Data Quality (scheduled or on demand) and include this information in the Amazon DataZone catalog to share with business users. Additionally, you can now use new Amazon DataZone APIs to import data quality scores produced by external systems into the data assets. With the latest enhancement, Amazon DataZone users can now accomplish the following: Access insights about data quality standards directly from the Amazon DataZone web portal View data quality scores on various KPIs, including data completeness, uniqueness, accuracy Make sure users have a holistic view of the quality and trustworthiness of their data. In the first part of this post, we walk through the integration between AWS Glue Data Quality and Amazon DataZone. We discuss how to visualize data quality scores in Amazon DataZone, enable AWS Glue Data Quality when creating a new Amazon DataZone data source, and enable data quality for an existing data asset. In the second part of this post, we discuss how you can import data quality scores produced by external systems into Amazon DataZone via API. In this example, we use Amazon EMR Serverless in combination with the open source library Pydeequ to act as an external system for data quality. Visualize AWS Glue Data Quality scores in Amazon DataZone You can now visualize AWS Glue Data Quality scores in data assets that have been published in the Amazon DataZone business catalog and that are searchable through the Amazon DataZone web portal. If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane. By selecting the corresponding asset, you can understand its content through the readme, glossary terms, and technical and business metadata. Additionally, the overall quality score indicator is displayed in the Asset Details section. A data quality score serves as an overall indicator of a dataset’s quality, calculated based on the rules you define. On the Data quality tab, you can access the details of data quality overview indicators and the results of the data quality runs. The indicators shown on the Overview tab are calculated based on the results of the rulesets from the data quality runs. Each rule is assigned an attribute that contributes to the calculation of the indicator. For example, rules that have the Completeness attribute will contribute to the calculation of the corresponding indicator on the Overview tab. To filter data quality results, choose the Applicable column dropdown menu and choose your desired filter parameter. You can also visualize column-level data quality starting on the Schema tab. When data quality is enabled for the asset, the data quality results become available, providing insightful quality scores that reflect the integrity and reliability of each column within the dataset. When you choose one of the data quality result links, you’re redirected to the data quality detail page, filtered by the selected column. Data quality historical results in Amazon DataZone Data quality can change over time for many reasons: Data formats may change because of changes in the source systems As data accumulates over time, it may become outdated or inconsistent Data quality can be affected by human errors in data entry, data processing, or data manipulation In Amazon DataZone, you can now track data quality over time to confirm reliability and accuracy. By analyzing the historical report snapshot, you can identify areas for improvement, implement changes, and measure the effectiveness of those changes. Enable AWS Glue Data Quality when creating a new Amazon DataZone data source In this section, we walk through the steps to enable AWS Glue Data Quality when creating a new Amazon DataZone data source. Prerequisites To follow along, you should have a domain for Amazon DataZone, an Amazon DataZone project, and a new Amazon DataZone environment (with a DataLakeProfile). For instructions, refer to Amazon DataZone quickstart with AWS Glue data. You also need to define and run a ruleset against your data, which is a set of data quality rules in AWS Glue Data Quality. To set up the data quality rules and for more information on the topic, refer to the following posts: Part 1: Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog Part 2: Getting started with AWS Glue Data Quality for ETL Pipelines Part 3: Set up data quality rules across multiple datasets using AWS Glue Data Quality Part 4: Set up alerts and orchestrate data quality rules with AWS Glue Data Quality Part 5: Visualize data quality score and metrics generated by AWS Glue Data Quality Part 6: Measure performance of AWS Glue Data Quality for ETL pipelines After you create the data quality rules, make sure that Amazon DataZone has the permissions to access the AWS Glue database managed through AWS Lake Formation. For instructions, see Configure Lake Formation permissions for Amazon DataZone. In our example, we have configured a ruleset against a table containing patient data within a healthcare synthetic dataset generated using Synthea. Synthea is a synthetic patient generator that creates realistic patient data and associated medical records that can be used for testing healthcare software applications. The ruleset contains 27 individual rules (one of them failing), so the overall data quality score is 96%. If you use Amazon DataZone managed policies, there is no action needed because these will get automatically updated with the needed actions. Otherwise, you need to allow Amazon DataZone to have the required permissions to list and get AWS Glue Data Quality results, as shown in the Amazon DataZone user guide. Create a data source with data quality enabled In this section, we create a data source and enable data quality. You can also update an existing data source to enable data quality. We use this data source to import metadata information related to our datasets. Amazon DataZone will also import data quality information related to the (one or more) assets contained in the data source. On the Amazon DataZone console, choose Data sources in the navigation pane. Choose Create data source. For Name, enter a name for your data source. For Data source type, select AWS Glue. For Environment, choose your environment. For Database name, enter a name for the database. For Table selection criteria, choose your criteria. Choose Next. For Data quality, select Enable data quality for this data source. If data quality is enabled, Amazon DataZone will automatically fetch data quality scores from AWS Glue at each data source run. Choose Next. Now you can run the data source. While running the data source, Amazon DataZone imports the last 100 AWS Glue Data Quality run results. This information is now visible on the asset page and will be visible to all Amazon DataZone users after publishing the asset. Enable data quality for an existing data asset In this section, we enable data quality for an existing asset. This might be useful for users that already have data sources in place and want to enable the feature afterwards. Prerequisites To follow along, you should have already run the data source and produced an AWS Glue table data asset. Additionally, you should have defined a ruleset in AWS Glue Data Quality over the target table in the Data Catalog. For this example, we ran the data quality job multiple times against the table, producing the related AWS Glue Data Quality scores, as shown in the following screenshot. Import data quality scores into the data asset Complete the following steps to import the existing AWS Glue Data Quality scores into the data asset in Amazon DataZone: Within the Amazon DataZone project, navigate to the Inventory data pane and choose the data source. If you choose the Data quality tab, you can see that there’s still no information on data quality because AWS Glue Data Quality integration is not enabled for this data asset yet. On the Data quality tab, choose Enable data quality. In the Data quality section, select Enable data quality for this data source. Choose Save. Now, back on the Inventory data pane, you can see a new tab: Data quality. On the Data quality tab, you can see data quality scores imported from AWS Glue Data Quality. Ingest data quality scores from an external source using Amazon DataZone APIs Many organizations already use systems that calculate data quality by performing tests and assertions on their datasets. Amazon DataZone now supports importing third-party originated data quality scores via API, allowing users that navigate the web portal to view this information. In this section, we simulate a third-party system pushing data quality scores into Amazon DataZone via APIs through Boto3 (Python SDK for AWS). For this example, we use the same synthetic dataset as earlier, generated with Synthea. The following diagram illustrates the solution architecture. The workflow consists of the following steps: Read a dataset of patients in Amazon Simple Storage Service (Amazon S3) directly from Amazon EMR using Spark. The dataset is created as a generic S3 asset collection in Amazon DataZone. In Amazon EMR, perform data validation rules against the dataset. The metrics are saved in Amazon S3 to have a persistent output. Use Amazon DataZone APIs through Boto3 to push custom data quality metadata. End-users can see the data quality scores by navigating to the data portal. Prerequisites We use Amazon EMR Serverless and Pydeequ to run a fully managed Spark environment. To learn more about Pydeequ as a data testing framework, see Testing Data quality at scale with Pydeequ. To allow Amazon EMR to send data to the Amazon DataZone domain, make sure that the IAM role used by Amazon EMR has the permissions to do the following: Read from and write to the S3 buckets Call the post_time_series_data_points action for Amazon DataZone: { "Version": "2012-10-17", "Statement": [ { "Sid": "Statement1", "Effect": "Allow", "Action": [ "datazone:PostTimeSeriesDataPoints" ], "Resource": [ "<datazone_domain_arn>" ] } ] } Make sure that you added the EMR role as a project member in the Amazon DataZone project. On the Amazon DataZone console, navigate to the Project members page and choose Add members. Add the EMR role as a contributor. Ingest and analyze PySpark code In this section, we analyze the PySpark code that we use to perform data quality checks and send the results to Amazon DataZone. You can download the complete PySpark script. To run the script entirely, you can submit a job to EMR Serverless. The service will take care of scheduling the job and automatically allocating the resources needed, enabling you to track the job run statuses throughout the process. You can submit a job to EMR within the Amazon EMR console using EMR Studio or programmatically, using the AWS CLI or using one of the AWS SDKs. In Apache Spark, a SparkSession is the entry point for interacting with DataFrames and Spark’s built-in functions. The script will start initializing a SparkSession: with SparkSession.builder.appName("PatientsDataValidation") \ .config("spark.jars.packages", pydeequ.deequ_maven_coord) \ .config("spark.jars.excludes", pydeequ.f2j_maven_coord) \ .getOrCreate() as spark: We read a dataset from Amazon S3. For increased modularity, you can use the script input to refer to the S3 path: s3inputFilepath = sys.argv[1] s3outputLocation = sys.argv[2] df = spark.read.format("csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .load(s3inputFilepath) #s3://<bucket_name>/patients/patients.csv Next, we set up a metrics repository. This can be helpful to persist the run results in Amazon S3. metricsRepository = FileSystemMetricsRepository(spark, s3_write_path) Pydeequ allows you to create data quality rules using the builder pattern, which is a well-known software engineering design pattern, concatenating instruction to instantiate a VerificationSuite object: key_tags = {'tag': 'patient_df'} resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags) check = Check(spark, CheckLevel.Error, "Integrity checks") checkResult = VerificationSuite(spark) \ .onData(df) \ .useRepository(metricsRepository) \ .addCheck( check.hasSize(lambda x: x >= 1000) \ .isComplete("birthdate") \ .isUnique("id") \ .isComplete("ssn") \ .isComplete("first") \ .isComplete("last") \ .hasMin("healthcare_coverage", lambda x: x == 1000.0)) \ .saveOrAppendResult(resultKey) \ .run() checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.show() The following is the output for the data validation rules: +----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+ |check |check_level|check_status|constraint |constraint_status|constraint_message | +----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+ |Integrity checks|Error |Error |SizeConstraint(Size(None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(birthdate,None))|Success | | |Integrity checks|Error |Error |UniquenessConstraint(Uniqueness(List(id),None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(ssn,None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(first,None)) |Success | | |Integrity checks|Error |Error |CompletenessConstraint(Completeness(last,None)) |Success | | |Integrity checks|Error |Error |MinimumConstraint(Minimum(healthcare_coverage,None))|Failure |Value: 0.0 does not meet the constraint requirement!| +----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+ At this point, we want to insert these data quality values in Amazon DataZone. To do so, we use the post_time_series_data_points function in the Boto3 Amazon DataZone client. The PostTimeSeriesDataPoints DataZone API allows you to insert new time series data points for a given asset or listing, without creating a new revision. At this point, you might also want to have more information on which fields are sent as input for the API. You can use the APIs to obtain the specification for Amazon DataZone form types; in our case, it’s amazon.datazone.DataQualityResultFormType. You can also use the AWS CLI to invoke the API and display the form structure: aws datazone get-form-type --domain-identifier <your_domain_id> --form-type-identifier amazon.datazone.DataQualityResultFormType --region <domain_region> --output text --query 'model.smithy' This output helps identify the required API parameters, including fields and value limits: $version: "2.0" namespace amazon.datazone structure DataQualityResultFormType { @amazon.datazone#timeSeriesSummary @range(min: 0, max: 100) passingPercentage: Double @amazon.datazone#timeSeriesSummary evaluationsCount: Integer evaluations: EvaluationResults } @length(min: 0, max: 2000) list EvaluationResults { member: EvaluationResult } @length(min: 0, max: 20) list ApplicableFields { member: String } @length(min: 0, max: 20) list EvaluationTypes { member: String } enum EvaluationStatus { PASS, FAIL } string EvaluationDetailType map EvaluationDetails { key: EvaluationDetailType value: String } structure EvaluationResult { description: String types: EvaluationTypes applicableFields: ApplicableFields status: EvaluationStatus details: EvaluationDetails } To send the appropriate form data, we need to convert the Pydeequ output to match the DataQualityResultsFormType contract. This can be achieved with a Python function that processes the results. For each DataFrame row, we extract information from the constraint column. For example, take the following code: CompletenessConstraint(Completeness(birthdate,None)) We convert it to the following: { "constraint": "CompletenessConstraint", "statisticName": "Completeness_custom", "column": "birthdate" } Make sure to send an output that matches the KPIs that you want to track. In our case, we are appending _custom to the statistic name, resulting in the following format for KPIs: Completeness_custom Uniqueness_custom In a real-world scenario, you might want to set a value that matches with your data quality framework in relation to the KPIs that you want to track in Amazon DataZone. After applying a transformation function, we have a Python object for each rule evaluation: ..., { 'applicableFields': ["healthcare_coverage"], 'types': ["Minimum_custom"], 'status': 'FAIL', 'description': 'MinimumConstraint - Minimum - Value: 0.0 does not meet the constraint requirement!' },... We also use the constraint_status column to compute the overall score: (number of success / total number of evaluation) * 100 In our example, this results in a passing percentage of 85.71%. We set this value in the passingPercentage input field along with the other information related to the evaluations in the input of the Boto3 method post_time_series_data_points: import boto3 # Instantiate the client library to communicate with Amazon DataZone Service # datazone = boto3.client( service_name='datazone', region_name=<Region(String) example: us-east-1> ) # Perform the API operation to push the Data Quality information to Amazon DataZone # datazone.post_time_series_data_points( domainIdentifier=<DataZone domain ID>, entityIdentifier=<DataZone asset ID>, entityType='ASSET', forms=[ { "content": json.dumps({ "evaluationsCount":<Number of evaluations (number)>, "evaluations": [<List of objects { 'description': <Description (String)>, 'applicableFields': [<List of columns involved (String)>], 'types': [<List of KPIs (String)>], 'status': <FAIL/PASS (string)> }> ], "passingPercentage":<Score (number)> }), "formName": <Form name(String) example: PydeequRuleSet1>, "typeIdentifier": "amazon.datazone.DataQualityResultFormType", "timestamp": <Date (timestamp)> } ] ) Boto3 invokes the Amazon DataZone APIs. In these examples, we used Boto3 and Python, but you can choose one of the AWS SDKs developed in the language you prefer. After setting the appropriate domain and asset ID and running the method, we can check on the Amazon DataZone console that the asset data quality is now visible on the asset page. We can observe that the overall score matches with the API input value. We can also see that we were able to add customized KPIs on the overview tab through custom types parameter values. With the new Amazon DataZone APIs, you can load data quality rules from third-party systems into a specific data asset. With this capability, Amazon DataZone allows you to extend the types of indicators present in AWS Glue Data Quality (such as completeness, minimum, and uniqueness) with custom indicators. Clean up We recommend deleting any potentially unused resources to avoid incurring unexpected costs. For example, you can delete the Amazon DataZone domain and the EMR application you created during this process. Conclusion In this post, we highlighted the latest features of Amazon DataZone for data quality, empowering end-users with enhanced context and visibility into their data assets. Furthermore, we delved into the seamless integration between Amazon DataZone and AWS Glue Data Quality. You can also use the Amazon DataZone APIs to integrate with external data quality providers, enabling you to maintain a comprehensive and robust data strategy within your AWS environment. To learn more about Amazon DataZone, refer to the Amazon DataZone User Guide. About the Authors Andrea Filippo is a Partner Solutions Architect at AWS supporting Public Sector partners and customers in Italy. He focuses on modern data architectures and helping customers accelerate their cloud journey with serverless technologies. Emanuele is a Solutions Architect at AWS, based in Italy, after living and working for more than 5 years in Spain. He enjoys helping large companies with the adoption of cloud technologies, and his area of expertise is mainly focused on Data Analytics and Data Management. Outside of work, he enjoys traveling and collecting action figures. Varsha Velagapudi is a Senior Technical Product Manager with Amazon DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about simplifying customers’ AI/ML and analytics journey to help them succeed in their day-to-day tasks. Outside of work, she enjoys nature and outdoor activities, reading, and traveling. View the full article

April 3, 2024
- aws glue
- aws glue data quality
- (and 1 more)
  Tagged with:

data quality Understanding Data Quality and Why Teams Struggle with It

TDS posted a topic in Databases, Data Engineering & Data Science

Data quality: the catch-all term for business logic, reliability, validity, and consistency Continue reading on Towards Data Science » View the full article

March 10, 2024

amazon sagemaker data wrangler Get insights into Data and Data Quality with Amazon SageMaker Data Wrangler

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. With SageMaker Data Wrangler’s data selection tool, you can quickly select data from multiple data sources, such as Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, Amazon SageMaker Feature Store, Databricks Delta Lake, and Snowflake. View the full article

Sign In

Search the Community

Search By Tags

Search By Author

Content Type

Forums

Calendars

Find results in...

Find results that contain...

Date Created

Start

End

Last Updated

Start

End

Filter by number of...

Minimum number of comments

Minimum number of replies

Minimum number of reviews

Minimum number of views

Joined

Start

End

Group

Website URL

LinkedIn Profile URL

About Me

Cloud Platforms

Cloud Experience

Development Experience

Current Role

Skills

Certifications

Favourite Tools

Interests

Forum Statistics