Search the Community
Showing results for tags 'data lakes'.
-
In the early 2000s, organizations started dealing with more semi-structured and unstructured data, which consisted of images, videos, log files, text, and sensor data. They needed a storage solution that was more flexible than a data warehouse. That’s when data lake emerged and strived to be one of the most beneficial platforms for modern data […]View the full article
-
Amazon DataZone is used by customers to catalog, discover, analyze, share, and govern data at scale across organizational boundaries with governance and access controls. Today, Amazon DataZone has introduced an integration with AWS Lake Formation hybrid mode. This integration enables customers to easily publish and share their AWS Glue tables through Amazon DataZone, without the need to register them in AWS Lake Formation first. Hybrid mode allows customers to start managing permissions on their AWS Glue tables through AWS Lake Formation, while continuing to maintain any existing IAM permissions on these tables. View the full article
-
- aws lake formation
- integration
-
(and 1 more)
Tagged with:
-
This is post is co-written with Andries Engelbrecht and Scott Teal from Snowflake. Businesses are constantly evolving, and data leaders are challenged every day to meet new requirements. For many enterprises and large organizations, it is not feasible to have one processing engine or tool to deal with the various business requirements. They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions. Customers are using AWS and Snowflake to develop purpose-built data architectures that provide the performance required for modern analytics and artificial intelligence (AI) use cases. Implementing these solutions requires data sharing between purpose-built data stores. This is why Snowflake and AWS are delivering enhanced support for Apache Iceberg to enable and facilitate data interoperability between data services. Apache Iceberg is an open-source table format that provides reliability, simplicity, and high performance for large datasets with transactional integrity between various processing engines. In this post, we discuss the following: Advantages of Iceberg tables for data lakes Two architectural patterns for sharing Iceberg tables between AWS and Snowflake: Manage your Iceberg tables with AWS Glue Data Catalog Manage your Iceberg tables with Snowflake The process of converting existing data lakes tables to Iceberg tables without copying the data Now that you have a high-level understanding of the topics, let’s dive into each of them in detail. Advantages of Apache Iceberg Apache Iceberg is a distributed, community-driven, Apache 2.0-licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time. Apache Iceberg offers integrations with popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive, Presto, and more. Iceberg tables maintain metadata to abstract large collections of files, providing data management features including time travel, rollback, data compaction, and full schema evolution, reducing management overhead. Originally developed at Netflix before being open sourced to the Apache Software Foundation, Apache Iceberg was a blank-slate design to solve common data lake challenges like user experience, reliability, and performance, and is now supported by a robust community of developers focused on continually improving and adding new features to the project, serving real user needs and providing them with optionality. Transactional data lakes built on AWS and Snowflake Snowflake provides various integrations for Iceberg tables with multiple storage options, including Amazon S3, and multiple catalog options, including AWS Glue Data Catalog and Snowflake. AWS provides integrations for various AWS services with Iceberg tables as well, including AWS Glue Data Catalog for tracking table metadata. Combining Snowflake and AWS gives you multiple options to build out a transactional data lake for analytical and other use cases such as data sharing and collaboration. By adding a metadata layer to data lakes, you get a better user experience, simplified management, and improved performance and reliability on very large datasets. Manage your Iceberg table with AWS Glue You can use AWS Glue to ingest, catalog, transform, and manage the data on Amazon Simple Storage Service (Amazon S3). AWS Glue is a serverless data integration service that allows you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes in Iceberg format. With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. Snowflake integrates with AWS Glue Data Catalog to access the Iceberg table catalog and the files on Amazon S3 for analytical queries. This greatly improves performance and compute cost in comparison to external tables on Snowflake, because the additional metadata improves pruning in query plans. You can use this same integration to take advantage of the data sharing and collaboration capabilities in Snowflake. This can be very powerful if you have data in Amazon S3 and need to enable Snowflake data sharing with other business units, partners, suppliers, or customers. The following architecture diagram provides a high-level overview of this pattern. The workflow includes the following steps: AWS Glue extracts data from applications, databases, and streaming sources. AWS Glue then transforms it and loads it into the data lake in Amazon S3 in Iceberg table format, while inserting and updating the metadata about the Iceberg table in AWS Glue Data Catalog. The AWS Glue crawler generates and updates Iceberg table metadata and stores it in AWS Glue Data Catalog for existing Iceberg tables on an S3 data lake. Snowflake integrates with AWS Glue Data Catalog to retrieve the snapshot location. In the event of a query, Snowflake uses the snapshot location from AWS Glue Data Catalog to read Iceberg table data in Amazon S3. Snowflake can query across Iceberg and Snowflake table formats. You can share data for collaboration with one or more accounts in the same Snowflake region. You can also use data in Snowflake for visualization using Amazon QuickSight, or use it for machine learning (ML) and artificial intelligence (AI) purposes with Amazon SageMaker. Manage your Iceberg table with Snowflake A second pattern also provides interoperability across AWS and Snowflake, but implements data engineering pipelines for ingestion and transformation to Snowflake. In this pattern, data is loaded to Iceberg tables by Snowflake through integrations with AWS services like AWS Glue or through other sources like Snowpipe. Snowflake then writes data directly to Amazon S3 in Iceberg format for downstream access by Snowflake and various AWS services, and Snowflake manages the Iceberg catalog that tracks snapshot locations across tables for AWS services to access. Like the previous pattern, you can use Snowflake-managed Iceberg tables with Snowflake data sharing, but you can also use S3 to share datasets in cases where one party does not have access to Snowflake. The following architecture diagram provides an overview of this pattern with Snowflake-managed Iceberg tables. This workflow consists of the following steps: In addition to loading data via the COPY command, Snowpipe, and the native Snowflake connector for AWS Glue, you can integrate data via the Snowflake Data Sharing. Snowflake writes Iceberg tables to Amazon S3 and updates metadata automatically with every transaction. Iceberg tables in Amazon S3 are queried by Snowflake for analytical and ML workloads using services like QuickSight and SageMaker. Apache Spark services on AWS can access snapshot locations from Snowflake via a Snowflake Iceberg Catalog SDK and directly scan the Iceberg table files in Amazon S3. Comparing solutions These two patterns highlight options available to data personas today to maximize their data interoperability between Snowflake and AWS using Apache Iceberg. But which pattern is ideal for your use case? If you’re already using AWS Glue Data Catalog and only require Snowflake for read queries, then the first pattern can integrate Snowflake with AWS Glue and Amazon S3 to query Iceberg tables. If you’re not already using AWS Glue Data Catalog and require Snowflake to perform reads and writes, then the second pattern is likely a good solution that allows for storing and accessing data from AWS. Considering that reads and writes will probably operate on a per-table basis rather than the entire data architecture, it is advisable to use a combination of both patterns. Migrate existing data lakes to a transactional data lake using Apache Iceberg You can convert existing Parquet, ORC, and Avro-based data lake tables on Amazon S3 to Iceberg format to reap the benefits of transactional integrity while improving performance and user experience. There are several Iceberg table migration options (SNAPSHOT, MIGRATE, and ADD_FILES) for migrating existing data lake tables in-place to Iceberg format, which is preferable to rewriting all of the underlying data files—a costly and time-consuming effort with large datasets. In this section, we focus on ADD_FILES, because it’s useful for custom migrations. For ADD_FILES options, you can use AWS Glue to generate Iceberg metadata and statistics for an existing data lake table and create new Iceberg tables in AWS Glue Data Catalog for future use without needing to rewrite the underlying data. For instructions on generating Iceberg metadata and statistics using AWS Glue, refer to Migrate an existing data lake to a transactional data lake using Apache Iceberg or Convert existing Amazon S3 data lake tables to Snowflake Unmanaged Iceberg tables using AWS Glue. This option requires that you pause data pipelines while converting the files to Iceberg tables, which is a straightforward process in AWS Glue because the destination just needs to be changed to an Iceberg table. Conclusion In this post, you saw the two architecture patterns for implementing Apache Iceberg in a data lake for better interoperability across AWS and Snowflake. We also provided guidance on migrating existing data lake tables to Iceberg format. Sign up for AWS Dev Day on April 10 to get hands-on not only with Apache Iceberg, but also with streaming data pipelines with Amazon Data Firehose and Snowpipe Streaming, and generative AI applications with Streamlit in Snowflake and Amazon Bedrock. About the Authors Andries Engelbrecht is a Principal Partner Solutions Architect at Snowflake and works with strategic partners. He is actively engaged with strategic partners like AWS supporting product and service integrations as well as the development of joint solutions with partners. Andries has over 20 years of experience in the field of data and analytics. Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data architectures on the AWS Cloud. He has helped customers of all sizes implement data management, data warehouse, and data lake solutions. Brian Dolan joined Amazon as a Military Relations Manager in 2012 after his first career as a Naval Aviator. In 2014, Brian joined Amazon Web Services, where he helped Canadian customers from startups to enterprises explore the AWS Cloud. Most recently, Brian was a member of the Non-Relational Business Development team as a Go-To-Market Specialist for Amazon DynamoDB and Amazon Keyspaces before joining the Analytics Worldwide Specialist Organization in 2022 as a Go-To-Market Specialist for AWS Glue. Nidhi Gupta is a Sr. Partner Solution Architect at AWS. She spends her days working with customers and partners, solving architectural challenges. She is passionate about data integration and orchestration, serverless and big data processing, and machine learning. Nidhi has extensive experience leading the architecture design and production release and deployments for data workloads. Scott Teal is a Product Marketing Lead at Snowflake and focuses on data lakes, storage, and governance. View the full article
-
- data lakes
- amazon s3
-
(and 2 more)
Tagged with:
-
A comparative overview of data warehouses, data lakes, and data marts to help you make informed decisions on data storage solutions for your data architecture.View the full article
-
- data warehouses
- data lakes
-
(and 1 more)
Tagged with:
-
I’m excited to announce today a new capability of Amazon Managed Streaming for Apache Kafka (Amazon MSK) that allows you to continuously load data from an Apache Kafka cluster to Amazon Simple Storage Service (Amazon S3). We use Amazon Kinesis Data Firehose—an extract, transform, and load (ETL) service—to read data from a Kafka topic, transform the records, and write them to an Amazon S3 destination. Kinesis Data Firehose is entirely managed and you can configure it with just a few clicks in the console. No code or infrastructure is needed. Kafka is commonly used for building real-time data pipelines that reliably move massive amounts of data between systems or applications. It provides a highly scalable and fault-tolerant publish-subscribe messaging system. Many AWS customers have adopted Kafka to capture streaming data such as click-stream events, transactions, IoT events, and application and machine logs, and have applications that perform real-time analytics, run continuous transformations, and distribute this data to data lakes and databases in real time. However, deploying Kafka clusters is not without challenges... View the full article
-
AWS Glue for Apache Spark now supports three open source data lake storage frameworks: Apache Hudi, Apache Iceberg, and Linux Foundation Delta Lake. These frameworks allow you to read and write data in Amazon Simple Storage Service (Amazon S3) in a transactionally consistent manner. AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. This feature removes the need to install a separate connector and reduces the configuration steps required to use these frameworks in AWS Glue for Apache Spark jobs. View the full article
-
In this post we’ll explore the concepts of data lake, data hub and data lab. There are many opinions and interpretations of these concepts, and they are broadly comparable. In fact, many might say they’re synonymous and we’re just splitting hairs. But let’s look again carefully. We can discern some subtle trends in the way people are doing things, and find distinctions in these expressions. Welcome to the Data Lake Lakes are tranquil, large pools of cool water, right? Well possibly. I grew up in Scotland, where lakes are called lochs, and rumours of monsters that lurk in the depths of ancient lochs abound. Scotland also has salt water sea lochs, full of stinging jellyfish. But one thing is for sure – lakes, lochs, call them what you will – they’re popular places to go fishing. In current technology vernacular, a data lake is essentially a very large body of cool data, typically in the 100s of terabytes to petabytes in size. The data lake differentiates from other cool storage systems such as MAIDs (Massive Array of Idle Disks), storage vaults and tape archives, because the data remains online and fully accessible on a low-cost storage media like Apache HDFS, Ceph, or AWS Simple Storage Service (s3). This makes it an interesting and cost-effective solution for performing ad-hoc research, analysis and reporting on the aggregated data – essentially enabling data “fishing expeditions”, as well as being the feedstock for applications using deep learning or other data-intensive artificial intelligence approaches. The “big data” need not be restored from a tape or extracted from a vault or deep storage solution in order to be queried, which are tasks that usually come with a significant cost. Data in the lake can take many forms, the most popular format is semi-structured machine data – for example telemetry data (system, application usage and activity logs, user tracking, things like that), log data (weblogs, crash logs, network element logs, application logs, firewall logs, industrial machine data and so on) and data feeds (like stock ticker data, weather data, etc.). Another popular format is system of record (SoR) data – operational database extracts, data warehouse change data capture, and so on. And many data lakes capture vast amounts of unstructured data (free-text – like chat or audio transcriptions, document scans, binary photographs and images like x-rays, binary audio – like call centre recordings, and binary video – like security camera recordings). It’s also important to know that data lake managers often like to adopt the so-called “schema-on-read” strategy for the datasets forming the lake. Basically, this means the data is stored in the lake untreated, in full fidelity. This might seem to go against all data storage best practices, where data normalization for efficiency and integrity is one of the major tenets. However the reasoning is sound – the volumes of data involved make guaranteeing integrity via relational modelling hard to achieve whilst still assuring timely access to the data. And any storage efficiency induced savings are massively offset by the upfront labour cost of engineering the data. Finally, treating the data often implies discarding or summarizing data, which may be undesirable since it might preclude future applications and use cases (for example some data mining or AI use cases), so the value of the upfront data modelling and engineering exercise is uncertain. Whilst residual processing data like weblogs and crashlogs might be considered low value at small scale, in aggregate and over long time spans this kind of data can be extremely valuable input. For example the data can be used to drive research, business excellence, as feedstock for new and innovative (eg. AI) products, and to guide informed business decisions. It should be noted that data lakes are typically used to store so-called “cool” data – by which we mean data that is infrequently accessed and rarely modified; whilst “hot” data – that is, data that is frequently accessed and updated – is usually stored elsewhere (for example in an OLTP database). I am fearless, and therefore powerful: the Data Lab Since the storage cost per GB is pretty low, storage efficiency is less of a concern versus accessibility. Exposing data verbatim in a data lake for data scientists and analysts to do the feature engineering or modelling that they want to get the data in the shape they need for the given project or product drives agility at the cost of dataset duplication. All of this lowers the upfront costs associated with advanced data experimentation and research; and thus places agility, innovation and the rigour of data-driven or empirical business practices within the reach of any organisation – large or small – that has the appetite to build up a data lake. Cue the data science lab. Data labs are an emerging shared service paradigm – a kind of “knowledge services” team or division, focused on delivering advanced analysis, forecasting, war-gaming, digital twins, machine learning (ML) applications and artificial intelligence (AI) tools. These services are typically delivered as short projects, assisting all parts of the business that may need their services – from marketing to manufacturing, the executive team to the people team. The data lab might therefore make use of a data lake, but it is, according to our definition, a different paradigm. The nerve center: the Data Hub Data aggregated in large pools can be quite useful, as we have learned. Not without its share of asset management cost of course, but undoubtedly a useful resource with many opportunities. And making good use of these data lakes can be accelerated by assembling a skilled team in a data lab. We’ve written about cool data. But what, you’re no doubt wondering, about hot data? What if we want to tap into the data feeds that we have, and use them to make predictions or take informed business decisions based on how we’re doing right now? In our vocabulary, this is the realm of the data hub. A data hub is a high capacity, high-throughput integration point – such as an Apache Kafka messaging system, that can be used for monitoring, inspecting, routing, and acting upon data in motion. The idea is that all the evented data feeds that the organisation has are hooked up to the data hub, where data analytics or predictive models execute online on the data. As the data hub is an online solution acting on data feeds, care should be taken to distinguish between batch data and data feeds. Data hubs are not well suited to processing batch data, and whilst it is possible to use change-data-capture techniques to turn system of record style batch processing-oriented data into a data feed – unless context is provided, this kind of data may deliver minimal business value in the data hub in exchange for a lot of hard work. Data is or data are? So there we have it. Subtle, nuanced and perhaps mildly contentious; our definitions of what a data lake, a data lab or a data hub is are noticeably different. To sum it up: Use a data lake when you want to store big data long term but still want to be able to process it for analysis, reporting, research and ML/AI model training Use a data lab when you want an expert team of data scientists, engineers and analysts to help you quickly get value from your data Use a data hub when you want to have a more real-time operational view on your business and use it to drive automated analysis, predictions, reporting and decisions online using hot-path data View the full article
-
- data lakes
- data labs
-
(and 1 more)
Tagged with:
-
Forum Statistics
70.4k
Total Topics68.3k
Total Posts