Search the Community
Showing results for tags 's3'.
-
This is post is co-written with Andries Engelbrecht and Scott Teal from Snowflake. Businesses are constantly evolving, and data leaders are challenged every day to meet new requirements. For many enterprises and large organizations, it is not feasible to have one processing engine or tool to deal with the various business requirements. They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions. Customers are using AWS and Snowflake to develop purpose-built data architectures that provide the performance required for modern analytics and artificial intelligence (AI) use cases. Implementing these solutions requires data sharing between purpose-built data stores. This is why Snowflake and AWS are delivering enhanced support for Apache Iceberg to enable and facilitate data interoperability between data services. Apache Iceberg is an open-source table format that provides reliability, simplicity, and high performance for large datasets with transactional integrity between various processing engines. In this post, we discuss the following: Advantages of Iceberg tables for data lakes Two architectural patterns for sharing Iceberg tables between AWS and Snowflake: Manage your Iceberg tables with AWS Glue Data Catalog Manage your Iceberg tables with Snowflake The process of converting existing data lakes tables to Iceberg tables without copying the data Now that you have a high-level understanding of the topics, let’s dive into each of them in detail. Advantages of Apache Iceberg Apache Iceberg is a distributed, community-driven, Apache 2.0-licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time. Apache Iceberg offers integrations with popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive, Presto, and more. Iceberg tables maintain metadata to abstract large collections of files, providing data management features including time travel, rollback, data compaction, and full schema evolution, reducing management overhead. Originally developed at Netflix before being open sourced to the Apache Software Foundation, Apache Iceberg was a blank-slate design to solve common data lake challenges like user experience, reliability, and performance, and is now supported by a robust community of developers focused on continually improving and adding new features to the project, serving real user needs and providing them with optionality. Transactional data lakes built on AWS and Snowflake Snowflake provides various integrations for Iceberg tables with multiple storage options, including Amazon S3, and multiple catalog options, including AWS Glue Data Catalog and Snowflake. AWS provides integrations for various AWS services with Iceberg tables as well, including AWS Glue Data Catalog for tracking table metadata. Combining Snowflake and AWS gives you multiple options to build out a transactional data lake for analytical and other use cases such as data sharing and collaboration. By adding a metadata layer to data lakes, you get a better user experience, simplified management, and improved performance and reliability on very large datasets. Manage your Iceberg table with AWS Glue You can use AWS Glue to ingest, catalog, transform, and manage the data on Amazon Simple Storage Service (Amazon S3). AWS Glue is a serverless data integration service that allows you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes in Iceberg format. With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. Snowflake integrates with AWS Glue Data Catalog to access the Iceberg table catalog and the files on Amazon S3 for analytical queries. This greatly improves performance and compute cost in comparison to external tables on Snowflake, because the additional metadata improves pruning in query plans. You can use this same integration to take advantage of the data sharing and collaboration capabilities in Snowflake. This can be very powerful if you have data in Amazon S3 and need to enable Snowflake data sharing with other business units, partners, suppliers, or customers. The following architecture diagram provides a high-level overview of this pattern. The workflow includes the following steps: AWS Glue extracts data from applications, databases, and streaming sources. AWS Glue then transforms it and loads it into the data lake in Amazon S3 in Iceberg table format, while inserting and updating the metadata about the Iceberg table in AWS Glue Data Catalog. The AWS Glue crawler generates and updates Iceberg table metadata and stores it in AWS Glue Data Catalog for existing Iceberg tables on an S3 data lake. Snowflake integrates with AWS Glue Data Catalog to retrieve the snapshot location. In the event of a query, Snowflake uses the snapshot location from AWS Glue Data Catalog to read Iceberg table data in Amazon S3. Snowflake can query across Iceberg and Snowflake table formats. You can share data for collaboration with one or more accounts in the same Snowflake region. You can also use data in Snowflake for visualization using Amazon QuickSight, or use it for machine learning (ML) and artificial intelligence (AI) purposes with Amazon SageMaker. Manage your Iceberg table with Snowflake A second pattern also provides interoperability across AWS and Snowflake, but implements data engineering pipelines for ingestion and transformation to Snowflake. In this pattern, data is loaded to Iceberg tables by Snowflake through integrations with AWS services like AWS Glue or through other sources like Snowpipe. Snowflake then writes data directly to Amazon S3 in Iceberg format for downstream access by Snowflake and various AWS services, and Snowflake manages the Iceberg catalog that tracks snapshot locations across tables for AWS services to access. Like the previous pattern, you can use Snowflake-managed Iceberg tables with Snowflake data sharing, but you can also use S3 to share datasets in cases where one party does not have access to Snowflake. The following architecture diagram provides an overview of this pattern with Snowflake-managed Iceberg tables. This workflow consists of the following steps: In addition to loading data via the COPY command, Snowpipe, and the native Snowflake connector for AWS Glue, you can integrate data via the Snowflake Data Sharing. Snowflake writes Iceberg tables to Amazon S3 and updates metadata automatically with every transaction. Iceberg tables in Amazon S3 are queried by Snowflake for analytical and ML workloads using services like QuickSight and SageMaker. Apache Spark services on AWS can access snapshot locations from Snowflake via a Snowflake Iceberg Catalog SDK and directly scan the Iceberg table files in Amazon S3. Comparing solutions These two patterns highlight options available to data personas today to maximize their data interoperability between Snowflake and AWS using Apache Iceberg. But which pattern is ideal for your use case? If you’re already using AWS Glue Data Catalog and only require Snowflake for read queries, then the first pattern can integrate Snowflake with AWS Glue and Amazon S3 to query Iceberg tables. If you’re not already using AWS Glue Data Catalog and require Snowflake to perform reads and writes, then the second pattern is likely a good solution that allows for storing and accessing data from AWS. Considering that reads and writes will probably operate on a per-table basis rather than the entire data architecture, it is advisable to use a combination of both patterns. Migrate existing data lakes to a transactional data lake using Apache Iceberg You can convert existing Parquet, ORC, and Avro-based data lake tables on Amazon S3 to Iceberg format to reap the benefits of transactional integrity while improving performance and user experience. There are several Iceberg table migration options (SNAPSHOT, MIGRATE, and ADD_FILES) for migrating existing data lake tables in-place to Iceberg format, which is preferable to rewriting all of the underlying data files—a costly and time-consuming effort with large datasets. In this section, we focus on ADD_FILES, because it’s useful for custom migrations. For ADD_FILES options, you can use AWS Glue to generate Iceberg metadata and statistics for an existing data lake table and create new Iceberg tables in AWS Glue Data Catalog for future use without needing to rewrite the underlying data. For instructions on generating Iceberg metadata and statistics using AWS Glue, refer to Migrate an existing data lake to a transactional data lake using Apache Iceberg or Convert existing Amazon S3 data lake tables to Snowflake Unmanaged Iceberg tables using AWS Glue. This option requires that you pause data pipelines while converting the files to Iceberg tables, which is a straightforward process in AWS Glue because the destination just needs to be changed to an Iceberg table. Conclusion In this post, you saw the two architecture patterns for implementing Apache Iceberg in a data lake for better interoperability across AWS and Snowflake. We also provided guidance on migrating existing data lake tables to Iceberg format. Sign up for AWS Dev Day on April 10 to get hands-on not only with Apache Iceberg, but also with streaming data pipelines with Amazon Data Firehose and Snowpipe Streaming, and generative AI applications with Streamlit in Snowflake and Amazon Bedrock. About the Authors Andries Engelbrecht is a Principal Partner Solutions Architect at Snowflake and works with strategic partners. He is actively engaged with strategic partners like AWS supporting product and service integrations as well as the development of joint solutions with partners. Andries has over 20 years of experience in the field of data and analytics. Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data architectures on the AWS Cloud. He has helped customers of all sizes implement data management, data warehouse, and data lake solutions. Brian Dolan joined Amazon as a Military Relations Manager in 2012 after his first career as a Naval Aviator. In 2014, Brian joined Amazon Web Services, where he helped Canadian customers from startups to enterprises explore the AWS Cloud. Most recently, Brian was a member of the Non-Relational Business Development team as a Go-To-Market Specialist for Amazon DynamoDB and Amazon Keyspaces before joining the Analytics Worldwide Specialist Organization in 2022 as a Go-To-Market Specialist for AWS Glue. Nidhi Gupta is a Sr. Partner Solution Architect at AWS. She spends her days working with customers and partners, solving architectural challenges. She is passionate about data integration and orchestration, serverless and big data processing, and machine learning. Nidhi has extensive experience leading the architecture design and production release and deployments for data workloads. Scott Teal is a Product Marketing Lead at Snowflake and focuses on data lakes, storage, and governance. View the full article
-
- data lakes
- amazon s3
-
(and 2 more)
Tagged with:
-
Amazon Kinesis Data Firehose now integrates with Amazon MSK to offer a fully managed solution that simplifies the processing and delivery of streaming data from Amazon MSK Apache Kafka clusters into data lakes stored on Amazon S3. With just a few clicks, Amazon MSK customers can continuously load data from their desired Apache Kafka clusters to their Amazon S3 bucket, eliminating the need to develop or run their own connector applications. View the full article
-
Amazon S3 now provides the Last-Modified time of delete markers in the response headers of S3 Head and Get APIs. For buckets that use S3 Versioning, when a customer issues a delete request without a versionId specified, S3 adds a delete marker on the latest version of the object to protect data from accidental deletions. With Last-Modified information added to S3 Head and Get API response headers for delete markers, customers can more easily track changes in their buckets. View the full article
-
Managing large amounts of data can be overwhelming, but with the right tools and knowledge, it doesn't have to be. Amazon Simple Storage Service (S3), an object storage service from Amazon, provides industry-leading scalability, data availability, security, and performance. It's one of Amazon's most popular services with a variety of use cases ranging from static website hosting to storing media files and CI/CD pipeline artifacts. This blog post, based on the AWS S3 course offered by KodeKloud, will help you understand how AWS S3 works and its features ... View the full article
-
- 1
-
- storage
- data management
-
(and 2 more)
Tagged with:
-
This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS! A new week starts, and Spring is almost here! If you’re curious about AWS news from the previous seven days, I got you covered. Last Week’s Launches Here are the launches that got my attention last week: Amazon S3 – Last week there was AWS Pi Day 2023 celebrating 17 years of innovation since Amazon S3 was introduced on March 14, 2006. For the occasion, the team released many new capabilities: S3 Object Lambda now provides aliases that are interchangeable with bucket names and can be used with Amazon CloudFront to tailor content for end users. S3 now support datasets that are replicated across multiple AWS accounts with cross-account support for S3 Multi-Region Access Points. You can now create and configure replication rules to automatically replicate S3 objects from one AWS Outpost to another. Amazon S3 has also simplified private connectivity from on-premises networks: with private DNS for S3, on-premises applications can use AWS PrivateLink to access S3 over an interface endpoint, while requests from your in-VPC applications access S3 using gateway endpoints. We released Mountpoint for Amazon S3, a high performance open source file client. Read more in the blog. Note that Mountpoint isn’t a general-purpose networked file system, and comes with some restrictions on file operations. Amazon Linux 2023 – Our new Linux-based operating system is now generally available. Sébastien’s post is full of tips and info. Application Auto Scaling – Now can use arithmetic operations and mathematical functions to customize the metrics used with Target Tracking policies. You can use it to scale based on your own application-specific metrics. Read how it works with Amazon ECS services. AWS Data Exchange for Amazon S3 is now generally available – You can now share and find data files directly from S3 buckets, without the need to create or manage copies of the data. Amazon Neptune – Now offers a graph summary API to help understand important metadata about property graphs (PG) and resource description framework (RDF) graphs. Neptune added support for Slow Query Logs to help identify queries that need performance tuning. Amazon OpenSearch Service – The team introduced security analytics that provides new threat monitoring, detection, and alerting features. The service now supports OpenSearch version 2.5 that adds several new features such as support for Point in Time Search and improvements to observability and geospatial functionality. AWS Lake Formation and Apache Hive on Amazon EMR – Introduced fine-grained access controls that allow data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR. Amazon EC2 M1 Mac Instances – You can now update guest environments to a specific or the latest macOS version without having to tear down and recreate the existing macOS environments. AWS Chatbot – Now Integrates With Microsoft Teams to simplify the way you troubleshoot and operate your AWS resources. Amazon GuardDuty RDS Protection for Amazon Aurora – Now generally available to help profile and monitor access activity to Aurora databases in your AWS account without impacting database performance AWS Database Migration Service – Now supports validation to ensure that data is migrated accurately to S3 and can now generate an AWS Glue Data Catalog when migrating to S3. AWS Backup – You can now back up and restore virtual machines running on VMware vSphere 8 and with multiple vNICs. Amazon Kendra – There are new connectors to index documents and search for information across these new content: Confluence Server, Confluence Cloud, Microsoft SharePoint OnPrem, Microsoft SharePoint Cloud. This post shows how to use the Amazon Kendra connector for Microsoft Teams. For a full list of AWS announcements, be sure to keep an eye on the What's New at AWS page. Other AWS News A few more blog posts you might have missed: Women founders Q&A – We’re talking to six women founders and leaders about how they’re making impacts in their communities, industries, and beyond. What you missed at that 2023 IMAGINE: Nonprofit conference – Where hundreds of nonprofit leaders, technologists, and innovators gathered to learn and share how AWS can drive a positive impact for people and the planet. Monitoring load balancers using Amazon CloudWatch anomaly detection alarms – The metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance. Extend geospatial queries in Amazon Athena with user-defined functions (UDFs) and AWS Lambda – Using a solution based on Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the globe into equally-sized hexagons. How cities can use transport data to reduce pollution and increase safety – A guest post by Rikesh Shah, outgoing head of open innovation at Transport for London. For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more. Upcoming AWS Events Here are some opportunities to meet: AWS Public Sector Day 2023 (March 21, London, UK) – An event dedicated to helping public sector organizations use technology to achieve more with less through the current challenging conditions. Women in Tech at Skills Center Arlington (March 23, VA, USA) – Let’s celebrate the history and legacy of women in tech. The AWS Summits season is warming up! You can sign up here to know when registration opens in your area. That’s all from me for this week. Come back next Monday for another Week in Review! — Danilo View the full article
-
- women in tech
- s3
- (and 23 more)
-
The new Amazon S3 condition key enables you to write policies that help you control the use of server-side encryption with customer-provided keys (SSE-C). Using Amazon S3 condition keys, you can specify conditions when granting permissions in the optional ‘Condition’ element of a bucket or an IAM policy. One such condition is to require server-side encryption (SSE) using your preferred encryption method. View the full article
-
AWS Backup for Amazon S3 now enables you to copy your Amazon S3 backups across AWS Regions and AWS accounts. With backups of Amazon S3 in multiple AWS Regions, you can maintain separable, protected copies of your backup data to help meet the compliance requirements for data protection and disaster recovery. In addition, backups across AWS accounts provide an additional layer of protection against inadvertent or unauthorized actions. View the full article
-
AWS Control Tower has updated its Region deny guardrail to include additional AWS global service APIs to assist in retrieving configuration settings, dashboard information, and support for an interactive chat agent. The Region deny guardrail, ‘Deny access to AWS based on the requested AWS Region', assists you in limiting access to AWS services and operations for enrolled accounts in your AWS Control Tower environment. The AWS Control Tower Region deny guardrail helps ensure that any customer data you upload to AWS services is located only in the AWS Regions that you specify. You can select the AWS Region or Regions in which your customer data is stored and processed. View the full article
-
- control tower
- s3
-
(and 4 more)
Tagged with:
-
We are pleased to announce a new capability in Amazon Macie that allows for one-click, temporary retrieval of up to 10 examples of sensitive data found in Amazon Simple Storage Service (Amazon S3) by Amazon Macie. This new capability enables you to more easily view and understand which contents of an S3 objects were identified to be sensitive, so you can review, validate, and quickly take action as needed. All sensitive data examples captured with this new capability are encrypted using customer-managed AWS Key Management Service (AWS KMS) keys and are temporarily viewable within the Amazon Macie console after being retrieved. View the full article
-
AWS Backup Audit Manager now allows you to audit and report on the compliance of your data protection policies for Amazon S3 and AWS Storage Gateway. Using AWS Backup Audit Manager, you can now continuously evaluate the backup activity of your Amazon S3 and AWS Storage Gateway resources and generate audit reports that can help you demonstrate compliance with organizational best practices or regulatory standards. View the full article
-
AWS Certificate Manager (ACM) Private Certificate Authority (CA) now supports using S3 Block Public Access when storing certificate revocation links (CRL) in S3 buckets. View the full article
-
Amazon Textract is a fully managed machine learning service that makes it easy to extract text and data from virtually any document. Amazon Textract offers you both synchronous and asynchronous APIs to choose based on the fit for each use case. With the asynchronous APIs, you can retrieve the extracted information using the GetDocumentTextDetection or the GetDocumentAnalysis APIs. Today, we are introducing an additional option to direct the Textract output to your own Amazon S3 buckets. With this new option, you can specify the Amazon S3 bucket name, and also a prefix to be added to the output file. You can still choose to use the Get APIs if you prefer. This new Amazon S3 output option provides you with greater flexibility to integrate Amazon Textract into your broader technical architectures. View the full article
-
AWS Database Migration Service (AWS DMS) helps you migrate databases to AWS quickly and securely. With this launch, AWS DMS now supports Amazon S3 folder partitions based on transaction commit dates when using Amazon S3 as a target. Using date-based folder partitioning, you can write data from a single source table to a time-hierarchy folder structure in Amazon S3. By partitioning the S3 folder, you can better manage your S3 objects, limit the size of each S3 folder, and optimize data lake queries or other subsequent operations. View the full article
-
Amazon EMR Release 6.2 now supports improved Apache HBase performance on Amazon S3 with persistent HFile tracking, and Apache Hive ACID transactions on HDFS and Amazon S3. EMR 6.2 contains performance improvements to EMR Runtime for Apache Spark, and PrestoDB performance improvements. View the full article
- 1 reply
-
- apache hbase
- amazon s3
-
(and 5 more)
Tagged with:
-
Amazon S3 Replication now gives you the ability to replicate data from one source bucket to multiple destination buckets in the same, or different AWS Regions. S3 Replication (multi-destination) is intended for customers that want to create and maintain multiple copies of their data in one or more AWS Regions. Additionally, when replicating to multiple destinations, you can use Amazon CloudWatch metrics to track replication progress for each region pair. View the full article
-
Amazon S3 now delivers strong read-after-write consistency automatically for all applications. Unlike other cloud providers, Amazon S3 delivers strong read-after-write consistency for any storage request, without changes to performance or availability, without sacrificing regional isolation for applications, and at no additional cost. View the full article
-
Amazon S3 Bucket Keys reduce the request costs of Amazon S3 server-side encryption (SSE) with AWS Key Management Service (KMS) by up to 99% by decreasing the request traffic from S3 to KMS. With a few clicks in AWS Management Console and no changes to your client applications, you can configure your buckets to use an S3 Bucket Key for KMS-based encryption on new objects. View the full article
-
Amazon S3 Replication now gives you the flexibility of replicating object metadata changes for two-way replication between buckets. With this new feature, replica modification sync, you can easily replicate metadata changes like object access control lists (ACLs), object tags, or object locks on the replicated objects. This two-way replication is important if you want to build shared datasets across multiple regions and keep all object and object metadata changes in sync. View the full article
-
Starting today, Amazon Simple Storage Service (S3) customers can discover a curated collection of third-party software built for Amazon S3 from within the S3 Management Console. Customers can choose from free or paid software products across SaaS, AMI, CFT, and Container product types, spanning across a wide range of popular categories including Storage, Back-up and Recovery, Data Integration and Analytics, Observability and Monitoring, Security and Threat Detection, and Permissions. View the full article
-
Amazon Textract is a machine learning service that makes it easy to extract printed text, handwriting, and data from virtually any document. Today, we are pleased to announce that Amazon Textract supports encryption of its asynchronous API output stored in your Amazon S3 buckets using your own AWS Key Management Service (KMS) Customer Master Keys (CMKs). With this feature, you have the flexibility to manage which encryption keys are used to protect your data and text extracted by Amazon Textract. For more information on how to accomplish this, please read our newest blog post. View the full article
-
Amazon S3 Storage Lens delivers organization-wide visibility into your object storage usage and activity trends, and makes actionable recommendations to improve cost-efficiency and apply data protection best practices. S3 Storage Lens is the first cloud storage analytics solution to provide a single view of object storage usage and activity across tens to hundreds of accounts in an AWS organization, with drill-downs to generate insights at the account, bucket, or even prefix level. Drawing from more than 14 years of experience helping customers optimize storage, S3 Storage Lens analyzes organization-wide metrics to deliver contextual recommendations to find ways to reduce your storage costs and apply best practices on data protection. View the full article
-
In a high severity data breach totaling 10,000,000+ files, Prestige Software, a hotel reservation platform based in Spain, exposed the banking details of over a million customers. This company provides automated online booking services to customers looking to reserve hotels for their next vacation or work trip. View the full article
-
AWS X-Ray now supports trace context propagation for Amazon Simple Storage Service (S3) enabling customers to view end-to-end requests when using Amazon S3. AWS X-Ray traces user requests as they travel through your entire application. It aggregates the data generated by the individual services like AWS Lambda, Amazon EC2 and the many resources that make up your application, providing you an end-to-end view of how your application is performing. View the full article
-
Forum Statistics
63.6k
Total Topics61.7k
Total Posts