Jump to content

Search the Community

Showing results for tags 'amazon data firehose'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • General
    • General Discussion
    • Artificial Intelligence
    • DevOpsForum News
  • DevOps & SRE
    • DevOps & SRE General Discussion
    • Databases, Data Engineering & Data Science
    • Development & Programming
    • CI/CD, GitOps, Orchestration & Scheduling
    • Docker, Containers, Microservices, Serverless & Virtualization
    • Infrastructure-as-Code
    • Kubernetes & Container Orchestration
    • Linux
    • Logging, Monitoring & Observability
    • Security, Governance, Risk & Compliance
  • Cloud Providers
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 3 results

  1. Amazon Data Firehose (Firehose) now offers direct integration with Snowflake Snowpipe Streaming. Firehose enables customers to reliably capture, transform, and deliver data streams into Amazon S3, Amazon Redshift, Splunk, and other destinations for analytics. With this new feature, customers can stream clickstream, application, and AWS service logs from multiple sources, including Kinesis Data Streams, to Snowflake. With a few clicks, customers can setup a Firehose stream to deliver data to Snowflake. Firehose automatically scales to stream gigabytes of data, and records are available in Snowflake within seconds. View the full article
  2. Today’s fast-paced world demands timely insights and decisions, which is driving the importance of streaming data. Streaming data refers to data that is continuously generated from a variety of sources. The sources of this data, such as clickstream events, change data capture (CDC), application and service logs, and Internet of Things (IoT) data streams are proliferating. Snowflake offers two options to bring streaming data into its platform: Snowpipe and Snowflake Snowpipe Streaming. Snowpipe is suitable for file ingestion (batching) use cases, such as loading large files from Amazon Simple Storage Service (Amazon S3) to Snowflake. Snowpipe Streaming, a newer feature released in March 2023, is suitable for rowset ingestion (streaming) use cases, such as loading a continuous stream of data from Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK). Before Snowpipe Streaming, AWS customers used Snowpipe for both use cases: file ingestion and rowset ingestion. First, you ingested streaming data to Kinesis Data Streams or Amazon MSK, then used Amazon Data Firehose to aggregate and write streams to Amazon S3, followed by using Snowpipe to load the data into Snowflake. However, this multi-step process can result in delays of up to an hour before data is available for analysis in Snowflake. Moreover, it’s expensive, especially when you have small files that Snowpipe has to upload to the Snowflake customer cluster. To solve this issue, Amazon Data Firehose now integrates with Snowpipe Streaming, enabling you to capture, transform, and deliver data streams from Kinesis Data Streams, Amazon MSK, and Firehose Direct PUT to Snowflake in seconds at a low cost. With a few clicks on the Amazon Data Firehose console, you can set up a Firehose stream to deliver data to Snowflake. There are no commitments or upfront investments to use Amazon Data Firehose, and you only pay for the amount of data streamed. Some key features of Amazon Data Firehose include: Fully managed serverless service – You don’t need to manage resources, and Amazon Data Firehose automatically scales to match the throughput of your data source without ongoing administration. Straightforward to use with no code – You don’t need to write applications. Real-time data delivery – You can get data to your destinations quickly and efficiently in seconds. Integration with over 20 AWS services – Seamless integration is available for many AWS services, such as Kinesis Data Streams, Amazon MSK, Amazon VPC Flow Logs, AWS WAF logs, Amazon CloudWatch Logs, Amazon EventBridge, AWS IoT Core, and more. Pay-as-you-go model – You only pay for the data volume that Amazon Data Firehose processes. Connectivity – Amazon Data Firehose can connect to public or private subnets in your VPC. This post explains how you can bring streaming data from AWS into Snowflake within seconds to perform advanced analytics. We explore common architectures and illustrate how to set up a low-code, serverless, cost-effective solution for low-latency data streaming. Overview of solution The following are the steps to implement the solution to stream data from AWS to Snowflake: Create a Snowflake database, schema, and table. Create a Kinesis data stream. Create a Firehose delivery stream with Kinesis Data Streams as the source and Snowflake as its destination using a secure private link. To test the setup, generate sample stream data from the Amazon Kinesis Data Generator (KDG) with the Firehose delivery stream as the destination. Query the Snowflake table to validate the data loaded into Snowflake. The solution is depicted in the following architecture diagram. Prerequisites You should have the following prerequisites: An AWS account and access to the following AWS services: AWS Identity and Access Management (IAM) Kinesis Data Streams Amazon S3 Amazon Data Firehose Familiarity with the AWS Management Console. A Snowflake account. A key pair generated and your user configured to connect securely to Snowflake. For instructions, refer to the following: Generate the private key Generate a public key Store the private and public keys securely Assign the public key to a Snowflake user Verify the user’s public key fingerprint An S3 bucket for error logging. The KDG set up. For instructions, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator. Create a Snowflake database, schema, and table Complete the following steps to set up your data in Snowflake: Log in to your Snowflake account and create the database: create database adf_snf; Create a schema in the new database: create schema adf_snf.kds_blog; Create a table in the new schema: create or replace table iot_sensors (sensorId number, sensorType varchar, internetIP varchar, connectionTime timestamp_ntz, currentTemperature number ); Create a Kinesis data stream Complete the following steps to create your data stream: On the Kinesis Data Streams console, choose Data streams in the navigation pane. Choose Create data stream. For Data stream name, enter a name (for example, KDS-Demo-Stream). Leave the remaining settings as default. Choose Create data stream. Create a Firehose delivery stream Complete the following steps to create a Firehose delivery stream with Kinesis Data Streams as the source and Snowflake as its destination: On the Amazon Data Firehose console, choose Create Firehose stream. For Source, choose Amazon Kinesis Data Streams. For Destination, choose Snowflake. For Kinesis data stream, browse to the data stream you created earlier. For Firehose stream name, leave the default generated name or enter a name of your preference. Under Connection settings, provide the following information to connect Amazon Data Firehose to Snowflake: For Snowflake account URL, enter your Snowflake account URL. For User, enter the user name generated in the prerequisites. For Private key, enter the private key generated in the prerequisites. Make sure the private key is in PKCS8 format. Do not include the PEM header-BEGIN prefix and footer-END suffix as part of the private key. If the key is split across multiple lines, remove the line breaks. For Role, select Use custom Snowflake role and enter the IAM role that has access to write to the database table. You can connect to Snowflake using public or private connectivity. If you don’t provide a VPC endpoint, the default connectivity mode is public. To allow list Firehose IPs in your Snowflake network policy, refer to Choose Snowflake for Your Destination. If you’re using a private link URL, provide the VPCE ID using SYSTEM$GET_PRIVATELINK_CONFIG: select SYSTEM$GET_PRIVATELINK_CONFIG(); This function returns a JSON representation of the Snowflake account information necessary to facilitate the self-service configuration of private connectivity to the Snowflake service, as shown in the following screenshot. For this post, we’re using a private link, so for VPCE ID, enter the VPCE ID. Under Database configuration settings, enter your Snowflake database, schema, and table names. In the Backup settings section, for S3 backup bucket, enter the bucket you created as part of the prerequisites. Choose Create Firehose stream. Alternatively, you can use an AWS CloudFormation template to create the Firehose delivery stream with Snowflake as the destination rather than using the Amazon Data Firehose console. To use the CloudFormation stack, choose Generate sample stream data Generate sample stream data from the KDG with the Kinesis data stream you created: { "sensorId": {{random.number(999999999)}}, "sensorType": "{{random.arrayElement( ["Thermostat","SmartWaterHeater","HVACTemperatureSensor","WaterPurifier"] )}}", "internetIP": "{{internet.ip}}", "connectionTime": "{{date.now("YYYY-MM-DDTHH:m:ss")}}", "currentTemperature": {{random.number({"min":10,"max":150})}} } Query the Snowflake table Query the Snowflake table: select * from adf_snf.kds_blog.iot_sensors; You can confirm that the data generated by the KDG that was sent to Kinesis Data Streams is loaded into the Snowflake table through Amazon Data Firehose. Troubleshooting If data is not loaded into Kinesis Data Steams after the KDG sends data to the Firehose delivery stream, refresh and make sure you are logged in to the KDG. If you made any changes to the Snowflake destination table definition, recreate the Firehose delivery stream. Clean up To avoid incurring future charges, delete the resources you created as part of this exercise if you are not planning to use them further. Conclusion Amazon Data Firehose provides a straightforward way to deliver data to Snowpipe Streaming, enabling you to save costs and reduce latency to seconds. To try Amazon Kinesis Firehose with Snowflake, refer to the Amazon Data Firehose with Snowflake as destination lab. About the Authors Swapna Bandla is a Senior Solutions Architect in the AWS Analytics Specialist SA Team. Swapna has a passion towards understanding customers data and analytics needs and empowering them to develop cloud-based well-architected solutions. Outside of work, she enjoys spending time with her family. Mostafa Mansour is a Principal Product Manager – Tech at Amazon Web Services where he works on Amazon Kinesis Data Firehose. He specializes in developing intuitive product experiences that solve complex challenges for customers at scale. When he’s not hard at work on Amazon Kinesis Data Firehose, you’ll likely find Mostafa on the squash court, where he loves to take on challengers and perfect his dropshots. Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers. He has helped technology companies design and implement data analytics solutions and products. View the full article
  3. You can use Amazon Data Firehose to aggregate and deliver log events from your applications and services captured in Amazon CloudWatch Logs to your Amazon Simple Storage Service (Amazon S3) bucket and Splunk destinations, for use cases such as data analytics, security analysis, application troubleshooting etc. By default, CloudWatch Logs are delivered as gzip-compressed objects. You might want the data to be decompressed, or want logs to be delivered to Splunk, which requires decompressed data input, for application monitoring and auditing. AWS released a feature to support decompression of CloudWatch Logs in Firehose. With this new feature, you can specify an option in Firehose to decompress CloudWatch Logs. You no longer have to perform additional processing using AWS Lambda or post-processing to get decompressed logs, and can deliver decompressed data to Splunk. Additionally, you can use optional Firehose features such as record format conversion to convert CloudWatch Logs to Parquet or ORC, and dynamic partitioning to automatically group streaming records based on keys in the data (for example, by month) and deliver the grouped records to corresponding Amazon S3 prefixes. In this post, we look at how to enable the decompression feature for Splunk and Amazon S3 destinations. We start with Splunk and then Amazon S3 for new streams, then we address migration steps to take advantage of this feature and simplify your existing pipeline. Decompress CloudWatch Logs for Splunk You can use subscription filter in CloudWatch log groups to ingest data directly to Firehose or through Amazon Kinesis Data Streams. Note: For the CloudWatch Logs decompression feature, you need a HTTP Event Collector (HEC) data input created in Splunk, with indexer acknowledgement enabled and the source type. This is required to map to the right source type for the decompressed logs. When creating the HEC input, include the source type mapping (for example, aws:cloudtrail). To create a Firehose delivery stream for the decompression feature, complete the following steps: Provide your destination settings and select Raw endpoint as endpoint type. You can use a raw endpoint for the decompression feature to ingest both raw and JSON-formatted event data to Splunk. For example, VPC Flow Logs data is raw data, and AWS CloudTrail data is in JSON format. Enter the HEC token for Authentication token. To enable decompression feature, deselect Transform source records with AWS Lambda under Transform records. Select Turn on decompression and Turn on message extraction for Decompress source records from Amazon CloudWatch Logs. Select Turn on message extraction for the Splunk destination. Message extraction feature After decompression, CloudWatch Logs are in JSON format, as shown in the following figure. You can see the decompressed data has metadata information such as logGroup, logStream, and subscriptionFilters, and the actual data is included within the message field under logEvents (the following example shows an example of CloudTrail events in the CloudWatch Logs). When you enable message extraction, Firehose will extract just the contents of the message fields and concatenate the contents with a new line between them, as shown in following figure. With the CloudWatch Logs metadata filtered out with this feature, Splunk will successfully parse the actual log data and map to the source type configured in HEC token. Additionally, If you want to deliver these CloudWatch events to your Splunk destination in real time, you can use zero buffering, a new feature that was launched recently in Firehose. You can use this feature to set up 0 seconds as the buffer interval or any time interval between 0–60 seconds to deliver data to the Splunk destination in real time within seconds. With these settings, you can now seamlessly ingest decompressed CloudWatch log data into Splunk using Firehose. Decompress CloudWatch Logs for Amazon S3 The CloudWatch Logs decompression feature for an Amazon S3 destination works similar to Splunk, where you can turn off data transformation using Lambda and turn on the decompression and message extraction options. You can use the decompression feature to write the log data as a text file to the Amazon S3 destination or use with other Amazon S3 destination features like record format conversion using Parquet or ORC, or dynamic partitioning to partition the data. Dynamic partitioning with decompression For Amazon S3 destination, Firehose supports dynamic partitioning, which enables you to continuously partition streaming data by using keys within data, and then deliver the data grouped by these keys into corresponding Amazon S3 prefixes. This enables you to run high-performance, cost-efficient analytics on streaming data in Amazon S3 using services such as Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and Amazon QuickSight. Partitioning your data minimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries on Amazon S3. With the new decompression feature, you can perform dynamic partitioning without any Lambda function for mapping the partitioning keys on CloudWatch Logs. You can enable the Inline parsing for JSON option, scan the decompressed log data, and select the partitioning keys. The following screenshot shows an example where inline parsing is enabled for CloudTrail log data with a partitioning schema selected for account ID and AWS Region in the CloudTrail record. Record format conversion with decompression For CloudWatch Logs data, you can use the record format conversion feature on decompressed data for Amazon S3 destination. Firehose can convert the input data format from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON. You can use the features for record format conversion under the Transform and convert records settings to convert the CloudWatch log data to Parquet or ORC format. The following screenshot shows an example of record format conversion settings for Parquet format using an AWS Glue schema and table for CloudTrail log data. When the dynamic partitioning settings are configured, record format conversion works along with dynamic partitioning to create the files in the output format with a partition folder structure in the target S3 bucket. Migrate existing delivery streams for decompression If you want to migrate an existing Firehose stream that uses Lambda for decompression to this new decompression feature of Firehose, refer to the steps outlined in Enabling and disabling decompression. Pricing The Firehose decompression feature decompress the data and charges per GB of decompressed data. To understand decompression pricing, refer to Amazon Data Firehose pricing. Clean up To avoid incurring future charges, delete the resources you created in the following order: Delete the CloudWatch Logs subscription filter. Delete the Firehose delivery stream. Delete the S3 buckets. Conclusion The decompression and message extraction feature of Firehose simplifies delivery of CloudWatch Logs to Amazon S3 and Splunk destinations without requiring any code development or additional processing. For an Amazon S3 destination, you can use Parquet or ORC conversion and dynamic partitioning capabilities on decompressed data. For more information, refer to the following resources: Record Transformation and Format Conversion Enabling and disabling decompression Message extraction after decompression of CloudWatch Logs About the Authors Ranjit Kalidasan is a Senior Solutions Architect with Amazon Web Services based in Boston, Massachusetts. He is a Partner Solutions Architect helping security ISV partners co-build and co-market solutions with AWS. He brings over 25 years of experience in information technology helping global customers implement complex solutions for security and analytics. You can connect with Ranjit on LinkedIn. Phaneendra Vuliyaragoli is a Product Management Lead for Amazon Data Firehose at AWS. In this role, Phaneendra leads the product and go-to-market strategy for Amazon Data Firehose. View the full article
  • Forum Statistics

    43.6k
    Total Topics
    43.1k
    Total Posts
×
×
  • Create New...