Jump to content

Search the Community

Showing results for tags 'amazon cloudwatch'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

There are no results to display.

There are no results to display.


Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

  1. Amazon CloudWatch RUM, which enables customers to monitor their web applications by collecting client side performance and error data in real time, is generally available in the following 5 AWS Regions starting today: Asia Pacific (Hyderabad), Asia Pacific (Melbourne), Europe (Spain), Europe (Zurich), and Middle East (UAE). View the full article
  2. Amazon CloudWatch Container Insights with Enhanced Observability for EKS now auto-discovers critical health metrics from your AWS accelerators Trainium and Inferentia, and AWS high performance network adapters (Elastic Fabric Adapters) as well as NVIDIA GPUs. You can visualize these out-of-the-box metrics in curated Container Insights dashboards to help monitor your accelerated infrastructure and optimize your AI workloads for operational excellence. View the full article
  3. The Internet has a plethora of moving parts: routers, switches, hubs, terrestrial and submarine cables, and connectors on the hardware side, and complex protocol stacks and configurations on the software side. When something goes wrong that slows or disrupts the Internet in a way that affects your customers, you want to be able to localize and understand the issue as quickly as possible. New Map The new Amazon CloudWatch Internet Weather Map is here to help! Built atop of collection of global monitors operated by AWS, you get a broad, global view of Internet weather, with the ability to zoom in and understand performance and availability issues that affect a particular city. To access the map, open the CloudWatch Console, expand Network monitoring on the left, and click Internet Monitor. The map appears and displays weather for the entire world: The red and yellow circles indicate current, active issues that affect availability or performance, respectively. The grey circles represent issues that have been resolved within the last 24 hours, and the blue diamonds represent AWS regions. The map will automatically refresh every 15 minutes if you leave it on the screen. Each issue affects a specific city-network, representing a combination of a location where clients access AWS resources, and the Autonomous System Number (ASN) that was used to access the resources. ASNs typically represent individual Internet Service Providers (ISPs). The list to the right of the map shows active events at the top, followed by events that have been resolved in the recent past, looking back up to 24 hours: I can hover my mouse over any of the indicators to see the list of city-networks in the geographic area: If I zoom in a step or two, I can see that those city-networks are spread out over the United States: I can zoom in even further and see a single city-network: This information is also available programmatically. The new ListInternetEvents function returns up to 100 performance or availability events per call, with optional filtering by time range, status (ACTIVE or RESOLVED), or type (PERFORMANCE or AVAILABILITY). Each event includes full details including latitude and longitude. The new map is accessible from all AWS regions and there is no charge to use it. Going forward, we have a lot of powerful additions on the roadmap, subject to prioritization based on your feedback. Right now we are thinking about: Displaying causes of certain types of outages such as DDoS attacks, BGP route leaks, and issues with route interconnects. Adding a view that is specific to a chosen ISP. Displaying the impact to public SaaS applications. Please feel free to send feedback on this feature to internet-monitor@amazon.com . CloudWatch Internet Monitor The information in the map applies to everyone who makes use of applications built on AWS. If you want to understand how internet weather affects your particular AWS applications and to take advantage of other features such as health event notification and traffic insights, you can make use of CloudWatch Internet Monitor. As my colleague Sébastien wrote when he launched this feature in late 2022: You told us one of your challenges when monitoring internet-facing applications is to gather data outside of AWS to build a realistic picture of how your application behaves for your customers connected to multiple and geographically distant internet providers. Capturing and monitoring data about internet traffic before it reaches your infrastructure is either difficult or very expensive. After you review the map, you can click Create monitor to get started with CloudWatch Internet Monitor: After that you enter a name for your monitor, choose the AWS resources (VPCs, CloudFront distributions, Network Load Balancers, and Amazon WorkSpace Directories) to monitor, then select the desired percentage of internet-facing traffic to monitor. The monitor will begin to operate within minutes, using entries from your VPC Flow Logs, CloudFront Access Logs, and other telemetry to identify the most relevant city-networks. Here are some resources to help you learn more about this feature: Amazon CloudWatch Internet Monitor Preview – End-to-End Visibility into Internet Performance for your Applications Introducing Amazon CloudWatch Internet Monitor Easily set up Amazon CloudWatch Internet Monitor Using Amazon CloudWatch Internet Monitor for enhanced internet observability Use Amazon CloudWatch Internet Monitor for greater visibility into online experiences Documentation: Using Amazon CloudWatch Internet Monitor Video: Amazon CloudWatch Internet Monitor More questions or comments, contact: internet-monitor@amazon.com — Jeff; View the full article
  4. All AWS customers who navigate to Amazon CloudWatch Internet Monitor console can now view the internet weather map, at no charge, which shares a 24-hour global snapshot of internet latency and availability outages. The map lets you see, at a glance, recent internet issues across the world, including specific cities and service providers. View the full article
  5. Amazon CloudWatch Container Insights now offers observability for Windows containers running on Amazon Elastic Kubernetes Service (EKS), and helps customers collect, aggregate, and summarize metrics and logs from their Windows container infrastructure. With this support, customers can monitor utilization of resources such as CPU, memory, disk, and network, as well as get enhanced observability such as container-level EKS performance metrics, Kube-state metrics and EKS control plane metrics for Windows containers. CloudWatch also provides diagnostic information, such as container restart failures, for faster problem isolation and troubleshooting for Windows containers running on EKS. View the full article
  6. You can now create or associate a monitor for a distribution directly from the Amazon CloudFront console. By adding your distribution to a monitor, you can gain improved visibility into your application's internet performance and availability using Amazon CloudWatch Internet Monitor. You can create a monitor for the distribution, or add the distribution to an existing monitor, directly from the distribution metrics dashboard on the CloudFront console. View the full article
  7. Amazon CloudWatch RUM, which enables customers to monitor their web applications by collecting client side performance and error data in real time, is generally available in the following 11 AWS Regions starting today: Africa (Cape Town), Asia Pacific (Jakarta), Asia Pacific (Mumbai), Asia Pacific (Osaka), Asia Pacific (Seoul), Canada (Central), Europe (Milan), Europe (Paris), Middle East (Bahrain), South America (Sao Paulo), and US West (N. California). View the full article
  8. Amazon CloudWatch now supports using AWS CloudFormation to manage tags when you create, update, or delete alarms. View the full article
  9. You can now set up cross-account observability for Amazon CloudWatch Internet Monitor, so that you can get read-only access to monitors from multiple accounts within an AWS Region. Deploying applications by using resources in separate accounts is a good practice, to establish security and billing boundaries between teams and reduce the impact of operational events. For example, when you set up cross-account observability for Internet Monitor, you can access and view performance and availability measurements generated by monitors in different AWS accounts. View the full article
  10. Amazon CloudWatch now supports Anomaly Detection on metrics shared across your accounts. CloudWatch Anomaly Detection now lets you track unexpected changes in metric behavior across multiple accounts from a single monitoring account through CloudWatch cross-account observability. View the full article
  11. You can use Amazon Data Firehose to aggregate and deliver log events from your applications and services captured in Amazon CloudWatch Logs to your Amazon Simple Storage Service (Amazon S3) bucket and Splunk destinations, for use cases such as data analytics, security analysis, application troubleshooting etc. By default, CloudWatch Logs are delivered as gzip-compressed objects. You might want the data to be decompressed, or want logs to be delivered to Splunk, which requires decompressed data input, for application monitoring and auditing. AWS released a feature to support decompression of CloudWatch Logs in Firehose. With this new feature, you can specify an option in Firehose to decompress CloudWatch Logs. You no longer have to perform additional processing using AWS Lambda or post-processing to get decompressed logs, and can deliver decompressed data to Splunk. Additionally, you can use optional Firehose features such as record format conversion to convert CloudWatch Logs to Parquet or ORC, and dynamic partitioning to automatically group streaming records based on keys in the data (for example, by month) and deliver the grouped records to corresponding Amazon S3 prefixes. In this post, we look at how to enable the decompression feature for Splunk and Amazon S3 destinations. We start with Splunk and then Amazon S3 for new streams, then we address migration steps to take advantage of this feature and simplify your existing pipeline. Decompress CloudWatch Logs for Splunk You can use subscription filter in CloudWatch log groups to ingest data directly to Firehose or through Amazon Kinesis Data Streams. Note: For the CloudWatch Logs decompression feature, you need a HTTP Event Collector (HEC) data input created in Splunk, with indexer acknowledgement enabled and the source type. This is required to map to the right source type for the decompressed logs. When creating the HEC input, include the source type mapping (for example, aws:cloudtrail). To create a Firehose delivery stream for the decompression feature, complete the following steps: Provide your destination settings and select Raw endpoint as endpoint type. You can use a raw endpoint for the decompression feature to ingest both raw and JSON-formatted event data to Splunk. For example, VPC Flow Logs data is raw data, and AWS CloudTrail data is in JSON format. Enter the HEC token for Authentication token. To enable decompression feature, deselect Transform source records with AWS Lambda under Transform records. Select Turn on decompression and Turn on message extraction for Decompress source records from Amazon CloudWatch Logs. Select Turn on message extraction for the Splunk destination. Message extraction feature After decompression, CloudWatch Logs are in JSON format, as shown in the following figure. You can see the decompressed data has metadata information such as logGroup, logStream, and subscriptionFilters, and the actual data is included within the message field under logEvents (the following example shows an example of CloudTrail events in the CloudWatch Logs). When you enable message extraction, Firehose will extract just the contents of the message fields and concatenate the contents with a new line between them, as shown in following figure. With the CloudWatch Logs metadata filtered out with this feature, Splunk will successfully parse the actual log data and map to the source type configured in HEC token. Additionally, If you want to deliver these CloudWatch events to your Splunk destination in real time, you can use zero buffering, a new feature that was launched recently in Firehose. You can use this feature to set up 0 seconds as the buffer interval or any time interval between 0–60 seconds to deliver data to the Splunk destination in real time within seconds. With these settings, you can now seamlessly ingest decompressed CloudWatch log data into Splunk using Firehose. Decompress CloudWatch Logs for Amazon S3 The CloudWatch Logs decompression feature for an Amazon S3 destination works similar to Splunk, where you can turn off data transformation using Lambda and turn on the decompression and message extraction options. You can use the decompression feature to write the log data as a text file to the Amazon S3 destination or use with other Amazon S3 destination features like record format conversion using Parquet or ORC, or dynamic partitioning to partition the data. Dynamic partitioning with decompression For Amazon S3 destination, Firehose supports dynamic partitioning, which enables you to continuously partition streaming data by using keys within data, and then deliver the data grouped by these keys into corresponding Amazon S3 prefixes. This enables you to run high-performance, cost-efficient analytics on streaming data in Amazon S3 using services such as Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and Amazon QuickSight. Partitioning your data minimizes the amount of data scanned, optimizes performance, and reduces costs of your analytics queries on Amazon S3. With the new decompression feature, you can perform dynamic partitioning without any Lambda function for mapping the partitioning keys on CloudWatch Logs. You can enable the Inline parsing for JSON option, scan the decompressed log data, and select the partitioning keys. The following screenshot shows an example where inline parsing is enabled for CloudTrail log data with a partitioning schema selected for account ID and AWS Region in the CloudTrail record. Record format conversion with decompression For CloudWatch Logs data, you can use the record format conversion feature on decompressed data for Amazon S3 destination. Firehose can convert the input data format from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON. You can use the features for record format conversion under the Transform and convert records settings to convert the CloudWatch log data to Parquet or ORC format. The following screenshot shows an example of record format conversion settings for Parquet format using an AWS Glue schema and table for CloudTrail log data. When the dynamic partitioning settings are configured, record format conversion works along with dynamic partitioning to create the files in the output format with a partition folder structure in the target S3 bucket. Migrate existing delivery streams for decompression If you want to migrate an existing Firehose stream that uses Lambda for decompression to this new decompression feature of Firehose, refer to the steps outlined in Enabling and disabling decompression. Pricing The Firehose decompression feature decompress the data and charges per GB of decompressed data. To understand decompression pricing, refer to Amazon Data Firehose pricing. Clean up To avoid incurring future charges, delete the resources you created in the following order: Delete the CloudWatch Logs subscription filter. Delete the Firehose delivery stream. Delete the S3 buckets. Conclusion The decompression and message extraction feature of Firehose simplifies delivery of CloudWatch Logs to Amazon S3 and Splunk destinations without requiring any code development or additional processing. For an Amazon S3 destination, you can use Parquet or ORC conversion and dynamic partitioning capabilities on decompressed data. For more information, refer to the following resources: Record Transformation and Format Conversion Enabling and disabling decompression Message extraction after decompression of CloudWatch Logs About the Authors Ranjit Kalidasan is a Senior Solutions Architect with Amazon Web Services based in Boston, Massachusetts. He is a Partner Solutions Architect helping security ISV partners co-build and co-market solutions with AWS. He brings over 25 years of experience in information technology helping global customers implement complex solutions for security and analytics. You can connect with Ranjit on LinkedIn. Phaneendra Vuliyaragoli is a Product Management Lead for Amazon Data Firehose at AWS. In this role, Phaneendra leads the product and go-to-market strategy for Amazon Data Firehose. View the full article
  12. Welcome to March’s post announcing new training and certification updates — helping equip you and your teams with the skills to work with AWS services and solutions. This month we launched eight new digital training products on AWS Skill Builder, including four new AWS Builder Labs and a free learning plan called, Generative AI Developer Kit. We also have three new, and one updated AWS Classroom Training courses—two of which have AWS Partner versions—including Developing Generative AI Applications on AWS. A reminder: registration is now open for the new AWS Certified Data Engineer – Associate exam. You can begin preparing with curated exam prep resources, created by the experts at AWS, on AWS Skill Builder. Missed our February course update? Check it out here. New AWS Skill Builder subscription features AWS Skill Builder subscriptions are available globally, including Mainland China as of this month, and unlock enhanced AWS Certification exam prep and hands-on AWS Cloud training including 1,000+ interactive learning and lab experiences like AWS Cloud Quest, AWS Industry Quest, AWS Builder Labs, and AWS Jam challenges. Select plans offer access to AWS Digital Classroom courses to dive deep with expert instruction. Try a 7-day free trail of Individual subscription. *terms and conditions apply AWS Builder Labs Migrate On-Premises Servers to AWS Using Application Migration Service (MGN) (60 min.) is an intermediate-level lab providing you an opportunity to learn how to use AWS Application Migration Service to migrate an existing workload to AWS. Migrate On-premises Databases to AWS Using AWS Database Migration Service (DMS) (75 min.) is an intermediate-level lab providing you an opportunity to learn how to use AWS Database Migration Service to migrate an existing database to Amazon Aurora. Data Modeling for Amazon Neptune (60 min.) is an intermediate-level lab providing you an opportunity to explore the process of modeling data with Amazon Neptune to meet prescribed use cases. Analyzing CloudWatch Logs with Kinesis Data Streams and Kinesis Data Analytics(4 hr.) is an advanced-level, challenge-based lab allowing you to learn how to use Amazon CloudWatch to collect Amazon Elastic Compute Cloud (EC2) system logs and use Amazon Kinesis to analyze the collected data. AWS Certification exam preparation and updates Now available: AWS Certified Data Engineer – Associate Registration is now open for the AWS Certified Data Engineer – Associate. Showcase your knowledge and skills in core data-related AWS services, implementing data pipelines, and providing high-quality data for business insights. Gain confidence going into exam day with trusted exam prep on AWS Skill Builder, including an Official Pretest, available now in all exam languages. Free digital courses on AWS Skill Builder The following digital courses on AWS Skill Builder are free to all learners, along with 600+ free digital courses and learning plans. Digital learning plan Generative AI Developer Kit (includes labs) (16h 30 min.) is a collection of curated courses, labs, and challenges to develop the skills needed to build generative AI applications. Software developers interested in leveraging large language models without fine-tuning will benefit from this collection. You’ll receive an overview of generative AI, learn to plan a generative AI project, get started with Amazon CodeWhisperer and Amazon Bedrock, learn the foundations of prompt engineering, and discover the architecture patterns to build generative AI applications using Amazon Bedrock and Langchain. Digital courses Decarbonization with AWS Introduction (15 min.) is a fundamental-level course that teaches you about AWS Customer Carbon Footprint Tool and other resources that can be used to advance your sustainability goals. You’ll learn how businesses use the AWS Customer Carbon Footprint Tool, how it helps you reduce your carbon footprint and achieve decarbonization goals with AWS, and considerations for using the tool for a variety of optimal usage and cost savings considerations. Amazon Redshift Introduction (15 min.) is a fundamental-level course that provides an introduction to Amazon Redshift, including its common uses and benefits. AWS Mainframe Modernization – Using Replatform Tools with Amazon AppStream (60 min.) is an intermediate-level course teaching the setup and usage of Micro Focus tools from OpenText, such as Enterprise Analyzer and Enterprise Developer, with Amazon AppStream 2.0. AWS Classroom Training Designing and Implementing Storage on AWS is a three-day, intermediate-level course teaching you to select, design, implement, and optimize secure storage solutions to save on time and cost, improve performance and scale, and accelerate innovation. You’ll explore AWS storage services and solutions for storing, accessing, and protecting your data. An expert AWS instructor will help you understand where, how, and when to take advantage of different storage services. Learn how to best evaluate the appropriate AWS storage service options to meet your use case and business requirements. Build Modern Applications with AWS NoSQL Databases is a one-day, intermediate-level course to help you understand how to build applications that involve complex data characteristics and millisecond performance requirements from your databases. You’ll learn to use purpose-built databases to build typical modern applications with diverse access patterns and real-time scaling needs. AnAWS Partner version is also available. Running Containers on Amazon Elastic Kubernetes Service (Amazon EKS) is an updated, three-day, intermediate-level course from an expert AWS instructor that teaches container management and orchestration for Kubernetes using Amazon EKS. You’ll build an Amazon EKS cluster, configure the environment, deploy the cluster, and add applications to your cluster. Learn how to also manage container images using Amazon Elastic Container Registry (ECR) and automate application deployment. Developing Generative AI Applications on AWS is a two-day, advanced-level course that teaches you the basics, benefits, and associated terminology of generative AI. An expert AWS instructor will guide you through planning a generative AI project and the foundations of prompt engineering to develop generative AI applications with AWS services. By the end of the course, you’ll have the skills needed to build applications that can generate and summarize text, answer questions, and interact with users using a chatbot interface. An AWS Partner version is also available. View the full article
  13. Amazon CloudWatch Container Insights with Enhanced Observability for EKS now auto-discovers critical health and performance metrics from your NVIDIA GPUs and delivers them in automatic dashboards to enable faster problem isolation and troubleshooting for your AI/ML workloads. Container Insights with Enhanced Observability delivers you out-of-the-box trends and patterns on your infrastructure health and removes the overhead of manual dashboard and alarm set-ups saving you time and effort. View the full article
  14. Amazon CloudWatch Logs now offers customer to use Internet Protocol version 6 (IPv6) addresses for their new and existing domains. Customers moving to IPV6 can simplify their network stack by running their CloudWatch log groups on a dual-stack network that supports both IPv4 and IPv6. View the full article
  15. Amazon CloudWatch Logs is excited to announce support for creating account-level subscription filters using the put-account-policy API. This new capability enables you to deliver real-time log events that are ingested into Amazon CloudWatch Logs to an Amazon Kinesis Data Stream, Amazon Kinesis Data Firehose, or AWS Lambda for custom processing, analysis, or delivery to other destinations using a single account level subscription filter. View the full article
  16. Introduction We have observed a growing adoption of container services among both startups and established companies. This trend is driven by the ease of deploying applications and migrating from on-premises environments to the cloud. One platform of choice for many of our customers is Amazon Elastic Container Service (Amazon ECS). The powerful simplicity of Amazon ECS allows customers to scale from managing a single task to overseeing their entire enterprise application portfolio and to reach thousands of tasks. Amazon ECS eliminates the management overhead associated with running your own container orchestration service. When working with customers, we have observed that there is a valuable opportunity to enhance the utilization of Amazon ECS events. Lifecycle events offer troubleshooting insights by linking service events with metrics and logs. Amazon ECS displays the latest 100 events, making it tricky to retrospectively review them. Using Amazon CloudWatch Container Insights resolves this by storing Amazon ECS lifecycle events in Amazon CloudWatch Log Group. This integration lets you analyze events retroactively, enhancing operational efficiency. Amazon EventBridge, a serverless event bus, which connects applications seamlessly. Along with Container Insights, Amazon ECS can serve as an Event source while Amazon CloudWatch Logs act as the Target in Amazon EventBridge. This enables post-incident analysis using Amazon CloudWatch Logs Insights. Our post explains how to effectively analyze Amazon ECS service events via Container Insights or Amazon EventBridge or both using Amazon CloudWatch Logs Insights Queries. These queries significantly enhance your development and operational workflows. Prerequisites To be able to work through the techniques that will be presented in this technical guide you must have the below feature enabled in your account. An Amazon ECS Cluster with active workload. Amazon EventBridge configured to stream events to either Amazon CloudWatch Logs directly or having Amazon ECS CloudWatch Container Insights enabled. Here is an elaborated guide to set up Amazon EventBridge to stream events to Amazon CloudWatch Logs or Container Insights. Walkthrough Useful lifecycle events patterns The events that the Elastic Container Service (Amazon ECS) emits can be categorized into four groups: Container instance state change events – These events are triggered when there is a change in the state of an Amazon ECS container instance. This can happen due to various reasons, such as starting or stopping a task, upgrading the Amazon ECS agent, or other scenarios. Task state change events – These events are emitted whenever there is a change in the state of a task, such as when it transitions from pending to running or from running to stopped. Additionally, events are triggered when a container within a task stops or when a termination notice is received for AWS Fargate Spot capacity. Service action events – These events provide information about the state of the service and are categorized as info, warning, or error. They are generated when the service reaches a steady state, when the service consistently cannot place a task, when the Amazon ECS APIs are throttled, or when there are insufficient resources to place a task. Service deployment state change events – These events are emitted when a deployment is in progress, completed, or fails. They are typically triggered by the circuit breaker logic and rollback settings. For a more detailed explanation and examples of these events and their potential use cases, please refer to the Amazon ECS events documentation. Let’s dive into some real-world examples of how to use events for operational support. We’ve organized these examples into four categories based on event patterns: Task Patterns, Service Action Patterns, Service Deployment Patterns, and ECS Container Instance Patterns. Each category includes common use cases and demonstrates specific queries and results. Running Amazon CloudWatch Logs Insights query Follow below steps to run an Amazon CloudWatch Logs Insights queries, which will be covered in latter section of this post: Open the Amazon CloudWatch console and choose Logs, and then choose Logs Insights. Choose log groups containing Amazon ECS events and performance logs to query. Enter the desired query and choose Run to view the results. Task event patterns Scenario 1: In this scenario, the operations team encounters a situation where they need to investigate the cause of HTTP status 5XX (server-side issue) errors that have been observed in their environment. To do so, they reach out to confirm whether an Amazon ECS task correctly followed its intended task lifecycle. The team suspects that a task’s lifecycle events might be contributing to the 5XX errors, and they need to narrow down the exact source of these issues to implement effective troubleshooting and resolution. Required query Query Inputs: detail.containers.0.taskArn: Intended Task ARN fields time as Timestamp, `detail-type` as Type, detail.lastStatus as `Last Status`, detail.desiredStatus as `Desired Status`, detail.stopCode as StopCode, detail.stoppedReason as Reason | filter detail.containers.0.taskArn = "arn:aws:ecs:us-east-1:111122223333:task/CB-Demo/6e81bd7083ad4d559f8b0b147f14753f" | sort @timestamp desc | limit 10 Result: Let’s see how service events can aide confirmation of Task lifecycle, from the results we can see Last Status of task progressed as shown in the following: PROVISIONING > PENDING > ACTIVATING > RUNNING > DEACTIVATING > STOPPING > DEPROVISIONING > STOPPED This confirms to documented task life cycle flow and task was first DEACTIVATED and then STOPPED, we can see that stoppage of this task was initiated by Scheduler ServiceSchedulerInitiated because of reason Task failed container health checks. Similarly, query can also fetch check lifecycle details of a task failing load balancer health checks, result will be as shown in the following: In below query replace detail.containers.0.taskArn with intended Task ARN: fields time as Timestamp, `detail-type` as Type, detail.lastStatus as `Last Status`, detail.desiredStatus as `Desired Status`, detail.stopCode as StopCode, detail.stoppedReason as Reason | filter detail.containers.0.taskArn = "arn:aws:ecs:us-east-1:111122223333:task/CB-Demo/649e1d63f0db482bafa0087f6a3aa5ed" | sort @timestamp desc | limit 10 Let’s see an example of another task which was stopped manually by calling StopTask, because action was UserInitiated and reason is Task stopped by user: As an addition in both cases we can see how Desired State (irrespective of who initiated stop task) drives Last Status of Task. Task Lifecycle for reference: Scenario 2: Let’s consider the scenario where you may encounter frequent task failures within a service, necessitating a means to diagnose the root causes behind these issues. Tasks might be terminating due to various reasons, such as resource limitations or application errors. To address this, you can query for the stop reasons for all tasks in the service to uncover underlying issues. Required Query Query Inputs: detail.group : Your intended service name filter `detail-type` = "ECS Task State Change" and detail.desiredStatus = "STOPPED" and detail.group = "service:circuit-breaker-demo" |fields detail.stoppingAt as stoppingAt, detail.stoppedReason as stoppedReason,detail.taskArn as Task | sort @timestamp desc | limit 200 TIP: In case if you have service autoscaling enabled and there are frequent scaling events for service you can further add another filter to above query to filter out events related to scaling to focus solely on other stop reason. filter detail-type = "ECS Task State Change" and detail.desiredStatus = "STOPPED" and detail.stoppedReason not like "Scaling activity initiated by" and detail.group = "service:circuit-breaker-demo" |fields detail.stoppingAt as stoppingAt, detail.stoppedReason as stoppedReason,detail.taskArn as Task | sort @timestamp desc | limit 200 Result: In the results, we can see the task stop reasons for tasks within the service, along with their respective task IDs. By analyzing these stop reasons, you can identify the specific issues leading to task terminations. Depending on the stop reasons, potential solutions might involve application tuning, adjusting resource allocations, optimizing task definitions, or fine-tuning scaling strategies. Scenario 3: Let’s consider a scenario where your security team needs critical information about the usage of specific network interfaces, MAC addresses, or attachment IDs. It’s important to note that Amazon ECS automatically provisions and deprovisions Elastic Network Interfaces (ENIs) when tasks start and stop. However, once a task is stopped, there are no readily available records or associations to trace back to a specific Task ID using Elastic Network Interface (ENI) or Media Access Control (MAC) assigned to ENI information. This poses a challenge in meeting the security team’s request for such data, as the automatic nature of ENI management in Amazon ECS may limit historical tracking capabilities for these identifiers. Required Query Query Inputs: detail.attachments.1.details.1.value :Intended mac address of the ENI Additional: Replace Task ARNs and Cluster ARN Details fields @timestamp, `detail.attachments.1.details.1.value` as ENIId,`detail.attachments.1.status` as ENIStatus, `detail.lastStatus` as TaskStatus | filter `detail.attachments.1.details.1.value` = "eni-0e2b348058ae3d639" | parse @message "arn:aws:ecs:us-east-1:111122223333:task/CB-Demo/*\"" as TaskId | parse @message "arn:aws:ecs:us-east-1:111122223333:cluster/*\"," as Cluster | parse @message "service:*\"," as Service | display @timestamp, ENIId, ENIStatus, TaskId, Service, Cluster, TaskStatus To Look up by ENI ID, replace value of detail.attachments.1.details.2.value with intended MAC address: fields @timestamp, `detail.attachments.1.details.1.value` as ENIId, `detail.attachments.1.details.2.value` as MAC ,`detail.attachments.1.status` as ENIStatus, `detail.lastStatus` as TaskStatus | filter `detail.attachments.1.details.2.value` = '12:eb:5f:5a:83:93' | parse @message "arn:aws:ecs:us-east-1:111122223333:task/CB-Demo/*\"" as TaskId | parse @message "arn:aws:ecs:us-east-1:111122223333:cluster/*\"," as Cluster | parse @message "service:*\"," as Service | display @timestamp, ENIId, MAC, ENIStatus, TaskId, Service, Cluster, TaskStatus Result: By ENI Id, in results we can details of task/service/cluster for which ENI was provisioned and the state of task to correlate. Just like ENI, we can query by MAC address, with the same details as ENI: Service action event patterns Scenario 4: You may encounter a situation where you need to identify and prioritize resolution for services with the highest number of faults. To achieve this, you want to query and determine the top N services that are experiencing issues. Required Query: filter `detail-type` = "ECS Service Action" and @message like /(?i)(WARN)/ | stats count(detail.eventName) as countOfWarnEvents by resources.0 as serviceArn, detail.eventName as eventFault | sort countOfWarnEvents desc | limit 20 Result: By filtering for WARN events and aggregating service-specific occurrences, you can pinpoint the services that require immediate attention. Prioritizing resolution efforts, for example, the service ecsdemo-auth-no-sd, in this case, is facing the SERVICE_TASK_START_IMPAIRED error. This ensures that you can focus your resources on mitigating the most impactful issues and enhancing the overall reliability of your microservices ecosystem: Service deployment event patterns Scenario 5: Since we are aware that any Amazon ECS service comes with an event type of INFO, WARN, or ERROR, we can use this as a search pattern to analysis our workloads for troubled services. Required Query: fields @timestamp as Time, `resources.0` as Service, `detail-type` as `lifecycleEvent`, `detail.reason` as `failureReason`, @message | filter `detail.eventType` = "ERROR" | sort @timestamp desc | display Time, Service, lifecycleEvent, failureReason | limit 100 Result: In results below the ecsdemo-backend service is failing to successfully deploy tasks, which activates the Amazon ECS circuit breaker mechanism that stops the deployment of the service. Using the expand arrow to the left of the table, we can get more details about the event: Service deployment event patterns Scenario 6: In this scenario, you have received a notification from the operations team indicating that, following a recent deployment to an Amazon ECS service, the previous version of the application is still visible. They are experiencing a situation where the new deployment did not replace the old one as expected, leading to confusion and potential issues. The operations team seeks to understand the series of events that occurred during the deployment process to determine what went wrong, identify the source of the issue, and implement the necessary corrective measures to ensure a successful deployment. Required Query Query Inputs: resources.0 : intended service ARN fields time as Timestamp, detail.deploymentId as DeploymentId , detail.eventType as Severity, detail.eventName as Name, detail.reason as Detail, `detail-type` as EventType | filter `resources.0` ="arn:aws:ecs:us-east-1:12345678910:service/CB-Demo/circuit-breaker-demo" | sort @timestamp desc | limit 10 Result: Let’s analyze service events to understand what went wrong during a deployment, by examining the sequence of events, a clear timeline emerges: We can see that service was initially in steady state (line 7) and there was good deployment (ecs-svc/6629184995452776901 in line 6). A new deployment (ecs-svc/4503003343648563919) occurs, possibly with a code bug (line 5). Task from this deployment was failing to start (line 3). This problematic deployment triggers a circuit breaker logic that initiates a rollback to the previously known good deployment (ecs-svc/6629184995452776901 in line 4). The service eventually returns to a steady state (lines 1 and 2). This sequence of events not only provides a chronological view of what happened but also offers specific insights into the deployments involved and the potential reasons for the issue. By analyzing these service events, the operations team can pinpoint the problematic deployment (i.e., ecs-svc/4503003343648563919) and investigate further to identify and address the underlying code issues, ensuring a more reliable deployment process in the future. ECS container instance event patterns: Scenario 7: You want to track the history of an Amazon ECS Agent updates for container instances in the cluster. A trackable history ensures compliance with security standards by verifying that the agent has the necessary patches and updates installed, and it also allows for the verification of rollbacks in the event of problematic updates. This information is valuable for operational efficiency and service reliability. Required Query: fields @timestamp, detail.agentUpdateStatus as agentUpdateStatus, detail.containerInstanceArn as containerInstanceArn,detail.versionInfo.agentVersion as agentVersion | filter `detail-type` = "ECS Container Instance State Change" | sort @timestamp desc | limit 200 Result: As we can see from the results, the Agent on Container Instance was at v 1.75.0. On Update Agent trigger, the process to update agent started at sequence 9 and finally completed at sequence 1. Initially, the container instance operated with ECS Agent version 1.75.0. Subsequently, at sequence 9, an update operation was initiated, indicating the presence of a new Amazon ECS Agent version. After a series of update actions, the Agent Update successfully concluded at sequence 1. This information offers a clear snapshot of the version transition and update procedure, underlining the importance of tracking Amazon ECS Agent updates to ensure the security, reliability, and functionality of the ECS cluster. Cleaning up Once you’ve completed with exploring sample queries, please ensure you disable any Amazon EventBridge rules and Amazon ECS CloudWatch Container Insights, so that you do not incur any further cost. Conclusion In this post, we’ve explored ways to harness the full potential of Amazon ECS events, a valuable resource for troubleshooting. Amazon ECS provides useful information about tasks, services, deployments, and container instances. Analyzing ECS events in Amazon CloudWatch Logs enables you to identify patterns over time, correlate events with other logs, discover recurring issues, and conduct various forms of analysis. We’ve outlined straightforward yet powerful methods for searching and utilizing Amazon ECS events. This includes tracking the lifecycle of tasks to swiftly diagnose unexpected stoppages, identifying tasks with specific network details to bolster security, pinpointing problematic services, understanding deployment issues, and ensuring the Amazon ECS agent is up-to-date for reliability. This broader perspective on your system’s operations equips you to proactively address problems, gain insights into your container performance, facilitate smooth deployments, and fortify your system’s security. Additional references Now that we have covered the basics of these lifecycle events, let’s look at best practices for querying these lifecycle events in the Amazon CloudWatch Log Insights console for troubling shooting purposes. To learn more about the Amazon CloudWatch query domain-specific language (DSL) visit the documentation (CloudWatch Logs Insights query syntax). You can further setup Anomaly Detection by further processing Amazon ECS events event bridge, which is explained in detail in Amazon Elastic Container Service Anomaly Detector using Amazon EventBridge. View the full article
  17. We are excited to announce regular expression support for Amazon CloudWatch Logs filter pattern syntax, making it easier to search and match relevant logs in the AWS GovCloud (US) Regions. Customers use filter pattern syntax today to search logs, extract metrics using metric filters, and send specific logs to other destinations with subscription filters. With today’s launch, customers will be able to further customize these operations to meet their needs with flexible and powerful regular expressions within filter patterns. Now customers can define one filter to match multiple IP subnets or HTTP status codes using a regular expression such as ‘{ $.statusCode=%4[0-9]{2}% }’ rather than having to define multiple filters to cater to each variation, reducing the configuration and management overhead on their logs. View the full article
  18. To make it easy to interact with your operational data, Amazon CloudWatch is introducing today natural language query generation for Logs and Metrics Insights. With this capability, powered by generative artificial intelligence (AI), you can describe in English the insights you are looking for, and a Logs or Metrics Insights query will be automatically generated. This feature provides three main capabilities for CloudWatch Logs and Metrics Insights: Generate new queries from a description or a question to help you get started easily. Query explanation to help you learn the language including more advanced features. Refine existing queries using guided iterations. Let’s see how these work in practice with a few examples. I’ll cover logs first and then metrics. Generate CloudWatch Logs Insights queries with natural language In the CloudWatch console, I select Log Insights in the Logs section. I then select the log group of an AWS Lambda function that I want to investigate. I choose the Query generator button to open a new Prompt field where I enter what I need using natural language: Tell me the duration of the 10 slowest invocations Then, I choose Generate new query. The following Log Insights query is automatically generated: fields @timestamp, @requestId, @message, @logStream, @duration | filter @type = "REPORT" and @duration > 1000 | sort @duration desc | limit 10 I choose Run query to see the results. I find that now there’s too much information in the output. I prefer to see only the data I need, so I enter the following sentence in the Prompt and choose Update query. Show only timestamps and latency The query is updated based on my input and only the timestamp and duration are returned: fields @timestamp, @duration | filter @type = "REPORT" and @duration > 1000 | sort @duration desc | limit 10 I run the updated query and get a result that is easier for me to read. Now, I want to know if there are any errors in the log. I enter this sentence in the Prompt and generate a new query: Count the number of ERROR messages As requested, the generated query is counting the messages that contain the ERROR string: fields @message | filter @message like /ERROR/ | stats count() I run the query and find out that there are more errors than I expected. I need more information. I use this prompt to update the query and get a better distribution of the errors: Show the errors per hour The updated query uses the bin() function to group the result in one hour intervals. fields @timestamp, @message | filter @message like /ERROR/ | stats count(*) by bin(1h) Let’s see a more advanced query about memory usage. I select the log groups of a few Lambda functions and type: Show invocations with the most over-provisioned memory grouped by log stream Before generating the query, I choose the gear icon to toggle the options to include my prompt and an explanation as comment. Here’s the result (I split the explanation over multiple lines for readability): # Show invocations with the most over-provisioned memory grouped by log stream fields @logStream, @memorySize/1000/1000 as memoryMB, @maxMemoryUsed/1000/1000 as maxMemoryUsedMB, (@memorySize/1000/1000 - @maxMemoryUsed/1000/1000) as overProvisionedMB | stats max(overProvisionedMB) as maxOverProvisionedMB by @logStream | sort maxOverProvisionedMB desc # This query finds the amount of over-provisioned memory for each log stream by # calculating the difference between the provisioned and maximum memory used. # It then groups the results by log stream and calculates the maximum # over-provisioned memory for each log stream. Finally, it sorts the results # in descending order by the maximum over-provisioned memory to show # the log streams with the most over-provisioned memory. Now, I have the information I need to understand these errors. On the other side, I also have EC2 workloads. How are those instances running? Let’s look at some metrics. Generate CloudWatch Metrics Insights queries with natural language In the CloudWatch console, I select All metrics in the Metrics section. Then, in the Query tab, I use the Editor. If you prefer, the Query generator is available also in the Builder. I choose Query generator like before. Then, I enter what I need using plain English: Which 10 EC2 instances have the highest CPU utilization? I choose Generate new query and get a result using the Metrics Insights syntax. SELECT AVG("CPUUtilization") FROM SCHEMA("AWS/EC2", InstanceId) GROUP BY InstanceId ORDER BY AVG() DESC LIMIT 10 To see the graph, I choose Run. Well, it looks like my EC2 instances are not doing much. This result shows how those instances are using the CPU, but what about storage? I enter this in the prompt and choose Update query: How about the most EBS writes? The updated query replaces the average CPU utilization with the sum of bytes written to all EBS volumes attached to the instance. It keeps the limit to only show the top 10 results. SELECT SUM("EBSWriteBytes") FROM SCHEMA("AWS/EC2", InstanceId) GROUP BY InstanceId ORDER BY SUM() DESC LIMIT 10 I run the query and, by looking at the result, I have a better understanding of how storage is being used by my EC2 instances. Try entering some requests and run the generated queries over your logs and metrics to see how this works with your data. Things to know Amazon CloudWatch natural language query generation for logs and metrics is available in preview in the US East (N. Virginia) and US West (Oregon) AWS Regions. There is no additional cost for using natural language query generation during the preview. You only pay for the cost of running the queries according to CloudWatch pricing. When generating a query, you can include your original request and an explanation of the query as comments. To do so, choose the gear icon in the bottom right corner of the query edit window and toggle those options. This new capability can help you generate and update queries for logs and metrics, saving you time and effort. This approach allows engineering teams to scale their operations without worrying about specific data knowledge or query expertise. Use natural language to analyze your logs and metrics with Amazon CloudWatch. — Danilo View the full article
  19. Searching through log data to find operational or business insights often feels like looking for a needle in a haystack. It usually requires you to manually filter and review individual log records. To help you with that, Amazon CloudWatch has added new capabilities to automatically recognize and cluster patterns among log records, extract noteworthy content and trends, and notify you of anomalies using advanced machine learning (ML) algorithms trained using decades of Amazon and AWS operational data. Specifically, CloudWatch now offers the following: The Patterns tab on the Logs Insights page finds recurring patterns in your query results and lets you analyze them in detail. This makes it easier to find what you’re looking for and drill down into new or unexpected content in your logs. The Compare button in the time interval selector on the Logs Insights page lets you quickly compare the query result for the selected time range to a previous period, such as the previous day, week, or month. In this way, it takes less time to see what has changed compared to a previous stable scenario. The Log Anomalies page in the Logs section of the navigation pane automatically surfaces anomalies found in your logs while they are processed during ingestion. Let’s see how these work in practice with a typical troubleshooting journey. I will look at some application logs to find key patterns, compare two time periods to understand what changed, and finally see how detecting anomalies can help discover issues. Finding recurring patterns in the logs In the CloudWatch console, I choose Logs Insights from the Logs section of the navigation pane. To start, I have selected which log groups I want to query. In this case, I select a log group of a Lambda function that I want to inspect and choose Run query. In the Pattern tab, I see the patterns that have been found in these log groups. One of the patterns seems to be an error. I can select it to quickly add it as a filter to my query and focus on the logs that contain this pattern. For now, I choose the magnifying glass icon to analyze the pattern. In the Pattern inspect window, a histogram with the occurrences of the pattern in the selected time period is shown. After the histogram, samples from the logs are provided. The variable parts of the pattern (such as numbers) have been extracted as “tokens.” I select the Token values tab to see the values for a token. I can select a token value to quickly add it as a filter to the query and focus on the logs that contain this pattern with this specific value. I can also look at the Related patterns tab to see other logs that typically occurred at the same time as the pattern I am analyzing. For example, if I am looking at an ERROR log that was always written alongside a DEBUG log showing more details, I would see that relationship there. Comparing logs with a previous period To better understand what is happening, I choose the Compare button in the time interval selector. This updates the query to compare results with a previous period. For example, I choose Previous day to see what changed compared to yesterday. In the Patterns tab, I notice that there has actually been a 10 percent decrease in the number of errors, so the current situation might not be too bad. I choose the magnifying glass icon on the pattern with severity type ERROR to see a full comparison of the two time periods. The graph overlaps the occurrences of the pattern over the two periods (now and yesterday in this case) inside the selected time range (one hour). Errors are decreasing but are still there. To reduce those errors, I make some changes to the application. I come back after some time to compare the logs, and a new ERROR pattern is found that was not present in the previous time period. My update probably broke something, so I roll back to the previous version of the application. For now, I’ll keep it as it is because the number of errors is acceptable for my use case. Detecting anomalies in the log I am reassured by the decrease in errors that I discovered comparing the logs. But how can I know if something unexpected is happening? Anomaly detection for CloudWatch Logs looks for unexpected patterns in the logs as they are processed during ingestion and can be enabled at log group level. I select Log groups in the navigation pane and type a filter to see the same log group I was looking at before. I choose Configure in the Anomaly detection column and select an Evaluation frequency of 5 minutes. Optionally, I can use a longer interval (up to 60 minutes) and add patterns to process only specific log events for anomaly detection. After I activate anomaly detection for this log group, incoming logs are constantly evaluated against historical baselines. I wait for a few minutes and, to see what has been found, I choose Log anomalies from the Logs section of the navigation pane. To simplify this view, I can suppress anomalies that I am not interested in following. For now, I choose one of the anomalies in order to inspect the corresponding pattern in a way similar to before. After this additional check, I am convinced there are no urgent issues with my application. With all the insights I collected with these new capabilities, I can now focus on the errors in the logs to understand how to solve them. Things to know Amazon CloudWatch automated log pattern analytics is available today in all commercial AWS Regions where Amazon CloudWatch Logs is offered excluding the China (Beijing), the China (Ningxia), and Israel (Tel Aviv) Regions. The patterns and compare query features are charged according to existing Logs Insights query costs. Comparing a one-hour time period against another one-hour time period is equivalent to running a single query over a two-hour time period. Anomaly detection is included as part of your log ingestion fees, and there is no additional charge for this feature. For more information, see CloudWatch pricing. Simplify how you analyze logs with CloudWatch automated log pattern analytics. — Danilo View the full article
  20. Amazon CloudWatch Container Insights now delivers enhanced observability for Amazon Elastic Kubernetes Service (EKS) with out-of-the-box detailed health and performance metrics, including container level EKS performance metrics, Kube-state metrics and EKS control plane metrics for faster problem isolation and troubleshooting. View the full article
  21. We are excited to announce regular expression support for Amazon CloudWatch Logs filter pattern syntax, making it easier to search and match relevant logs. Customers use filter pattern syntax today to search logs, extract metrics using metric filters, and send specific logs to other destinations with subscription filters. With today’s launch, customers will be able to further customize these operations to meet their needs with flexible and powerful regular expressions within filter patterns. Now customers can define one filter to match multiple IP subnets or HTTP status codes using a regular expression such as ‘{ $.statusCode=%4[0-9]{2}% }’ rather than having to define multiple filters to cater to each variation, reducing the configuration and management overhead on their logs. View the full article
  22. You can now enable CloudWatch Contributor Insights on your AWS PrivateLink-powered VPC Endpoint Services in 6 new regions- Asia Pacific (Hyderabad, Jakarta, Melbourne), Europe (Spain, Zurich) and Middle East (UAE). AWS PrivateLink is a fully-managed private connectivity service that enables customers to access AWS services, third-party services or internal enterprise services hosted on AWS in a secure and scalable manner while keeping network traffic private. CloudWatch Contributor Insights analyzes time-series data to report the top contributors and number of unique contributors in a dataset. View the full article
  23. Amazon CloudWatch Logs is excited to announce a new Logs Insights command, dedup, which enables customers to eliminate duplicate results when analyzing logs. Customers frequently want to query their logs and view only unique results based on one or more fields. You can now use the new dedup command in your Amazon CloudWatch Logs Insights queries to view unique results based on one or more fields. For example, you can view the most recent error message for each hostname by executing the dedup command on the hostname field. View the full article
  24. Amazon CloudWatch Logs is excited to announce support for account level data protection policy configuration, you can now create a data protection policy that will be applied to all existing and future log groups within your AWS account. View the full article
  25. We are excited to announce Amazon CloudWatch Logs Live Tail, a new interactive log analytics experience feature that helps you detect and debug anomalies in applications. You can now view your logs interactively in real-time as they’re ingested, which helps you to analyze and resolve issues across your systems and applications. View the full article
  • Forum Statistics

    63.7k
    Total Topics
    61.7k
    Total Posts
×
×
  • Create New...