Jump to content

Monitoring & Observability

  • Metrics & Time Series Databases (e.g., Prometheus, Grafana, InfluxDB)

  • Logging & Log Management (e.g., ELK Stack, Loki, Splunk)

  • Tracing & Distributed Systems Monitoring (e.g., Jaeger, Zipkin, OpenTelemetry)

  • Alerting & Incident Management (e.g., PagerDuty, Opsgenie)

  • Synthetic Monitoring & Uptime Checks

  1. Troubleshooting within Kubernetes environments can be a daunting task. If we could only have a magical artificial intelligence advisor that could gather all the data about what goes on the system, and tell me what’s wrong, and even how to solve it. Wouldn’t it be nice? K8sGPT is a young open source project that uses […]View the full article

    • 0 replies
    • 104 views
  2. Started by Logz.io,

    In technology, having “modern” capabilities is standard. Staying ahead of the curve is critical, and keeping outdated technology or processes going can be a recipe for disaster in a complex, ever-changing landscape. Ensuring the smooth functioning and performance of software systems is paramount. This is where modern observability—a sophisticated approach to monitoring and understanding the […]View the full article

    • 0 replies
    • 90 views
  3. Observability isn’t new. But organizations are struggling to adopt mature observability practices, and the impact on business is palpable. Organizations are seeing the value of observability for their applications and infrastructure—the results of our 2024 Observability Pulse survey of 500 global IT professionals reflects that across the board. But respondents are challenged by the notion […]View the full article

    • 0 replies
    • 200 views
  4. Amazon CloudWatch is excited to announce a resource filtering capability for cross-account observability, providing customers with the flexibility to share a subset of their logs or metrics across multiple AWS accounts using configurable filters. View the full article

  5. In today’s data-driven landscape, managing and analyzing vast amounts of data, especially logs, is crucial for organizations to derive insights and make informed decisions. However, handling this data efficiently presents a significant challenge, prompting organizations to seek scalable solutions without the complexity of infrastructure management. Amazon OpenSearch Serverless lets you run OpenSearch in the AWS Cloud, without worrying about scaling infrastructure. With OpenSearch Serverless, you can ingest, analyze, and visualize your time-series data. Without the need for infrastructure provisioning, OpenSearch Serverless simplifies data management and enables you to d…

  6. Amazon OpenSearch Service now supports Amazon Route 53 alias records for defining custom domain endpoints. Alias records provide better flexibility when configuring routing to AWS resources. For more information about Route 53 alias records, please see documentation. View the full article

  7. Starting today, the Amazon CloudWatch metrics for monitoring AWS Config data usage will display only billable usage. With this enhancement, non-billable usage will no longer be displayed in both the Amazon CloudWatch Config metrics and AWS Config console. This allows you to validate AWS Config setup and usage using Amazon CloudWatch metrics and correlate billable usage with associated costs. View the full article

  8. In today's fast-paced digital landscape, the ability to monitor and observe the health and performance of applications and infrastructure is not just beneficial—it's essential. As systems grow increasingly complex and the volume of data continues to skyrocket, organizations are faced with the challenge of not just managing this information but making sense of it. This is where Grafana steps in. In this blog post, we'll take a comprehensive look at what Grafana is and how it works. Let's get started! ... View the full article

    • 1 reply
    • 447 views
  9. Without a doubt, you’ve heard about the persistent talent gap that has troubled the technology sector in recent years. It’s a problem that isn’t going away, plaguing everyone from engineering teams to IT security pros, and if you work in the industry today you’ve likely experienced it somewhere within your own teams. Despite major changes […]View the full article

    • 0 replies
    • 117 views
  10. Amazon CloudWatch RUM, which enables customers to monitor their web applications by collecting client side performance and error data in real time, is generally available in the following 5 AWS Regions starting today: Asia Pacific (Hyderabad), Asia Pacific (Melbourne), Europe (Spain), Europe (Zurich), and Middle East (UAE). View the full article

  11. Amazon OpenSearch Service adds support for Hebrew and HanLP (Chinese NLP) language analyzer plugins. These are now available as optional plugins that you can associate with your Amazon OpenSearch Service clusters. View the full article

  12. Amazon CloudWatch Container Insights with Enhanced Observability for EKS now auto-discovers critical health metrics from your AWS accelerators Trainium and Inferentia, and AWS high performance network adapters (Elastic Fabric Adapters) as well as NVIDIA GPUs. You can visualize these out-of-the-box metrics in curated Container Insights dashboards to help monitor your accelerated infrastructure and optimize your AI workloads for operational excellence. View the full article

  13. The Internet has a plethora of moving parts: routers, switches, hubs, terrestrial and submarine cables, and connectors on the hardware side, and complex protocol stacks and configurations on the software side. When something goes wrong that slows or disrupts the Internet in a way that affects your customers, you want to be able to localize and understand the issue as quickly as possible. New Map The new Amazon CloudWatch Internet Weather Map is here to help! Built atop of collection of global monitors operated by AWS, you get a broad, global view of Internet weather, with the ability to zoom in and understand performance and availability issues that affect a particular…

  14. All AWS customers who navigate to Amazon CloudWatch Internet Monitor console can now view the internet weather map, at no charge, which shares a 24-hour global snapshot of internet latency and availability outages. The map lets you see, at a glance, recent internet issues across the world, including specific cities and service providers. View the full article

  15. Amazon Managed Workflows for Apache Airflow (MWAA) now offers larger environment sizes, giving customers of the managed service the ability to define a greater number of workflows in each Apache Airflow environment, supporting more complex tasks that can utilize increased resources. View the full article

  16. Despite advances in the world of observability, log management hasn’t evolved much in recent years. Users are familiar with the experience of Kibana or OpenSearch Dashboards (OSD), but those don’t always meet modern use cases. Logz.io is ready to change the conversation with the introduction of Explore, the new path forward for Log Management for […]View the full article

    • 0 replies
    • 125 views
  17. Today, AWS announces the release of workflow monitor for live video, a media-centric tool to simplify and elevate the monitoring of your video workloads. Accessible via the AWS Elemental MediaLive console and API, workflow monitor discovers and visualizes resources. It creates signal maps showing video across AWS Elemental MediaConnect, MediaLive, and MediaPackage along with Amazon S3 and Amazon CloudFront to provide end-to-end visibility. With the workflow monitor, you can create your own alarm templates or start from a set of recommended alarms, and build custom templates for alarm notifications. View the full article

  18. Amazon CloudWatch Container Insights now offers observability for Windows containers running on Amazon Elastic Kubernetes Service (EKS), and helps customers collect, aggregate, and summarize metrics and logs from their Windows container infrastructure. With this support, customers can monitor utilization of resources such as CPU, memory, disk, and network, as well as get enhanced observability such as container-level EKS performance metrics, Kube-state metrics and EKS control plane metrics for Windows containers. CloudWatch also provides diagnostic information, such as container restart failures, for faster problem isolation and troubleshooting for Windows containers runn…

  19. You can now create or associate a monitor for a distribution directly from the Amazon CloudFront console. By adding your distribution to a monitor, you can gain improved visibility into your application's internet performance and availability using Amazon CloudWatch Internet Monitor. You can create a monitor for the distribution, or add the distribution to an existing monitor, directly from the distribution metrics dashboard on the CloudFront console. View the full article

  20. Amazon OpenSearch Service is now extending the ability to update the number of data nodes without requiring a blue/green deployment for clusters without dedicated cluster manager (master) nodes. This change will allow you to make node count changes faster. Clusters with dedicated cluster manager nodes already supported updating the data node count without a blue/green deployment. View the full article

  21. Amazon CloudWatch RUM, which enables customers to monitor their web applications by collecting client side performance and error data in real time, is generally available in the following 11 AWS Regions starting today: Africa (Cape Town), Asia Pacific (Jakarta), Asia Pacific (Mumbai), Asia Pacific (Osaka), Asia Pacific (Seoul), Canada (Central), Europe (Milan), Europe (Paris), Middle East (Bahrain), South America (Sao Paulo), and US West (N. California). View the full article

  22. Amazon CloudWatch now supports using AWS CloudFormation to manage tags when you create, update, or delete alarms. View the full article

  23. You can now set up cross-account observability for Amazon CloudWatch Internet Monitor, so that you can get read-only access to monitors from multiple accounts within an AWS Region. Deploying applications by using resources in separate accounts is a good practice, to establish security and billing boundaries between teams and reduce the impact of operational events. For example, when you set up cross-account observability for Internet Monitor, you can access and view performance and availability measurements generated by monitors in different AWS accounts. View the full article

  24. Amazon OpenSearch Ingestion now enables you to enrich events with geographical location data from an IP address, allowing you to add additional context to your observability and security data in realtime. Additionally, you can configure mapping templates in Amazon OpenSearch clusters to automatically display these enriched events on a geographical map using OpenSearch Dashboards. View the full article

  25. OR1, the OpenSearch Optimized Instance family, now doubles the max allowed storage per instance. OR1 also expands availability to four additional regions- Canada Central, EU (London), and Asia Pacific (Hyderabad, Seoul). OR1 delivers up to 30% price-performance improvement over existing instances (based on internal benchmarks), and uses Amazon S3 to provide 11 9s of durability. The new OR1 instances are best suited for indexing-heavy workloads, and offers better indexing performance compared to the existing memory optimized instances available on OpenSearch Service. View the full article

  26. Amazon CloudWatch now supports Anomaly Detection on metrics shared across your accounts. CloudWatch Anomaly Detection now lets you track unexpected changes in metric behavior across multiple accounts from a single monitoring account through CloudWatch cross-account observability. View the full article

  27. Amazon OpenSearch Service is an Apache-2.0-licensed distributed search and analytics suite offered by AWS. This fully managed service allows organizations to secure data, perform keyword and semantic search, analyze logs, alert on anomalies, explore interactive log analytics, implement real-time application monitoring, and gain a more profound understanding of their information landscape. OpenSearch Service provides the tools and resources needed to unlock the full potential of your data. With its scalability, reliability, and ease of use, it’s a valuable solution for businesses seeking to optimize their data-driven decision-making processes and improve overall operationa…

  28. Every software-driven business strives for optimum performance and user experience. Observability—which allows engineering and IT Ops teams to understand the internal state of their cloud applications and infrastructure based on available telemetry data —has emerged as a crucial practice to help engage this process. For years, application performance monitoring (APM) was the de facto practice […]View the full article

    • 0 replies
    • 113 views
  29. You can use Amazon Data Firehose to aggregate and deliver log events from your applications and services captured in Amazon CloudWatch Logs to your Amazon Simple Storage Service (Amazon S3) bucket and Splunk destinations, for use cases such as data analytics, security analysis, application troubleshooting etc. By default, CloudWatch Logs are delivered as gzip-compressed objects. You might want the data to be decompressed, or want logs to be delivered to Splunk, which requires decompressed data input, for application monitoring and auditing. AWS released a feature to support decompression of CloudWatch Logs in Firehose. With this new feature, you can specify an option in…

  30. Data volumes are soaring. Environments are increasingly intricate. The risk of applications and systems encountering breakdowns is sky-high, and the mean time to recovery (MTTR) for production incidents is moving in the wrong direction. Disruptions not only jeopardize critical infrastructure but also have a direct impact on the bottom line of organizations. Swift recovery of […]View the full article

    • 0 replies
    • 100 views
  31. Today, AWS IoT Core for LoRaWAN announces a new fleet monitoring application that enables developers capture and visualize critical operational and health parameters related to the functioning of LoRaWAN-based gateways and devices. AWS IoT Core for LoRaWAN is a fully managed LoRaWAN Network Server that supports cloud connectivity for LoRaWAN-based wireless devices. Using the new metrics feature, developers can now quickly capture system health data, such as connection signal strength, data rate, and gateway latency and analyze their fleet’s performance. View the full article

  32. In Part 2 of this series, we discussed how to enable AWS Glue job observability metrics and integrate them with Grafana for real-time monitoring. Grafana provides powerful customizable dashboards to view pipeline health. However, to analyze trends over time, aggregate from different dimensions, and share insights across the organization, a purpose-built business intelligence (BI) tool like Amazon QuickSight may be more effective for your business. QuickSight makes it straightforward for business users to visualize data in interactive dashboards and reports. In this post, we explore how to connect QuickSight to Amazon CloudWatch metrics and build graphs to uncover trends…

  33. Krones provides breweries, beverage bottlers, and food producers all over the world with individual machines and complete production lines. Every day, millions of glass bottles, cans, and PET containers run through a Krones line. Production lines are complex systems with lots of possible errors that could stall the line and decrease the production yield. Krones wants to detect the failure as early as possible (sometimes even before it happens) and notify production line operators to increase reliability and output. So how to detect a failure? Krones equips their lines with sensors for data collection, which can then be evaluated against rules. Krones, as the line manufact…

  34. The topic of continuous profiling has been an ongoing discussion in the observability world for some time. I said back in 2021 that profiling was set to be the next major telemetry signal in observability, and in fact, since then there’s been growing interest in profiles. Startups and large observability vendors have gotten into this […]View the full article

    • 0 replies
    • 88 views
  35. Amazon CloudWatch Logs now supports increased default API quotas. The default quota for ingesting logs has increased from 1,500 to 5,000 Transactions Per Second (TPS) in select regions. The increased quotas are available automatically with no changes required. View the full article

  36. Kubernetes has changed the way many organizations approach the deployment of their applications. But despite its benefits, the additional layers of abstraction and reams of data can cause complexity around Kubernetes monitoring. We’ve seen so much of these challenges borne out in the results of the 2024 Observability Pulse survey. In the survey report, 36% […]View the full article

    • 0 replies
    • 105 views
  37. The more data you have, the harder it becomes to read through it, let alone identify trends or crucial patterns. Couple that with a shortage of time, and the ability not only to visualize but also to communicate with your data becomes paramount. To help empower your data analysis like never before, we’re introducing a […]View the full article

    • 0 replies
    • 113 views
  38. Amazon OpenSearch Service has been a long-standing supporter of both lexical and semantic search, facilitated by its utilization of the k-nearest neighbors (k-NN) plugin. By using OpenSearch Service as a vector database, you can seamlessly combine the advantages of both lexical and vector search. The introduction of the neural search feature in OpenSearch Service 2.9 further simplifies integration with artificial intelligence (AI) and machine learning (ML) models, facilitating the implementation of semantic search. Lexical search using TF/IDF or BM25 has been the workhorse of search systems for decades. These traditional lexical search algorithms match user queries with…

  39. Amazon CloudWatch Synthetics, an outside-in monitoring capability to continually verify your customers’ experience using snippets of code called canaries, is extending historical data for canary runs that pass or fail from 7-days to 30-days. Canary run troubleshooting artifacts such as screenshots from the canary run, HAR files, and log files for historical runs can be viewed for up to 30 days to easily pin point persistent versus intermittent canary run failure patterns on the CloudWatch console. View the full article

  40. You can now obtain an aggregated picture of the performance and health of your WorkSpaces instances using the Amazon CloudWatch Automatic dashboard. This enables WorkSpaces administrators to quickly start monitoring WorkSpaces metrics and identify issues and their potential causes. You can also use CloudWatch Automatic dashboard as a starting point and create your own custom dashboards to meet your monitoring needs. View the full article

  41. Organizations like yours are increasingly reliant on complex IT infrastructures to support their operations. Pervasive use of Kubernetes and microservices architectures continues to up the ante. Amidst this complexity, achieving comprehensive visibility into systems and applications has become both imperative for ensuring performance, reliability, and security, while also becoming ever-more challenging to achieve. End-to-end, or […]View the full article

    • 0 replies
    • 92 views
  42. Amazon CloudWatch Container Insights with Enhanced Observability for EKS now auto-discovers critical health and performance metrics from your NVIDIA GPUs and delivers them in automatic dashboards to enable faster problem isolation and troubleshooting for your AI/ML workloads. Container Insights with Enhanced Observability delivers you out-of-the-box trends and patterns on your infrastructure health and removes the overhead of manual dashboard and alarm set-ups saving you time and effort. View the full article

  43. Amazon CloudWatch Synthetics announces new runtime version releases; syn-nodejs-puppeteer-7.0 for NodeJS Runtime and syn-python-selenium-3.0 for Python Runtime. The NodeJS Runtime update includes dependency upgrades to puppeteer (v21.9.0) and Chromium (v121.0.6167.0.85). The Python Runtime update includes dependency upgrades to Chromium and Chromedriver (v121.0.6167.85). To learn more, see NodeJS release notes and Python release notes. View the full article

  44. Amazon CloudWatch announces support for streaming of daily metrics on CloudWatch Metric Streams. With Metric Streams, you can create a continuous, near real-time stream of metrics to a destination of your choice. You can use Metric Streams to send metrics to your data lake on Amazon Web Services (AWS), such as Amazon Simple Storage Service (Amazon S3), or AWS Partner solutions including Datadog, New Relic, Splunk, Dynatrace and Sumo Logic. This new capability provides additional metrics for streaming, adding daily metrics with timestamps up to two days old. View the full article

  45. We are excited to announce that Amazon OpenSearch Serverless can now scan and search up to 10TB of time series data which includes one or more indexes within a collection. OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that makes it simple for you to run search and analytics workloads without having to think about infrastructure management. With the support for much larger datasets than before, you can further enhance unlocking valuable operational insights and make data driven decisions to troubleshoot application downtime, improve system performance, or identify fraudulent activities. View the full article

  46. We are excited to announce that Amazon OpenSearch Serverless is enhancing access controls for VPC endpoints. With this feature, administrators can attach endpoint policies to control which AWS principals are allowed or denied access to the OpenSearch resources through their VPC endpoint(s). With a VPC endpoint policy, users can also combine actions along with AWS principals and resources to have finer control on the allowing or denying the traffic through their VPC endpoint(s). View the full article

  47. The .NET programming language is taking cloud native deployment and observability seriously, and most notably with the recent announcement of .NET Aspire stack unveiled at the recent .NET Conf 2023. In the latest episode of OpenObservability Talks, we reviewed the journey to making .NET a “by default, out of the box observable platform,” as ASP.NET […]View the full article

    • 0 replies
    • 95 views
  48. There’s no debate — in our increasingly AI-driven, lean and data-heavy world, automating key tasks to increase effectiveness and efficiency is the ultimate name of the game. No matter what job you hold today, you’re likely being pushed to not only do more with less, but also perform your work with a tighter focus on […]View the full article

    • 0 replies
    • 98 views
  49. Amazon CloudWatch Logs now offers customer to use Internet Protocol version 6 (IPv6) addresses for their new and existing domains. Customers moving to IPV6 can simplify their network stack by running their CloudWatch log groups on a dual-stack network that supports both IPv4 and IPv6. View the full article

  50. Amazon CloudWatch announces a comprehensive set of enhancements to the alarm and dashboard experience. It introduces out-of-the-box, best practice alarm recommendations for 24 AWS services, streamlining your monitoring setup. You can easily view all metrics with recommended alarms using a convenient toggle. Creating alarms is simpler with pre-filled configuration in the alarm wizard or bulk downloading infrastructure-as-code templates for the recommended alarms. View the full article