Gain operator confidence with HCP Consul Central observability features

Hashicorp · November 27, 2023

Recently, one of our customer’s HCP Consul clusters became non-responsive. When reaching out to our engineering team for help, the customer provided charts and data now made available in HCP Consul Central’s new observability features. Using this data, our team was able to quickly find the root cause and recommend a fix. This post takes a closer look at this incident and HCP Consul Central’s new observability features that helped lead to its speedy resolution.

Why observability is critical

End users expect reliable, responsive products, so modern software systems must be high-quality, resilient, and scalable. In turn, this means operators must have confidence in the availability of their services, be able troubleshoot issues swiftly, and continuously scale delivery without compromising quality standards. It is difficult to reach a perfect state of operational excellence without the right tools, processes, and practices in place. A first step is having the right actionable insights. \

Observability — the ability to understand the internal state of a system based on data it generates — is crucial to meeting high quality standards and providing operator confidence. Application metrics, traces, and logs are prime examples of data points operators can use to observe the internal state and make intelligent deductions on the health of a system. The more observable a system is, the faster operators can notice an issue, pinpoint the root cause, and fix it. Beyond reactive troubleshooting, observability helps operators catch trends or abnormalities before they cause a serious problem, making it an essential preventative measure.

HCP Consul Central

Out of the box, HCP Consul Central provides a set of observability features, based on metrics, for both HashiCorp-managed and self-managed Consul infrastructure. The features are available for all users of HCP Consul Central with an Enterprise license and for community users during a 90-day free trial period.

HCP Consul Central’s observability features and workflows are opinionated — informed by extensive research on operator pain points and guidance from our experts. Simple visualizations, built with a set of essential metrics, showcase the overall health of two components: Consul servers and services on the mesh. Consul insights are derived from metrics emitted by the Consul agent on servers. For service mesh insights, a Consul telemetry collector agent must be deployed to collect metrics emitted by the Envoy proxies for services. (For more details on configuration, visit the HCP Consul Central observability guide.)

How observability made a difference for one customer

Observability is essential for operators, as illustrated by this recent customer incident. The benefits showcased below are a glimpse of what can be enabled by HCP Consul Central observability, and more is yet to come.

As noted above, while testing configuration changes on a development-tier HCP Consul cluster, a HashiCorp customer recently encountered an issue in which their cluster became non-responsive for a period of time. The team filed a support ticket describing symptoms they were seeing, including failed Consul UI requests.

With the help of HCP Consul Central observability, the customer investigated the issue a level deeper on their own. They reported that around the time of the failed requests, they could see a CPU spike on their HCP Consul Central observability dashboard. Due to the issues they were seeing, the team held off on applying configuration changes in production to prevent any end-user impact.

As an operator of complex infrastructure, when you run into an issue and request help from your vendor, you can expect multiple questions to help everyone understand what might have gone wrong. During this back-and-forth discussion, the customer team provided our engineering team with a detailed description of the issues they were seeing, alongside screenshots from HCP Consul Central observability dashboards.

Together with the customer’s timeline of events, the graphs told the story — a period of normal operation, followed by a CPU spike, a gap in the metrics, and finally recovery to normal operation. This gave our engineers a clear picture of the situation, enabling us to focus our debugging efforts and quickly discover the issue. In the end, HashiCorp’s engineering team reproduced the issue and determined that the root cause was a leaked goroutine that was consuming excessive resources. The customer’s cluster had stabilized when the Consul process restarted and the leaked goroutines were no longer present.

With the root cause detected, HashiCorp’s engineering team provided a path forward for the customer. When deploying configuration changes, the customer team was able to monitor changes using HCP Consul Central’s observability features themselves, and they proceeded to confidently make changes in their production environment.

This incident demonstrates some key benefits of observability. Operators noticed their Consul infrastructure health deteriorating and held off on production rollout to avoid potential customer impact. The customer’s troubleshooting time was cut significantly because of the rich data they were able to send to our engineers. By the end of the incident the customer’s operators had greater confidence in their ability to troubleshoot future issues with HCP Consul Central’s observability features.

Scaling systems always involves an element of risk. But with proper observability, you can feel confident in your ability to recover and push the boundaries of what you can achieve.

The vision for HCP Consul Central observability features

While we’re proud of the observability features we’ve built for HCP Consul Central, we’re even more excited about what’s coming next. We know that customers need observability tools that help them see a holistic view of their infrastructure, and we want to help by providing more data points and better insights.

In addition to new metrics, we are looking at how other sources of data — such as logs and traces — along with automatically generated insights, can help users sort through the noise of a busy system to find important signals.

Right now, we’re focused on giving operators enhanced visibility into interservice communication. With varying architectures and increasingly complex software systems, many services can interact with each other and it can be tricky to troubleshoot the root cause. Our first goal is to shed light into these relationships through a service topology view.

Throughout this journey, we’re committed to reducing friction for Consul operators and empowering them to build with confidence. Get started now using the HCP Consul Central observability guide. To submit feedback on HCP Consul Central observability features, please use this form.

View the full article

Sign In

Gain operator confidence with HCP Consul Central observability features

Recommended Posts

Hashicorp

Why observability is critical

HCP Consul Central

How observability made a difference for one customer

The vision for HCP Consul Central observability features

Link to comment

Share on other sites

Join the conversation