Posted 23 hours ago23 hr Troubleshooting a large, complex, distributed enterprise application involves challenges like tracing requests across multiple services, identifying performance bottlenecks across the stack, and understanding cascading failures between dependent services. Customers often need to work with isolated data to identify the underlying cause of the problem. By correlating different signals like logs, traces, metrics, and other performance indicators, you can get valuable insight into what caused the problem, where, and why. Amazon OpenSearch Service is a managed service to deploy, operate, and search data at scale within AWS. Amazon Managed Grafana is a secure data visualization service to query operational data from multiple sources, including OpenSearch Service. In this post, we show you how to use these services to correlate the various observability signals that improve root cause analysis, thereby resulting in reduced Mean Time to Resolution (MTTR). We also provide a reference solution that can be used at scale for proactive monitoring of enterprise applications to avoid a problem before they occur. Solution overview The following diagram shows the solution architecture for collecting and correlating various enterprise telemetry signals at scale. At the core of this architecture are applications composed of microservices (represented by orange boxes) running on Amazon Elastic Kubernetes Service (Amazon EKS). These microservices contain instrumentation that emit telemetry data in the form of metrics, logs, and traces. This data is exported into the OpenTelemetry Collector, which serves as a central vendor agnostic gateway to collect this data uniformly. In this post, we use an OpenTelemetry demo application as a sample enterprise application. Large enterprise customers typically separate their observability signal data into various stores for scalability, fault isolation, access control, and ease of operation. To aid in these functions, we recommend and use Amazon OpenSearch Ingestion for a serverless, scalable, and fully managed data pipeline. We separate log and trace data and send them to distinct OpenSearch Service domains. The solution also sends the metrics data to Amazon Managed Service for Prometheus. We use Amazon Managed Grafana as a data visualization and analytics platform to query and visualize this data. We also show how to employ correlations as a valuable tool to gain insights from these signals spread across various data stores. The following sections outline building this architecture at scale. Prerequisites Complete the following prerequisite steps: Provision and configure the Amazon Managed Prometheus workspace to receive metrics from the OpenTelemetry Collector. Create two dedicated OpenSearch Service domains (or use existing ones) to ingest logs and traces from the OpenTelemetry Collector. Create an Amazon Managed Grafana workspace and configure data sources to connect to Amazon Managed Prometheus and OpenSearch Service. Set up an EKS cluster to deploy applications and the OpenTelemetry Collector. Create log and trace OpenSearch Ingestion pipelines Before setting up the ingestion pipelines, you need to create the necessary AWS Identity and Access Management (IAM) policies and roles. This process involves creating two policies for domain and OSIS access, followed by creating a pipeline role that uses these policies. Create a policy for ingestion Complete the following steps to create an IAM policy: Open the IAM console. Choose Policies in the navigation pane, then choose Create policy. On the JSON tab, enter the following policy into the editor: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "es:DescribeDomain", "Resource": "arn:aws:es:*:{accountId}:domain/*" }, { "Effect": "Allow", "Action": [ "es:ESHttpGet", "es:HttpHead", "es:HttpDelete", "es:HttpPatch", "es:HttpPost", "es:HttpPut" ], "Resource": "arn:aws:es:us-east-1:{accountId}:domain/otel-traces" }, { "Effect": "Allow", "Action": [ "es:ESHttpGet", "es:HttpHead", "es:HttpDelete", "es:HttpPatch", "es:HttpPost", "es:HttpPut" ], "Resource": "arn:aws:es:us-east-1:{accountId}:domain/otel-logs" } } ] } // Replace {accountId} with your own values Choose Next, choose Next again, and name your policy domain-policy. Choose Create policy. Create another policy with the name osis-policy and use the following JSON: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "osis:Ingest", "Resource": "arn:aws:osis:us-east-1:{accountId}:pipeline/osi-pipeline-otellogs" }, { "Effect": "Allow", "Action": "osis:Ingest", "Resource": "arn:aws:osis:us-east-1:{accountId}:pipeline/osi-pipeline-oteltraces" } ] } // Replace {accountId} with your own values Create a pipeline role Complete the following steps to create a pipeline role: On the IAM console, choose Roles in the navigation pane, then choose Create role. Select Custom trust policy and enter the following policy into the editor: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "eks.amazonaws.com", "osis-pipelines.amazonaws.com" ], "AWS": "{nodegroup_arn}" }, "Action": "sts:AssumeRole" } ] } // Replace {nodegroup_arn} with your own values Choose Next, then search for and select the policies osis-policy and domain-policy you just created. Choose Next and name the role PipelineRole. Choose Create role. Allow access for the pipeline role in OpenSearch Service domains To enable access for the pipeline role in OpenSearch Service domains, complete the following steps: Open the OpenSearch Service console. Choose your domain (either logs or traces). Choose the OpenSearch Dashboards URL Sign in with your credentials. Then, complete the following steps for each OpenSearch Service domain (logs and traces domains). In OpenSearch Dashboards, go to the Security Choose Roles and then all_access. This procedure uses the all_access role for demonstration purposes only. This grants full administrative privileges to the pipeline role, which violates the principle of least privilege and could pose security risks. For production environments, you should create a custom role with minimal permissions required for data ingestion, limit permissions to specific indexes and operations, consider implementing index patterns and time-based access controls, and regularly audit role mappings and permissions. For detailed guidance on creating custom roles with appropriate permissions, refer to Security in Amazon OpenSearch Service. Choose Mapped users and then Managed mapping. On the Map user page, under Backend roles, update the backend role with the Amazon Resource Name (ARN) for the role PiplelineRole. Choose Map. Create a pipeline for logs Complete the following steps to create a pipeline for logs: Open the OpenSearch Service console. Choose Ingestion pipelines. Choose Create pipeline. Define the pipeline configuration by entering the following: version: "2" otel-logs-pipeline: source: otel_logs_source: path: "/v1/logs" sink: - opensearch: hosts: ["{OpenSearch_domain_endpoint}"] aws: sts_role_arn: "arn:aws:iam::{accountId}:role/osi-pipeline-role" region: "us-east-1" serverless: false index: "observability-otel-logs%{yyyy-MM-dd}" # To get the values for the placeholders: # 1. {OpenSearch_domain_endpoint}: You can find the domain endpoint by navigating to the Amazon Managed Opensearch managed clusters in the AWS Management Console, and then clicking on the domain. # After obtaining the necessary values, replace the placeholders in the configuration with the actual values. Create a pipeline for traces Complete the following steps to create a pipeline for traces: Open the OpenSearch Service console. Choose Ingestion pipelines. Choose Create pipeline. Define the pipeline configuration by entering the following: version: "2" entry-pipeline: source: otel_trace_source: path: "/v1/traces" processor: - trace_peer_forwarder: sink: - pipeline: name: "span-pipeline" - pipeline: name: "service-map-pipeline" span-pipeline: source: pipeline: name: "entry-pipeline" processor: - otel_traces: sink: - opensearch: index_type: "trace-analytics-raw" hosts: ["{OpenSearch_domain_endpoint}"] aws: sts_role_arn: "arn:aws:iam::{accountId}:role/osi-pipeline-role" region: "us-east-1" service-map-pipeline: source: pipeline: name: "entry-pipeline" processor: - service_map: sink: - opensearch: index_type: "trace-analytics-service-map" hosts: ["{OpenSearch_domain_endpoint}"] aws: sts_role_arn: "arn:aws:iam::{accountId}:role/osi-pipeline-role" region: "us-east-1" # To get the values for the placeholders: # 1. {OpenSearch_domain_endpoint}: You can find the domain endpoint by navigating to the Amazon Managed Opensearch managed clusters in the AWS Management Console, and then clicking on the domain. # 2. {accountId}: This is your AWS account ID. You can find your account ID by clicking on your username in the top-right corner of the AWS Management Console and selecting "My Account" from the dropdown menu. # After obtaining the necessary values, replace the placeholders in the configuration with the actual values. Install the OpenTelemetry demo application in Amazon EKS Use the EKS cluster you set up earlier along with AWS CloudShell or another tool to complete these steps: Open the AWS Management Console. Choose the CloudShell icon in the top navigation bar, or go directly to the CloudShell console. Wait for the shell environment to initialize—it comes preinstalled with common AWS Command Line Interface (AWS CLI) tools. Now you can complete the following steps to install the application. Clone the OpenTelemetry Demo repository: git clone https://github.com/aws-samples/sample-correlation-opensearch-repository Navigate to the Kubernetes directory: cd deployment_files Deploy the demo application using kubectl apply: kubectl apply -f . Use a load balancer to expose the frontend service so you can reach the source application web URL: kubectl expose deployment opentelemetry-demo-frontendproxy --type=LoadBalancer --name=frontendproxy After you have deployed the application, access the frontend application using the load balancer on port 8080. Use your browser to visit http://<LoadBalancerIP>:8080/ to open the source application for OpenTelemetry. By following these steps, you can successfully install and access demo applications on your EKS cluster. Configure the OpenTelemetry Collector exporter for logs, traces, and metrics The OpenTelemetry Collector is a tool that manages the receiving, processing, and exporting of telemetry data from your application to a target repository. In this step, we send logs and traces to OpenSearch Service and metrics to Amazon Managed Prometheus. The OpenTelemetry Collector also works with popular data repositories like Jaeger and a variety of other open source and commercial platforms. In this section, we include steps to configure the OpenTelemetry Collector in an EKS environment. Then we deploy the demo application and explore the OpenTelemetry exporters using AWS Managed Solutions instead of the open source versions. Complete the following steps: Open the otel-collector-config ConfigMap in your preferred editor: kubectl edit configmap opentelemetry-demo-otelcol -n otel-demo Update the exporters section with the following configuration (provide the appropriate Amazon Managed Service for Prometheus endpoint and OpenSearch Service log ingestion URLs): exporters: logging: {} otlphttp/logs: logs_endpoint: "<AWS_OPENSEARCH_LOG_INGESTION_URL>/v1/logs" auth: authenticator: sigv4auth compression: none otlphttp/traces: traces_endpoint: "<AWS_OPENSEARCH_TRACE_INGESTION_URL>/v1/traces" auth: authenticator: sigv4auth compression: none prometheusremotewrite: endpoint: "<AWS_MANAGED_PROMETHEUS_ENDPOINT>" auth: authenticator: sigv4auth Locate the extensions section and update the IAM role ARN in the sigv4auth configuration: sigv4auth: assume_role: arn: "arn:aws:iam::{accountId}:role/osi-pipeline-role" sts_region: "us-east-1" region: "us-east-1" service: "osis" # {accountId}: replace accountID with your account id After updating the ConfigMap, restart the OpenTelemetry Collector deployment: kubectl rollout restart deployment opentelemetry-demo-otelcol -n otel-demo With these changes, the OpenTelemetry Collector will send trace data to the OpenSearch Service domain, metrics data to the AWS Managed Service for Prometheus endpoint, and log data to the OpenSearch Service domain. Configure Amazon Managed Grafana Before you can visualize your logs and traces, you need to configure OpenSearch Service as a data source in your Amazon Managed Grafana workspace. This configuration is done through the Amazon Managed Grafana console. Configure the OpenSearch Service data source Complete the following steps to configure the OpenSearch Service data source: Open the Amazon Managed Grafana console. Select your workspace and choose the workspace URL to access your Grafana instance. Log in to your Amazon Managed Grafana instance. From the side menu, choose the configuration (gear) icon. On the Configuration menu, choose Data Sources. Choose Add data source. On the Add data source page, select OpenSearch Service from the list of available data sources. In the Name field, enter a descriptive name for the data source. In the URL field, enter the URL (OpenSearch Service domain endpoint) of your OpenSearch Service domain, including the protocol and port number. If your OpenSearch cluster is configured with authentication, provide the required credentials in the User and Password If you want to use a specific index pattern for the data source, you can specify it in the Index name field (For example, logstash-*). Adjust any other settings as needed, such as the Time field name and Time interval. Choose Save & Test to verify the connection to your OpenSearch cluster. If the test is successful, you should see a green notification with the message “Data source is working.” Choose Save to save the data source configuration. Repeat the same steps for the OpenSearch logs and traces domains. Configure the Prometheus data source Complete the following steps to configure the Prometheus data source: Open the Amazon Managed Grafana console. Select your workspace and choose the workspace URL to access your Grafana instance. Log in to your Amazon Managed Grafana instance. From the side menu, choose the configuration (gear) icon. On the Configuration menu, choose Data Sources. Choose Add data source. On the Add data source page, select Amazon Managed Prometheus from the list of available data sources. In the Name field, enter a descriptive name for the data source. The AWS Auth Provider and Default Region fields should be automatically populated based on your Amazon Managed Grafana workspace configuration. In the Workspace field, enter the ID or alias of your Amazon Managed Prometheus workspace. Choose Save & Test to verify the connection to your Amazon Managed Prometheus workspace. If the test is successful, you should see a green notification with the message “Data source is working.” Choose Save to save the data source configuration. Create correlations in Amazon Managed Grafana To establish connections between your logs and traces data, you need to set up data correlations in Amazon Managed Grafana. This allows you to navigate seamlessly between related logs and traces. Follow these steps in your Amazon Managed Grafana workspace: Open the Amazon Managed Grafana console. Select your workspace and choose the workspace URL to access your Grafana instance. In the Amazon Managed Grafana portal, on the Administration menu, choose Plugins and Data, and choose Correlation. On the Set up the target for the correlation page, under Target, choose your traces data source (OpenSearch Service, for example, otel-traces) from the dropdown list and define the query that will execute when the link is followed. You can use variables to query specific field values. For example, traceId: ${__value.raw}. On the Set up the target for the correlation page, choose the log data source from the dropdown list, and enter the field name to be linked or correlated with the traces data source in the OpenSearch Service data source. For example, traceID. Choose Save to complete the correlation configuration. Repeat the steps to create a correlation between metrics on Prometheus to logs in OpenSearch Service. Validate results In Amazon Managed Grafana, using the Prometheus data source, locate the desired instance for correlation. The instance ID will be displayed as a link. Follow the link to open the corresponding log details in a panel on the right side of the page. With the logs to traces correlation configured, you can access trace information directly from the logs page. Choose traces on the log details panel to view the corresponding trace data. The following screenshot demonstrates the node graph visualization showing the correlation flow: instance metrics to logs to traces. Clean up Remove the infrastructure for this solution when not in use to avoid incurring unnecessary costs. Conclusion In this post, we showed how to use correlation as a helpful tool to gain insight into observability data stored in various stores. Separating logs and traces into dedicated domains provides the following benefits: Better resource allocation and scaling based on different workload patterns Independent performance optimization for each data type Simplified cost tracking and management Enhanced security control with separate access policies You can use this solution as a reference to build a scalable observability solution for your enterprise to detect, investigate, and remediate problems faster. This ability, when used along next-generation artificial intelligence and machine learning (AI/ML), helps to not only proactively react but predict and prevent problems before they occur. You can learn more about AI/ML with AWS. About the Authors Balaji Mohan is a Senior Delivery Consultant specializing in application and data modernization to the cloud. His business-first approach provides seamless transitions, aligning technology with organizational goals. Using cloud-centered architectures, he delivers scalable, agile, and cost-effective solutions, driving innovation and growth. Senthil Ramasamy is a Senior Database Consultant at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on database services, helping them with database migrations to the AWS Cloud and improving the value of their solutions when using AWS. Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.View the full article
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.