Search the Community
Showing results for tags 'hashicorp vault'.
-
Thanks to Andre Newman, Senior Reliability Specialist at Gremlin, for his assistance creating this blog post. Chaos engineering is a modern, innovative approach to verifying your application's resilience. This post shows how to apply chaos engineering concepts to HashiCorp Vault using Gremlin and Vault stress testing tools to simulate disruptive events. You’ll learn how to collect performance benchmarking results and monitor key metrics. And you’ll see Vault operators can use the results of the tests to iteratively improve resilience and performance in Vault architectures. Running these tests will help you identify reliability risks with Vault before they can bring down your critical apps. What is HashiCorp Vault HashiCorp Vault is an identity-based secrets and encryption management system. A secret is anything that you want to tightly control access to, such as API encryption keys, passwords, and certificates. Vault has a deep and broad ecosystem with more than 100 partners and integrations, and it is used by 70% of the top 20 US banks. Chaos engineering and Vault Because Vault stores and handles secrets for mission-critical applications, it is a primary target for threat actors. Vault is also a foundational system that keeps your applications running. Once you’ve migrated the application secrets into Vault, if all Vault instances go down, the applications receiving secrets from Vault won’t be able to run. Any compromise or unavailability of Vault could result in significant damage to an organization’s operations, reputation, and finances. Organizations need to plan for and mitigate several possible types of Vault failures, including: Code and configuration changes that affect application performance Lost the leader node Vault cluster lost quorum The primary cluster is unavailable High load on Vault clusters To mitigate these risks, teams need a more modern approach to testing and validating Vault’s resilience. This is where chaos engineering comes in. Chaos engineering aims to help improve systems by identifying hidden problems and reliability risks. This is done by injecting faults — such as high CPU usage or network latency — into systems, observing how the system responds, and then using that information to improve the system. This post illustrates this by creating and running chaos experiments using Gremlin, a chaos engineering platform. Chaos engineering brings multiple benefits, including: Improving system performance and resilience Exposing blind spots using monitoring, observability, and alerts Proactively validating the resilience of the system in the event of failure Learning how systems handle different failures Preparing and educating the engineering team for actual failures Improving architecture design to handle failures HashiCorp Vault architecture Vault supports a multi-server mode for high availability. This mode protects against outages by running multiple Vault servers. High availability (HA) mode is automatically enabled when using a data store that supports it. When running in HA mode, Vault servers have two states: standby and active. For multiple Vault servers sharing a storage backend, only a single instance is active at any time. All standby instances are placed in hot standbys. Only the active server processes all requests; the standby server redirects all requests to an active Vault server. Meanwhile, if the active server is sealed, fails, or loses network connectivity, then one of the standby Vault servers becomes the active instance. Vault service can continue to operate, provided that a quorum of available servers remains online. Read more about performance standby nodes in our documentation. What is chaos engineering? Chaos engineering is the practice of finding reliability risks in systems by deliberately injecting faults into those systems. It helps engineers and operators proactively find shortcomings in their systems, services, and architecture before an outage hits. With the knowledge gained from chaos testing, teams can address shortcomings, verify resilience, and create a better customer experience. For most teams, chaos engineering leads to increased availability, lower mean time to resolution (MTTR), lower mean time to detection (MTTD), fewer bugs shipped to the product, and fewer outages. Teams who often run chaos engineering experiments are also more likely to surpass 99.9% availability. Despite the name, the goal of injecting faults isn't to create chaos but to reduce chaos by surfacing, identifying, and fixing problems. Chaos engineering also is not random or uncontrolled testing. It’s a methodical approach that involves planning and forethought. That means when injecting faults, you need to plan out experiments beforehand and ensure there is a way to halt experiments, whether manually or by using health checks to check the state of systems during an experiment. Chaos engineering is not an alternative to unit tests, integration tests, or performance benchmarking. It works complementary to them, and even in parallel. For example: running chaos engineering tests and performance tests simultaneously can help find problems that occur only under load. This increases the likelihood of finding reliability issues that might surface in production or during high-traffic events. The 5 stages of chaos engineering A chaos engineering experiment follows five main steps: Create a hypothesis Define and measure your system’s steady state Create and run a chaos experiment Observe your system’s response to the experiment Use your observations to improve the system 1. Create a hypothesis A hypothesis is an educated guess about how your system will behave under certain conditions. How do you expect your system to respond to a type of failure? For example, if Vault loses the leader node in a three-node cluster, Vault should continue responding to requests, and another node should be elected as the leader. When forming a hypothesis, start small: focus on one specific part of your system. This makes it easier to test that specific system without impacting other systems. 2. Measure your steady state A system’s steady state is its performance and behavior under normal conditions. Determine the metrics that best indicate your system’s reliability and monitor those under conditions that your team considers normal. This is the baseline that you’ll compare your experiment's results against. Examples of steady-state metrics include Vault.core.handle_login_request and vault.core.handle_request. See our well architected framework for more key metrics. 3. Create and run a chaos experiment This is where you define the parameters of your experiment. How will you test your hypothesis? For example, when testing a Vault application’s response time, you could use a latency experiment to create a slow connection. This is also where you define abort conditions, which are conditions that indicate you should stop the experiment. For example, if the Vault application latency rises above the experimental threshold values, you should immediately stop the experiment so you can address those unexpected results. Note that an abort doesn’t mean the experiment failed; it just means you discovered a different reliability risk than the one you were testing for. Once you have your experiment and abort conditions defined, you can build the experimentation systems using Gremlin. 4. Observe the impact While the experiment is running, monitor your application’s key metrics. See how they compare to your steady state, and interpret what they mean for the test. For example, if running a blackhole on your Vault cluster causes CPU usage to increase rapidly, you might have an overly aggressive response time on API requests. Or, the web app might start delivering HTTP 500 errors to users instead of user-friendly error messages. In both cases, there’s an undesirable outcome that you need to address. 5. Iterate and improve Once you’ve reviewed the outcomes and compared the metrics, fix the problem. Make any necessary changes to your application or system, deploy the changes, and then validate that your changes fix the problem by repeating this process. This is how you iteratively make your system more resilient; a better approach than trying to make sweeping, application-wide fixes all at once. Implementation The next section runs through four experiments to test a Vault cluster. Before you can run these experiments, you’ll need the following. Prerequisites: A Vault HA cluster A Gremlin account (Sign up for free for 30-days.) The Vault benchmarking tool Organizational awareness (let others know you’re running experiments on this cluster) Basic monitoring Experiment 1: Impact of losing a leader node In the first experiment, you’ll test whether Vault can continue responding to requests if a leader node becomes unavailable. If the active server is sealed, fails, or loses network connectivity, one of the standby Vault servers becomes the active instance. You’ll use a blackhole experiment to drop network traffic to and from the leader node and then monitor the cluster. Hypothesis: If Vault loses the leader node in a three-node cluster, Vault should continue responding to requests, and another node should be elected to leader. Get a steady state from the monitoring tool Our steady state is based on three metrics: The sum of all requests handled by Vault vault.core.handle_login_request vault.core.handle_request Below graphs shows the sum of requests oscillates around 20K, while handle_login_request and handle_request hover between metrics 1 and 3. Run the experiment: This experiment runs a blackhole experiment for 300 seconds (5 minutes) on a leader node. Blackhole experiments block network traffic from a host and are great for simulating any number of network failures, including misconfigured firewalls, network hardware failures, etc. Setting it for 5 minutes gives us enough time to measure the impact and observe any response from Vault. Here, you can see the ongoing status of the experiment in Gremlin: Observe This experiment uses Datadog for metrics. The graphs below show that Vault is responding to requests with a negligible impact on throughput. This means Vault’s standby node kicked in and was elected as the new leader. You can confirm this by checking the nodes in your cluster using Vault operator raft command: Improve cluster design for resilience Based on these results, no immediate changes are needed, but there’s an opportunity to scale up this test. What happens if two nodes fail? Or all three? If this is a genuine concern for your team, try repeating this experiment and selecting additional nodes. You might try scaling up your cluster to four nodes instead of three — how does this change your results? Keep in mind that Gremlin provides a Halt button for stopping an ongoing experiment ife something unexpected happens. Remember your abort conditions, and don’t be afraid to stop an experiment if those conditions are met. Experiment 2: Impact of losing quorum The next experiment tests whether Vault can continue responding to requests if there is no quorum, using a blackhole experiment to bring two nodes offline. In such a scenario, Vault is unable to add or remove a node or commit additional log entries, resulting in unavailability. This HashiCorp runbook documents the steps needed to bring the cluster back online, which this experiment tests. Hypothesis If Vault loses the quorum, Vault should stop responding to requests. Following our runbook should bring the cluster back online in a reasonable amount of time. Get a steady state from Vault The steady state for this experiment is simple: Does Vault respond to requests? We’ll test this by retrieving a key: Run the experiment Run another blackhole experiment in Gremlin, this time targeting two nodes in the cluster. Observe Now that the nodes are down, the Vault cluster has lost the quorum. Without a quorum, read and write operations cannot be performed within the cluster. Retrieving the same key returns an error this time: Recovery drill and improvements Follow the HashiCorp runbook to recover from the loss of two of the three Vault nodes by converting it into a single-node cluster. It takes a few minutes to bring the cluster online, but it works as a temporary measure. A long-term fix might be to adopt a multi-datacenter deployment where you can replicate data across multiple datacenters for performance as well as disaster recovery (DR). HashiCorp recommends using DR clusters to avoid outages and meet service level agreements (SLAs). Experiment 3: Testing how Vault handles latency This next experiment tests Vault’s ability to handle high-latency, low-throughput network connections. You test this by adding latency to your leader node, then observing request metrics to see how Vault’s functionality is impacted. Hypothesis Introducing latency on your cluster’s leader node shouldn’t cause any application timeouts or cluster failures. Get KPIs from monitoring the tool This experiment uses the same Datadog metrics as the first experiment: vault.core.handle_login request, and vault.core.handle_request. Run the experiment This time, use Gremlin to add latency. Instead of running a single experiment, create a Scenario, which lets you run multiple experiments sequentially. Gradually increase latency from 100ms to 200ms over 4 minutes, with 5-second breaks in between experiments. (This Gremlin blog post explains how a latency attack works.) Observe In our test, the experiment introduced some delays in response time, especially in the 95th and 99th percentiles, but all requests were successful. More importantly, our cluster is stable from key metrics below: Improve cluster design for resilience To make the cluster even more resilient, add non-voter nodes to the cluster. A non-voting node has all of Vault's data replicated but does not contribute to the quorum count. This can be used with performance standby nodes to add read scalability to a cluster in cases where a high volume of reads to servers is needed. This way, if one or two nodes have poor performance, or if a large volume of reads saturates a node, these standby nodes can kick in and maintain performance. Experiment 4: Testing how Vault handles memory pressure This final experiment tests Vault’s ability to handle reads during high memory pressure. Hypothesis If you consume memory on a Vault cluster’s leader node, applications should switch to reading from performance standby nodes. This should have no impact on performance. Get metrics from monitoring tool For this experiment, graphs below gather telemetry metrics directly from Vault nodes; specifically, memory allocated to and used by Vault. Run the experiment Run a memory experiment to consume 99% of Vault’s memory for 5 minutes. This pushes memory usage on the leader node to its limit and holds it there until the experiment ends (or you abort). Observe In this example, the leader node kept running, and while there were minor delays in response time, all requests were successful as seen in the graph below. This means our cluster can tolerate high memory usage well. Improve cluster design for resilience As in the previous experiment, you can use non-voter nodes and performance standby nodes to add compute capacity to your cluster if needed. These nodes add extra memory but don’t contribute to the quorum count. If your cluster runs low on memory, you can add these nodes until usage drops again. Other experiments that might be beneficial include DDOS attacks, cluster failover, and others. How to build chaos engineering culture Teams typically think of reliability in terms of technology and systems. In reality, reliability starts with people. Getting application developers, site reliability engineers (SREs), incident responders, and other team members to think proactively about reliability is how you start building a culture of reliability. In a culture of reliability, each member of the organization works toward maximizing the availability of their services, processes, and people. Team members focus on improving the availability of their services, reducing the risk of outages, and responding to incidents as quickly as possible to reduce downtime. Reliability culture ultimately focuses on a single goal: providing the best possible customer experience. In practice, building a reliability culture requires several steps, including: Introducing the concept of chaos engineering to other teams Showing the value of chaos engineering to your team (you can use the results of these experiments as prooft) Encouraging teams to focus on reliability early in the software development lifecycle, not just at the end Building a team culture that encourages experimentation and learning, not assigning blame for incidents Adopting the right tools and practices to support chaos engineering Using chaos engineering to regularly test systems and processes, automate experiments, and run organized team reliability events (often called “Game Days”) To learn more about adopting chaos engineering practices, read Gremlin’s guide: How to train your engineers in chaos engineering or this S&P Global case study. Learn more One of the biggest challenges in adopting a culture of reliability is maintaining the practice. Reliability isn’t can’t be achieved in a single action: it has to be maintained and validated regularly, and reliability tools need to both enable and support this practice. Chaos engineering is a key component of that. Run experiments on HashiCorp Vault clusters, automate reliability testing, and keep operators aware of the reliability risks in their systems. Want to see how Indeed.com manages Vault reliability testing? Watch our video All the 9s: Keeping Vault resilient and reliable from HashiConf 2023. If you use HashiCorp Consul, check out our tutorial and interactive lab on Consul and chaos engineering. View the full article
-
HashiCorp and Microsoft have partnered to create Terraform modules that follow Microsoft's Azure Well-Architected Framework and best practices. In a previous blog post, we demonstrated how to accelerate AI adoption on Azure with Terraform. This post covers how to use a simple three-step process to build, secure, and enable OpenAI applications on Azure with HashiCorp Terraform and Vault. The code for this demo can be found on GitHub. You can leverage the Microsoft application outlined in this post and the Microsoft Azure Kubernetes Service (AKS) to integrate with OpenAI. You can also read more about how to deploy an application that uses OpenAI on AKS on the Microsoft website. Key considerations of AI The rise in AI workloads is driving an expansion of cloud operations. Gartner predicts that cloud infrastructure will grow 26.6% in 2024, as organizations deploying generative AI (GenAI) services look to the public cloud. To create a successful AI environment, orchestrating the seamless integration of artificial intelligence and operations demands a focus on security, efficiency, and cost control. Security Data integration, the bedrock of AI, not only requires the harmonious assimilation of diverse data sources but must also include a process to safeguard sensitive information. In this complex landscape, the deployment of public key infrastructure (PKI) and robust secrets management becomes indispensable, adding cryptographic resilience to data transactions and ensuring the secure handling of sensitive information. For more information on the HashiCorp Vault solution, see our use-case page on Automated PKI infrastructure Machine learning models, pivotal in anomaly detection, predictive analytics, and root-cause analysis, not only provide operational efficiency but also serve as sentinels against potential security threats. Automation and orchestration, facilitated by tools like HashiCorp Terraform, extend beyond efficiency to become critical components in fortifying against security vulnerabilities. Scalability and performance, guided by resilient architectures and vigilant monitoring, ensure adaptability to evolving workloads without compromising on security protocols. Efficiency and cost control In response, platform teams are increasingly adopting infrastructure as code (IaC) to enhance efficiency and help control cloud costs. HashiCorp products underpin some of today’s largest AI workloads, using infrastructure as code to help eliminate idle resources and overprovisioning, and reduce infrastructure risk. Automation with Terraform This post delves into specific Terraform configurations tailored for application deployment within a containerized environment. The first step looks at using IaC principles to deploy infrastructure to efficiently scale AI workloads, reduce manual intervention, and foster a more agile and collaborative AI development lifecycle on the Azure platform. The second step focuses on how to build security and compliance into an AI workflow. The final step shows how to manage application deployment on the newly created resources. Prerequisites For this demo, you can use either Azure OpenAI service or OpenAI service: To use Azure OpenAI service, enable it on your Azure subscription using the Request Access to Azure OpenAI Service form. To use OpenAI, sign up on the OpenAI website. Step one: Build First let's look at the Helm provider block in main.tf: provider "helm" { kubernetes { host = azurerm_kubernetes_cluster.tf-ai-demo.kube_config.0.host username = azurerm_kubernetes_cluster.tf-ai-demo.kube_config.0.username password = azurerm_kubernetes_cluster.tf-ai-demo.kube_config.0.password client_certificate = base64decode(azurerm_kubernetes_cluster.tf-ai-demo.kube_config.0.client_certificate) client_key = base64decode(azurerm_kubernetes_cluster.tf-ai-demo.kube_config.0.client_key) cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.tf-ai-demo.kube_config.0.cluster_ca_certificate) } }This code uses information from the AKS resource to populate the details in the Helm provider, letting you deploy resources into AKS pods using native Helm charts. With this Helm chart method, you deploy multiple resources using Terraform in the helm_release.tf file. This file sets up HashiCorp Vault, cert-manager, and Traefik Labs’ ingress controller within the pods. The Vault configuration shows the Helm set functionality to customize the deployment: resource "helm_release" "vault" { name = "vault" chart = "hashicorp/vault" set { name = "server.dev.enabled" value = "true" } set { name = "server.dev.devRootToken" value = "AzureA!dem0" } set { name = "ui.enabled" value = "true" } set { name = "ui.serviceType" value = "LoadBalancer" } set { name = "ui.serviceNodePort" value = "null" } set { name = "ui.externalPort" value = "8200" } }In this demo, the Vault server is customized to be in Dev Mode, have a defined root token, and enable external access to the pod via a load balancer using a specific port. At this stage you should have created a resource group with an AKS cluster and servicebus established. The containerized environment should look like this: If you want to log in to the Vault server at this stage, use the EXTERNAL-IP load balancer address with port 8200 (like this: http://[EXTERNAL_IP]:8200/) and log in using AzureA!dem0. Step two: Secure Now that you have established a base infrastructure in the cloud and the microservices environment, you are ready to configure Vault resources to integrate PKI into your environment. This centers around the pki_build.tf.second file, which you need to rename to remove the .second extension and make it executable as a Terraform file. After performing a terraform apply, as you are adding to the current infrastructure, add the elements to set up Vault with a root certificate and issue this within the pod. To do this, use the Vault provider and configure it to define a mount point for the PKI, a root certificate, role cert URL, issuer, and policy needed to build the PKI: resource "vault_mount" "pki" { path = "pki" type = "pki" description = "This is a PKI mount for the Azure AI demo." default_lease_ttl_seconds = 86400 max_lease_ttl_seconds = 315360000 } resource "vault_pki_secret_backend_root_cert" "root_2023" { backend = vault_mount.pki.path type = "internal" common_name = "example.com" ttl = 315360000 issuer_name = "root-2023" }Using the same Vault provider you can also configure Kubernetes authentication to create a role named "issuer" that binds the PKI policy with a Kubernetes service account named issuer: resource "vault_auth_backend" "kubernetes" { type = "kubernetes" } resource "vault_kubernetes_auth_backend_config" "k8_auth_config" { backend = vault_auth_backend.kubernetes.path kubernetes_host = azurerm_kubernetes_cluster.tf-ai-demo.kube_config.0.host } resource "vault_kubernetes_auth_backend_role" "k8_role" { backend = vault_auth_backend.kubernetes.path role_name = "issuer" bound_service_account_names = ["issuer"] bound_service_account_namespaces = ["default","cert-manager"] token_policies = ["default", "pki"] token_ttl = 60 token_max_ttl = 120 }The role connects the Kubernetes service account, issuer, which is created in the default namespace with the PKI Vault policy. The tokens returned after authentication are valid for 60 minutes. The Kubernetes service account name, issuer, is created using the Kubernetes provider, discussed in step three, below. These resources are used to configure the model to use HashiCorp Vault to manage the PKI certification process. The image below shows how HashiCorp Vault interacts with cert-manager to issue certificates to be used by the application: Step three: Enable The final stage requires another tf apply as you are again adding to the environment. You now use app_build.tf.third to build an application. To do this you need to rename app_build.tf.third to remove the .third extension and make it executable as a Terraform file. Interestingly, the code in app_build.tf uses the Kubernetes provider resource kubernetes_manifest. The manifest values are the HCL (HashiCorp Configuration Language) representation of a Kubernetes YAML manifest. (We converted an existing manifest from YAML to HCL to get the code needed for this deployment. You can do this using Terraform’s built-in yamldecode() function or the HashiCorp tfk8s tool.) The code below represents an example of a service manifest used to create a service on port 80 to allow access to the store-admin app that was converted using the tfk8s tool: resource "kubernetes_manifest" "service_tls_admin" { manifest = { "apiVersion" = "v1" "kind" = "Service" "metadata" = { "name" = "tls-admin" "namespace" = "default" } "spec" = { "clusterIP" = "10.0.160.208" "clusterIPs" = [ "10.0.160.208", ] "internalTrafficPolicy" = "Cluster" "ipFamilies" = [ "IPv4", ] "ipFamilyPolicy" = "SingleStack" "ports" = [ { "name" = "tls-admin" "port" = 80 "protocol" = "TCP" "targetPort" = 8081 }, ] "selector" = { "app" = "store-admin" } "sessionAffinity" = "None" "type" = "ClusterIP" } } }Putting it all together Once you’ve deployed all the elements and applications, you use the certificate stored in a Kubernetes secret to apply the TLS configuration to inbound HTTPS traffic. In the example below, you associate "example-com-tls" — which includes the certificate created by Vault earlier — with the inbound IngressRoute deployment using the Terraform manifest: resource "kubernetes_manifest" "ingressroute_admin_ing" { manifest = { "apiVersion" = "traefik.containo.us/v1alpha1" "kind" = "IngressRoute" "metadata" = { "name" = "admin-ing" "namespace" = "default" } "spec" = { "entryPoints" = [ "websecure", ] "routes" = [ { "kind" = "Rule" "match" = "Host(`admin.example.com`)" "services" = [ { "name" = "tls-admin" "port" = 80 }, ] }, ] "tls" = { "secretName" = "example-com-tls" } } } }To test access to the OpenAI store-admin site, you need a domain name. You use a FQDN to access the site that you are going to protect using the generated certificate and HTTPS. To set this up, access your AKS cluster. The Kubernetes command-line client, kubectl, is already installed in your Azure Cloud Shell. You enter: kubectl get svc And should get the following output: NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE hello LoadBalancer 10.0.23.77 20.53.189.251 443:31506/TCP 94s kubernetes ClusterIP 10.0.0.1 443/TCP 29h makeline-service ClusterIP 10.0.40.79 3001/TCP 4h45m mongodb ClusterIP 10.0.52.32 27017/TCP 4h45m order-service ClusterIP 10.0.130.203 3000/TCP 4h45m product-service ClusterIP 10.0.59.127 3002/TCP 4h45m rabbitmq ClusterIP 10.0.122.75 5672/TCP,15672/TCP 4h45m store-admin LoadBalancer 10.0.131.76 20.28.162.45 80:30683/TCP 4h45m store-front LoadBalancer 10.0.214.72 20.28.162.47 80:32462/TCP 4h45m traefik LoadBalancer 10.0.176.139 20.92.218.96 80:32240/TCP,443:32703/TCP 29h vault ClusterIP 10.0.69.111 8200/TCP,8201/TCP 29h vault-agent-injector-svc ClusterIP 10.0.31.52 443/TCP 29h vault-internal ClusterIP None 8200/TCP,8201/TCP 29h vault-ui LoadBalancer 10.0.110.159 20.92.217.182 8200:32186/TCP 29hLook for the traefik entry and note the EXTERNALl-IP (yours will be different than the one shown above). Then, on your local machine, create a localhost entry for admin.example.com to resolve to the address. For example on MacOS, you can use sudo nano /etc/hosts. If you need more help, search “create localhost” for your machine type. Now you can enter https://admin.example.com in your browser and examine the certificate. This certificate is built from a root certificate authority (CA) held in Vault (example.com) and is valid against this issuer (admin.example.com) to allow for secure access over HTTPS. To verify the right certificate is being issued, expand the detail on our browser and view the cert name and serial number: You can then check this in Vault and see if the common name and serial numbers match. Terraform has configured all of the elements using the three-step approach shown in this post. To test the OpenAI application, follow Microsoft’s instructions. Skip to Step 4 and use https://admin.example.com to access the store-admin and the original store-front load balancer address to access the store-front. DevOps for AI app development To learn more and keep up with the latest trends in DevOps for AI app development, check out this Microsoft Reactor session with HashiCorp Co-Founder and CTO Armon Dadgar: Using DevOps and copilot to simplify and accelerate development of AI apps. It covers how developers can use GitHub Copilot with Terraform to create code modules for faster app development. You can get started by signing up for a free Terraform Cloud account. View the full article
-
hashicorp vault HCP Vault Radar begins limited beta
Hashicorp posted a topic in Infrastructure-as-Code
At HashiConf last October, we announced HCP Vault Radar’s alpha program. Today, we’re pleased to announce that HCP Vault Radar is entering a limited beta phase. HCP Vault Radar is our new secret scanning product that expands upon Vault’s secrets lifecycle management use cases to include the discovery of unmanaged or leaked secrets. The beta release also debuts new functionality to support role and attribute-based access controls (RBACs/ABACs), as well as new data sources available to scan. HCP Vault Radar (beta) HCP Vault Radar detects unmanaged and leaked secrets so that DevOps or Security teams can take appropriate actions to remediate exposed secrets. Radar scans for secrets, personally identifiable information (PII) or data, and non-inclusive language. It then categorizes and ranks the exposed data discovered by level of risk. Vault Radar evaluates risk according to a range of factors, including: Was the secret found on the latest version of the code/document? Is the secret identified? Is the secret currently active? HCP Vault Radar supports secret scanning from a command line interface (CLI), and is also integrated into the HCP portal for a better user experience that can help prioritize any unmanaged secrets discovered. With the recently added support for scanning Terraform Cloud and Terraform Enterprise, beta Radar customers will be able to scan the following data sources: Git-based version control systems (GitHub, GitLab, BitBucket, etc.) AWS Parameter Store Server file directory structures Confluence HashiCorp Vault Amazon S3 Terraform Cloud (new) Terraform Enterprise (new) JIRA Docker images HashiCorp Vault integration HCP Vault Radar also integrates with Vault to scan supported data sources for the presence of leaked secrets currently in Vault that are actively being used. Using additional metadata from the scan and cross-referencing the secrets in Vault Enterprise and Vault Community, Vault Radar will give the secrets it discovers an enhanced risk rating to prioritize which ones may need immediate attention. Attribute-based and role-based access controls The limited beta release of HCP Vault Radar also includes RBAC and ABAC capabilities. The primary difference between RBAC and ABAC is how access is granted. RBAC in Vault Radar allows you to grant access by roles while ABAC offers the organization to define highly granular controls and govern access by user and object characteristics, action types, and more. RBAC roles generally refer to groups of people with common characteristics, such as: Departments or business units Security level Geography Responsibilities RBAC and ABAC in HCP Vault Radar can help: Create a repeatable process of assigning permissions Audit privileges and make necessary changes Add or change roles Reduce the potential for human error when assigning permissions Comply with regulatory or statutory requirements Getting started HCP Vault Radar is an exciting new addition to Vault’s secrets lifecycle management functionality. Vault Radar facilitates automated scanning and ongoing detection of unmanaged secrets in various code repositories and other data sources. This critical functionality further differentiates HashiCorp Vault’s secrets management offering by allowing organizations to take a proactive approach to remediation before a data breach occurs. Please review Vault Radar’s product documentation to learn more. HCP Vault Radar is currently in a private beta program. To learn more or to be considered for the beta program, sign up to receive HCP Vault Radar updates. View the full article -
What is HashiCorp’s Vault? Why is Vault necessary? What is a secret in the context of Vault? Differentiate between static and dynamic secrets. What is a seal/unseal process in Vault? What are policies in Vault? How does Vault store its data? What is the significance of the Vault token? Explain the difference between authentication and authorization in the context of Vault. What is the Transit Secret Engine in Vault? How does Vault handle high availability? What is Namespaces in Vault? The post List of interview questions along with answer for hashicorp vault appeared first on DevOpsSchool.com. View the full article
-
Enterprises leverage Public key infrastructure (PKI) to encrypt, decrypt, and authenticate information between servers, digital identities, connected devices, and application services. PKI is used to establish secure communications to abate risks to data theft and protect proprietary information as organizations increase their reliance on the internet for critical operations. This post will explore how public key cryptography enables related keys to encrypt and decrypt information and guarantee the integrity of data transfers... View the full article
-
HashiCorp Waypoint is an application release orchestrator that enables developers to deploy, manage, and observe their applications on any infrastructure platform, including HashiCorp Nomad, Kubernetes, or Amazon Elastic Container Service (Amazon ECS), with a few short commands. At HashiConf Global last month, many developers discussed their use of multiple services from HashiCorp and were interested in connecting Waypoint to the rest of the HashiCorp stack. In listening to this feedback, we wanted to highlight how to use Waypoint plugins for HashiCorp Terraform Cloud, Consul, and Vault to automate and simplify the application release process. By connecting to our other products (as well as many third-party tools), Waypoint optimizes an engineer’s workflow to enable teams to work faster. The plugins are available for HCP Waypoint (beta) and Waypoint open source. »Waypoint Plugins Help Save Time Typically, the infrastructure team needs to explicitly communicate configurations via GitHub (or some other method) to the application team as a part of the release process, which they copy into their code. With the respective plugins, application developers can use a fixed piece of configuration to grab specific parameters set by the infrastructure team. This action removes the back-and-forth between teams that may be a point of failure or miscommunication during the CI/CD process. For example, infrastructure engineers can create an Amazon ECS cluster using Terraform Cloud, and app teams can deploy it into that cluster without needing to copy-paste cluster names. For a closer look at how to pull information into Waypoint, check out these code examples: Waypoint’s Terraform Cloud Config Sourcer Variable on GitHub Waypoint Node.js example on GitHub HashiCorp plugins: Terraform Cloud Consul Vault »Automate Your Application Delivery Workflow with Waypoint Modern organizations often deploy applications to multiple cloud providers, an action that dramatically increases the complexity of releases. Multi-cloud or multi-platform environments force application developers to become familiar with those multiple platforms and the frequent, unexpected changes to them. When managed the traditional way via script in a continuous integration process, the pipeline is brittle. Application developers find themselves needing to rely heavily on infrastructure teams for routine tasks like checking application health, deploying a specific version, or getting logs. »Try HCP Waypoint with Your Team The goal of Waypoint is to remove this dependency by automating how application developers build, deploy, and release software to a wide variety of platforms. The Waypoint plugins for Terraform, Vault, and Consul further this aim of automation by pulling in configuration details without relying so heavily on the infrastructure team. No other application release platform offers these deep connections to the HashiCorp ecosystem tools and helps teams work faster and smarter. Just as important, Waypoint is a highly extensible platform that allows users to build their own plugins or use other plugins created by HashiCorp and our community. Over time we anticipate the number of Waypoint plugins will continue to grow. Try HCP Waypoint for free to get started. View the full article
-
- terraform
- hashicorp terraform
-
(and 4 more)
Tagged with:
-
HashiCorp Consul provides a variety of capabilities, including service mesh, security policies through intentions, and a key-value store. For organizations with multiple teams, how do you empower those teams to securely use Consul? In this post, we generate dynamic Consul tokens with HashiCorp Vault and define access control lists (ACLs) as code using the Consul provider for HashiCorp Terraform. When multiple teams use Consul, it becomes difficult to correlate manually managed policies in Consul with the identity accessing it. For example, you might temporarily add additional access for a developer to edit services, only to forget that someone has write access. A bad actor might be able to obtain the ACL token and use it to access data or change services. The possibility of an insecure ACL policy or long-lived Consul ACL token increases as more users, nodes, and services require access. To ensure least-privilege access to Consul, you can use HashiCorp Terraform to define and test Consul ACLs and enable audit of policy rules. Then, you can configure the Consul secrets engine for HashiCorp Vault to dynamically generate the API tokens associated with the Consul ACL policy in order to reduce the lifetime of the token and further secure Consul. Example Scenario In this scenario, let’s consider multiple teams. Imagine you are an operator administering to HashiCorp Consul, which development teams use for service mesh and discovery. You enable ACLs on your production Consul cluster and obtain a Consul token with acl = “write” policy to start managing its resources. Besides Consul, your organization uses HashiCorp Vault to manage its secrets, Terraform to manage infrastructure, and Terraform Cloud to store state and test the configuration. As an operator, you know that different teams need to be able to access Consul but not all of them need access to everything. A new app team needs to be able to view keys to debug consul-template and their connection to other applications, thus requiring access to read any keys related to the app team and all Consul intentions (to debug network policy). When the app team onboards, they should be able to use your organization’s Vault instance to retrieve a token associated with their team. Let’s see how we can accomplish this by applying Vault and Terraform to configuring Consul ACLs. Set up Vault Access to Consul with Terraform You will need a Consul token to allow Terraform enough access to configure Consul ACLs. The policy associated with the token must have at least an acl = “write” rule. First, define your Consul address as part of the Terraform provider configuration. The example uses a local Consul instance and scopes the provider to Consul datacenter dc1. # provider.tf provider "consul" { address = "127.0.0.1:8500" datacenter = "dc1" } Set the CONSUL_HTTP_TOKEN environment variable to the Consul ACL token. The Consul provider for Terraform will use the management token specified in the variable. $ export CONSUL_HTTP_TOKEN=aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee Next, you need to create a Consul policy and token for Vault. The Consul secrets engine for Vault specifically requires a management token. We attach the token to the global-management policy, which provides unlimited access across Consul. For larger Consul deployments, the management token should be scoped to the datacenter. # consul-policy-vault.tf data "consul_acl_policy" "management" { name = "global-management" } resource "consul_acl_token" "vault" { description = "ACL token for Consul secrets engine in Vault" policies = [data.consul_acl_policy.management.name] local = true } To create the policy, plan and apply the Terraform configuration. $ terraform init $ terraform plan ... Plan: 1 to add, 0 to change, 0 to destroy. $ terraform apply consul_acl_token.vault: Creating... consul_acl_token.vault: Creation complete after 0s Apply complete! Resources: 1 added, 0 changed, 0 destroyed. If you examine the ACL tokens in the Consul UI, you find that Terraform added the Consul ACL token for Vault. Configure Consul Secrets Engine for Vault After creating the Consul ACL token for Vault, use the Vault provider for Terraform to configure HashiCorp Vault with the Consul secrets engine. By enabling the Consul secrets engine, you allow Vault to issue dynamic ACL tokens and attach them to a policy. First, add the Vault provider to providers.tf with the address of the Vault instance. The example uses a local instance of Vault. # provider.tf provider "consul" { address = "127.0.0.1:8500" datacenter = "dc1" } provider "vault" { address = "http://127.0.0.1:8200" } Set the VAULT_TOKEN environment variable to the Vault token with least privilege access to manage secrets engines (create, read, update, delete, list, and sudo to the sys/mounts/* Vault API). The Vault provider for Terraform will use the token specified in the variable. $ export VAULT_TOKEN=s.bbbbbbbbbbbbbbbbbbbbbb Next, configure the Consul secrets engine in Vault. Use the consul_acl_token_secret_id Terraform data source to retrieves the secret of the Consul ACL token for Vault. While you can issue a management token for the Consul secrets engine manually, creating it with Terraform allows you to manage and revoke it more dynamically than through the CLI. When using this data source, the Consul token will be reflected in Terraform state. For security concerns, you should treat Terraform state as sensitive data by encrypting it or storing it in Terraform Cloud. If you do not encrypt state, you should encrypt the Consul token with your own PGP or keybase public key. The example does not encrypt the token because it uses Terraform Cloud to store and encrypt state. # vault.tf data "consul_acl_token_secret_id" "vault" { accessor_id = consul_acl_token.vault.id } resource "vault_consul_secret_backend" "consul" { path = "consul" description = "Manages the Consul backend" address = "consul:8500" token = data.consul_acl_token_secret_id.vault.secret_id default_lease_ttl_seconds = 3600 max_lease_ttl_seconds = 3600 } Pass the token to vault_consul_secret_backend resource. The vault_consul_secret_backend specifies the Consul address for Vault to reference and the lease time for Consul tokens. The example sets the default lease to one hour, which means the Consul tokens expire one hour after issuance. Apply the configuration to create the secrets engine. $ terraform plan … Plan: 1 to add, 0 to change, 0 to destroy. $ terraform apply ... vault_consul_secret_backend.consul: Creating... vault_consul_secret_backend.consul: Creation complete after 0s [id=consul] Apply complete! Resources: 1 added, 0 changed, 0 destroyed. You can tell if the Consul secrets engine has been enabled by examining the list of secrets engines and the roles and configuration at the consul/ path in Vault. $ vault secrets list Path Type Accessor Description ---- ---- -------- ----------- consul/ consul consul_0605a7f4 Manages the Consul backend $ vault read consul/config/access Key Value --- ----- address consul:8500 scheme http You may require additional automation if you use the Consul secrets engine to issue short-lived tokens for Consul agents or service registration. To rotate tokens for Consul agents, you will need to update the token with the Consul agent API. Similarly, you need to re-register services if you rotate the token associated with the service. In this example, we scope the use of the Consul secrets engine to rotating tokens for reading intentions and keys. Define the App Team’s Consul ACL Policies After creating the Consul management token and configuration for the Consul secrets engine, you can now define the app team’s Consul policies and roles with Terraform and request a dynamic Consul ACL token with Vault. At a minimum, the app team needs to read Consul intentions and keys. You do not want to grant them write access to change any resources at this time. First, define two policies intended for app team members to view Consul information. The user should be able to view Consul intentions and read from application-related keys. Use the consul_acl_policy resource to create both policies. For fine-grained access control of intentions based on service, you must include its service destination value. Review Consul documentation for additional information about intention management permissions. # consul-policy-appteam.tf resource "consul_acl_policy" "intentions_read" { name = "intentions-read" rules = <<-RULE service_prefix "" { policy = "read" } RULE } resource "consul_acl_policy" "app_key_read" { name = "key-read" rules = <<-RULE key_prefix "app" { policy = "list" } RULE } resource "vault_consul_secret_backend_role" "app_team" { name = "app-team" backend = vault_consul_secret_backend.consul.path policies = [ consul_acl_policy.intentions_read.name, consul_acl_policy.app_key_read.name, ] } If your organization has Terraform Cloud or Enterprise, you can use HashiCorp Sentinel to check that new policies do not have write access to Consul. This creates a policy check for Consul policies defined as code, enabling teams to self-service access to Consul within certain boundaries. The Sentinel policy only allows teams to update Consul policies with read access. import "tfplan/v2" as tfplan resources = values(tfplan.planned_values.resources) consul_acl_policies = filter resources as _, v { v.type is "consul_acl_policy" } consul_acl_policies_do_not_have_write_rule = rule { all consul_acl_policies as consul_acl_policy { consul_acl_policy.values.rules not contains "write" } } main = rule { consul_acl_policies_do_not_have_write_rule } Each policy will be associated with the app team role created by Vault. Use the vault_consul_secret_backend_role resource to associate both policies to a role labeled app-team. Any token with this role will be able to access Consul based on its associated policies. Apply the configuration with Terraform. $ terraform plan … Plan: 3 to add, 0 to change, 0 to destroy. $ terraform apply … Apply complete! Resources: 3 added, 0 changed, 0 destroyed. Configure App Team Authentication to Vault In order to allow any app team developer to authenticate to Vault, you configure the GitHub authentication method. An authentication method allows you to specify an identity provider like GitHub to handle application team access to Vault. When you have many teams retrieving various secrets from Vault, an authentication method alleviates the burden of additional configuration. Add a Terraform configuration that includes the vault_github_auth_backend and vault_github_team resources with a Vault policy limited to read access to on the consul/creds/app-team endpoint. # vault-appteam.tf resource "vault_github_auth_backend" "org" { organization = "example" } resource "vault_policy" "app_team" { name = "app-team" policy = < Apply the configuration. $ terraform init $ terraform plan ... Plan: 3 to add, 0 to change, 0 to destroy. $ terraform apply ... Apply complete! Resources: 3 added, 0 changed, 0 destroyed. Any user under the app team and the example organization in GitHub can now log into Vault with their personal access token. With a read-only Vault policy on their team endpoint to the Consul secrets engine, they will not be able to access other API endpoints in Vault. Issue Dynamic Consul ACL Tokens with Vault An app team user can request a Consul token from Vault by using the consul/creds/app-team endpoint. When you request credentials, Vault generates a new Consul token with a default lease defined by the secrets engine configuration. For example, the token associated with the app-team role must be renewed after one hour. First, the app team logs into Vault with their GitHub token. $ vault login -method=github token=${GITHUB_TOKEN} Key Value --- ----- token s.xxxxxxxxxxxxxxxxxxxxxx token_accessor iUthJU7SjQQZRAVP0THtvQOu token_duration 768h token_renewable true token_policies ["app-team" "default"] identity_policies [] policies ["app-team" "default"] token_meta_org example token_meta_username some-github-user Then, they can use their Vault token to request a Consul ACL token. $ vault read consul/creds/app-team Key Value --- ----- lease_id consul/creds/app-team/12tQWGMPqmEHAso1ami8D6Pg lease_duration 1h lease_renewable true accessor 21cd7044-c75a-8801-8c29-9d95959e1e7c local false token cbbff3b8-c6ad-3db3-dab5-6fdf94df5f97 When the app team uses the token, they can only read intentions and keys. They cannot make updates to intentions or keys or retrieve other information like ACL policies from Consul. $ consul intention get web db Source: web Destination: db Action: allow $ consul kv get -recurse app/ app/hi/there: app/toggles:hello $ consul intention create web db Error creating intention "web => db (allow)": Unexpected response code: 403 (Permission denied) $ consul kv put app/new Error! Failed writing data: Unexpected response code: 403 (Permission denied) $ consul acl policy list Failed to retrieve the policy list: Unexpected response code: 403 (Permission denied) After one hour, the app team will no longer be able to use this token to access intentions and keys. $ consul intention get web db Error: Unexpected response code: 403 (ACL not found) $ consul kv get -recurse app/ Error querying Consul agent: Unexpected response code: 403 To regain access to Consul, the app team developer must either renew the lease with vault lease renew consul/creds/app-team/<lease id> or generate a new token with Vault. If a team does not want to manually renew tokens used by service accounts or automation, they can configure Vault agent to automatically authenticate and renew the tokens. Conclusion Using the Consul and Vault providers for Terraform, you created a management token to enable Vault to issue Consul ACL tokens using the Consul secrets engine. After enabling the Consul secrets engine, you used the Consul provider for Terraform to create policies and attach them to roles in Vault. When they need to debug intentions and test their configuration of consul-template, the app team can log into Vault using their GitHub credentials and request Consul ACL tokens from Vault. Vault will handle the renewal and revocation of the token. By configuring Vault and Consul with Terraform, you can scale and collaborate on Consul ACL policies to secure the cluster. Changes and updates to the policies will reflect in version control and use infrastructure as code practices to maintain security. The addition of the Consul secrets engine generates ACL tokens on-demand and handles the lifetime of the secret. Learn more about Consul, Vault, and Terraform with the HashiCorp Learn guides. For detailed Consul security recommendations, refer to the Consul Security Model and the complete ACL Guide. Additional documentation on using Terraform to configure Consul and Vault can be found at the Consul provider and Vault provider. For resources on configuring Vault, check out the Consul secrets engine, the Github auth method, and Vault policy documentation. Any questions? I've created a community forum thread so I can respond to them. Feel free to reach out there! View the full article
-
- acls
- hashicorp terraform
-
(and 3 more)
Tagged with:
-
Forum Statistics
67.7k
Total Topics65.6k
Total Posts