Amazon Web Services Posted January 13, 2023 Share Posted January 13, 2023 Introduction In Kubernetes, the Data Plane consists of two layers of scaling: a pod layer and a worker node layer. The pods can be autoscaled using Horizontal Pod Autoscaler (HPA) or Vertical Pod Autoscaler. Nodes can be autoscaled using Cluster Autoscaler (CA) or Karpenter. If worker nodes are running at full capacity and new pods are still being added, then the new worker nodes are added to the Data Plane so that pending pods can be scheduled on these newly created nodes. However, provisioning new nodes and adding them to the cluster adds lag time (approximately 1–2 minutes) to the scaling process. This lag can be minimized or eliminated by over‑provisioning the worker nodes. In this post, we show you how to over-provision the worker nodes using dummy pods. The dummy pods contain a pause container that is scheduled by the kube-scheduler according to pod specifications’ placements and CPU/memory. The pause container then waits for the termination signals, which come if Kubernetes needs to pre-empt its capacity for our higher priority workload. The real workload’s pods have higher priority, whereas dummy pods have the least priority. So, when a real workload’s pods are created, the dummy pods are evicted by kube-scheduler from the worker nodes, and then schedule the real workload’s pods on these nodes. As dummy pods go into the pending state, more worker nodes are added by Karpenter to schedule these pending dummy pods. Thus, the lag caused by worker nodes’ start up time can be eliminated or minimized using dummy pods. The number of dummy pods to be over-provisioned is based on the trade-offs of performance required for scaling of the worker nodes and the cost of running the dummy pods. A simple approach to achieve over-provisioning is by creating the deployment of pod that has the pause container, and setting the replica count to a static value. However, as a cluster grows (e.g., to hundreds, or thousands of nodes), having static replica count for over-provisioning may not be effective. To autoscale the size of over-provisioning to complement the size of the cluster, we can use the tools available in the Kubernetes ecosystem. One such tool is Horizontal cluster-proportional-autoscaler container. It resizes the number of replicas of over-provisioning application as per number of worker nodes and cores (i.e., vCPUs) of the cluster. Workload that is latency sensitive and/or has spikes in traffic can use this solution. Please note that this solution is applicable to Amazon Elastic Kubernetes Service (Amazon EKS) on Amazon Elastic Compute Cloud (Amazon EC2), but isn’t applicable to Amazon EKS on AWS Fargate because AWS Fargate scales underlying worker nodes and isn’t visible to customer. Time to read 15 minutes Time to complete 60 minutes Cost to complete (estimated) Approx. $40 (at publication time) for region us-west-2 Learning level Expert (400) Services used Amazon EKS Amazon EC2 Amazon EBS Solution overview Initially, there are two services running on Amazon EKS. High priority Nginx service and low priority dummy service. The solution starts with only one replica of each of these services (i.e., Nginx-1 pod and Dummy-1 pod, respectively). Each of these pods is running on a separate node. The HPA is enabled for high priority Nginx service. Also, the Cluster Autoscaler or Karpenter is enabled for autoscaling of the worker nodes. The priority of Nginx service is higher, whereas the priority of dummy service is lower. As load increases on the Nginx service, HPA adds a new high priority pod (Nginx‑2), which is initially in pending state. Since, the Nginx-2 pod has higher priority and it’s in the pending state, the kube‑scheduler evicts the dummy service pod (i.e., Dummy-1) to make the room for Nginx-2 pod. Then, the kube-scheduler schedules the high priority pod Nginx-2 on the node from which the dummy pod was evicted. The dummy service pod goes in pending state. So, the Cluster Autoscaler or Karpenter adds an additional node (Node 3) to the cluster. As soon as new node (Node 3) is ready, the kube-scheduler places the pending dummy pod (Dummy-1) on it. Prerequisites The prerequisites for this walkthrough are provided in the following list: An AWS account Permission to create AWS resources (e.g., IAM Roles, IAM policies, Amazon EC2 instances, AWS Cloud9, and Amazon EKS clusters) Basic knowledge of Kubernetes and Linux shell commands Walkthrough This solution consists of the following steps that create and configure all the necessary resources. It uses Karpenter Autoscaler; however, you can use the Cluster Autoscaler (CA) as well. Step 1 – Create Amazon EKS cluster with Karpenter Autoscaler Step 2 – Create Provisioner with t3.small on-demand instance type Step 3 – Create high and low priority classes Step 4 – Deploy high priority sample application Step 5 – Deploy low priority dummy (over-provisioning) application Step 6 – Test the scaling Step 7 – Deploy Horizontal cluster-proportional-autoscaler container Step 8 – Test the scaling using proportional autoscaler Step 9 – Clean up Step 1 – Create Amazon EKS cluster with Karpenter Autoscaler In this step, we’ll create an Amazon EKS cluster with Karpenter Autoscaler. We’ll use the instructions here to create a cluster. Choose Get Started on the landing page, which takes you to the documentation Expand the Getting Started menu in left pane Choose Getting Started with eksctl in left pane Then, follow the instructions in right pane through Install Karpenter Helm Chart and pay attention to the following points: In Create a cluster Step, use “Example 1: Create basic cluster”. You can skip the step “Create the EC2 Spot Service Linked Role” because we’ll use the on-demand instances for simplicity in this post. Step 2 – Create Provisioner with t3.small on-demand instance type To simplify the demonstration, we’ll use on-demand instances and restrict the type and size of node to t3.small. cat <<EOF | kubectl apply -f - apiVersion: karpenter.sh/v1alpha5 kind: Provisioner metadata: name: default spec: requirements: - key: karpenter.sh/capacity-type operator: In values: ["on-demand"] - key: "node.kubernetes.io/instance-type" operator: In values: ["t3.small"] limits: resources: cpu: 1000 providerRef: name: default ttlSecondsAfterEmpty: 30 --- apiVersion: karpenter.k8s.aws/v1alpha1 kind: AWSNodeTemplate metadata: name: default spec: subnetSelector: karpenter.sh/discovery: ${CLUSTER_NAME} securityGroupSelector: karpenter.sh/discovery: ${CLUSTER_NAME} EOF Step 3 – Create high and low priority classes cat <<EOF | kubectl apply -f - apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority value: 1000000 globalDefault: false description: "This priority class should be used for high priority service pods only." EOF cat <<EOF | kubectl apply -f - apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: low-priority value: -1 globalDefault: false description: "This priority class should be used for dummy service pods only." EOF Step 4 – Deploy the high priority sample application Deploy Sample Application named nginx-app. Please note that priorityClassName is high-priority in the following object definition. cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: nginx-app labels: app: nginx spec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx ports: - containerPort: 80 resources: requests: cpu: "1" memory: "500Mi" limits: cpu: "1" memory: "500Mi" priorityClassName: high-priority EOF Step 5 – Deploy the low priority dummy (over-provisioning) application Deploy the dummy Application (i.e., dummy-app). Please note that priorityClassName is low-priority in the following object definition. cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: dummy-app namespace: default labels: app: overprovisioning spec: replicas: 1 selector: matchLabels: app: overprovisioning template: metadata: labels: app: overprovisioning spec: containers: - name: pause image: registry.k8s.io/pause resources: requests: cpu: "1" memory: "500Mi" limits: cpu: "1" memory: "500Mi" priorityClassName: low-priority EOF After completing the previous steps, you’ll see one replica running for each of the nginx and dummy deployments. Each replica runs on a separate t3.small node, as shown in the following screenshot. kubectl get po -o wide kubectl get nodes --selector=karpenter.sh/initialized=true Step 6 – Test the scaling Watch the pods in a separate terminal window. kubectl get po -o wide -w Also, watch the nodes managed by Karpenter in a separate terminal window. kubectl get nodes --selector=karpenter.sh/initialized=true -w Now, scale the nginx deployment from one replica to two replicas. Please note that for demonstration purposes, we’re manually scaling the replica of the nginx-app. In reality, you can use HPA, which automatically increases or decreases the number of replicas to match the demand. kubectl scale deployment nginx-app --replicas=2 The following is the reference screenshot of the terminal where the pods are viewed: kubectl get po -o wide -w The following is the reference screenshot of the terminal where the nodes created by Karpenter are viewed: kubectl get nodes --selector=karpenter.sh/initialized=true -w The expected results are as follows: As nginx-app is scaled from one replica to two replicas, one additional pod is created. Initially, this pod is in pending state. Since the newly created nginx-app pod is in the pending state and it has higher priority, the existing running lower priority pod (i.e., dummy-app) is evicted from its node to make room for nginx-app pod. The newly created nginx-app pod is placed on the node from which dummy-app pod was evicted. Since the dummy-app pod is in the pending state and there is no node to schedule it, the Karpenter adds another node. Once the newly added node is ready, dummy-app pod is scheduled onto newly added node. Thus, the new higher priority nginx-app pod doesn’t wait for node to be provisioned and starts as soon as the lower priority dummy-app pod is evicted. Moreover, the evicted dummy-app pod re-runs as soon as Karpenter adds new node for it. Step 7 – Deploy the Horizontal cluster-proportional-autoscaler container application Now, we’ll deploy the autoscaler, which autoscales the dummy application proportionally to the cluster size. We ‘ll use image registry.k8s.io/cpa/cluster-proportional-autoscaler:1.8.5. For details about this image, please refer to Kubernetes Cluster Proportional Autoscaler Container. As per our configuration: The autoscaler watches the nodes in cluster with the label sh/initialized=true and sets the replica count of dummy application as one count per four nodes or per eight cores, whichever is greater. The autoscaler has priority of system-cluster-critical and is placed on one of the initial nodes created as part of cluster in kube-system namespace. cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: overprovisioning-autoscaler namespace: default data: linear: |- { "coresPerReplica": 8, "nodesPerReplica": 4, "min": 1, "max": 3, "preventSinglePointFailure": false, "includeUnschedulableNodes": true } --- kind: ServiceAccount apiVersion: v1 metadata: name: overprovisioning-autoscaler namespace: kube-system --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: system:overprovisioning-autoscaler rules: - apiGroups: [""] resources: ["nodes"] verbs: ["list", "watch"] - apiGroups: [""] resources: ["replicationcontrollers/scale"] verbs: ["get", "update"] - apiGroups: ["apps"] resources: ["deployments/scale", "replicasets/scale"] verbs: ["get", "update"] # Remove the configmaps rule once below issue is fixed: # kubernetes-incubator/cluster-proportional-autoscaler#16 - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "create"] --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: system:overprovisioning-autoscaler subjects: - kind: ServiceAccount name: overprovisioning-autoscaler namespace: kube-system roleRef: kind: ClusterRole name: system:overprovisioning-autoscaler apiGroup: rbac.authorization.k8s.io --- apiVersion: apps/v1 kind: Deployment metadata: name: overprovisioning-autoscaler namespace: kube-system labels: k8s-app: overprovisioning-autoscaler kubernetes.io/cluster-service: "true" spec: selector: matchLabels: k8s-app: overprovisioning-autoscaler template: metadata: labels: k8s-app: overprovisioning-autoscaler spec: priorityClassName: system-cluster-critical containers: - name: autoscaler image: registry.k8s.io/cpa/cluster-proportional-autoscaler:1.8.5 resources: requests: cpu: "20m" memory: "10Mi" command: - /cluster-proportional-autoscaler - --namespace=default - --configmap=overprovisioning-autoscaler # Should keep target in sync with cluster/addons/dns/overprovisioning.yaml.base - --target=deployment/dummy-app # When cluster is using large nodes(with more cores), "coresPerReplica" should dominate. # If using small nodes, "nodesPerReplica" should dominate. - --default-params={"linear":{"coresPerReplica":8,"nodesPerReplica":4,"preventSinglePointFailure":false,"includeUnschedulableNodes":true}} - --nodelabels=karpenter.sh/initialized=true - --logtostderr=true - --v=2 tolerations: - key: "CriticalAddonsOnly" operator: "Exists" serviceAccountName: overprovisioning-autoscaler EOF Step 8 – Test the scaling using proportional autoscaler View the pods in a separate terminal window: kubectl get po -o wide -w Also, watch the nodes in a separate terminal window kubectl get nodes --selector=karpenter.sh/initialized=true -w Initially, two replicas of nginx-app and one replica of dummy-app are running. Each replica is running on a separate node provisioned by Karpenter. Thus, three nodes of t3.small with label karpenter.sh/initialized=true are running. The total cores of these three nodes are six. As per proportional autoscaler’s formula, the replica count of dummy (i.e, over-provisioning) app should be is one, and no change is made by autoscaler in the replica count. Now, scale the nginx app to four replicas. Please note we’re manually scaling the replica for nginx-app. In reality, you can use HPA. kubectl scale deployment nginx-app --replicas=4 The expected results are as follows: As nginx-app is scaled to four replicas, two more pods are created for it. Initially, these pods are in pending state. Since the newly created nginx-app pods have higher priority and are in pending state, the existing running pod (i.e., dummy-app) is evicted from its node to make room for pending nginx-app pods. One of the newly created nginx-app pod is placed on to the node from which dummy-app pod was evicted. Since the newly created nginx-app pod and evicted dummy-app pod are in pending state and there are no nodes to schedule them, Karpenter adds two more nodes to schedule these pending pods. As newly added nodes are ready, the nginx-app pod and dummy-app pod are scheduled in that order by kube-scheduler. Now the cluster has five nodes with label sh/initialized=true (i.e., four nodes for four replicas of nginx-app and one node for one replica of dummy-app). The total number of cores for these five nodes is 10. As per the proportional autoscaler’s formula, the desired replica count of dummy (over-provisioning) app is two now. So, the autoscaler proportionally increases the replica count of the dummy-app (over-provisioning) from one to two. Thus, one more pod is created for the dummy-app. As the newly created pod of the dummy-app is in the pending state, one more node is added by Karpenter and this pod is placed onto it. Now, the cluster has six nodes with the label sh/initialized=true (i.e., four nodes for four replicas of nginx-app and two nodes for two replicas of dummy-app). Thus, the size of over-provisioning has proportionally increased as cluster has grown. Use the following command to see the logs of proportional autoscaler. The screenshot of proportional autoscaler’s log shows that as soon as it detected the change in number of nodes or cores, it scaled the replica of dummy-app (over-provisioning) as per the defined proportion. kubectl logs -n kube-system `kubectl get pod -n kube-system --selector=k8s-app=overprovisioning-autoscaler -o name` The following screenshot shows the final state of the pods and nodes. It has four replicas of nginx-app and two replicas of dummy-app. It has 6 nodes with label karpenter.sh/initialized=true. Each pod of the nginx-app and dummy-app is placed on a separate node. kubectl get po -o wide kubectl get nodes --selector=karpenter.sh/initialized=true Cleaning up Follow the instructions below to remove the demonstration infrastructure from your account. Delete the deployments & related objects using commands: kubectl delete deployment -n kube-system overprovisioning-autoscaler kubectl delete deployment -n default nginx-app kubectl delete deployment -n default dummy-app kubectl delete ClusterRoleBinding system:overprovisioning-autoscaler kubectl delete ClusterRole system:overprovisioning-autoscaler kubectl delete ServiceAccount -n kube-system overprovisioning-autoscaler kubectl delete ConfigMap -n default overprovisioning-autoscaler kubectl delete PriorityClass low-priority kubectl delete PriorityClass high-priority Uninstall Karpenter and delete AWS IAM roles, policies, AWS CloudFormation stack, the Amazon EKS cluster using the Cleanup instructions found here: Choose Get Started in landing page ,which takes you to the documentation Expand the Getting Started menu in left pane Choose Getting Started with eksctl in left pane In right pane, scroll down to the Cleanup section and follow the instructions. Conclusion In this post, we showed you how to implement high performance scaling. You can use this technique on applicable workloads to quickly scale your service. In this post, we saw how to use pod priority and dummy pods that have a pause container to eliminate or minimize the time required for provisioning the worker nodes during scaling. The number of dummy pods to be over-provisioned is based on the trade-offs of performance required for scaling of the worker nodes and the cost of running dummy pods. View the full article Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.