Kubernetes Architecture Explained: A Comprehensive Guide For Beginners

DevOpsCube · December 8, 2022

This is the ultimate guide to understanding Kubernetes architecture in 2023.

So if you want to:

Understand Kubernetes core components
Understand how each component works

Then you’ll love this new guide.

Note: To understand Kubernetes architecture better, there are some prerequisites Please check the prerequisites in the kubernetes learning guide to know more.

Kubernetes Architecture Explained

The first and foremost thing you should understand about Kubernetes is, it is a distributed system. Meaning, it has multiple components spread across different servers over a network. We call it a Kubernetes cluster.

A Kubernetes cluster primarily has a control plane and worker nodes.

The control plane has the following components.

kube-apiserver
etcd
kube-scheduler
kube-controller-manager
cloud-controller-manager

The worker Node has the following components.

kubelet
kube-proxy
Container runtime

Kubernetes Architecture Diagram

The following Kubernetes architecture diagram shows all the components of the Kubernetes cluster and how external systems connect to the Kubernetes cluster.

Kubernetes Control Plane Components

First, let’s take a look at each control plane component and the important concepts behind each component.

1. kube-apiserver

The kube-api server is the central hub of the Kubernetes cluster that exposes the Kubernetes API.

End users, and other cluster components, talk to the cluster via the API server. Very rarely monitoring systems and third-party services may talk to API servers to interact with the cluster.

So when you use kubectl to manage the cluster, at the backend you are actually communicating with the API server through HTTP REST APIs. However, the internal cluster components like the scheduler, controller, etc talk to the API server using gRPC.

Kubernetes architecture - kube-apiserver explained

Kubernetes api-server is responsible for the following

API management: Exposes the cluster API endpoint and handles all API requests.
Authentication (Using client certificates, bearer tokens, and HTTP Basic Authentication) and Authorization (Using RBAC)
Processing API requests and validating data for the API objects like pods, services, etc. (Admission control)
The only components that have direct access and communicate with etcd.
api-server is the component that coordinates all the processes between the control plane and worker node components.

Note: To reduce the cluster attack surface, it is crucial to secure the API server. The Shadowserver Foundation has conducted an experiment that discovered 380 000 publicly accessible Kubernetes API servers.

2. etcd

Kubernetes is a distributed system and it needs an efficient distributed database like etcd that supports its distributed nature. It acts as both a backend service discovery and a database. You can call it the brain of the Kubernetes cluster.

etcd is an open-source strongly consistent, distributed key-value store. So what does it mean?

Strongly consistent: If an update is made to a node, strong consistency will ensure it gets updated to all the other nodes in the cluster immediately. Also if you look at CAP theorem, achieving 100% availability with strong consistency is impossible.
Distributed: etcd is designed to run on multiple nodes as a cluster without sacrificing consistency.
Key Value Store: A nonrelational database that stores data as keys and values. It also exposes a key-value API. The datastore is built on top of BboltDB which is a fork of BoltDB.

etcd uses raft consensus algorithm for strong consistency and availability. It works in a leader-member fashion for high availability and to withstand node failures.

The following table shows the fault tolerance of the etcd cluster.

Cluster Size	Majority	Failure Tolerance
1	1	0
2	2	0
3	2	1
4	3	1
5	3	2
6	4	2
7	4	3

etcd fault tolerance

To put it simply, when you use kubectl to get kubernetes object details, you are getting it from etcd. Also, when you deploy an object like a pod, an entry gets created in etcd.

In a nutshell, here is what you need to know about etcd.

etcd stores all configurations, states, and metadata of Kubernetes objects (pods, secrets, daemonsets, deployments, configmaps, statefulsets, etc).
Kubernetes uses the etcd’s watch functionality to track the change in the state of an object (Watch() API)
etcd exposes key-value API using gRPC. Also, the gRPC gateway which is a RESTful proxy that translates all the HTTP API calls into gRPC messages. It makes it an ideal database for Kubernetes.

Also, etcd it is the only Statefulset component in the control plane.

When it comes to etcd HA architecture, there are two modes.

Stacked etcd: etcd deployed along with control plane nodes
External etcd cluster: Dedicated etcd cluster.

3. kube-scheduler

The kube-scheduler is responsible for scheduling pods on worker nodes.

When you deploy a pod, you specify the pod requirements such as CPU, memory, Volume mounts, etc. The scheduler’s primary task is to choose the best node for a pod to run with its requirements.

In a Kubernetes cluster, there will be more than one worker node. So how does the scheduler select the node out of all worker nodes?

Here is how the scheduler works.

To choose the best node, the Kube-scheduler uses filtering and scoring operations.
In filtering, the scheduler finds the best-suited nodes where the pod can be scheduled. For example, if there are five worker nodes with resource availability to run the pod, it selects all five nodes. If there are no nodes, then the pod is unschedulable and moved to the scheduling queue. If It is a large cluster, let’s say 100 worker nodes, and the scheduler doesn’t iterate over all the nodes. There is a scheduler configuration parameter called percentageOfNodesToScore. The default value is typically 50%. So it tries to iterate over 50% of nodes in a round-robin fashion. If the worker nodes are spread across multiple zones, then the scheduler iterates over nodes in different zones. For very large clusters the default percentageOfNodesToScore is 5%.
In the scoring phase, the scheduler ranks the nodes by assigning a score to the filtered worker nodes. The scheduler makes the scoring by calling multiple scheduling plugins. Finally, the worker node with the highest rank will be selected for scheduling the pod. If all the nodes have the same rank, a node will be selected at random.
Once the node is selected, the scheduler creates a binding event in the API server. Meaning an event to bind a pod and node.

Things should know about a scheduler.

It is a controller that listens to pod creation events in the API server.
The scheduler has two phases. Scheduling cycle and the Binding cycle. Together it is called the scheduling context. The scheduling cycle selects a worker node and the binding cycle applies that change to the cluster.
The scheduler always places the high-priority pods ahead of the low-priority pods for scheduling. Also, in some cases, after the pod started running in the selected node, the pod might get evicted or moved to other nodes. If you want to understand more, read the Kubernetes pod priority guide
You can create custom schedulers and run multiple schedulers in a cluster along with the native scheduler. When you deploy a pod you can specify the custom scheduler in the pod manifest. So the scheduling decisions will be taken based on the custom scheduler logic.
The scheduler has a pluggable scheduling framework. Meaning, you can add your custom plugin to the scheduling workflow.

4. Kube Controller Manager

What is a controller? Controllers are programs that run infinite control loops. Meaning it runs continuously and watches the actual and desired state of objects. If there is a difference in the actual and desired state, it ensures that the kubernetes resource/object is in the desired state.

As per the official documentation,

In Kubernetes, controllers are control loops that watch the state of your cluster, then make or request changes where needed. Each controller tries to move the current cluster state closer to the desired state.

If you want to create a deployment, you specify the desired state in the manifest YAML file (declarative approach). For example, 2 replicas, one volume mount, configmap, etc. The in-built deployment controller ensures that the deployment is in the desired state all the time. If a user updates the deployment with 5 replicas, the deployment controller recognizes it and ensures the desired state is 5 replicas.

Kube controller manager is a component that manages all the Kubernetes controllers. Kubernetes resources/objects like pods, namespaces, jobs, replicaset are managed by respective controllers.

Following is the list of core built-in Kubernetes controllers.

Deployment controller
Replicaset controller
DaemonSet controller
Job Controller (Kubernetes Jobs)
CronJob Controller
endpoints controller
namespace controller
service accounts controller.
Node controller

Here is what you should know about the Kube controller manager.

It manages all the controllers and the controllers try to keep the cluster in the desired state.
You can extend kubernetes with custom controllers associated with a custom resource definition.

5. Cloud Controller Manager (CCM)

When kubernetes are deployed in cloud environments, the cloud controller manages and acts as a bridge between Cloud Platform APIs and the Kubernetes cluster.

This way the core kubernetes components can work independently and allows the cloud providers to integrate with kubernetes using plugins. (For example, an interface between kubernetes cluster and AWS cloud API)

Cloud Controller Manager contains a set of cloud platform-specific controllers that ensure the desired state of cloud-specific components (nodes, Loadbalancers, storage, etc). Following are the three main controllers that are part of the cloud controller manager.

Node controller: This controller updates node-related information by talking to the cloud provider API. For example, node labeling & annotation, getting hostname, CPU & memory availability, nodes health, etc.
Route controller: It is responsible for the configuration of networking routes for pods in different nodes to talk to each other.
Service controller: It takes care of deploying load balancers for kubernetes services, assigning IP addresses, etc.

Examples,

Provisioning cloud load balancer for Kubernetes service
Provisioning cloud storage volumes to be used with pods.

Kubernetes Worker Node Components

Now let’s look at each of the worker node components.

1. Kubelet

Kubelet is an agent component that runs on every node in the cluster. It is responsible for registering worker nodes with the API server and working with the podSpec (Pod specification – YAML or JSON) primarily from the API server. It then brings the podSpec to the desired state by creating containers.

To put it simply, kubelet is responsible for creating, modifying, and deleting containers for the pod.

Kubelet is also a controller where it watches for pod changes and utilizes the node’s container runtime to pull images, run containers, etc.

Other than PodSepcs from the API server, kubelet can accept podSpec from a file, HTTP endpoint, and HTTP server. A good example of this is Kubernetes static pods. Static pods are controlled by kubelet, not the API servers. While bootstrapping the control plane, kubelet starts the api-server, scheduler, and controller manager as static pods from podSpecs located at /etc/kubernetes/manifests.

Following are some of the key things about kubelet.

Kubelet uses the CRI (container runtime interface) gRPC interface to talk to the container runtime.
It also exposes an HTTP endpoint to stream logs and provides exec sessions for clients.
Uses the CSI (container storage interface) gRPC to configure block volumes.

2. Kube proxy

Kube-proxy is a daemon that runs on every node. It is a network proxy that implements the Kubernetes Services concept for pods. (single DNS for a set of pods with load balancing).

When you expose deployments using a Service endpoint, Kube-proxy creates rules to send traffic to the backend pods that are grouped under the Service. Meaning, all the load balancing, and service discovery are handled by the Kube proxy.

So, the primary work of Kube-proxy is to load balance traffic to a set of pods under a service exposed via Cluster IP (Internal Virtual IP assigned to a Service) or Node Port (Static port on Node IP address).

So how does Kube-proxy work?

Kube-proxy uses any one of the following modes to direct traffic to pods behind a service.

IPTables: In IPtables mode, the traffic is handled by Linux Netfilter and kube-proxy chooses the backend pod random for load balancing.
IPVS: For clusters with services exceeding 1000, IPVS offers performance improvement. It supports the following load-balancing algorithms for the backend.
1. rr: round-robin
2. lc: least connection (smallest number of open connections)
3. dh: destination hashing
4. sh: source hashing
5. sed: shortest expected delay
6. nq: never queue
Userspace (legacy & not recommended)

If you would like to understand the performance difference between kube-proxy IPtables and IPVS mode, read this article.

3. Container Runtime

You probably know about Java Runtime (JRE). It is the software required to run Java programs on a host. In the same way, container runtime is a software component that is required to run containers.

Container runtime runs on all the nodes and you can also call it a container engine. It is responsible for pulling images from container registries, running containers, allocating and isolating resources for containers, and managing the entire lifecycle of a container on a host.

Kubernetes supports multiple container runtimes (CRI-O, Docker Engine, containerd, etc) that are compliant with Container Runtime Interface (CRI). Meaning all these container runtimes implements the CRI interface and exposes gRPC CRI APIs (runtime and image service endpoints).

So how does Kubernetes make use of the container runtime?

As we learned in the Kubelet section, the kubelet agent is responsible for interacting with the container runtime using CRI APIs to manage the lifecycle of a container. Also, it gets all the container information from the container runtime and provides it to the control plane.

Let’s take an example of CRI-O container runtime interface. Here is a high-level overview of how it works with kubernetes.

Kubernetes container runtime CRI-O overview

When there is a new request for a pod from the API server, the kubelet talks to CRI-O daemon to launch the required containers via Kubernetes Container Runtime Interface.
CRI-O checks and pulls the required container image from the configured container registry using containers/image library.
CRI-O then generates OCI runtime specification (JSON) for a container.
CRI-O then launches an OCI-compatible runtime (runc) to start the container process as per the runtime specification.

Kubernetes Cluster Addon Components

Apart from the core components, the kubernetes cluster needs the following components to be fully operational.

CNI Plugin (Container Network Interface)
CoreDNS
Metrics Server

Conclusion

Understanding Kubernetes architecture helps you with day-to-day Kubernetes implementation and operations.

When implementing a production-level cluster setup, having the right knowledge of Kubernetes components will help you run and troubleshoot applications.

Next, you can start with step-by-step kubernetes tutorials to get hands-on experience with Kubernetes objects and resources.

View the full article

Sign In