TensorFlow on GKE Autopilot with GPU acceleration

Google Cloud Platform · July 26, 2023

With all the recent interest in Machine Learning and Artificial Intelligence, you might be wondering: what’s the best place to run my AI/ML workloads?

This is why we built the Autopilot mode of operation for Google Kubernetes Engine (GKE) with GPU support. Autopilot takes care of all the infrastructure, so you can focus on running AI/ML workloads, whether for inference, training, or any other GPU task. You simply provide the Pod or Job definition with your container, schedule it on Autopilot and we will provision the right GPU and execute the workload. You’re only billed while the Job is running too, so once it completes (or you terminate it), the charges stop immediately, and we’ll take care of the cleanup.

Sound too good to be true?

In this post, I’ll demo the creation, execution and teardown of an AI/ML workload. The workload is a Tensorflow-enabled Jupiter notebook running on a NVIDIA T4, which we can use to run a bunch of different AI/ML training examples. Jupiter notebooks are great for learning and experimenting with AI/ML, and we’ll mount a persistent disk so that you can even preserve your work between runs.

You can also watch my video demonstration here:

Building for the future with Kubernetes: Put Your Workloads on Autopilot

Setup

Start by creating a GKE Autopilot cluster. Since GPUs are not available in every region, choose a region with the GPU you want (the config here uses a NVIDIA T4). Regions with GPUs are shown in the Autopilot pricing table.

Create the cluster:

code_block: [StructValue([(u'code', u'CLUSTER_NAME=test-cluster\r\nREGION=us-west1\r\ngcloud container clusters create-auto $CLUSTER_NAME \\\r\n --region $REGION \\'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ac90d4410>)])]

Installation

Now we can deploy a Tensorflow-enabled Jupyter Notebook with GPU-acceleration.

The following StatefulSet definition creates an instance of the tensorflow/tensorflow:latest-gpu-jupyter container that gives us a Jupyter notebook in a TensorFlow environment. It provisions a NVIDIA T4 GPU, and mounts a PersistentVolume to the /tf/saved path so you can save your work and it will persist between restarts. And it runs in Spot, so you save 60-91% (and remember, our work is saved if it’s preempted).

This is a legit Jupyter Notebook that you can use long term!

code_block: [StructValue([(u'code', u'# Tensorflow/Jupyter StatefulSet\r\napiVersion: apps/v1\r\nkind: StatefulSet\r\nmetadata:\r\n name: tensorflow\r\nspec:\r\n selector:\r\n matchLabels:\r\n pod: tensorflow-pod\r\n serviceName: tensorflow\r\n replicas: 1\r\n template:\r\n metadata:\r\n labels:\r\n pod: tensorflow-pod\r\n spec:\r\n nodeSelector:\r\n cloud.google.com/gke-accelerator: nvidia-tesla-t4\r\n cloud.google.com/gke-spot: "true"\r\n terminationGracePeriodSeconds: 30\r\n containers:\r\n - name: tensorflow-container\r\n image: tensorflow/tensorflow:latest-gpu-jupyter\r\n volumeMounts:\r\n - name: tensorflow-pvc\r\n mountPath: /tf/saved\r\n resources:\r\n requests:\r\n nvidia.com/gpu: "1"\r\n ephemeral-storage: 10Gi\r\n## Optional: override and set your own token\r\n# \tenv:\r\n# \t- name: JUPYTER_TOKEN\r\n# \tvalue: "jupyter"\r\n volumeClaimTemplates:\r\n - metadata:\r\n name: tensorflow-pvc\r\n spec:\r\n accessModes:\r\n - ReadWriteOnce\r\n resources:\r\n requests:\r\n storage: 100Gi\r\n---\r\n# Headless service for the above StatefulSet\r\napiVersion: v1\r\nkind: Service\r\nmetadata:\r\n name: tensorflow\r\nspec:\r\n ports:\r\n - port: 8888\r\n clusterIP: None\r\n selector:\r\n pod: tensorflow-pod'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ac90d4550>)])]

We also need a load balancer, so we can connect to this notebook from our desktop:

code_block: [StructValue([(u'code', u'# External service\r\napiVersion: "v1"\r\nkind: "Service"\r\nmetadata:\r\n name: tensorflow-jupyter\r\nspec:\r\n ports:\r\n - protocol: "TCP"\r\n port: 80\r\n targetPort: 8888\r\n selector:\r\n pod: tensorflow-pod\r\n type: LoadBalancer'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7add0d3f50>)])]

Deploy them both like so:

code_block: [StructValue([(u'code', u'kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/autopilot-examples/main/tensorflow/tensorflow.yaml\r\nkubectl create -f https://raw.githubusercontent.com/WilliamDenniss/autopilot-examples/main/tensorflow/tensorflow-jupyter.yaml'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ac9335450>)])]

While we’re waiting, we can watch the events in the cluster to make sure it’s going to work, like so (output truncated to show relevant events):

code_block: [StructValue([(u'code', u"$ kubectl get events -w\r\nLAST SEEN TYPE \tREASON \tOBJECT \tMESSAGE\r\n5m25s \tWarning FailedScheduling \tpod/tensorflow-0 \t0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.\r\n4m24s \tNormal\tTriggeredScaleUp \tpod/tensorflow-0 \tpod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/gke-autopilot-test/zones/us-west1-b/instanceGroups/gk3-test-cluster-nap-1ax02924-9c722205-grp 0->1 (max: 1000)}]\r\n2m13s \tNormal\tScheduled \tpod/tensorflow-0 \tSuccessfully assigned default/tensorflow-0 to gk3-test-cluster-nap-1ax02924-9c722205-lzgj"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ac9335390>)])]

The way Kubernetes and Autopilot works is you’ll initially see FailedScheduling, that’s because at the moment you deploy the code, there is no resource that can handle your Pod. But then you’ll see TriggeredScaleUp, which is Autopilot adding that resource for you, and finally Scheduled once the Pod has the resources. GPU nodes take a little longer than regular CPU nodes to provision, and this container takes a little while to boot. In my case it took about 5min all up from scheduling the Pod to it being running.

Using the Notebook

Now it’s time to connect. First, get the external IP of the load balancer

code_block: [StructValue([(u'code', u'$ kubectl get svc\r\nNAME \tTYPE \tCLUSTER-IP \tEXTERNAL-IP\tPORT(S) \tAGE\r\nkubernetes \tClusterIP \t10.102.0.1 \t<none> \t443/TCP \t20d\r\ntensorflow \tClusterIP \tNone \t<none> \t80/TCP \t9m4s\r\ntensorflow-jupyter LoadBalancer 10.102.2.107 34.127.75.81 80:31790/TCP 8m35s'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ac9335a10>)])]

And browse to it

We can run the command it suggests in Kubernetes with exec:

code_block: [StructValue([(u'code', u'$ kubectl exec -it sts/tensorflow -- jupyter notebook list\r\nCurrently running servers:\r\nhttp://0.0.0.0:8888/?token=e54a0e8129ca3918db604f5c79e8a9712aa08570e62d2715 :: /tf'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7acb7538d0>)])]

Login by copying the token (in my case, e54a0e8129ca3918db604f5c79e8a9712aa08570e62d2715) into the input box and hit “Log In”.

Note: if you want to skip this step, you can set your own token in the configuration, just uncomment the env lines and define your own token.

There are 2 folders, one with some included samples and “saved” which is the one we mounted from a persistent disk. I recommend operating out of the “saved” folder to preserve your state between sessions, and moving the included “tensorflow-tutorials” directory into the “saved” directory before getting started. You can use the UI below to move the folder, and upload your own notebooks.

Let’s try run a few of the included samples.

The classification.ipynb example

The overfit_and_underfit.ipynb example

We can upload our own projects, like the examples in the Tensorflow docs. Just download the notebook from the docs, and upload it jupyter to the saved/ folder, and run.

Tensorflow basics.ipynb tutorial, utilizing GPU acceleration

So there it is. We have a reusable TensorFlow Jupyter notebook running on an NVIDIA T4! This isn’t just a toy either, we hooked up a PersistentVolume so your work is saved (even if the StatefulSet is deleted, or the Pod disrupted). We’re using Spot compute to save some cash. And the entire thing was provisioned from 2 YAML files, no need to think about the underlying compute hardware. Neat!

Monitoring & Troubleshooting

If you get a message like “The kernel appears to have died. It will restart automatically.”, then the first step is to tail your logs.

code_block: [StructValue([(u'code', u'kubectl logs tensorflow-0 -f'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7acaa1b250>)])]

A common issue I saw was when trying to run two notebooks, I would exhaust my GPU’s memory (CUDA_ERROR_OUT_OF_MEMORY in the logs). The easy fix is to shutdown all but the notebook you are actively using.

You can keep an eye on the GPU utilization like so:

code_block: [StructValue([(u'code', u'$ kubectl exec -it sts/tensorflow -- bash\r\n# watch -d nvidia-smi'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ac91c65d0>)])]

If you need to restart the setup for whatever reason, just delete the pod and Kubernetes will recreate it. This is very fast on Autopilot, as the GPU-enabled node resource will hang around for a short time in the cluster.

code_block: [StructValue([(u'code', u'kubectl delete pod tensorflow-0'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ac93067d0>)])]

What’s Next

To shell into the environment and run arbitrary code (i.e. without using the notebook UI), you can use the following. Just be sure to save any data you want to persist in /tf/saved/.

code_block: [StructValue([(u'code', u'kubectl exec -it sts/tensorflow -- bash'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ae4fd5090>)])]

If you want some more tutorials, check out the TensorFlow tutorials and Keras.

I cloned the Keras repo onto my persistent volume to have all those tutorials in my notebook as well.

code_block: [StructValue([(u'code', u'$ kubectl exec -it sts/tensorflow -- bash\r\n# cd /tf/saved\r\n# git clone https://github.com/keras-team/keras-io.git\r\n# pip install pandas'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ac908af50>)])]

If you need any additional Python modules for your notebooks like Pandas, you can set that up the same way. To create a more durable setup though you’ll want your own Dockerfile extending the one we used above (let me know if you want to share such a recipie in a follow up post).

I ran a few different examples, here’s some of the output:

The output of the Keras timeseries/ipynb/timeseries_weather_forecasting.ipynb example

A epoch random iteration in the Keras generative/ipynb/text_generation_with_miniature_gpt.ipynb example

Cleanup

GPUs are not the cheapest resources, so make sure you delete the resources once you are done! Clean up by removing the StatefulSet and services:

code_block: [StructValue([(u'code', u'kubectl delete sts tensorflow\r\nkubectl delete svc tensorflow tensorflow-jupyter'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ac90d4050>)])]

Again, the nice thing about Autopilot is that deleting the Kubernetes resources (in this case a StatefulSet and LoadBalancer) will end the associated charges.

That just leaves the persistent disk. You can either keep it around (so that if you re-create the above StatefulSet, it will be reattached and your work will be saved), or if you no longer need it, then go ahead and delete the disk as well.

code_block: [StructValue([(u'code', u'kubectl delete persistentvolumeclaim/tensorflow-pvc-tensorflow-0'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ac93064d0>)])]

You can delete the cluster if you don’t need it anymore as well.

code_block: [StructValue([(u'code', u'gcloud container clusters delete $CLUSTER_NAME --region $REGION'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e7ae4f97e10>)])]

Next Steps

So that’s how easy it is to run GPU workloads on Autopilot!

Just define your Kubernetes workloads including any GPU resources they need, and we’ll take care of the rest. When you’re done, delete the object and the charges stop right away—no need to worry about node clean up.

Head over to https://console.cloud.google.com/kubernetes to get started with your own GKE cluster, and if you’re new to Google Cloud, remember to take advantage of the $300 free trial!

Sign In

TensorFlow on GKE Autopilot with GPU acceleration

Recommended Posts

Google Cloud Platform