Jump to content

Search the Community

Showing results for tags 'gke'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • General
    • General Discussion
    • Artificial Intelligence
    • DevOpsForum News
  • DevOps & SRE
    • DevOps & SRE General Discussion
    • Databases, Data Engineering & Data Science
    • Development & Programming
    • CI/CD, GitOps, Orchestration & Scheduling
    • Docker, Containers, Microservices, Serverless & Virtualization
    • Infrastructure-as-Code
    • Kubernetes & Container Orchestration
    • Linux
    • Logging, Monitoring & Observability
    • Security, Governance, Risk & Compliance
  • Cloud Providers
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

  1. Patent-search platform provider IPRally is growing quickly, servicing global enterprises, IP law firms, and multiple national patent and trademark offices. As the company grows, so do its technology needs. It continues to train its models for greater accuracy, adding 200,000 searchable records for customer access weekly, and mapping new patents. With millions of patent documents published annually – and the technical complexity of those documents increasing — it can take even the most seasoned patent professional several hours of research to resolve a case with traditional patent search tools. In 2018, Finnish firm IPRally set out to tackle this problem with a graph-based approach. “Search engines for patents were mostly complicated boolean ones, where you needed to spend hours building a complicated query,” says Juho Kallio, CTO and co-founder of the 50-person firm. “I wanted to build something important and challenging.” Using machine learning (ML) and natural language processing (NLP), the company has transformed the text from over 120 million global patent documents into document-level knowledge graphs embedded into a searchable vector space. Now, patent researchers can receive relevant results in seconds with AI-selected highlights of key information and explainable results. To meet those needs, IPRally built a customized ML platform using Google Kubernetes Engine (GKE) and Ray, an open-source ML framework, balancing efficiency, performance and streamlining machine learning operations (MLOps). The company uses open-source KubeRay to deploy and manage Ray on GKE, which enables them to leverage cost-efficient NVIDIA GPU Spot instances for exploratory ML research and development. It also uses Google Cloud data building blocks, including Cloud Storage and Compute Engine persistent disks. Next on the horizon is expanding to big data solutions with Ray Data and BigQuery. “Ray on GKE has the ability to support us in the future with any scale and any kind of distributed complex deep learning,” says Kallio. A custom ML platform built for performance and efficiency The IPRally engineering team’s primary focus is on R&D and how it can continue to improve its Graph AI to make technical knowledge more accessible. With just two DevOps engineers and one MLOps engineer, IPRally was able to build its own customized ML platform with GKE and Ray as key components. A big proponent of open source, IPRally transitioned everything to Kubernetes when their compute needs grew. However, they didn’t want to have to manage Kubernetes themselves. That led them to GKE, with its scalability, flexibility, open ecosystem, and its support for a diverse set of accelerators. All told, this provides IPRally the right balance of performance and cost, as well as easy management of compute resources and the ability to efficiently scale down capacity when they don’t need it. “GKE provides the scalability and performance we need for these complex training and serving needs and we get the right granularity of control over data and compute,” says Kallio. One particular GKE capability that Kallio highlights is container image streaming, which has significantly accelerated their start-up time. “We have seen that container image streaming in GKE has a significant impact on expediting our application startup time. Image streaming helps us accelerate our start-up time for a training job after submission by 20%,” he shares. “And, when we are able to reuse an existing pod, we can start up in a few seconds instead of minutes.” The next layer is Ray, which the company uses to scale the distributed, parallelized Python and Clojure applications it uses for machine learning. To more easily manage Ray, IPRally uses KubeRay, a specialized tool that simplifies Ray cluster management on Kubernetes. IPRally uses Ray for the most advanced tasks like massive preprocessing of data and exploratory deep learning in R&D. “Interoperability between Ray and GKE autoscaling is smooth and robust. We can combine computational resources without any constraints,” says Kallio. The heaviest ML loads are mainly deployed on G2 VMs featuring eight NVIDIA L4 GPUs featuring up to eight NVIDIA L4 Tensor Core GPUs, which deliver cutting-edge performance-per-dollar for AI inference workloads. And by leveraging them within GKE, IPRally facilitates the creation of nodes on-demand, scales GPU resources as needed, thus optimizing its operational costs. There is a single Terraform-provisioned Kubernetes cluster in each of the regions that IPRally searches for the inexpensive spot instances. GKE and Ray then step in for compute orchestration and automated scaling. To further ease MLOps, IPRally built its own thin orchestration layer, IPRay, atop KubeRay and Ray. This layer provides a command line tool for data scientists to easily provision a templated Ray cluster that scales efficiently up and down and that can run jobs in Ray without needing to know Terraform. This self-service layer reduces friction and allows both engineers and data scientists to focus on their higher-value work. Technology paves the way for strong growth Through this selection of Google Cloud and open-source frameworks, IPRally has shown that a startup can build an enterprise-grade ML platform without spending millions of dollars. Focusing on providing a powerful MLOps and automation foundation from its earliest days has paid dividends in efficiency and the team’s ability to focus on R&D. “Crafting a flexible ML infrastructure from the best parts has been more than worth it,” shares Jari Rosti, an ML engineer at IPRally. “Now, we’re seeing the benefits of that investment multiply as we adapt the infrastructure to the constantly evolving ideas of modern ML. That’s something other young companies can achieve as well by leveraging Google Cloud and Ray.” Further, the company has been saving 70% of ML R&D costs by using Spot instances. These affordable instances offer the same quality VMs as on-demand instances but are subject to interruption. But because IPRally’s R&D workloads are fault-tolerant, they are a good fit for Spot instances. IPRally closed a €10m A round investment last year, and it’s forging on with ingesting and processing IP documentation from around the globe, with a focus on improving its graph neural network models and building the best AI platform for patent searching. With 3.4 million patents filed in 2022, the third consecutive year of growth, data will keep flowing and IPRally can continue helping intellectual property professionals find every relevant bit of information. "With Ray on GKE, we've built an ML foundation that is a testament to how powerful Google Cloud is with AI," says Kallio. “And now, we’re prepared to explore far more advanced deep learning and to keep growing.” View the full article
  2. Google Kubernetes Engine (GKE) has emerged as a leading container orchestration platform for deploying and managing containerized applications. But GKE is not just limited to running stateless microservices — its flexible design supports workloads that need to persist data as well, via tight integrations with persistent disk or blob storage products including Persistent Disk, Cloud Storage, and FIlestore. And for organizations that need even stronger throughput and performance, there’s Hyperdisk, Google Cloud's next-generation block storage service that allows you to modify your capacity, throughput, and IOPS-related performance and tailor it to your workloads, without having to re-deploy your entire stack. Today, we’re excited to introduce support for Hyperdisk Balanced storage volumes on GKE, which joins the Hyperdisk Extreme and Hyperdisk Throughput options, and is a good fit for workloads that typically rely on persistent SSDs — for example, line-of-business applications, web applications, and databases. Hyperdisk Balanced is supported on 3rd+ generation instance types. For instance type compatibility please reference this page. Understanding Hyperdisk First, let’s start with what it means to fine-tune your throughput with Hyperdisk? What about tuning IOPS and capacity? Tuning IOPS means refining the input/output operations per second (IOPS) performance of a storage device. Hyperdisk allows you to provision only the IOPS you need, and does not share it with other volumes on the same node. Tuning throughput means enhancing the amount of data or information that can be transferred or processed in a specified amount of time. Hyperdisk allows you to specify exactly how much throughput a given storage volume should have without limitations imposed by other volumes on the same node. Expanding capacity means you can increase the size of your storage volume. Hyperdisk can be provisioned for the exact capacity you need and extended as your storage needs grow. These Hyperdisk capabilities translate in to the following benefits: First, you can transform your environment's stateful environment through flexibility, ease of use, scalability and management — with a potential cost savings to boost. Imagine the benefit of a storage area network environment without the management overhead. Second, you can build lower-cost infrastructure by rightsizing the machine types that back your GKE nodes, optimizing your GKE stack while integrating with GKE CI/CD, security and networking capabilities. Finally, you get predictability — the consistency that comes with fine-grained tuning for each individual node and its required IOPS. You can also use this to fine-tune for ML model building/training/deploying, as Hyperdisk removes the element of throughput and IOPS from all PDs sharing the same node bandwidth, placing it on the provisioned Hyperdisk beneath it. Compared with traditional persistent disk, Hyperdisk’s storage performance is decoupled from the node your application is running on. This gives you more flexibility with your IOPs and throughput, while reducing the possibility that overall storage performance would be impacted by a noisy neighbor. On GKE, the following types of Hyperdisk volumes are available: Hyperdisk Balanced - General-purpose volume type that is the best fit for most workloads, with up to 2.4GBps of throughput and 160k IOPS. Ideal for line-of-business applications, web applications, databases, or boot disks. Hyperdisk Throughput - Optimized for cost-efficient high-throughput workloads, with up to 3 GBps throughput (>=128 KB IO size). Hyperdisk Throughput is targeted at scale-out analytics (e.g., Kafka, Hadoop, Redpanda) and other throughput-oriented cost-sensitive workloads, and is supported on GKE Autopilot and GKE Standard clusters. Hyperdisk Extreme - Specifically optimized for IOPS performance, such as large-scale databases that require high IOPS performance. Supported on Standard clusters only. Getting started with Hyperdisk on GKE To provision Hyperdisk you first need to make sure your cluster has the necessary StorageClass loaded that references the disk. You can add the necessary IOPS/Throughput to the storage class or go with the defaults, which are 3,600 IOPs and 140MiBps (Docs). YAML to Apply to GKE cluster code_block <ListValue: [StructValue([('code', 'apiVersion: storage.k8s.io/v1\r\nkind: StorageClass\r\nmetadata:\r\nname: balanced-storage\r\nprovisioner: pd.csi.storage.gke.io\r\nvolumeBindingMode: WaitForFirstConsumer\r\nallowVolumeExpansion: true\r\nparameters:\r\ntype: hyperdisk-balanced\r\nprovisioned-throughput-on-create: "200Mi" #optional\r\nprovisioned-iops-on-create: "5000" #optional'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3dfdda67a5b0>)])]> After you’ve configured the StorageClass, you can create a persistent volume claim that references it. code_block <ListValue: [StructValue([('code', 'kind: PersistentVolumeClaim\r\napiVersion: v1\r\nmetadata:\r\nname: postgres\r\nspec:\r\naccessModes:\r\n- ReadWriteOnce\r\nstorageClassName: balanced-storage\r\nresources:\r\nrequests:\r\nstorage: 1000Gi'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3dfdda67ad60>)])]> Take control of your GKE storage with Hyperdisk To recap, Hyperdisk lets you take control of your cloud infrastructure storage, throughput, IOPS, and capacity, combining the speed of existing PD SSD volumes with the flexibility to fine-tune to your specific needs, by decoupling disks from the node that your workload is running on. For more, check out the following sessions from Next ‘24: Next generation storage: Designing storage for the future A primer on data on Kubernetes And be sure to check out these resources: How to create a Hyperdisk on GKE Data management solutions on GKE Contact your Google Cloud representative View the full article
  3. Multi-cluster Ingress (MCI) is an advanced feature typically used in cloud computing environments that enables the management of ingress (the entry point for external traffic into a network) across multiple Kubernetes clusters. This functionality is especially useful for applications that are deployed globally across several regions or clusters, offering a unified method to manage access to these applications. MCI simplifies the process of routing external traffic to the appropriate cluster, enhancing both the reliability and scalability of applications. Here are key features and benefits of Multi-cluster Ingress: Global Load Balancing: MCI can intelligently route traffic to different clusters based on factors like region, latency, and health of the service. This ensures users are directed to the nearest or best-performing cluster, improving the overall user experience. Centralized Management: It allows for the configuration of ingress rules from a single point, even though these rules are applied across multiple clusters. This simplification reduces the complexity of managing global applications. High Availability and Redundancy: By spreading resources across multiple clusters, MCI enhances the availability and fault tolerance of applications. If one cluster goes down, traffic can be automatically rerouted to another healthy cluster. Cross-Region Failover: In the event of a regional outage or a significant drop in performance within one cluster, MCI can perform automatic failover to another cluster in a different region, ensuring continuous availability of services. Cost Efficiency: MCI helps optimize resource utilization across clusters, potentially leading to cost savings. Traffic can be routed to clusters where resources are less expensive or more abundant. Simplified DNS Management: Typically, MCI solutions offer integrated DNS management, automatically updating DNS records based on the health and location of clusters. This removes the need for manual DNS management in a multi-cluster setup. How What is Multi-cluster Ingress (MCI) works? Multi-cluster Ingress (MCI) works by managing and routing external traffic into applications running across multiple Kubernetes clusters. This process involves several components and steps to ensure that traffic is efficiently and securely routed to the appropriate destination based on predefined rules and policies. Here’s a high-level overview of how MCI operates: 1. Deployment Across Multiple Clusters Clusters Preparation: You deploy your application across multiple Kubernetes clusters, often spread across different geographical locations or cloud regions, to ensure high availability and resilience. Ingress Configuration: Each cluster has its own set of resources and services that need to be exposed externally. With MCI, you configure ingress resources that are aware of the multi-cluster environment. 2. Central Management and Configuration Unified Ingress Control: A central control plane is used to manage the ingress resources across all participating clusters. This is where you define the rules for how external traffic should be routed to your services. DNS and Global Load Balancer Setup: MCI integrates with global load balancers and DNS systems to direct users to the closest or most appropriate cluster based on various criteria like location, latency, and the health of the clusters. 3. Traffic Routing Initial Request: When a user makes a request to access the application, the DNS resolution directs the request to the global load balancer. Global Load Balancing: The global load balancer evaluates the request against the configured routing rules and the current state of the clusters (e.g., load, health). It then selects the optimal cluster to handle the request. Cluster Selection: The criteria for cluster selection can include geographic proximity to the user, the health and capacity of the clusters, and other custom rules defined in the MCI configuration. Request Forwarding: Once the optimal cluster is selected, the global load balancer forwards the request to an ingress controller in that cluster. Service Routing: The ingress controller within the chosen cluster then routes the request to the appropriate service based on the path, host, or other headers in the request. 4. Health Checks and Failover Continuous Monitoring: MCI continuously monitors the health and performance of all clusters and their services. This includes performing health checks and monitoring metrics to ensure each service is functioning correctly. Failover and Redundancy: In case a cluster becomes unhealthy or is unable to handle additional traffic, MCI automatically reroutes traffic to another healthy cluster, ensuring uninterrupted access to the application. 5. Scalability and Maintenance Dynamic Scaling: As traffic patterns change or as clusters are added or removed, MCI dynamically adjusts routing rules and load balancing to optimize performance and resource utilization. Configuration Updates: Changes to the application or its deployment across clusters can be managed centrally through the MCI configuration, simplifying updates and maintenance. Example Deployment YAML for Multi-cluster Ingress with FrontendConfig and BackendConfig This example includes: A simple web application deployment. A service to expose the application within the cluster. A MultiClusterService to expose the service across clusters. A MultiClusterIngress to expose the service externally with FrontendConfig and BackendConfig. The post What is Multi-cluster Ingress (MCI) appeared first on DevOpsSchool.com. View the full article
  4. Google Kubernetes Engine (GKE) offers two different ways to perform service discovery and DNS resolution: the in-cluster kube-dns functionality, and GCP managed Cloud DNS. Either approach can be combined with the performance-enhancing NodeLocal DNSCache add-on. New GKE Autopilot clusters use Cloud DNS as a fully managed DNS solution for your GKE Autopilot clusters without any configuration required on your part. But for GKE Standard clusters, you have the following DNS provider choices: Kube-dns (default) Cloud DNS - configured for either cluster-scope or VPC scope, and Install and run your own DNS (like Core DNS) In this blog, we break down the differences between the DNS providers for your GKE Standard clusters, and guide you to the best solution for your specific situation. Kube-DNS kube-dns is the default DNS provider for Standard GKE clusters, providing DNS resolution for services and pods within the cluster. If you select this option, GKE deploys the necessary kube-dns components such as Kube-dns pods, Kube-dns-autoscaler, Kube-dns configmap and Kube-dns service in the kube-system namespace. kube-dns is the default DNS provider for GKE Standard clusters and the only DNS provider for Autopilot clusters running versions earlier than 1.25.9-gke.400 and 1.26.4-gke.500. Kube-dns is a suitable solution for workloads with moderate DNS query volumes that have stringent DNS resolution latency requirements (e.g. under ~2-4ms). Kube-dns is able to provide low latency DNS resolution for all DNS queries as all the DNS resolutions are performed within the cluster. If you notice DNS timeouts or failed DNS resolutions for bursty workload traffic patterns when using kube-dns, consider scaling the number of kube-dns pods, and enabling NodeLocal DNS cache for the cluster. You can scale the number of kube-dns pods beforehand using Kube-dns autoscaler, and manually tuning it to the cluster's DNS traffic patterns. Using kube-dns along with Nodelocal DNS cache (discussed below) also reduces overhead on the kube-dns pods for DNS resolution of external services. While scaling up kube-dns and using NodeLocal DNS Cache(NLD) helps in the short term, it does not guarantee reliable DNS resolution during sudden traffic spikes. Hence migrating to Cloud DNS provides a more robust and long-term solution for improved reliability of DNS resolution consistently across varying DNS query volumes. You can update the DNS provider for your existing GKE Standard from kube-dns to Cloud DNS without requiring to re-create your existing cluster. For logging the DNS queries when using kube-dns, there is manual effort required in creating a new kube-dns debug pod with log-queries enabled. Cloud DNS Cloud DNS is a Google-managed service that is designed for high scalability and availability. In addition, Cloud DNS elastically scales to adapt to your DNS query volume, providing consistent and reliable DNS query resolution regardless of traffic volume. Cloud DNS simplifies your operations and minimizes operational overhead since it is a Google managed service and does not require you to maintain any additional infrastructure. Cloud DNS supports dns resolutions across the entire VPC, which is something not currently possible with kube-dns. Also, while using Multi Cluster Services (MCS) in GKE, Cloud DNS provides DNS resolution for services across your fleet of clusters. Unlike kube-dns, Google Cloud’s hosted DNS service Cloud DNS provides Pod and Service DNS resolution that auto-scales and offers a 100% service-level agreement, reducing DNS timeouts and providing consistent DNS resolution latency for heavy DNS workloads. Cloud DNS also integrates with Cloud Monitoring, giving you greater visibility into DNS queries for enhanced troubleshooting and analysis. The Cloud DNS controller automatically provisions DNS records for pods and services in Cloud DNS for ClusterIP, headless and external name services. You can configure Cloud DNS to provide GKE DNS resolution in either VPC or Cluster (the default) scope. With VPC scope, the DNS records are resolvable with the entire VPC. This is achieved with the private DNS zone that gets created automatically. With Cluster scope, the DNS records are resolvable only within the cluster. While Cloud DNS offers enhanced features, it does come with usage-based costs. You save on compute costs and overhead by removing kube-dns pods when using Cloud DNS. Considering the typical cluster size workload traffic patterns, Cloud DNS is usually more cost effective than running kube-dns You can migrate clusters from kube-dns to Cloud DNS cluster scope without downtime or changes to your applications. The reverse (migrating from Cloud DNS to kube-dns) is not a seamless operation. NodeLocal DNSCache NodeLocal DNSCache is a GKE add-on that you can run in addition to kube-dns and Cloud DNS. The node-local-dns pod gets deployed on the GKE nodes after the option has been enabled (subject to a node upgrade procedure). Nodelocal DNS Cache (NLD) helps to reduce the average DNS resolution times by resolving the DNS requests locally on the same nodes as the pods, and only forwards requests that it cannot resolve to the other DNS servers in the cluster. This is a great fit for clusters that have heavy internal DNS query loads. Enable NLD during maintenance windows. Please note that node pools must be re-created for this change to take effect. Final thoughts The choice of DNS provider for your GKE Standard cluster has implications for the performance and reliability, in addition to your operations and overall service discovery architecture. Hence, it is crucial for GKE Standard users to understand their DNS options taking into account their application and architecture objectives. Standard GKE clusters allow you to use either kube-dns or Cloud DNS as your DNS provider, allowing you to optimize for either low latency DNS resolution or a simple, scalable and reliable DNS solution for GKE Standard clusters. You can learn more about DNS for your GKE cluster from the GKE documentation . If you have any further questions, feel free to contact us. We thank the Google Cloud team member who contributed to the blog: Selin Goksu, Technical Solutions Developer, Google View the full article
  5. Autopilot mode for Google Kubernetes Engine (GKE) is our take on a fully managed, Pod-based Kubernetes platform. It provides category-defining features with a fully functional Kubernetes API with support for StatefulSet with block storage, GPU and other critical functionality that you don’t often found in nodeless/serverless-style Kubernetes offerings, while still offering a Pod-level SLA and a super-simple developer API. But until now, Autopilot, like other products in this category, did not offer the ability to temporarily burst CPU and memory resources beyond what was requested by the workload. I’m happy to announce that now, powered by the unique design of GKE Autopilot on Google Cloud, we are able to bring burstable workload support to GKE Autopilot. Bursting allows your Pod to temporarily utilize resources outside of those resources that it requests and is billed for. How does this work, and how can Autopilot offer burstable support, given the Pod-based model? The key is that in Autopilot mode we still group Pods together on Nodes. This is what powers several unique features of Autopilot such as our flexible Pod sizes. With this change, the capacity of your Pods is pooled, and Pods that set a limit higher than their requests can temporarily burst into this capacity (if it’s available). With the introduction of burstable support, we’re also introducing another groundbreaking change: 50m CPU Pods — that is, Pods as small as 1/20th of a vCPU. Until now, the smallest Pod we offered was ¼ of a vCPU (250m CPU) — five times bigger. Combined with burst, the door is now open to run high-density-yet-small workloads on Autopilot, without constraining each Pod to its resource requests. We’re also dropping the 250m CPU resource increment, so you can create any size of Pod you like between the minimum to the maximum size, for example, 59m CPU, 302m, 808m, 7682m etc (the memory to CPU ratio is automatically kept within a supported range, sizing up if needed). With these finer-grained increments, Vertical Pod Autoscaling now works even better, helping you tune your workload sizes automatically. If you want to add a little more automation to figuring out the right resource requirements, give Vertical Pod Autoscaling a try! Here’s what our customer Ubie had to say: “GKE Autopilot frees us from managing infrastructure so we can focus on developing and deploying our applications. It eliminates the challenges of node optimization, a critical benefit in our fast-paced startup environment. Autopilot's new bursting feature and lowered CPU minimums offer cost optimization and support multi-container architectures such as sidecars. We were already running many workloads in Autopilot mode, and with these new features, we're excited to migrate our remaining clusters to Autopilot mode.” - Jun Sakata, Head of Platform Engineering, Ubie A new home for high-density workloadsAutopilot is now a great place to run high-density workloads. Let’s say for example you have a multi-tenant architecture with many smaller replicas of a Pod, each running a website that receives relatively little traffic, but has the occasional spike. Combining the new burstable feature and the lower 50m CPU minimum, you can now deploy thousands of these Pods in a cost-effective manner (up to 20 per vCPU), while enabling burst so that when that traffic spike comes in, that Pod can temporarily burst into the pooled capacity of these replicas. What’s even better is that you don’t need to concern yourself with bin-packing these workloads, or the problem where you may have a large node that’s underutilized (for example, a 32 core node running a single 50m CPU Pod). Autopilot takes care of all of that for you, so you can focus on what’s key to your business: building great services. Calling all startups, students, and solopreneursEveryone needs to start somewhere, and if Autopilot wasn’t the perfect home for your workload before, we think it is now. With the 50m CPU minimum size, you can run individual containers in us-central1 for under $2/month each (50m CPU, 50MiB). And thanks to burst, these containers can use a little extra CPU when traffic spikes, so they’re not completely constrained to that low size. And if this workload isn’t mission-critical, you can even run it in Spot mode, where the price is even lower. In fact, thanks to GKE’s free tier, your costs in us-central1 can be as low as $30/month for small workloads (including production-grade load balancing) while tapping into the same world-class infrastructure that powers some of the biggest sites on the internet today. Importantly, if your startup grows, you can scale in-place without needing to migrate — since you’re already running on a production-grade Kubernetes cluster. So you can start small, while being confident that there is nothing limiting about your operating environment. We’re counting on your success as well, so good luck! And if you’re learning GKE or Kubernetes, this is a great platform to learn on. Nothing beats learning on real production systems — after all, that’s what you’ll be using on the job. With one free cluster per account, and all these new features, Autopilot is a fantastic learning environment. Plus, if you delete your Kubernetes workload and networking resources when you’re done for the day, any associated costs cease as well. When you’re ready to resume, just create those resources again, and you’re back at it. You can even persist state between learning sessions by deleting just the compute resources (and keeping the persistent disks). Don’t forget there’s a $300 free trial to cover your initial usage as well! Next steps: Learn about Pod bursting in GKE.Read more about the minimum and maximum resource requests in Autopilot mode.Learn how to set resource requirements automatically with VPA.Try GKE’s Autopilot mode for a workload-based API and simpler Day 2 ops with lower TCO.Going to Next ‘24? Check out session DEV224 to hear Ubie talk about how it uses burstable workloads in GKE. View the full article
  6. Editor’s note: Stanford University Assistant Professor Paul Nuyujukian and his team at the Brain Inferencing Laboratory explore motor systems neuroscience and neuroengineering applications as part of an effort to create brain-machine interfaces for medical conditions such as stroke and epilepsy. This blog explores how the team is using Google Cloud data storage, computing and analytics capabilities to streamline the collection, processing, and sharing of that scientific data, for the betterment of science and to adhere to funding agency regulations. Scientific discovery, now more than ever, depends on large quantities of high-quality data and sophisticated analyses performed on those data. In turn, the ability to reliably capture and store data from experiments and process them in a scalable and secure fashion is becoming increasingly important for researchers. Furthermore, collaboration and peer-review are critical components of the processes aimed at making discoveries accessible and useful across a broad range of audiences. The cornerstones of scientific research are rigor, reproducibility, and transparency — critical elements that ensure scientific findings can be trusted and built upon [1]. Recently, US Federal funding agencies have adopted strict guidelines around the availability of research data, and so not only is leveraging data best practices practical and beneficial for science, it is now compulsory [2, 3, 4, 5]. Fortunately, Google Cloud provides a wealth of data storage, computing and analytics capabilities that can be used to streamline the collection, processing, and sharing of scientific data. Prof. Paul Nuyujukian and his research team at Stanford’s Brain Inferencing Laboratory explore motor systems neuroscience and neuroengineering applications. Their work involves studying how the brain controls movement, recovers from injury, and work to establish brain-machine interfaces as a platform technology for a variety of brain-related medical conditions, particularly stroke and epilepsy. The relevant data is obtained from experiments on preclinical models and human clinical studies. The raw experimental data collected in these experiments is extremely valuable and virtually impossible to reproduce exactly (not to mention the potential costs involved). Fig. 1: Schematic representation of a scientific computation workflow To address the challenges outlined above, Prof. Nuyujukian has developed a sophisticated data collection and analysis platform that is in large part inspired by the practices that make up the DevOps approach common in software development [6, Fig. 2]. Keys to the success of this system are standardization, automation, repeatability and scalability. The platform allows for both standardized analyses and “one-off” or ad-hoc analyses in a heterogeneous computing environment. The critical components of the system are containers, Git, CI/CD (leveraging GitLab Runners), and high-performance compute clusters, both on-premises and in cloud environments such as Google Cloud, in particular Google Kubernetes Engine (GKE) running in Autopilot mode. Fig. 2: Leveraging DevOps for Scientific Computing Google Cloud provides a secure, scalable, and highly interoperable framework for the various analyses that need to be run on the data collected from scientific experiments (spanning basic science and clinical studies). GitLab Pipelines specify the transformations and analyses that need to be applied to the various datasets. GitLab Runner instances running on GKE (or other on-premises cluster/high-performance computing environments) are used to execute these pipelines in a scalable and cost-effective manner. Autopilot environments in particular provide substantial advantages to researchers since they are fully managed and require only minimal customization or ongoing “manual” maintenance. Furthermore, they instantly scale with the demand for analyses that need to be run, even with spot VM pricing, allowing for cost-effective computation. Then, they scale down to near-zero when idle, and scale up as demand increases again – all without intervention by the researcher. GitLab pipelines have a clear and well-organized structure defined in YAML files. Data transformations are often multi-stage and GitLab’s framework explicitly supports such an approach. Defaults can be set for an entire pipeline, such as the various data transformation stages, and can be overwritten for particular stages where necessary. Since the exact steps of a data transformation pipeline can be context- or case-dependent, conditional logic is supported along with dynamic definition of pipelines, e.g., definitions depending on the outcome of previous analysis steps. Critically, different stages of a GitLab pipeline can be executed by different runners, facilitating the execution of pipelines across heterogeneous environments, for example transferring data from experimental acquisition systems and processing them in cloud or on-premises computing spaces [Fig. 3]. Fig. 3: Architecture of the Google Cloud based scientific computation workflow via GitLab Runners hosted on Google Kubernetes Engine Cloud computing resources can provide exceptional scalability, while pipelines allow for parallel execution of stages to take advantage of this scalability, allowing researchers to execute transformations at scale and substantially speed up data processing and analysis. Parametrization of pipelines allows researchers to automate the validation of processing protocols across many acquired datasets or analytical variations, yielding robust, reproducible, and sustainable data analysis workflows. Collaboration and data sharing is another critical, and now mandatory, aspect of scientific discovery. Multiple generations of researchers, from the same lab or different labs, may interact with particular datasets and analysis workflows over a long period of time. Standardized pipelines like the ones described above can play a central role in providing transparency on how data is collected and how it is processed, since they are essentially self-documenting. That, in turn, allows for scalable and repeatable discovery. Data provenance, for example, is explicitly supported by this framework. Through the extensive use of containers, workflows are also well encapsulated and no longer depend on specifically tuned local computing environments. This consequently leads to increased rigor, reproducibility and transparency, enabling a large audience to interact productively with datasets and data transformation workflows. In conclusion, by using the computing, data storage, and transformation technologies available from Google Cloud along with workflow capabilities of CI/CD engines like GitLab, researchers can build highly capable and cost-effective scientific data-analysis environments that aid efforts to increase rigor, reproducibility, and transparency, while also achieving compliance with relevant government regulations. References: Enhancing Reproducibility through Rigor and Transparency NIH issues a seismic mandate: share data publicly Final NIH Policy for Data Management and Sharing FACT SHEET: Biden-⁠Harris Administration Announces New Actions to Advance Open and Equitable Research MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES Leveraging DevOps for Scientific Computing View the full article
  7. Managing the growth of your Kubernetes clusters within Google Kubernetes Engine (GKE) just got easier. We've recently introduced the ability to directly monitor and set alerts for crucial scalability limits, providing you with deeper insight and control over your Kubernetes environment. Effective scalability management is essential for avoiding outages and optimizing resource usage in Kubernetes. These new monitoring features bring you: Peace of mind: Potential capacity issues can be proactively addressed before they cause problems, ensuring uninterrupted operations.Clearer understanding: Gain a deeper insight into your clusters’ architectural constraints, allowing for informed decision-making.Optimization opportunities: Analyze usage trends and identify ways to fine-tune your cluster configurations for optimal resource utilization.Here are the specific limits you can now keep track of: Etcd database size (GiB): Understand how much space your Kubernetes cluster state is consuming.Nodes per cluster: Get proactive alerts on your cluster's overall node capacity.Nodes per node pool (all zones): Manage node distribution and limits across specific node pools.Pods per cluster (GKE Standard / GKE Autopilot): Ensure you have the pod capacity to support your applications.Containers per cluster (GKE Standard / GKE Autopilot): Prevent issues by understanding the maximum number of containers your cluster can support. Get startedYou'll find these new quota monitoring and alerting features directly within the Google Cloud console. To get there, you can use the link to a pre-filtered list of GKE quotas or navigate to the Quotas page under the IAM & Admin section in the console and then filter by the Kubernetes Engine API service. To search for a specific quota, use the Filter table. You can filter by the exact quota name, location, cluster name, or node pool (where applicable). You can also create alerts for a specific quota by following the guide. Alerts can be configured to notify you when a quota is approaching or has exceeded its limit. By using the Cloud Monitoring API and console you can monitor GKE quota usage in greater depth. The API allows you to programmatically access quota metrics and create custom dashboards and alerts. The console provides a graphical interface for monitoring quota usage and creating alerts. Custom dashboards can be created to visualize quota usage over time. Alerts can be configured to notify you when quota usage reaches a certain threshold. This can help you proactively manage your quotas and avoid unexpected outages. See the guide for more details. Need more information? Explore the official Google Cloud documentation for more in-depth guidance: Understanding GKE Quotas and Limits: Quotas and limits | Google Kubernetes Engine (GKE)Best practices planning and designing large-size clusters: Plan for large GKE clusters | Google Kubernetes Engine (GKE)Setting up a quota alert: Monitor and alert with quota metricsUsing GKE observability metrics: View observability metrics | Google Kubernetes Engine (GKE)View the full article
  8. When developers are innovating quickly, security can be an afterthought. That’s even true for AI/ML workloads, where the stakes are high for organizations trying to protect valuable models and data. When you deploy an AI workload on Google Kubernetes Engine (GKE), you can benefit from the many security tools available in Google Cloud infrastructure. In this blog, we share security insights and hardening techniques for training AI/ML workloads on one framework in particular — Ray. Ray needs security hardening As a distributed compute framework for AI applications, Ray has grown in popularity in recent years, and deploying it on GKE is a popular choice that provides flexibility and configurable orchestration. You can read more on why we recommend GKE for Ray. However, Ray lacks built-in authentication and authorization, which means that if you can successfully send a request to the Ray cluster head, it will execute arbitrary code on your behalf. So how do you secure Ray? The authors state that security should be enforced outside of the Ray cluster, but how do you actually harden it? Running Ray on GKE can help you achieve a more secure, scalable, and reliable Ray deployment by taking advantage of existing global Google infrastructure components including Identity-Aware Proxy (IAP). We’re also making strides in the Ray community to make safer defaults for running Ray with Kubernetes using KubeRay. One focus area has been improving Ray component compliance with the restricted Pod Security Standards profile and by adding security best practices, such as running the operator as non-root to help prevent privilege escalation. Security separation supports multi-cluster operation One key advantage of running Ray inside Kubernetes is the ability to run multiple Ray clusters, with diverse workloads, managed by multiple teams, inside a single Kubernetes cluster. This gives you better resource sharing and utilization because nodes with accelerators can be used by several teams, and spinning up Ray on an existing GKE cluster saves waiting on VM provisioning time before workloads can begin execution. Security plays a supporting role in landing those multi-cluster advantages by using Kubernetes security features to help keep Ray clusters separate. The goal is to avoid accidental denial of service or accidental cross-tenant access. Note that the security separation here is not “hard” multitenancy — it is only sufficient for clusters running trusted code and teams that trust each other with their data. If further isolation is required, consider using separate GKE clusters. The architecture is shown in the following diagram. Different Ray clusters are separated by namespaces within the GKE cluster, allowing authorized users to make calls to their assigned Ray cluster, without accessing others. Diagram: Ray on GKE Architecture How to secure Ray on GKE At Google Cloud, we’ve been working on improving the security of KubeRay components, and making it easier to spin up a multi-team environment with the help of Terraform templates including sample security configurations that you can reuse. Below, we’ve summarized fundamental security best practices included in our sample templates: Namespaces: Separate Ray clusters into distinct namespaces by placing one Ray cluster per namespace to take advantage of Kubernetes policies based on the namespace boundary. Role-based access control (RBAC): Practice least privilege by creating a Kubernetes Service Account (KSA) per Ray cluster namespace, avoid using the default KSA associated with each Ray cluster namespace, and minimizing permissions down to no RoleBindings until deemed necessary. Optionally, consider setting automountServiceAccountToken:false on the KSA to ensure the KSA’s token is not available to the Ray cluster Pods, since Ray jobs are not expected to call the Kubernetes API. Resource quotas: Harden against denial of service due to resource exhaustion by setting limits for resource quotas (especially for CPUs, GPUs, TPUs, and memory) on your Ray cluster namespace. NetworkPolicy: Protect the Ray API as a critical measure to Ray security, since there is no authentication or authorization for submitting jobs. Use Kubernetes NetworkPolicy with GKE Dataplane V2 to control which traffic reaches the Ray components. Security context: Comply with Kubernetes Pod Security Standards by configuring Pods to run with hardened settings preventing privilege escalation, running as root, and restricting potentially dangerous syscalls. Workload identity federation: If necessary, secure access from your Ray deployment Pods to other Google Cloud services with workload identity federation such as Cloud Storage, by leveraging your KSA in a Google Cloud IAM policy. Additional security tools The following tools and references can provide additional security for your Ray clusters on GKE: Identity-Aware Policy (IAP): Control access to your Ray cluster with Google’s distributed global endpoint with IAP providing user and group authorization, with Ray deployed as an Kubernetes Ingress or Gateway service. Pod Security Standards (PSS): Turn Pod Security Standards on for each of your namespaces in order to prevent common insecure misconfigurations such as HostVolume mounts. If you need more policy customization, you can also use Policy Controller. GKE Sandbox: Leverage GKE Sandbox Pods based on gVisor to add a second security layer around Pods, further reducing the possibility of breakouts for your Ray clusters. Currently available for CPUs (also GPUs with some limitations). Cluster hardening: By default, GKE Autopilot already applies a lot of cluster hardening best practices, but there are some additional ways to lock down the cluster. The Ray API can be further secured by removing access from the Internet by using private nodes. Organization policies: Ensure your organization's clusters meet security and hardening standards by setting custom organization policies — for example, guarantee that all GKE clusters are Autopilot. Google continues to contribute to the community through our efforts to ensure safe and scalable deployments. We look forward to continued collaboration to ensure Ray runs safely on Kubernetes clusters. Please drop us a line with any feedback at ray-on-gke@google.com or comment on our GitHub repo. To learn more, check out the following resources: Terraform templates for hardening Ray on GKE GKE cluster hardening guide View the full article
  9. As Artificial Intelligence and Machine Learning (AI/ML) models grow larger, training and inference applications demand accelerated compute such as NVIDIA GPUs. Google Kubernetes Engine (GKE) is a fully managed Kubernetes service that simplifies container orchestration, and has become the platform of choice to deploy, scale, and manage custom ML platforms. GKE can now automatically install NVIDIA GPU drivers, making it easier for customers to take advantage of GPUs. Previously, using GPUs with GKE required manually installing the GPU drivers by applying a daemonset. While the manual approach has the benefit of giving customers more explicit control over their environment, for other customers, it felt like unnecessary friction that delayed the deployment workflow. Going forward, GKE can automatically install GPU drivers on behalf of the customer. With the GA launch of GKE’s automated GPU driver installation, using GPUs is now even easier. “Automatic driver installation has been a great quality-of-life feature to simplify adding GPUs to GKE node pools. We use it on all of our AI workloads” – Barak Peleg, VP, AI21 Additionally, off-loading the installation management of GPU drivers to Google allows for the drivers to be precompiled for the GKE node, which can reduce the time it takes for GPU nodes to startup. Setting up GPU driver installation To take advantage of automated GPU driver installation when creating GKE node pools, specify the DRIVER_VERSION option with one of the following options: default: Install the default driver version for the GKE version. latest: Install the latest available driver version for the GKE version. Available only for nodes that use Container-Optimized OS. If you’d prefer to manually install your drivers, the DRIVER_VERSION can be specified as disabled, which will skip automatic driver installation. Currently, if you do not specify anything, the default behavior is the manual approach. Here’s how to enable GPU driver installation via gcloud: code_block <ListValue: [StructValue([('code', 'gcloud container node-pools create POOL_NAME \\\r\n--accelerator type=nvidia-tesla-a100, count=8, gpu-driver-version=latest \\\r\n--region COMPUTE_REGION\r\n--cluster CLUSTER_NAME'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ecb1dea8940>)])]> You can also enable it through the GKE console UI. In the node pool machine configuration, select GPUs. You will then be given the option for ‘Google-managed” or ‘User-managed’, as shown below. If you select the ‘Google-managed’ option, , the necessary GPU drivers will be automatically installed when the node pool is created. This eliminates the need for the additional manual step. For the time being, GPU driver installation is set to ‘User-managed’. In the future, GKE will shift the default selection to ‘Google-managed’ once most users become accustomed to the new approach. To learn more about using GPUs on GKE, please visit our documentation. View the full article
  10. Background Ever since generative AI gained prominence in the AI field, organizations ranging from startups to large enterprises have moved to harness its power by making it an integral part of their applications, solutions, and platforms. While the true potential of generative AI lies in creating new content based on learning from existing content, it is becoming important that the content produced has a degree of specificity to a given area or domain. This blog post shows how generative AI models can be adapted to your use cases by demonstrating how to train models on Google Kubernetes Engine (GKE) using NVIDIA accelerated computing and NVIDIA NeMo framework. Building generative AI models In the context of constructing generative AI models, high-quality data (the ‘dataset’), serves as a foundational element. Data in various formats, such as text, code, images, and others, is processed, enhanced, and analyzed to minimize direct effects on the model's output. Based on the model's modality, this data is fed into a model architecture to enable the model's training process. This might be text for Transformers or images for GANs (Generative Adversarial Networks). During the training process, the model adjusts its internal parameters so that its output matches the patterns and structures of the data. As the model learns, its performance is monitored by observing a lowering loss on the training set, as well as improved predictions on a test set. Once the performance is no longer improving, the model is considered converged. It may then under-go further refinement, such as reinforcement-learning with human feedback (RLHF). Additional hyperparameters, such as learning rate or batch size, can be tuned to improve the rate of model learning. The process of building and customizing a model can be expedited by utilizing a framework that offers the necessary constructs and tooling, thereby simplifying adoption. NVIDIA NeMo NVIDIA NeMo is an open-source, end-to-end platform purpose-built for developing custom, enterprise-grade generative AI models. NeMo leverages NVIDIA’s state-of-the-art technology to facilitate a complete workflow from automated distributed data processing to training of large-scale bespoke models and finally, to deploy and serve using infrastructure in Google Cloud. NeMo is also available for enterprise-grade deployments with NVIDIA AI Enterprise software, available on Google Cloud Marketplace. NeMo framework approaches building AI models using a modular design to encourage data scientists, ML engineers and developers to mix and match these core components: Data Curation: extract, deduplicate and filter information from datasets to generate high-quality training data Distributed Training: advanced parallelism of training models by spreading workloads across tens of thousands of compute nodes with NVIDIA graphics processing units (GPUs) Model Customization: adapt several foundational, pre-trained models to specific domains using techniques such as P-tuning, SFT (Supervised Fine Tuning), RLHF (Reinforcement Learning from Human Feedback) Deployment: seamless integration with NVIDIA Triton Inference Server to deliver high accuracy, low latency and high throughput results. NeMo framework provides guardrails to honor the safety and security requirements. It enables organizations to foster innovation, optimize operational efficiency, and establish easy access to software frameworks to start the generative AI journey. For those interested in deploying NeMo onto a HPC system that may include schedulers like the Slurm workload manager, we recommend using the ML Solution available through the Cloud HPC Toolkit. Training at scale using GKE Building and customizing models requires massive compute, quick access to memory and storage, and rapid networking. In addition, there are multiple demands across the infrastructure ranging from scaling large-sized models, efficient resource utilization, agility for faster iteration, fault tolerance and orchestrating distributed workloads. GKE allows customers to have a more consistent and robust development process by having one platform for all their workloads. GKE as a foundation platform provides unmatched scalability, compatibility with a diverse set of hardware accelerators including NVIDIA GPUs, bringing the best of accelerator orchestration to help significantly improve performance and reduce costs. Let’s look at how GKE helps manage the underlying infrastructure with ease with the help of Figure 1: Compute Multi-Instance GPUs (MIG): partition a single NVIDIA H100 or A100 Tensor Core GPU into multiple instances so each has high-bandwidth memory, cache and compute cores Time-sharing GPUs: single physical GPU node shared by multiple containers to efficiently use and save running costs Storage Local SSD: high throughput and I/O requirements GCS Fuse: allow file-like operations on objects Networking: GPUDirect-TCPX NCCL plug-in: Transport layer plugin to enable direct GPU to NIC transfers during NCCL communication, improving network performance. Google Virtual Network Interface Card (gVNIC): to increase network performance between GPU nodes Queuing: Kubernetes native job queueing system to orchestrate job execution to completion in a resource-constrained environment. GKE is widely embraced by communities including other ISVs (Independent Software Vendors) to land their tools, libraries and frameworks. GKE democratizes infrastructure by letting teams of different sizes to build, train and deploy AI models. Solution architecture The current industry trends in AI / ML space indicate that more computational power improves models significantly. GKE unleashes this very power of Google Cloud’s products and services along with NVIDIA GPUs to train and serve models with industry-leading scale. In Figure 2 above, the Reference Architecture illustrates the major components, tools and common services used to train the NeMo large language model using GKE. A GKE Cluster set up as a regional or zonal location consisting of two node pools; a default node pool to manage common services such as DNS pods, custom controllers and a managed node pool with the A3 nodes to run the workloads. A3 nodes, each with 16 local SSDs, 8x NVIDIA H100 Tensor Core GPUs and associated drivers. In each node, the CSI driver for Filestore CSI is enabled to access fully managed NFS storage and Cloud Storage FUSE to access Google Cloud Storage as a file system. Kueue batching for workload management. This is recommended for a larger setup used by multiple teams. Filestore mounted in each node to store outputs, interim, and final logs to view training performance. A Cloud Storage bucket that contains the training data. NVIDIA NGC hosts the NeMo framework’s training image. Training logs mounted on Filestore can be viewed using TensorBoard to examine the training loss and training step times. Common services such as Cloud Ops to view logs, IAM to manage security, and Terraform to deploy the entire setup. An end-to-end walkthrough is available in a GitHub repository at https://github.com/GoogleCloudPlatform/nvidia-nemo-on-gke. The walkthrough provides detailed instructions to set up the above solution in a Google Cloud Project and pre-train NVIDIAs’ NeMo Megatron GPT using the NeMo framework. Extend further In scenarios where there are massive amounts of structured data, BigQuery is commonly used by enterprises as their central data warehousing platform. There are techniques to export data into Cloud Storage to train the model. If the data is not available in the intended form, Dataflow can be used to read, transform and write back the data to BigQuery. Conclusion By leveraging GKE, organizations can focus on developing and deploying their models to grow their business, without worrying about the underlying infrastructure. NVIDIA NeMo is highly suited to building custom generative AI models. This combination provides the scalability, reliability, and ease of use to train and serve models. To learn more about GKE, click here. To understand more about NVIDIA NeMo, click here. If you are coming to NVIDIA GTC, please see Google Cloud at booth #808 to see this in action. View the full article
  11. The revolution in generative AI (gen AI) and large language models (LLMs) is leading to larger model sizes and increased demands on the compute infrastructure. Organizations looking to integrate these advancements into their applications increasingly require distributed computing solutions that offer minimal scheduling overhead. As the need for scalable gen AI solutions grows, Ray, an open-source Python framework designed for scaling and distributing AI workloads, has become increasingly popular. Traditional Ray deployments on virtual machines (VMs) have limitations when it comes to scalability, resource efficiency, and infrastructure manageability. One alternative is to leverage the power and flexibility of Kubernetes and deploy Ray on Google Kubernetes Engine (GKE) with KubeRay, an open-source Kubernetes operator that simplifies Ray deployment and management. “With the help of Ray on GKE, our AI practitioners are able to get easy orchestration of infrastructure resources, flexibility and scalability that their applications need without the headache of understanding and managing the intricacies of the underlying platform.” - Nacef Labidi, Head of Infrastructure, Instadeep In this blog, we discuss the numerous benefits that running Ray on GKE brings to the table — scalability, cost-efficiency, fault tolerance and isolation, and portability, to name a few — and resources on how to get started. Easy scalability and node auto-provisioningOn VMs, Ray's scalability is inherently limited by the number of VMs in the cluster. Autoscaling and node provisioning, configured for specific clouds (example), require detailed knowledge of machine types and network configurations. In contrast, Kubernetes orchestrates infrastructure resources using containers, pods, and VMs as scheduling units, while Ray distributes data-parallel processes within applications, employing actors and tasks for scheduling. KubeRay introduces cloud-agnostic autoscaling to the mix, allowing you to define minimum and maximum replicas within the workerGroupSpec. Based on this configuration, the Ray autoscaler schedules more Kubernetes pods as required by its tasks. And if you choose the GKE Autopilot mode of operation, node provisioning happens automatically, eliminating the need for manual configuration. Greater efficiency and improved startup latencyGKE offers discount-based savings such as committed use discounts, new pricing model and reservations for GPUs in Autopilot mode. In addition, GKE makes it easy to taking advantage of cost-saving measures like spot nodes via YAML configuration. Low startup latency is critical to optimal resource usage, ensuring quick recovery, faster iterations and elasticity. GKE image streaming lets you initialize and run eligible container images from Artifact Registry, without waiting for the full image to download. Testing demonstrated containers going from `ray-ml` container image going from `ContainerCreating` to `Running` state in 8.82s, compared to 5m17s without image streaming — that’s 35x faster! Image streaming is automatically enabled on Autopilot clusters and available on Standard clusters. Automated infrastructure management for fault tolerance and isolationManaging a Ray cluster on VMs offers control over fault tolerance and isolation via detailed VM configuration. However, it lacks the automated, portable self-healing capabilities that Kubernetes provides. Kubernetes excels at repeatable automation that is expressed with clear declarative and idempotent desired state configuration. It provides automatic self-healing capabilities, which in Kuberay 2.0 or later extends to preventing the Ray cluster from crashing when the head node goes down. In fact, Ray Serve docs specifically recommend Kubernetes for production workloads, using the RayService custom resource to automatically handle health checking, status reporting, failure recovery and upgrades. On GKE, the declarative YAML-based approach not only simplifies deployment and management but can also be used to provision security and isolation. This is achieved by integrating Kubernetes' RBAC with Google Cloud's Identity and Access Management (IAM), allowing administrators to finely tune the permissions granted to each Ray cluster. For instance, a Ray cluster that requires access to a Google Cloud Storage bucket for data ingestion or model storage can be assigned specific roles that limit its actions to reading and writing to that bucket only. This is configured by specifying the Kubernetes service account (KSA) as part of the pod template for Ray cluster `workerGroupSpec` and then linking a Google Service account with appropriate permissions to the KSA using the workload identity annotation. Easy multi-team sharing with Kubernetes namespacesOut of the box, Ray does not have any security separation between Ray clusters. With Kubernetes you can leverage namespaces to create a Ray cluster per team, and use Kubernetes Role-Based Access Control (RBAC), Resource Quotas and Network Policies. This creates a namespace-based trust boundary to allow multiple teams to each manage their Ray clusters within a larger shared Kubernetes cluster. Flexibility and portabilityYou can use Kubernetes for more than just data and AI. As a general-purpose platform, Kubernetes is portable across clouds and on-premises, and has a rich ecosystem. With Kubernetes, you can mix Ray and non-Ray workloads on the same infrastructure, allowing the central platform team to manage a single common compute layer, while leaving infrastructure and resource management to GKE. Think of it as your own personal SRE. Get started with Kuberay on GKEIn conclusion, running Ray on GKE is a straightforward way to achieve scalability, cost-efficiency, fault tolerance and isolation for your production workloads, all while ensuring cloud portability. You get the flexibility to adapt quickly to changing demands, making it an ideal choice for forward-thinking organizations in an ever-evolving generative AI landscape. To get started with Kuberay on GKE, follow these instructions. This repo has Terraform templates to run Kuberay on GPUs and TPUs, and examples for training and serving. You can also find more tutorials and code samples at AI/ML on GKE page. View the full article
  12. Creating and managing the security of Kubernetes clusters is a lot like building or renovating a house. Both require making concessions across many areas when trying to find a balance between security, usability and maintainability. For homeowners, these choices include utility and aesthetic options, such as installing floors, fixtures, benchtops, and tiles. They also include security decisions: what types of doors, locks, lights, cameras, and sensors should you install? How should they be connected, monitored, and maintained? Who do you call when there’s a problem? Kubernetes clusters are similar: Each cluster is like a house you’re constructing. The initial security decisions you make determine how well you can detect and respond to attacks. Wouldn't it be nicer if all of those decisions were made by experts who stuck around and upgraded your house when the technology improved? This is where GKE Autopilot comes in. We use Google Cloud’s deep Kubernetes security expertise to configure your clusters to be move-in ready for your production workloads. Autopilot is a great example of Google Cloud’s shared fate operating model, where we work to be proactive partners in helping you to achieve your desired security outcomes on our platform. With Google Cloud’s security tools we give you the means to run an entire city of Kubernetes clusters more securely. The work we do to configure cluster-level security depends on which mode you choose for the cluster. In Standard mode, Google Cloud handles all of the Kubernetes security configuration, but leaves many node configuration decisions and in-cluster policy configuration up to you. In Autopilot mode, we fully configure nodes, node pools, and in-cluster policy for you according to security best practices, allowing you to focus on workload-specific security. Say goodbye to node security responsibilitiesBy fully managing the nodes in Autopilot mode, Google Cloud handles the complexity of the node security surface, while still letting you use the flexible Kubernetes API to fine-tune the configuration. This approach can help solve security challenges, including: Keeping nodes patched, balancing speed of patching with availabilityPreventing overprivileged containers that create risk of node breakoutsPreventing unauthorized access and changes to nodes via SSH, privileged volume types, webhooks, and the certificates APIProtecting privileged namespaces such as kube-system from unauthorized accessAllowlisting privileges required for common security tools from our Autopilot partner workloads listThe benefits of this shift in responsibility extend beyond stopping insecure practices — it also makes configuring some important security features much easier. On Autopilot, enabling GKE Sandbox is simple because we manage the nodes. When you set the runtimeClass option in your podspec, Autopilot figures out the correct node configuration to run the sandboxed container for you. Additionally, GKE Security Posture vulnerability scanning is on by default. By taking on the responsibility for node security, we continue to make usability improvements while also tightening host security over time. We add new defenses and increasingly sophisticated detection as technology improves, without you needing to migrate node settings or configuration. Built-in cluster policyIn-cluster policy is responsible for making sure Kubernetes API objects like Pods and Services are well-configured, don’t create security risks, and are compliant with your organization's security best practices. In particular, you need to avoid pods that have the ability to “break out” to the node and access privileged agents or interfere with other workloads. Typically this work is done by installing tooling like Policy Controller or a third-party product like Gatekeeper. However, even managed policy introduces some overhead: Deciding which tooling and vendor you preferIdentifying which controls are appropriate for your organization's containersCreating exemptions for privileged security and monitoring toolingCreating self-serve exemptions for developers needing to debug containersAutopilot removes that work by installing policies that meet security best practices but allow the majority of workloads to run without modification. We work with a vetted set of third-party security partners to ensure their tools and monitoring work out of the box without customer effort. We have built in some self-service security exemptions that keep workloads protected by default but still allow developers to debug problems. Autopilot's built-in policy implements 93% of the Kubernetes Baseline security standard and 60% of the Restricted standard. While most users will find this to be a good balance of security and usability, if you have additional security or compliance requirements, you can address those using an additional policy tool as mentioned above. The benefits of this policy aren't just more compliance checkboxes. From Jan 2022 to December 2023, the default Autopilot configuration protected clusters against 62% of the container breakout vulnerabilities reported through Google’s kCTF VRP and kernelCTF. GKE security configuration done rightAutopilot helps prevent insecure configuration at the GKE API layer and simplifies policy concerns by standardizing the configuration. This is the building code that reduces the risk of electrical fires. Our goal with GKE has always been to build in security by default, and over the years we’ve added always-on security in multiple parts of the product. Some highlights include enabling Shielded Nodes and Auto-upgrades by default to protect against node tampering and to keep clusters patched. As we introduced those changes, we retained the option to disable some of these security features on GKE Standard, in order to maintain backwards compatibility. This adds overhead for cluster admins: to keep clusters secure at scale, you'd need to configure Organization Policy Service or Terraform validation for each feature. With Autopilot, we started fresh and removed all of those less secure options. Strong security features including Workload Identity, auto-upgrades, and Shielded Nodes are always on and can't be turned off. As a result of ensuring that all new clusters are Autopilot clusters, your security policy management is simplified. Additionally, moving security feature configuration into the workload and out of the node pool API provides a consolidated surface on which to enforce policy. Try it out todayAutopilot is the default choice for all new GKE clusters, and gives you move-in ready security. It sets up a baseline of in-cluster security policy for you, and allows common, trusted security tools. It simplifies enforcing cluster security practices by moving responsibility for node security to Google, protecting the node and system components, and removing legacy security options. Try it out by creating a cluster today. View the full article
  13. In today's cloud-native landscape, organizations demand both agile API management and the flexibility of microservices running on Kubernetes. Google Cloud offers powerful solutions for comprehensive API management capabilities with Apigee and Google Kubernetes Engine (GKE) for orchestration of containerized applications. But bridging these environments for streamlined API traffic to your Kubernetes-based microservices can present hurdles. Fortunately, there’s Private Service Connect (PSC), which you can use to create private and secure connections from your VPCs to Google, third parties, or your own services. In this scenario, PSC provides a secure, efficient, and elegant solution for the southbound communication from Apigee to your backend targets. The connectivity challenge The traditional methods of bridging Apigee and GKE introduce a series of networking complications. Some of the typical challenges hindering streamlined connectivity between Apigee and GKE include: Public exposure: Routing API traffic from Apigee to GKE often necessitates exposing GKE services publicly. This introduces security risks by opening potential attack vectors. Traffic bottlenecks and cost: Relying exclusively on public internet connectivity can result in unpredictable latency, affecting performance and driving up egress costs. Management complexity: Configuring network and VPC Peering connections, ensuring CIDR ranges do not overlap, and managing access permissions across platforms like Apigee and GKE can create a complex administrative burden. The Private Service Connect approach PSC fundamentally transforms the way your Apigee deployment interacts with GKE services, unlocking a new level of security and simplicity using: Private, internal networking: PSC exposes services from one VPC to another directly via a service attachment and endpoint without the need for complex VPC Peering or exposing clusters to the public internet. PSC facilitates the establishment of private endpoints within your Google Cloud VPC for your GKE Gateway. This enables Apigee to communicate with your Kubernetes service directly via Google Cloud’s internal network, bypassing the public internet entirely or requiring complex VPC Peering implementations. Granular security controls: PSC allows you to expose only the specific services within your GKE cluster that Apigee is authorized to access. This granular approach minimizes risk. Performance improvement: Private connectivity over Google Cloud's high-performance network fosters reduced latency and greater reliability, enhancing overall API response times. Key components in the mix Apigee: Google Cloud's robust API management platform simplifies creating, managing, securing, and monitoring APIs, providing policy enforcement, analytics, and monetization capabilities. GKE Gateway: GKE’s implementation of the Kubernetes Gateway API (a newer, more flexible API specification) is an advanced networking resource that improves upon the Ingress object and provides expanded routing and load balancing features for internal and external traffic within your GKE cluster. Private Service Connect: The cornerstone of this integration, PSC sets up a private connection and manages the routing between your Apigee VPC and the VPC hosting your GKE cluster. Setting up the secure bridge To establish this connectivity, you must complete the following steps. Here, we focus on the details of GKE Gateway creation, as the other steps are very straightforward and well documented. 1. Deploy the Internal GKE Gateway: This is a very detailed process but at a high level, the following steps must be completed: Create a VPC-native cluster. A VPC-native cluster is a cluster whose Pods and Services directly use IP addresses from your Google Cloud VPC network. Enable the Gateway API and HTTP Load Balancing Addon. You can achieve this during cluster creation or update, as needed. Configure a Proxy-only Subnet. This is a different subnet from the subnet your GKE cluster runs in. Ensure it is created with the REGIONAL_MANAGED_PROXY purpose Create a named IP address: Named IP addresses let you reserve a static IP to be used by your Gateway resource. If you do not specify an IP address on the Gateway, then the Gateway controller automatically provides an IP address. NOTE: At the time of this publication, PSC Service Attachments do not support forwarding rules that use a shared IP address (SHARED_LOADBALANCER_VIP). This is the default purpose of the IP address used by the Gateways if a named IP address is not reserved or a named IP address is created without an explicitly defined purpose. To work around this, we create a named IP address of purpose GCE_ENDPOINT using the gcloud command below instead. The implications of using the GCP_ENDPOINT IP address are: The Gateway cannot use the same IP address for both HTTP and HTTPS traffic from Apigee to the Gateway or perform HTTP-to-HTTPS redirects on traffic to the Gateway on the same address. If the communication from Apigee to the Gateway is using either only HTTP or HTTPS, this implication is not as significant. If you need both from Apigee to the Gateway, you will need to handle this with two separate Gateways (one for HTTP and the other for HTTPS) therefore requiring two named IP addresses. This is not referring to initial traffic from the origin client into Apigee. The external L7 Load balancer that handles this initial traffic still uses the SHARED_LOADBALANCER_VIP purpose and can handle both HTTP/HTTPS traffic and perform HTTP-to-HTTPS redirect. This step can be ignored once PSC Service Attachments support forwarding rules using shared IP addresses of purpose SHARED_LOADBALANCER_VIP. Verify this is still a limitation before completing this step. code_block <ListValue: [StructValue([('code', 'gcloud compute addresses create <IP_ADDRESS_NAME> \\\r\n \t\t --purpose=GCP_ENDPOINT \\\r\n \t\t --region=<COMPUTE_REGION> \\\r\n \t\t --subnet=<SUBNET> \\\r\n \t\t --project=<PROJECT_ID>'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ebc0a559460>)])]> Create the Gateway resource: A sample yaml file to create the Gateway using a named IP address is outlined below: code_block <ListValue: [StructValue([('code', 'kind: Gateway\r\napiVersion: gateway.networking.k8s.io/v1beta1\r\nmetadata:\r\n name: sample-gateway\r\nspec:\r\n gatewayClassName: gke-l7-rilb\r\n listeners:\r\n - name: http\r\n protocol: HTTP\r\n port: 80\r\n addresses:\r\n - type: NamedAddress\r\n value: <IP_ADDRESS_NAME>'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ebc0a559d30>)])]> 1. Set up a PSC service attachment: Create a PSC service attachment in the VPC network where the GKE Gateway is deployed. Ensure to use the forwarding rule created for the GKE Gateway to configure the service attachment 2. Create an Endpoint attachment: This is configured in the Apigee organization. 3. Configure the Gateway routes to your GKE services: The benefit of this is deploying containerized applications and Routes can be decoupled from the Gateway deployment process. ConclusionPrivate Service Connect improves how Apigee interacts with GKE workloads, delivering enhanced security, better performance, and increased operational efficiency. By sidestepping the public internet or complex VPC Peering management and embracing Google Cloud's secure internal infrastructure, you can reduce risk and gain a streamlined and robust backbone for your cloud-native API services. View the full article
  14. As more customers use multiple cloud services or microservices, they face the difficulty of consistently managing and connecting their services across various environments, including on-premises, different clouds, and existing legacy systems. HashiCorp Consul's service mesh addresses this challenge by securely and consistently connecting applications on any runtime, network, cloud platform, or on-premises setup. In the Google Cloud ecosystem, Consul can be deployed across Google Kubernetes Engine (GKE) and Anthos GKE. Now, Consul 1.16 is also supported on GKE Autopilot, Google Cloud’s fully managed Kubernetes platform for containerized workloads. Consul 1.17 is currently on track to be supported on GKE Autopilot later this year. Benefits of GKE Autopilot In 2021, Google Cloud introduced GKE Autopilot, a streamlined configuration for Kubernetes that follows GKE best practices, with Google managing the cluster configuration. Reducing the complexity that comes with workloads using Kubernetes, Google’s GKE Autopilot simplifies operations by managing infrastructure, control plane, and nodes, while reducing operational and maintenance costs. Consul is the latest partner product to be generally available, fleet-wide, on GKE Autopilot. By deploying Consul on GKE Autopilot, customers can connect services and applications across clouds, platforms, and services while realizing the benefits of a simplified Kubernetes experience. The key benefits of using Autopilot include more time to focus on building your application, a strong security posture out-of-the-box, and reduced pricing — paying only for what you use: Focus on building and deploying your applications: With Autopilot, Google manages the infrastructure using best practices for GKE. Using Consul, customers can optimize operations through centralized management and automation, saving valuable time and resources for developers. Out-of-the-box security: With years of Kubernetes experience, GKE Autopilot implements GKE-hardening guidelines and security best practices, while blocking features deemed less safe (i.e. privileged pod- and host-level access). As a part of HashiCorp’s zero trust security solution, Consul enables least-privileged access by using identity-based authorization and service-to-service encryption. Pay-as-you-go: GKE Autopilot’s pricing model simplifies billing forecasts and attribution because it's based on resources requested by your pods. Visit the Google Cloud and HashiCorp websites to learn more about GKE Autopilot pricing and HashiCorp Consul pricing. Deploying Consul on GKE Autopilot Deploying Consul on GKE Autopilot facilitates service networking across a multi-cloud environment or microservices architecture, allowing customers to quickly and securely deploy and manage Kubernetes clusters. With Consul integrated across Google Cloud Kubernetes, including GKE, GKE Autopilot, and Anthos GKE, Consul helps bolster application resilience, increase uptime, accelerate application deployment, and improve security across service-to-service communications for clusters, while reducing overall operational load. Today, you can deploy Consul service mesh on GKE Autopilot using the following configuration for Helm in your values.yaml file: global: name: consul connectInject: enabled: true cni: enabled: true logLevel: info cniBinDir: "/home/kubernetes/bin" cniNetDir: "/etc/cni/net.d"In addition, if you are using a Consul API gateway for north-south traffic, you will need to configure the Helm chart so you can leverage the existing Kubernetes Gateway API resources provided by default when provisioning GKE Autopilot. We recommend the configuration shown below for most deployments on GKE Autopilot as it provides the greatest flexibility by allowing both API gateway and service mesh workflows. Refer to Install Consul on GKE Autopilot for more information. global: name: consul connectInject: enabled: true apiGateway: manageExternalCRDs: false manageNonStandardCRDs: true cni: enabled: true logLevel: info cniBinDir: "/home/kubernetes/bin" cniNetDir: "/etc/cni/net.d"Learn more You can learn more about the process that Google Cloud uses to support HashiCorp Consul workloads on GKE Autopilot clusters with this GKE documentation and resources page. Here’s how to get started on Consul: Learn more in the Consul documentation. Begin using Consul 1.16 by installing the latest Helm chart, and learn how to use a multi-port service in Consul on Kubernetes deployments. Try Consul Enterprise by starting a free trial. Sign up for HashiCorp-managed HCP Consul. View the full article
  15. In Kubernetes, persistent volumes were initially managed by in-tree plugins, but this approach hindered development and feature implementation since in-tree plugins were compiled and shipped as part of k8s source code. To address this, the Container Storage Interface (CSI) was introduced, standardizing storage system exposure to containerized workloads. CSI drivers for standard volumes like Google Cloud PersistentDisk were developed and are continuously evolving. The implementation for in-tree plugins is being transitioned to CSI drivers. If you have a Google Kubernetes Engine (GKE) cluster(s) that is still using the in-tree volumes, please follow the instructions below to learn how to migrate to CSI provisioned volumes. Why migrate? There are various benefits to using a gce-pd CSI Driver, including improved deployment automation, customer managed keys, volume snapshots and more. In GKE version 1.22 and later, CSI Migration is enabled. Existing volumes that use the gce-pd provider managed through CSI drivers via transparent migration in the kubernetes controller backend. No changes are required to any StorageClass. You must use the pd.csi.storage.gke.io provider in the StorageClass to enable features like CMEK or volume snapshots. An example of a storage Class with an in-tree storage plugin and a CSI driver. code_block <ListValue: [StructValue([('code', 'apiVersion: storage.k8s.io/v1\r\nkind: StorageClass\r\n...\r\nprovisioner: kubernetes.io/gce-pd <--- in-tree\r\nprovisioner: pd.csi.storage.gke.io <--- CSI provisioner'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ebc86d6f1c0>)])]> [Please perform the below actions in your test/dev environment first] Before you begin: To test migration, create a GKE cluster. Once the cluster is ready, check the provisioner of your default storage class. If it’s already a CSI provisioner pd.csi.storage.gke.io then change it to gce-pd (in-tree) by following these instructions Refer to this page if you want to deploy a stateful PostgreSQL database application in a GKE cluster.. We will refer to this sample application throughout this blog. Again, make sure that a storage class (standard) with gce-pd provisioner creates the volumes (PVCs) attached to the pods. As a next step, we will backup this application using Backup for GKE (BfG) and restore the application while changing the provisioner from gce-pd (in-tree) to pd.csi.storage.io (the CSI driver). Create a backup Plan Please follow this page to ensure you have BfG enabled on your cluster. When you enable the BfG agent in your GKE cluster, BfG provides a CustomResourceDefinition that introduces a new kind of Kubernetes resource: the ProtectedApplication. For more on ProtectedApplication, please visit this page. A sample manifest file: code_block <ListValue: [StructValue([('code', 'kind: ProtectedApplication\r\napiVersion: gkebackup.gke.io/v1alpha2\r\nmetadata:\r\n name: postgresql\r\n namespace: blog\r\nspec:\r\n resourceSelection:\r\n type: Selector\r\n selector:\r\n matchLabels:\r\n app.kubernetes.io/name: postgresql-ha\r\n components:\r\n - name: postgresql\r\n resourceKind: StatefulSet\r\n resourceNames: ["db-postgresql-ha-postgresql"]\r\n strategy:\r\n type: BackupAllRestoreAll\r\n backupAllRestoreAll: {}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ebc86d6f2e0>)])]> If Ready to backup status shows as true, your application is ready for backup. code_block <ListValue: [StructValue([('code', '❯ kubectl describe protectedapplication postgresql\r\n......\r\nStatus:\r\n Ready To Backup: true'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ebc86d6f400>)])]> Let’s create a backup plan following these instructions. Up until now, we have only created a backup plan and haven’t taken an actual backup. But before we start the backup process, we have to bring down the application. Bring down the Application We have to bring down the application right before taking its backup (This is where the Application downtime starts). We are doing it to prevent any data loss during this migration. My application is currently exposed via a service db-postgresql-ha-pgpool with the following selectors: We’ll patch this service by overriding above selectors with a null value so that no new request can reach the database. Save this file as patch.yaml and apply it using kubectl. code_block <ListValue: [StructValue([('code', 'spec:\r\n selector:\r\n app.kubernetes.io/instance: ""\r\n app.kubernetes.io/name: ""\r\n app.kubernetes.io/component: ""'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ebc86d6ff10>)])]> code_block <ListValue: [StructValue([('code', '❯ kubectl patch service db-postgresql-ha-pgpool --patch-file patch.yaml\r\nservice/db-postgresql-ha-pgpool patched'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ebc865c8370>)])]> You should no longer be able to connect to your app (i.e., database) now. Start a Backup Manually Navigate to the GKE Console → Backup for GKE → Backup Plans Click Start a backup as shown below. Restore from the Backup We will restore this backup to a Target Cluster. Please note that you do have an option to select the same cluster as your source and your target cluster. The recommendation is to use a new GKE cluster as your target cluster. Restore process completes in following two steps: Create a restore plan Restore a backup using the restore plan Create a restore plan You can follow these instructions to create a restore plan. While adding the transformation rule(s) , we will change the storage class from standard to standard-rwo. Add transformation rules → Add Rule (Rename a PVC’s Storage Class) Please see this page for more details. Next, review the configuration and create a plan. Restore backup using the (previously created) restore plan When a backup is restored, the Kubernetes resources are re-created in the target cluster. Navigate to the GKE Console → Backup for GKE → BACKUPS tab to see the latest backup(s). Select the backup you took before bringing down the application to view the details and click on SET UP A RESTORE. Fill all the mandatory fields and click RESTORE. Once done, switch the context to the target cluster and see how BfG has restored the application successfully in the same namespace. The data was restored into new PVCs (verify with kubectl -n blog get pvc). Their storageclass is gce-pd-gkebackup-de, which is a special storageclass used to provision volumes from the backup. Let’s get the details of one of the restored volumes to confirm BfG has successfully changed the provisioner from in-tree to CSI New volumes are created by the CSI provisioner. Great! Bring up the application Let’s patch the service db-postgresql-ha-pgpool back with the original selectors to bring our application up. Save this patch file as new_patch.yaml and apply using kubectl. We are able to connect to our database application now. Note: This downtime will depend on your application size. For more information, please see this link. Use it todayBackup for GKE can help you reduce the overhead of this migration with a minimal downtime. It can also help you prepare for disaster recovery. View the full article
  16. Filesystem in Userspace (FUSE) is an interface used to export a filesystem to the Linux kernel. Cloud Storage FUSE allows you to mount Cloud Storage buckets as a file system so that applications can access the objects in a bucket using common file I/O operations (e.g., open, read, write, close) rather than using cloud-specific APIs. Cloud Storage FUSE is generally available. Google Kubernetes Engine (GKE) offers turn-key integration with Cloud Storage FUSE using the Cloud Storage FUSE CSI driver. The CSI driver lets you use the Kubernetes API to consume pre-existing Cloud Storage buckets as persistent volumes. Your applications can upload and download objects using Cloud Storage FUSE file system semantics. The Cloud Storage FUSE CSI driver is a fully-managed experience powered by the open-source Google Cloud Storage FUSE CSI driver. You get portability, reliability, performance, and out-of-the-box GKE integration. Powering AI workloads with data portabilityCloud Storage is a common choice for AI/ML workloads because of its near-limitless scale, simplicity, affordability, and performance. While some AI/ML frameworks have libraries that support native object-storage APIs directly, others require file-system semantics. Moreover, it is desired to standardize on file-system semantics for a consistent experience across environments. With Cloud Storage FUSE CSI driver, GKE workloads can access Cloud Storage buckets mounted as a local file system using the Kubernetes API, providing data portability across various environments. Here is an example showing how to access your Cloud Storage objects using common file-system semantics in a Job workload on GKE: code_block <ListValue: [StructValue([('code', 'apiVersion: batch/v1\r\nkind: Job\r\nmetadata:\r\n name: gcs-fuse-csi-job-example\r\n namespace: my-namespace\r\nspec:\r\n template:\r\n metadata:\r\n annotations:\r\n gke-gcsfuse/volumes: "true"\r\n spec:\r\n serviceAccountName: my-sa\r\n containers:\r\n - name: data-transfer\r\n image: busybox\r\n command:\r\n - "/bin/sh"\r\n - "-c"\r\n - touch /data/my-file;\r\n ls /data | grep my-file;\r\n echo "Hello World!" >> /data/my-file;\r\n cat /data/my-file | grep "Hello World!";\r\n volumeMounts:\r\n - name: gcs-fuse-csi-ephemeral\r\n mountPath: /data\r\n volumes:\r\n - name: gcs-fuse-csi-ephemeral\r\n csi:\r\n driver: gcsfuse.csi.storage.gke.io\r\n volumeAttributes:\r\n bucketName: my-bucket-name\r\n restartPolicy: Never\r\n backoffLimit: 1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e65f8bc7d30>)])]> You can find more AI/ML workload examples using the CSI driver to consume objects in Cloud Storage buckets: Train a model with GPUs on GKE Standard modeServe a model with a single GPU in GKE AutopilotTPUs in GKE introductionWith the Cloud Storage FUSE CSI driver, you only need to specify a Cloud Storage bucket and a service account, and GKE takes care of the rest. If you want more control or want to learn more about the underlying design, read on. Making easy things easy, but hard things possibleYou can easily specify the Cloud Storage bucket using in-line CSI ephemeral volumes on your Pod specification, which does not require maintaining separate PersistentVolume (PV) and PersistentVolumeClaim (PVC) objects. If compatibility or authentication requirements make the conventional PV/PVC approach preferable, the CSI driver also supports the standard static provisioning approach. This approach offers ReadWriteMany access mode, enabling multiple applications to consume data using a single PV/PVC. The CSI driver automatically injects a sidecar container into your Pod specification, so you can focus on your application code and don’t have to worry about managing the Cloud Storage FUSE runtime. Additionally, we designed an auto-termination mechanism that terminates the sidecar container when it is no longer needed. This design prevents the sidecar container from blocking the termination of your Kubernetes Job workloads. When the sidecar container is injected, we set default resource allocation and mount options that work well with most lightweight workloads. We also provide the flexibility for you to fine-tune the mount options and resource allocation for the Cloud Storage FUSE runtime. Tight constraints leading to an innovative designRunning FUSE drivers on Kubernetes has several limitations that could hinder adoption. We developed a new sidecar-based solution to overcome these limitations by leveraging a kernel cross-process file descriptor transfer to cleanly separate privileges, authentication, workload lifecycle, and resource usage. Traditional FUSE CSI drivers run all FUSE instances inside the CSI driver container, which creates a single point of failure. In some cases, FUSE instances run directly on the node’s underlying virtual machine (VM), but this can consume reserved system resources. Our solution is different. Cloud Storage FUSE runs alongside workload containers in a sidecar container. This per-Pod based model ties the Cloud Storage FUSE lifecycle to the lifecycle of the workload. Since Cloud Storage FUSE runs as part of your workload, you can use Workload Identity to authenticate it. This means you have fine-grained, Pod-level IAM access control without having to manually manage your service account credentials. Additionally, the resources consumed by Cloud Storage FUSE are accounted for by the Pod, which makes it possible to fine-tune performance. Another major challenge of running FUSE drivers in sidecar containers is that the FUSE runtime requires elevated container privileges. This restriction makes the sidecar-based solution almost impossible for GKE Autopilot, because privileged containers are not permitted on Autopilot clusters. Our solution introduced a two-step FUSE mount technique that allows the FUSE sidecar container to run without privileges. Only the CSI driver container needs to be a privileged container. The following diagram illustrates how the two-step FUSE mount process works. Step 1: the CSI driver opens the “/dev/fuse” device on the node VM and obtains the file descriptor. Then it calls the Linux command mount.fuse3 using the file descriptor to create the mount point. In the end, the CSI driver calls the Linux sendmsg command to send the file descriptor to the sidecar container via an Unix Domain Socket (UDS) in an emptyDir.Step 2: in the sidecar container, a launcher process connects to the UDS and calls Linux recvmsg command to receive the file descriptor. The process then launches the FUSE instance using the file descriptor to serve the FUSE mount point.ConclusionCloud Storage FUSE CSI driver is a GKE fully managed solution that lets you use the Kubernetes API to access pre-existing Cloud Storage buckets in your applications. The innovative sidecar-based design solves privilege, authentication, resource allocation, and FUSE lifecycle management issues. See GKE documentation for details: Access Cloud Storage buckets with the Cloud Storage FUSE CSI driver. Submit issues on GitHub repository. View the full article
  17. If you run workloads on Kubernetes, chances are you’ve experienced a “cold start”: a delay in launching an application that happens when workloads are scheduled to nodes that haven’t hosted the workload before and the pods need to spin up from scratch. The extended startup time can lead to longer response times and a worse experience for your users — especially when the application is autoscaling to handle a surge in traffic. What’s going on during a cold start? Deploying a containerized application on Kubernetes typically involves several steps, including pulling container images, starting containers, and initializing the application code. These processes all add to the time before a pod can start serving traffic, resulting in increased latency for the first requests served by a new pod. The initial startup time can be significantly longer because the new node has no pre-existing container image. For subsequent requests, the pod is already up and warm, so it can quickly serve requests without extra startup time. Cold starts are frequent when pods are continuously being shut down and restarted, as that forces requests to be routed to new, cold pods. A common solution is to keep warm pools of pods ready to reduce the cold start latency. However, with larger workloads like AI/ML, and especially on expensive and scarce GPUs, the warm pool practice can be very costly. So cold starts are especially prevalent for AI/ML workloads, where it’s common to shut down pods after completed requests. Google Kubernetes Engine (GKE) is Google Cloud’s managed Kubernetes service, and can make it easier to deploy and maintain complex containerized workloads. In this post, we’ll discuss four different techniques to reduce cold start latency on GKE, so you can deliver responsive services. Techniques to overcome the cold start challenge Use ephemeral storage with local SSDs or larger boot disks Nodes mount the Kubelet and container runtime (docker or containerd) root directories on a local SSD. As a result, the container layer is backed by the local SSD, with the IOPS and throughput documented on About local SSDs. This is usually more cost-effective than increasing the PD size. The following table compares the options and demonstrates that for the same cost, LocalSSD has ~3x more throughput than PD, allowing the image pull to run faster and reduce the workload’s startup latency. With the same cost LocalSSD PD Balanced Throughput Comparison $ per month Storage space (GB) Throughput(MB/s) R W Storage space (GB) Throughput (MB/s) R+W LocalSSD / PD (Read) LocalSSD / PD (Write) $ 375 660 350 300 140 471% 250% $$ 750 1320 700 600 168 786% 417% $$$ 1125 1980 1050 900 252 786% 417% $$$$ 1500 2650 1400 1200 336 789% 417% You can create a node pool that uses ephemeral storage with local SSDs in an existing cluster running on GKE version 1.25.3-gke.1800 or later. code_block <ListValue: [StructValue([('code', 'gcloud container node-pools create POOL_NAME \\\r\n --cluster=CLUSTER_NAME \\\r\n --ephemeral-storage-local-ssd count=<NUMBER_OF_DISKS> \\\r\n --machine-type=MACHINE_TYPE'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e4b3ed176d0>)])]> For more, see Provision ephemeral storage with local SSDs. 2. Enable container image streamingImage streaming can allow workloads to start without waiting for the entire image to be downloaded, leading to significant improvements in workload startup time. For example, with GKE image streaming, the end-to-end startup time (from workload creation to server up for traffic) for an NVIDIA Triton Server (5.4GB container image) can be reduced from 191s to 30s. You must use Artifact Registry for your containers and meet the requirements. Image Streaming can be enabled on the cluster by code_block <ListValue: [StructValue([('code', 'gcloud container clusters create CLUSTER_NAME \\\r\n --zone=COMPUTE_ZONE \\\r\n --image-type="COS_CONTAINERD" \\\r\n --enable-image-streaming'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e4b3ed17bb0>)])]> To learn more, see Use Image streaming to pull container images. 3. Use Zstandard compressed container imagesZstandard compression is a feature supported in ContainerD. Zstandard benchmark shows zstd is >3x faster decompression than gzip (the current default). Here’s how to use the zstd builder in docker buildx: code_block <ListValue: [StructValue([('code', 'docker buildx create --name zstd-builder --driver docker-container \\\r\n --driver-opt image=moby/buildkit:v0.10.3\r\ndocker buildx use zstd-builder'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e4b3ed17c40>)])]> Here’s how to build and push an image: code_block <ListValue: [StructValue([('code', 'IMAGE_URI=us-central1-docker.pkg.dev/taohe-gke-dev/example\r\nIMAGE_TAG=v1\r\n\r\n<Create your Dockerfile>\r\n\r\ndocker buildx build --file Dockerfile --output type=image,name=$IMAGE_URI:$IMAGE_TAG,oci-mediatypes=true,compression=zstd,compression-level=3,force-compression=true,push=true .'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e4b3ed17640>)])]> Please note that Zstandard is incompatible with image streaming. If your application requires the majority of the container image content to be loaded before starting, it’s better to use Zstandard. If your application needs only a small portion of the total container image to be loaded to start executing, then try image streaming. 4. Use a preloader DaemonSet to preload the base container on nodesLast but not least, ContainerD reuses the image layers across different containers if they share the same base container. And the preloader DaemonSet can start running even before the GPU driver is installed (driver installation takes ~30 seconds). That means it can preload required containers before the GPU workload can be scheduled to the GPU node and start pulling images ahead of time. Below is an example of the preloader DaemonSet. code_block <ListValue: [StructValue([('code', 'apiVersion: apps/v1\r\nkind: DaemonSet\r\nmetadata:\r\n name: container-preloader\r\n labels:\r\n k8s-app: container-preloader\r\nspec:\r\n selector:\r\n matchLabels:\r\n k8s-app: container-preloader\r\n updateStrategy:\r\n type: RollingUpdate\r\n template:\r\n metadata:\r\n labels:\r\n name: container-preloader\r\n k8s-app: container-preloader\r\n spec:\r\n affinity:\r\n nodeAffinity:\r\n requiredDuringSchedulingIgnoredDuringExecution:\r\n nodeSelectorTerms:\r\n - matchExpressions:\r\n - key: cloud.google.com/gke-accelerator\r\n operator: Exists\r\n tolerations:\r\n - operator: "Exists"\r\n containers:\r\n - image: "<CONTAINER_TO_BE_PRELOADED>"\r\n name: container-preloader\r\n command: [ "sleep", "inf" ]'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e4b3ed17160>)])]> Outsmarting the cold startThe cold start challenge is a common problem in container orchestration systems. With careful planning and optimization, you can mitigate its impact on your applications running on GKE. By using ephemeral storage with larger boot disks, enabling container streaming or Zstandard compression, and preloading the base container with a daemonset, you can reduce cold start delays and ensure a more responsive and efficient system. To learn more about GKE, check out the user guide. View the full article
  18. Google Kubernetes Engine (GKE) traces its roots back to Google’s development of Borg in 2004, a Google internal system managing clusters and applications. In 2014, Google introduced Kubernetes, an open-source platform based on Borg’s principles, gaining rapid popularity for automating containerized application deployment. In 2015, Google launched GKE, a managed Kubernetes service on Google Cloud Platform (GCP), streamlining infrastructure management for users. Over the years, GKE saw substantial growth, introducing Autopilot mode in 2017 and surpassing 2 million managed clusters by 2019. Throughout 2020 and 2021, GKE continued evolving with features like Workload Identity and Anthos Config Management, aiding in access management and configurations. In 2022, GKE maintained its status as a leading managed Kubernetes platform, introducing innovations like Anthos Service Mesh and Anthos Multi-Cluster Ingress. GKE’s trajectory underscores the significance of containerization and cloud-native tech, poised to further empower the development, deployment, and management of modern applications amidst growing industry adoption. Ubuntu can be used as the base operating system for GKE nodes, and GKE can be used to deploy and manage containerized applications on Ubuntu nodes. Here are some of the benefits of using Ubuntu and GKE together: Flexibility: Ubuntu can be used with a variety of Kubernetes distributions, so developing teams will have a standardized experience across multiple cloud, such as Azure Kubernetes Service (AKS) and Amazon EKS.Secure. Timely updates and patches, building on Ubuntu’s security track record.Stable. Long-term support with a consistent, predictable lifecycle and support commitment. Each GKE version has a customized version of Ubuntu that support its latest features.Seamless. Developer-friendly, smooth experience from development to production.Small. Ultra-small, resource-efficient, yet robust images with chiselled (distroless) Ubuntu. Let’s go through the steps you need to deploy Ubuntu nodes in GKE. In Google Cloud Console, locate Kubernetes Engine: In “Clusters”, click CREATE: Choose “Standard: You manage your cluster” in the pop-up window: In the left of the console, click “default-pool”: Click “Nodes”: Find “Ubuntu with containerd (ubuntu_containerd)” in “Image type” drop-down menu: Choose “Ubuntu with containerd (ubuntu_containerd)” and Click “CREATE” at the bottom: Once the cluster is running, you can check “OS image” in the “Node details”: View the full article
  19. The recent boom in the AI landscape has seen larger and more complex models give rise to mind-blowing AI capabilities across a range of applications. At the same time, these larger models are driving up costs for AI compute astronomically; state-of-the-art LLMs cost tens of millions of dollars (or more) to train, with hundreds of billions of parameters and trillions of tokens of data to learn. ML teams need access to compute that is both scalable and price-efficient. They need the right infrastructure to operationalize ML activities and enhance developer productivity when working with large models. Moreover, they must maintain guardrails for orchestration and deployment to production. Developing, refining, optimizing, deploying, and monitoring ML models can be challenging and complex in the current AI landscape. However, the efficient orchestration, cost-effective performance, and scalability present in Google Kubernetes Engine (GKE), in tandem with Weights & Biases (W&B) Launch's user-friendly interface, simplifies the model development and deployment process for machine learning researchers. This integration seamlessly connects ML researchers to their training and inference infrastructure, making the management and deployment of machine learning models easier. In this blog, we show you how to use W&B Launch to set up access to either GPUs or Cloud Tensor Processing Units (TPUs) on GKE once, and from then easily grant ML researchers frictionless access to compute. W&B LaunchW&B is an ML developer platform designed to enable ML teams to build, track, and deploy better models faster. As the system of record for ML activities, from experiment tracking to model registry management, W&B improves collaboration, boosts productivity, and overall helps simplify the complexity of modern ML workflows. W&B Launch connects ML practitioners to their cloud compute infrastructure. After a one-time configuration by an ML platform team, ML researchers can then select the target environment in which they want to launch training or inference jobs. W&B Launch automatically packages up all the code and dependencies for that job and sends it to the target environment, taking advantage of more powerful compute or parallelization to execute jobs faster and at greater scale. With jobs packaged up, practitioners can easily rerun jobs with small tweaks, such as changing hyperparameters or training datasets. ML teams also use W&B Launch to automate model evaluation and deployment workflows to manage shared compute resources more efficiently. “We’re using W&B Launch to enable easy access to compute resources to dramatically scale our training workloads,” said Mike Seddon, Head of Machine Learning and Artificial Intelligence at VisualCortex. “Having that ability to create queues to each cluster and activate them is exactly what we want to do.” Creating a GKE ClusterGKE offers a fully-managed environment for deploying, managing, and scaling containerized applications using Kubernetes. ML teams often choose GKE over managing an open-source Kubernetes cluster because it provides the industry's only fully managed Kubernetes with a 99.9% Pod-level SLA backed by Google SREs, which reduces operational overhead and can improve an organization's security posture. To start using W&B Launch with GKE, create a GKE cluster with TPUs or with GPUs W&B Launch Jobs with GKEThe W&B Launch agent builds container images inside of GKE, capturing all the dependencies for that particular run. Once W&B Launch is configured with the GKE cluster, ML engineers can easily start training jobs by accessing powerful GPUs or Google Cloud TPUs to accelerate and supercharge AI development. To get started, create an account, install W&B and start tracking your machine learning experiments in minutes. You can then set up your W&B Launch queue and W&B Launch agent on your GKE cluster. W&B Launch queues must be configured to point to a specific target resource, along with any additional configuration specific to that resource. A Launch queue that points to a GKE cluster would include environment variables or set a custom namespace for its Launch queue configuration. When an agent receives a job from a queue, it also receives the queue configuration. Once you’ve created your queue, you can set up your W&B Launch agent, which are long running processes that poll one or more W&B Launch queues for jobs, in a first in first out (FIFO) order. The agent then submits the job to the target resource — your Cloud TPU nodes within your GKE cluster — along with the configuration options specified. Check out our documentation for more information on setting up your GKE cluster and agent. Creating a W&B Launch jobNow that W&B Launch is set up with GKE, job execution can be handled through the W&B UI. Identify the previously executed training run that has been tracked in W&B.Select the specific code version used for the job under the version history tab.You will see the W&B Launch button on the upper right hand corner of the Python source screen.After clicking on the W&B Launch button, you’ll be able to change any parameters for the experiment, and select the GKE environment under the “Queue” menu.A common use case for W&B Launch is to execute a number of hyperparameter tuning jobs in parallel. Setting up a hyperparameter sweep is simple: select “Sweep” on the left-hand toolbar, enter the range of the sweep for the hyperparameters, and select the “GKE” queue for the environment. ConclusionW&B Launch with GKE is a powerful combination to provide ML researchers and ML platform teams the compute resources and automation they need to rapidly increase the rate of experimentation for AI projects. To learn more, check out the full W&B Launch documentation and this repository of pre-built W&B Launch jobs.
  20. The recent boom in the AI landscape has seen larger and more complex models give rise to mind-blowing AI capabilities across a range of applications. At the same time, these larger models are driving up costs for AI compute astronomically; state-of-the-art LLMs cost tens of millions of dollars (or more) to train, with hundreds of billions of parameters and trillions of tokens of data to learn. ML teams need access to compute that is both scalable and price-efficient. They need the right infrastructure to operationalize ML activities and enhance developer productivity when working with large models. Moreover, they must maintain guardrails for orchestration and deployment to production. Developing, refining, optimizing, deploying, and monitoring ML models can be challenging and complex in the current AI landscape. However, the efficient orchestration, cost-effective performance, and scalability present in Google Kubernetes Engine (GKE), in tandem with Weights & Biases (W&B) Launch's user-friendly interface, simplifies the model development and deployment process for machine learning researchers. This integration seamlessly connects ML researchers to their training and inference infrastructure, making the management and deployment of machine learning models easier. In this blog, we show you how to use W&B Launch to set up access to either GPUs or Cloud Tensor Processing Units (TPUs) on GKE once, and from then easily grant ML researchers frictionless access to compute. W&B LaunchW&B is an ML developer platform designed to enable ML teams to build, track, and deploy better models faster. As the system of record for ML activities, from experiment tracking to model registry management, W&B improves collaboration, boosts productivity, and overall helps simplify the complexity of modern ML workflows. W&B Launch connects ML practitioners to their cloud compute infrastructure. After a one-time configuration by an ML platform team, ML researchers can then select the target environment in which they want to launch training or inference jobs. W&B Launch automatically packages up all the code and dependencies for that job and sends it to the target environment, taking advantage of more powerful compute or parallelization to execute jobs faster and at greater scale. With jobs packaged up, practitioners can easily rerun jobs with small tweaks, such as changing hyperparameters or training datasets. ML teams also use W&B Launch to automate model evaluation and deployment workflows to manage shared compute resources more efficiently. “We’re using W&B Launch to enable easy access to compute resources to dramatically scale our training workloads,” said Mike Seddon, Head of Machine Learning and Artificial Intelligence at VisualCortex. “Having that ability to create queues to each cluster and activate them is exactly what we want to do.” Creating a GKE ClusterGKE offers a fully-managed environment for deploying, managing, and scaling containerized applications using Kubernetes. ML teams often choose GKE over managing an open-source Kubernetes cluster because it provides the industry's only fully managed Kubernetes with a 99.9% Pod-level SLA backed by Google SREs, which reduces operational overhead and can improve an organization's security posture. To start using W&B Launch with GKE, create a GKE cluster with TPUs or with GPUs W&B Launch Jobs with GKEThe W&B Launch agent builds container images inside of GKE, capturing all the dependencies for that particular run. Once W&B Launch is configured with the GKE cluster, ML engineers can easily start training jobs by accessing powerful GPUs or Google Cloud TPUs to accelerate and supercharge AI development. To get started, create an account, install W&B and start tracking your machine learning experiments in minutes. You can then set up your W&B Launch queue and W&B Launch agent on your GKE cluster. W&B Launch queues must be configured to point to a specific target resource, along with any additional configuration specific to that resource. A Launch queue that points to a GKE cluster would include environment variables or set a custom namespace for its Launch queue configuration. When an agent receives a job from a queue, it also receives the queue configuration. Once you’ve created your queue, you can set up your W&B Launch agent, which are long running processes that poll one or more W&B Launch queues for jobs, in a first in first out (FIFO) order. The agent then submits the job to the target resource — your Cloud TPU nodes within your GKE cluster — along with the configuration options specified. Check out our documentation for more information on setting up your GKE cluster and agent. Creating a W&B Launch jobNow that W&B Launch is set up with GKE, job execution can be handled through the W&B UI. Identify the previously executed training run that has been tracked in W&B.Select the specific code version used for the job under the version history tab.You will see the W&B Launch button on the upper right hand corner of the Python source screen.After clicking on the W&B Launch button, you’ll be able to change any parameters for the experiment, and select the GKE environment under the “Queue” menu.A common use case for W&B Launch is to execute a number of hyperparameter tuning jobs in parallel. Setting up a hyperparameter sweep is simple: select “Sweep” on the left-hand toolbar, enter the range of the sweep for the hyperparameters, and select the “GKE” queue for the environment. ConclusionW&B Launch with GKE is a powerful combination to provide ML researchers and ML platform teams the compute resources and automation they need to rapidly increase the rate of experimentation for AI projects. To learn more, check out the full W&B Launch documentation and this repository of pre-built W&B Launch jobs.
  21. Today’s post is the first in a series that will provide the reasons for why administrators and architects should build batch processing platforms on Google Kubernetes Engine (GKE). Kubernetes has emerged as a leading container orchestration platform for deploying and managing containerized applications, speeding up the pace of innovation. This platform is not just limited to running microservices, but also provides a powerful framework for orchestrating batch workloads such as data processing jobs, training machine learning models, running scientific simulations, and other compute-intensive tasks. GKE is a managed Kubernetes solution that abstracts underlying infrastructure and accelerates time-to-value for users in a cost effective way. A batch platform processes batch workloads, defined as Kubernetes Jobs, in the order they are received. The batch platform might incorporate a queue to apply your business case logic...
  22. Detecting vulnerabilities in open-source software requires a holistic approach, and security best practices recommend scanning early and often throughout your development lifecycle to help maintain an effective security posture. However, only scanning in the CI/CD pipeline or registry can miss artifacts and containers that are deployed to production through other mechanisms. Likewise, only scanning runtimes can pass over software supply chain vulnerabilities. To address these security concerns, Artifact Analysis in partnership with Google Kubernetes Engine (GKE) is introducing a new offering called Advanced Vulnerability Insights in public preview. We’re also expanding scanning language support for Artifact Registry... View the full article
  23. BackgroundWith the increasing popularity of AI-generated content (AIGC), open-source projects based on text-to-image AI models such as MidJourney and Stable Diffusion have emerged. Stable Diffusion is a diffusion model that generates realistic images based on given text inputs. In this GitHub repository, we provide three different solutions for deploying Stable Diffusion quickly on Google Cloud Vertex AI, Google Kubernetes Engine (GKE), and Agones-based platforms, respectively, to ensure stable service delivery through elastic infrastructure. This article will focus on the Stable Diffusion model on GKE and improve launch times by up to 4x. Problem statementThe container image of Stable Diffusion is quite large, reaching approximately 10-20GB, which slows down the image pulling process during container startup and consequently affects the launch time. In scenarios that require rapid scaling, launching new container replicas may take more than 10 minutes, significantly impacting user experience. During the launch of the container, we can see following events in chronological order: Triggering Cluster Autoscaler for scaling + Node startup and Pod scheduling: 225 seconds Image pull startup: 4 seconds Image pulling: 5 minutes 23 seconds Pod startup: 1 second sd-webui serving: more than 2 minutes After analyzing this time series, we can see that the slow startup of the Stable Diffusion WebUI running in the container is primarily due to the heavy dependencies of the entire runtime, resulting in a large container image size and prolonged time for image pulling and pod initialization. Therefore, we consider optimizing the startup time from the following three aspects: Optimizing the Dockerfile: Selecting the appropriate base image and minimizing the installation of runtime dependencies to reduce the image size. Separating the base environment from the runtime dependencies: Accelerating the creation of the runtime environment through PD disk images. Leveraging GKE Image Streaming: Optimizing image loading time by utilizing GKE Image Streaming and utilizing Cluster Autoscaler to enhance elastic scaling and resizing speed. This article focuses on introducing a solution to optimize the startup time of the Stable Diffusion WebUI container by separating the base environment from the runtime dependencies and leveraging high-performance disk image. Optimizing the DockerfileFirst of all, here's a reference Dockerfile based on the official installation instructions for the Stable Diffusion WebUI: https://github.com/nonokangwei/Stable-Diffusion-on-GCP/blob/main/Stable-Diffusion-UI-Agones/sd-webui/Dockerfile In the initial building container image for the Stable Diffusion, we found that besides the base image NVIDIA runtime, there were also numerous installed libraries, dependencies and extensions Before optimization, the container image size was 16.3GB.In terms of optimizing the Dockerfile, after analyzing the Dockerfile, we found that the nvidia runtime occupies approximately 2GB, while the PyTorch library is a very large package, taking up around 5GB. Additionally, Stable Diffusion and its extensions also occupy some space. Therefore, following the principle of minimal viable environment, we can remove unnecessary dependencies from the environment. We can use the NVIDIA runtime as the base image and separate the PyTorch library, Stable Diffusion libraries, and extensions from the original image, storing them separately in the file system. Below is the original Dockerfile snippet, code_block[StructValue([(u'code', u'# Base image\r\nFROM nvidia/cuda:11.8.0-runtime-ubuntu22.04\r\n\r\nRUN set -ex && \\\r\n apt update && \\\r\n apt install -y wget git python3 python3-venv python3-pip libglib2.0-0 pkg-config libcairo2-dev && \\\r\n rm -rf /var/lib/apt/lists/*\r\n\r\n# Pytorch \r\nRUN python3 -m pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117\r\n\r\n\u2026\r\n\r\n# Stable Diffusion\r\nRUN git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git\r\nRUN git clone https://github.com/Stability-AI/stablediffusion.git /stable-diffusion-webui/repositories/stable-diffusion-stability-ai\r\nRUN git -C /stable-diffusion-webui/repositories/stable-diffusion-stability-ai checkout cf1d67a6fd5ea1aa600c4df58e5b47da45f6bdbf\r\n\r\n\u2026\r\n\r\n# Stable Diffusion extensions\r\nRUN set -ex && cd stable-diffusion-webui \\\r\n && git clone https://gitcode.net/ranting8323/sd-webui-additional-networks.git extensions/sd-webui-additional-networks \\\r\n && git clone https://gitcode.net/ranting8323/sd-webui-cutoff extensions/sd-webui-cutoff \\\r\n && git clone https://github.com/toshiaki1729/stable-diffusion-webui-dataset-tag-editor.git extensions/stable-diffusion-webui-dataset-tag-editor'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ea1b5b8e710>)])]After moving out Pytorch libraries and Stable diffusion, we only retained the NVIDIA runtime in the base image, here is the new Dockerfile. code_block[StructValue([(u'code', u'FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04\r\nRUN set -ex && \\\r\n apt update && \\\r\n apt install -y wget git python3 python3-venv python3-pip libglib2.0-0 && \\\r\n rm -rf /var/lib/apt/lists/*'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ea1b5b8e150>)])]Using PD disks images to store libraries PD Disk images are the cornerstone of instance deployment in Google Cloud. Often referred to as templates or bootstrap disks, these virtual images contain the baseline operative system, and all the application software and configuration your instance will have upon first boot. The idea here is to store all the runtime libraries and extensions in a disk image, which in this case has a size of 6.77GB. The advantage of using a disk image is that it can support up to 1000 disk recoveries simultaneously, making it suitable for scenarios involving large-scale scaling and resizing. code_block[StructValue([(u'code', u'gcloud compute disks create sd-lib-disk-$NOW --type=pd-balanced --size=30GB --zone=$ZONE --image=$IMAGE_NAME\r\n\r\ngcloud compute instances attach-disk ${MY_NODE_NAME} --disk=projects/$PROJECT_ID/zones/$ZONE/disks/sd-lib-disk-$NOW --zone=$ZONE'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ea1b4687150>)])]We use a DaemonSet to mount the disk when GKE nodes start. The specific steps are as follows: As described in previous sections, in order to speed up initial launch for better performance, we’re trying to mount a persistent disk to GKE nodes to place runtime libraries for stable diffusion. Leveraging GKE Image Streaming and Cluster AutoscalerIn addition, as mentioned earlier, we have also enabled GKE Image Streaming to accelerate the image pulling and loading process. GKE Image Streaming works by using network mounting to attach the container's data layer to containerd and supporting it with multiple cache layers on the network, memory, and disk. Once we have prepared the Image Streaming mount, your containers transition from the ImagePulling state to Running in a matter of seconds, regardless of the container size. This effectively parallelizes the application startup with the data transfer of the required data from the container image. As a result, you can experience faster container startup times and faster automatic scaling. We have enabled the Cluster Autoscaler (CS) feature, which allows the GKE nodes to automatically scale up when there are increasing requests. Cluster Autoscaler triggers and determines the number of nodes needed to handle the additional requests. When the Cluster Autoscaler initiates a new scaling wave and the new GKE nodes are registered in the cluster, the DaemonSet starts working to assist in mounting the disk image that contains the runtime dependencies. The Stable Diffusion Deployment then accesses this disk through the HostPath. Additionally, we have utilized the Optimization Utilization Profile of the Cluster Autoscaler, a profile on GKE CA that prioritizes optimizing utilization over keeping spare resources in the cluster, to reduce the scaling time, save costs, and improve machine utilization. Final results The final startup result is as below: In chronological order: Triggering Cluster Autoscaler for scaling: 38 seconds Node startup and Pod scheduling: 89 seconds Mounting PVC: 4 seconds Image pull startup: 10 seconds Image pulling: 1 second Pod startup: 1 second Ability to provide sd-webui service (approximately): 65 seconds Overall, it took approximately 3 minutes to start a new Stable Diffusion container instance and start serving on a new GKE node. Compared to the previous 12 minutes, it is evident that the significant improvement in startup speed has enhanced the user experience. Take a look at the full code here: https://github.com/nonokangwei/Stable-Diffusion-on-GCP/tree/main/Stable-Diffusion-UI-Agones/optimizated-init Considerations While the technique described above splits up dependencies so that the container size is smaller and you can load the libraries from PD Disk images, there are some downsides to consider. Packing it all in one container image has its upsides where you can have a single immutable and versioned artifact. Separating the base environment from the run time dependencies means you have multiple artifacts to maintain and update. You can mitigate this by building tooling to manage updating of your PD disk images.
  24. Have you ever wondered what it will cost to run a particular Google Kubernetes Engine (GKE) cluster? How various configurations and feature choices will affect your costs? What the potential of autoscaling might be on your bill? If you’ve ever tried to estimate this yourself, you know it can be a puzzle — especially if you’re just starting out with Kubernetes, and don’t have many reference points from existing infrastructure to help. Today we are launching the GKE cost estimator in Preview, seamlessly integrated into the Google Cloud console... Read Article
  25. Available in GA today, you can now begin deploying Spot VMs in your Google Cloud projects to start saving now. For an overview of Spot VMs, see our Preview launch blog and for a deeper dive, check out our Spot VM documentation. Modern applications such as microservices, containerized workloads, and horizontal scalable applications are engineered to persist even when the underlying machine does not. This architecture allows you to leverage Spot VMs to access capacity and run applications at a low price. You will save 60 - 91% off the price of our on-demand VMs with Spot VMs. To make it even easier to utilize Spot VMs, we’ve incorporated Spot VM support in a variety of tools… Read Article
  • Forum Statistics

    43.3k
    Total Topics
    42.7k
    Total Posts
×
×
  • Create New...