Search the Community
Showing results for tags 'scaling'.
-
Real-time data streaming and event processing present scalability and management challenges. AWS offers a broad selection of managed real-time data streaming services to effortlessly run these workloads at any scale. In this post, Nexthink shares how Amazon Managed Streaming for Apache Kafka (Amazon MSK) empowered them to achieve massive scale in event processing. Experiencing business hyper-growth, Nexthink migrated to AWS to overcome the scaling limitations of on-premises solutions. With Amazon MSK, Nexthink now seamlessly processes trillions of events per day, reaching over 5 GB per second of aggregated throughput. In the following sections, Nexthink introduces their product and the need for scalability. They then highlight the challenges of their legacy on-premises application and present their transition to a cloud-centered software as a service (SaaS) architecture powered by Amazon MSK. Finally, Nexthink details the benefits achieved by adopting Amazon MSK. Nexthink’s need to scale Nexthink is the leader in digital employee experience (DeX). The company is shaping the future of work by providing IT leaders and C-levels with insights into employees’ daily technology experiences at the device and application level. This allows IT to evolve from reactive problem-solving to proactive optimization. The Nexthink Infinity platform combines analytics, monitoring, automation, and more to manage the employee digital experience. By collecting device and application events, processing them in real time, and storing them, our platform analyzes data to solve problems and boost experiences for over 15 million employees across five continents. In just 3 years, Nexthink’s business grew tenfold, and with the introduction of more real-time data our application had to scale from processing 200 MB per second to 5 GB per second and trillions of events daily. To enable this growth, we modernized our application from an on-premises single-tenant monolith to a cloud-based scalable SaaS solution powered by Amazon MSK. The next sections detail our modernization journey, including the challenges we faced and the benefits we realized with our new cloud-centered, AWS-based architecture. The on-premises solution and its challenges Let’s first explore our previous on-premises solution, Nexthink V6, before examining how Amazon MSK addressed its challenges. The following diagram illustrates its architecture. V6 was made up of two monolithic, single-tenant Java and C++ applications that were tightly coupled. The portal was a backend-for-frontend Java application, and the core engine was an in-house C++ in-memory database application that was also handling device connections, data ingestion, aggregation, and querying. By bundling all these functions together, the engine became difficult to manage and improve. V6 also lacked scalability. Initially supporting 10,000 devices, some new tenants had over 300,000 devices. We reacted by deploying multiple V6 engines per tenant, increasing complexity and cost, hampering user experience, and delaying time to market. This also led to longer proof of concept and onboarding cycles, which hurt the business. Furthermore, the absence of a streaming platform like Kafka created dependencies between teams through tight HTTP/gRPC coupling. Additionally, teams couldn’t access real-time events before ingestion into the database, limiting feature development. We also lacked a data buffer, risking potential data loss during outages. Such constraints impeded innovation and increased risks. In summary, although the V6 system served its initial purpose, reinventing it with cloud-centered technologies became imperative to enhance scalability, reliability, and foster innovation by our engineering and product teams. Transitioning to a cloud-centered architecture with Amazon MSK To achieve our modernization goals, after thorough research and iterations, we implemented an event-driven microservices design on Amazon Elastic Kubernetes Service (Amazon EKS), using Kafka on Amazon MSK for distributed event storage and streaming. Our transition from the v6 on-prem solution to the cloud-centered platform was phased over four iterations: Phase 1 – We lifted and shifted from on premises to virtual machines in the cloud, reducing operational complexities and accelerating proof of concept cycles while transparently migrating customers. Phase 2 – We extended the cloud architecture by implementing new product features with microservices and self-managed Kafka on Kubernetes. However, operating Kafka clusters ourselves proved overly difficult, leading us to Phase 3. Phase 3 – We switched from self-managed Kafka to Amazon MSK, improving stability and reducing operational costs. We realized that managing Kafka wasn’t our core competency or differentiator, and the overhead was high. Amazon MSK enabled us to focus on our core application, freeing us from the burden of undifferentiated Kafka management. Phase 4 – Finally, we eliminated all legacy components, completing the transition to a fully cloud-centered SaaS platform. This multi-year journey of learning and transformation took 3 years. Today, after our successful transition, we use Amazon MSK for two key functions: Real-time data ingestion and processing of trillions of daily events from over 15 million devices worldwide, as illustrated in the following figure. Enabling an event-driven system that decouples data producers and consumers, as depicted in the following figure. To further enhance our scalability and resilience, we adopted a cell-based architecture using the wide availability of Amazon MSK across AWS Regions. We currently operate over 10 cells, each representing an independent regional deployment of our SaaS solution. This cell-based approach minimizes the area of impact in case of issues, addresses data residency requirements, and enables horizontal scaling across AWS Regions, as illustrated in the following figure. Benefits of Amazon MSK Amazon MSK has been critical in enabling our event-driven design. In this section, we outline the main benefits we gained from its adoption. Improved data resilience In our new architecture, data from devices is pushed directly to Kafka topics in Amazon MSK, which provides high availability and resilience. This makes sure that events can be safely received and stored at any time. Our services consuming this data inherit the same resilience from Amazon MSK. If our backend ingestion services face disruptions, no event is lost, because Kafka retains all published messages. When our services resume, they seamlessly continue processing from where they left off, thanks to Kafka’s producer semantics, which allow processing messages exactly-once, at-least-once, or at-most-once based on application needs. Amazon MSK enables us to tailor the data retention duration to our specific requirements, ranging from seconds to unlimited duration. This flexibility grants uninterrupted data availability to our application, which wasn’t possible with our previous architecture. Furthermore, to safeguard data integrity in the event of processing errors or corruption, Kafka enabled us to implement a data replay mechanism, ensuring data consistency and reliability. Organizational scaling By adopting an event-driven architecture with Amazon MSK, we decomposed our monolithic application into loosely coupled, stateless microservices communicating asynchronously via Kafka topics. This approach enabled our engineering organization to scale rapidly from just 4–5 teams in 2019 to over 40 teams and approximately 350 engineers today. The loose coupling between event publishers and subscribers empowered teams to focus on distinct domains, such as data ingestion, identification services, and data lakes. Teams could develop solutions independently within their domains, communicating through Kafka topics without tight coupling. This architecture accelerated feature development by minimizing the risk of new features impacting existing ones. Teams could efficiently consume events published by others, offering new capabilities more rapidly while reducing cross-team dependencies. The following figure illustrates the seamless workflow of adding new domains to our system. Furthermore, the event-driven design allowed teams to build stateless services that could seamlessly auto scale based on MSK metrics like messages per second. This event-driven scalability eliminated the need for extensive capacity planning and manual scaling efforts, freeing up development time. By using an event-driven microservices architecture on Amazon MSK, we achieved organizational agility, enhanced scalability, and accelerated innovation while minimizing operational overhead. Seamless infrastructure scaling Nexthink’s business grew tenfold in 3 years, and many new capabilities were added to the product, leading to a substantial increase in traffic from 200 MB per second to 5 GB per second. This exponential data growth was enabled by the robust scalability of Amazon MSK. Achieving such scale with an on-premises solution would have been challenging and expensive, if not infeasible. Attempting to self-manage Kafka imposed unnecessary operational overhead without providing business value. Running it with just 5% of today’s traffic was already complex and required two engineers. At today’s volumes, we estimated needing 6–10 dedicated staff, increasing costs and diverting resources away from core priorities. Real-time capabilities By channeling all our data through Amazon MSK, we enabled real-time processing of events. This unlocked capabilities like real-time alerts, event-driven triggers, and webhooks that were previously unattainable. As such, Amazon MSK was instrumental in facilitating our event-driven architecture and powering impactful innovations. Secure data access Transitioning to our new architecture, we met our security and data integrity goals. With Kafka ACLs, we enforced strict access controls, allowing consumers and producers to only interact with authorized topics. We based these granular data access controls on criteria like data type, domain, and team. To securely scale decentralized management of topics, we introduced proprietary Kubernetes Custom Resource Definitions (CRDs). These CRDs enabled teams to independently manage their own topics, settings, and ACLs without compromising security. Amazon MSK encryption made sure that the data remained encrypted at rest and in transit. We also introduced a Bring Your Own Key (BYOK) option, allowing application-level encryption with customer keys for all single-tenant and multi-tenant topics. Enhanced observability Amazon MSK gave us great visibility into our data flows. The out-of-the-box Amazon CloudWatch metrics let us see the amount and types of data flowing through each topic and cluster. This helped us quantify the usage of our product features by tracking data volumes at the topic level. The Amazon MSK operational metrics enabled effortless monitoring and right-sizing of clusters and brokers. Overall, the rich observability of Amazon MSK facilitated data-driven decisions about architecture and product features. Conclusion Nexthink’s journey from an on-premises monolith to a cloud SaaS was streamlined by using Amazon MSK, a fully managed Kafka service. Amazon MSK allowed us to scale seamlessly while benefiting from enterprise-grade reliability and security. By offloading Kafka management to AWS, we could stay focused on our core business and innovate faster. Going forward, we plan to further improve performance, costs, and scalability by adopting Amazon MSK capabilities such as tiered storage and AWS Graviton-based EC2 instance types. We are also working closely with the Amazon MSK team to prepare for upcoming service features. Rapidly adopting new capabilities will help us remain at the forefront of innovation while continuing to grow our business. To learn more about how Nexthink uses AWS to serve its global customer base, explore the Nexthink on AWS case study. Additionally, discover other customer success stories with Amazon MSK by visiting the Amazon MSK blog category. About the Authors Moe Haidar is a principal engineer and special projects lead @ CTO office of Nexthink. He has been involved with AWS since 2018 and is a key contributor to the cloud transformation of the Nexthink platform to AWS. His focus is on product and technology incubation and architecture, but he also loves doing hands-on activities to keep his knowledge of technologies sharp and up to date. He still contributes heavily to the code base and loves to tackle complex problems. Simone Pomata is Senior Solutions Architect at AWS. He has worked enthusiastically in the tech industry for more than 10 years. At AWS, he helps customers succeed in building new technologies every day. Magdalena Gargas is a Solutions Architect passionate about technology and solving customer challenges. At AWS, she works mostly with software companies, helping them innovate in the cloud. She participates in industry events, sharing insights and contributing to the advancement of the containerization field. View the full article
-
- 1
-
- kafka
- apache kafka
-
(and 3 more)
Tagged with:
-
Managing the growth of your Kubernetes clusters within Google Kubernetes Engine (GKE) just got easier. We've recently introduced the ability to directly monitor and set alerts for crucial scalability limits, providing you with deeper insight and control over your Kubernetes environment. Effective scalability management is essential for avoiding outages and optimizing resource usage in Kubernetes. These new monitoring features bring you: Peace of mind: Potential capacity issues can be proactively addressed before they cause problems, ensuring uninterrupted operations.Clearer understanding: Gain a deeper insight into your clusters’ architectural constraints, allowing for informed decision-making.Optimization opportunities: Analyze usage trends and identify ways to fine-tune your cluster configurations for optimal resource utilization.Here are the specific limits you can now keep track of: Etcd database size (GiB): Understand how much space your Kubernetes cluster state is consuming.Nodes per cluster: Get proactive alerts on your cluster's overall node capacity.Nodes per node pool (all zones): Manage node distribution and limits across specific node pools.Pods per cluster (GKE Standard / GKE Autopilot): Ensure you have the pod capacity to support your applications.Containers per cluster (GKE Standard / GKE Autopilot): Prevent issues by understanding the maximum number of containers your cluster can support. Get startedYou'll find these new quota monitoring and alerting features directly within the Google Cloud console. To get there, you can use the link to a pre-filtered list of GKE quotas or navigate to the Quotas page under the IAM & Admin section in the console and then filter by the Kubernetes Engine API service. To search for a specific quota, use the Filter table. You can filter by the exact quota name, location, cluster name, or node pool (where applicable). You can also create alerts for a specific quota by following the guide. Alerts can be configured to notify you when a quota is approaching or has exceeded its limit. By using the Cloud Monitoring API and console you can monitor GKE quota usage in greater depth. The API allows you to programmatically access quota metrics and create custom dashboards and alerts. The console provides a graphical interface for monitoring quota usage and creating alerts. Custom dashboards can be created to visualize quota usage over time. Alerts can be configured to notify you when quota usage reaches a certain threshold. This can help you proactively manage your quotas and avoid unexpected outages. See the guide for more details. Need more information? Explore the official Google Cloud documentation for more in-depth guidance: Understanding GKE Quotas and Limits: Quotas and limits | Google Kubernetes Engine (GKE)Best practices planning and designing large-size clusters: Plan for large GKE clusters | Google Kubernetes Engine (GKE)Setting up a quota alert: Monitor and alert with quota metricsUsing GKE observability metrics: View observability metrics | Google Kubernetes Engine (GKE)View the full article
-
- scaling
- monitoring
-
(and 2 more)
Tagged with:
-
Docker Compose‘s simplicity — just run compose up — has been an integral part of developer workflows for a decade, with the first commit occurring in 2013, back when it was called Plum. Although the feature set has grown dramatically in that time, maintaining that experience has always been integral to the spirit of Compose. In this post, we’ll walk through how to manage microservice sprawl with Docker Compose by importing subprojects from other Git repos. Maintaining simplicity Now, perhaps more than ever, that simplicity is key. The complexity of modern software development is undeniable regardless of whether you’re using microservices or a monolith, deploying to the cloud or on-prem, or writing in JavaScript or C. Compose has not kept up with this “development sprawl” and is even sometimes an obstacle when working on larger, more complex projects. Maintaining Compose to accurately represent your increasingly complex application can require its own expertise, often resulting in out-of-date configuration in YAML or complex makefile tasks. As an open source project, Compose serves everyone from home lab enthusiasts to transcontinental corporations, which is no small feat, and our commitment to maintaining Compose’s signature simplicity for all users hasn’t changed. The increased flexibility afforded by Compose watch and include means your project no longer needs to be one-size-fits-all. Now, it’s possible to split your project across Git repos and import services as needed, customizing their configuration in the process. Application architecture Let’s take a look at a hypothetical application architecture. To begin, the application is split across two Git repos: backend — Backend in Python/Flask frontend — Single-page app (SPA) frontend in JavaScript/Node.js While working on the frontend, the developers run without using Docker or Compose, launching npm start on their laptops directly and proxy API requests to a shared staging server (as opposed to running the backend locally). Meanwhile, while working on the backend, developers and CI (for integration tests) share a Compose file and rely on command-line tools like cURL to manually test functionality locally. We’d like a flexible configuration that enables each group of developers to use their optimal workflow (e.g., leveraging hot reload for the frontend) while also allowing reuse to share project configuration between repos. At first, this seems like an impossible situation to resolve. Frontend We can start by adding a compose.yaml file to frontend: services: frontend: pull_policy: build build: context: . environment: BACKEND_HOST: ${BACKEND_HOST:-https://staging.example.com} ports: - 8000:8000 Note: If you’re wondering what the Dockerfile looks like, take a look at this samples page for an up-to-date example of best practices generated by docker init. This is a great start! Running docker compose up will now build the Node.js frontend and make it accessible at http://localhost:8000/. The BACKEND_HOST environment variable can be used to control where upstream API requests are proxied to and defaults to our shared staging instance. Unfortunately, we’ve lost the great developer experience afforded by hot module reload (HMR) because everything is inside the container. By adding a develop.watch section, we can preserve that: services: frontend: pull_policy: build build: context: . environment: BACKEND_HOST: ${BACKEND_HOST:-https://staging.example.com} ports: - 8000:8000 develop: watch: - path: package.json action: rebuild - path: src/ target: /app/src action: sync Now, while working on the frontend, developers continue to benefit from the rapid iteration cycles due to HMR. Whenever a file is modified locally in the src/ directory, it’s synchronized into the container at /app/src. If the package.json file is modified, the entire container is rebuilt, so that the RUN npm install step in the Dockerfile will be re-executed and install the latest dependencies. The best part is the only change to the workflow is running docker compose watch instead of npm start. Backend Now, let’s set up a Compose file in backend: services: backend: pull_policy: build build: context: . ports: - 1234:8080 develop: watch: - path: requirements.txt action: rebuild - path: ./ target: /app/ action: sync include: - path: git@github.com:myorg/frontend.git env_file: frontend.env frontend.env BACKEND_HOST=http://backend:8080 Much of this looks very similar to the frontend compose.yaml. When files in the project directory change locally, they’re synchronized to /app inside the container, so the Flask dev server can handle hot reload. If the requirements.txt is changed, the entire container is rebuilt, so that the RUN pip install step in the Dockerfile will be re-executed and install the latest dependencies. However, we’ve also added an include section that references the frontend project by its Git repository. The custom env_file points to a local path (in the backend repo), which sets BACKEND_HOST so that the frontend service container will proxy API requests to the backend service container instead of the default. Note: Remote includes are an experimental feature. You’ll need to set COMPOSE_EXPERIMENTAL_GIT_REMOTE=1 in your environment to use Git references. With this configuration, developers can now run the full stack while keeping the frontend and backend Compose projects independent and even in different Git repositories. As developers, we’re used to sharing code library dependencies, and the include keyword brings this same reusability and convenience to your Compose development configurations. What’s next? There are still some rough edges. For example, the remote project is cloned to a temporary directory, which makes it impractical to use with watch mode when imported, as the files are not available for editing. Enabling bigger and more complex software projects to use Compose for flexible, personal environments is something we’re continuing to improve upon. If you’re a Docker customer using Compose across microservices or repositories, we’d love to hear how we can better support you. Get in touch! Learn more Get the latest release of Docker Desktop. Vote on what’s next! Check out our public roadmap. Have questions? The Docker community is here to help. New to Docker? Get started. View the full article
-
Cloud-native architecture is a transformative approach to designing and managing applications. This type of architecture embraces the concepts of modularity, scalability, and rapid deployment, making it highly suitable for modern software development. Though the cloud-native ecosystem is vast, Kubernetes stands out as its beating heart. It serves as a container orchestration platform that helps with automatic deployments and the scaling and management of microservices. Some of these features are crucial for building true cloud-native applications... View the full article
-
WebSocket is a common communication protocol used in web applications to facilitate real-time bi-directional data exchange between client and server. However, when the server has to maintain a direct connection with the client, it can limit the server’s ability to scale down when there are long-running clients. This scale down can occur when nodes are underutilized during periods of low usage. In this post, we demonstrate how to redesign a web application to achieve auto scaling even for long-running clients, with minimal changes to the original application... View the full article
-
- api gateway
- aws
-
(and 6 more)
Tagged with:
-
Cloud-native technologies are becoming increasingly ubiquitous, and Kubernetes is at the forefront of this movement. Today, Kubernetes is seeing widespread adoption across organizations in a variety of different industries. When implemented properly, Kubernetes can help these organizations achieve higher availability, scalability, and resiliency for their workloads. Combining Kubernetes with the attributes of cloud computing—such as unparalleled scalability and elasticity—can help organizations enhance their containerized applications’ resiliency and availability. As detailed in this introductory post, Karpenter‘s objective is to make sure that your cluster’s workloads have the compute they need, no more and no less, right when they need it. In its most recent updates, Karpenter added support for more advanced scheduling constraints, such as pod affinity and anti-affinity, topology spread, node affinity, node selection, and resource requests. This post will specifically delve into podAffinity, podAntiAffinity, and volume topology awareness and elaborate on the use cases that they’re best suited for... View the full article
-
- kubernetes
- scheduling
-
(and 2 more)
Tagged with:
-
Years before Amazon Elastic Kubernetes Service (EKS) was released, our customers told us they wanted a service that would simplify Kubernetes management. Many of them were running self-managed clusters on Amazon Elastic Computer Cloud (EC2) and were having challenges upgrading, scaling, and maintaining the Kubernetes control plane. When EKS launched in 2018, it aimed to reduce our customers’ operational burden by offering a managed control plane for Kubernetes. This initially included automated upgrades, patching, and backups, which we often refer to as “undifferentiated heavy lifting.” We analyzed volumes of data to create a control plane that would work for the vast majority of our customers. However, as usage of EKS grew, we discovered there were customers who occasionally exceeded the provisioned capacity of the cluster. When this happened, they had to file a ticket with AWS support to have their cluster control plane resized. This was not an ideal user experience. Today, the control plane is scaled automatically when certain metrics are exceeded. At first, we used basic metrics such as CPU/memory for scaling. As we learned how the control plane behaved under different conditions, we adjusted our metrics to make scaling more responsive. Now we use a variety of metrics to scale the control plane, including the number of worker nodes and the size of the etcd database. These enhancements are a great example of the flywheel effect where AWS releases a feature in response to customer feedback, solicits feedback from end users about its impact, and uses that feedback to continue improving the customer experience... View the full article
-
Forum Statistics
63.6k
Total Topics61.7k
Total Posts