Search the Community
Showing results for tags 'performance tuning'.
-
Account reconciliation is an important step to ensure the completeness and accuracy of financial statements. Specifically, companies must reconcile balance sheet accounts that could contain significant or material misstatements. Accountants go through each account in the general ledger of accounts and verify that the balance listed is complete and accurate. When discrepancies are found, accountants investigate and take appropriate corrective action. As part of Amazon’s FinTech organization, we offer a software platform that empowers the internal accounting teams at Amazon to conduct account reconciliations. To optimize the reconciliation process, these users require high performance transformation with the ability to scale on demand, as well as the ability to process variable file sizes ranging from as low as a few MBs to more than 100 GB. It’s not always possible to fit data onto a single machine or process it with one single program in a reasonable time frame. This computation has to be done fast enough to provide practical services where programming logic and underlying details (data distribution, fault tolerance, and scheduling) can be separated. We can achieve these simultaneous computations on multiple machines or threads of the same function across groups of elements of a dataset by using distributed data processing solutions. This encouraged us to reinvent our reconciliation service powered by AWS services, including Amazon EMR and the Apache Spark distributed processing framework, which uses PySpark. This service enables users to process files over 100 GB containing up to 100 million transactions in less than 30 minutes. The reconciliation service has become a powerhouse for data processing, and now users can seamlessly perform a variety of operations, such as Pivot, JOIN (like an Excel VLOOKUP operation), arithmetic operations, and more, providing a versatile and efficient solution for reconciling vast datasets. This enhancement is a testament to the scalability and speed achieved through the adoption of distributed data processing solutions. In this post, we explain how we integrated Amazon EMR to build a highly available and scalable system that enabled us to run a high-volume financial reconciliation process. Architecture before migration The following diagram illustrates our previous architecture. Our legacy service was built with Amazon Elastic Container Service (Amazon ECS) on AWS Fargate. We processed the data sequentially using Python. However, due to its lack of parallel processing capability, we frequently had to increase the cluster size vertically to support larger datasets. For context, 5 GB of data with 50 operations took around 3 hours to process. This service was configured to scale horizontally to five ECS instances that polled messages from Amazon Simple Queue Service (Amazon SQS), which fed the transformation requests. Each instance was configured with 4 vCPUs and 30 GB of memory to allow horizontal scaling. However, we couldn’t expand its capacity on performance because the process happened sequentially, picking chunks of data from Amazon Simple Storage Service (Amazon S3) for processing. For example, a VLOOKUP operation where two files are to be joined required both files to be read in memory chunk by chunk to obtain the output. This became an obstacle for users because they had to wait for long periods of time to process their datasets. As part of our re-architecture and modernization, we wanted to achieve the following: High availability – The data processing clusters should be highly available, providing three 9s of availability (99.9%) Throughput – The service should handle 1,500 runs per day Latency – It should be able to process 100 GB of data within 30 minutes Heterogeneity – The cluster should be able to support a wide variety of workloads, with files ranging from a few MBs to hundreds of GBs Query concurrency – The implementation demands the ability to support a minimum of 10 degrees of concurrency Reliability of jobs and data consistency – Jobs need to run reliably and consistently to avoid breaking Service Level Agreements (SLAs) Cost-effective and scalable – It must be scalable based on the workload, making it cost-effective Security and compliance – Given the sensitivity of data, it must support fine-grained access control and appropriate security implementations Monitoring – The solution must offer end-to-end monitoring of the clusters and jobs Why Amazon EMR Amazon EMR is the industry-leading cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning (ML) using open source frameworks such as Apache Spark, Apache Hive, and Presto. With these frameworks and related open-source projects, you can process data for analytics purposes and BI workloads. Amazon EMR lets you transform and move large amounts of data in and out of other AWS data stores and databases, such as Amazon S3 and Amazon DynamoDB. A notable advantage of Amazon EMR lies in its effective use of parallel processing with PySpark, marking a significant improvement over traditional sequential Python code. This innovative approach streamlines the deployment and scaling of Apache Spark clusters, allowing for efficient parallelization on large datasets. The distributed computing infrastructure not only enhances performance, but also enables the processing of vast amounts of data at unprecedented speeds. Equipped with libraries, PySpark facilitates Excel-like operations on DataFrames, and the higher-level abstraction of DataFrames simplifies intricate data manipulations, reducing code complexity. Combined with automatic cluster provisioning, dynamic resource allocation, and integration with other AWS services, Amazon EMR proves to be a versatile solution suitable for diverse workloads, ranging from batch processing to ML. The inherent fault tolerance in PySpark and Amazon EMR promotes robustness, even in the event of node failures, making it a scalable, cost-effective, and high-performance choice for parallel data processing on AWS. Amazon EMR extends its capabilities beyond the basics, offering a variety of deployment options to cater to diverse needs. Whether it’s Amazon EMR on EC2, Amazon EMR on EKS, Amazon EMR Serverless, or Amazon EMR on AWS Outposts, you can tailor your approach to specific requirements. For those seeking a serverless environment for Spark jobs, integrating AWS Glue is also a viable option. In addition to supporting various open-source frameworks, including Spark, Amazon EMR provides flexibility in choosing deployment modes, Amazon Elastic Compute Cloud (Amazon EC2) instance types, scaling mechanisms, and numerous cost-saving optimization techniques. Amazon EMR stands as a dynamic force in the cloud, delivering unmatched capabilities for organizations seeking robust big data solutions. Its seamless integration, powerful features, and adaptability make it an indispensable tool for navigating the complexities of data analytics and ML on AWS. Redesigned architecture The following diagram illustrates our redesigned architecture. The solution operates under an API contract, where clients can submit transformation configurations, defining the set of operations alongside the S3 dataset location for processing. The request is queued through Amazon SQS, then directed to Amazon EMR via a Lambda function. This process initiates the creation of an Amazon EMR step for Spark framework implementation on a dedicated EMR cluster. Although Amazon EMR accommodates an unlimited number of steps over a long-running cluster’s lifetime, only 256 steps can be running or pending simultaneously. For optimal parallelization, the step concurrency is set at 10, allowing 10 steps to run concurrently. In case of request failures, the Amazon SQS dead-letter queue (DLQ) retains the event. Spark processes the request, translating Excel-like operations into PySpark code for an efficient query plan. Resilient DataFrames store input, output, and intermediate data in-memory, optimizing processing speed, reducing disk I/O cost, enhancing workload performance, and delivering the final output to the specified Amazon S3 location. We define our SLA in two dimensions: latency and throughput. Latency is defined as the amount of time taken to perform one job against a deterministic dataset size and the number of operations performed on the dataset. Throughput is defined as the maximum number of simultaneous jobs the service can perform without breaching the latency SLA of one job. The overall scalability SLA of the service depends on the balance of horizontal scaling of elastic compute resources and vertical scaling of individual servers. Because we had to run 1,500 processes per day with minimal latency and high performance, we choose to integrate Amazon EMR on EC2 deployment mode with managed scaling enabled to support processing variable file sizes. The EMR cluster configuration provides many different selections: EMR node types – Primary, core, or task nodes Instance purchasing options – On-Demand Instances, Reserved Instances, or Spot Instances Configuration options – EMR instance fleet or uniform instance group Scaling options – Auto Scaling or Amazon EMR managed scaling Based on our variable workload, we configured an EMR instance fleet (for best practices, see Reliability). We also decided to use Amazon EMR managed scaling to scale the core and task nodes (for scaling scenarios, refer to Node allocation scenarios). Lastly, we chose memory-optimized AWS Graviton instances, which provide up to 30% lower cost and up to 15% improved performance for Spark workloads. The following code provides a snapshot of our cluster configuration: Concurrent steps:10 EMR Managed Scaling: minimumCapacityUnits: 64 maximumCapacityUnits: 512 maximumOnDemandCapacityUnits: 512 maximumCoreCapacityUnits: 512 Master Instance Fleet: r6g.xlarge - 4 vCore, 30.5 GiB memory, EBS only storage - EBS Storage:250 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 1 units r6g.2xlarge - 8 vCore, 61 GiB memory, EBS only storage - EBS Storage:250 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 1 units Core Instance Fleet: r6g.2xlarge - 8 vCore, 61 GiB memory, EBS only storage - EBS Storage:100 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 8 units r6g.4xlarge - 16 vCore, 122 GiB memory, EBS only storage - EBS Storage:100 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 16 units Task Instances: r6g.2xlarge - 8 vCore, 61 GiB memory, EBS only storage - EBS Storage:100 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 8 units r6g.4xlarge - 16 vCore, 122 GiB memory, EBS only storage - EBS Storage:100 GiB - Maximum Spot price: 100 % of On-demand price - Each instance counts as 16 units Performance With our migration to Amazon EMR, we were able to achieve a system performance capable of handling a variety of datasets, ranging from as low as 273 B to as high as 88.5 GB with a p99 of 491 seconds (approximately 8 minutes). The following figure illustrates the variety of file sizes processed. The following figure shows our latency. To compare against sequential processing, we took two datasets containing 53 million records and ran a VLOOKUP operation against each other, along with 49 other Excel-like operations. This took 26 minutes to process in the new service, compared to 5 days to process in the legacy service. This improvement is almost 300 times greater over the previous architecture in terms of performance. Considerations Keep in mind the following when considering this solution: Right-sizing clusters – Although Amazon EMR is resizable, it’s important to right-size the clusters. Right-sizing mitigates a slow cluster, if undersized, or higher costs, if the cluster is oversized. To anticipate these issues, you can calculate the number and type of nodes that will be needed for the workloads. Parallel steps – Running steps in parallel allows you to run more advanced workloads, increase cluster resource utilization, and reduce the amount of time taken to complete your workload. The number of steps allowed to run at one time is configurable and can be set when a cluster is launched and any time after the cluster has started. You need to consider and optimize the CPU/memory usage per job when multiple jobs are running in a single shared cluster. Job-based transient EMR clusters – If applicable, it is recommended to use a job-based transient EMR cluster, which delivers superior isolation, verifying that each task operates within its dedicated environment. This approach optimizes resource utilization, helps prevent interference between jobs, and enhances overall performance and reliability. The transient nature enables efficient scaling, providing a robust and isolated solution for diverse data processing needs. EMR Serverless – EMR Serverless is the ideal choice if you prefer not to handle the management and operation of clusters. It allows you to effortlessly run applications using open-source frameworks available within EMR Serverless, offering a straightforward and hassle-free experience. Amazon EMR on EKS – Amazon EMR on EKS offers distinct advantages, such as faster startup times and improved scalability resolving compute capacity challenges—which is particularly beneficial for Graviton and Spot Instance users. The inclusion of a broader range of compute types enhances cost-efficiency, allowing tailored resource allocation. Furthermore, Multi-AZ support provides increased availability. These compelling features provide a robust solution for managing big data workloads with improved performance, cost optimization, and reliability across various computing scenarios. Conclusion In this post, we explained how Amazon optimized its high-volume financial reconciliation process with Amazon EMR for higher scalability and performance. If you have a monolithic application that’s dependent on vertical scaling to process additional requests or datasets, then migrating it to a distributed processing framework such as Apache Spark and choosing a managed service such as Amazon EMR for compute may help reduce the runtime to lower your delivery SLA, and also may help reduce the Total Cost of Ownership (TCO). As we embrace Amazon EMR for this particular use case, we encourage you to explore further possibilities in your data innovation journey. Consider evaluating AWS Glue, along with other dynamic Amazon EMR deployment options such as EMR Serverless or Amazon EMR on EKS, to discover the best AWS service tailored to your unique use case. The future of the data innovation journey holds exciting possibilities and advancements to be explored further. About the Authors Jeeshan Khetrapal is a Sr. Software Development Engineer at Amazon, where he develops fintech products based on cloud computing serverless architectures that are responsible for companies’ IT general controls, financial reporting, and controllership for governance, risk, and compliance. Sakti Mishra is a Principal Solutions Architect at AWS, where he helps customers modernize their data architecture and define their end-to-end data strategy, including data security, accessibility, governance, and more. He is also the author of the book Simplify Big Data Analytics with Amazon EMR. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. View the full article
-
Ubuntu 23.10 experimental image with x86-64-v3 instruction set now available on Azure Canonical is enabling enterprises to evaluate the performance of their most critical workloads in an experimental Ubuntu image on Azure compiled with x86-64-v3, which is a microarchitecture level that has the potential for performance gains. Developers can use this image to characterise workloads, which can help inform planning for a transition to x86-64-v3 and provide valuable input to the community working to make widespread adoption of x86-64-v3 a reality. The x86-64-v3 instruction set enables hardware features that have been added by chip vendors since the original instruction set architecture (ISA) commonly known as x86-64-v1, x86-64, or amd64. Canonical Staff Engineer Michael Hudson-Doyle recently wrote about the history of the x86-64/amd64 instruction sets, what these v1 and v3 microarchitecture levels represent, and how Canonical is evaluating their performance. While fully backwards compatible, later versions of these feature groups are not available on all hardware, so when deciding on an ISA image you must choose to maximise the supported hardware or to get access to more recent hardware capabilities. Canonical plans to continue supporting x86-64-v1 as there is a significant amount of legacy hardware deployed in the field. However, we also want to enable users to take advantage of newer x86-64-v3 hardware features that provide the opportunity for performance improvements the industry isn’t yet capitalising on. Untapped performance and power benefits Intel and Canonical partner closely to ensure that Ubuntu takes full advantage of the advanced hardware features Intel silicon offers, and the Ubuntu image on Azure is an interim step towards giving the industry access to the capabilities of x86-64-v3 and understanding the benefits that it offers. Intel has made x86-64-v3 available since Intel Haswell was first announced a decade ago. Support in their low power processor family is more recent, arriving in the Gracemont microarchitecture which was first in the 12th generation of Intel Core processors. Similarly, AMD has had examples since 2015, and emulators such as QEMU have supported x86-64-v3 since 2022. Yet, with this broad base of hardware availability, distro support of the features in the x86-64-v3 microarchitecture level is not widespread. In the spirit of enabling Ubuntu everywhere and ensuring that users can benefit from the unique features on different hardware families, Canonical feels strongly about enabling a transition to x86-64-v3 while remaining committed to our many users on hardware that doesn’t support v3. x86-64-v3 is available in a significant amount of hardware, and provides the opportunity for performance improvements which are currently being left on the table. This is why we believe that v3 is the next logical microarchitecture level to offer in Ubuntu, and Michael’s blog post explains in greater detail why v3 should be chosen instead of v2 or v4. Not just a porting exercise The challenge with enabling the transition to v3 is that while we expect a broad range of performance improvements depending on the workload, the results are much more nuanced. From Canonical’s early benchmarking we see that certain workloads see significant benefit from the adoption of x86-64-v3; however there are outliers that regress and need further analysis. Canonical continues to do benchmarking, with plans to evaluate different compilers, compiler parameters, and configurations of hostOS and guestOS. In certain cases, such as the Glibc Log2 benchmark, we have reproducibly seen up to a 60% improvement. On the other hand, we also see other benchmarks that regress significantly. When digging in, we found unexpected behaviour in the compiled code. For example, in one of the benchmarks we verified an excessive number of moves between registers, leading to much worse performance due to the increased latency. In another situation, we noticed a large code size increase, as enabling x86-64-v3 on optimised SSE code caused the compiler to expand it into 17x more instructions, due to a possible bug during the translation to VEX encoding. With community efforts, these outliers could be resolved. However, they will require interdisciplinary collaboration to do so. This also underscores the necessity of benchmarking different types of workloads, so that we can understand their specific performance and bottlenecks. That’s why we believe it’s important to enable workloads to run on Azure, so that a broader community can give feedback and enable further optimisation. Try Ubuntu 23.10 with x86-64-v3 on Azure today The community now has access to resources on Azure to easily evaluate the performance of x86-64-v3 for their workloads, so that they can understand the benefits of migrating and can identify where improvements are still required. What is being shared today is experimental and for evaluation and benchmarking purposes only, which means that it won’t receive security updates or other maintenance updates you would expect for an image you could use in production. When x86-64-v3 is introduced for production workloads there will be a benefit to being able to run both v3 and v1 depending on the workload and hardware available. As is usually the case, the answer to the question of whether to run on a v3 image or a v1 image is ‘it depends’. This image provides the tools to answer that cost, power, and performance optimisation problem. In addition to the availability of the cloud image on Azure, we’ve also previously posted on the availability of Ubuntu 23.04 rebuilt to target the x86-64-v3 microarchitecture level, and made available installer images from that archive. These are additional tools that the community can use to benchmark, when cloud environments can’t be targeted. In order to access the image on Azure and use it, you can follow the instructions in our discourse post. Please be sure to leave your feedback there, or Contact us directly to discuss your use case. Further reading Optimising Ubuntu performance on amd64 architecture Trying out Ubuntu 23.04 on x86-64-v3 rebuild for yourself View the full article
-
Performance tuning in Snowflake is optimizing the configuration and SQL queries to improve the efficiency and speed of data operations. It involves adjusting various settings and writing queries to reduce execution time and resource consumption, ultimately leading to cost savings and enhanced user satisfaction. Performance tuning is crucial in Snowflake for several reasons: View the full article
-
Forum Statistics
67.4k
Total Topics65.3k
Total Posts