Showing results for tags 'big data'.

amazon redshift serverless How Aura from Unity revolutionized their big data pipeline with Amazon Redshift Serverless

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

This post is co-written with Amir Souchami and Fabian Szenkier from Unity. Aura from Unity (formerly known as ironSource) is the market standard for creating rich device experiences that engage and retain customers. With a powerful set of solutions, Aura enables complete digital transformation, letting operators promote key services outside the store, directly on-device. Amazon Redshift is a recommended service for online analytical processing (OLAP) workloads such as cloud data warehouses, data marts, and other analytical data stores. You can use simple SQL to analyze structured and semi-structured data, operational databases, and data lakes to deliver the best price/performance at any scale. The Amazon Redshift data sharing feature provides instant, granular, and high-performance access without data copies and data movement across multiple Redshift data warehouses in the same or different AWS accounts and across AWS Regions. Data sharing provides live access to data so that you always see the most up-to-date and consistent information as it’s updated in the data warehouse. Amazon Redshift Serverless makes it straightforward to run and scale analytics in seconds without the need to set up and manage data warehouse clusters. Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. You can load your data and start querying right away in the Amazon Redshift Query Editor or in your favorite business intelligence (BI) tool and continue to enjoy the best price/performance and familiar SQL features in an easy-to-use, zero administration environment. In this post, we describe Aura’s successful and swift adoption of Redshift Serverless, which allowed them to optimize their overall bidding advertisement campaigns’ time to market from 24 hours to 2 hours. We explore why Aura chose this solution and what technological challenges it helped solve. Aura’s initial data pipeline Aura is a pioneer in using Redshift RA3 clusters with data sharing for extract, transform, and load (ETL) and BI workloads. One of Aura’s operations is bidding advertisement campaigns. These campaigns are optimized by using an AI-based bid process that requires running hundreds of analytical queries per campaign. These queries are run on data that resides in an RA3 provisioned Redshift cluster. The integrated pipeline is comprised of various AWS services: Amazon Elastic Container Registry (Amazon ECR) for storing Amazon Elastic Kubernetes Service (Amazon EKS) Docker images Amazon Managed Workflows for Apache Airflow (Amazon MWAA) for pipeline orchestration Amazon DynamoDB for storing job-related configuration such as service connection strings and batch sizes Amazon Managed Streaming for Apache Kafka (Amazon MSK) for streaming last changed and added advertisement campaigns EKSPodOperator in Amazon MWAA for triggering an EKS pod task that runs the data preparation queries for each ad campaign on Aura’s main Redshift provisioned cluster Amazon Redshift provisioned for running ETL jobs, a BI layer, and analytical queries per ad campaign An Amazon Simple Storage Service (Amazon S3) bucket for storing the Redshift query results Amazon MWAA with Amazon EKS for running machine learning (ML) training on the query results using a Python-based ML algorithm The following diagram illustrates this architecture. Challenges of the initial architecture The queries for each campaign run in the following manner: First, a preparation query filters and aggregates raw data, preparing it for the subsequent operation. This is followed by the main query, which carries out the logic according to the preparation query result set. As the number of campaigns grew, Aura’s Data team was required to run hundreds of concurrent queries for each of these steps. Aura’s existing provisioned cluster was already heavily utilized with data ingestion, ETL, and BI workloads, so they were looking for cost-effective ways to isolate this workload with dedicated compute resources. The team evaluated a variety of options, including unloading data to Amazon S3 and a multi-cluster architecture using data sharing and Redshift serverless. The team gravitated towards the multi-cluster architecture with data sharing, as it requires no query rewrite, allows for dedicated compute for this specific workload, avoids the need to duplicate or move data from the main cluster, and provides high concurrency and automatic scaling. Lastly, it’s billed in a pay-for-what-you-use model, and provisioning is straightforward and quick. Proof of concept After evaluating the options, Aura’s Data team decided to conduct a proof of concept using Redshift Serverless as a consumer of their main Redshift provisioned cluster, sharing just the relevant tables for running the required queries. Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). A single RPU provides 16 GB of memory and a serverless endpoint can range from 8 RPU to 512 RPU. Aura’s Data team started the proof of concept using a 256 RPU Redshift Serverless endpoint and gradually lowered the RPU to reduce costs while making sure the query runtime was below the required target. Eventually, the team decided to use a 128 RPU (2 TB RAM) Redshift Serverless endpoint as the base RPU, while using the Redshift Serverless auto scaling feature, which allows hundreds of concurrent queries to run by automatically upscaling the RPU as needed. Aura’s new solution with Redshift Serverless After a successful proof of concept, the production setup included adding code to switch between the provisioned Redshift cluster and the Redshift Serverless endpoint. This was done using a configurable threshold based on the number of queries waiting to be processed in a specific MSK topic consumed at the beginning of the pipeline. Small-scale campaign queries would still run on the provisioned cluster, and large-scale queries would use the Redshift Serverless endpoint. The new solution uses an Amazon MWAA pipeline that fetches configuration information from a DynamoDB table, consumes jobs that represent ad campaigns, and then runs hundreds of EKS jobs triggered using EKSPodOperator. Each job runs the two serial queries (the preparation query followed by a main query, which outputs the results to Amazon S3). This happens several hundred times concurrently using Redshift Serverless compute resources. Then the process initiates another set of EKSPodOperator operators to run the AI training code based on the data result that was saved on Amazon S3. The following diagram illustrates the solution architecture. Outcome The overall runtime of the pipeline was reduced from 24 hours to just 2 hours, a 12-times improvement. This integration of Redshift Serverless, coupled with data sharing, led to a 90% reduction in pipeline duration, negating the necessity for data duplication or query rewriting. Moreover, the introduction of a dedicated consumer as an exclusive compute resource significantly eased the load of the producer cluster, enabling running small-scale queries even faster. “Redshift Serverless and data sharing enabled us to provision and scale our data warehouse capacity to deliver fast performance, high concurrency and handle challenging ML workloads with very minimal effort.” – Amir Souchami, Aura’s Principal Technical Systems Architect. Learnings Aura’s Data team is highly focused on working in a cost-effective manner and has therefore implemented several cost controls in their Redshift Serverless endpoint: Limit the overall spend by setting a maximum RPU-hour usage limit (per day, week, month) for the workgroup. Aura configured that limit so when it is reached, Amazon Redshift will send an alert to the relevant Amazon Redshift administrator team. This feature also allows writing an entry to a system table and even turning off user queries. Use a maximum RPU configuration, which defines the upper limit of compute resources that Redshift Serverless can use at any given time. When the maximum RPU limit is set for the workgroup, Redshift Serverless scales within that limit to continue to run the workload. Implement query monitoring rules that prevent wasteful resource utilization and runaway costs caused by poorly written queries. Conclusion A data warehouse is a crucial part of any modern data-driven company, enabling you to answer complex business questions and provide insights. The evolution of Amazon Redshift allowed Aura to quickly adapt to business requirements by combining data sharing between provisioned and Redshift Serverless data warehouses. Aura’s journey with Redshift Serverless underscores the vast potential of strategic tech integration in driving efficiency and operational excellence. If Aura’s journey has sparked your interest and you are considering implementing a similar solution in your organization, here are some strategic steps to consider: Start by thoroughly understanding your organization’s data needs and how such a solution can address them. Reach out to AWS experts, who can provide you with guidance based on their own experiences. Consider engaging in seminars, workshops, or online forums that discuss these technologies. The following resources are recommended for getting started: Redshift Serverless and data sharing workshop Redshift Serverless overview An important part of this journey would be to implement a proof of concept. Such hands-on experience will provide valuable insights before moving to production. Elevate your Redshift expertise. Already enjoying the power of Amazon Redshift? Enhance your data journey with the latest features and expert guidance. Reach out to your dedicated AWS account team for personalized support, discover cutting-edge capabilities, and unlock even greater value from your data with Amazon Redshift. About the Authors Amir Souchami, Chief Architect of Aura from Unity, focusing on creating resilient and performant cloud systems and mobile apps at major scale. Fabian Szenkier is the ML and Big Data Architect at Aura by Unity, works on building modern AI/ML solutions and state of the art data engineering pipelines at scale. Liat Tzur is a Senior Technical Account Manager at Amazon Web Services. She serves as the customer’s advocate and assists her customers in achieving cloud operational excellence in alignment with their business goals. Adi Jabkowski is a Sr. Redshift Specialist in EMEA, part of the Worldwide Specialist Organization (WWSO) at AWS. Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value. View the full article

big data Leveraging big data for strategic business decisions

TechRadar posted a topic in Databases, Data Engineering & Data Science

Organizations today heavily rely on big data to drive decision-making and strategize for the future, adapting to an ever-expanding array of data sources, both internal and external. This reliance extends to a variety of tools used to harness this data effectively. In the modern business environment, with an estimated 2.5 quintillion bytes of data generated daily, big data is undoubtedly pivotal in understanding and developing all aspects of an organization's goals. However, known for its vast volume and rapid collection, big data can overwhelm and lead to analysis paralysis if not managed and analyzed objectively. But, when dissected thoughtfully, it can provide the critical insights necessary for strategic advancement. The evolution of big data in business strategy In the past, businesses primarily focused on structured data from internal systems, but today, they navigate a sea of unstructured data from varied sources. This transition is fueled by key market trends, such as the exponential growth of Internet of Things (IoT) devices and the increasing reliance on cloud computing. Big data analytics has become essential for organizations aiming to derive meaningful insights from this vast, complex data landscape, transcending traditional business intelligence to offer predictive and prescriptive analytics. Driving this big data revolution are several market trends. The surge in digital transformation initiatives, accelerated by the global pandemic, has seen a significant increase in data creation and usage. Businesses are integrating and analyzing new data sources, moving beyond basic analytics to embrace more sophisticated techniques. Now, it is about refining data strategies to align closely with specific business goals and outcomes. The increasing sophistication of analytics tools, capable of handling the 5 Vs of big data - volume, variety, velocity, veracity, and vulnerability - is enabling businesses to tap into the true potential of big data, transforming it from a raw resource into a valuable tool for strategic decision-making. Practical applications of big data across industries Big data's influence is evident across various sectors, each utilizing it uniquely for growth and innovation: Transportation: GPS applications use data from satellites and government sources for optimized route planning and traffic management. Aviation analytics process data from flights (about 1,000 gigabytes per transatlantic flight) to enhance fuel efficiency and safety. Healthcare: Wearable devices and embedded sensors are often employed to collect valuable patient data in real-time for predicting epidemic outbreaks and improving patient engagement. Banking and Financial Services: Banks monitor the purchase behavioral pattern of credit cardholders to detect potential fraud. Big data analytics are used for risk management and customer relationship management optimization. Government: Agencies like the IRS and SSA use data analysis to identify tax fraud and fraudulent disability claims. The CDC uses big data to track the spread of infectious diseases. Media and Entertainment: Companies like Amazon Prime and Spotify use big data analytics to recommend personalized content to users. Implementing big data strategies within organizations requires a nuanced approach. First, identifying relevant data sources and integrating them into a cohesive analytics system is crucial. For instance, banks have leveraged big data for fraud detection and customer relationship optimization, analyzing patterns in customer transactions and interactions. Additionally, big data aids in personalized marketing, with companies like Amazon using customer data to tailor marketing strategies, leading to more effective ad placements. The key lies in aligning big data initiatives with specific business objectives, moving beyond mere data collection to generating actionable insights. Organizations need to invest in the right tools and skills to analyze data, ensuring data-driven strategies are central to their decision-making processes. Implementing these strategies can lead to more informed decisions, improved customer experiences, and enhanced operational efficiency. Navigating data privacy and security concerns Addressing data privacy and security in big data is crucial, given the legal and ethical implications. With regulations like the GDPR imposing fines for non-compliance, companies must ensure adherence to legal standards. 81% of consumers are increasingly concerned about online data usage, highlighting the need for robust data governance. Companies should establish clear policies for data handling and conduct regular compliance audits. For data security, a multi-layered approach is essential. Practices include encrypting data, implementing strong access controls, and conducting vulnerability assessments. Advanced analytics for threat detection and a zero-trust security model are also crucial to maintain data integrity and mitigate risks. Big data predictions and preparations In the next decade, big data is set to undergo significant transformations, driven by advancements in AI and machine learning. IDC forecasts suggest the global data sphere will reach 175 zettabytes by 2025, underscoring the growing volume and complexity of data. To stay ahead, businesses must invest in scalable data infrastructure and enhance their workforce's analytical skills. Adapting to emerging data privacy regulations and maintaining robust data governance will also be vital. With this proactive approach, businesses will be set to successfully utilize big data, ensuring continued innovation and competitiveness in a data-centric future. We've listed the best AI tools. This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro View the full article

April 1

spark Write a Spark big data job with ChatGPT

Ubuntu posted a topic in Databases, Data Engineering & Data Science

I’ve read and watched more than a few articles about ChatGPT in the last couple of months. It seems the large language model AI hype machine just can’t stop. As somebody with a passion for music production, some of the more interesting things I’ve seen included a guy using ChatGPT to build a virtual effect plugin for his DAW (digital audio workstation) that emulates an Ibanez Tube Screamer guitar effects pedal, and this video about getting ChatGPT to write MIDI music scores using Python notebooks. As I’m working on bringing to market a solution for running Spark on Kubernetes, it got me thinking… May the prompt be with you Can I get ChatGPT to output a Spark job? Well there’s only one way to find out so I signed up for a ChatGPT account over at OpenAI and fired up a prompt. Feeling a bit like a naughty hacker, I was in. I typed in my command: Write a pyspark job that ranks Linux distributions by popularity based on issues reported on stackoverflow And the output immediately began spewing down my screen. But now the $64k question. Will it work? Examining the output, it won’t work, because ChatGPT hasn’t provided us with code to scrape StackOverflow.com for the information we need. Let’s see: Write a pyspark job to scrape Stackoverflow for the Linux distribution issue report data used as input to the previous job ChatGPT comes back with a python script (not a PySpark job, but OK) to scrape StackOverflow.com. So I fired up an editor and pasted it in. Perhaps needless to say, but StackOverflow seems to have changed its HTML layout template since the last time ChatGPT was trained, because the Python script didn’t work out of the box, and tweaks were needed. When I was a kid in the early 1980s, publishers would sell computer magazines and books with code listings for games in BASIC that you could program into your ZX Spectrum yourself. Alas they were always full of bugs and would never run first time, and due to the unusual way code had to be input on a Spectrum, this usually meant spending a fair few hours inputting the commands before finding out. I’m getting the feeling that ChatGPT might be going the same way. Better get a cup of tea and a biscuit, I feel this is going to be a session. ChatGPT vs hand edited script Ok, nice try ChatGPT but this is going to need a bit of tweaking. I needed to change the target HTML entities and CSS classes that the script needs to find and process (and lightly restructure things). I’m able to scrape the data I need from StackOverflow. Here’s the original and the adapted code listings. Original web scraper listing from ChatGPT Corrected web scraper listing Time to make some parallel, distributed sparks fly Alright, so now we have the data we need, will that PySpark job that ChatGPT made us actually work? Let’s give it a whirl. Well immediately, it won’t work because the fields in the CSV have different names from what the job expects. But that’s an easy tweak. Here’s ChatGPT’s listing, but adapted for my needs. That wasn’t as bad as I feared. Adapted ChatGPT output PySpark job The result Drumroll please, time to find out which is the most popular distro: Of course no surprises: it’s Ubuntu that gets the most questions, because it’s Ubuntu that gets the most use. In the end there were quite a few changes that I needed to make to get a working job, but clearly there’s potential for this technology, especially if you’re new to Spark and data engineering in general – it can give you a starter job quite quickly, but expect to make changes. If you’re interested in Spark… You might like to check out our Charmed Spark solution for running Spark on Kubernetes. We recently shipped the Beta and are looking for feedback. To get started, visit the Charmed Spark documentation pages and install the spark-client snap. Let us know what you think at https://chat.charmhub.io/charmhub/channels/data-platform or file bug reports and feature requests in Github. View the full article

big data Big data basics: What sysadmins need to know

Linux.com posted a topic in Databases, Data Engineering & Data Science

Learn what big data is, how data is processed and visualized, and key big data terms to know. Read More at Enable Sysadmin The post Big data basics: What sysadmins need to know appeared first on Linux.com. View the full article

Amazon SageMaker Processing now supports built-in Spark containers for big data processing

Amazon Web Services posted a topic in Amazon Web Services

We’re excited to announce Amazon SageMaker now supports Apache Spark as a pre-built big data processing container. You can now use this container with Amazon SageMaker Processing and take advantage of a fully managed Spark environment for data processing or feature engineering workloads. View the full article

September 30, 2020
2 replies
- aws
- sagemaker
- (and 2 more)
  Tagged with:
  - aws
  - sagemaker
  - spark
  - big data

Sign In

Search the Community

Search By Tags

Search By Author

Content Type

Forums

Calendars

Find results in...

Find results that contain...

Date Created

Start

End

Last Updated

Start

End

Filter by number of...

Minimum number of comments

Minimum number of replies

Minimum number of reviews

Minimum number of views

Joined

Start

End

Group

Website URL

LinkedIn Profile URL

About Me

Cloud Platforms

Cloud Experience

Development Experience

Current Role

Skills

Certifications

Favourite Tools

Interests

amazon redshift serverless How Aura from Unity revolutionized their big data pipeline with Amazon Redshift Serverless

big data Leveraging big data for strategic business decisions

spark Write a Spark big data job with ChatGPT

big data Big data basics: What sysadmins need to know

Amazon SageMaker Processing now supports built-in Spark containers for big data processing

Forum Statistics