Search the Community
Showing results for tags 'big data'.
-
The construction of big data applications based on open source software has become increasingly uncomplicated since the advent of projects like Data on EKS, an open source project from AWS to provide blueprints for building data and machine learning (ML) applications on Amazon Elastic Kubernetes Service (Amazon EKS). In the realm of big data, securing data on cloud applications is crucial. This post explores the deployment of Apache Ranger for permission management within the Hadoop ecosystem on Amazon EKS. We show how Ranger integrates with Hadoop components like Apache Hive, Spark, Trino, Yarn, and HDFS, providing secure and efficient data management in a cloud environment. Join us as we navigate these advanced security strategies in the context of Kubernetes and cloud computing. Overview of solution The Amber Group’s Data on EKS Platform (DEP) is a Kubernetes-based, cloud-centered big data platform that revolutionizes the way we handle data in EKS environments. Developed by Amber Group’s Data Team, DEP integrates with familiar components like Apache Hive, Spark, Flink, Trino, HDFS, and more, making it a versatile and comprehensive solution for data management and BI platforms. The following diagram illustrates the solution architecture. Effective permission management is crucial for several key reasons: Enhanced security – With proper permission management, sensitive data is only accessible to authorized individuals, thereby safeguarding against unauthorized access and potential security breaches. This is especially important in industries handling large volumes of sensitive or personal data. Operational efficiency – By defining clear user roles and permissions, organizations can streamline workflows and reduce administrative overhead. This system simplifies managing user access, saves time for data security administrators, and minimizes the risk of configuration errors. Scalability and compliance – As businesses grow and evolve, a scalable permission management system helps with smoothly adjusting user roles and access rights. This adaptability is essential for maintaining compliance with various data privacy regulations like GDPR and HIPAA, making sure that the organization’s data practices are legally sound and up to date. Addressing big data challenges – Big data comes with unique challenges, like managing large volumes of rapidly evolving data across multiple platforms. Effective permission management helps tackle these challenges by controlling how data is accessed and used, providing data integrity and minimizing the risk of data breaches. Apache Ranger is a comprehensive framework designed for data governance and security in Hadoop ecosystems. It provides a centralized framework to define, administer, and manage security policies consistently across various Hadoop components. Ranger specializes in fine-grained access control, offering detailed management of user permissions and auditing capabilities. Ranger’s architecture is designed to integrate smoothly with various big data tools such as Hadoop, Hive, HBase, and Spark. The key components of Ranger include: Ranger Admin – This is the central component where all security policies are created and managed. It provides a web-based user interface for policy management and an API for programmatic configuration. Ranger UserSync – This service is responsible for syncing user and group information from a directory service like LDAP or AD into Ranger. Ranger plugins – These are installed on each component of the Hadoop ecosystem (like Hive and HBase). Plugins pull policies from the Ranger Admin service and enforce them locally. Ranger Auditing – Ranger captures access audit logs and stores them for compliance and monitoring purposes. It can integrate with external tools for advanced analytics on these audit logs. Ranger Key Management Store (KMS) – Ranger KMS provides encryption and key management, extending Hadoop’s HDFS Transparent Data Encryption (TDE). The following flowchart illustrates the priority levels for matching policies. The priority levels are as follows: Deny list takes precedence over allow list Deny list exclude has a higher priority than deny list Allow list exclude has a higher priority than allow list Our Amazon EKS-based deployment includes the following components: S3 buckets – We use Amazon Simple Storage Service (Amazon S3) for scalable and durable Hive data storage MySQL database – The database stores Hive metadata, facilitating efficient metadata retrieval and management EKS cluster – The cluster is comprised of three distinct node groups: platform, Hadoop, and Trino, each tailored for specific operational needs Hadoop cluster applications – These applications include HDFS for distributed storage and YARN for managing cluster resources Trino cluster application – This application enables us to run distributed SQL queries for analytics Apache Ranger – Ranger serves as the central security management tool for access policy across the big data components OpenLDAP – This is integrated as the LDAP service to provide a centralized user information repository, essential for user authentication and authorization Other cloud services resources – Other resources include a dedicated VPC for network security and isolation By the end of this deployment process, we will have realized the following benefits: A high-performing, scalable big data platform that can handle complex data workflows with ease Enhanced security through centralized management of authentication and authorization, provided by the integration of OpenLDAP and Apache Ranger Cost-effective infrastructure management and operation, thanks to the containerized nature of services on Amazon EKS Compliance with stringent data security and privacy regulations, due to Apache Ranger’s policy enforcement capabilities Deploy a big data cluster on Amazon EKS and configure Ranger for access control In this section, we outline the process of deploying a big data cluster on AWS EKS and configuring Ranger for access control. We use AWS CloudFormation templates for quick deployment of a big data environment on Amazon EKS with Apache Ranger. Complete the following steps: Upload the provided template to AWS CloudFormation, configure the stack options, and launch the stack to automate the deployment of the entire infrastructure, including the EKS cluster and Apache Ranger integration. After a few minutes, you’ll have a fully functional big data environment with robust security management ready for your analytical workloads, as shown in the following screenshot. On the AWS web console, find the name of your EKS cluster. In this case, it’s dep-demo-eks-cluster-ap-northeast-1. For example: aws eks update-kubeconfig --name dep-eks-cluster-ap-northeast-1 --region ap-northeast-1 ## Check pod status. kubectl get pods --namespace hadoop kubectl get pods --namespace platform kubectl get pods --namespace trino After Ranger Admin is successfully forwarded to port 6080 of localhost, go to localhost:6080 in your browser. Log in with user name admin and the password you entered earlier. By default, you have already created two policies: Hive and Trino, and granted all access to the LDAP user you created (depadmin in this case). Also, the LDAP user sync service is set up and will automatically sync all users from the LDAP service created in this template. Example permission configuration In a practical application within a company, permissions for tables and fields in the data warehouse are divided based on business departments, isolating sensitive data for different business units. This provides data security and orderly conduct of daily business operations. The following screenshots show an example business configuration. The following is an example of an Apache Ranger permission configuration. The following screenshots show users associated with roles. When performing data queries, using Hive and Spark as examples, we can demonstrate the comparison before and after permission configuration. The following screenshot shows an example of Hive SQL (running on superset) with privileges denied. The following screenshot shows an example of Spark SQL (running on IDE) with privileges denied. The following screenshot shows an example of Spark SQL (running on IDE) with permissions permitting. Based on this example and considering your enterprise requirements, it becomes feasible and flexible to manage permissions in the data warehouse effectively. Conclusion This post provided a comprehensive guide on permission management in big data, particularly within the Amazon EKS platform using Apache Ranger, that equips you with the essential knowledge and tools for robust data security and management. By implementing the strategies and understanding the components detailed in this post, you can effectively manage permissions, implementing data security and compliance in your big data environments. About the Authors Yuzhu Xiao is a Senior Data Development Engineer at Amber Group with extensive experience in cloud data platform architecture. He has many years of experience in AWS Cloud platform data architecture and development, primarily focusing on efficiency optimization and cost control of enterprise cloud architectures. Xin Zhang is an AWS Solutions Architect, responsible for solution consulting and design based on the AWS Cloud platform. He has a rich experience in R&D and architecture practice in the fields of system architecture, data warehousing, and real-time computing. View the full article
-
- kubernetes
- security
-
(and 3 more)
Tagged with:
-
This post is co-written with Amir Souchami and Fabian Szenkier from Unity. Aura from Unity (formerly known as ironSource) is the market standard for creating rich device experiences that engage and retain customers. With a powerful set of solutions, Aura enables complete digital transformation, letting operators promote key services outside the store, directly on-device. Amazon Redshift is a recommended service for online analytical processing (OLAP) workloads such as cloud data warehouses, data marts, and other analytical data stores. You can use simple SQL to analyze structured and semi-structured data, operational databases, and data lakes to deliver the best price/performance at any scale. The Amazon Redshift data sharing feature provides instant, granular, and high-performance access without data copies and data movement across multiple Redshift data warehouses in the same or different AWS accounts and across AWS Regions. Data sharing provides live access to data so that you always see the most up-to-date and consistent information as it’s updated in the data warehouse. Amazon Redshift Serverless makes it straightforward to run and scale analytics in seconds without the need to set up and manage data warehouse clusters. Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. You can load your data and start querying right away in the Amazon Redshift Query Editor or in your favorite business intelligence (BI) tool and continue to enjoy the best price/performance and familiar SQL features in an easy-to-use, zero administration environment. In this post, we describe Aura’s successful and swift adoption of Redshift Serverless, which allowed them to optimize their overall bidding advertisement campaigns’ time to market from 24 hours to 2 hours. We explore why Aura chose this solution and what technological challenges it helped solve. Aura’s initial data pipeline Aura is a pioneer in using Redshift RA3 clusters with data sharing for extract, transform, and load (ETL) and BI workloads. One of Aura’s operations is bidding advertisement campaigns. These campaigns are optimized by using an AI-based bid process that requires running hundreds of analytical queries per campaign. These queries are run on data that resides in an RA3 provisioned Redshift cluster. The integrated pipeline is comprised of various AWS services: Amazon Elastic Container Registry (Amazon ECR) for storing Amazon Elastic Kubernetes Service (Amazon EKS) Docker images Amazon Managed Workflows for Apache Airflow (Amazon MWAA) for pipeline orchestration Amazon DynamoDB for storing job-related configuration such as service connection strings and batch sizes Amazon Managed Streaming for Apache Kafka (Amazon MSK) for streaming last changed and added advertisement campaigns EKSPodOperator in Amazon MWAA for triggering an EKS pod task that runs the data preparation queries for each ad campaign on Aura’s main Redshift provisioned cluster Amazon Redshift provisioned for running ETL jobs, a BI layer, and analytical queries per ad campaign An Amazon Simple Storage Service (Amazon S3) bucket for storing the Redshift query results Amazon MWAA with Amazon EKS for running machine learning (ML) training on the query results using a Python-based ML algorithm The following diagram illustrates this architecture. Challenges of the initial architecture The queries for each campaign run in the following manner: First, a preparation query filters and aggregates raw data, preparing it for the subsequent operation. This is followed by the main query, which carries out the logic according to the preparation query result set. As the number of campaigns grew, Aura’s Data team was required to run hundreds of concurrent queries for each of these steps. Aura’s existing provisioned cluster was already heavily utilized with data ingestion, ETL, and BI workloads, so they were looking for cost-effective ways to isolate this workload with dedicated compute resources. The team evaluated a variety of options, including unloading data to Amazon S3 and a multi-cluster architecture using data sharing and Redshift serverless. The team gravitated towards the multi-cluster architecture with data sharing, as it requires no query rewrite, allows for dedicated compute for this specific workload, avoids the need to duplicate or move data from the main cluster, and provides high concurrency and automatic scaling. Lastly, it’s billed in a pay-for-what-you-use model, and provisioning is straightforward and quick. Proof of concept After evaluating the options, Aura’s Data team decided to conduct a proof of concept using Redshift Serverless as a consumer of their main Redshift provisioned cluster, sharing just the relevant tables for running the required queries. Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). A single RPU provides 16 GB of memory and a serverless endpoint can range from 8 RPU to 512 RPU. Aura’s Data team started the proof of concept using a 256 RPU Redshift Serverless endpoint and gradually lowered the RPU to reduce costs while making sure the query runtime was below the required target. Eventually, the team decided to use a 128 RPU (2 TB RAM) Redshift Serverless endpoint as the base RPU, while using the Redshift Serverless auto scaling feature, which allows hundreds of concurrent queries to run by automatically upscaling the RPU as needed. Aura’s new solution with Redshift Serverless After a successful proof of concept, the production setup included adding code to switch between the provisioned Redshift cluster and the Redshift Serverless endpoint. This was done using a configurable threshold based on the number of queries waiting to be processed in a specific MSK topic consumed at the beginning of the pipeline. Small-scale campaign queries would still run on the provisioned cluster, and large-scale queries would use the Redshift Serverless endpoint. The new solution uses an Amazon MWAA pipeline that fetches configuration information from a DynamoDB table, consumes jobs that represent ad campaigns, and then runs hundreds of EKS jobs triggered using EKSPodOperator. Each job runs the two serial queries (the preparation query followed by a main query, which outputs the results to Amazon S3). This happens several hundred times concurrently using Redshift Serverless compute resources. Then the process initiates another set of EKSPodOperator operators to run the AI training code based on the data result that was saved on Amazon S3. The following diagram illustrates the solution architecture. Outcome The overall runtime of the pipeline was reduced from 24 hours to just 2 hours, a 12-times improvement. This integration of Redshift Serverless, coupled with data sharing, led to a 90% reduction in pipeline duration, negating the necessity for data duplication or query rewriting. Moreover, the introduction of a dedicated consumer as an exclusive compute resource significantly eased the load of the producer cluster, enabling running small-scale queries even faster. “Redshift Serverless and data sharing enabled us to provision and scale our data warehouse capacity to deliver fast performance, high concurrency and handle challenging ML workloads with very minimal effort.” – Amir Souchami, Aura’s Principal Technical Systems Architect. Learnings Aura’s Data team is highly focused on working in a cost-effective manner and has therefore implemented several cost controls in their Redshift Serverless endpoint: Limit the overall spend by setting a maximum RPU-hour usage limit (per day, week, month) for the workgroup. Aura configured that limit so when it is reached, Amazon Redshift will send an alert to the relevant Amazon Redshift administrator team. This feature also allows writing an entry to a system table and even turning off user queries. Use a maximum RPU configuration, which defines the upper limit of compute resources that Redshift Serverless can use at any given time. When the maximum RPU limit is set for the workgroup, Redshift Serverless scales within that limit to continue to run the workload. Implement query monitoring rules that prevent wasteful resource utilization and runaway costs caused by poorly written queries. Conclusion A data warehouse is a crucial part of any modern data-driven company, enabling you to answer complex business questions and provide insights. The evolution of Amazon Redshift allowed Aura to quickly adapt to business requirements by combining data sharing between provisioned and Redshift Serverless data warehouses. Aura’s journey with Redshift Serverless underscores the vast potential of strategic tech integration in driving efficiency and operational excellence. If Aura’s journey has sparked your interest and you are considering implementing a similar solution in your organization, here are some strategic steps to consider: Start by thoroughly understanding your organization’s data needs and how such a solution can address them. Reach out to AWS experts, who can provide you with guidance based on their own experiences. Consider engaging in seminars, workshops, or online forums that discuss these technologies. The following resources are recommended for getting started: Redshift Serverless and data sharing workshop Redshift Serverless overview An important part of this journey would be to implement a proof of concept. Such hands-on experience will provide valuable insights before moving to production. Elevate your Redshift expertise. Already enjoying the power of Amazon Redshift? Enhance your data journey with the latest features and expert guidance. Reach out to your dedicated AWS account team for personalized support, discover cutting-edge capabilities, and unlock even greater value from your data with Amazon Redshift. About the Authors Amir Souchami, Chief Architect of Aura from Unity, focusing on creating resilient and performant cloud systems and mobile apps at major scale. Fabian Szenkier is the ML and Big Data Architect at Aura by Unity, works on building modern AI/ML solutions and state of the art data engineering pipelines at scale. Liat Tzur is a Senior Technical Account Manager at Amazon Web Services. She serves as the customer’s advocate and assists her customers in achieving cloud operational excellence in alignment with their business goals. Adi Jabkowski is a Sr. Redshift Specialist in EMEA, part of the Worldwide Specialist Organization (WWSO) at AWS. Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value. View the full article
-
Organizations today heavily rely on big data to drive decision-making and strategize for the future, adapting to an ever-expanding array of data sources, both internal and external. This reliance extends to a variety of tools used to harness this data effectively. In the modern business environment, with an estimated 2.5 quintillion bytes of data generated daily, big data is undoubtedly pivotal in understanding and developing all aspects of an organization's goals. However, known for its vast volume and rapid collection, big data can overwhelm and lead to analysis paralysis if not managed and analyzed objectively. But, when dissected thoughtfully, it can provide the critical insights necessary for strategic advancement. The evolution of big data in business strategy In the past, businesses primarily focused on structured data from internal systems, but today, they navigate a sea of unstructured data from varied sources. This transition is fueled by key market trends, such as the exponential growth of Internet of Things (IoT) devices and the increasing reliance on cloud computing. Big data analytics has become essential for organizations aiming to derive meaningful insights from this vast, complex data landscape, transcending traditional business intelligence to offer predictive and prescriptive analytics. Driving this big data revolution are several market trends. The surge in digital transformation initiatives, accelerated by the global pandemic, has seen a significant increase in data creation and usage. Businesses are integrating and analyzing new data sources, moving beyond basic analytics to embrace more sophisticated techniques. Now, it is about refining data strategies to align closely with specific business goals and outcomes. The increasing sophistication of analytics tools, capable of handling the 5 Vs of big data - volume, variety, velocity, veracity, and vulnerability - is enabling businesses to tap into the true potential of big data, transforming it from a raw resource into a valuable tool for strategic decision-making. Practical applications of big data across industries Big data's influence is evident across various sectors, each utilizing it uniquely for growth and innovation: Transportation: GPS applications use data from satellites and government sources for optimized route planning and traffic management. Aviation analytics process data from flights (about 1,000 gigabytes per transatlantic flight) to enhance fuel efficiency and safety. Healthcare: Wearable devices and embedded sensors are often employed to collect valuable patient data in real-time for predicting epidemic outbreaks and improving patient engagement. Banking and Financial Services: Banks monitor the purchase behavioral pattern of credit cardholders to detect potential fraud. Big data analytics are used for risk management and customer relationship management optimization. Government: Agencies like the IRS and SSA use data analysis to identify tax fraud and fraudulent disability claims. The CDC uses big data to track the spread of infectious diseases. Media and Entertainment: Companies like Amazon Prime and Spotify use big data analytics to recommend personalized content to users. Implementing big data strategies within organizations requires a nuanced approach. First, identifying relevant data sources and integrating them into a cohesive analytics system is crucial. For instance, banks have leveraged big data for fraud detection and customer relationship optimization, analyzing patterns in customer transactions and interactions. Additionally, big data aids in personalized marketing, with companies like Amazon using customer data to tailor marketing strategies, leading to more effective ad placements. The key lies in aligning big data initiatives with specific business objectives, moving beyond mere data collection to generating actionable insights. Organizations need to invest in the right tools and skills to analyze data, ensuring data-driven strategies are central to their decision-making processes. Implementing these strategies can lead to more informed decisions, improved customer experiences, and enhanced operational efficiency. Navigating data privacy and security concerns Addressing data privacy and security in big data is crucial, given the legal and ethical implications. With regulations like the GDPR imposing fines for non-compliance, companies must ensure adherence to legal standards. 81% of consumers are increasingly concerned about online data usage, highlighting the need for robust data governance. Companies should establish clear policies for data handling and conduct regular compliance audits. For data security, a multi-layered approach is essential. Practices include encrypting data, implementing strong access controls, and conducting vulnerability assessments. Advanced analytics for threat detection and a zero-trust security model are also crucial to maintain data integrity and mitigate risks. Big data predictions and preparations In the next decade, big data is set to undergo significant transformations, driven by advancements in AI and machine learning. IDC forecasts suggest the global data sphere will reach 175 zettabytes by 2025, underscoring the growing volume and complexity of data. To stay ahead, businesses must invest in scalable data infrastructure and enhance their workforce's analytical skills. Adapting to emerging data privacy regulations and maintaining robust data governance will also be vital. With this proactive approach, businesses will be set to successfully utilize big data, ensuring continued innovation and competitiveness in a data-centric future. We've listed the best AI tools. This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro View the full article
-
I’ve read and watched more than a few articles about ChatGPT in the last couple of months. It seems the large language model AI hype machine just can’t stop. As somebody with a passion for music production, some of the more interesting things I’ve seen included a guy using ChatGPT to build a virtual effect plugin for his DAW (digital audio workstation) that emulates an Ibanez Tube Screamer guitar effects pedal, and this video about getting ChatGPT to write MIDI music scores using Python notebooks. As I’m working on bringing to market a solution for running Spark on Kubernetes, it got me thinking… May the prompt be with you Can I get ChatGPT to output a Spark job? Well there’s only one way to find out so I signed up for a ChatGPT account over at OpenAI and fired up a prompt. Feeling a bit like a naughty hacker, I was in. I typed in my command: Write a pyspark job that ranks Linux distributions by popularity based on issues reported on stackoverflow And the output immediately began spewing down my screen. But now the $64k question. Will it work? Examining the output, it won’t work, because ChatGPT hasn’t provided us with code to scrape StackOverflow.com for the information we need. Let’s see: Write a pyspark job to scrape Stackoverflow for the Linux distribution issue report data used as input to the previous job ChatGPT comes back with a python script (not a PySpark job, but OK) to scrape StackOverflow.com. So I fired up an editor and pasted it in. Perhaps needless to say, but StackOverflow seems to have changed its HTML layout template since the last time ChatGPT was trained, because the Python script didn’t work out of the box, and tweaks were needed. When I was a kid in the early 1980s, publishers would sell computer magazines and books with code listings for games in BASIC that you could program into your ZX Spectrum yourself. Alas they were always full of bugs and would never run first time, and due to the unusual way code had to be input on a Spectrum, this usually meant spending a fair few hours inputting the commands before finding out. I’m getting the feeling that ChatGPT might be going the same way. Better get a cup of tea and a biscuit, I feel this is going to be a session. ChatGPT vs hand edited script Ok, nice try ChatGPT but this is going to need a bit of tweaking. I needed to change the target HTML entities and CSS classes that the script needs to find and process (and lightly restructure things). I’m able to scrape the data I need from StackOverflow. Here’s the original and the adapted code listings. Original web scraper listing from ChatGPT Corrected web scraper listing Time to make some parallel, distributed sparks fly Alright, so now we have the data we need, will that PySpark job that ChatGPT made us actually work? Let’s give it a whirl. Well immediately, it won’t work because the fields in the CSV have different names from what the job expects. But that’s an easy tweak. Here’s ChatGPT’s listing, but adapted for my needs. That wasn’t as bad as I feared. Adapted ChatGPT output PySpark job The result Drumroll please, time to find out which is the most popular distro: Of course no surprises: it’s Ubuntu that gets the most questions, because it’s Ubuntu that gets the most use. In the end there were quite a few changes that I needed to make to get a working job, but clearly there’s potential for this technology, especially if you’re new to Spark and data engineering in general – it can give you a starter job quite quickly, but expect to make changes. If you’re interested in Spark… You might like to check out our Charmed Spark solution for running Spark on Kubernetes. We recently shipped the Beta and are looking for feedback. To get started, visit the Charmed Spark documentation pages and install the spark-client snap. Let us know what you think at https://chat.charmhub.io/charmhub/channels/data-platform or file bug reports and feature requests in Github. View the full article
-
We’re excited to announce Amazon SageMaker now supports Apache Spark as a pre-built big data processing container. You can now use this container with Amazon SageMaker Processing and take advantage of a fully managed Spark environment for data processing or feature engineering workloads. View the full article
-
Forum Statistics
63.6k
Total Topics61.7k
Total Posts