Showing results for tags 'data engineering'.

interview questions Top 25 AWS Data Engineer Interview Questions and Answers

Whizlabs posted a topic in Databases, Data Engineering & Data Science

AWS data engineering involves designing and implementing data solutions on the Amazon Web Services (AWS) platform. For those aspiring to become AWS data engineers, cracking the interview is somehow difficult. Don’t worry, we’re here to help you! In this blog, we present a comprehensive collection of top AWS data engineer interview questions for you. These questions have been carefully selected to cover a wide range of topics and concepts that are relevant to the AWS Data Engineer role. Understanding the concepts behind these questions would help you to successfully go through the interview. If you are planning to become AWS Data Engineer, I would recommend you to pass AWS Data Engineer Certification exam. This exam could potentially cover many topics related to the data engineer role. Let’s dive in! Top 25 AWS Data Engineer Interview Questions and Answers Below are some AWS data engineer questions and answers that you might encounter during an interview: 1. What is the role of a data engineer at AWS? As an AWS Data Engineer, your core responsibility is to plan, create, manage, and enhance an organization’s data infrastructure. This covers everything from assembling systems for data processing and storage to connecting diverse data sources and ensuring the efficiency and dependability of the data pipeline. 2. What are the common challenges faced by AWS data Engineers? Data engineers at AWS frequently deal with issues including handling complicated data pipelines, managing massive amounts of data, integrating various data sources, and maintaining the performance and dependability of the data infrastructure. Working with remote systems, addressing privacy and security issues, and handling real-time data processing could present additional difficulties. 3. What are the tools used for data engineering? The following are some of the tools that are employed for doing the data engineering tasks: Data ingestion Storage Data integration Data visualization tools Data warehouse 4. What exactly is Amazon S3? Amazon Simple Storage Service (Amazon S3), is an object storage service that offers scalable and affordable data storage. Data lakes, backup and recovery, and disaster recovery are among its frequent uses. 5. What does Amazon EC2 do? A web service called Amazon Elastic Compute Cloud (Amazon EC2) offers scalable computing capability in the cloud. Batch processing, web and application hosting, and other compute-intensive operations are among its frequent uses. 6. What is Amazon Redshift? Amazon Redshift is a fully managed data warehouse that helps to process large volumes of data easily and affordably. It is frequently utilized for corporate intelligence and data warehousing activities. 7. What is Amazon Glue, and how does it make the Extract, Transform, and Load (ETL) process easier? AWS Data migration between data stores is made simple with a fully managed ETL solution called AWS Glue. It eliminates manual coding by automating the extract, transform, and load procedures. Glue crawlers can find and categorize information from various data sources. Glue’s ETL processes can transform and upload the data into target data storage. This speeds up the creation of data pipelines and streamlines the ETL process. 8. What is the role of Amazon Quicksight in data visualization for AWS data engineering solutions? Amazon QuickSight is a fully managed business intelligence service that can generate and distribute interactive reports and dashboards. QuickSight can be used in data engineering to display data produced by data pipelines and connect to a variety of data sources, including those on AWS. It offers a user-friendly interface for building visualizations, enabling people to learn from their data without requiring a deep understanding of code or analysis. 9. Describe the idea behind AWS Data Pipeline and how it helps to coordinate data activities. AWS Data Pipeline is a web service that facilitates the coordination and automation of data transfer and transformation across various AWS services and data sources that are located on-premises. It makes complex data processing processes easier to handle by enabling you to build and schedule data-driven workflows. When it comes to data engineering, data pipelines are especially helpful for organizing tasks like data extraction, transformation, and loading (ETL). 10. How do data engineering migrations benefit from the use of AWS DMS (Database Migration Service)? AWS DMS makes it easier to move databases to and from Amazon Web Services. DMS is frequently used in data engineering to migrate databases, either across different cloud database systems or from on-premises databases to the cloud. By controlling schema conversion, and data replication, and guaranteeing little downtime throughout the move, DMS streamlines the process. 11. How does AWS Glue support schema evolution in data engineering? AWS Glue facilitates the evolution of schemas by permitting modifications to data structures over time. Glue can dynamically adapt its understanding of the data structure whenever fresh data with varied schemas arrives. Because datasets may vary over time, flexibility is essential in data engineering. Glue’s ability to adjust to schema changes makes managing dynamic, changing data easier. 12. Describe the role that AWS Data Lakes play in contemporary data engineering architectures. AWS Centralized repositories known as “data lakes” let you store data of any size, both structured and unstructured. They enable effective data processing, analysis, and storage, which lays the groundwork for developing analytics and machine learning applications. Data Lakes are essential for managing and processing heterogeneous datasets from several sources in data engineering. 13. How can AWS CodePipeline be utilized to automate a CI/CD pipeline for a multi-tier application effectively? Automating CI/CD pipeline for a multi-tier application can be done effectively by following the below steps: Pipeline Creation: Begin by establishing a pipeline within AWS CodePipeline, specifying the source code repository, whether it’s GitHub, AWS CodeCommit, or another source. Build Stage Definition: Incorporate a build stage into the pipeline, connecting to a building service such as AWS CodeBuild. This stage will handle tasks like code compilation, testing, and generating deployable artifacts. Deployment Stage Setup: Configure deployment stages tailored to each tier of the application. Utilize AWS services like CodeDeploy for automated deployments to Amazon EC2 instances, AWS Elastic Beanstalk for web applications, or AWS ECS for containerized applications. Incorporate Approval Steps (Optional): Consider integrating manual approval steps before deployment stages, particularly for critical environments. This ensures quality control and allows stakeholders to verify changes before deployment. Continuous Monitoring and Improvement: Monitor the pipeline’s performance and adjust as needed. Emphasize gathering feedback and iterating on the deployment process to enhance efficiency and effectiveness over time. 14. How to handle continuous integration and deployment in AWS DevOps? Managing continuous integration and deployment in AWS DevOps involves using AWS Developer Tools effectively. Start by storing and versioning your application’s source code using these tools. Next, employ services like AWS CodePipeline to orchestrate the build, testing, and deployment processes. CodePipeline is the core, integrating seamlessly with AWS CodeBuild for compiling and testing code, and AWS CodeDeploy for automating deployments across different environments. This structured approach ensures smooth and automated workflows for continuous integration and delivery. 15. What is AWS Glue Spark Runtime, and how does it utilize Apache Spark for distributed data processing? AWS Glue Spark Runtime is the foundational runtime engine for AWS Glue ETL jobs. It utilizes Apache Spark, an open-source distributed computing framework, to process extensive datasets concurrently. By integrating with Spark, Glue can horizontally scale and effectively manage intricate data transformations within data engineering workflows. 16. What role does AWS Glue Data Wrangler play in automating and visualizing data transformations within ETL workflows? AWS Glue Data Wrangler streamlines and visually represents data transformations by offering a user-friendly interface for constructing data preparation workflows. It furnishes pre-configured transformations and enables users to design ETL processes visually, eliminating the need for manual code writing. In the realm of data engineering, Data Wrangler expedites and simplifies the creation of ETL jobs, thereby broadening its accessibility to a wider user base. 17. What is the purpose of AWS Glue Schema Evolution? AWS Glue Schema Evolution serves as a capability that enables the Glue Data Catalog to adjust to changes in the structure of the source data over time. Whenever modifications occur to the schema of the source data, Glue can automatically revise its comprehension of the schema. This capability facilitates ETL jobs to effortlessly handle evolving data. Such functionality is paramount in data engineering for effectively managing dynamic and evolving datasets. 18. What is the importance of AWS Glue DataBrew’s data profiling features? AWS Glue DataBrew’s data profiling features enable users to examine and grasp the attributes of datasets thoroughly. Profiling encompasses insights into data types, distributions, and potential quality concerns. In the realm of data engineering, data profiling proves valuable for obtaining a holistic understanding of the data and pinpointing areas necessitating cleaning or transformation. 19. What is the role of AWS Glue Dev Endpoint? AWS Glue Dev Endpoint serves as a development endpoint enabling users to iteratively develop, test, and debug ETL scripts interactively, utilizing tools such as PySpark or Scala. It furnishes an environment for executing and validating code before deployment in production ETL jobs. In the domain of data engineering, the Dev Endpoint streamlines the development and debugging phases, thereby enhancing the efficiency of ETL script development. 20. What is AWS Glue Crawler? The role of AWS Glue Crawler is pivotal in data engineering as it handles the automatic discovery and cataloging of data metadata. By scanning and extracting schema details from diverse data repositories, it populates the Glue Data Catalog. This component is vital for maintaining a centralized and current metadata repository, facilitating streamlined data discovery and processing workflows. 21. What is an operational data store (ODS)? An operational data store (ODS) serves as a centralized database that gathers and organizes data from multiple sources in a structured manner. It acts as a bridge between source systems and data warehouses or data marts, facilitating operational reporting and analysis. Incremental data loading is a strategy employed to update data in a target system with efficiency. Instead of reloading all data each time, only the new or modified data since the last update is processed. This method minimizes data transfer and processing requirements, leading to enhanced performance and reduced resource usage. 22. What are the stages and types of ETL testing ETL testing is vital for ensuring the accuracy, completeness, and reliability of data processing pipelines. Here are the common stages and types of ETL testing: Data source testing: This stage involves validating the data sources to ensure that they are reliable and accurate. It includes verifying data integrity and confirming that the data meets the expected quality standards. Data transformation testing: In this stage, the focus is on ensuring that the data transformations are applied correctly as per the defined business rules. It involves verifying that the data is transformed accurately and consistently according to the requirements. Data load testing: This stage involves testing the loading of data into the target system. It includes verifying the integrity of the data loaded into the target system and ensuring that it matches the source data. End-to-end testing: This comprehensive testing stage validates the entire ETL process from source to target. It includes testing the entire data flow, including data extraction, transformation, and loading, to ensure that the process is functioning correctly and producing the expected results. By performing these stages and types of ETL testing, organizations can ensure the reliability and accuracy of their data processing pipelines, leading to better decision-making and improved business outcomes. 23. How does AWS support the creation and management of data lakes? AWS offers a variety of services and tools designed specifically for building and maintaining data lakes, which serve as centralized repositories for storing structured, semi-structured, and unstructured data in its raw format. These include: Amazon S3: A highly scalable object storage service that allows for the storage and retrieval of data within a data lake. AWS Glue: A fully managed ETL (Extract, Transform, Load) service that facilitates data integration and transformation tasks within the data lake environment. AWS Lake Formation: A specialized service aimed at simplifying the process of building and managing secure data lakes on the AWS platform. You can take advantage of the AWS data engineer practice test to become familiar with the above AWS services. 24. What are the partitioning and data loading techniques employed in AWS Redshift? In AWS Redshift, partitioning is a method utilized to segment large datasets into smaller partitions based on specific criteria such as date, region, or product category. This enhances query performance by reducing the volume of data that needs to be scanned. Regarding data loading techniques, AWS Redshift supports: Bulk data loading: This involves importing large volumes of data from sources like Amazon S3 or other external data repositories. Continuous data ingestion: Redshift enables ongoing data ingestion using services like Amazon Kinesis or AWS Database Migration Service (DMS), ensuring real-time updates to the data warehouse. Automatic compression and columnar storage: Redshift employs automatic compression and columnar storage techniques to optimize data storage and retrieval efficiency. 25. What is AWS Redshift and what are its key components? AWS Redshift is a fully managed data warehousing solution provided by AWS, capable of handling petabyte-scale data. Its critical components include: Clusters: These are groups of nodes (compute resources) responsible for storing and processing data within Redshift. Leader node: This node serves as the coordinator, managing and distributing queries across the compute nodes within the cluster. Compute nodes: These nodes are dedicated to executing queries and performing various data processing tasks within the Redshift environment. Conclusion Hope this article provides a comprehensive roadmap of AWS Cloud Data Engineer interview questions suitable for candidates at different levels of expertise. It covers questions ranging from beginners who are just starting to explore AWS to seasoned professionals aiming to advance their careers. These interview questions not only equip you to address interview questions but also encourage you to delve deeply into the AWS platform, enriching your comprehension and utilization of its extensive capabilities. Make use of the AWS data engineer practice exam to experience the real-time exam settings and boost your confidence level. Enhance your AWS Data Engineer interview readiness with our AWS hands-on labs and AWS Sandboxes! View the full article

databricks Databricks Assistant Tips & Tricks for Data Engineers

Databricks posted a topic in Databases, Data Engineering & Data Science

The generative AI revolution is transforming the way that teams work, and Databricks Assistant leverages the best of these advancements. It allows you... View the full article

May 2, 2024
- data engineering
- ai
- (and 1 more)
  Tagged with:

data engineering Why Shopify will increasingly require data engineering expertise

RudderStack posted a topic in Databases, Data Engineering & Data Science

Learn why your Shopify success demands data engineering expertise and how to start doing more with your Shopify data. View the full article

snowflake Snowflake’s New Python API Empowers Data Engineers to Build Modern Data Pipelines with Ease

Snowflake posted a topic in Databases, Data Engineering & Data Science

In today’s data-driven world, developer productivity is essential for organizations to build effective and reliable products, accelerate time to value, and fuel ongoing innovation. To deliver on these goals, developers must have the ability to manipulate and analyze information efficiently. Yet while SQL applications have long served as the gateway to access and manage data, Python has become the language of choice for most data teams, creating a disconnect. Recognizing this shift, Snowflake is taking a Python-first approach to bridge the gap and help users leverage the power of both worlds. Our previous Python connector API, primarily available for those who need to run SQL via a Python script, enabled a connection to Snowflake from Python applications. This traditional SQL-centric approach often challenged data engineers working in a Python environment, requiring context-switching and limiting the full potential of Python’s rich libraries and frameworks. Since the previous Python connector API mostly communicated via SQL, it also hindered the ability to manage Snowflake objects natively in Python, restricting data pipeline efficiency and the ability to complete complex tasks. Snowflake’s new Python API (in public preview) marks a significant leap forward, offering a more streamlined, powerful solution for using Python within your data pipelines — and furthering our vision to empower all developers, regardless of experience, with a user-friendly and approachable platform. A New Era: Introducing Snowflake’s Python API With the new Snowflake Python API, readily available through pip install snowflake, developers no longer need to juggle between languages or grapple with cumbersome syntax. They can effortlessly leverage the power of Python for a seamless, unified experience across Snowflake workloads encompassing data engineering, Snowpark, machine learning and application development. This API is a testament to Snowflake’s commitment to a Python-first approach, offering a plethora of features designed to streamline workflows and enhance developer productivity. Key benefits of the new Snowflake Python API include: Simplified syntax and intuitive API design: Featuring a Pythonic design, the API is built on the foundation of REST APIs, which are known for their clarity and ease of use. This allows developers to interact with Snowflake objects naturally and efficiently, minimizing the learning curve and reducing development time. Rich functionality and support for advanced operations: The API goes beyond basic operations, offering comprehensive functionality for managing various Snowflake resources and performing complex tasks within your Python environment. This empowers developers to maximize the full potential of Snowflake through intuitive REST API calls. Enhanced performance and improved scalability: Designed with performance in mind, the API leverages the inherent scalability of REST APIs, enabling efficient data handling and seamless scaling to meet your growing data needs. This allows your applications to handle large data sets and complex workflows efficiently. Streamlined integration with existing tools and frameworks: The API seamlessly integrates with popular Python data science libraries and frameworks, enabling developers to leverage their existing skill sets and workflows effectively. This integration allows developers to combine the power of Python libraries with the capabilities of Snowflake through familiar REST API structures. By prioritizing the developer experience and offering a comprehensive, user-friendly solution, Snowflake’s new Python API paves the way for a more efficient, productive and data-driven future. Getting Started with the Snowflake Python API Our Quickstart guide makes it easy to see how the Snowflake Python API can manage Snowflake objects. The API allows you to create, delete and modify tables, schemas, warehouses, tasks and much more. In this Quickstart, you’ll learn how to perform key actions — from installing the Snowflake Python API to retrieving object data and managing Snowpark Container Services. Dive in to experience how the enhanced Python API streamlines your data workflows and unlocks the full potential of Python within Snowflake. To get started, explore the comprehensive API documentation, which will guide you through every step. We recommend that Python developers prioritize the new API for data engineering tasks since it offers a more intuitive and efficient approach compared to the legacy SQL connector. While the Python API connector remains available for specific SQL use cases, the new API is designed to be your go-to solution. By general availability, we aim to achieve feature parity, empowering you to complete 100% of your data engineering tasks entirely through Python. This means you’ll only need to use SQL commands if you truly prefer them or for rare unsupported functionalities. The New Wave of Native DevOps on Snowflake The Snowflake Python API release is among a series of native DevOps tools becoming available on the Snowflake platform — all of which aim to empower developers of every experience level with a user-friendly and approachable platform. These benefits extend far beyond the developer team. The 2023 Accelerate State of DevOps Report, the annual report from Google Cloud’s DevOps Research and Assessment (DORA) team, reveals that a focus on user-centricity around the developer experience leads to a 40% increase in organizational performance. With intuitive tools for data engineers, data scientists and even citizen developers, Snowflake strives to enhance these advantages by fostering collaboration across your data and delivery teams. By offering the flexibility and control needed to build unique applications, Snowflake aims to become your one-stop shop for data — minimizing reliance on third-party tools for core development lifecycle use cases and ultimately reducing your total cost of ownership. We’re excited to share more innovations soon, making data even more accessible for all. For a deeper dive into Snowflake’s Python API and other native Snowflake DevOps features, register for the Snowflake Data Cloud Summit 2024. Or, experience these features firsthand at our free Dev Day event on June 6th in the Demo Zone. The post Snowflake’s New Python API Empowers Data Engineers to Build Modern Data Pipelines with Ease appeared first on Snowflake. View the full article

April 17, 2024
- python
- data engineering
- (and 2 more)
  Tagged with:

data engineering 7 Steps to Mastering Data Engineering

KDnuggets posted a topic in Databases, Data Engineering & Data Science

The only data engineering roadmap you need for an introduction to concepts, tools, and techniques to collect, store, transform, analyze, and model data.View the full article

April 12, 2024

cdp Why Data and Engineering Teams Need to Own the CDP

RudderStack posted a topic in Databases, Data Engineering & Data Science

A thoughtful look at why Data and Engineering teams are best suited to own customer data platform implementation & management.View the full article

data engineering The Future of Data Engineering

RudderStack posted a topic in Databases, Data Engineering & Data Science

Read RudderStack CEO Soumyadeb Mitra's insights on the changes ahead in the field of data as the data engineering megatrend impacts every industry.View the full article

April 10, 2024

4 Reasons Why Data Engineers Hate Google Tag Manager

RudderStack posted a topic in Databases, Data Engineering & Data Science

Modern teams need a more robust data integration solution than GTM. Here are 4 of the reasons GTM and data engineers don’t get along plus a better solution.View the full article

April 9, 2024
- data engineering
- google tag manager
- (and 2 more)
  Tagged with:

reverse etl Making Data Engineering Easier: Operational Analytics With Event Streaming and Reverse ETL

RudderStack posted a topic in Databases, Data Engineering & Data Science

When it comes to Reverse ETL, the business use cases usually get all the attention. Here, we focus on how it makes data engineering easier. View the full article

April 9, 2024
- data engineering
- event streaming
- (and 1 more)
  Tagged with:

data modeling Data Modeling in the Warehouse for Data Engineers

RudderStack posted a topic in Databases, Data Engineering & Data Science

Expectations for data are higher than ever and come from a broad array of end users. These best practices will help you deliver better data products.View the full article

gen ai How I use Gen AI as a Data Engineer

TDS posted a topic in Databases, Data Engineering & Data Science

Generative AI is all the rage. In this article we dive into some practical examples for Data Engineers Continue reading on Towards Data Science » View the full article

data engineering Four Data Engineering Projects That Look Great on your CV

TDS posted a topic in Databases, Data Engineering & Data Science

Data pipelines that would turn you into a decorated data professional Continue reading on Towards Data Science » View the full article

March 23, 2024

What is the Salary of AWS Certified Data Engineer Associate?

Whizlabs posted a topic in Amazon Web Services

Data engineering, particularly with Amazon Web Services (AWS), has evolved as an appealing and financially rewarding career path. The growing need for data engineers has elevated the salary spectrum within the field. But first, there’s an important question to answer before diving into this field: “What does an AWS Data Engineer salary look like?” No need to fret! Keep reading this blog. This article encompasses all the essential information about the AWS Certified Data Engineer Associate, delving into the salary of the AWS Certified Data Engineer Associate and thoroughly examining the factors that influence the salary range. To shine in data engineering, taking the AWS Certified Data Engineer Associate certification can be an ideal choice. Let’s get started! Role of Data Engineers in Today’s World Data engineers play a crucial role in developing the systems that handle data storage, extraction, and processing. They are responsible for constructing and maintaining databases for applications while overseeing the infrastructure necessary for their operation. As a data engineer, your responsibilities may include managing a SQL data store and a MongoDB NoSQL data warehouse, ensuring accessibility and functionality. Collaborating within a team alongside software engineers, developers, data analysts, and designers, data engineers contribute their expertise to gather and manipulate data, driving essential business objectives. The specific duties of a data engineer can vary across organizations, including tasks such as: Designing efficient data store indexes Selecting appropriate storage technologies (SQL or NoSQL) Maintaining data stores Replicating data across multiple machines Tuning data warehouses Creating and validating query plans Identifying patterns in historical data Analyzing and optimizing database performance AWS Certified Data Engineer-Associate Certification: An Overview AWS has recently introduced the AWS Certified Cloud Data Engineer-Associate Certification Exam. AWS Certified Data Engineer-Associate Certification Exam serves as an excellent entry point for individuals seeking to delve into advanced specialty themes in AWS, even without prior data experience. Conversely, for those already engaged in data-related roles, this certification offers a valuable opportunity to deepen their comprehension of AWS by utilizing specialized services they likely already engage with. While acquiring these skills was always possible without formal certification, the introduction of a structured certification pathway not only encourages learners to pursue certification but also motivates training providers to address skill gaps by offering specialized guidance and resources. This certification confirms your expertise in core AWS data services, assessing your skills in configuring data pipelines, proficiently managing monitoring and troubleshooting, and optimizing performance while adhering to industry best practices. However, obtaining AWS Data Engineer certification can significantly enhance earning potential by validating proficiency in core AWS data services, data pipeline configuration, and effective management of monitoring and troubleshooting. Role of an AWS Certified Data Engineer Associate For those new to the field of data engineering, enrolling in the AWS Data Engineer Certification Beta course is a valuable option. The AWS Certified Data Engineer Associate Certification Exam (DEA-C01) follows the Associate-level tests for Solutions Architects, Developers, and SysOps Administrators, making it the company’s fourth Associate-level certification. AWS Certified Data Engineer Associates were solely responsible for the following tasks: Ingesting and transforming data Orchestrating data pipelines when deploying programming concepts Operationalizing, maintaining, and monitoring data pipelines Identifying the most suitable data storage solution, crafting effective data models, and organizing data schemas efficiently Overseeing the entire lifecycle of data, from creation to disposal Evaluating and ensuring the quality of data through thorough analysis Enforcing proper measures such as authentication, authorization, data encryption, privacy, and governance for effective data management Also Read: AWS Data Engineer Associate Certification guide AWS Data Engineer Salary AWS Data Engineer Salaries in India The average annual salary for an AWS Data Engineer in India is ₹21,20,567. Additionally, there is an average additional cash compensation of ₹13,87,883. The range for this additional cash compensation can range from ₹13,87,883 to ₹13,87,883. AWS Data Engineer Salaries in the USA The average annual salary for an AWS Data Engineer in the United States is found to be $129,716. In hourly terms, this averages approximately $62.36 per hour. Every week, it is equivalent to $2,494, and monthly, it amounts to $10,809. Factors Influencing AWS Data Engineer Associate Salary The salary of an AWS Certified Data Engineer Associate is influenced by a variety of factors that collectively shape the compensation landscape for professionals in this field. Understanding these factors is crucial for both aspiring data engineers and those looking to negotiate their salaries. Here are key elements that influence the salary of an AWS Certified Data Engineer Associate: Experience Level Salaries often differ based on experience level. It means entry-level candidates will get a lower salary compared to those who have several years of hands-on experience in the data engineering field. For example Entry-level AWS Certified Data Engineer Associate can get an average pay of $124,786 per year while most of the AWS Senior Data Engineer salary will be around $175,000 per year. Certification Even though education is significant, taking relevant certification courses can help move on to data engineering jobs. The AWS Data Engineer Associate can be maximized if you hold certifications such as AWS Certified Big Data – Specialty certification, Google Professional Data Engineer, etc. Location Location stands out as one of the paramount factors influencing the AWS Data Engineer Salary. The geographical setting where a professional works significantly shapes the compensation landscape, reflecting variations in living costs, market demands, and economic conditions. The cities where data engineers can earn the highest salaries include Seattle, Maryland, and Washington, with average salaries of over $2,11,350 /year. Skill Set To shine in the data engineering field, you must possess the following skills: ETL Tools: Understanding and utilizing various ETL tools are crucial for effective data management. These tools enable professionals to extract data from diverse sources, transform it according to specific requirements, and load it into databases or data warehouses. Examples of popular ETL tools include Apache NiFi, Talend, Informatica, and Microsoft SSIS. SQL: SQL (Structured Query Language) is a fundamental language for interacting with databases. Given that substantial volumes of data are typically stored in expansive data warehouses, proficiency in SQL is imperative. It empowers data engineers to retrieve, manipulate, and manage data efficiently. Python: Programming languages play a pivotal role in performing ETL tasks and data management activities. Python stands out as one of the most versatile and widely used languages for these purposes. Data engineers often leverage Python for scripting, automation, and executing various data-related tasks. Big Data Tools and Cloud Storage: Dealing with extensive datasets is a common aspect of a data engineer’s role. Therefore, familiarity with big data tools such as Hadoop, Spark, and cloud storage solutions like Amazon S3, Google Cloud Storage, and Azure Data Lake Storage (ADLS) is crucial. These tools streamline the handling and processing of large-scale data. Query Engines: Proficiency in query engines like Apache Spark and Apache Flink is essential for running queries against sizable datasets. These engines enable data engineers to process and analyze data efficiently, making them indispensable tools in the data engineering toolkit. Data Warehousing Concepts: Data engineers are responsible for maintaining data warehouses, making it vital to have a comprehensive understanding of data warehousing concepts. This includes knowledge of key components such as Enterprise Data Warehouse (EDW), Operational Data Store (ODS), and Data Mart. Mastery of these concepts ensures effective data storage, organization, and retrieval. Possessing proficiency in tools such as Python and SQL can secure an average salary of 8.5 and 8.6 LPA, respectively, for a data engineer. Following closely are skills related to Hadoop and ETL, garnering an average salary of 9 LPA. To maximize earnings, expertise in Amazon Web Services (AWS) and Apache Spark is pivotal, as they can lead to an average salary of 9.8 and 10 LPA, respectively. Employer Data Engineers are paid more when working in larger firms such as Google, Apple, Meta, etc. As per the study, In India, the data engineers who work at Cognizant can make up an average pay of about ₹819,207/ year. In IBM, they offer good pay to AWS data engineers which is about ₹950,000/ year. Job title variations Data engineer job titles differ based on the company, tasks, and skills they have. Here are some job titles that data engineers can hold: Enterprise Data Architect: $172,872 AI Engineer: $126,774 Cloud Data Engineer: $116,497 Hadoop Engineer: $143,322 Database Architect: $143,601 Data Science Engineer: $127,966 Big Data Engineer: $116,675 Information Systems Engineer: $92,340 How to improve AWS Certified Data Engineer Associate Salary? To enhance your AWS Certified Data Engineer Associate salary, consider the following strategies: Continuous Learning: Stay updated on the latest AWS technologies and best practices. Attend training sessions, webinars, and workshops to expand your knowledge. Earn Additional Certifications: Obtain other relevant certifications to demonstrate a diverse skill set. This can make you more valuable to employers and potentially lead to a higher salary. Gain Practical Experience: Apply your knowledge through hands-on projects and real-world scenarios. Practical experience is highly valued and can set you apart in the job market. Build a Strong Professional Network: Connect with other professionals in the field, attend industry events, and participate in online forums. Networking can open up new opportunities and provide insights into salary trends. Showcase Your Achievements: Highlight your accomplishments on your resume and LinkedIn profile. Quantify your impact on projects and emphasize how your skills have positively contributed to business objectives. Negotiation Skills: Develop effective negotiation skills when discussing salary with potential employers. Research industry salary benchmarks and be prepared to make a compelling case for your value. Specialize in High-Demand Areas: Focus on specialized areas within AWS that are in high demand. This could include specific data analytics tools, machine learning, or database management skills. Seek Leadership Roles: Transitioning into leadership positions can often lead to higher salaries. Develop your leadership skills and take on responsibilities that demonstrate your ability to lead and manage teams. Stay Informed About Market Trends: Keep track of industry trends and market demands. If you can align your skills with emerging technologies and trends, you may find yourself in higher demand. Key AWS Services to Prioritize for the DEA-C01 Exam To effectively prepare for the AWS Data Engineer Associate exam, it is essential to focus on specific concepts and AWS services to optimize study time and avoid unnecessary topics. It is highly recommended to dedicate more time to comprehending the following AWS services: Amazon Athena Amazon Redshift Amazon QuickSight Amazon EMR (Amazon Elastic MapReduce) AWS LakeFormation AWS EventBridge AWS Glue Amazon Kinesis Amazon Managed Service for Apache Flink Amazon Managed Streaming for Apache Kafka (Amazon MSK) Amazon OpenSearch Service FAQs Is it worth getting a certificate in the data engineering field? Yes, obtaining a data engineering certificate is worthwhile for several reasons. It serves as a proven and effective way to enhance your earning potential. Additionally, it demonstrates to potential employers that you are committed to staying updated on the latest advancements in the field and showcases your dedication to continuous learning. Does an AWS Data Engineer require coding knowledge? No, it is not necessary to know coding if you want to become an AWS Data Engineer. What skills are required for an AWS data engineer? To become an AWS Data Engineer, you need the following skills to execute Data Engineering tasks effectively: SQL Skills Data Modelling Hadoop for big data Python AWS Cloud services What is AWS Data Engineering? AWS Data Engineering entails the gathering of data from various sources to be stored, processed, analyzed, and visualized, and the creation of pipelines on the AWS platform. Conclusion Hope this blog details the AWS Certified Data Engineer Salary and what are the factors that impact the AWS Data Engineer Salary. As the demand for skilled data engineers continues to rise, obtaining the AWS Certified Data Engineer Associate credential not only validates your expertise but also enhances your earning potential. To further delve deeper into the data engineering world, try our hands-on labs and sandbox. View the full article

March 12, 2024
- salaries
- aws certified
- (and 4 more)
  Tagged with:

free courses KDnuggets News, December 6: GitHub Repositories to Master Machine Learning • 5 Free Courses to Master Data Engineering

KDnuggets posted a topic in Databases, Data Engineering & Data Science

This week on KDnuggets: Discover GitHub repositories from machine learning courses, bootcamps, books, tools, interview questions, cheat sheets, MLOps platforms, and more to master ML and secure your dream job • Data engineers must prepare and manage the infrastructure and tools necessary for the whole data workflow in a data-driven company • And much, much more!View the full article

December 6, 2023
- github
- github repos
- (and 3 more)
  Tagged with:
  - github
  - github repos
  - ml
  - free
  - data engineering

Data Engineering Interview Questions

TDS posted a topic in Databases, Data Engineering & Data Science

Tips to prepare for a job interview Continue reading on Towards Data Science » View the full article

free courses Datacamp Free Access Week

James posted a topic in Databases, Data Engineering & Data Science

Access all of Datacamp's 460+ data and AI courses, career tracks & certifications ... https://www.datacamp.com/freeweek

November 7, 2023
- datacamp
- data engineering
- (and 9 more)
  Tagged with:
  - datacamp
  - data engineering
  - certifications
  - careers
  - training
  - python
  - python 3
  - ai
  - gen ai
  - chatgpt
  - sql

data engineering Modern Data Engineering

TDS posted a topic in Databases, Data Engineering & Data Science

Platform Specific Tools and Advanced TechniquesPhoto by Christopher Burns on UnsplashThe modern data ecosystem keeps evolving and new data tools emerge now and then. In this article, I want to talk about crucial things that affect data engineers. We will discuss how to use this knowledge to power advanced analytics pipelines and operational excellence. I’d like to discuss some popular Data engineering questions:Modern data engineering (DE). What is it?Does your DE work well enough to fuel advanced data pipelines and Business intelligence (BI)?Are your data pipelines efficient?What is required from the technological point of view to enable operational excellence?Back in October, I wrote about the rise of the Data Engineer, the role, its challenges, responsibilities, daily routine and how to become successful in this field. The data engineering landscape is constantly changing but major trends seem to remain the same. How to Become a Data Engineer As a data engineer, I am tasked to design efficient data processes almost every day. So here are a few things to consider that can help us answer these questions. Modern data engineering trendsETL vs ELTSimplified data connectors and API integrationsETL frameworks explosionData infrastructure as codeData Mesh and decentralized data managementDemocratization of Business intelligence pipelines using AIFocus on data literacyELT vs ETLPopular SQL data transformation tools like Dataform and DBT made a significant contribution to the popularisation of the ELT approach [1]. It simply makes sense to perform required data transformations, such as cleansing, enrichment and extraction in the place where data is being stored. Often it is a data warehouse solution (DWH) in the central part of our infrastructure. Cloud platform leaders made DWH (Snowflake, BigQuery, Redshift, Firebolt) infrastructure management really simple and in many scenarios they will outperform and dedicated in-house infrastructure management team in terms of cost-effectiveness and speed. Data warehouse exmaple. Image by authorIt also might be a datalake in the center and it depends on the type of our data platform and tools we use. In this case, SQL stops being an option in many cases making it difficult to query the data for those users who are not familiar with programming. Tools like Databricks, Tabular and Galaxy try to solve this problem and it really feels like the future. Indeed, datalakes can store all types of data including unstructured ones and we still need to be able to analyse these datasets. Datalake example. Image by author.Just imagine transactionally consistent datalake tables with point-in-time snapshot isolation.I previously wrote about it in one of my stories on Apache Iceberg table format [2]. Introduction to Apache Iceberg Tables Simplified data integrationsManaged solutions like Fivetran and Stitch were built to manage third-party API integrations with ease. These days many companies choose this approach to simplify data interactions with their external data sources. This would be the right way to go for data analyst teams that are not familiar with coding. Indeed, why would we build a data connector from scratch if it already exists and is being managed in the cloud?The downside of this approach is it’s pricing model though.Very often it is row-based and might become quite expensive on an enterprise level of data ingestion, i.e. big data pipelines. This is where open-source alternatives come into play. Frameworks like Airbyte and Meltano might be an easy and quick solution to deploy a data source integration microservice. If you don’t have time to learn a new ETL framework you can create a simple data connector yourself. If you know a bit of Python it would be a trivial task. In one of my previous articles I wrote how easy it is to create a microservice that pulls data from NASA API [3]: Python for Data Engineers Consider this code snippet for app.py import requests session = requests.Session() url="https://api.nasa.gov/neo/rest/v1/feed" apiKey="your_api_key" requestParams = { 'api_key': apiKey, 'start_date': '2023-04-20', 'end_date': '2023-04-21' } response = session.get(url, params = requestParams, stream=True) print(response.status_code)It can be deployed in any cloud vendor platform and scheduled to run with the required frequency. It’s always a good practice to use something like Terraform to deploy our data pipeline applications. ETL frameworks explosionWe can witness a “Cambrian explosion” of various ETL frameworks for data extraction and transformation. It’s not a surprise that many of them are open-source and are Python-based. Luigi [8] is one of them and it helps to create ETL pipelines. It was created by Spotify to manage massive data processing workloads. It has a command line interface and great visualization features. However, even basic ETL pipelines would require a certain level of Python programming skills. From my experience, I can tell that it’s great for strict and straightforward pipelines. I find it particularly difficult to implement complex branching logic using Luigi but it works great in many scenarios. Python ETL (PETL) [9] is one of the most widely used open-source ETL frameworks for straightforward data transformations. It is invaluable working with tables, extracting data from external data sources and performing basic ETL on data. In many ways, it is similar to Pandas but the latter has more analytics capabilities under the hood. PETL is great for aggregation and row-level ETL. Bonobo [10] is another open-source lightweight data processing tool which is great for rapid development, automation and parallel execution of batch-processing data pipelines. What I like about it is that it makes it really easy to work with various data file formats, i.e. SQL, XML, XLS, CSV and JSON. It will be a great tool for those with minimal Python knowledge. Among other benefits, I like that it works well with semi-complex data schemas. It is ideal for simple ETL and can run in Docker containers (it has a Docker extension). Pandas is an absolute beast in the world of data and there is no need to cover it’s capabilities in this story. It’s worth mentioning that its data frame transformations have been included in one of the basic methods of data loading for many modern data warehouses. Consider this data loading sample into the BigQuery data warehouse solution: from google.cloud import bigquery from google.oauth2 import service_account ... # Authenticate BigQuery client: service_acount_str = config.get('BigQuery') # Use config credentials = service_account.Credentials.from_service_account_info(service_acount_str) client = bigquery.Client(credentials=credentials, project=credentials.project_id) ... def load_table_from_dataframe(table_schema, table_name, dataset_id): #! source data file format must be outer array JSON: """ [ {"id":"1"}, {"id":"2"} ] """ blob = """ [ {"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}, {"id":"2","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]} ] """ body = json.loads(blob) print(pandas.__version__) table_id = client.dataset(dataset_id).table(table_name) job_config = bigquery.LoadJobConfig() schema = create_schema_from_yaml(table_schema) job_config.schema = schema df = pandas.DataFrame( body, # In the loaded table, the column order reflects the order of the # columns in the DataFrame. columns=["id", "first_name","last_name","dob","addresses"], ) df['addresses'] = df.addresses.astype(str) df = df[['id','first_name','last_name','dob','addresses']] print(df) load_job = client.load_table_from_dataframe( df, table_id, job_config=job_config, ) load_job.result() print("Job finished.")Apache Airflow, for example, is not an ETL tool per se but it helps to organize our ETL pipelines into a nice visualization of dependency graphs (DAGs) to describe the relationships between tasks. Typical Airflow architecture includes a schduler based on metadata, executors, workers and tasks. For example, we can run ml_engine_training_op after we export data into the cloud storage (bq_export_op) and make this workflow run daily or weekly. ML model training using Airflow. Image by author.Consider this example below. It creates a simple data pipeline graph to export data into a cloud storage bucket and then trains the ML model using MLEngineTrainingOperator."""DAG definition for recommendation_bespoke model training.""" import airflow from airflow import DAG from airflow.contrib.operators.bigquery_operator import BigQueryOperator from airflow.contrib.operators.bigquery_to_gcs import BigQueryToCloudStorageOperator from airflow.hooks.base_hook import BaseHook from airflow.operators.app_engine_admin_plugin import AppEngineVersionOperator from airflow.operators.ml_engine_plugin import MLEngineTrainingOperator import datetime def _get_project_id(): """Get project ID from default GCP connection.""" extras = BaseHook.get_connection('google_cloud_default').extra_dejson key = 'extra__google_cloud_platform__project' if key in extras: project_id = extras[key] else: raise ('Must configure project_id in google_cloud_default ' 'connection from Airflow Console') return project_id PROJECT_ID = _get_project_id() # Data set constants, used in BigQuery tasks. You can change these # to conform to your data. DATASET = 'staging' #'analytics' TABLE_NAME = 'recommendation_bespoke' # GCS bucket names and region, can also be changed. BUCKET = 'gs://rec_wals_eu' REGION = 'us-central1' #'europe-west2' #'us-east1' JOB_DIR = BUCKET + '/jobs' default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': airflow.utils.dates.days_ago(2), 'email': ['mike.shakhomirov@gmail.com'], 'email_on_failure': True, 'email_on_retry': False, 'retries': 5, 'retry_delay': datetime.timedelta(minutes=5) } # Default schedule interval using cronjob syntax - can be customized here # or in the Airflow console. schedule_interval = '00 21 * * *' dag = DAG('recommendations_training_v6', default_args=default_args, schedule_interval=schedule_interval) dag.doc_md = __doc__ # # # Task Definition # # # BigQuery training data export to GCS training_file = BUCKET + '/data/recommendations_small.csv' # just a few records for staging t1 = BigQueryToCloudStorageOperator( task_id='bq_export_op', source_project_dataset_table='%s.recommendation_bespoke' % DATASET, destination_cloud_storage_uris=[training_file], export_format='CSV', dag=dag ) # ML Engine training job training_file = BUCKET + '/data/recommendations_small.csv' job_id = 'recserve_{0}'.format(datetime.datetime.now().strftime('%Y%m%d%H%M')) job_dir = BUCKET + '/jobs/' + job_id output_dir = BUCKET delimiter=',' data_type='user_groups' master_image_uri='gcr.io/my-project/recommendation_bespoke_container:tf_rec_latest' training_args = ['--job-dir', job_dir, '--train-file', training_file, '--output-dir', output_dir, '--data-type', data_type] master_config = {"imageUri": master_image_uri,} t3 = MLEngineTrainingOperator( task_id='ml_engine_training_op', project_id=PROJECT_ID, job_id=job_id, training_args=training_args, region=REGION, scale_tier='CUSTOM', master_type='complex_model_m_gpu', master_config=master_config, dag=dag ) t3.set_upstream(t1)Bubbles [11] is another open-source tool for ETL in the Python world. It’s great for rapid development and I like how it works with metadata to describe data pipelines. The creators of Bubbles call it an “abstract framework” and say that it can be used from many other programming languages, not exclusively from Python. There are many other tools with more specific applications, i.e. extracting data from web pages (PyQuery, BeautifulSoup, etc.) and parallel data processing. It can be a topic for another story but I wrote about some of them before, i.e. joblib library [12] Data infrastructure as codeInfrastructure as code (IaC) is a popular and very functional approach for managing data platform resources. Even for data, it is pretty much a standard right now, and it definitely looks great on your CV telling your potential employers that you are familiar with DevOps standards. Using tools like Terraform (platform agnostic) and CloudFormation we can integrate our development work and deployments (operations) with ease. In general, we would want to have staging and production data environments for our data pipelines. It helps to test our pipelines and facilitate collaboration between teams. Consider this diagram below. It explains how data environments work. Data environments. Image by author.Often we might need an extra sandbox for testing purposes or to run data transformation unit tests when our ETL services trigger CI/CD workflows. I previously wrote about it here: Infrastructure as Code for Beginners Using AWS CloudFormation template files we can describe required resources and their dependencies so we can launch and configure them together as a single stack.If you are a data professional this approach will definitely help working with different data environments and replicate data platform resources faster and more consistently without errors. The problem is that many data practitioners are not familiar with IaC and it creates a lot of errors during the development process. Data Mesh and decentralized data managementData space has significantly evolved during the last decade and now we have lots of data tools and frameworks. Data Mesh defines the state when we have different data domains (company departments) with their own teams and shared data resources. Each team has their own goals, KPIs, data roles and responsibilities. For a long period of time, data bureaucracy has been a real pain for many companies.This data platform type [4] might seem a bit chaotic but it was meant to become a successful and efficient choice for companies where decentralization enables different teams to access cross-domain datasets and run analytics or ETL tasks on their own. Indeed, Snowflake might be your favourite data warehouse solution if you are a data analyst and not familiar with Spark. However, often it’s a trivial problem when you might want to read datalake data without data engineering help. In this scenario, a bunch of metadata records on datasets could be extremely useful and that’s why Data Mesh is so successful. It enables users with knowledge about data, its origins and how other teams can make the best of those datasets they weren’t previously aware of. Sometimes datasets and data source connections become very intricate and it is always a good practice to have a single-source-of-truth data silo or repository with metadata and dataset descriptions. In one of my previous stories [5] I wrote about the role of SQL as a unified querying language for teams and data. Indeed, it analytical, self-descriptive and come be even dynamic which makes it a perfect tool for all data users. Often it all turns into a big mes(s/h)This fact makes SQL-based templating engines like DBT, Jinja and Dataform very popular. Just imagine you have an SQL-like platform where all datasets and their transformations are described and defined thoroughly [6]. Dataform’s dependency graph and metadata. Image by author.It might be a big challenge to understand how data teams relate to data sources and schemas. Very often it is all tangled in spaghetti of dataset dependencies and ETL transformations. Data engineering plays a critical role in mentoring, improving data literacy and empowering the rest of the company with state-of-the-art data processing techniques and best practices. Democratization of Business Intelligence pipelines using AIImproving data accessibility has always been a popular topic in the data space but it is interesting to see how the whole data pipeline design process is becoming increasingly accessible to teams that weren’t familiar with data before. Now almost every department can utilize built-in AI capabilities to create complex BI transformations on data. All they need is to describe what they want BI-wise in their own wordsFor example, BI tools like Thoughspot use AI with an intuitive “Google-like search interface” [7] to gain insights from data stored in any modern DWH solution such as Google Big Query, Redshift, Snowflake or Databricks. Modern Data Stack includes BI tools that help with data modelling and visualization. Many of them already have these built-in AI capabilities to gain data insights faster based on user behaviour. I believe it’s a fairly easy task to integrate GPT and BI. In the next couple of years, we will see many new products using this tech. GPT can pre-process text data to generate a SQL query that understands your intent and answers your question.ConclusionIn this article, I tried to give a very high-level overview of major data trends that affect data engineering role these days. Data Mesh and templated SQL with dependency graphs to facilitate data literacy democratized the whole analytics process. Advanced data pipelines with intricate ETL techniques and transformations can be transparent for everyone in the organisation now. Data pipelines are becoming increasingly accessible for other teams and they don’t need to know programming to learn and understand the complexity of ETL. Data Mesh and metadata help to solve this problem. From my experience, I can tell that I keep seeing more and more people learning SQL to contribute to the transformation layer. Companies born during the “advanced data analytics” age have the luxury of easy access to cloud vendor products and their managed services. It definitely helps to acquire the required data skills and improve them to gain a competitive advantage. Recommended read[1] https://medium.com/towards-data-science/data-pipeline-design-patterns-100afa4b93e3 [2] https://towardsdatascience.com/introduction-to-apache-iceberg-tables-a791f1758009 [3] https://towardsdatascience.com/python-for-data-engineers-f3d5db59b6dd [4] https://medium.com/towards-data-science/data-platform-architecture-types-f255ac6e0b7 [5] https://medium.com/towards-data-science/advanced-sql-techniques-for-beginners-211851a28488 [6] https://medium.com/towards-data-science/easy-way-to-create-live-and-staging-environments-for-your-data-e4f03eb73365 [7] https://docs.thoughtspot.com/cloud/latest/search-sage [8] https://github.com/spotify/luigi [9] https://petl.readthedocs.io/en/stable/ [10] https://www.bonobo-project.org [11] http://bubbles.databrewery.org/ [12] https://medium.com/towards-data-science/how-to-become-a-data-engineer-c0319cb226c2 Modern Data Engineering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article

November 4, 2023

data engineering Python for Data Engineers

TDS posted a topic in Databases, Data Engineering & Data Science

Advanced ETL techniques for beginners ... View the full article

data engineering How to Become an AWS Data Engineer: A Complete Guide [2023]

Whizlabs posted a topic in Development & Programming

AWS Data Engineering is a vital element in the AWS Cloud to deliver ultimate data solutions to end users. Data Engineering on AWS assists big data professionals in managing Data Pipelines, Data Transfer, and Data Storage. AWS data engineers do the same jobs as general data engineers but they exclusively work in the Amazon Web Services cloud platform. To succeed in Data Engineering on AWS, one should have a solid understanding of AWS and data engineering principles. To nurture your Data engineering skills from the foundational level, it is better to take AWS Data Engineer Certification... View the full article

apache airflow Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering

Cloudera posted a topic in CI/CD, GitOps, Orchestration & Scheduling

Airflow has been adopted by many Cloudera Data Platform (CDP) customers in the public cloud as the next generation orchestration service to setup and operationalize complex data pipelines. Today, customers have deployed 100s of Airflow DAGs in production performing various data transformation and preparation tasks, with differing levels of complexity. This combined with Cloudera Data Engineering’s (CDE) first-class job management APIs and centralized monitoring is delivering new value for modernizing enterprises. As we mentioned before, instead of relying on one custom monolithic process, customers can develop modular data transformation steps that are more reusable and easier to debug, which can then be orchestrated with glueing logic at the level of the pipeline. That’s why we are excited to announce the next evolutionary step on this modernization journey by lowering the barrier even further for data practitioners looking for flexible pipeline orchestration — introducing CDE’s completely new pipeline authoring UI for Airflow. Until now, the setup of such pipelines still required knowledge of Airflow and the associated python configurations. This presented challenges for users in building more complex multi-step pipelines that are typical of DE workflows. We wanted to hide those complexities from users, making multi-step pipeline development as self-service as possible and providing an easier path to developing, deploying, and operationalizing true end-to-end data pipelines. Easing development friction We started out by interviewing customers to understand where the most friction exists in their pipeline development workflows today. In the process several key themes emerged: Low/No-code By far the biggest barrier for new users is creating custom Airflow DAGs. Writing code is error prone and requires trial and error. Anyway to minimize coding and manual configuration will dramatically streamline the development process. Long-tail of operators Although Airflow offers 100s of operators, users tend to use only a subset of them. Making the most commonly used as readily available as possible is critical to reduce development friction. Templates Airflow DAGs are a great way to isolate pipelines and monitor them independently, making it more operationally friendly for DE teams. But a lot of times when we looked across Airflow DAGs we noticed similar patterns, where the majority of the operations were identical except for a series of configurations like table names and directories – the 80/20 rule clearly at play. This laid the foundation for some of the key design principles we applied to our authoring experience. Pipeline Authoring UI for Airflow With CDE Pipeline authoring UI, any CDE user irrespective of their level of Airflow expertise can create multi-step pipelines with a combination of out-of-the-box operators (CDEOperator, CDWOperator, BashOperator, PythonOperator). More advanced users can still continue to deploy their own customer Airflow DAGs as before, or use the Pipeline authoring UI to bootstrap their projects for further customization (as we describe later the pipeline engine generates Airflow code which can be used as starting to meet more complex scenarios). And once the pipeline has been developed through the UI, users can deploy and manage these data pipeline jobs like other CDE applications thru the API/CLI/UI. Figure 1: “Editor” screen for authoring Airflow pipelines, with operators (left), canvas (middle), and context sensitive configuration panel (right) The “Editor” is where all the authoring operations take place — a central interface to quickly sequence together your pipelines. It was critical to make the interactions as intuitive as possible to avoid slowing down the flow of the user. The user is presented with a blank canvas with click & drop operators. A palette focused on the most commonly used operators on the left, and a context sensitive configuration panel on the right. And as the user drops new operators onto the canvas they can specify dependencies through an intuitive click and drag interaction. Clicking on an existing operator within the canvas brings it to focus which triggers an update to the configuration panel on the right. Hovering over any operator highlights each side with four dots inviting the user to use a click & drag action to create connection with another operator. Figure 2: Creating dependencies with simple click & drag Pipeline Engine To make the authoring UI as flexible as possible a translation engine was developed that sits in between the user interface and the final Airflow job. Each “box” (step) in on the canvas serves as a task in the final Airflow DAG. Multiple steps comprise the overall pipeline, which are stored as pipeline definition files in the CDE resource of the job. This intermediate definition can easily be integrated with source code management, such as Git, as needed. When the pipeline is saved in the editor screen, a final translation is performed whereby the corresponding Airflow DAG is generated and loaded into the Airflow server. This makes our pipeline engine flexible to support multitude of orchestration services. Today we support Airflow but in the future it can be extended to meet other requirements. An additional benefit is that this can also serve to bootstrap more complex pipelines. The generated Airflow python code can be modified by end users to accommodate custom configurations and then uploaded as a new job. This way users don’t have to start from scratch, but rather build an outline of what they want to achieve, output the skeleton python code, and then customize. Templatizing Airflow Airflow provides a way to templatize pipelines and with CDE we have integrated that with our APIs to allow job parameters to be pushed down to Airflow as part of the execution of the pipeline. A simple example of this would be parameterizing SQL query within the CDW operator. Using the special syntax {{..}} the developer can include placeholders for different parts of the query, for example the SELECT expression or the table being referenced in the FROM section. SELECT {{ dag_run.conf['conf1'] }} FROM {{ dag_run.conf['conf2'] }} LIMIT 100 This can be entered through the configuration pane in UIl as shown here: Once the pipeline is saved and the Airflow job generated, it can be programmatically triggered through the CDE CLI/API with the configuration override options. $ cde job run --config conf1='column1, sum(1)' --config conf2='default.txn' --name example_airflow_job The same Airflow job can now be used to generate different SQL reports. Looking forward With early design partners we already have enhancements in the works to continue improving the experience. Some of them include: More operators – as we mentioned earlier there is a small set of highly used operators. We want to ensure these most commonly used ones are easily accessible to the user. Additionally, the introduction of more CDP operators that integrate with CML (machine learning) and COD (operation database) are critical for a complete end-to-end orchestration service. UI improvements to make the experience even smoother. These span common usability improvements like pan and zoom and undo-redo operations, and a mechanism to add comments to make more complex pipelines easier to follow. Auto-discovery can be powerful when applied to help autocomplete various configurations, such as referencing pre-defined spark job for the CDE task or the hive virtual warehouse end-point for the CDW query task. Ready-to-use pipelines – although parameterized Airflow jobs are great way to develop reusable pipelines, we want to make this even easier to specify through the UI. Also there’s opportunities for us to provide read-to-use pipeline definitions that capture very common patterns such as detecting files on S3 bucket, running data transformation with Spark, and performing data mart creation with Hive. With this Technical Preview release, any CDE customer can test drive the new authoring interface by setting up the latest CDE service. When creating a Virtual Cluster a new option will allow the enablement of the Airflow authoring UI. Stay tuned for more developments in the coming months and until then happy pipeline building! The post Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering appeared first on Cloudera Blog. View the full article

apache spark Cloudera Data Engineering – Integration steps to leverage Spark on Kubernetes

Cloudera posted a topic in Databases, Data Engineering & Data Science

What is Cloudera Data Engineering (CDE) ? Cloudera Data Engineering is a serverless service for Cloudera Data Platform (CDP) that allows you to submit jobs to auto-scaling virtual clusters. CDE enables you to spend more time on your applications, and less time on infrastructure. CDE allows you to create, manage, and schedule Apache Spark jobs without the overhead of creating and maintaining Spark clusters. In addition to this, you can define virtual clusters with a range of CPU and memory resources, and the cluster scales up and down as needed to execute your Spark workloads, helping control your cloud costs. Managed, serverless Spark service helps our customers in a number of ways: Auto scaling of compute to eliminate static infrastructure costs. This feature ensures that customers do not have to maintain a large infrastructure footprint and hence reduce total cost of ownership. Ability for business users to easily control their own compute needs with a click of a button, without IT intervention. Complete view of the job performance, logging and debugging through a single pane of glass to enable efficient development on Spark. Refer to the following Cloudera blog to understand the full potential of Cloudera Data Engineering. Why should technology partners care about CDE? Unlike traditional data engineering workflows that have relied on a patchwork of tools for preparing, operationalizing, and debugging data pipelines, Cloudera Data Engineering is designed for efficiency and speed — seamlessly integrating and securing data pipelines to any CDP service including Machine Learning, Data Warehouse, Operational Database, or any other analytic tool in your business. Partner tools that leverage CDP as their backend store can leverage this new service to ensure their customers can take advantage of a serverless architecture for Spark. ISV Partners, like Precisely, support Cloudera’s hybrid vision. Precisely Data Integration, Change Data Capture and Data Quality tools support CDP Public Cloud as well as CDP Private Cloud. Precisely end-customers can now design a pipeline once and deploy it anywhere. Data pipelines that are bursty in nature can leverage the public cloud CDE service while longer running persistent loads can run on-prem. This ensures that the right data pipelines are running on the most cost-effective engines available in the market today. Using the CDE Integration API: CDE provides a robust API for integration with your existing continuous integration/continuous delivery platforms. The Cloudera Data Engineering service API is documented in Swagger. You can view the API documentation and try out individual API calls by accessing the API DOC link in any virtual cluster: In the CDE web console, select an environment. Click the Cluster Details icon in any of the listed virtual clusters. Click the link under API DOC. For further details on the API, please refer to the following doc link here. Custom base Image for Kubernetes: Partners who need to run their own business logic and require custom binaries or packages available on the Spark engine platform, can now leverage this feature for Cloudera Data Engineering. We believe customized engine images would allow greater flexibility to our partners to build cloud-native integrations and could potentially be leveraged by our enterprise customers as well. The following set of steps will describe the ability to run Spark jobs with dependencies on external libraries and packages. The libraries and packages will be installed on top of the base image to make them available to the Spark executors. First, obtain the latest CDE CLI a) Create a virtual cluster b) Go to virtual cluster details page c) Download the CLI Learn more on how to use the CLI here Run Spark jobs on customized container image – Overview Custom images are based on the base dex-spark-runtime image, which is accessible from the Cloudera docker repository. Users can then layer their packages and custom libraries on top of the base image. The final image is uploaded to a docker repo, which is then registered with CDE as a job resource. New jobs are defined with references to the resource which automatically downloads the custom runtime image to run the Spark drivers and executors. Run Spark jobs on customized container image: Steps 1. Pull “dex-spark-runtime” image from “docker.repository.cloudera.com” $ docker pull container.repository.cloudera.com/cloudera/dex/dex-spark-runtime:<version> Note: “docker.repository.cloudera.com” is behind the paywall and will require credentials to access, please ask your account team to provide 2. Create your “custom-dex-spark-runtime” image, based on “dex-spark-runtime” image $ docker build --network=host -t <company-registry>/custom-dex-spark-runtime:<version> . -f Dockerfile Dockerfile Example: FROM docker.repository.cloudera.com/<company-name>/dex-spark-runtime:<version> USER root RUN yum install ${YUM_OPTIONS} <package-to-install> && yum clean all && rm -rf /var/cache/yum RUN dnf install ${DNF_OPTIONS} <package-to-install> && dnf clean all && rm -rf /var/cache/dnf USER ${DEX_UID} 3. Push image to your company Docker registry $ docker push <company-registry>/custom-dex-spark-runtime:<version> 4. Create ImagePullSecret in DE cluster for the company’s Docker registry (Optional) REST API: # POST /api/v1/credentials { "name": "<company-registry-basic-credentials>", "type": "docker", "uri": "<company-registry>", "secret": { "username": "foo", "password": "bar", } } CDE CLI: === credential === ./cde credential create --type=docker-basic --name=docker-sandbox-cred --docker-server=https://docker-sandbox.infra.cloudera.com --docker-username=foo --tls-insecure --user srv_dex_mc --vcluster-endpoint https://gbz7t69f.cde-vl4zqll4.dex-a58x.svbr-nqvp.int.cldr.work/dex/api/v1 Note: Credentials will be stored as Kubernetes “Secret”. Never stored by DEX API. 5. Register “custom-dex-spark-runtime” in DE as a “Custom Spark Runtime Image” Resource. REST API: # POST /api/v1/resources { "name":"", "type":"custom-spark-runtime-container-image", "engine": "spark2", "image": <company-registry>/custom-dex-spark-runtime:<version>, "imagePullSecret": <company-registry-basic-credentials> } CDE CLI: === runtime resources === ./cde resource create --type="custom-runtime-image" --image-engine="spark2" --name="custom-dex-qe-1_1" --image-credential=docker-sandbox-cred --image="docker-sandbox.infra.cloudera.com/dex-qe/custom-dex-qe:1.1" --tls-insecure --user srv_dex_mc --vcluster-endpoint https://gbz7t69f.cde-vl4zqll4.dex-a58x.svbr-nqvp.int.cldr.work/dex/api/v1 6. You should now be able to define Spark jobs referencing the custom-dex-spark-runtime REST API: # POST /api/v1/jobs { "name":"spark-custom-image-job", "spark":{ "imageResource": "CustomSparkImage-1", ... } ... } CDE CLI: === job create === ./cde job create --type spark --name cde-job-docker --runtime-image-resource-name custom-dex-qe-1_1 --application-file /tmp/numpy_app.py --num-executors 1 --executor-memory 1G --driver-memory 1G --tls-insecure --user srv_dex_mc --vcluster-endpoint https://gbz7t69f.cde-vl4zqll4.dex-a58x.svbr-nqvp.int.cldr.work/dex/api/v1 7. Once the job is created either trigger it to run through Web UI or by running the following command in CLI: $> cde job run --name cde-job-docker In conclusion We introduced the “Custom Base Image” feature as part of our Design Partner Program to elicit feedback from our ISV partners. The response has been overwhelmingly positive and building custom integrations with our cloud-native CDE offering has never been easier. As a partner, you can leverage Spark running on Kubernetes Infrastructure for free. You can launch a trial of CDE on CDP in minutes here, giving you a hands-on introduction to data engineering innovations in the Public Cloud. References: https://www.cloudera.com/tutorials/cdp-getting-started-with-cloudera-data-engineering.html The post Cloudera Data Engineering – Integration steps to leverage Spark on Kubernetes appeared first on Cloudera Blog. View the full article

April 14, 2021
- cloudera
- data engineering
- (and 2 more)
  Tagged with:

Sign In

Search the Community

Search By Tags

Search By Author

Content Type

Forums

Calendars

Find results in...

Find results that contain...

Date Created

Start

End

Last Updated

Start

End

Filter by number of...

Minimum number of comments

Minimum number of replies

Minimum number of reviews

Minimum number of views

Joined

Start

End

Group

Website URL

LinkedIn Profile URL

About Me

Cloud Platforms

Cloud Experience

Development Experience

Current Role

Skills

Certifications

Favourite Tools

Interests

Forum Statistics