Jump to content

Search the Community

Showing results for tags 'data engineering'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • General
    • General Discussion
    • Artificial Intelligence
    • DevOpsForum News
  • DevOps & SRE
    • DevOps & SRE General Discussion
    • Databases, Data Engineering & Data Science
    • Development & Programming
    • CI/CD, GitOps, Orchestration & Scheduling
    • Docker, Containers, Microservices, Serverless & Virtualization
    • Infrastructure-as-Code
    • Kubernetes & Container Orchestration
    • Linux
    • Logging, Monitoring & Observability
    • Security, Governance, Risk & Compliance
  • Cloud Providers
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 19 results

  1. Learn why your Shopify success demands data engineering expertise and how to start doing more with your Shopify data. View the full article
  2. In today’s data-driven world, developer productivity is essential for organizations to build effective and reliable products, accelerate time to value, and fuel ongoing innovation. To deliver on these goals, developers must have the ability to manipulate and analyze information efficiently. Yet while SQL applications have long served as the gateway to access and manage data, Python has become the language of choice for most data teams, creating a disconnect. Recognizing this shift, Snowflake is taking a Python-first approach to bridge the gap and help users leverage the power of both worlds. Our previous Python connector API, primarily available for those who need to run SQL via a Python script, enabled a connection to Snowflake from Python applications. This traditional SQL-centric approach often challenged data engineers working in a Python environment, requiring context-switching and limiting the full potential of Python’s rich libraries and frameworks. Since the previous Python connector API mostly communicated via SQL, it also hindered the ability to manage Snowflake objects natively in Python, restricting data pipeline efficiency and the ability to complete complex tasks. Snowflake’s new Python API (in public preview) marks a significant leap forward, offering a more streamlined, powerful solution for using Python within your data pipelines — and furthering our vision to empower all developers, regardless of experience, with a user-friendly and approachable platform. A New Era: Introducing Snowflake’s Python API With the new Snowflake Python API, readily available through pip install snowflake, developers no longer need to juggle between languages or grapple with cumbersome syntax. They can effortlessly leverage the power of Python for a seamless, unified experience across Snowflake workloads encompassing data engineering, Snowpark, machine learning and application development. This API is a testament to Snowflake’s commitment to a Python-first approach, offering a plethora of features designed to streamline workflows and enhance developer productivity. Key benefits of the new Snowflake Python API include: Simplified syntax and intuitive API design: Featuring a Pythonic design, the API is built on the foundation of REST APIs, which are known for their clarity and ease of use. This allows developers to interact with Snowflake objects naturally and efficiently, minimizing the learning curve and reducing development time. Rich functionality and support for advanced operations: The API goes beyond basic operations, offering comprehensive functionality for managing various Snowflake resources and performing complex tasks within your Python environment. This empowers developers to maximize the full potential of Snowflake through intuitive REST API calls. Enhanced performance and improved scalability: Designed with performance in mind, the API leverages the inherent scalability of REST APIs, enabling efficient data handling and seamless scaling to meet your growing data needs. This allows your applications to handle large data sets and complex workflows efficiently. Streamlined integration with existing tools and frameworks: The API seamlessly integrates with popular Python data science libraries and frameworks, enabling developers to leverage their existing skill sets and workflows effectively. This integration allows developers to combine the power of Python libraries with the capabilities of Snowflake through familiar REST API structures. By prioritizing the developer experience and offering a comprehensive, user-friendly solution, Snowflake’s new Python API paves the way for a more efficient, productive and data-driven future. Getting Started with the Snowflake Python API Our Quickstart guide makes it easy to see how the Snowflake Python API can manage Snowflake objects. The API allows you to create, delete and modify tables, schemas, warehouses, tasks and much more. In this Quickstart, you’ll learn how to perform key actions — from installing the Snowflake Python API to retrieving object data and managing Snowpark Container Services. Dive in to experience how the enhanced Python API streamlines your data workflows and unlocks the full potential of Python within Snowflake. To get started, explore the comprehensive API documentation, which will guide you through every step. We recommend that Python developers prioritize the new API for data engineering tasks since it offers a more intuitive and efficient approach compared to the legacy SQL connector. While the Python API connector remains available for specific SQL use cases, the new API is designed to be your go-to solution. By general availability, we aim to achieve feature parity, empowering you to complete 100% of your data engineering tasks entirely through Python. This means you’ll only need to use SQL commands if you truly prefer them or for rare unsupported functionalities. The New Wave of Native DevOps on Snowflake The Snowflake Python API release is among a series of native DevOps tools becoming available on the Snowflake platform — all of which aim to empower developers of every experience level with a user-friendly and approachable platform. These benefits extend far beyond the developer team. The 2023 Accelerate State of DevOps Report, the annual report from Google Cloud’s DevOps Research and Assessment (DORA) team, reveals that a focus on user-centricity around the developer experience leads to a 40% increase in organizational performance. With intuitive tools for data engineers, data scientists and even citizen developers, Snowflake strives to enhance these advantages by fostering collaboration across your data and delivery teams. By offering the flexibility and control needed to build unique applications, Snowflake aims to become your one-stop shop for data — minimizing reliance on third-party tools for core development lifecycle use cases and ultimately reducing your total cost of ownership. We’re excited to share more innovations soon, making data even more accessible for all. For a deeper dive into Snowflake’s Python API and other native Snowflake DevOps features, register for the Snowflake Data Cloud Summit 2024. Or, experience these features firsthand at our free Dev Day event on June 6th in the Demo Zone. The post Snowflake’s New Python API Empowers Data Engineers to Build Modern Data Pipelines with Ease appeared first on Snowflake. View the full article
  3. The only data engineering roadmap you need for an introduction to concepts, tools, and techniques to collect, store, transform, analyze, and model data.View the full article
  4. A thoughtful look at why Data and Engineering teams are best suited to own customer data platform implementation & management.View the full article
  5. Read RudderStack CEO Soumyadeb Mitra's insights on the changes ahead in the field of data as the data engineering megatrend impacts every industry.View the full article
  6. Modern teams need a more robust data integration solution than GTM. Here are 4 of the reasons GTM and data engineers don’t get along plus a better solution.View the full article
  7. When it comes to Reverse ETL, the business use cases usually get all the attention. Here, we focus on how it makes data engineering easier. View the full article
  8. Expectations for data are higher than ever and come from a broad array of end users. These best practices will help you deliver better data products.View the full article
  9. Generative AI is all the rage. In this article we dive into some practical examples for Data Engineers Continue reading on Towards Data Science » View the full article
  10. Data pipelines that would turn you into a decorated data professional Continue reading on Towards Data Science » View the full article
  11. Data engineering, particularly with Amazon Web Services (AWS), has evolved as an appealing and financially rewarding career path. The growing need for data engineers has elevated the salary spectrum within the field. But first, there’s an important question to answer before diving into this field: “What does an AWS Data Engineer salary look like?” No need to fret! Keep reading this blog. This article encompasses all the essential information about the AWS Certified Data Engineer Associate, delving into the salary of the AWS Certified Data Engineer Associate and thoroughly examining the factors that influence the salary range. To shine in data engineering, taking the AWS Certified Data Engineer Associate certification can be an ideal choice. Let’s get started! Role of Data Engineers in Today’s World Data engineers play a crucial role in developing the systems that handle data storage, extraction, and processing. They are responsible for constructing and maintaining databases for applications while overseeing the infrastructure necessary for their operation. As a data engineer, your responsibilities may include managing a SQL data store and a MongoDB NoSQL data warehouse, ensuring accessibility and functionality. Collaborating within a team alongside software engineers, developers, data analysts, and designers, data engineers contribute their expertise to gather and manipulate data, driving essential business objectives. The specific duties of a data engineer can vary across organizations, including tasks such as: Designing efficient data store indexes Selecting appropriate storage technologies (SQL or NoSQL) Maintaining data stores Replicating data across multiple machines Tuning data warehouses Creating and validating query plans Identifying patterns in historical data Analyzing and optimizing database performance AWS Certified Data Engineer-Associate Certification: An Overview AWS has recently introduced the AWS Certified Cloud Data Engineer-Associate Certification Exam. AWS Certified Data Engineer-Associate Certification Exam serves as an excellent entry point for individuals seeking to delve into advanced specialty themes in AWS, even without prior data experience. Conversely, for those already engaged in data-related roles, this certification offers a valuable opportunity to deepen their comprehension of AWS by utilizing specialized services they likely already engage with. While acquiring these skills was always possible without formal certification, the introduction of a structured certification pathway not only encourages learners to pursue certification but also motivates training providers to address skill gaps by offering specialized guidance and resources. This certification confirms your expertise in core AWS data services, assessing your skills in configuring data pipelines, proficiently managing monitoring and troubleshooting, and optimizing performance while adhering to industry best practices. However, obtaining AWS Data Engineer certification can significantly enhance earning potential by validating proficiency in core AWS data services, data pipeline configuration, and effective management of monitoring and troubleshooting. Role of an AWS Certified Data Engineer Associate For those new to the field of data engineering, enrolling in the AWS Data Engineer Certification Beta course is a valuable option. The AWS Certified Data Engineer Associate Certification Exam (DEA-C01) follows the Associate-level tests for Solutions Architects, Developers, and SysOps Administrators, making it the company’s fourth Associate-level certification. AWS Certified Data Engineer Associates were solely responsible for the following tasks: Ingesting and transforming data Orchestrating data pipelines when deploying programming concepts Operationalizing, maintaining, and monitoring data pipelines Identifying the most suitable data storage solution, crafting effective data models, and organizing data schemas efficiently Overseeing the entire lifecycle of data, from creation to disposal Evaluating and ensuring the quality of data through thorough analysis Enforcing proper measures such as authentication, authorization, data encryption, privacy, and governance for effective data management Also Read: AWS Data Engineer Associate Certification guide AWS Data Engineer Salary AWS Data Engineer Salaries in India The average annual salary for an AWS Data Engineer in India is ₹21,20,567. Additionally, there is an average additional cash compensation of ₹13,87,883. The range for this additional cash compensation can range from ₹13,87,883 to ₹13,87,883. AWS Data Engineer Salaries in the USA The average annual salary for an AWS Data Engineer in the United States is found to be $129,716. In hourly terms, this averages approximately $62.36 per hour. Every week, it is equivalent to $2,494, and monthly, it amounts to $10,809. Factors Influencing AWS Data Engineer Associate Salary The salary of an AWS Certified Data Engineer Associate is influenced by a variety of factors that collectively shape the compensation landscape for professionals in this field. Understanding these factors is crucial for both aspiring data engineers and those looking to negotiate their salaries. Here are key elements that influence the salary of an AWS Certified Data Engineer Associate: Experience Level Salaries often differ based on experience level. It means entry-level candidates will get a lower salary compared to those who have several years of hands-on experience in the data engineering field. For example Entry-level AWS Certified Data Engineer Associate can get an average pay of $124,786 per year while most of the AWS Senior Data Engineer salary will be around $175,000 per year. Certification Even though education is significant, taking relevant certification courses can help move on to data engineering jobs. The AWS Data Engineer Associate can be maximized if you hold certifications such as AWS Certified Big Data – Specialty certification, Google Professional Data Engineer, etc. Location Location stands out as one of the paramount factors influencing the AWS Data Engineer Salary. The geographical setting where a professional works significantly shapes the compensation landscape, reflecting variations in living costs, market demands, and economic conditions. The cities where data engineers can earn the highest salaries include Seattle, Maryland, and Washington, with average salaries of over $2,11,350 /year. Skill Set To shine in the data engineering field, you must possess the following skills: ETL Tools: Understanding and utilizing various ETL tools are crucial for effective data management. These tools enable professionals to extract data from diverse sources, transform it according to specific requirements, and load it into databases or data warehouses. Examples of popular ETL tools include Apache NiFi, Talend, Informatica, and Microsoft SSIS. SQL: SQL (Structured Query Language) is a fundamental language for interacting with databases. Given that substantial volumes of data are typically stored in expansive data warehouses, proficiency in SQL is imperative. It empowers data engineers to retrieve, manipulate, and manage data efficiently. Python: Programming languages play a pivotal role in performing ETL tasks and data management activities. Python stands out as one of the most versatile and widely used languages for these purposes. Data engineers often leverage Python for scripting, automation, and executing various data-related tasks. Big Data Tools and Cloud Storage: Dealing with extensive datasets is a common aspect of a data engineer’s role. Therefore, familiarity with big data tools such as Hadoop, Spark, and cloud storage solutions like Amazon S3, Google Cloud Storage, and Azure Data Lake Storage (ADLS) is crucial. These tools streamline the handling and processing of large-scale data. Query Engines: Proficiency in query engines like Apache Spark and Apache Flink is essential for running queries against sizable datasets. These engines enable data engineers to process and analyze data efficiently, making them indispensable tools in the data engineering toolkit. Data Warehousing Concepts: Data engineers are responsible for maintaining data warehouses, making it vital to have a comprehensive understanding of data warehousing concepts. This includes knowledge of key components such as Enterprise Data Warehouse (EDW), Operational Data Store (ODS), and Data Mart. Mastery of these concepts ensures effective data storage, organization, and retrieval. Possessing proficiency in tools such as Python and SQL can secure an average salary of 8.5 and 8.6 LPA, respectively, for a data engineer. Following closely are skills related to Hadoop and ETL, garnering an average salary of 9 LPA. To maximize earnings, expertise in Amazon Web Services (AWS) and Apache Spark is pivotal, as they can lead to an average salary of 9.8 and 10 LPA, respectively. Employer Data Engineers are paid more when working in larger firms such as Google, Apple, Meta, etc. As per the study, In India, the data engineers who work at Cognizant can make up an average pay of about ₹819,207/ year. In IBM, they offer good pay to AWS data engineers which is about ₹950,000/ year. Job title variations Data engineer job titles differ based on the company, tasks, and skills they have. Here are some job titles that data engineers can hold: Enterprise Data Architect: $172,872 AI Engineer: $126,774 Cloud Data Engineer: $116,497 Hadoop Engineer: $143,322 Database Architect: $143,601 Data Science Engineer: $127,966 Big Data Engineer: $116,675 Information Systems Engineer: $92,340 How to improve AWS Certified Data Engineer Associate Salary? To enhance your AWS Certified Data Engineer Associate salary, consider the following strategies: Continuous Learning: Stay updated on the latest AWS technologies and best practices. Attend training sessions, webinars, and workshops to expand your knowledge. Earn Additional Certifications: Obtain other relevant certifications to demonstrate a diverse skill set. This can make you more valuable to employers and potentially lead to a higher salary. Gain Practical Experience: Apply your knowledge through hands-on projects and real-world scenarios. Practical experience is highly valued and can set you apart in the job market. Build a Strong Professional Network: Connect with other professionals in the field, attend industry events, and participate in online forums. Networking can open up new opportunities and provide insights into salary trends. Showcase Your Achievements: Highlight your accomplishments on your resume and LinkedIn profile. Quantify your impact on projects and emphasize how your skills have positively contributed to business objectives. Negotiation Skills: Develop effective negotiation skills when discussing salary with potential employers. Research industry salary benchmarks and be prepared to make a compelling case for your value. Specialize in High-Demand Areas: Focus on specialized areas within AWS that are in high demand. This could include specific data analytics tools, machine learning, or database management skills. Seek Leadership Roles: Transitioning into leadership positions can often lead to higher salaries. Develop your leadership skills and take on responsibilities that demonstrate your ability to lead and manage teams. Stay Informed About Market Trends: Keep track of industry trends and market demands. If you can align your skills with emerging technologies and trends, you may find yourself in higher demand. Key AWS Services to Prioritize for the DEA-C01 Exam To effectively prepare for the AWS Data Engineer Associate exam, it is essential to focus on specific concepts and AWS services to optimize study time and avoid unnecessary topics. It is highly recommended to dedicate more time to comprehending the following AWS services: Amazon Athena Amazon Redshift Amazon QuickSight Amazon EMR (Amazon Elastic MapReduce) AWS LakeFormation AWS EventBridge AWS Glue Amazon Kinesis Amazon Managed Service for Apache Flink Amazon Managed Streaming for Apache Kafka (Amazon MSK) Amazon OpenSearch Service FAQs Is it worth getting a certificate in the data engineering field? Yes, obtaining a data engineering certificate is worthwhile for several reasons. It serves as a proven and effective way to enhance your earning potential. Additionally, it demonstrates to potential employers that you are committed to staying updated on the latest advancements in the field and showcases your dedication to continuous learning. Does an AWS Data Engineer require coding knowledge? No, it is not necessary to know coding if you want to become an AWS Data Engineer. What skills are required for an AWS data engineer? To become an AWS Data Engineer, you need the following skills to execute Data Engineering tasks effectively: SQL Skills Data Modelling Hadoop for big data Python AWS Cloud services What is AWS Data Engineering? AWS Data Engineering entails the gathering of data from various sources to be stored, processed, analyzed, and visualized, and the creation of pipelines on the AWS platform. Conclusion Hope this blog details the AWS Certified Data Engineer Salary and what are the factors that impact the AWS Data Engineer Salary. As the demand for skilled data engineers continues to rise, obtaining the AWS Certified Data Engineer Associate credential not only validates your expertise but also enhances your earning potential. To further delve deeper into the data engineering world, try our hands-on labs and sandbox. View the full article
  12. This week on KDnuggets: Discover GitHub repositories from machine learning courses, bootcamps, books, tools, interview questions, cheat sheets, MLOps platforms, and more to master ML and secure your dream job • Data engineers must prepare and manage the infrastructure and tools necessary for the whole data workflow in a data-driven company • And much, much more!View the full article
  13. Tips to prepare for a job interview Continue reading on Towards Data Science » View the full article
  14. Access all of Datacamp's 460+ data and AI courses, career tracks & certifications ... https://www.datacamp.com/freeweek
  15. Platform Specific Tools and Advanced TechniquesPhoto by Christopher Burns on UnsplashThe modern data ecosystem keeps evolving and new data tools emerge now and then. In this article, I want to talk about crucial things that affect data engineers. We will discuss how to use this knowledge to power advanced analytics pipelines and operational excellence. I’d like to discuss some popular Data engineering questions:Modern data engineering (DE). What is it?Does your DE work well enough to fuel advanced data pipelines and Business intelligence (BI)?Are your data pipelines efficient?What is required from the technological point of view to enable operational excellence?Back in October, I wrote about the rise of the Data Engineer, the role, its challenges, responsibilities, daily routine and how to become successful in this field. The data engineering landscape is constantly changing but major trends seem to remain the same. How to Become a Data Engineer As a data engineer, I am tasked to design efficient data processes almost every day. So here are a few things to consider that can help us answer these questions. Modern data engineering trendsETL vs ELTSimplified data connectors and API integrationsETL frameworks explosionData infrastructure as codeData Mesh and decentralized data managementDemocratization of Business intelligence pipelines using AIFocus on data literacyELT vs ETLPopular SQL data transformation tools like Dataform and DBT made a significant contribution to the popularisation of the ELT approach [1]. It simply makes sense to perform required data transformations, such as cleansing, enrichment and extraction in the place where data is being stored. Often it is a data warehouse solution (DWH) in the central part of our infrastructure. Cloud platform leaders made DWH (Snowflake, BigQuery, Redshift, Firebolt) infrastructure management really simple and in many scenarios they will outperform and dedicated in-house infrastructure management team in terms of cost-effectiveness and speed. Data warehouse exmaple. Image by authorIt also might be a datalake in the center and it depends on the type of our data platform and tools we use. In this case, SQL stops being an option in many cases making it difficult to query the data for those users who are not familiar with programming. Tools like Databricks, Tabular and Galaxy try to solve this problem and it really feels like the future. Indeed, datalakes can store all types of data including unstructured ones and we still need to be able to analyse these datasets. Datalake example. Image by author.Just imagine transactionally consistent datalake tables with point-in-time snapshot isolation.I previously wrote about it in one of my stories on Apache Iceberg table format [2]. Introduction to Apache Iceberg Tables Simplified data integrationsManaged solutions like Fivetran and Stitch were built to manage third-party API integrations with ease. These days many companies choose this approach to simplify data interactions with their external data sources. This would be the right way to go for data analyst teams that are not familiar with coding. Indeed, why would we build a data connector from scratch if it already exists and is being managed in the cloud?The downside of this approach is it’s pricing model though.Very often it is row-based and might become quite expensive on an enterprise level of data ingestion, i.e. big data pipelines. This is where open-source alternatives come into play. Frameworks like Airbyte and Meltano might be an easy and quick solution to deploy a data source integration microservice. If you don’t have time to learn a new ETL framework you can create a simple data connector yourself. If you know a bit of Python it would be a trivial task. In one of my previous articles I wrote how easy it is to create a microservice that pulls data from NASA API [3]: Python for Data Engineers Consider this code snippet for app.py import requests session = requests.Session() url="https://api.nasa.gov/neo/rest/v1/feed" apiKey="your_api_key" requestParams = { 'api_key': apiKey, 'start_date': '2023-04-20', 'end_date': '2023-04-21' } response = session.get(url, params = requestParams, stream=True) print(response.status_code)It can be deployed in any cloud vendor platform and scheduled to run with the required frequency. It’s always a good practice to use something like Terraform to deploy our data pipeline applications. ETL frameworks explosionWe can witness a “Cambrian explosion” of various ETL frameworks for data extraction and transformation. It’s not a surprise that many of them are open-source and are Python-based. Luigi [8] is one of them and it helps to create ETL pipelines. It was created by Spotify to manage massive data processing workloads. It has a command line interface and great visualization features. However, even basic ETL pipelines would require a certain level of Python programming skills. From my experience, I can tell that it’s great for strict and straightforward pipelines. I find it particularly difficult to implement complex branching logic using Luigi but it works great in many scenarios. Python ETL (PETL) [9] is one of the most widely used open-source ETL frameworks for straightforward data transformations. It is invaluable working with tables, extracting data from external data sources and performing basic ETL on data. In many ways, it is similar to Pandas but the latter has more analytics capabilities under the hood. PETL is great for aggregation and row-level ETL. Bonobo [10] is another open-source lightweight data processing tool which is great for rapid development, automation and parallel execution of batch-processing data pipelines. What I like about it is that it makes it really easy to work with various data file formats, i.e. SQL, XML, XLS, CSV and JSON. It will be a great tool for those with minimal Python knowledge. Among other benefits, I like that it works well with semi-complex data schemas. It is ideal for simple ETL and can run in Docker containers (it has a Docker extension). Pandas is an absolute beast in the world of data and there is no need to cover it’s capabilities in this story. It’s worth mentioning that its data frame transformations have been included in one of the basic methods of data loading for many modern data warehouses. Consider this data loading sample into the BigQuery data warehouse solution: from google.cloud import bigquery from google.oauth2 import service_account ... # Authenticate BigQuery client: service_acount_str = config.get('BigQuery') # Use config credentials = service_account.Credentials.from_service_account_info(service_acount_str) client = bigquery.Client(credentials=credentials, project=credentials.project_id) ... def load_table_from_dataframe(table_schema, table_name, dataset_id): #! source data file format must be outer array JSON: """ [ {"id":"1"}, {"id":"2"} ] """ blob = """ [ {"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}, {"id":"2","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]} ] """ body = json.loads(blob) print(pandas.__version__) table_id = client.dataset(dataset_id).table(table_name) job_config = bigquery.LoadJobConfig() schema = create_schema_from_yaml(table_schema) job_config.schema = schema df = pandas.DataFrame( body, # In the loaded table, the column order reflects the order of the # columns in the DataFrame. columns=["id", "first_name","last_name","dob","addresses"], ) df['addresses'] = df.addresses.astype(str) df = df[['id','first_name','last_name','dob','addresses']] print(df) load_job = client.load_table_from_dataframe( df, table_id, job_config=job_config, ) load_job.result() print("Job finished.")Apache Airflow, for example, is not an ETL tool per se but it helps to organize our ETL pipelines into a nice visualization of dependency graphs (DAGs) to describe the relationships between tasks. Typical Airflow architecture includes a schduler based on metadata, executors, workers and tasks. For example, we can run ml_engine_training_op after we export data into the cloud storage (bq_export_op) and make this workflow run daily or weekly. ML model training using Airflow. Image by author.Consider this example below. It creates a simple data pipeline graph to export data into a cloud storage bucket and then trains the ML model using MLEngineTrainingOperator."""DAG definition for recommendation_bespoke model training.""" import airflow from airflow import DAG from airflow.contrib.operators.bigquery_operator import BigQueryOperator from airflow.contrib.operators.bigquery_to_gcs import BigQueryToCloudStorageOperator from airflow.hooks.base_hook import BaseHook from airflow.operators.app_engine_admin_plugin import AppEngineVersionOperator from airflow.operators.ml_engine_plugin import MLEngineTrainingOperator import datetime def _get_project_id(): """Get project ID from default GCP connection.""" extras = BaseHook.get_connection('google_cloud_default').extra_dejson key = 'extra__google_cloud_platform__project' if key in extras: project_id = extras[key] else: raise ('Must configure project_id in google_cloud_default ' 'connection from Airflow Console') return project_id PROJECT_ID = _get_project_id() # Data set constants, used in BigQuery tasks. You can change these # to conform to your data. DATASET = 'staging' #'analytics' TABLE_NAME = 'recommendation_bespoke' # GCS bucket names and region, can also be changed. BUCKET = 'gs://rec_wals_eu' REGION = 'us-central1' #'europe-west2' #'us-east1' JOB_DIR = BUCKET + '/jobs' default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': airflow.utils.dates.days_ago(2), 'email': ['mike.shakhomirov@gmail.com'], 'email_on_failure': True, 'email_on_retry': False, 'retries': 5, 'retry_delay': datetime.timedelta(minutes=5) } # Default schedule interval using cronjob syntax - can be customized here # or in the Airflow console. schedule_interval = '00 21 * * *' dag = DAG('recommendations_training_v6', default_args=default_args, schedule_interval=schedule_interval) dag.doc_md = __doc__ # # # Task Definition # # # BigQuery training data export to GCS training_file = BUCKET + '/data/recommendations_small.csv' # just a few records for staging t1 = BigQueryToCloudStorageOperator( task_id='bq_export_op', source_project_dataset_table='%s.recommendation_bespoke' % DATASET, destination_cloud_storage_uris=[training_file], export_format='CSV', dag=dag ) # ML Engine training job training_file = BUCKET + '/data/recommendations_small.csv' job_id = 'recserve_{0}'.format(datetime.datetime.now().strftime('%Y%m%d%H%M')) job_dir = BUCKET + '/jobs/' + job_id output_dir = BUCKET delimiter=',' data_type='user_groups' master_image_uri='gcr.io/my-project/recommendation_bespoke_container:tf_rec_latest' training_args = ['--job-dir', job_dir, '--train-file', training_file, '--output-dir', output_dir, '--data-type', data_type] master_config = {"imageUri": master_image_uri,} t3 = MLEngineTrainingOperator( task_id='ml_engine_training_op', project_id=PROJECT_ID, job_id=job_id, training_args=training_args, region=REGION, scale_tier='CUSTOM', master_type='complex_model_m_gpu', master_config=master_config, dag=dag ) t3.set_upstream(t1)Bubbles [11] is another open-source tool for ETL in the Python world. It’s great for rapid development and I like how it works with metadata to describe data pipelines. The creators of Bubbles call it an “abstract framework” and say that it can be used from many other programming languages, not exclusively from Python. There are many other tools with more specific applications, i.e. extracting data from web pages (PyQuery, BeautifulSoup, etc.) and parallel data processing. It can be a topic for another story but I wrote about some of them before, i.e. joblib library [12] Data infrastructure as codeInfrastructure as code (IaC) is a popular and very functional approach for managing data platform resources. Even for data, it is pretty much a standard right now, and it definitely looks great on your CV telling your potential employers that you are familiar with DevOps standards. Using tools like Terraform (platform agnostic) and CloudFormation we can integrate our development work and deployments (operations) with ease. In general, we would want to have staging and production data environments for our data pipelines. It helps to test our pipelines and facilitate collaboration between teams. Consider this diagram below. It explains how data environments work. Data environments. Image by author.Often we might need an extra sandbox for testing purposes or to run data transformation unit tests when our ETL services trigger CI/CD workflows. I previously wrote about it here: Infrastructure as Code for Beginners Using AWS CloudFormation template files we can describe required resources and their dependencies so we can launch and configure them together as a single stack.If you are a data professional this approach will definitely help working with different data environments and replicate data platform resources faster and more consistently without errors. The problem is that many data practitioners are not familiar with IaC and it creates a lot of errors during the development process. Data Mesh and decentralized data managementData space has significantly evolved during the last decade and now we have lots of data tools and frameworks. Data Mesh defines the state when we have different data domains (company departments) with their own teams and shared data resources. Each team has their own goals, KPIs, data roles and responsibilities. For a long period of time, data bureaucracy has been a real pain for many companies.This data platform type [4] might seem a bit chaotic but it was meant to become a successful and efficient choice for companies where decentralization enables different teams to access cross-domain datasets and run analytics or ETL tasks on their own. Indeed, Snowflake might be your favourite data warehouse solution if you are a data analyst and not familiar with Spark. However, often it’s a trivial problem when you might want to read datalake data without data engineering help. In this scenario, a bunch of metadata records on datasets could be extremely useful and that’s why Data Mesh is so successful. It enables users with knowledge about data, its origins and how other teams can make the best of those datasets they weren’t previously aware of. Sometimes datasets and data source connections become very intricate and it is always a good practice to have a single-source-of-truth data silo or repository with metadata and dataset descriptions. In one of my previous stories [5] I wrote about the role of SQL as a unified querying language for teams and data. Indeed, it analytical, self-descriptive and come be even dynamic which makes it a perfect tool for all data users. Often it all turns into a big mes(s/h)This fact makes SQL-based templating engines like DBT, Jinja and Dataform very popular. Just imagine you have an SQL-like platform where all datasets and their transformations are described and defined thoroughly [6]. Dataform’s dependency graph and metadata. Image by author.It might be a big challenge to understand how data teams relate to data sources and schemas. Very often it is all tangled in spaghetti of dataset dependencies and ETL transformations. Data engineering plays a critical role in mentoring, improving data literacy and empowering the rest of the company with state-of-the-art data processing techniques and best practices. Democratization of Business Intelligence pipelines using AIImproving data accessibility has always been a popular topic in the data space but it is interesting to see how the whole data pipeline design process is becoming increasingly accessible to teams that weren’t familiar with data before. Now almost every department can utilize built-in AI capabilities to create complex BI transformations on data. All they need is to describe what they want BI-wise in their own wordsFor example, BI tools like Thoughspot use AI with an intuitive “Google-like search interface” [7] to gain insights from data stored in any modern DWH solution such as Google Big Query, Redshift, Snowflake or Databricks. Modern Data Stack includes BI tools that help with data modelling and visualization. Many of them already have these built-in AI capabilities to gain data insights faster based on user behaviour. I believe it’s a fairly easy task to integrate GPT and BI. In the next couple of years, we will see many new products using this tech. GPT can pre-process text data to generate a SQL query that understands your intent and answers your question.ConclusionIn this article, I tried to give a very high-level overview of major data trends that affect data engineering role these days. Data Mesh and templated SQL with dependency graphs to facilitate data literacy democratized the whole analytics process. Advanced data pipelines with intricate ETL techniques and transformations can be transparent for everyone in the organisation now. Data pipelines are becoming increasingly accessible for other teams and they don’t need to know programming to learn and understand the complexity of ETL. Data Mesh and metadata help to solve this problem. From my experience, I can tell that I keep seeing more and more people learning SQL to contribute to the transformation layer. Companies born during the “advanced data analytics” age have the luxury of easy access to cloud vendor products and their managed services. It definitely helps to acquire the required data skills and improve them to gain a competitive advantage. Recommended read[1] https://medium.com/towards-data-science/data-pipeline-design-patterns-100afa4b93e3 [2] https://towardsdatascience.com/introduction-to-apache-iceberg-tables-a791f1758009 [3] https://towardsdatascience.com/python-for-data-engineers-f3d5db59b6dd [4] https://medium.com/towards-data-science/data-platform-architecture-types-f255ac6e0b7 [5] https://medium.com/towards-data-science/advanced-sql-techniques-for-beginners-211851a28488 [6] https://medium.com/towards-data-science/easy-way-to-create-live-and-staging-environments-for-your-data-e4f03eb73365 [7] https://docs.thoughtspot.com/cloud/latest/search-sage [8] https://github.com/spotify/luigi [9] https://petl.readthedocs.io/en/stable/ [10] https://www.bonobo-project.org [11] http://bubbles.databrewery.org/ [12] https://medium.com/towards-data-science/how-to-become-a-data-engineer-c0319cb226c2 Modern Data Engineering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article
  16. Advanced ETL techniques for beginners ... View the full article
  17. AWS Data Engineering is a vital element in the AWS Cloud to deliver ultimate data solutions to end users. Data Engineering on AWS assists big data professionals in managing Data Pipelines, Data Transfer, and Data Storage. AWS data engineers do the same jobs as general data engineers but they exclusively work in the Amazon Web Services cloud platform. To succeed in Data Engineering on AWS, one should have a solid understanding of AWS and data engineering principles. To nurture your Data engineering skills from the foundational level, it is better to take AWS Data Engineer Certification... View the full article
  18. Airflow has been adopted by many Cloudera Data Platform (CDP) customers in the public cloud as the next generation orchestration service to setup and operationalize complex data pipelines. Today, customers have deployed 100s of Airflow DAGs in production performing various data transformation and preparation tasks, with differing levels of complexity. This combined with Cloudera Data Engineering’s (CDE) first-class job management APIs and centralized monitoring is delivering new value for modernizing enterprises. As we mentioned before, instead of relying on one custom monolithic process, customers can develop modular data transformation steps that are more reusable and easier to debug, which can then be orchestrated with glueing logic at the level of the pipeline. That’s why we are excited to announce the next evolutionary step on this modernization journey by lowering the barrier even further for data practitioners looking for flexible pipeline orchestration — introducing CDE’s completely new pipeline authoring UI for Airflow. Until now, the setup of such pipelines still required knowledge of Airflow and the associated python configurations. This presented challenges for users in building more complex multi-step pipelines that are typical of DE workflows. We wanted to hide those complexities from users, making multi-step pipeline development as self-service as possible and providing an easier path to developing, deploying, and operationalizing true end-to-end data pipelines. Easing development friction We started out by interviewing customers to understand where the most friction exists in their pipeline development workflows today. In the process several key themes emerged: Low/No-code By far the biggest barrier for new users is creating custom Airflow DAGs. Writing code is error prone and requires trial and error. Anyway to minimize coding and manual configuration will dramatically streamline the development process. Long-tail of operators Although Airflow offers 100s of operators, users tend to use only a subset of them. Making the most commonly used as readily available as possible is critical to reduce development friction. Templates Airflow DAGs are a great way to isolate pipelines and monitor them independently, making it more operationally friendly for DE teams. But a lot of times when we looked across Airflow DAGs we noticed similar patterns, where the majority of the operations were identical except for a series of configurations like table names and directories – the 80/20 rule clearly at play. This laid the foundation for some of the key design principles we applied to our authoring experience. Pipeline Authoring UI for Airflow With CDE Pipeline authoring UI, any CDE user irrespective of their level of Airflow expertise can create multi-step pipelines with a combination of out-of-the-box operators (CDEOperator, CDWOperator, BashOperator, PythonOperator). More advanced users can still continue to deploy their own customer Airflow DAGs as before, or use the Pipeline authoring UI to bootstrap their projects for further customization (as we describe later the pipeline engine generates Airflow code which can be used as starting to meet more complex scenarios). And once the pipeline has been developed through the UI, users can deploy and manage these data pipeline jobs like other CDE applications thru the API/CLI/UI. Figure 1: “Editor” screen for authoring Airflow pipelines, with operators (left), canvas (middle), and context sensitive configuration panel (right) The “Editor” is where all the authoring operations take place — a central interface to quickly sequence together your pipelines. It was critical to make the interactions as intuitive as possible to avoid slowing down the flow of the user. The user is presented with a blank canvas with click & drop operators. A palette focused on the most commonly used operators on the left, and a context sensitive configuration panel on the right. And as the user drops new operators onto the canvas they can specify dependencies through an intuitive click and drag interaction. Clicking on an existing operator within the canvas brings it to focus which triggers an update to the configuration panel on the right. Hovering over any operator highlights each side with four dots inviting the user to use a click & drag action to create connection with another operator. Figure 2: Creating dependencies with simple click & drag Pipeline Engine To make the authoring UI as flexible as possible a translation engine was developed that sits in between the user interface and the final Airflow job. Each “box” (step) in on the canvas serves as a task in the final Airflow DAG. Multiple steps comprise the overall pipeline, which are stored as pipeline definition files in the CDE resource of the job. This intermediate definition can easily be integrated with source code management, such as Git, as needed. When the pipeline is saved in the editor screen, a final translation is performed whereby the corresponding Airflow DAG is generated and loaded into the Airflow server. This makes our pipeline engine flexible to support multitude of orchestration services. Today we support Airflow but in the future it can be extended to meet other requirements. An additional benefit is that this can also serve to bootstrap more complex pipelines. The generated Airflow python code can be modified by end users to accommodate custom configurations and then uploaded as a new job. This way users don’t have to start from scratch, but rather build an outline of what they want to achieve, output the skeleton python code, and then customize. Templatizing Airflow Airflow provides a way to templatize pipelines and with CDE we have integrated that with our APIs to allow job parameters to be pushed down to Airflow as part of the execution of the pipeline. A simple example of this would be parameterizing SQL query within the CDW operator. Using the special syntax {{..}} the developer can include placeholders for different parts of the query, for example the SELECT expression or the table being referenced in the FROM section. SELECT {{ dag_run.conf['conf1'] }} FROM {{ dag_run.conf['conf2'] }} LIMIT 100 This can be entered through the configuration pane in UIl as shown here: Once the pipeline is saved and the Airflow job generated, it can be programmatically triggered through the CDE CLI/API with the configuration override options. $ cde job run --config conf1='column1, sum(1)' --config conf2='default.txn' --name example_airflow_job The same Airflow job can now be used to generate different SQL reports. Looking forward With early design partners we already have enhancements in the works to continue improving the experience. Some of them include: More operators – as we mentioned earlier there is a small set of highly used operators. We want to ensure these most commonly used ones are easily accessible to the user. Additionally, the introduction of more CDP operators that integrate with CML (machine learning) and COD (operation database) are critical for a complete end-to-end orchestration service. UI improvements to make the experience even smoother. These span common usability improvements like pan and zoom and undo-redo operations, and a mechanism to add comments to make more complex pipelines easier to follow. Auto-discovery can be powerful when applied to help autocomplete various configurations, such as referencing pre-defined spark job for the CDE task or the hive virtual warehouse end-point for the CDW query task. Ready-to-use pipelines – although parameterized Airflow jobs are great way to develop reusable pipelines, we want to make this even easier to specify through the UI. Also there’s opportunities for us to provide read-to-use pipeline definitions that capture very common patterns such as detecting files on S3 bucket, running data transformation with Spark, and performing data mart creation with Hive. With this Technical Preview release, any CDE customer can test drive the new authoring interface by setting up the latest CDE service. When creating a Virtual Cluster a new option will allow the enablement of the Airflow authoring UI. Stay tuned for more developments in the coming months and until then happy pipeline building! The post Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering appeared first on Cloudera Blog. View the full article
  19. What is Cloudera Data Engineering (CDE) ? Cloudera Data Engineering is a serverless service for Cloudera Data Platform (CDP) that allows you to submit jobs to auto-scaling virtual clusters. CDE enables you to spend more time on your applications, and less time on infrastructure. CDE allows you to create, manage, and schedule Apache Spark jobs without the overhead of creating and maintaining Spark clusters. In addition to this, you can define virtual clusters with a range of CPU and memory resources, and the cluster scales up and down as needed to execute your Spark workloads, helping control your cloud costs. Managed, serverless Spark service helps our customers in a number of ways: Auto scaling of compute to eliminate static infrastructure costs. This feature ensures that customers do not have to maintain a large infrastructure footprint and hence reduce total cost of ownership. Ability for business users to easily control their own compute needs with a click of a button, without IT intervention. Complete view of the job performance, logging and debugging through a single pane of glass to enable efficient development on Spark. Refer to the following Cloudera blog to understand the full potential of Cloudera Data Engineering. Why should technology partners care about CDE? Unlike traditional data engineering workflows that have relied on a patchwork of tools for preparing, operationalizing, and debugging data pipelines, Cloudera Data Engineering is designed for efficiency and speed — seamlessly integrating and securing data pipelines to any CDP service including Machine Learning, Data Warehouse, Operational Database, or any other analytic tool in your business. Partner tools that leverage CDP as their backend store can leverage this new service to ensure their customers can take advantage of a serverless architecture for Spark. ISV Partners, like Precisely, support Cloudera’s hybrid vision. Precisely Data Integration, Change Data Capture and Data Quality tools support CDP Public Cloud as well as CDP Private Cloud. Precisely end-customers can now design a pipeline once and deploy it anywhere. Data pipelines that are bursty in nature can leverage the public cloud CDE service while longer running persistent loads can run on-prem. This ensures that the right data pipelines are running on the most cost-effective engines available in the market today. Using the CDE Integration API: CDE provides a robust API for integration with your existing continuous integration/continuous delivery platforms. The Cloudera Data Engineering service API is documented in Swagger. You can view the API documentation and try out individual API calls by accessing the API DOC link in any virtual cluster: In the CDE web console, select an environment. Click the Cluster Details icon in any of the listed virtual clusters. Click the link under API DOC. For further details on the API, please refer to the following doc link here. Custom base Image for Kubernetes: Partners who need to run their own business logic and require custom binaries or packages available on the Spark engine platform, can now leverage this feature for Cloudera Data Engineering. We believe customized engine images would allow greater flexibility to our partners to build cloud-native integrations and could potentially be leveraged by our enterprise customers as well. The following set of steps will describe the ability to run Spark jobs with dependencies on external libraries and packages. The libraries and packages will be installed on top of the base image to make them available to the Spark executors. First, obtain the latest CDE CLI a) Create a virtual cluster b) Go to virtual cluster details page c) Download the CLI Learn more on how to use the CLI here Run Spark jobs on customized container image – Overview Custom images are based on the base dex-spark-runtime image, which is accessible from the Cloudera docker repository. Users can then layer their packages and custom libraries on top of the base image. The final image is uploaded to a docker repo, which is then registered with CDE as a job resource. New jobs are defined with references to the resource which automatically downloads the custom runtime image to run the Spark drivers and executors. Run Spark jobs on customized container image: Steps 1. Pull “dex-spark-runtime” image from “docker.repository.cloudera.com” $ docker pull container.repository.cloudera.com/cloudera/dex/dex-spark-runtime:<version> Note: “docker.repository.cloudera.com” is behind the paywall and will require credentials to access, please ask your account team to provide 2. Create your “custom-dex-spark-runtime” image, based on “dex-spark-runtime” image $ docker build --network=host -t <company-registry>/custom-dex-spark-runtime:<version> . -f Dockerfile Dockerfile Example: FROM docker.repository.cloudera.com/<company-name>/dex-spark-runtime:<version> USER root RUN yum install ${YUM_OPTIONS} <package-to-install> && yum clean all && rm -rf /var/cache/yum RUN dnf install ${DNF_OPTIONS} <package-to-install> && dnf clean all && rm -rf /var/cache/dnf USER ${DEX_UID} 3. Push image to your company Docker registry $ docker push <company-registry>/custom-dex-spark-runtime:<version> 4. Create ImagePullSecret in DE cluster for the company’s Docker registry (Optional) REST API: # POST /api/v1/credentials { "name": "<company-registry-basic-credentials>", "type": "docker", "uri": "<company-registry>", "secret": { "username": "foo", "password": "bar", } } CDE CLI: === credential === ./cde credential create --type=docker-basic --name=docker-sandbox-cred --docker-server=https://docker-sandbox.infra.cloudera.com --docker-username=foo --tls-insecure --user srv_dex_mc --vcluster-endpoint https://gbz7t69f.cde-vl4zqll4.dex-a58x.svbr-nqvp.int.cldr.work/dex/api/v1 Note: Credentials will be stored as Kubernetes “Secret”. Never stored by DEX API. 5. Register “custom-dex-spark-runtime” in DE as a “Custom Spark Runtime Image” Resource. REST API: # POST /api/v1/resources { "name":"", "type":"custom-spark-runtime-container-image", "engine": "spark2", "image": <company-registry>/custom-dex-spark-runtime:<version>, "imagePullSecret": <company-registry-basic-credentials> } CDE CLI: === runtime resources === ./cde resource create --type="custom-runtime-image" --image-engine="spark2" --name="custom-dex-qe-1_1" --image-credential=docker-sandbox-cred --image="docker-sandbox.infra.cloudera.com/dex-qe/custom-dex-qe:1.1" --tls-insecure --user srv_dex_mc --vcluster-endpoint https://gbz7t69f.cde-vl4zqll4.dex-a58x.svbr-nqvp.int.cldr.work/dex/api/v1 6. You should now be able to define Spark jobs referencing the custom-dex-spark-runtime REST API: # POST /api/v1/jobs { "name":"spark-custom-image-job", "spark":{ "imageResource": "CustomSparkImage-1", ... } ... } CDE CLI: === job create === ./cde job create --type spark --name cde-job-docker --runtime-image-resource-name custom-dex-qe-1_1 --application-file /tmp/numpy_app.py --num-executors 1 --executor-memory 1G --driver-memory 1G --tls-insecure --user srv_dex_mc --vcluster-endpoint https://gbz7t69f.cde-vl4zqll4.dex-a58x.svbr-nqvp.int.cldr.work/dex/api/v1 7. Once the job is created either trigger it to run through Web UI or by running the following command in CLI: $> cde job run --name cde-job-docker In conclusion We introduced the “Custom Base Image” feature as part of our Design Partner Program to elicit feedback from our ISV partners. The response has been overwhelmingly positive and building custom integrations with our cloud-native CDE offering has never been easier. As a partner, you can leverage Spark running on Kubernetes Infrastructure for free. You can launch a trial of CDE on CDP in minutes here, giving you a hands-on introduction to data engineering innovations in the Public Cloud. References: https://www.cloudera.com/tutorials/cdp-getting-started-with-cloudera-data-engineering.html The post Cloudera Data Engineering – Integration steps to leverage Spark on Kubernetes appeared first on Cloudera Blog. View the full article
  • Forum Statistics

    43.3k
    Total Topics
    42.7k
    Total Posts
×
×
  • Create New...