Jump to content

Search the Community

Showing results for tags 'etl'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • General
    • General Discussion
    • Artificial Intelligence
    • DevOpsForum News
  • DevOps & SRE
    • DevOps & SRE General Discussion
    • Databases, Data Engineering & Data Science
    • Development & Programming
    • CI/CD, GitOps, Orchestration & Scheduling
    • Docker, Containers, Microservices, Serverless & Virtualization
    • Infrastructure-as-Code
    • Kubernetes & Container Orchestration
    • Linux
    • Logging, Monitoring & Observability
    • Security, Governance, Risk & Compliance
  • Cloud Providers
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 20 results

  1. ETL is like doing a jigsaw puzzle with no picture on the box. You don’t know what the final result should look like, but you must make all the pieces fit together. Extract, transform, and load (ETL) processes play a critical role in any data-driven organization. It is likely that your ETL project will not […]View the full article
  2. A fundamental requirement for any data-driven organization is to have a streamlined data delivery mechanism. With organizations collecting data at a rate like never before, devising data pipelines for adequate flow of information for analytics and Machine Learning tasks becomes crucial for businesses. As organizations gather information from multiple sources and data can come in […]View the full article
  3. Databricks, AWS and Google Cloud are among the top ETL tools for seamless data integration, featuring AI, real-time processing and visual mapping to enhance business intelligence.View the full article
  4. ETL processes often involve aggregating data from various sources into a data warehouse or data lake. Bucketing can be used during the transformation phase to aggregate data into predefined buckets or intervals. For example, you might want to aggregate daily sales data into monthly buckets or hourly sensor readings into daily buckets. It plays a […]View the full article
  5. According to Expert Market Research, the global big data & Analytics market is expected to grow at a CAGR of 10%. This report also forecasts that the global investment in big data and analytics will reach $450 Billion by 2026. Hence, considering the massive scope of the analytics and the amount of data you deal […]View the full article
  6. Organizations are constantly on the lookout for simple solutions to integrate their company data from several sources into a centralized location, then analyze it to make informed decisions. This process is termed Data Integration. One of the most popular Data Integration techniques is ETL (Extract, Transform and Load). You will get to know more about […]View the full article
  7. Launching Reverse ETL and ETL - work with data to and from your warehouse and cloud sources with RudderStack. View the full article
  8. Learn about how configuring your warehouse as a data source can fully unlock your data’s value, how Reverse ETL works, and how to set it up in RudderStackView the full article
  9. In this update, we announce a beta version of Reverse ETL Mirror Sync Mode. Read the post for more information, and sign up to try Mirror Mode today.View the full article
  10. We've rebranded Warehouse Actions to Reverse ETL and are excited to announce many new features including a visual data mapper, custom SQL models, and more.View the full article
  11. When it comes to Reverse ETL, the business use cases usually get all the attention. Here, we focus on how it makes data engineering easier. View the full article
  12. Reverse ETL definition and guide. Learn everything you need to know about reverse-ETL, data activation and operational analytics. View the full article
  13. Explore data transformations, ETL processes and ETL tools in this introduction to ETL for data management professionals.View the full article
  14. ETL, or Extract, Transform, Load, serves as the backbone for data-driven decision-making in today's rapidly evolving business landscape. However, traditional ETL processes often suffer from challenges like high operational costs, error-prone execution, and difficulty scaling. Enter automation—a strategy not merely as a facilitator but a necessity to alleviate these burdens. So, let's dive into the transformative impact of automating ETL workflows, the tools that make it possible, and methodologies that ensure robustness. The Evolution of ETL Gone are the days when ETL processes were relegated to batch jobs that ran in isolation, churning through records in an overnight slog. The advent of big data and real-time analytics has fundamentally altered the expectations from ETL processes. As Doug Cutting, the co-creator of Hadoop, aptly said, "The world is one big data problem." This statement resonates more than ever as we are bombarded with diverse, voluminous, and fast-moving data from myriad sources. View the full article
  15. Get insights into the day-to-day challenges of builders. In this issue, Peter Reitz from our partner tecRacer talks about how to build Serverless ETL (extract, transform and load) pipelines with the help of Amazon Managed Workflows for Apache Airflow (MWAA) and Amazon Athena. /images/2022/06/diary.jpg If you prefer a video or podcast instead of reading, here you go. JavaScript is disabled. Please visit YouTube.com to watch the video. Do you prefer listening to a podcast episode over reading a blog post? Here you go! What sparked your interest in cloud computing? Computers have always held a great fascination for me. I taught myself how to program. That’s how I ended up working as a web developer during my economics studies. When I first stumbled upon Amazon Web Services, I was intrigued by the technology and wide variety of services. How did you grow into the role of a cloud consultant? After completing my economics degree, I was looking for a job. By chance, a job ad drew my attention to a vacancy for a cloud consultant at tecRacer. To be honest, my skills didn’t match the requirements very well. But because I found the topic exciting, I applied anyway. Right from the job interview, I felt right at home at tecRacer in Duisburg. Since I had no experience with AWS, there was a lot to learn within the first months. My first goal was to achieve the AWS Certified Solutions Architect - Associate certification. The entire team supported and motivated me during this intensive learning phase. After that, I joined a small team working on a project for one of our consulting clients. This allowed me to gain practical experience at a very early stage. What does your day-to-day work as a cloud consultant at tecRacer look like? As a cloud consultant, I work on projects for our clients. I specialize in machine learning and data analytics. Since tecRacer has a 100% focus on AWS, I invest in my knowledge of related AWS services like S3, Athena, EMR, SageMaker, and more. I work remotely or at our office in Hamburg and am at the customer’s site every now and then. For example, to analyze the requirements for a project in workshops. What project are you currently working on? I’m currently working on building an ETL pipeline. Several data providers upload CSV files to an S3 bucket. My client’s challenge is to extract and transform 3 billion data points and store them in a way that allows efficient data analytics. This process can be roughly described as follows. Fetch CSV files from S3. Parse CSV files. Filter, transform, and enrich columns. Partition data to enable efficient queries in the future. Transform to a file format optimized for data analytics. Upload data to S3. I’ve been implementing similar data pipelines in the past. My preferred solution consists of the following building blocks: S3 storing the input data. Apache Airflow to orchestrate the ETL pipeline. Athena to extract, transform, and load the data. S3 storing the output data. /images/2022/09/serverless-etl-airflow-athena.png How do you build an ETL pipeline based on Athena? Amazon Athena enables me to query data stored on S3 on-demand using SQL. The remarkable thing about Athena is that the service is serverless, which means we only have to pay for the processed data when running a query. There are no idle costs except the S3 storage costs. As mentioned before, in my current project, the challenge is to extract data from CSV files and store the data in a way that is optimized for data analytics. My approach is transforming the CSV files into more efficient formats such as Parquet. The Parquet file format is designed for efficient data analysis and organizes data in rows, not columns, as CSV does. Therefore, Athena skips fetching and processing all other columns when querying only a subset of the available columns. Also, Parquet compresses the data to minimize storage and network consumption. I like using Athena for ETL jobs because of its simplicity and pay-per-use pricing mode. The CREATE TABLE AS SELECT (CTAS) statement implements ETL as described in the following: Extract: Load data from CSV files stored on S3 (SELECT FROM "awsmp"."cas_daily_business_usage_by_instance_type") Transform: Filter and enrich columns. (SELECT product_code, SUM(estimated_revenue) AS revenue, concat(year, '-', month, '-01') as date) Load: Store results in Parquet file format on S3 (CREATE TABLE monthly_recurring_revenue). CREATE TABLE monthly_recurring_revenue WITH ( format = 'Parquet', external_location = 's3://demo-datalake/monthly_recurring_revenue/', partitioned_by = ARRAY['date'] ) AS SELECT product_code, SUM(estimated_revenue) AS revenue, concat(year, '-', month, '-01') as date FROM ( SELECT year, month, day, product_code, estimated_revenue FROM "awsmp"."cas_daily_business_usage_by_instance_type" ORDER BY year, month, day ) GROUP BY year, month, product_code ORDER BY year, month, product_code Besides converting the data into the Parquet file format, the statement also partitions the data. This means the keys of the objects start with something like date=2022-08-01, which allows Athena to only fetch relevant files from S3 when querying by date. Why did you choose Athena instead of Amazon EMR to build an ETL pipeline? I’ve been using EMR for some projects in the past. But nowadays, I prefer adding Athena to the mix, wherever feasible. That’s because compared to EMR, Athena is a lightweight solution. Using Athena is less complex than running jobs on EMR. For example, I prefer using SQL to transform data instead of writing Python code. It takes me less time to build an ETL pipeline with Athena compared to EMR. Also, accessing Athena is much more convenient as all functionality is available via the AWS Management Console and API. In contrast, it requires a VPN connection to interact efficiently with EMR when developing a pipeline. I prefer a serverless solution due to its cost implications. With Athena, our customer only pays for the processed data. There are no idling costs. As an example, I migrated a workload from EMR to Athena, which reduced costs from $3,000 to $100. What is Apache Airflow? Apache Airflow is a popular open-source project providing a workflow management platform for data engineering pipelines. As a data engineer, I describe an ETL pipeline in Python as a directed acyclic graph (DAG). Here is a straightforward directed acyclic graph (DAG). The workflow consists of two steps: Creating an Athena query. Awaiting results from the Athena query. from airflow.models import DAG from airflow.providers.amazon.aws.operators.athena import AWSAthenaOperator from airflow.providers.amazon.aws.sensors.athena import AthenaSensor with DAG(dag_id='demo') as dag: read_table = AWSAthenaOperator( task_id='read_table', query='SELECT * FROM "sampledb"."elb_logs" limit 10;', output_location='s3://aws-athena-query-results-486555357186-eu-west-1/airflow/', database='sampledb' ) await_query = AthenaSensor( task_id='await_query', query_execution_id=read_table.output, ) Airflow allows you to run a DAG manually, via an API, or based on a schedule. /images/2022/09/airflow-demo.png Airflow consists of multiple components: Scheduler Worker Web Server PostgreSQL database Redis in-memory database Operating such a distributed system is complex. Luckily, AWS provides a managed service called Amazon Managed Workflows for Apache Airflow (MWAA), which we use in my current project. What does your development workflow for Airflow DAGs look like? We built a deployment pipeline for the project I’m currently involved in. So you can think of developing the ETL pipeline like any other software delivery process. The engineer pushes changes of DAGs to a Git repository. The deployment pipeline validates the Python code. The deployment pipeline spins up a container based on aws-mwaa-local-runner and verifies whether all dependencies are working as expected. The deployment pipeline runs an integration test. The deployment pipeline uploads the DAGs to S3. Airflow refreshes the DAGs. The deployment pipeline significantly speeds up the ETL pipeline’s development process, as many issues are spotted before deploying to AWS. Why do you use Airflow instead of AWS Step Functions? In general, Airflow is similar to AWS Step Functions. However, there are two crucial differences. First, Airflow is a popular choice for building ETL pipelines. Therefore, many engineers in the field of data analytics have already gained experience with the tool. And besides, the open-source community creates many integrations that help build ETL pipelines. Second, unlike Step Functions, Airflow is not only available on AWS. Being able to move ETL pipelines to another cloud vendor or on-premises is a plus. Why do you specialize in machine learning and data analytics? I enjoy working with data. Being able to answer questions by analyzing huge amounts of data and enabling better decisions backed by data motivates me. Also, I’m a huge fan of Athena. It’s one of the most powerful services offered by AWS. On top of that, machine learning, in general, and reinforced learning, in particular, fascinates me, as it allows us to recognize correlations that were not visible before. Would you like to join Peter’s team to implement solutions with the help of machine learning and data analytics? tecRacer is hiring Cloud Consultants focusing on machine learning and data analytics. Apply now! View the full article
  16. Auto Scaling in AWS Glue Streaming ETL is now generally available. AWS Glue Streaming ETL jobs can now dynamically scale resources up and down based on the input stream. Auto Scaling helps customers reduce the cost and manual effort required to optimize resources by allocating the right resources necessary for Streaming ETL jobs. View the full article
  17. AWS Glue streaming ETL (Extract Transform and Load) can now detect compressed data streaming from Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and self managed Apache Kafka. It can then automatically decompresses this data without customers having to write code, saving them development hours. AWS Glue Streaming ETL jobs continuously consume data from streaming sources, cleans and transforms the data in-flight, and makes it available for analysis in seconds. Customers compress data prior to streaming in-order to improve performance and to avoid throttling limits by Amazon Kinesis and Amazon MSK. Prior to this feature, customers had to write user defined functions to uncompress data from a stream, which is time consuming. View the full article
  18. Streaming extract, transform, and load (ETL) jobs in AWS Glue can now automatically detect the schema of incoming records and gracefully handle schema changes on a per-record basis. Previously, you needed to specify the schema of incoming data using the AWS Glue Data Catalog and update ETL scripts to handle schema changes. The AWS Glue job can now do both for you, saving time on reworking code and increasing the flexibility of your ETL jobs. View the full article
  19. Streaming extract, transform, and load (ETL) jobs in AWS Glue can now read data encoded in the Apache Avro format. Previously, streaming ETL jobs could read data in the JSON, CSV, Parquet, and XML formats. With the addition of Avro, streaming ETL jobs now support all the same formats as batch AWS Glue jobs. View the full article
  20. Whether you are building a data lake, a data analytics pipeline, or a simple data feed, you may have small volumes of data that need to be processed and refreshed regularly. This post shows how you can build and deploy a micro extract, transform, and load (ETL) pipeline to handle this requirement. In addition, you configure a reusable Python environment to build and deploy micro ETL pipelines using your source of data... View the full article
  • Forum Statistics

    43.2k
    Total Topics
    42.5k
    Total Posts
×
×
  • Create New...