Jump to content

Search the Community

Showing results for tags 'dbt'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

There are no results to display.

There are no results to display.


Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 5 results

  1. until
    About Experience everything that Summit has to offer. Attend all the parties, build your session schedule, enjoy the keynotes and then watch it all again on demand. Expo access to 150 + partners and 100’s of Databricks experts 500 + breakout sessions and keynotes 20 + Hands-on trainings Four days food and beverage Networking events and parties On-Demand session streaming after the event Join leading experts, researchers and open source contributors — from Databricks and across the data and AI community — who will speak at Data + AI Summit. Over 500 sessions covering everything from data warehousing, governance and the latest in generative AI. Join thousands of data leaders, engineers, scientists and architects to explore the convergence of data and AI. Explore the latest advances in Apache Spark™, Delta Lake, MLflow, PyTorch, dbt, Presto/Trino and much more. You’ll also get a first look at new products and features in the Databricks Data Intelligence Platform. Connect with thousands of data and AI community peers and grow your professional network in social meetups, on the Expo floor or at our event party. Register https://dataaisummit.databricks.com/flow/db/dais2024/landing/page/home Further Details https://www.databricks.com/dataaisummit/
  2. An overview of the architecture, data, and modeling you need to assess contribution to conversion in multi-touch customer journeys.View the full article
  3. A Complete Guide to Effectively Scale Your Data Pipelines and Data Products with Contract Testing and dbt.. View the full article
  4. We ran a $12K experiment to test the cost and performance of Serverless warehouses and dbt concurrent threads, and obtained unexpected results.By: Jeff Chou, Stewart Bryson Image by Los Muertos CrewDatabricks’ SQL warehouse products are a compelling offering for companies looking to streamline their production SQL queries and warehouses. However, as usage scales up, the cost and performance of these systems become crucial to analyze. In this blog we take a technical deep dive into the cost and performance of their serverless SQL warehouse product by utilizing the industry standard TPC-DI benchmark. We hope data engineers and data platform managers can use the results presented here to make better decisions when it comes to their data infrastructure choices. What are Databricks’ SQL warehouse offerings?Before we dive into a specific product, let’s take a step back and look at the different options available today. Databricks currently offers 3 different warehouse options: SQL Classic — Most basic warehouse, runs inside customer’s cloud environmentSQL Pro — Improved performance and good for exploratory data science, runs inside customer’s cloud environmentSQL Serverless — “Best” performance, and the compute is fully managed by Databricks.From a cost perspective, both classic and pro run inside the user’s cloud environment. What this means is you will get 2 bills for your databricks usage — one is your pure Databricks cost (DBU’s) and the other is from your cloud provider (e.g. AWS EC2 bill). To really understand the cost comparison, let’s just look at an example cost breakdown of running on a Small warehouse based on their reported instance types: Cost comparison of jobs compute, and the various SQL serverless options. Prices shown are based on on-demand list prices. Spot prices will vary and were chosen based on the prices at the time of this publication. Image by author.In the table above, we look at the cost comparison of on-demand vs. spot costs as well. You can see from the table that the serverless option has no cloud component, because it’s all managed by Databricks. Serverless could be cost effective compared to pro, if you were using all on-demand instances. But if there are cheap spot nodes available, then Pro may be cheaper. Overall, the pricing for serverless is pretty reasonable in my opinion since it also includes the cloud costs, although it’s still a “premium” price. We also included the equivalent jobs compute cluster, which is the cheapest option across the board. If cost is a concern to you, you can run SQL queries in jobs compute as well! Pros and cons of ServerlessThe Databricks serverless option is a fully managed compute platform. This is pretty much identical to how Snowflake runs, where all of the compute details are hidden from users. At a high level there are pros and cons to this: Pros: You don’t have to think about instances or configurationsSpin up time is much less than starting up a cluster from scratch (5–10 seconds from our observations)Cons: Enterprises may have security issues with all of the compute running inside of DatabricksEnterprises may not be able to leverage their cloud contracts which may have special discounts on specific instancesNo ability to optimize the cluster, so you don’t know if the instances and configurations picked by Databricks are actually good for your jobThe compute is a black box — users have no idea what is going on or what changes Databricks is implementing underneath the hood which may make stability an issue.Because of the inherent black box nature of serverless, we were curious to explore the various tunable parameters people do still have and their impact on performance. So let’s drive into what we explored: Experiment SetupWe tried to take a “practical” approach to this study, and simulate what a real company might do when they want to run a SQL warehouse. Since DBT is such a popular tool in the modern data stack, we decided to look at 2 parameters to sweep and evaluate: Warehouse size — [‘2X-Small’, ‘X-Small’, ‘Small’, ‘Medium’, ‘Large’, ‘X-Large’, ‘2X-Large’, ‘3X-Large’, ‘4X-Large’]DBT Threads — [‘4’, ‘8’, ‘16’, ‘24’, ‘32’, ‘40’, ‘48’]The reason why we picked these two is they are both “universal” tuning parameters for any workload, and they both impact the compute side of the job. DBT threads in particular effectively tune the parallelism of your job as it runs through your DAG. The workload we selected is the popular TPC-DI benchmark, with a scale factor of 1000. This workload in particular is interesting because it’s actually an entire pipeline which mimics more real-world data workloads. For example, a screenshot of our DBT DAG is below, as you can see it’s quite complicated and changing the number of DBT threads could have an impact here. DBT DAG from our TPC-DI Benchmark, Image by authorAs a side note, Databricks has a fantastic open source repo that will help quickly set up the TPC-DI benchmark within Databricks entirely. (We did not use this since we are running with DBT).To get into the weeds of how we ran the experiment, we used Databricks Workflows with a Task Type of dbt as the “runner” for the dbt CLI, and all the jobs were executed concurrently; there should be no variance due to unknown environmental conditions on the Databricks side. Each job spun up a new SQL warehouse and tore it down afterwards, and ran in unique schemas in the same Unity Catalog. We used the Elementary dbt package to collect the execution results and ran a Python notebook at the end of each run to collect those metrics into a centralized schema. Costs were extracted via Databricks System Tables, specifically those for Billable Usage. Try this experiment yourself and clone the Github repo hereResultsBelow are the cost and runtime vs. warehouse size graphs. We can see below that the runtime stops scaling when you get the medium sized warehouses. Anything larger than a medium pretty much had no impact on runtime (or perhaps were worse). This is a typical scaling trend which shows that scaling cluster size is not infinite, they always have some point at which adding more compute provides diminishing returns. For the CS enthusiasts out there, this is just the fundamental CS principal — Amdahls Law.One unusual observation is that the medium warehouse outperformed the next 3 sizes up (large to 2xlarge). We repeated this particular data point a few times, and obtained consistent results so it is not a strange fluke. Because of the black box nature of serverless, we unfortunately don’t know what’s going on under the hood and are unable to give an explanation. Runtime in Minutes across Warehouse Sizes. Image by authorBecause scaling stops at medium, we can see in the cost graph below that the costs start to skyrocket after the medium warehouse size, because well basically you’re throwing more expensive machines while the runtime remains constant. So, you’re paying for extra horsepower with zero benefit. Cost in $ across Warehouse Sizes. Image by authorThe graph below shows the relative change in runtime as we change the number of threads and warehouse size. For values greater than the zero horizontal line, the runtime increased (a bad thing). The Percent Change in Runtime as Threads Increase. Image by authorThe data here is a bit noisy, but there are some interesting insights based on the size of the warehouse: 2x-small — Increasing the number of threads usually made the job run longer.X-small to large — Increasing the number of threads usually helped make the job run about 10% faster, although the gains were pretty flat so continuing to increase thread count had no value.2x-large — There was an actual optimal number of threads, which was 24, as seen in the clear parabolic line3x-large — had a very unusual spike in runtime with a thread count of 8, why? No clue.To put everything together into one comprehensive plot, we can see the plot below which plots the cost vs. duration of the total job. The different colors represent the different warehouse sizes, and the size of the bubbles are the number of DBT threads. Cost vs duration of the jobs. Size of the bubbles represents the number of threads. Image by authorIn the plot above we see the typical trend that larger warehouses typically lead to shorter durations but higher costs. However, we do spot a few unusual points: Medium is the best — From a pure cost and runtime perspective, medium is the best warehouse to chooseImpact of DBT threads — For the smaller warehouses, changing the number of threads appeared to have changed the duration by about +/- 10%, but not the cost much. For larger warehouses, the number of threads impacted both cost and runtime quite significantly.ConclusionIn summary, our top 5 lessons learned about Databricks SQL serverless + DBT products are: Rules of thumbs are bad — We cannot simply rely on “rules of thumb” about warehouse size or the number of dbt threads. Some expected trends do exist, but they are not consistent or predictable and it is entirely dependent on your workload and data.Huge variance — For the exact same workloads the costs ranged from $5 — $45, and runtimes from 2 minutes to 90 minutes, all due to different combinations of number of threads and warehouse size.Serverless scaling has limits — Serverless warehouses do not scale infinitely and eventually larger warehouses will cease to provide any speedup and only end up causing increased costs with no benefit.Medium is great ?— We found the Medium Serverless SQL Warehouse outperformed many of the larger warehouse sizes on both cost and job duration for the TPC-DI benchmark. We have no clue why.Jobs clusters may be cheapest — If costs are a concern, switching to just standard jobs compute with notebooks may be substantially cheaperThe results reported here reveal that the performance of black box “serverless” systems can result in some unusual anomalies. Since it’s all behind Databrick’s walls, we have no idea what is happening. Perhaps it’s all running on giant Spark on Kubernetes clusters, maybe they have special deals with Amazon on certain instances? Either way, the unpredictable nature makes controlling cost and performance tricky. Because each workload is unique across so many dimensions, we can’t rely on “rules of thumb”, or costly experiments that are only true for a workload in its current state. The more chaotic nature of serverless system does beg the question if these systems need a closed loop control system to keep them at bay? As an introspective note — the business model of serverless is truly compelling. Assuming Databricks is a rational business and does not want to decrease their revenue, and they want to lower their costs, one must ask the question: “Is Databricks incentivized to improve the compute under the hood?” The problem is this — if they make serverless 2x faster, then all of sudden their revenue from serverless drops by 50% — that’s a very bad day for Databricks. If they could make it 2x faster, and then increase the DBU costs by 2x to counteract the speedup, then they would remain revenue neutral (this is what they did for Photon actually). So Databricks is really incentivized to decrease their internal costs while keeping customer runtimes about the same. While this is great for Databricks, it’s difficult to pass on any serverless acceleration technology to the user that results in a cost reduction. Interested in learning more about how to improve your Databricks pipelines? Reach out to Jeff Chou and the rest of the Sync Team. ResourcesTry this experiment yourself and clone the Github repo hereRelated ContentWhy Your Data Pipelines Need Closed-Loop Feedback ControlAre Databricks clusters with Photon and Graviton instances worth it?Is Databricks’s autoscaling cost efficient?Introducing Gradient — Databricks optimization made easy5 Lessons Learned from Testing Databricks SQL Serverless + DBT was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article
  5. Introduction dbt allows data teams to produce trusted data sets for reporting, ML modeling, and operational workflows using SQL, with a simple workflow that follows software engineering best practices like modularity, portability, and continuous integration/continuous development (CI/CD). We’re excited to announce the general availability of the open source adapters for dbt for all the engines in CDP—Apache Hive, Apache Impala, and Apache Spark, with added support for Apache Livy and Cloudera Data Engineering. Using these adapters, Cloudera customers can use dbt to collaborate, test, deploy, and document their data transformation and analytic pipelines on CDP Public Cloud, CDP One, and CDP Private Cloud. Cloudera’s mission, values, and culture have long centered around using open source engines on open data and table formats to enable customers to build flexible and open data lakes. Recently, we became the first and only open data lakehouse with support for multiple engines on the same data with the general availability of Apache Iceberg in Cloudera Data Platform (CDP). To make it easy to start using dbt on the Cloudera Data Platform (CDP), we’ve packaged our open source adapters and dbt Core in a fully tested and certified downloadable package. We’ve also made it simple to integrate dbt seamlessly with CDP’s governance, security, and SDX capabilities. With this announcement, we welcome our customer data teams to streamline data transformation pipelines in their open data lakehouse using any engine on top of data in any format in any form factor and deliver high quality data that their business can trust. The Open Data Lakehouse In an organization with multiple teams and business units, there are a variety of data stacks with tools and query engines based on the preferences and requirements of different users. When different use cases require different query engines to be used on the same data, complicated data replication mechanisms need to be set up and maintained in order for data to be consistently available to different teams. A key aspect of an open lakehouse is giving data teams the freedom to use multiple engines over the same data, eliminating the need for data replication for different use cases. However, different teams and business units have different processes for building and managing their data transformations and analytics pipelines. This variety can result in a lack of standardization, leading to data duplication and inconsistency. That’s why there’s a growing need for a central, transparent, version-controlled repository with a consistent Software Development Lifecycle (SDLC) experience for data transformation pipelines across data teams, business functions, and engines. Streamlining the SDLC has been shown to speed up the delivery of data projects and increase transparency and auditability, leading to a more trusted, data-driven organization. Cloudera builds dbt adaptors for all engines in the open data lakehouse dbt offers this consistent SDLC experience for data transformation pipelines and, in doing so, has become widely adopted in companies large and small. Anyone who knows SQL can now build production-grade pipelines with ease. Figure 1. dbt used in transformation pipelines on data warehouses (Image source: https://github.com/dbt-labs/dbt-core) To date, dbt was only available on proprietary cloud data warehouses, with very little interoperability between different engines. For example, transformations performed in one engine are not visible across other engines because there was no common storage or metadata store. Cloudera has built dbt adapters for all of the engines in the open data lakehouse. Companies can now use dbt-core to consolidate all of their transformation pipelines across different engines into a single version-controlled repository with a consistent SDLC across teams. Cloudera also makes it easy to deploy dbt as a packaged application running within CDP using Cloudera Machine Learning and Cloudera Data Science Workbench. This capability allows customers to have a consistent experience irrespective of using CDP on premises or in the cloud. In addition, given that dbt is just submitting queries to the underlying engines in CDP, customers get the full governance capabilities provided by SDX, like automatic lineage capture, auditing, and impact analysis. The combination of Cloudera’s open data lakehouse and dbt supercharges the ability of data teams to collaboratively build, test, document, and deploy data transformation pipelines using any engine and in any form factor. The packaged offering within CDP and integration with SDX provides the critical security and governance guarantees that Cloudera customers rely on. Figure 2. dbt end-to-end SDLC on CDP Open Lakehouse How to get started with dbt within CDP The dbt integration with CDP is brought to you by Cloudera’s Innovation Accelerator, a cross-functional team that identifies new industry trends and creates new products and partnerships that dramatically improve the lives of our Cloudera customer’s data practitioners. To find out more, here are a selection of links for how to get started. Repository of the latest Python packages and docker images with dbt and all the Cloudera supported adapters Handbooks to run dbt as a packaged application in CDP CDP Public Cloud via Cloudera Machine Learning CDP Private Cloud via Cloudera Data Science Workbench Getting started guides for the open source adapters supported by Cloudera dbt-impala dbt-hive dbt-spark-livy dbt-spark-cde To learn more, contact us at innovation-feedback@cloudera.com. The post Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm) appeared first on Cloudera Blog. View the full article
  • Forum Statistics

    63.7k
    Total Topics
    61.7k
    Total Posts
×
×
  • Create New...