Search the Community
Showing results for tags 'cloudera'.
-
Introduction dbt allows data teams to produce trusted data sets for reporting, ML modeling, and operational workflows using SQL, with a simple workflow that follows software engineering best practices like modularity, portability, and continuous integration/continuous development (CI/CD). We’re excited to announce the general availability of the open source adapters for dbt for all the engines in CDP—Apache Hive, Apache Impala, and Apache Spark, with added support for Apache Livy and Cloudera Data Engineering. Using these adapters, Cloudera customers can use dbt to collaborate, test, deploy, and document their data transformation and analytic pipelines on CDP Public Cloud, CDP One, and CDP Private Cloud. Cloudera’s mission, values, and culture have long centered around using open source engines on open data and table formats to enable customers to build flexible and open data lakes. Recently, we became the first and only open data lakehouse with support for multiple engines on the same data with the general availability of Apache Iceberg in Cloudera Data Platform (CDP). To make it easy to start using dbt on the Cloudera Data Platform (CDP), we’ve packaged our open source adapters and dbt Core in a fully tested and certified downloadable package. We’ve also made it simple to integrate dbt seamlessly with CDP’s governance, security, and SDX capabilities. With this announcement, we welcome our customer data teams to streamline data transformation pipelines in their open data lakehouse using any engine on top of data in any format in any form factor and deliver high quality data that their business can trust. The Open Data Lakehouse In an organization with multiple teams and business units, there are a variety of data stacks with tools and query engines based on the preferences and requirements of different users. When different use cases require different query engines to be used on the same data, complicated data replication mechanisms need to be set up and maintained in order for data to be consistently available to different teams. A key aspect of an open lakehouse is giving data teams the freedom to use multiple engines over the same data, eliminating the need for data replication for different use cases. However, different teams and business units have different processes for building and managing their data transformations and analytics pipelines. This variety can result in a lack of standardization, leading to data duplication and inconsistency. That’s why there’s a growing need for a central, transparent, version-controlled repository with a consistent Software Development Lifecycle (SDLC) experience for data transformation pipelines across data teams, business functions, and engines. Streamlining the SDLC has been shown to speed up the delivery of data projects and increase transparency and auditability, leading to a more trusted, data-driven organization. Cloudera builds dbt adaptors for all engines in the open data lakehouse dbt offers this consistent SDLC experience for data transformation pipelines and, in doing so, has become widely adopted in companies large and small. Anyone who knows SQL can now build production-grade pipelines with ease. Figure 1. dbt used in transformation pipelines on data warehouses (Image source: https://github.com/dbt-labs/dbt-core) To date, dbt was only available on proprietary cloud data warehouses, with very little interoperability between different engines. For example, transformations performed in one engine are not visible across other engines because there was no common storage or metadata store. Cloudera has built dbt adapters for all of the engines in the open data lakehouse. Companies can now use dbt-core to consolidate all of their transformation pipelines across different engines into a single version-controlled repository with a consistent SDLC across teams. Cloudera also makes it easy to deploy dbt as a packaged application running within CDP using Cloudera Machine Learning and Cloudera Data Science Workbench. This capability allows customers to have a consistent experience irrespective of using CDP on premises or in the cloud. In addition, given that dbt is just submitting queries to the underlying engines in CDP, customers get the full governance capabilities provided by SDX, like automatic lineage capture, auditing, and impact analysis. The combination of Cloudera’s open data lakehouse and dbt supercharges the ability of data teams to collaboratively build, test, document, and deploy data transformation pipelines using any engine and in any form factor. The packaged offering within CDP and integration with SDX provides the critical security and governance guarantees that Cloudera customers rely on. Figure 2. dbt end-to-end SDLC on CDP Open Lakehouse How to get started with dbt within CDP The dbt integration with CDP is brought to you by Cloudera’s Innovation Accelerator, a cross-functional team that identifies new industry trends and creates new products and partnerships that dramatically improve the lives of our Cloudera customer’s data practitioners. To find out more, here are a selection of links for how to get started. Repository of the latest Python packages and docker images with dbt and all the Cloudera supported adapters Handbooks to run dbt as a packaged application in CDP CDP Public Cloud via Cloudera Machine Learning CDP Private Cloud via Cloudera Data Science Workbench Getting started guides for the open source adapters supported by Cloudera dbt-impala dbt-hive dbt-spark-livy dbt-spark-cde To learn more, contact us at innovation-feedback@cloudera.com. The post Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm) appeared first on Cloudera Blog. View the full article
-
In this article, we described the step by step process to install Cloudera Manager as per industrial practices. In Part 2, we already have gone through the Cloudera Pre-requisites, make sure all the servers The post How to Install and Configure Cloudera Manager on CentOS/RHEL 7 - Part 3 first appeared on Tecmint: Linux Howtos, Tutorials & Guides. View the full article
-
Airflow has been adopted by many Cloudera Data Platform (CDP) customers in the public cloud as the next generation orchestration service to setup and operationalize complex data pipelines. Today, customers have deployed 100s of Airflow DAGs in production performing various data transformation and preparation tasks, with differing levels of complexity. This combined with Cloudera Data Engineering’s (CDE) first-class job management APIs and centralized monitoring is delivering new value for modernizing enterprises. As we mentioned before, instead of relying on one custom monolithic process, customers can develop modular data transformation steps that are more reusable and easier to debug, which can then be orchestrated with glueing logic at the level of the pipeline. That’s why we are excited to announce the next evolutionary step on this modernization journey by lowering the barrier even further for data practitioners looking for flexible pipeline orchestration — introducing CDE’s completely new pipeline authoring UI for Airflow. Until now, the setup of such pipelines still required knowledge of Airflow and the associated python configurations. This presented challenges for users in building more complex multi-step pipelines that are typical of DE workflows. We wanted to hide those complexities from users, making multi-step pipeline development as self-service as possible and providing an easier path to developing, deploying, and operationalizing true end-to-end data pipelines. Easing development friction We started out by interviewing customers to understand where the most friction exists in their pipeline development workflows today. In the process several key themes emerged: Low/No-code By far the biggest barrier for new users is creating custom Airflow DAGs. Writing code is error prone and requires trial and error. Anyway to minimize coding and manual configuration will dramatically streamline the development process. Long-tail of operators Although Airflow offers 100s of operators, users tend to use only a subset of them. Making the most commonly used as readily available as possible is critical to reduce development friction. Templates Airflow DAGs are a great way to isolate pipelines and monitor them independently, making it more operationally friendly for DE teams. But a lot of times when we looked across Airflow DAGs we noticed similar patterns, where the majority of the operations were identical except for a series of configurations like table names and directories – the 80/20 rule clearly at play. This laid the foundation for some of the key design principles we applied to our authoring experience. Pipeline Authoring UI for Airflow With CDE Pipeline authoring UI, any CDE user irrespective of their level of Airflow expertise can create multi-step pipelines with a combination of out-of-the-box operators (CDEOperator, CDWOperator, BashOperator, PythonOperator). More advanced users can still continue to deploy their own customer Airflow DAGs as before, or use the Pipeline authoring UI to bootstrap their projects for further customization (as we describe later the pipeline engine generates Airflow code which can be used as starting to meet more complex scenarios). And once the pipeline has been developed through the UI, users can deploy and manage these data pipeline jobs like other CDE applications thru the API/CLI/UI. Figure 1: “Editor” screen for authoring Airflow pipelines, with operators (left), canvas (middle), and context sensitive configuration panel (right) The “Editor” is where all the authoring operations take place — a central interface to quickly sequence together your pipelines. It was critical to make the interactions as intuitive as possible to avoid slowing down the flow of the user. The user is presented with a blank canvas with click & drop operators. A palette focused on the most commonly used operators on the left, and a context sensitive configuration panel on the right. And as the user drops new operators onto the canvas they can specify dependencies through an intuitive click and drag interaction. Clicking on an existing operator within the canvas brings it to focus which triggers an update to the configuration panel on the right. Hovering over any operator highlights each side with four dots inviting the user to use a click & drag action to create connection with another operator. Figure 2: Creating dependencies with simple click & drag Pipeline Engine To make the authoring UI as flexible as possible a translation engine was developed that sits in between the user interface and the final Airflow job. Each “box” (step) in on the canvas serves as a task in the final Airflow DAG. Multiple steps comprise the overall pipeline, which are stored as pipeline definition files in the CDE resource of the job. This intermediate definition can easily be integrated with source code management, such as Git, as needed. When the pipeline is saved in the editor screen, a final translation is performed whereby the corresponding Airflow DAG is generated and loaded into the Airflow server. This makes our pipeline engine flexible to support multitude of orchestration services. Today we support Airflow but in the future it can be extended to meet other requirements. An additional benefit is that this can also serve to bootstrap more complex pipelines. The generated Airflow python code can be modified by end users to accommodate custom configurations and then uploaded as a new job. This way users don’t have to start from scratch, but rather build an outline of what they want to achieve, output the skeleton python code, and then customize. Templatizing Airflow Airflow provides a way to templatize pipelines and with CDE we have integrated that with our APIs to allow job parameters to be pushed down to Airflow as part of the execution of the pipeline. A simple example of this would be parameterizing SQL query within the CDW operator. Using the special syntax {{..}} the developer can include placeholders for different parts of the query, for example the SELECT expression or the table being referenced in the FROM section. SELECT {{ dag_run.conf['conf1'] }} FROM {{ dag_run.conf['conf2'] }} LIMIT 100 This can be entered through the configuration pane in UIl as shown here: Once the pipeline is saved and the Airflow job generated, it can be programmatically triggered through the CDE CLI/API with the configuration override options. $ cde job run --config conf1='column1, sum(1)' --config conf2='default.txn' --name example_airflow_job The same Airflow job can now be used to generate different SQL reports. Looking forward With early design partners we already have enhancements in the works to continue improving the experience. Some of them include: More operators – as we mentioned earlier there is a small set of highly used operators. We want to ensure these most commonly used ones are easily accessible to the user. Additionally, the introduction of more CDP operators that integrate with CML (machine learning) and COD (operation database) are critical for a complete end-to-end orchestration service. UI improvements to make the experience even smoother. These span common usability improvements like pan and zoom and undo-redo operations, and a mechanism to add comments to make more complex pipelines easier to follow. Auto-discovery can be powerful when applied to help autocomplete various configurations, such as referencing pre-defined spark job for the CDE task or the hive virtual warehouse end-point for the CDW query task. Ready-to-use pipelines – although parameterized Airflow jobs are great way to develop reusable pipelines, we want to make this even easier to specify through the UI. Also there’s opportunities for us to provide read-to-use pipeline definitions that capture very common patterns such as detecting files on S3 bucket, running data transformation with Spark, and performing data mart creation with Hive. With this Technical Preview release, any CDE customer can test drive the new authoring interface by setting up the latest CDE service. When creating a Virtual Cluster a new option will allow the enablement of the Airflow authoring UI. Stay tuned for more developments in the coming months and until then happy pipeline building! The post Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering appeared first on Cloudera Blog. View the full article
-
Many customers looking at modernizing their pipeline orchestration have turned to Apache Airflow, a flexible and scalable workflow manager for data engineers. With 100s of open source operators, Airflow makes it easy to deploy pipelines in the cloud and interact with a multitude of services on premise, in the cloud, and across cloud providers for a true hybrid architecture. Apache Airflow providers are a set of packages allowing services to define operators in their Directed Acyclic Graphs (DAGs) to access external systems. A provider could be used to make HTTP requests, connect to a RDBMS, check file systems (such as S3 object storage), invoke cloud provider services, and much more. They were already part of Airflow 1.x but starting with Airflow 2.x they are separate python packages maintained by each service provider, allowing more flexibility in Airflow releases. Using provider operators that are tested by a community of users reduces the overhead of writing and maintaining custom code in bash or python, and simplifies the DAG configuration as well. Airflow users can avoid writing custom code to connect to a new system, but simply use the off-the-shelf providers. Until now, customers managing their own Apache Airflow deployment who wanted to use Cloudera Data Platform (CDP) data services like Data Engineering (CDE) and Data Warehousing (CDW) had to build their own integrations. Users either needed to install and configure a CLI binary and install credentials locally in each Airflow worker or had to add custom code to retrieve the API tokens and make REST calls with Python with the correct configurations. But now it has become very simple and secure with our release of the Cloudera Airflow provider, which gives users the best of Airflow and CDP data services. This blog post will describe how to install and configure the Cloudera Airflow provider in under 5 minutes and start creating pipelines that tap into auto-scaling Spark service in CDE and Hive service in CDW in the public cloud. Step 0: Skip if you already have Airflow We assume that you already have an Airflow instance up and running. However, for those who do not, or want a local development installation, here is a basic setup of Airflow 2.x to run a proof of concept: # we use this version in our example but any version should work pip install apache-airflow[http,hive]==2.1.2 airflow db init airflow users create \ --username admin \ --firstname Cloud \ --lastname Era \ --password admin \ --role Admin \ --email airflow@cloudera.com Step 1: Cloudera Provider Setup (1 minute) Installing Cloudera Airflow provider is a matter of running pip command and restarting your Airflow service: # install the Cloudera Airflow provider pip install cloudera-airflow-provider # Start/Restart Airflow components airflow scheduler & airflow webserver Step 2: CDP Access Setup (1 minute) If you already have a CDP access key, you can skip this section. If not, as a first step, you will need to create one on the Cloudera Management Console. It is pretty simple to create. Click onto your “Profile” in the pane on the left-hand side of the CDP management console… … It will bring you to your profile page, directly on the “Access Keys” tab, as follows: Then you need to click on “Generate Access Key” (also on the pop-up menu) and it will generate the key pair. Do not forget to copy the Private Key or to download the credential files. As a side note, these same credentials can be used when running CDE CLI. Step 3: Airflow Connection Setup (1 minute) To be able to talk with CDP data services you need to set up connectivity for the operators to use. This follows a similar pattern as other providers by setting up a connection within the Admin page. CDE provides a managed Spark service that can be accessed via a simple REST end-point in a CDE Virtual Cluster called the Jobs API (learn how to set up a Virtual Cluster here). Set up a connection to a CDE Jobs API in your Airflow as follows: # Create connection from the CLI (can also be done from the UI): #Airflow 2.x: airflow connections add 'cde' \ --conn-type 'cloudera_data_engineering' \ --conn-host '<CDE_JOBS_API_ENDPOINT>' \ --conn-login "<ACCESS_KEY>" \ --conn-password "<PRIVATE_KEY>" #Airflow 1.x: airflow connections add 'cde' \ --conn-type 'http' \ --conn-host '<CDE_JOBS_API_ENDPOINT>' \ --conn-login "<ACCESS_KEY>" \ --conn-password "<PRIVATE_KEY>" Please note that the connection name can be anything, ‘cde’ is just here as in example: For CDW, the connection must be defined using workload credentials as follows (Please note that for CDW only user/name password is available through our Airflow Operator for now. We are adding access key support in an upcoming release): airflow connections add 'cdw' \ --conn-type 'hive' \ --conn-host '<HOSTNAME(base hostname of the JDBC URL that can be copied from the CDW UI, without port and protocol)>' \ --conn-schema '<DATABE_SCHEMA (by default 'default')>' \ --conn-login "<WORKLOAD_USERNAME>" \ --conn-password "<WORKLOAD_PASSWORD>" With only a few steps, your Airflow connection setup is done! Step 4: Running your DAG (2 minutes) Two operators are supported in the Cloudera provider. The “CDEJobRunOperator”, allows you to run Spark jobs on a CDE cluster. Additionally, the “CDWOperator” allows you to tap into Virtual Warehouse in CDW to run Hive jobs. CDEJobRunOperator The CDE operator assumes that a Spark job triggered has been already created within CDE on in your CDP public cloud environment, follow these steps to create a job. Once you have prepared a job, you can start to invoke it from your Airflow DAG using a CDEJobRunOperator. First make sure to import the library from cloudera.cdp.airflow.operators.cde_operator import CDEJobRunOperator Then use the operator task as follows: cde_task = CDEJobRunOperator( dag=dag, task_id="process_data", job_name='process_data_spark', connection_id='cde' ) The connection_id ‘cde’ references the connection you defined in step 3. Copy your new DAG into Airflow’s dag folder as shown below : # if you followed the Airflow setup in step 0, you will need to create the dag folder mkdir airflow/dags # Copy dag to dag folder cp /tmp/cde_demo/cde/cde.py airflow/dags Alternatively, Git can be used to manage and automate your DAGs as part of a CI/CD pipeline, see Airflow Dag Git integration guide. We are all set! Now we simply need to run the DAG – to trigger via the Airflow CLI run the following: airflow dags trigger <dag_id> Or to trigger it through the UI: We can monitor the Spark job that was triggered through the CDE UI and if needed view logs and performance profiles. What’s Next As customers continue to adopt Airflow as their next generation orchestration, we will expand the Cloudera provider to leverage other Data Services within CDP such as running machine learning models within CML helping accelerate deployment of Edge-to-AI pipelines. Take a test drive of Airflow in Cloudera Data Engineering yourself today to learn about its benefits and how it could help you streamline complex data workflows. The post Supercharge your Airflow Pipelines with the Cloudera Provider Package appeared first on Cloudera Blog. View the full article
-
What is Cloudera Data Engineering (CDE) ? Cloudera Data Engineering is a serverless service for Cloudera Data Platform (CDP) that allows you to submit jobs to auto-scaling virtual clusters. CDE enables you to spend more time on your applications, and less time on infrastructure. CDE allows you to create, manage, and schedule Apache Spark jobs without the overhead of creating and maintaining Spark clusters. In addition to this, you can define virtual clusters with a range of CPU and memory resources, and the cluster scales up and down as needed to execute your Spark workloads, helping control your cloud costs. Managed, serverless Spark service helps our customers in a number of ways: Auto scaling of compute to eliminate static infrastructure costs. This feature ensures that customers do not have to maintain a large infrastructure footprint and hence reduce total cost of ownership. Ability for business users to easily control their own compute needs with a click of a button, without IT intervention. Complete view of the job performance, logging and debugging through a single pane of glass to enable efficient development on Spark. Refer to the following Cloudera blog to understand the full potential of Cloudera Data Engineering. Why should technology partners care about CDE? Unlike traditional data engineering workflows that have relied on a patchwork of tools for preparing, operationalizing, and debugging data pipelines, Cloudera Data Engineering is designed for efficiency and speed — seamlessly integrating and securing data pipelines to any CDP service including Machine Learning, Data Warehouse, Operational Database, or any other analytic tool in your business. Partner tools that leverage CDP as their backend store can leverage this new service to ensure their customers can take advantage of a serverless architecture for Spark. ISV Partners, like Precisely, support Cloudera’s hybrid vision. Precisely Data Integration, Change Data Capture and Data Quality tools support CDP Public Cloud as well as CDP Private Cloud. Precisely end-customers can now design a pipeline once and deploy it anywhere. Data pipelines that are bursty in nature can leverage the public cloud CDE service while longer running persistent loads can run on-prem. This ensures that the right data pipelines are running on the most cost-effective engines available in the market today. Using the CDE Integration API: CDE provides a robust API for integration with your existing continuous integration/continuous delivery platforms. The Cloudera Data Engineering service API is documented in Swagger. You can view the API documentation and try out individual API calls by accessing the API DOC link in any virtual cluster: In the CDE web console, select an environment. Click the Cluster Details icon in any of the listed virtual clusters. Click the link under API DOC. For further details on the API, please refer to the following doc link here. Custom base Image for Kubernetes: Partners who need to run their own business logic and require custom binaries or packages available on the Spark engine platform, can now leverage this feature for Cloudera Data Engineering. We believe customized engine images would allow greater flexibility to our partners to build cloud-native integrations and could potentially be leveraged by our enterprise customers as well. The following set of steps will describe the ability to run Spark jobs with dependencies on external libraries and packages. The libraries and packages will be installed on top of the base image to make them available to the Spark executors. First, obtain the latest CDE CLI a) Create a virtual cluster b) Go to virtual cluster details page c) Download the CLI Learn more on how to use the CLI here Run Spark jobs on customized container image – Overview Custom images are based on the base dex-spark-runtime image, which is accessible from the Cloudera docker repository. Users can then layer their packages and custom libraries on top of the base image. The final image is uploaded to a docker repo, which is then registered with CDE as a job resource. New jobs are defined with references to the resource which automatically downloads the custom runtime image to run the Spark drivers and executors. Run Spark jobs on customized container image: Steps 1. Pull “dex-spark-runtime” image from “docker.repository.cloudera.com” $ docker pull container.repository.cloudera.com/cloudera/dex/dex-spark-runtime:<version> Note: “docker.repository.cloudera.com” is behind the paywall and will require credentials to access, please ask your account team to provide 2. Create your “custom-dex-spark-runtime” image, based on “dex-spark-runtime” image $ docker build --network=host -t <company-registry>/custom-dex-spark-runtime:<version> . -f Dockerfile Dockerfile Example: FROM docker.repository.cloudera.com/<company-name>/dex-spark-runtime:<version> USER root RUN yum install ${YUM_OPTIONS} <package-to-install> && yum clean all && rm -rf /var/cache/yum RUN dnf install ${DNF_OPTIONS} <package-to-install> && dnf clean all && rm -rf /var/cache/dnf USER ${DEX_UID} 3. Push image to your company Docker registry $ docker push <company-registry>/custom-dex-spark-runtime:<version> 4. Create ImagePullSecret in DE cluster for the company’s Docker registry (Optional) REST API: # POST /api/v1/credentials { "name": "<company-registry-basic-credentials>", "type": "docker", "uri": "<company-registry>", "secret": { "username": "foo", "password": "bar", } } CDE CLI: === credential === ./cde credential create --type=docker-basic --name=docker-sandbox-cred --docker-server=https://docker-sandbox.infra.cloudera.com --docker-username=foo --tls-insecure --user srv_dex_mc --vcluster-endpoint https://gbz7t69f.cde-vl4zqll4.dex-a58x.svbr-nqvp.int.cldr.work/dex/api/v1 Note: Credentials will be stored as Kubernetes “Secret”. Never stored by DEX API. 5. Register “custom-dex-spark-runtime” in DE as a “Custom Spark Runtime Image” Resource. REST API: # POST /api/v1/resources { "name":"", "type":"custom-spark-runtime-container-image", "engine": "spark2", "image": <company-registry>/custom-dex-spark-runtime:<version>, "imagePullSecret": <company-registry-basic-credentials> } CDE CLI: === runtime resources === ./cde resource create --type="custom-runtime-image" --image-engine="spark2" --name="custom-dex-qe-1_1" --image-credential=docker-sandbox-cred --image="docker-sandbox.infra.cloudera.com/dex-qe/custom-dex-qe:1.1" --tls-insecure --user srv_dex_mc --vcluster-endpoint https://gbz7t69f.cde-vl4zqll4.dex-a58x.svbr-nqvp.int.cldr.work/dex/api/v1 6. You should now be able to define Spark jobs referencing the custom-dex-spark-runtime REST API: # POST /api/v1/jobs { "name":"spark-custom-image-job", "spark":{ "imageResource": "CustomSparkImage-1", ... } ... } CDE CLI: === job create === ./cde job create --type spark --name cde-job-docker --runtime-image-resource-name custom-dex-qe-1_1 --application-file /tmp/numpy_app.py --num-executors 1 --executor-memory 1G --driver-memory 1G --tls-insecure --user srv_dex_mc --vcluster-endpoint https://gbz7t69f.cde-vl4zqll4.dex-a58x.svbr-nqvp.int.cldr.work/dex/api/v1 7. Once the job is created either trigger it to run through Web UI or by running the following command in CLI: $> cde job run --name cde-job-docker In conclusion We introduced the “Custom Base Image” feature as part of our Design Partner Program to elicit feedback from our ISV partners. The response has been overwhelmingly positive and building custom integrations with our cloud-native CDE offering has never been easier. As a partner, you can leverage Spark running on Kubernetes Infrastructure for free. You can launch a trial of CDE on CDP in minutes here, giving you a hands-on introduction to data engineering innovations in the Public Cloud. References: https://www.cloudera.com/tutorials/cdp-getting-started-with-cloudera-data-engineering.html The post Cloudera Data Engineering – Integration steps to leverage Spark on Kubernetes appeared first on Cloudera Blog. View the full article
-
- cloudera
- data engineering
-
(and 2 more)
Tagged with:
-
Forum Statistics
67.7k
Total Topics65.6k
Total Posts