Search the Community
Showing results for tags 'apache hive'.
-
This post series is about mastering offline data pipeline's best practices, focusing on the potent combination of Apache Airflow and data processing engines like Hive and Spark. In Part 1 of our series explored the strategies for enhancing Airflow data pipelines using Apache Hive on AWS EMR. Our primary objective was to attain cost efficiency and establish effective job configurations. In this concluding Part 2, we will extensively explore Apache Spark, another pivotal element in our comprehensive data engineering toolkit. By optimizing the Airflow job parameters specifically for Spark, there is a substantial potential for enhancing performance and realizing substantial cost savings. Why Apache Spark in Airflow? Apache Spark is a really important framework and tool for data processing in companies all about data. It's genuinely outstanding at processing massive amounts of data quickly and efficiently. It's especially great for complex data analytics with fast query performance and advanced analytics capabilities. This makes Spark a preferred choice for enterprises handling vast amounts of data and requiring real-time analytics. View the full article
-
- best practices
- scheduling
-
(and 2 more)
Tagged with:
-
Welcome to the first post in our exciting series on mastering offline data pipeline's best practices, focusing on the potent combination of Apache Airflow and data processing engines like Hive and Spark. This post focuses on elevating our data engineering game, streamlining your data workflows, and significantly cutting computing costs. The need to optimize offline data pipeline optimization has become a necessity with the growing complexity and scale of modern data pipelines. In this kickoff post, we delve into the intricacies of Apache Airflow and AWS EMR, a managed cluster platform for big data processing. Working together, they form the backbone of many modern data engineering solutions. However, they can become a source of increased costs and inefficiencies without the right optimization strategies. Let's dive into the journey to transform your data workflows and embrace cost-efficiency in your data engineering environment. View the full article
-
- data piplines
- best practices
-
(and 1 more)
Tagged with:
-
AWS Lake Formation and the Glue Data Catalog now extend data cataloging, data sharing and fine-grained access control support for customers using a self-managed Apache Hive Metastore (HMS) as their data catalog. Previously, customers had to replicate their metadata into the AWS Glue Data Catalog in order use Lake Formation permissions and data sharing capabilities. Now, customers can integrate their HMS metadata within AWS, allowing them to discover data alongside native tables in the Glue data catalog, manage permissions and sharing from Lake Formation, and query data using AWS analytics services. View the full article
-
- aws glue
- aws glue data catalog
-
(and 1 more)
Tagged with:
-
This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS! A new week starts, and Spring is almost here! If you’re curious about AWS news from the previous seven days, I got you covered. Last Week’s Launches Here are the launches that got my attention last week: Amazon S3 – Last week there was AWS Pi Day 2023 celebrating 17 years of innovation since Amazon S3 was introduced on March 14, 2006. For the occasion, the team released many new capabilities: S3 Object Lambda now provides aliases that are interchangeable with bucket names and can be used with Amazon CloudFront to tailor content for end users. S3 now support datasets that are replicated across multiple AWS accounts with cross-account support for S3 Multi-Region Access Points. You can now create and configure replication rules to automatically replicate S3 objects from one AWS Outpost to another. Amazon S3 has also simplified private connectivity from on-premises networks: with private DNS for S3, on-premises applications can use AWS PrivateLink to access S3 over an interface endpoint, while requests from your in-VPC applications access S3 using gateway endpoints. We released Mountpoint for Amazon S3, a high performance open source file client. Read more in the blog. Note that Mountpoint isn’t a general-purpose networked file system, and comes with some restrictions on file operations. Amazon Linux 2023 – Our new Linux-based operating system is now generally available. Sébastien’s post is full of tips and info. Application Auto Scaling – Now can use arithmetic operations and mathematical functions to customize the metrics used with Target Tracking policies. You can use it to scale based on your own application-specific metrics. Read how it works with Amazon ECS services. AWS Data Exchange for Amazon S3 is now generally available – You can now share and find data files directly from S3 buckets, without the need to create or manage copies of the data. Amazon Neptune – Now offers a graph summary API to help understand important metadata about property graphs (PG) and resource description framework (RDF) graphs. Neptune added support for Slow Query Logs to help identify queries that need performance tuning. Amazon OpenSearch Service – The team introduced security analytics that provides new threat monitoring, detection, and alerting features. The service now supports OpenSearch version 2.5 that adds several new features such as support for Point in Time Search and improvements to observability and geospatial functionality. AWS Lake Formation and Apache Hive on Amazon EMR – Introduced fine-grained access controls that allow data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR. Amazon EC2 M1 Mac Instances – You can now update guest environments to a specific or the latest macOS version without having to tear down and recreate the existing macOS environments. AWS Chatbot – Now Integrates With Microsoft Teams to simplify the way you troubleshoot and operate your AWS resources. Amazon GuardDuty RDS Protection for Amazon Aurora – Now generally available to help profile and monitor access activity to Aurora databases in your AWS account without impacting database performance AWS Database Migration Service – Now supports validation to ensure that data is migrated accurately to S3 and can now generate an AWS Glue Data Catalog when migrating to S3. AWS Backup – You can now back up and restore virtual machines running on VMware vSphere 8 and with multiple vNICs. Amazon Kendra – There are new connectors to index documents and search for information across these new content: Confluence Server, Confluence Cloud, Microsoft SharePoint OnPrem, Microsoft SharePoint Cloud. This post shows how to use the Amazon Kendra connector for Microsoft Teams. For a full list of AWS announcements, be sure to keep an eye on the What's New at AWS page. Other AWS News A few more blog posts you might have missed: Women founders Q&A – We’re talking to six women founders and leaders about how they’re making impacts in their communities, industries, and beyond. What you missed at that 2023 IMAGINE: Nonprofit conference – Where hundreds of nonprofit leaders, technologists, and innovators gathered to learn and share how AWS can drive a positive impact for people and the planet. Monitoring load balancers using Amazon CloudWatch anomaly detection alarms – The metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance. Extend geospatial queries in Amazon Athena with user-defined functions (UDFs) and AWS Lambda – Using a solution based on Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the globe into equally-sized hexagons. How cities can use transport data to reduce pollution and increase safety – A guest post by Rikesh Shah, outgoing head of open innovation at Transport for London. For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more. Upcoming AWS Events Here are some opportunities to meet: AWS Public Sector Day 2023 (March 21, London, UK) – An event dedicated to helping public sector organizations use technology to achieve more with less through the current challenging conditions. Women in Tech at Skills Center Arlington (March 23, VA, USA) – Let’s celebrate the history and legacy of women in tech. The AWS Summits season is warming up! You can sign up here to know when registration opens in your area. That’s all from me for this week. Come back next Monday for another Week in Review! — Danilo View the full article
-
- women in tech
- s3
- (and 23 more)
-
Amazon EMR is excited to announce a new capability that enables users to apply AWS Lake Formation based table and column level permissions on Amazon S3 data lake for write operations (i.e., INSERT INTO, INSERT OVERWRITE) with Apache Hive jobs submitted using Amazon EMR Steps API. This feature allows data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR. View the full article
-
We are excited to launch two new features that help enforce access controls with Amazon EMR on EC2 clusters (EMR Clusters). These features are supported with jobs that are submitted to the cluster using the EMR Steps API. First is Runtime Role with EMR Steps. A Runtime Role is an AWS Identity and Access Management (IAM) role that you associate with an EMR Step. An EMR Step uses this role to access AWS resources. The second is integration with AWS Lake Formation to apply table and column-level access controls for Apache Spark and Apache Hive jobs with EMR Steps. View the full article
-
- iam
- lake formation
-
(and 7 more)
Tagged with:
-
You can now use Amazon Athena to query views stored in your self-managed Apache Hive metastores. Hive views are defined using the Hive Query Language (HiveQL) which is not fully compatible with Athena's standard SQL. With this new capability, Athena automatically handles HiveQL syntax differences so you can query Hive views without changing your view definitions or maintaining a complex translation layer. View the full article
-
Amazon EMR Release 6.2 now supports improved Apache HBase performance on Amazon S3 with persistent HFile tracking, and Apache Hive ACID transactions on HDFS and Amazon S3. EMR 6.2 contains performance improvements to EMR Runtime for Apache Spark, and PrestoDB performance improvements. View the full article
- 1 reply
-
- apache hbase
- amazon s3
-
(and 5 more)
Tagged with:
-
Apache Hive supports transactional tables which provide ACID guarantees. There has been a significant amount of work that has gone into hive to make these transactional tables highly performant. Apache Spark provides some capabilities to access hive external tables but it cannot access hive managed tables. To access hive managed tables from spark Hive Warehouse Connector needs to be used. We are happy to announce Spark Direct Reader mode in Hive Warehouse Connector which can read hive transactional tables directly from the filesystem. This feature has been available from CDP-Public-Cloud-2.0 (7.2.0.0) and CDP-DC-7.1 (7.1.1.0) releases onwards. Hive Warehouse Connector (HWC) was available to provide access to managed tables in hive from spark, however since this involved communication with LLAP there was an additional hop to get the data and process it in spark vs the ability of spark to directly read the data from FileSystem for External tables. This leads to performance degradation in accessing data from managed tables vs external tables. Additionally a lot of use cases for HWC were associated with ETL jobs where a super user was running these jobs to update data in multiple tables hence authorization was not a strong business need for this case. HWC Spark Direct Reader is an additional mode available in HWC which tries to address the above concerns. This article describes the usage of spark direct reader to consume hive transactional table data in a spark application. It also introduces the methods and APIs to read hive transactional tables into spark dataframes. Finally, it demonstrates the transaction handling and semantics while using this reader. HWC Spark Direct Reader is derived from Qubole Spark Acid Connector. Prerequisites Following are the prerequisites to be able to query hive managed tables from Spark Direct Reader – Connectivity to HMS (Hive Metastore) which means the spark application should be able to access hive metastore using thrift URI. This URI is determined by hive config hive.metastore.uris The User launching spark application must have Read and Execute permissions on hive warehouse location on the filesystem. The location is determined by hive config hive.metastore.warehouse.dir Use cases This section focuses on the usage of the Spark Direct Reader to read transactional tables. Consider that we have a hive transactional table emp_acid which contains information about the employees. With auto-translate extension enabled The connector has an auto-translate rule which is a spark extension rule which automatically instructs spark to use spark direct reader in case of managed tables so that the user does not need to specify it explicitly. See employee data in table emp_acid scala> spark.sql("select * from emp_acid").show +------+----------+--------------------+-------------+--------------+-----+-----+-------+ |emp_id|first_name| e_mail|date_of_birth| city|state| zip|dept_id| +------+----------+--------------------+-------------+--------------+-----+-----+-------+ |677509| Lois|lois.walker@hotma… | 3/29/1981| Denver | CO|80224| 4| |940761| Brenda|brenda.robinson@g...| 7/31/1970| Stonewall | LA |71078| 5| |428945| Joe|joe.robinson@gmai… | 6/16/1963| Michigantown| IN |46057| 3| ………. ………. ………. Using transactional tables in conjunction with other data sources Spark direct reader works seamlessly with other data sources as well, like in the below example we are joining emp_acid table with an external table dept_ext to find out corresponding departments of employees. scala> sql("select e.emp_id, e.first_name, d.name department from emp_acid e join dept_ext d on e.dept_id = d.id").show +------+----------+-----------+ |emp_id|first_name| department| +------+----------+-----------+ |677509| Lois | HR | |940761| Brenda| FINANCE| |428945| Joe | ADMIN | Here direct reader is used to fetch the data of emp_acid table since it’s transactional table, the data of dept_ext table is fetched by spark’s native reader scala> sql("select e.emp_id, e.first_name, d.name department from emp_acid e join dept_ext d on e.dept_id = d.id").explain == Physical Plan == *(2) Project [emp_id#288, first_name#289, name#287 AS department#255] +- *(2) BroadcastHashJoin [dept_id#295], [id#286], Inner, BuildRight :- *(2) Filter isnotnull(dept_id#295) : +- *(2) Scan HiveAcidRelation(org.apache.spark.sql.SparkSession@1444fa42,default.emp_acid,Map(transactional -> true, numFilesErasureCoded -> 0, bucketing_version -> 2, transient_lastDdlTime -> 1594830632, transactional_properties -> default, table -> default.emp_acid)) [emp_id#288,first_name#289,dept_id#295] PushedFilters: [IsNotNull(dept_id)], ReadSchema: struct<emp_id:int,first_name:string,dept_id:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#169] +- *(1) Project [id#286, name#287] +- *(1) Filter isnotnull(id#286) +- *(1) FileScan orc default.dept_ext[id#286,name#287] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://anurag-hwc-1.anurag-hwc.root.hwx.site:8020/warehouse/tablespace/external..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:int,name:string> Configurations to enable auto-translate To turn on auto translate feature, we need to specify spark sql extension like spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension Using Hive Warehouse Connector executeQuery() API If you are already using Hive Warehouse Connector in your spark application then you can continue to use executeQuery() API and switch to Spark Direct Reader just by adding some configurations. Code is similar to what we need to use with Hive Warehouse Connector. Queries shown in sections 3.1.1 and 3.1.2 can be done like the following. scala> val hive = com.hortonworks.hwc.HiveWarehouseSession.session(spark).build() scala> hive.executeQuery("select * from emp_acid").show scala> hive.executeQuery("select e.emp_id, e.first_name, d.name department from emp_acid e join dept_ext d on e.dept_id = d.id").show Configurations to use Spark Direct Reader via Hive Warehouse Connector API spark.datasource.hive.warehouse.read.via.llap=false spark.sql.hive.hwc.execution.mode=spark spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension Behind the scenes – Read Architecture Spark Direct Reader is built on top DataSource V1 APIs exposed by spark which allows us to plug in custom data sources to spark. In our case the custom data source is the layer which enables us to read hive ACID tables. Following diagram depicts the high level read process and transaction management in connector Spark Driver parses the query and for each ACID table starts a read txn. Transaction snapshot for each table is stored separately and is used for generating the split. Spark driver serializes and sends the partition info and txn snapshot to executors. Executors read the specific split using the transaction snapshot. Processed and transformed data is sent back to the driver. Driver commits the read transactions started. Note on transactions Currently the connector only supports single table transaction consistency i.e. one new transaction is opened for each table involved in query. This means if multiple tables are involved in the query then all may not use the same snapshot of data. Transactions when single table t1 is involved scala> spark.sql(“select * from t1”).show 20/07/08 05:41:39 INFO transaction.HiveAcidTxn: Begin transaction {"id":"174","validTxns":"174:9223372036854775807::"} 20/07/08 05:41:39 INFO transaction.HiveAcidTxn: Lock taken for lockInfo com.qubole.spark.hiveacid.transaction.LockInfo@37f5c5fd in transaction with id 174 .... .... 20/07/08 05:41:47 INFO transaction.HiveAcidTxn: End transaction {"id":"174","validTxns":"174:9223372036854775807::"} abort = false Transactions when multiple tables t3 and t4 are involved scala> spark.sql("select * from default.t3 join default.t4 on default.t3.a = default.t4.a").show 20/07/08 05:43:36 INFO transaction.HiveAcidTxn: Begin transaction {"id":"175","validTxns":"175:9223372036854775807::"} 20/07/08 05:43:36 INFO transaction.HiveAcidTxn: Lock taken for lockInfo com.qubole.spark.hiveacid.transaction.LockInfo@67cc7aee in transaction with id 175 .... .... 20/07/08 05:43:36 INFO transaction.HiveAcidTxn: Begin transaction {"id":"176","validTxns":"176:175:175:"} 20/07/08 05:43:36 INFO transaction.HiveAcidTxn: Lock taken for lockInfo com.qubole.spark.hiveacid.transaction.LockInfo@a8c0c6d in transaction with id 176 .... .... 20/07/08 05:43:53 INFO transaction.HiveAcidTxn: End transaction {"id":"175","validTxns":"175:9223372036854775807::"} abort = false 20/07/08 05:43:53 INFO transaction.HiveAcidTxn: End transaction {"id":"176","validTxns":"176:175:175:"} abort = false Notice different transactions 175 and 176 when two different tables t3 and t4 are present in the query. API to close transactions explicitly To commit or abort transactions, we have a sql listener which does it whenever a dataframe operation or spark sql query finishes. In some cases when .explain() / .rdd() / .cache() are invoked on a dataframe, it opens a transaction and never closes it since technically they are not spark sql queries so the sql listener does not kick in. To handle this scenario and to be able to close the transactions manually, an explicit API is exposed which can be invoked like the following. scala> com.qubole.spark.hiveacid.transaction.HiveAcidTxnManagerObject.commitTxn(spark) Or if you are using Hive Warehouse Connector’s session (say ‘hive’ is the instance) scala> hive.commitTxn Configuration Summary To use Spark Direct Reader, we need the following configurations. Property Value Description spark.hadoop.hive.metastore.uris thrift://<host>:<port> Hive metastore URI spark.sql.extensions com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension Extension needed to auto-translate to work spark.kryo.registrator com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator For using kryo serialization. spark.sql.hive.hwc.execution.mode spark spark.datasource.hive.warehouse.read.via.llap false Hive Warehouse connector jar should be supplied to spark-shell or spark-submit using –jars option while launching the application. For instance, spark-shell can be launched like the following. spark-shell --jars /opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-<version>.jar \ --conf "spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension" \ --conf "spark.datasource.hive.warehouse.read.via.llap=false" \ --conf "spark.sql.hive.hwc.execution.mode=spark" \ --conf "spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator" \ --conf "spark.hadoop.hive.metastore.uris=<metastore_uri>" Further Information and Resources Cloudera Data Warehouse HWC Spark Direct Reader for accessing Hive data Integrating Apache Hive with Apache Spark The post Enabling high-speed Spark direct reader for Apache Hive ACID tables appeared first on Cloudera Blog. View the full article
-
Forum Statistics
63.7k
Total Topics61.7k
Total Posts