IT World https://blog.yannickjaquier.com RDBMS, Unix and many more... Tue, 13 Oct 2020 09:53:09 +0000 en-US hourly 1 https://wordpress.org/?v=5.5.1 Spark dynamic allocation how to configure and use it https://blog.yannickjaquier.com/hadoop/spark-dynamic-allocation-how-to-configure-and-use-it.html https://blog.yannickjaquier.com/hadoop/spark-dynamic-allocation-how-to-configure-and-use-it.html#respond Thu, 22 Oct 2020 08:52:23 +0000 https://blog.yannickjaquier.com/?p=4994 Preamble Since we have started to put Spark job in production we asked ourselves the question of how many executors, number of cores per executor and executor memory we should put. What if we put too much and are wasting resources and could we improve the response time if we put more ? In other […]

The post Spark dynamic allocation how to configure and use it appeared first on IT World.

]]>

Table of contents

Preamble

Since we have started to put Spark job in production we asked ourselves the question of how many executors, number of cores per executor and executor memory we should put. What if we put too much and are wasting resources and could we improve the response time if we put more ?

In other words those spark-submit parameters (we have an Hortonworks Hadoop cluster and so are using YARN):

  • –executor-memory MEM – Memory per executor (e.g. 1000M, 2G) (Default: 1G).
  • –executor-cores NUM – Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode)
  • –num-executors NUM – Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM.

And in fact it is written in above description of num-executors Spark dynamic allocation is partially answering to the former question.

Spark dynamic allocation is a feature allowing your Spark application to automatically scale up and down the number of executors. And only the number of executors not the memory size and not the number of cores of each executor that must still be set specifically in your application or when executing spark-submit command. So the promise is your application will dynamically be able to request more executors and release them back to cluster pool based on your application workload. Of course if using YARN you will be tightly linked to the ressource allocated to the queue to which you have submitted your application (–queue parameter of spark-submit).

This blog post has been written using Hortonworks Data Platform (HDP) 3.1.4 and so Spark2 2.3.2.

Spark dynamic allocation setup

As it is written in official documentation the shuffle jar must be added to the classpath of all NodeManagers. If like me you are running HDP 3 I have discovered that everything was already configured. The jar of this external shuffle library is:

[root@server jars]# ll /usr/hdp/current/spark2-client/jars/*shuffle*
-rw-r--r-- 1 root root 67763 Aug 23  2019 /usr/hdp/current/spark2-client/jars/spark-network-shuffle_2.11-2.3.2.3.1.4.0-315.jar

And in Ambari the YARN configuration was also already done:

spark_dynamic_allocation01
spark_dynamic_allocation01

Remark:
We still have old Spark 1 variables and you should now concentrate only on the spark2_xx variables. Same this is spark2_shuffle that must be appended to yarn.nodemanager.aux-services.

Then again quoting official documentation you have two parameters to set inside your application to have the feature activated:

There are two requirements for using this feature. First, your application must set spark.dynamicAllocation.enabled to true. Second, you must set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application.

This part was not obvious to me but as it is written spark.dynamicAllocation.enabled and spark.shuffle.service.enabled must not only be set at cluster level but also in your application or as a spark-submit parameter ! I would even say that setting those parameters in Ambari makes no difference but as you can see below all was done by default in my HDP 3.1.4 cluster:

spark_dynamic_allocation02
spark_dynamic_allocation02
spark_dynamic_allocation03
spark_dynamic_allocation03

For the complete list of parameters refer to the official Spark dynamic allocation parameter list.

Spark dynamic allocation testing

For the testing code I have done a mix in PySpark of multiple test code I have seen around on Internet. Using Python is avoiding me a boring sbt compilation phase before testing…

The source code is (spark_dynamic_allocation.py):

# from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark import SparkContext
# from pyspark_llap import HiveWarehouseSession
from time import sleep

def wait_x_seconds(x):
  sleep(x*10)

conf = SparkConf().setAppName("Spark dynamic allocation").\
        set("spark.dynamicAllocation.enabled", "true").\
        set("spark.shuffle.service.enabled", "true").\
        set("spark.dynamicAllocation.initialExecutors", "1").\
        set("spark.dynamicAllocation.executorIdleTimeout", "5s").\
        set("spark.executor.cores", "1").\
        set("spark.executor.memory", "512m")

sc = SparkContext.getOrCreate(conf)

# spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
# spark.stop()

sc.parallelize(range(1,6), 5).foreach(wait_x_seconds)

exit()

So in short I run five parallel processes that will each wait x*10 seconds when x is from 1 to 5 (range(1,6)). We will start with one executor and expect Spark to scale up and then down as the shorter timers will end in order (10 seconds, 20 seconds, ..). I have also exaggerated a bit in parameters as spark.dynamicAllocation.executorIdleTimeout is changed to 5s that I see in my example the executors being killed (default is 60s).

The command to execute it is, Hive Warehouse Connector not really mandatory here but it became an habit. Notice that I do not specify anything in command line as all will be setup in Python script:

spark-submit --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar
--py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip spark_dynamic_allocation.py

By default our spark-submit is in INFO mode, and the important part of the output is:

.
20/04/09 14:34:14 INFO Utils: Using initial executors = 1, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
20/04/09 14:34:16 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.75.37.249:36332) with ID 1
20/04/09 14:34:16 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
.
.
20/04/09 14:34:17 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 2)
20/04/09 14:34:18 INFO ExecutorAllocationManager: Requesting 2 new executors because tasks are backlogged (new desired total will be 4)
20/04/09 14:34:19 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 5)
20/04/09 14:34:20 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.75.37.249:36354) with ID 2
20/04/09 14:34:20 INFO ExecutorAllocationManager: New executor 2 has registered (new total is 2)
20/04/09 14:34:20 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, yarn01.domain.com, executor 2, partition 1, PROCESS_LOCAL, 7869 bytes)
20/04/09 14:34:20 INFO BlockManagerMasterEndpoint: Registering block manager yarn01.domain.com:29181 with 114.6 MB RAM, BlockManagerId(2, yarn01.domain.com, 29181, None)
20/04/09 14:34:20 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on yarn01.domain.com:29181 (size: 3.7 KB, free: 114.6 MB)
20/04/09 14:34:21 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.75.37.249:36366) with ID 3
20/04/09 14:34:21 INFO ExecutorAllocationManager: New executor 3 has registered (new total is 3)
20/04/09 14:34:21 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, yarn01.domain.com, executor 3, partition 2, PROCESS_LOCAL, 7869 bytes)
20/04/09 14:34:21 INFO BlockManagerMasterEndpoint: Registering block manager yarn01.domain.com:44000 with 114.6 MB RAM, BlockManagerId(3, yarn01.domain.com, 44000, None)
20/04/09 14:34:21 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on yarn01.domain.com:44000 (size: 3.7 KB, free: 114.6 MB)
20/04/09 14:34:22 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.75.37.249:36376) with ID 5
20/04/09 14:34:22 INFO ExecutorAllocationManager: New executor 5 has registered (new total is 4)
20/04/09 14:34:22 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, yarn01.domain.com, executor 5, partition 3, PROCESS_LOCAL, 7869 bytes)
20/04/09 14:34:22 INFO BlockManagerMasterEndpoint: Registering block manager yarn01.domain.com:32822 with 114.6 MB RAM, BlockManagerId(5, yarn01.domain.com, 32822, None)
20/04/09 14:34:22 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on yarn01.domain.com:32822 (size: 3.7 KB, free: 114.6 MB)
20/04/09 14:34:27 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, yarn01.domain.com, executor 1, partition 4, PROCESS_LOCAL, 7869 bytes)
20/04/09 14:34:27 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10890 ms on yarn01.domain.com (executor 1) (1/5)
20/04/09 14:34:27 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 31354
20/04/09 14:34:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.75.37.248:57764) with ID 4
20/04/09 14:34:29 INFO ExecutorAllocationManager: New executor 4 has registered (new total is 5)
20/04/09 14:34:29 INFO BlockManagerMasterEndpoint: Registering block manager worker01.domain.com:38365 with 114.6 MB RAM, BlockManagerId(4, worker01.domain.com, 38365, None)
20/04/09 14:34:34 INFO ExecutorAllocationManager: Request to remove executorIds: 4
20/04/09 14:34:34 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
20/04/09 14:34:34 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 4
20/04/09 14:34:34 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 5 seconds (new desired total will be 4)
20/04/09 14:34:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 4.
20/04/09 14:34:38 INFO DAGScheduler: Executor lost: 4 (epoch 0)
20/04/09 14:34:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
20/04/09 14:34:38 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(4, worker01.domain.com, 38365, None)
20/04/09 14:34:38 INFO BlockManagerMaster: Removed 4 successfully in removeExecutor
20/04/09 14:34:38 INFO YarnScheduler: Executor 4 on worker01.domain.com killed by driver.
20/04/09 14:34:38 INFO ExecutorAllocationManager: Existing executor 4 has been removed (new total is 4)
20/04/09 14:34:41 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 20892 ms on yarn01.domain.com (executor 2) (2/5)
20/04/09 14:34:46 INFO ExecutorAllocationManager: Request to remove executorIds: 2
20/04/09 14:34:46 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
20/04/09 14:34:46 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 2
20/04/09 14:34:46 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 5 seconds (new desired total will be 3)
20/04/09 14:34:48 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 2.
20/04/09 14:34:48 INFO DAGScheduler: Executor lost: 2 (epoch 0)
20/04/09 14:34:48 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/04/09 14:34:48 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, yarn01.domain.com, 29181, None)
20/04/09 14:34:48 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/04/09 14:34:48 INFO YarnScheduler: Executor 2 on yarn01.domain.com killed by driver.
20/04/09 14:34:48 INFO ExecutorAllocationManager: Existing executor 2 has been removed (new total is 3)
20/04/09 14:34:52 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 30897 ms on yarn01.domain.com (executor 3) (3/5)
20/04/09 14:34:57 INFO ExecutorAllocationManager: Request to remove executorIds: 3
20/04/09 14:34:57 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 3
20/04/09 14:34:57 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 3
20/04/09 14:34:57 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 5 seconds (new desired total will be 2)
20/04/09 14:34:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
20/04/09 14:34:59 INFO DAGScheduler: Executor lost: 3 (epoch 0)
20/04/09 14:34:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
20/04/09 14:34:59 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, yarn01.domain.com, 44000, None)
20/04/09 14:34:59 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
20/04/09 14:34:59 INFO YarnScheduler: Executor 3 on yarn01.domain.com killed by driver.
20/04/09 14:34:59 INFO ExecutorAllocationManager: Existing executor 3 has been removed (new total is 2)
20/04/09 14:35:03 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 40831 ms on yarn01.domain.com (executor 5) (4/5)
20/04/09 14:35:08 INFO ExecutorAllocationManager: Request to remove executorIds: 5
20/04/09 14:35:08 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 5
20/04/09 14:35:08 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 5
20/04/09 14:35:08 INFO ExecutorAllocationManager: Removing executor 5 because it has been idle for 5 seconds (new desired total will be 1)
20/04/09 14:35:10 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 5.
20/04/09 14:35:10 INFO DAGScheduler: Executor lost: 5 (epoch 0)
20/04/09 14:35:10 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
20/04/09 14:35:10 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, yarn01.domain.com, 32822, None)
20/04/09 14:35:10 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor
20/04/09 14:35:10 INFO YarnScheduler: Executor 5 on yarn01.domain.com killed by driver.
20/04/09 14:35:10 INFO ExecutorAllocationManager: Existing executor 5 has been removed (new total is 1)
20/04/09 14:35:17 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 50053 ms on yarn01.domain.com (executor 1) (5/5)
20/04/09 14:35:17 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
.

We clearly see the allocation and removal of executors but it is even more clear with the Spark UI web interface:

spark_dynamic_allocation04
spark_dynamic_allocation04

The executors dynamically added in blue well contrast with the ones dynamically removed in red…

One of my colleague asked me if by mistake he allocates too many initial executors and his over allocation is wasting ressource. I have done this trial by specifying in my code:

set("spark.dynamicAllocation.initialExecutors", "1").\

And Spark Dynamic allocation has been really clever by de-allocating almost instantly the non-needed executors:

spark_dynamic_allocation05
spark_dynamic_allocation05

References

The post Spark dynamic allocation how to configure and use it appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/spark-dynamic-allocation-how-to-configure-and-use-it.html/feed 0
INSERT OVERWRITE does not delete old directories https://blog.yannickjaquier.com/hadoop/insert-overwrite-does-not-delete-old-directories.html https://blog.yannickjaquier.com/hadoop/insert-overwrite-does-not-delete-old-directories.html#respond Tue, 22 Sep 2020 08:07:13 +0000 https://blog.yannickjaquier.com/?p=4990 Preamble In one of our processes we are daily overwriting a table (a partition of this table to be accurate) and, by good luck, we noticed the table size kept increasing till reaching a size that was bigger than her sibling history one !! We did a quick check on HDFS and saw that old […]

The post INSERT OVERWRITE does not delete old directories appeared first on IT World.

]]>

Table of contents

Preamble

In one of our processes we are daily overwriting a table (a partition of this table to be accurate) and, by good luck, we noticed the table size kept increasing till reaching a size that was bigger than her sibling history one !! We did a quick check on HDFS and saw that old files have not been deleted…

I have been able to reproduce the issue in a simple example and I think I have found the opened bug for this… This looks pretty amazing to find such bug as I feel Hadoop has reach a good maturity level…

We are running Hortonworks Data Platform (HDP) 3.1.4. So Hive release is 3.1.0.

INSERT OVERWRITE test case

I have below creation script (the database name is yannick):

drop table yannick.test01 purge;
create table yannick.test01(val int, descr string) partitioned by (fab string, lot_partition string) stored as orc;
insert into table yannick.test01 partition(fab='GVA', lot_partition='TEST') values(1, 'One');

Initially from HDFS standpoint things are crystal clear:

[hdfs@client ~]$ hdfs dfs -ls /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/
Found 1 items
drwxrwx---+  - hive hadoop          0 2020-04-06 14:55 /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/delta_0000001_0000001_0000
[hdfs@client ~]$ hdfs dfs -ls /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/*
Found 2 items
-rw-rw----+  3 hive hadoop          1 2020-04-06 14:55 /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/delta_0000001_0000001_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        696 2020-04-06 14:55 /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/delta_0000001_0000001_0000/bucket_00000

So one directory with one ORC file.

I have then tried to know from Hive standpoint which directory(ie) is used by this table. I initially tried directly querying our Hive metastore (MySQL):

mysql> select TBLS.TBL_NAME, PARTITIONS.PART_NAME, SDS.LOCATION
    -> from SDS, TBLS, PARTITIONS, DBS
    -> where TBLS.TBL_NAME='test01'
    -> and DBS.NAME = 'yannick'
    -> and TBLS.DB_ID = DBS.DB_ID
    -> and PARTITIONS.SD_ID = SDS.SD_ID
    -> and TBLS.TBL_ID = PARTITIONS.TBL_ID
    -> order by 1,2;
+----------+----------------------------+------------------------------------------------------------------------------------------------------------------+
| TBL_NAME | PART_NAME                  | LOCATION                                                                                                         |
+----------+----------------------------+------------------------------------------------------------------------------------------------------------------+
| test01   | fab=GVA/lot_partition=TEST | hdfs://namenode01.domain.com:8020/warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST |
+----------+----------------------------+------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

But only the root folder is given and I have not been able to find in Hive metastore a table displaying this level of detail. The solution is simply coming from Hive virtual columns:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> SELECT input__file__name FROM yannick.test01 WHERE fab="GVA" and lot_partition="TEST";
+----------------------------------------------------+
|                 input__file__name                  |
+----------------------------------------------------+
| hdfs://namenode01.domain.com:8020/warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/delta_0000001_0000001_0000/bucket_00000 |
+----------------------------------------------------+
1 row selected (0.431 seconds)

INSERT OVERWRITE does not delete old directories

If I INSERT OVERWRITE in this table in same exact partition I’m expecting Hive to do HDFS cleaning automatically and I surely not expect to have old folder kept forever. Unfortunately this is what happens if I insert overwrite in same partition:

insert overwrite table yannick.test01 partition(fab='GVA', lot_partition='TEST') values(2,'Two');

If I select used ORC files I get, as expected:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> SELECT input__file__name FROM yannick.test01 WHERE fab="GVA" and lot_partition="TEST";
INFO  : Compiling command(queryId=hive_20200406150529_c05aac38-6933-4a8e-b7ec-6ae9016e67f0): SELECT input__file__name FROM yannick.test01 WHERE fab="GVA" and lot_partition="TEST"
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:input__file__name, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20200406150529_c05aac38-6933-4a8e-b7ec-6ae9016e67f0); Time taken: 0.114 seconds
INFO  : Executing command(queryId=hive_20200406150529_c05aac38-6933-4a8e-b7ec-6ae9016e67f0): SELECT input__file__name FROM yannick.test01 WHERE fab="GVA" and lot_partition="TEST"
INFO  : Completed executing command(queryId=hive_20200406150529_c05aac38-6933-4a8e-b7ec-6ae9016e67f0); Time taken: 0.002 seconds
INFO  : OK
+----------------------------------------------------+
|                 input__file__name                  |
+----------------------------------------------------+
| hdfs://namenode01.domain.com:8020/warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/base_0000002/bucket_00000 |
+----------------------------------------------------+
1 row selected (0.211 seconds)

But if you look at HDFS level:

[hdfs@client ~]$ hdfs dfs -ls /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/
Found 2 items
drwxrwx---+  - hive hadoop          0 2020-04-06 15:04 /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/base_0000002
drwxrwx---+  - hive hadoop          0 2020-04-06 14:55 /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/delta_0000001_0000001_0000
[hdfs@client ~]$ hdfs dfs -ls /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/*
Found 2 items
-rw-rw----+  3 hive hadoop          1 2020-04-06 15:04 /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/base_0000002/_orc_acid_version
-rw-rw----+  3 hive hadoop        696 2020-04-06 15:04 /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/base_0000002/bucket_00000
Found 2 items
-rw-rw----+  3 hive hadoop          1 2020-04-06 14:55 /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/delta_0000001_0000001_0000/_orc_acid_version
-rw-rw----+  3 hive hadoop        696 2020-04-06 14:55 /warehouse/tablespace/managed/hive/yannick.db/test01/fab=GVA/lot_partition=TEST/delta_0000001_0000001_0000/bucket_00000

This can also be done with the File View interface in Ambari:

insert_overwrite01
insert_overwrite01

The old former directory has not been deleted and this happen as often as you insert overwrite…

We have also tried to play with auto.purge table property:

As of Hive 2.3.0 (HIVE-15880), if the table has TBLPROPERTIES (“auto.purge”=”true”) the previous data of the table is not moved to Trash when INSERT OVERWRITE query is run against the table. This functionality is applicable only for managed tables (see managed tables) and is turned off when “auto.purge” property is unset or set to false.

For example:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> ALTER TABLE yannick.test01 SET TBLPROPERTIES ("auto.purge" = "true");
No rows affected (0.174 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> describe formatted yannick.test01;
+-------------------------------+----------------------------------------------------+-----------------------------+
|           col_name            |                     data_type                      |           comment           |
+-------------------------------+----------------------------------------------------+-----------------------------+
| # col_name                    | data_type                                          | comment                     |
| val                           | int                                                |                             |
| descr                         | string                                             |                             |
|                               | NULL                                               | NULL                        |
| # Partition Information       | NULL                                               | NULL                        |
| # col_name                    | data_type                                          | comment                     |
| fab                           | string                                             |                             |
| lot_partition                 | string                                             |                             |
|                               | NULL                                               | NULL                        |
| # Detailed Table Information  | NULL                                               | NULL                        |
| Database:                     | yannick                                            | NULL                        |
| OwnerType:                    | USER                                               | NULL                        |
| Owner:                        | hive                                               | NULL                        |
| CreateTime:                   | Mon Apr 06 14:55:48 CEST 2020                      | NULL                        |
| LastAccessTime:               | UNKNOWN                                            | NULL                        |
| Retention:                    | 0                                                  | NULL                        |
| Location:                     | hdfs://namenode01.domain.com:8020/warehouse/tablespace/managed/hive/yannick.db/test01 | NULL                        |
| Table Type:                   | MANAGED_TABLE                                      | NULL                        |
| Table Parameters:             | NULL                                               | NULL                        |
|                               | COLUMN_STATS_ACCURATE                              | {\"BASIC_STATS\":\"true\"}  |
|                               | auto.purge                                         | true                        |
|                               | bucketing_version                                  | 2                           |
|                               | last_modified_by                                   | hive                        |
|                               | last_modified_time                                 | 1586180747                  |
|                               | numFiles                                           | 0                           |
|                               | numPartitions                                      | 0                           |
|                               | numRows                                            | 0                           |
|                               | rawDataSize                                        | 0                           |
|                               | totalSize                                          | 0                           |
|                               | transactional                                      | true                        |
|                               | transactional_properties                           | default                     |
|                               | transient_lastDdlTime                              | 1586180747                  |
|                               | NULL                                               | NULL                        |
| # Storage Information         | NULL                                               | NULL                        |
| SerDe Library:                | org.apache.hadoop.hive.ql.io.orc.OrcSerde          | NULL                        |
| InputFormat:                  | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat    | NULL                        |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat   | NULL                        |
| Compressed:                   | No                                                 | NULL                        |
| Num Buckets:                  | -1                                                 | NULL                        |
| Bucket Columns:               | []                                                 | NULL                        |
| Sort Columns:                 | []                                                 | NULL                        |
| Storage Desc Params:          | NULL                                               | NULL                        |
|                               | serialization.format                               | 1                           |
+-------------------------------+----------------------------------------------------+-----------------------------+
43 rows selected (0.172 seconds)

But here the problem is not being able to recover figures from Trash in case of an human error because previous figures are simply not deleted…

Workaround we have found but we are sad not being able to rely on this basic thing:

  • TRUNCATE TABLE yannick.test01 PARTITION(fab=”GVA”, lot_partition=”TEST”);
  • ALTER TABLE yannick.test01 drop PARTITION(fab=”GVA”, lot_partition=”TEST”);
  • Deleteing manually the old forlder works but this is quite dangerous and not natural at all to do so…

References

The post INSERT OVERWRITE does not delete old directories appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/insert-overwrite-does-not-delete-old-directories.html/feed 0
Spark lineage issue and how to handle it with Hive Warehouse Connector https://blog.yannickjaquier.com/hadoop/spark-lineage-issue-and-how-to-handle-it-with-hive-warehouse-connector.html https://blog.yannickjaquier.com/hadoop/spark-lineage-issue-and-how-to-handle-it-with-hive-warehouse-connector.html#respond Sun, 23 Aug 2020 08:03:34 +0000 https://blog.yannickjaquier.com/?p=4983 Preamble One of my teammate has submitted me an interesting issue. In a Spark script he was reading a table partition, doing some operations on the resulting DataFrame and then tried to overwrite the modified DataFrame back in the same partition. Obviously this was hanging so this blog post… Internally this hanging problem is not […]

The post Spark lineage issue and how to handle it with Hive Warehouse Connector appeared first on IT World.

]]>

Table of contents

Preamble

One of my teammate has submitted me an interesting issue. In a Spark script he was reading a table partition, doing some operations on the resulting DataFrame and then tried to overwrite the modified DataFrame back in the same partition.

Obviously this was hanging so this blog post… Internally this hanging problem is not a bug but a feature called Spark lineage. It avoid for exemple to loose or corrupt you data in case of a crash of your process.

We are using HDP 3.1.4 and so Spark 2.3.2.3.1.4.0-315. So the below script will use Hive Warehouse Connector (HWC).

Test case and problem

The creation script of my small test table is the following:

drop table yannick.test01 purge;
create table yannick.test01(val int, descr string) partitioned by (fab string, lot_partition string) stored as orc;

insert into yannick.test01 partition(fab='GVA', lot_partition='TEST') values(1,'One');
insert into yannick.test01 partition(fab='GVA', lot_partition='TEST') values(2,'Two');
insert into yannick.test01 partition(fab='GVA', lot_partition='TEST') values(3,'Three');

In Beeline it gives:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select * from yannick.test01;
+-------------+---------------+-------------+-----------------------+
| test01.val  | test01.descr  | test01.fab  | test01.lot_partition  |
+-------------+---------------+-------------+-----------------------+
| 1           | One           | GVA         | TEST                  |
| 2           | Two           | GVA         | TEST                  |
| 3           | Three         | GVA         | TEST                  |
+-------------+---------------+-------------+-----------------------+
3 rows selected (0.317 seconds)

To simulate a modification of the current partition in a DataFrame and the write back I have written below PySpark script:

>>> from pyspark_llap import HiveWarehouseSession
>>> from pyspark.sql.functions import *
>>> hive = HiveWarehouseSession.session(spark).build()
>>> df01=hive.executeQuery("select * from yannick.test01");
>>> df02=df01.withColumn('val',col('val')+1)
>>> df02=df01.withColumn('val',col('val')+1)
>>> df02.show()
+---+-----+---+-------------+
|val|descr|fab|lot_partition|
+---+-----+---+-------------+
|  4|Three|GVA|         TEST|
|  3|  Two|GVA|         TEST|
|  2|  One|GVA|         TEST|
+---+-----+---+-------------+

I have added 1 to val column values. But when I try to write it back the hanging command is this one:

df02.write.mode('overwrite').format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("partition", "fab,lot_partition").option('table','yannick.test01').save()

While if you try to append figures it works well:

df02.write.mode('append').format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("partition", "fab,lot_partition").option('table','yannick.test01').save()

My teammate has implemented a solution to write in a temporary table (through HWC) and then launch another process (because doing the two operations in a single script is hanging) selecting this temporary table to insert overwrite back in final table, working but not sexy.

Spark lineage solution

In fact as we will see there is no magic solution and overall this Spark lineage is a good thing. I have simply found a solution to do it all in one single script. And, at least, it simplifies our schedule.

I write this intermediate DataFrame in a Spark ORC table (versus a Hive table accessible through HWC):

>>> df02.write.format('orc').mode('overwrite').saveAsTable('temporary_table')
>>> df03=sql('select * from temporary_table');
>>> df03.show()
+---+-----+---+-------------+
|val|descr|fab|lot_partition|
+---+-----+---+-------------+
|  4|Three|GVA|         TEST|
|  3|  Two|GVA|         TEST|
|  2|  One|GVA|         TEST|
+---+-----+---+-------------+

Later you can use things like:

sql('select * from temporary_table').show()
sql('show tables').show()
sql('drop table temporary_table purge')

I was also wondering where are going those tables because I did not see it in default database of the traditional Hive managed table directory:

[hdfs@client_node ~]$ hdfs dfs -ls  /warehouse/tablespace/managed/
Found 1 items
drwxrwxrwx+  - hive hadoop          0 2020-03-09 16:33 /warehouse/tablespace/managed/hive

Destination is set by those Ambari/Spark parameters:

spark_lineage01
spark_lineage01
[hdfs@server ~]$ hdfs dfs -ls /apps/spark/warehouse
Found 1 items
drwxr-xr-x   - mfgdl_ingestion hdfs          0 2020-03-20 14:21 /apps/spark/warehouse/temporary_table
sql('select * from temporary_table').write.mode('overwrite').format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).
option("partition", "fab,lot_partition").option('table','yannick.test01').save()

It also clarify (really ?) a bit this story of Spark Metastore vs Hive Metastore…

Spark lineage second problem and partial solution

Previous solution went well till my teammate told me that it was still hanging and he shared his code with me. I noticed he was using, for performance reason, the persist() function when reading the source table in a DataFrame:

scala> val df01=hive.executeQuery("select * from yannick.test01").persist()

I have found a mention of this is in Close HiveWarehouseSession operations:

Spark can invoke operations, such as cache(), persist(), and rdd(), on a DataFrame you obtain from running a HiveWarehouseSession executeQuery() or table(). The Spark operations can lock Hive resources. You can release any locks and resources by calling the HiveWarehouseSession close().

So I tried using below Spark Scale code:

scala> import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession

scala> import com.hortonworks.hwc.HiveWarehouseSession._
import com.hortonworks.hwc.HiveWarehouseSession._

scala> val HIVE_WAREHOUSE_CONNECTOR="com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector"
HIVE_WAREHOUSE_CONNECTOR: String = com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector

scala> val hive = HiveWarehouseSession.session(spark).build()
hive: com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl = com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl@25f3207e

scala> val df01=hive.executeQuery("select * from yannick.test01").persist()
df01: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [val: int, descr: string ... 2 more fields]

scala> df01.show()
20/03/24 18:29:51 WARN TaskSetManager: Stage 0 contains a task of very large size (439 KB). The maximum recommended task size is 100 KB.
+---+-----+---+-------------+
|val|descr|fab|lot_partition|
+---+-----+---+-------------+
|  3|Three|GVA|         TEST|
|  1|  One|GVA|         TEST|
|  2|  Two|GVA|         TEST|
+---+-----+---+-------------+


scala> val df02=df01.withColumn("val",$"val" + 1)
df02: org.apache.spark.sql.DataFrame = [val: int, descr: string ... 2 more fields]

scala> df02.show()
20/03/24 18:30:16 WARN TaskSetManager: Stage 1 contains a task of very large size (439 KB). The maximum recommended task size is 100 KB.
+---+-----+---+-------------+
|val|descr|fab|lot_partition|
+---+-----+---+-------------+
|  4|Three|GVA|         TEST|
|  2|  One|GVA|         TEST|
|  3|  Two|GVA|         TEST|
+---+-----+---+-------------+


scala> hive.close()

scala> val hive = HiveWarehouseSession.session(spark).build()
hive: com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl = com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl@1c9f274d

scala> df02.write.mode("overwrite").format(HIVE_WAREHOUSE_CONNECTOR).option("partition", "fab,lot_partition").option("table","yannick.test01").save()
20/03/24 18:33:44 WARN TaskSetManager: Stage 2 contains a task of very large size (439 KB). The maximum recommended task size is 100 KB.

And voila my table has been correctly written back without any hanging. Looks like a marvelous solution till I received a feedback from my teammate that is using PySpark:

>>> hive.close()
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'HiveWarehouseSessionImpl' object has no attribute 'close'

I have tried to have a look to this HWC source code and apparently the close() function has not been exposed to be used in PySpark… Definitively since we moved to HPD 3 this HWC implementation looks not very mature and we have already identified many issue with it…

If someone has found something interesting please share and I might come back on this article if we find a sexy solution…

References

The post Spark lineage issue and how to handle it with Hive Warehouse Connector appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/spark-lineage-issue-and-how-to-handle-it-with-hive-warehouse-connector.html/feed 0
How to install and configure a standalone TEZ UI with HDP 3.x https://blog.yannickjaquier.com/hadoop/how-to-install-and-configure-a-standalone-tez-ui-with-hdp-3-x.html https://blog.yannickjaquier.com/hadoop/how-to-install-and-configure-a-standalone-tez-ui-with-hdp-3-x.html#comments Wed, 22 Jul 2020 07:32:30 +0000 https://blog.yannickjaquier.com/?p=4940 Preamble We recently upgraded our Hadoop platform to HortonWorks Data Platform (HDP) 3.1.4 (latest free available edition) and Ambari 2.7.4. We had tons of issue when doing and even after the migration so I might publish few articles around this painful period. One annoying drawback we had is TEZ UI that has gone in Ambari […]

The post How to install and configure a standalone TEZ UI with HDP 3.x appeared first on IT World.

]]>

Table of contents

Preamble

We recently upgraded our Hadoop platform to HortonWorks Data Platform (HDP) 3.1.4 (latest free available edition) and Ambari 2.7.4. We had tons of issue when doing and even after the migration so I might publish few articles around this painful period. One annoying drawback we had is TEZ UI that has gone in Ambari 2.7.x:

tez-ui01
tez-ui01

I have found many Cloudera and Stackoverflow forum discussion explaining partially what to do but they are most of the time contradictory so decided to try to write one on my own with bare minimum of what is really required.

TEZ UI installation

The official Apache Tez UI installation page steps are very easy I have personally downloaded latest available at the time of writing this article so TEZ UI 0.9.2 at this url: https://repository.apache.org/content/repositories/releases/org/apache/tez/tez-ui/0.9.2/

i have taken tez-ui-0.9.2.war file.

I have installed the Apache of my RedHat release with yum command (start it with systemctl start httpd, ensure firewalld is stopped of well configured):

[root@ambari_server tez-ui]# yum list httpd
Loaded plugins: product-id, search-disabled-repos, subscription-manager
Installed Packages
httpd.x86_64                                                          2.4.6-67.el7_4.6                                         @rhel-7-server-rpms
Available Packages
httpd.x86_64                                                          2.4.6-80.el7                                             Server  

Then I have unzipped the tez-ui-0.9.2.war file in /var/www/html/tez-ui directory (/var/www/html being default Apache DocumentRoot directory bu this can be changed in Apache configuration):

[root@ambari_server tez-ui]# pwd
/var/www/html/tez-ui
[root@ambari_server tez-ui]# ll
total 4312
drwxr-xr-x 4 root root     151 Mar 19  2019 assets
drwxr-xr-x 2 root root      25 Feb 21 16:20 config
drwxr-xr-x 2 root root    4096 Mar 19  2019 fonts
-rw-r--r-- 1 root root    1777 Mar 19  2019 index.html
drwxr-xr-x 3 root root      75 Mar 19  2019 META-INF
-rw-r--r-- 1 root root 2200884 Feb 21 16:20 tez-ui-0.9.2.war
drwxr-xr-x 3 root root      36 Mar 19  2019 WEB-INF  

Edit /var/www/html/tez-uiconfig/configs.env fiel and change only two parameters (you need to un comment them as well, remove trailing //):

  • timeline must be set to Ambari value of yarn.timeline-service.webapp.address (YARN configuration)
  • rm must be set to Ambari value of yarn.resourcemanager.webapp.address (YARN configuration)

The two above parameter values are like server.domain.com:8188 and server.domain.com:8088. Ensure with a web browser that they return the list of competed and running applications onto your cluster.

Ambari values to copy/paste:

tez-ui02
tez-ui02

And voila installtion done you can access to TEZ UI at this url: http://ambari_server.domain.com/tez-ui/

TEZ UI configuration

Previous part was really straightforward but the Hadoop configuration was quite cumbersome as many articles are saying opposite things…

Some are even quite dangerous with HDP 3.1.4… For example hive.exec.failure.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook, hive.exec.post.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook,org.apache.atlas.hive.hook.HiveHook, hive.exec.pre.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook have corrupted beeline access to databases…

What I have set in Custom tez-site part is this below parameter list:

  • tez.history.logging.service.class = org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService
  • tez.tez-ui.history-url.base = http://ambari_server.domain.com/tez-ui/
  • tez.am.tez-ui.history-url.template = __HISTORY_URL_BASE__?viewPath=/#/tez-app/__APPLICATION_ID__

Those other parameters must also have below values but as those are default should not be an issue:

  • yarn.timeline-service.enabled = true (in YARN and in TEZ configuration)
  • hive_timeline_logging_enabled = true
  • yarn.acl.enable = false

Remark:
If you want to keep yarn.acl.enable = true (which sounds a good idea) you might need to add something to yarn.admin.acl so set it to activity_analyzer,yarn,admin,operator,…. The only issue is that so far I have not found what to add. See next chapter for a trick to solve this issue…

If you have multiple servers in your landscape and your Resource Manager server is not the one running TEZ UI you might need to set as well:

  • yarn.timeline-service.http-cross-origin.allowed-origins = *
  • yarn.timeline-service.hostname = Timeline Service server name (FQDN)

To be able to display queries in Hive Queries panel of TEZ Ui home page I have also set:

  • hive.exec.failure.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook
  • hive.exec.post.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook
  • hive.exec.pre.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook

Then going to http://ambari_server.domain.com/tez-ui/ I was only seeing old queries pre-upgrade. To unlock the situation I have been to YARN resource manager and in on of the running queries I selected ApplicationMaster link:

tez-ui03
tez-ui03

And I got redirected to TEZ UI:

tez-ui04
tez-ui04

And starting from this the TEZ UI home page (http://ambari_server.domain.com/tez-ui/) was correctly displaying running figures.

With the extra hook parameters the Hive Queries panel is also displaying query list:

tez-ui05
tez-ui05

Or more precisely:

tez-ui06
tez-ui06

TEZ UI bug correction trick

If at a point in time you encounter an issue and no figures are displayed you can get some help from the developer tool of your browser (Firefox or Chrome). Here is a screenshot using Firefox:

tez-ui07
tez-ui07

Here using the Network tab and the response from my NameNode I see that http://namenode.domain.com:8188/ws/v1/timeline/TEZ_DAG_ID?limit=11&_=1583236988485 is returning an empty json answer:

{"entities":[]}

So obviously nothing is displayed… This is how I have seen that yarn.acl.enable was playing an important role to display this resource…

I also noticed that trying to access to something like http://namenode.domain.com:8188/ws/v1/timeline/TEZ_DAG_ID?user.name=whatever is resolving the issue for… Well, some time, not even the current session of your web browser… There is something weird here…

I also have to say that the documentation on this part is not very well done. Strange knowing the increasing interest on security nowadays…

References

The post How to install and configure a standalone TEZ UI with HDP 3.x appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/how-to-install-and-configure-a-standalone-tez-ui-with-hdp-3-x.html/feed 6
PySpark and Spark Scala Jupyter kernels cluster integration https://blog.yannickjaquier.com/hadoop/pyspark-and-spark-scala-jupyter-kernels-cluster-integration.html https://blog.yannickjaquier.com/hadoop/pyspark-and-spark-scala-jupyter-kernels-cluster-integration.html#respond Sun, 21 Jun 2020 08:37:52 +0000 https://blog.yannickjaquier.com/?p=4964 Preamble Even if the standard tool for your data scientist in an Hortonworks Data Platform (HDP) is Zeppelin Notebook this population would most probably want to use Jupyter Lab/Notebook that has quite a momentum in this domain. As you might guess with the new Hive Warehouse Connector (HWC) to access Hive tables in Spark comes […]

The post PySpark and Spark Scala Jupyter kernels cluster integration appeared first on IT World.

]]>

Table of contents

Preamble

Even if the standard tool for your data scientist in an Hortonworks Data Platform (HDP) is Zeppelin Notebook this population would most probably want to use Jupyter Lab/Notebook that has quite a momentum in this domain.

As you might guess with the new Hive Warehouse Connector (HWC) to access Hive tables in Spark comes a bunch of problem to correctly configure Jupyter Lab/Notebook…

In short the idea is to add additional Jupyter kernels on top of the default Python 3 one. To do this either you create them on your own by creating a kernel.json file or installing one of the packages that help you to integrate the language you wish.

In this article I assume that you already have a working Anaconda installation on your server. The installation is pretty straightforward, just execute the Anaconda3-2019.10-Linux-x86_64.sh shell script (in my case) and acknowledge the licence information displayed.

JupyterHub installation

If like me you are behind a corporate proxy the first thing to do is to configure it to be able to download conda packages over Internet:

(base) [root@server ~]# cat .condarc
auto_activate_base: false

proxy_servers:
    http: http://account:password@proxy_server.domain.com:proxy_port/
    https: http://account:password@proxy_server.domain.com:proxy_port/


ssl_verify: False

Create the JupyterHub conda environment with (chosen name is totally up to you):

(base) [root@server ~]# conda create --name jupyterhub
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.7.12
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /opt/anaconda3/envs/jupyterhub



Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate jupyterhub
#
# To deactivate an active environment, use
#
#     $ conda deactivate

If, like me, you received a warning about an obsolete release of conda, upgrade it with:

(base) [root@server ~]# conda update -n base -c defaults conda
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - conda


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    backports.functools_lru_cache-1.6.1|             py_0          11 KB
    conda-4.8.2                |           py37_0         2.8 MB
    future-0.18.2              |           py37_0         639 KB
    ------------------------------------------------------------
                                           Total:         3.5 MB

The following packages will be UPDATED:

  backports.functoo~                               1.5-py_2 --> 1.6.1-py_0
  conda                                       4.7.12-py37_0 --> 4.8.2-py37_0
  future                                      0.17.1-py37_0 --> 0.18.2-py37_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
future-0.18.2        | 639 KB    | ######################################################################################### | 100%
backports.functools_ | 11 KB     | ######################################################################################### | 100%
conda-4.8.2          | 2.8 MB    | ######################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

(base) [root@server ~]# conda -V
conda 4.8.2

Activate your newly created Conda environment and install jupyterhub and notebook inside it:

(base) [root@server ~]# conda activate jupyterhub
(jupyterhub) [root@server ~]# conda install -c conda-forge jupyterhub
(jupyterhub) [root@server ~]# conda install -c conda-forge notebook

Remark
It is also possible to install newest JupyterLab in Jupyterhub instead of Jupyter Notebook. If you do so you have to set c.Spawner.default_url = '/lab' to instruct JupyterHub to load JupyterLab instead of Jupyter Notebook. In below I will try to mix screenshot but clearly the future is JupyterLab and not Jupyter Notebook. JupyterHub is just providing a multi user environment.

Install JupyterLab with:

conda install -c conda-forge jupyterlab

Execute jupyterhub by just typing the command jupyterhub and access to its url at http://server.domain.com:8000. All options can obviously be configured...

As an exemple how to activate https for your Jupyterhub using a self signed certificate (free). Is not optimal but better than http...

Generate the config file using:

(jupyterhub) [root@server ~]# jupyterhub --generate-config -f jupyterhub_config.py
Writing default config to: jupyterhub_config.py

Generate the key and certificate using below command (taken from OpenSSL Cookbook book):

(jupyterhub) [root@server ~]# openssl req -new -newkey rsa:2048 -x509 -nodes -keyout root-ocsp.key -out root-ocsp.csr
Generating a RSA private key
...................................+++++
........................................+++++
writing new private key to 'root-ocsp.key'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:CH
State or Province Name (full name) [Some-State]:Geneva
Locality Name (eg, city) []:Geneva
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Company Name
Organizational Unit Name (eg, section) []:
Common Name (e.g. server FQDN or YOUR name) []:servername.domain.com
Email Address []:

Finally I have configured only three below parameters in jupyterhub_config.py configuration file:

c.JupyterHub.bind_url = 'https://servername.domain.com'
c.JupyterHub.ssl_cert = 'root-ocsp.csr'
c.JupyterHub.ssl_key = 'root-ocsp.key'
c.Spawner.default_url = '/lab' # Unset to keep Jupyter Notebook

Simply execute juyterhub command to run JupyterHub, of course creating a service that start with server boot is more than recommended.

Then accessing to https url you see login window without the HTTP warning (you will have to add the self signed certificate as a trusted server in your browser).

Jupyter kernels manual configuration

Connect with an existing OS account created onto server where JupyterHub is running:

jupyterhub01
jupyterhub01

Create a new Python 3 notebook (yannick_python.ipynb in my example, but as you can see I have many others):

jupyterhub02
jupyterhub02

And this below dummy example should work:

jupyterhub03
jupyterhub03

Obvosuly our goal here is not to do Python but Spark. To manually create a Pyspark kernel create the kernel directory in home installation of your Jupyterhub:

[root@server ~]# cd /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels
(jupyterhub) [root@server kernels]# ll
total 0
drwxr-xr-x 2 root root 69 Feb 18 14:20 python3
(jupyterhub) [root@server kernels]# mkdir pyspark_kernel

Then create below kernel file:

[root@server pyspark]# cat kernel.json
{
  "display_name": "PySpark_kernel",
  "language": "python",
  "argv": [ "/opt/anaconda3/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ],
  "env": {
    "SPARK_HOME": "/usr/hdp/current/spark2-client/",
    "PYSPARK_PYTHON": "/opt/anaconda3/bin/python",
    "PYTHONPATH": "/usr/hdp/current/spark2-client/python/:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip",
    "PYTHONSTARTUP": "/usr/hdp/current/spark2-client/python/pyspark/shell.py",
    "PYSPARK_SUBMIT_ARGS": "--master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip pyspark-shell"
  }
}

You can confirm it is taken into account with:

(jupyterhub) [root@server ~]# jupyter-kernelspec list
Available kernels:
  pyspark_kernel    /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/pyspark_kernel
  python3           /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/python3

If you create a new Jupyter Notebook and choose PySpark_kernel you should be able to execute below sample code:

jupyterhub04
jupyterhub04

So far I have not yet found on how to create a Spark Scala manual kernel, any insight is welcome...

Sparkmagic Jupyter kernels configuration

On the list of Jupyter available kernel there is one (IMHO) that comes more often than the competition: Sparkmagic.

Install it with:

conda install -c conda-forge sparkmagic

Once done install the Sparkmagic kernel with:

(jupyterhub) [root@server kernels]# cd /opt/anaconda3/envs/jupyterhub/lib/python3.8/site-packages/sparkmagic/kernels
(jupyterhub) [root@server kernels]# ll
total 28
-rw-rw-r-- 2 root root    46 Jan 23 14:36 __init__.py
-rw-rw-r-- 2 root root 20719 Jan 23 14:36 kernelmagics.py
drwxr-xr-x 2 root root    72 Feb 18 16:48 __pycache__
drwxr-xr-x 3 root root   104 Feb 18 16:48 pysparkkernel
drwxr-xr-x 3 root root   102 Feb 18 16:48 sparkkernel
drwxr-xr-x 3 root root   103 Feb 18 16:48 sparkrkernel
drwxr-xr-x 3 root root    95 Feb 18 16:48 wrapperkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec install sparkrkernel
[InstallKernelSpec] Installed kernelspec sparkrkernel in /usr/local/share/jupyter/kernels/sparkrkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec install sparkkernel
[InstallKernelSpec] Installed kernelspec sparkkernel in /usr/local/share/jupyter/kernels/sparkkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec install pysparkkernel
[InstallKernelSpec] Installed kernelspec pysparkkernel in /usr/local/share/jupyter/kernels/pysparkkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec list
Available kernels:
  pysparkkernel    /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/pysparkkernel
  python3          /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/python3
  sparkkernel      /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/sparkkernel
  sparkrkernel     /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/sparkrkernel

Enable the server extension with:

(jupyterhub) [root@server ~]# jupyter serverextension enable --py sparkmagic
Enabling: sparkmagic
- Writing config: /root/.jupyter
    - Validating...
      sparkmagic 0.15.0 OK

In the home directory of the account with which you will connect to JupyterHub create a .sparkmagic directory and create a file that is a copy of provided config.json.

In this file modify at least for each kernel_xxx_credentials section the url to map your Livy server name and port:

"kernel_python_credentials" : {
  "username": "",
  "password": "",
  "url": "http://livyserver.domain.com:8999",
  "auth": "None"
},

And the part on which I have spent quite a lot of time the session_configs section as below (to add the HWC connector information):

  "session_configs": {
  "driverMemory": "1000M",
  "executorCores": 2,
  "conf": {"spark.jars": "file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar",
           "spark.submit.pyFiles": "file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip"}
},

If you then create new notebook using PySpark or Spark whether you want to use Python or Scala you should be able to run the below exemples.

If you use Jupyter Notebook the first command to execute is magic command %load_ext sparkmagic.magics then create a session using magic command %manage_spark select either Scala or Python (remain the question of R language but I do not use it). If you use JupyterLab you can directly start to work as the %manage_spark command does not work. The Livy session should be automatically created while executing the first command, should also be the same with Jupyter Notebook but I had few issues with this so...

Few other magic commands are quite interesting:

  • %%help to get list of available command
  • %%info to see if your Livy session is still active (many issue can come from this)

PySpark exemple (the SQL part is still broken due to HWC):

jupyterhub05
jupyterhub05
jupyterhub06
jupyterhub06
jupyterhub07
jupyterhub07

Scala exemple (same story for SQL):

jupyterhub08
jupyterhub08
jupyterhub09
jupyterhub09
jupyterhub10
jupyterhub10

References

The post PySpark and Spark Scala Jupyter kernels cluster integration appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/pyspark-and-spark-scala-jupyter-kernels-cluster-integration.html/feed 0
Hive Warehouse Connector integration in Zeppelin Notebook https://blog.yannickjaquier.com/hadoop/hive-warehouse-connector-integration-in-zeppelin-notebook.html https://blog.yannickjaquier.com/hadoop/hive-warehouse-connector-integration-in-zeppelin-notebook.html#respond Thu, 21 May 2020 13:44:17 +0000 https://blog.yannickjaquier.com/?p=4955 Preamble How to configure Hive Warehouse Connector (HWC) integration in Zeppelin Notebook ? Since we upgraded to Hortonworks Data Platform (HDP) 3.1.4 we have been obliged to fight everywhere to integrate the new way of working in Spark with HWC. The connector in itself is not very complex to use, it requires only a small […]

The post Hive Warehouse Connector integration in Zeppelin Notebook appeared first on IT World.

]]>

Table of contents

Preamble

How to configure Hive Warehouse Connector (HWC) integration in Zeppelin Notebook ? Since we upgraded to Hortonworks Data Platform (HDP) 3.1.4 we have been obliged to fight everywhere to integrate the new way of working in Spark with HWC. The connector in itself is not very complex to use, it requires only a small modification of your Spark code (Scala or Python). What’s complex is its integration in the existing tools as well as how, now, to run Pyspark and Spark-shell (Scala) to integrate it in your environment.

On top of this, as clearly stated in the official documentation, you need HWC and LLAP (Live Long And Process or Low-Latency Analytical processing) to read Hive managed tables from Spark. Which is one of the first operation you will do in most of your Spark scripts. So in Ambari activate and configure it (number of nodes, memory, concurrent queries):

hwc01
hwc01

HWC on the command line

Before jumping to Zeppelin let’s quickly see how you know execute Spark on the command line with HWC.

You should have already configured Spark below parameters (as well as reverting back hive.fetch.task.conversion value see my other article on this)

  • spark.hadoop.hive.llap.daemon.service.hosts = @llap0
  • spark.hadoop.hive.zookeeper.quorum = zookeeper01.domain.com:2181,zookeeper03.domain.com:2181,zookeeper03.domain.com:2181
  • spark.sql.hive.hiveserver2.jdbc.url = jdbc:hive2://zookeeper01.domain.com:2181,zookeeper03.domain.com:2181,zookeeper03.domain.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive

To execute spark-shell you must now add below jar file (in the example I’m in YARN mode on llap YARN queue):

spark-shell --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar

To test it’s working you might use the below test code:

val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.showDatabases().show(100, false)

To execute pyspark you must add below jar file (I’m in YARN mode on llap YARN queue) and Python file:

pyspark --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

To test all is good you can use below script

from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show(10, False)

To use spark-submit you do exactly the same thing:

spark-submit --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip script_to_execute

Zeppelin Spark2 interpreter configuration

To make Spark2 Zeppelin interpreter working with HWC you need to add two parameters to its configuration:

  • spark.jars = /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar
  • spark.submit.pyFiles = /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

Note:
Notice the F in uppercase in spark.submit.pyFiles, I have lost half a day because some blog articles on internet have misspelled it.

Save and restart the interpreter and it you test below code in a new Notebook you should get a result (and not the default databases name):

hwc02
hwc02

Remark:
As you can see above the %spark2.sql option of the interpreter is not configured to use HWC. So far I have not found how to overcome this. This could be an issue because this interpreter option is a nice feature to make graphs from direct SQL commands…

It works also well in Scala:

hwc03
hwc03

Zeppelin Livy2 interpreter configuration

You might wonder why configuring Livy2 interpreter as the first one is working fine. Well I had to configure it to for the JupyterHub Sparkmagic configuration that we will see in another blog post so here it is…

To make Livy2 Zeppelin interpreter working with HWC you need to add two parameters to its configuration:

  • livy.spark.jars = file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar
  • livy.spark.submit.pyFiles = file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

Note:
The second parameter was just a good guess from the Spark2 interpreter setting. Also note the file:// to instruct interpreter to look on local file system and not on HDFS.

Also ensure the Livy url is well setup to the server (and port) of you Livy process:

  • zeppelin.livy.url = http://livyserver.domain.com:8999

I have also faced this issue:

org.apache.zeppelin.livy.LivyException: Error with 400 StatusCode: "requirement failed: Local path /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar cannot be added to user sessions."
	at org.apache.zeppelin.livy.BaseLivyInterpreter.callRestAPI(BaseLivyInterpreter.java:755)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.createSession(BaseLivyInterpreter.java:337)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.initLivySession(BaseLivyInterpreter.java:209)
	at org.apache.zeppelin.livy.LivySharedInterpreter.open(LivySharedInterpreter.java:59)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.getLivySharedInterpreter(BaseLivyInterpreter.java:190)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.open(BaseLivyInterpreter.java:163)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:617)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
	at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

To solve it add below configuration to your Livy2 server (Custom livy2-conf in Ambari):

  • livy.file.local-dir-whitelist = /usr/hdp/current/hive_warehouse_connector/

Working in Scala:

hwc04
hwc04

In Python:

hwc05
hwc05

But same as for Spark2 interpreter the %livy2.sql option of the interpreter is broken for exact same reason:

hwc06
hwc06

References

The post Hive Warehouse Connector integration in Zeppelin Notebook appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/hive-warehouse-connector-integration-in-zeppelin-notebook.html/feed 0