Hive Warehouse Connector integration in Zeppelin Notebook

Preamble

How to configure Hive Warehouse Connector (HWC) integration in Zeppelin Notebook ? Since we upgraded to Hortonworks Data Platform (HDP) 3.1.4 we have been obliged to fight everywhere to integrate the new way of working in Spark with HWC. The connector in itself is not very complex to use, it requires only a small modification of your Spark code (Scala or Python). What’s complex is its integration in the existing tools as well as how, now, to run Pyspark and Spark-shell (Scala) to integrate it in your environment.

On top of this, as clearly stated in the official documentation, you need HWC and LLAP (Live Long And Process or Low-Latency Analytical processing) to read Hive managed tables from Spark. Which is one of the first operation you will do in most of your Spark scripts. So in Ambari activate and configure it (number of nodes, memory, concurrent queries):

HWC on the command line

Before jumping to Zeppelin let’s quickly see how you know execute Spark on the command line with HWC.

You should have already configured Spark below parameters (as well as reverting back hive.fetch.task.conversion value see my other article on this)

spark.hadoop.hive.llap.daemon.service.hosts = @llap0
spark.hadoop.hive.zookeeper.quorum = zookeeper01.domain.com:2181,zookeeper03.domain.com:2181,zookeeper03.domain.com:2181
spark.sql.hive.hiveserver2.jdbc.url = jdbc:hive2://zookeeper01.domain.com:2181,zookeeper03.domain.com:2181,zookeeper03.domain.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive

To execute spark-shell you must now add below jar file (in the example I’m in YARN mode on llap YARN queue):

spark-shell --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar

To test it’s working you might use the below test code:

val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.showDatabases().show(100, false)

To execute pyspark you must add below jar file (I’m in YARN mode on llap YARN queue) and Python file:

pyspark --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

To test all is good you can use below script

from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show(10, False)

To use spark-submit you do exactly the same thing:

spark-submit --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip script_to_execute

Zeppelin Spark2 interpreter configuration

To make Spark2 Zeppelin interpreter working with HWC you need to add two parameters to its configuration:

spark.jars = /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar
spark.submit.pyFiles = /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

Note:
Notice the F in uppercase in spark.submit.pyFiles, I have lost half a day because some blog articles on internet have misspelled it.

Save and restart the interpreter and it you test below code in a new Notebook you should get a result (and not the default databases name):

Remark:
As you can see above the %spark2.sql option of the interpreter is not configured to use HWC. So far I have not found how to overcome this. This could be an issue because this interpreter option is a nice feature to make graphs from direct SQL commands…

It works also well in Scala:

Zeppelin Livy2 interpreter configuration

You might wonder why configuring Livy2 interpreter as the first one is working fine. Well I had to configure it to for the JupyterHub Sparkmagic configuration that we will see in another blog post so here it is…

To make Livy2 Zeppelin interpreter working with HWC you need to add two parameters to its configuration:

livy.spark.jars = file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar
livy.spark.submit.pyFiles = file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

Note:
The second parameter was just a good guess from the Spark2 interpreter setting. Also note the file:// to instruct interpreter to look on local file system and not on HDFS.

Also ensure the Livy url is well setup to the server (and port) of you Livy process:

zeppelin.livy.url = http://livyserver.domain.com:8999

I have also faced this issue:

org.apache.zeppelin.livy.LivyException: Error with 400 StatusCode: "requirement failed: Local path /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar cannot be added to user sessions."
	at org.apache.zeppelin.livy.BaseLivyInterpreter.callRestAPI(BaseLivyInterpreter.java:755)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.createSession(BaseLivyInterpreter.java:337)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.initLivySession(BaseLivyInterpreter.java:209)
	at org.apache.zeppelin.livy.LivySharedInterpreter.open(LivySharedInterpreter.java:59)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.getLivySharedInterpreter(BaseLivyInterpreter.java:190)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.open(BaseLivyInterpreter.java:163)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:617)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
	at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)