IT World https://blog.yannickjaquier.com RDBMS, Unix and many more... Tue, 23 Jun 2020 13:44:09 +0000 en-US hourly 1 https://wordpress.org/?v=5.4.2 How to install and configure a standalone TEZ UI with HDP 3.x https://blog.yannickjaquier.com/hadoop/how-to-install-and-configure-a-standalone-tez-ui-with-hdp-3-x.html https://blog.yannickjaquier.com/hadoop/how-to-install-and-configure-a-standalone-tez-ui-with-hdp-3-x.html#respond Wed, 22 Jul 2020 07:32:30 +0000 https://blog.yannickjaquier.com/?p=4940 Preamble We recently upgraded our Hadoop platform to HortonWorks Data Platform (HDP) 3.1.4 (latest free available edition) and Ambari 2.7.4. We had tons of issue when doing and even after the migration so I might publish few articles around this painful period. One annoying drawback we had is TEZ UI that has gone in Ambari […]

The post How to install and configure a standalone TEZ UI with HDP 3.x appeared first on IT World.

]]>

Table of contents

Preamble

We recently upgraded our Hadoop platform to HortonWorks Data Platform (HDP) 3.1.4 (latest free available edition) and Ambari 2.7.4. We had tons of issue when doing and even after the migration so I might publish few articles around this painful period. One annoying drawback we had is TEZ UI that has gone in Ambari 2.7.x:

tez-ui01
tez-ui01

I have found many Cloudera and Stackoverflow forum discussion explaining partially what to do but they are most of the time contradictory so decided to try to write one on my own with bare minimum of what is really required.

TEZ UI installation

The official Apache Tez UI installation page steps are very easy I have personally downloaded latest available at the time of writing this article so TEZ UI 0.9.2 at this url: https://repository.apache.org/content/repositories/releases/org/apache/tez/tez-ui/0.9.2/

i have taken tez-ui-0.9.2.war file.

I have installed the Apache of my RedHat release with yum command (start it with systemctl start httpd, ensure firewalld is stopped of well configured):

[root@ambari_server tez-ui]# yum list httpd
Loaded plugins: product-id, search-disabled-repos, subscription-manager
Installed Packages
httpd.x86_64                                                          2.4.6-67.el7_4.6                                         @rhel-7-server-rpms
Available Packages
httpd.x86_64                                                          2.4.6-80.el7                                             Server  

Then I have unzipped the tez-ui-0.9.2.war file in /var/www/html/tez-ui directory (/var/www/html being default Apache DocumentRoot directory bu this can be changed in Apache configuration):

[root@ambari_server tez-ui]# pwd
/var/www/html/tez-ui
[root@ambari_server tez-ui]# ll
total 4312
drwxr-xr-x 4 root root     151 Mar 19  2019 assets
drwxr-xr-x 2 root root      25 Feb 21 16:20 config
drwxr-xr-x 2 root root    4096 Mar 19  2019 fonts
-rw-r--r-- 1 root root    1777 Mar 19  2019 index.html
drwxr-xr-x 3 root root      75 Mar 19  2019 META-INF
-rw-r--r-- 1 root root 2200884 Feb 21 16:20 tez-ui-0.9.2.war
drwxr-xr-x 3 root root      36 Mar 19  2019 WEB-INF  

Edit /var/www/html/tez-uiconfig/configs.env fiel and change only two parameters (you need to un comment them as well, remove trailing //):

  • timeline must be set to Ambari value of yarn.timeline-service.webapp.address (YARN configuration)
  • rm must be set to Ambari value of yarn.resourcemanager.webapp.address (YARN configuration)

The two above parameter values are like server.domain.com:8188 and server.domain.com:8088. Ensure with a web browser that they return the list of competed and running applications onto your cluster.

Ambari values to copy/paste:

tez-ui02
tez-ui02

And voila installtion done you can access to TEZ UI at this url: http://ambari_server.domain.com/tez-ui/

TEZ UI configuration

Previous part was really straightforward but the Hadoop configuration was quite cumbersome as many articles are saying opposite things…

Some are even quite dangerous with HDP 3.1.4… For example hive.exec.failure.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook, hive.exec.post.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook,org.apache.atlas.hive.hook.HiveHook, hive.exec.pre.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook have corrupted beeline access to databases…

What I have set in Custom tez-site part is this below parameter list:

  • tez.history.logging.service.class = org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService
  • tez.tez-ui.history-url.base = http://ambari_server.domain.com/tez-ui/
  • tez.am.tez-ui.history-url.template = __HISTORY_URL_BASE__?viewPath=/#/tez-app/__APPLICATION_ID__

Those other parameters must also have below values but as those are default should not be an issue:

  • yarn.timeline-service.enabled = true (in YARN and in TEZ configuration)
  • hive_timeline_logging_enabled = true
  • yarn.acl.enable = false

Remark:
If you want to keep yarn.acl.enable = true (which sounds a good idea) you might need to add something to yarn.admin.acl so set it to activity_analyzer,yarn,admin,operator,…. The only issue is that so far I have not found what to add. See next chapter for a trick to solve this issue…

If you have multiple servers in your landscape and your Resource Manager server is not the one running TEZ UI you might need to set as well:

  • yarn.timeline-service.http-cross-origin.allowed-origins = *
  • yarn.timeline-service.hostname = Timeline Service server name (FQDN)

To be able to display queries in Hive Queries panel of TEZ Ui home page I have also set:

  • hive.exec.failure.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook
  • hive.exec.post.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook
  • hive.exec.pre.hooks = org.apache.hadoop.hive.ql.hooks.ATSHook

Then going to http://ambari_server.domain.com/tez-ui/ I was only seeing old queries pre-upgrade. To unlock the situation I have been to YARN resource manager and in on of the running queries I selected ApplicationMaster link:

tez-ui03
tez-ui03

And I got redirected to TEZ UI:

tez-ui04
tez-ui04

And starting from this the TEZ UI home page (http://ambari_server.domain.com/tez-ui/) was correctly displaying running figures.

With the extra hook parameters the Hive Queries panel is also displaying query list:

tez-ui05
tez-ui05

Or more precisely:

tez-ui06
tez-ui06

TEZ UI bug correction trick

If at a point in time you encounter an issue and no figures are displayed you can get some help from the developer tool of your browser (Firefox or Chrome). Here is a screenshot using Firefox:

tez-ui07
tez-ui07

Here using the Network tab and the response from my NameNode I see that http://namenode.domain.com:8188/ws/v1/timeline/TEZ_DAG_ID?limit=11&_=1583236988485 is returning an empty json answer:

{"entities":[]}

So obviously nothing is displayed… This is how I have seen that yarn.acl.enable was playing an important role to display this resource…

I also noticed that trying to access to something like http://namenode.domain.com:8188/ws/v1/timeline/TEZ_DAG_ID?user.name=whatever is resolving the issue for… Well, some time, not even the current session of your web browser… There is something weird here…

I also have to say that the documentation on this part is not very well done. Strange knowing the increasing interest on security nowadays…

References

The post How to install and configure a standalone TEZ UI with HDP 3.x appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/how-to-install-and-configure-a-standalone-tez-ui-with-hdp-3-x.html/feed 0
PySpark and Spark Scala Jupyter kernels cluster integration https://blog.yannickjaquier.com/hadoop/pyspark-and-spark-scala-jupyter-kernels-cluster-integration.html https://blog.yannickjaquier.com/hadoop/pyspark-and-spark-scala-jupyter-kernels-cluster-integration.html#respond Sun, 21 Jun 2020 08:37:52 +0000 https://blog.yannickjaquier.com/?p=4964 Preamble Even if the standard tool for your data scientist in an Hortonworks Data Platform (HDP) is Zeppelin Notebook this population would most probably want to use Jupyter Lab/Notebook that has quite a momentum in this domain. As you might guess with the new Hive Warehouse Connector (HWC) to access Hive tables in Spark comes […]

The post PySpark and Spark Scala Jupyter kernels cluster integration appeared first on IT World.

]]>

Table of contents

Preamble

Even if the standard tool for your data scientist in an Hortonworks Data Platform (HDP) is Zeppelin Notebook this population would most probably want to use Jupyter Lab/Notebook that has quite a momentum in this domain.

As you might guess with the new Hive Warehouse Connector (HWC) to access Hive tables in Spark comes a bunch of problem to correctly configure Jupyter Lab/Notebook…

In short the idea is to add additional Jupyter kernels on top of the default Python 3 one. To do this either you create them on your own by creating a kernel.json file or installing one of the packages that help you to integrate the language you wish.

In this article I assume that you already have a working Anaconda installation on your server. The installation is pretty straightforward, just execute the Anaconda3-2019.10-Linux-x86_64.sh shell script (in my case) and acknowledge the licence information displayed.

JupyterHub installation

If like me you are behind a corporate proxy the first thing to do is to configure it to be able to download conda packages over Internet:

(base) [root@server ~]# cat .condarc
auto_activate_base: false

proxy_servers:
    http: http://account:password@proxy_server.domain.com:proxy_port/
    https: http://account:password@proxy_server.domain.com:proxy_port/


ssl_verify: False

Create the JupyterHub conda environment with (chosen name is totally up to you):

(base) [root@server ~]# conda create --name jupyterhub
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.7.12
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /opt/anaconda3/envs/jupyterhub



Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate jupyterhub
#
# To deactivate an active environment, use
#
#     $ conda deactivate

If, like me, you received a warning about an obsolete release of conda, upgrade it with:

(base) [root@server ~]# conda update -n base -c defaults conda
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - conda


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    backports.functools_lru_cache-1.6.1|             py_0          11 KB
    conda-4.8.2                |           py37_0         2.8 MB
    future-0.18.2              |           py37_0         639 KB
    ------------------------------------------------------------
                                           Total:         3.5 MB

The following packages will be UPDATED:

  backports.functoo~                               1.5-py_2 --> 1.6.1-py_0
  conda                                       4.7.12-py37_0 --> 4.8.2-py37_0
  future                                      0.17.1-py37_0 --> 0.18.2-py37_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
future-0.18.2        | 639 KB    | ######################################################################################### | 100%
backports.functools_ | 11 KB     | ######################################################################################### | 100%
conda-4.8.2          | 2.8 MB    | ######################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

(base) [root@server ~]# conda -V
conda 4.8.2

Activate your newly created Conda environment and install jupyterhub and notebook inside it:

(base) [root@server ~]# conda activate jupyterhub
(jupyterhub) [root@server ~]# conda install -c conda-forge jupyterhub
(jupyterhub) [root@server ~]# conda install -c conda-forge notebook

Remark
It is also possible to install newest JupyterLab in Jupyterhub instead of Jupyter Notebook. If you do so you have to set c.Spawner.default_url = '/lab' to instruct JupyterHub to load JupyterLab instead of Jupyter Notebook. In below I will try to mix screenshot but clearly the future is JupyterLab and not Jupyter Notebook. JupyterHub is just providing a multi user environment.

Install JupyterLab with:

conda install -c conda-forge jupyterlab

Execute jupyterhub by just typing the command jupyterhub and access to its url at http://server.domain.com:8000. All options can obviously be configured...

As an exemple how to activate https for your Jupyterhub using a self signed certificate (free). Is not optimal but better than http...

Generate the config file using:

(jupyterhub) [root@server ~]# jupyterhub --generate-config -f jupyterhub_config.py
Writing default config to: jupyterhub_config.py

Generate the key and certificate using below command (taken from OpenSSL Cookbook book):

(jupyterhub) [root@server ~]# openssl req -new -newkey rsa:2048 -x509 -nodes -keyout root-ocsp.key -out root-ocsp.csr
Generating a RSA private key
...................................+++++
........................................+++++
writing new private key to 'root-ocsp.key'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:CH
State or Province Name (full name) [Some-State]:Geneva
Locality Name (eg, city) []:Geneva
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Company Name
Organizational Unit Name (eg, section) []:
Common Name (e.g. server FQDN or YOUR name) []:servername.domain.com
Email Address []:

Finally I have configured only three below parameters in jupyterhub_config.py configuration file:

c.JupyterHub.bind_url = 'https://servername.domain.com'
c.JupyterHub.ssl_cert = 'root-ocsp.csr'
c.JupyterHub.ssl_key = 'root-ocsp.key'
c.Spawner.default_url = '/lab' # Unset to keep Jupyter Notebook

Simply execute juyterhub command to run JupyterHub, of course creating a service that start with server boot is more than recommended.

Then accessing to https url you see login window without the HTTP warning (you will have to add the self signed certificate as a trusted server in your browser).

Jupyter kernels manual configuration

Connect with an existing OS account created onto server where JupyterHub is running:

jupyterhub01
jupyterhub01

Create a new Python 3 notebook (yannick_python.ipynb in my example, but as you can see I have many others):

jupyterhub02
jupyterhub02

And this below dummy example should work:

jupyterhub03
jupyterhub03

Obvosuly our goal here is not to do Python but Spark. To manually create a Pyspark kernel create the kernel directory in home installation of your Jupyterhub:

[root@server ~]# cd /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels
(jupyterhub) [root@server kernels]# ll
total 0
drwxr-xr-x 2 root root 69 Feb 18 14:20 python3
(jupyterhub) [root@server kernels]# mkdir pyspark_kernel

Then create below kernel file:

[root@server pyspark]# cat kernel.json
{
  "display_name": "PySpark_kernel",
  "language": "python",
  "argv": [ "/opt/anaconda3/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ],
  "env": {
    "SPARK_HOME": "/usr/hdp/current/spark2-client/",
    "PYSPARK_PYTHON": "/opt/anaconda3/bin/python",
    "PYTHONPATH": "/usr/hdp/current/spark2-client/python/:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip",
    "PYTHONSTARTUP": "/usr/hdp/current/spark2-client/python/pyspark/shell.py",
    "PYSPARK_SUBMIT_ARGS": "--master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip pyspark-shell"
  }
}

You can confirm it is taken into account with:

(jupyterhub) [root@server ~]# jupyter-kernelspec list
Available kernels:
  pyspark_kernel    /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/pyspark_kernel
  python3           /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/python3

If you create a new Jupyter Notebook and choose PySpark_kernel you should be able to execute below sample code:

jupyterhub04
jupyterhub04

So far I have not yet found on how to create a Spark Scala manual kernel, any insight is welcome...

Sparkmagic Jupyter kernels configuration

On the list of Jupyter available kernel there is one (IMHO) that comes more often than the competition: Sparkmagic.

Install it with:

conda install -c conda-forge sparkmagic

Once done install the Sparkmagic kernel with:

(jupyterhub) [root@server kernels]# cd /opt/anaconda3/envs/jupyterhub/lib/python3.8/site-packages/sparkmagic/kernels
(jupyterhub) [root@server kernels]# ll
total 28
-rw-rw-r-- 2 root root    46 Jan 23 14:36 __init__.py
-rw-rw-r-- 2 root root 20719 Jan 23 14:36 kernelmagics.py
drwxr-xr-x 2 root root    72 Feb 18 16:48 __pycache__
drwxr-xr-x 3 root root   104 Feb 18 16:48 pysparkkernel
drwxr-xr-x 3 root root   102 Feb 18 16:48 sparkkernel
drwxr-xr-x 3 root root   103 Feb 18 16:48 sparkrkernel
drwxr-xr-x 3 root root    95 Feb 18 16:48 wrapperkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec install sparkrkernel
[InstallKernelSpec] Installed kernelspec sparkrkernel in /usr/local/share/jupyter/kernels/sparkrkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec install sparkkernel
[InstallKernelSpec] Installed kernelspec sparkkernel in /usr/local/share/jupyter/kernels/sparkkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec install pysparkkernel
[InstallKernelSpec] Installed kernelspec pysparkkernel in /usr/local/share/jupyter/kernels/pysparkkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec list
Available kernels:
  pysparkkernel    /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/pysparkkernel
  python3          /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/python3
  sparkkernel      /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/sparkkernel
  sparkrkernel     /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/sparkrkernel

Enable the server extension with:

(jupyterhub) [root@server ~]# jupyter serverextension enable --py sparkmagic
Enabling: sparkmagic
- Writing config: /root/.jupyter
    - Validating...
      sparkmagic 0.15.0 OK

In the home directory of the account with which you will connect to JupyterHub create a .sparkmagic directory and create a file that is a copy of provided config.json.

In this file modify at least for each kernel_xxx_credentials section the url to map your Livy server name and port:

"kernel_python_credentials" : {
  "username": "",
  "password": "",
  "url": "http://livyserver.domain.com:8999",
  "auth": "None"
},

And the part on which I have spent quite a lot of time the session_configs section as below (to add the HWC connector information):

  "session_configs": {
  "driverMemory": "1000M",
  "executorCores": 2,
  "conf": {"spark.jars": "file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar",
           "spark.submit.pyFiles": "file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip"}
},

If you then create new notebook using PySpark or Spark whether you want to use Python or Scala you should be able to run the below exemples.

If you use Jupyter Notebook the first command to execute is magic command %load_ext sparkmagic.magics then create a session using magic command %manage_spark select either Scala or Python (remain the question of R language but I do not use it). If you use JupyterLab you can directly start to work as the %manage_spark command does not work. The Livy session should be automatically created while executing the first command, should also be the same with Jupyter Notebook but I had few issues with this so...

Few other magic commands are quite interesting:

  • %%help to get list of available command
  • %%info to see if your Livy session is still active (many issue can come from this)

PySpark exemple (the SQL part is still broken due to HWC):

jupyterhub05
jupyterhub05
jupyterhub06
jupyterhub06
jupyterhub07
jupyterhub07

Scala exemple (same story for SQL):

jupyterhub08
jupyterhub08
jupyterhub09
jupyterhub09
jupyterhub10
jupyterhub10

References

The post PySpark and Spark Scala Jupyter kernels cluster integration appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/pyspark-and-spark-scala-jupyter-kernels-cluster-integration.html/feed 0
Hive Warehouse Connector integration in Zeppelin Notebook https://blog.yannickjaquier.com/hadoop/hive-warehouse-connector-integration-in-zeppelin-notebook.html https://blog.yannickjaquier.com/hadoop/hive-warehouse-connector-integration-in-zeppelin-notebook.html#respond Thu, 21 May 2020 13:44:17 +0000 https://blog.yannickjaquier.com/?p=4955 Preamble How to configure Hive Warehouse Connector (HWC) integration in Zeppelin Notebook ? Since we upgraded to Hortonworks Data Platform (HDP) 3.1.4 we have been obliged to fight everywhere to integrate the new way of working in Spark with HWC. The connector in itself is not very complex to use, it requires only a small […]

The post Hive Warehouse Connector integration in Zeppelin Notebook appeared first on IT World.

]]>

Table of contents

Preamble

How to configure Hive Warehouse Connector (HWC) integration in Zeppelin Notebook ? Since we upgraded to Hortonworks Data Platform (HDP) 3.1.4 we have been obliged to fight everywhere to integrate the new way of working in Spark with HWC. The connector in itself is not very complex to use, it requires only a small modification of your Spark code (Scala or Python). What’s complex is its integration in the existing tools as well as how, now, to run Pyspark and Spark-shell (Scala) to integrate it in your environment.

On top of this, as clearly stated in the official documentation, you need HWC and LLAP (Live Long And Process or Low-Latency Analytical processing) to read Hive managed tables from Spark. Which is one of the first operation you will do in most of your Spark scripts. So in Ambari activate and configure it (number of nodes, memory, concurrent queries):

hwc01
hwc01

HWC on the command line

Before jumping to Zeppelin let’s quickly see how you know execute Spark on the command line with HWC.

You should have already configured Spark below parameters (as well as reverting back hive.fetch.task.conversion value see my other article on this)

  • spark.hadoop.hive.llap.daemon.service.hosts = @llap0
  • spark.hadoop.hive.zookeeper.quorum = zookeeper01.domain.com:2181,zookeeper03.domain.com:2181,zookeeper03.domain.com:2181
  • spark.sql.hive.hiveserver2.jdbc.url = jdbc:hive2://zookeeper01.domain.com:2181,zookeeper03.domain.com:2181,zookeeper03.domain.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive

To execute spark-shell you must now add below jar file (in the example I’m in YARN mode on llap YARN queue):

spark-shell --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar

To test it’s working you might use the below test code:

val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.showDatabases().show(100, false)

To execute pyspark you must add below jar file (I’m in YARN mode on llap YARN queue) and Python file:

pyspark --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

To test all is good you can use below script

from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show(10, False)

To use spark-submit you do exactly the same thing:

spark-submit --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip script_to_execute

Zeppelin Spark2 interpreter configuration

To make Spark2 Zeppelin interpreter working with HWC you need to add two parameters to its configuration:

  • spark.jars = /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar
  • spark.submit.pyFiles = /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

Note:
Notice the F in uppercase in spark.submit.pyFiles, I have lost half a day because some blog articles on internet have misspelled it.

Save and restart the interpreter and it you test below code in a new Notebook you should get a result (and not the default databases name):

hwc02
hwc02

Remark:
As you can see above the %spark2.sql option of the interpreter is not configured to use HWC. So far I have not found how to overcome this. This could be an issue because this interpreter option is a nice feature to make graphs from direct SQL commands…

It works also well in Scala:

hwc03
hwc03

Zeppelin Livy2 interpreter configuration

You might wonder why configuring Livy2 interpreter as the first one is working fine. Well I had to configure it to for the JupyterHub Sparkmagic configuration that we will see in another blog post so here it is…

To make Livy2 Zeppelin interpreter working with HWC you need to add two parameters to its configuration:

  • livy.spark.jars = file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar
  • livy.spark.submit.pyFiles = file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

Note:
The second parameter was just a good guess from the Spark2 interpreter setting. Also note the file:// to instruct interpreter to look on local file system and not on HDFS.

Also ensure the Livy url is well setup to the server (and port) of you Livy process:

  • zeppelin.livy.url = http://livyserver.domain.com:8999

I have also faced this issue:

org.apache.zeppelin.livy.LivyException: Error with 400 StatusCode: "requirement failed: Local path /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar cannot be added to user sessions."
	at org.apache.zeppelin.livy.BaseLivyInterpreter.callRestAPI(BaseLivyInterpreter.java:755)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.createSession(BaseLivyInterpreter.java:337)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.initLivySession(BaseLivyInterpreter.java:209)
	at org.apache.zeppelin.livy.LivySharedInterpreter.open(LivySharedInterpreter.java:59)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.getLivySharedInterpreter(BaseLivyInterpreter.java:190)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.open(BaseLivyInterpreter.java:163)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:617)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
	at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

To solve it add below configuration to your Livy2 server (Custom livy2-conf in Ambari):

  • livy.file.local-dir-whitelist = /usr/hdp/current/hive_warehouse_connector/

Working in Scala:

hwc04
hwc04

In Python:

hwc05
hwc05

But same as for Spark2 interpreter the %livy2.sql option of the interpreter is broken for exact same reason:

hwc06
hwc06

References

The post Hive Warehouse Connector integration in Zeppelin Notebook appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/hive-warehouse-connector-integration-in-zeppelin-notebook.html/feed 0
Setup Spark and Intellij on Windows to access a remote Hadoop cluster https://blog.yannickjaquier.com/hadoop/setup-spark-and-intellij-on-windows-to-access-a-remote-hadoop-cluster.html https://blog.yannickjaquier.com/hadoop/setup-spark-and-intellij-on-windows-to-access-a-remote-hadoop-cluster.html#respond Wed, 22 Apr 2020 07:49:46 +0000 https://blog.yannickjaquier.com/?p=4896 Preamble Developing directly on an Hadoop cluster is not the best development environment you would dream for yourself. Of course it makes easy the testing of your code as you can instantly submit it but in worst case your editor will be vi. Obviously what new generation developers want is a clever editor running on […]

The post Setup Spark and Intellij on Windows to access a remote Hadoop cluster appeared first on IT World.

]]>

Table of contents

Preamble

Developing directly on an Hadoop cluster is not the best development environment you would dream for yourself. Of course it makes easy the testing of your code as you can instantly submit it but in worst case your editor will be vi. Obviously what new generation developers want is a clever editor running on their Windows (Linux ?) desktop. By clever editor I mean one having syntax completion, suggesting to import packages when using a procedures, suggesting unused variables if any and so on… Small teasing: Intellij IDEA from JetBrains is a serious contenders nowadays, apparently much better than Eclipse that is more or less dying…

We, at last, improved a bit the editor part by installing VSCode and editing our file with SSH FS plugins. In clear we edit through SSH files located directly on the server and then using a shell terminal (MobaXterm , Putty, …) we are able to submit scripts directly onto our Hadoop cluster. This works if you do Python (even if a PySpark plugin is not available in VSCode) but if you do Scala you also have to run an sbt command, each time, to compile you Scala program. A bit cumbersome…

As said above Intellij IDEA from JetBrains is a very good Scala language editor and the community edition does a pretty decent job. But then difficulties rise as this software lack a remote SSH edition… So it works well for pure Scala script but if you need to access you Hive table you have to work a bit to configure it…

This blog post is all about this: building a productive Scala/PySpark development environment on your Windows desktop and access Hive tables of an Hadoop cluster.

Spark installation and configuration

Before going in small details I have first tried to make raw Spark installation working on my Windows machine. I have started by downloading it on official web site:

spark_installation01
spark_installation01

And unzip it in default D:\spark-2.4.4-bin-hadoop2.7 directory. You must set SPARK_HOME environment variable to the directory you have unzip Spark. For convenience you also need to add D:\spark-2.4.4-bin-hadoop2.7\bin in the path of your Windows account (restart PowerShell after) and confirm it’s all good with:

$env:path

Then issue spark-shell in a PowerShell session, you should get a warning like:

19/11/15 17:15:37 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
        at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
        at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
        at org.apache.hadoop.util.Shell.(Shell.java:387)
        at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
        at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
        at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
        at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
        at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
        at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
        at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
        at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:348)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:348)
        at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
        at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Take winutils.exe that is Windows binary for Hadoop version at: https://github.com/steveloughran/winutils and put it in D:\spark-2.4.4-bin-hadoop2.7\bin folder.

I have downloaded the one for Hadoop 2.7.1 (linked to my Hadoop cluster). You must also set environment variable HADOOP_HOME to D:\spark-2.4.4-bin-hadoop2.7. Then scala-shell command should execute with no error. You can make the suggested testing right now:

PS D:\> spark-shell
19/11/15 17:29:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://client_machine.domain.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1573835367367).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@28532753

scala> sc
res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext@44846c76
scala> val textFile = spark.read.textFile("D:/spark-2.4.4-bin-hadoop2.7/README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]

scala> textFile.count()
res2: Long = 105

scala> textFile.filter(line => line.contains("Spark")).count()
res3: Long = 20

And that’s it you have a local running working Spark environment with all the Spark clients: Spark shell, PySpark, Spark SQL, SparkR… I had to install Python 3.7.5 as it was not working with Python 3.8.0, might be corrected soon:

PS D:\> pyspark
Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
19/11/15 17:28:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Python version 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019 00:11:34)
SparkSession available as 'spark'.
>>> textFile = spark.read.text("D:/spark-2.4.4-bin-hadoop2.7/README.md")
>>> textFile.count()
105

To access Hive running on my cluster I have taken the file called hive-site.xml in /etc/spark2/conf directory of one of my edge node where Spark was installed and put it in D:\spark-2.4.4-bin-hadoop2.7\conf folder. And it worked pretty well:

scala> val databases=spark.sql("show databases")
databases: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> databases.show(10,false)
+------------------+
|databaseName      |
+------------------+
|activity          |
|admin             |
|alexandre_c300    |
|audit             |
|big_ews_processing|
|big_ews_raw       |
|big_ews_refined   |
|big_fdc_processing|
|big_fdc_raw       |
|big_fdc_refined   |
+------------------+
only showing top 10 rows

But it failed for a more complex example:

scala> val df01=spark.sql("select * from admin.tbl_concatenate")
df01: org.apache.spark.sql.DataFrame = [id_process: string, date_job_start: string ... 9 more fields]

scala> df01.printSchema()
root
 |-- id_process: string (nullable = true)
 |-- date_job_start: string (nullable = true)
 |-- date_job_stop: string (nullable = true)
 |-- action: string (nullable = true)
 |-- database_name: string (nullable = true)
 |-- table_name: string (nullable = true)
 |-- partition_name: string (nullable = true)
 |-- number_files_before: string (nullable = true)
 |-- partition_size_before: string (nullable = true)
 |-- number_files_after: string (nullable = true)
 |-- partition_size_after: string (nullable = true)


scala> df01.show(10,false)
java.lang.IllegalArgumentException: java.net.UnknownHostException: DataLakeHdfs
  at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
  at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
  at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
  at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:678)
  at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:619)

To solve this I have to copy hdfs-site.xml from /etc/hadoop/conf directory of one of my edge where HDFS Client has been installed and it worked much better:

scala> val df01=spark.sql("select * from admin.tbl_concatenate")
df01: org.apache.spark.sql.DataFrame = [id_process: string, date_job_start: string ... 9 more fields]

scala> df01.show(10,false)
19/11/15 17:45:01 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because UNIX Domain sockets are not available on Windows.
+----------+--------------+--------------+----------+----------------+------------------+----------------------------+-------------------+---------------------+------------------+--------------------+
|id_process|date_job_start|date_job_stop |action    |database_name   |table_name        |partition_name              |number_files_before|partition_size_before|number_files_after|partition_size_after|
+----------+--------------+--------------+----------+----------------+------------------+----------------------------+-------------------+---------------------+------------------+--------------------+
|10468     |20190830103323|20190830103439|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q1111|3                  |1606889              |3                 |1606889             |
|10468     |20190830103859|20190830104224|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7180|7                  |37136990             |2                 |37130614            |
|10468     |20190830104251|20190830104401|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7210|4                  |22095872             |3                 |22094910            |
|10468     |20190830104435|20190830104624|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7250|4                  |73357246             |2                 |73352589            |
|10468     |20190830104659|20190830104759|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7280|1                  |1696312              |1                 |1696312             |
|10468     |20190830104845|20190830104952|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7350|3                  |3439184              |3                 |3439184             |
|10468     |20190830105023|20190830105500|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7371|6                  |62283893             |2                 |62274587            |
|10468     |20190830105532|20190830105718|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7382|5                  |25501396             |3                 |25497826            |
|10468     |20190830110118|20190830110316|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q8030|3                  |74338924             |3                 |74338924            |
|10468     |20190830110413|20190830110520|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q8039|3                  |5336123              |2                 |5335855             |
+----------+--------------+--------------+----------+----------------+------------------+----------------------------+-------------------+---------------------+------------------+--------------------+
only showing top 10 rows

And that’s it you have built a working local Spark environment able to access your Hadoop cluster figures. Of course your Spark power is limited to your client machine and you cannot use the full power of your cluster but to develop scripts this is more than enough…

Sbt installation and configuration

If you write Scala script the next step to submit them using spark-submit and leave the interactive mode will be to compile them and produce a jar file usable by spark-submit. I have to say this is a drawback versus developing in Python where your PySpark script will be directly usable as-is with this language. The speed is also apparently not a criteria as PySpark has same performance as Scala. The only reason to do Scala, except being hype, is the functionalities that are more in advance in Scala versus Python…

The installation is as simple as downloading it on sbt official web site. I have chosen to take the zip file and uncompressed it in d:\sbt directory. For convenience add d:\sbt\bin in your environment path. I have configured my corporate proxy in sbtconfig.txt file located in d:\sbt\conf as follow:

-Dhttp.proxyHost=proxy.domain.com
-Dhttp.proxyPort=8080
-Dhttp.proxyUser=account
-Dhttp.proxyPassword=password
PS D:\> mkdir foo-build


    Directory: D:\


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----       18/11/2019     12:07                foo-build


PS D:\> cd .\foo-build\
PS D:\foo-build> ni build.sbt


    Directory: D:\foo-build


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       18/11/2019     12:08              0 build.sbt
PS D:\foo-build> sbt

[info] Updated file D:\foo-build\project\build.properties: set sbt.version to 1.3.3
[info] Loading project definition from D:\foo-build\project
nov. 18, 2019 12:08:39 PM lmcoursier.internal.shaded.coursier.cache.shaded.org.jline.utils.Log logr
WARNING: Unable to create a system terminal, creating a dumb terminal (enable debug logging for more information)
[info] Updating
















  | => foo-build-build / update 0s
[info] Resolved  dependencies
[warn]
[warn]  Note: Unresolved dependencies path:
[error] sbt.librarymanagement.ResolveException: Error downloading org.scala-sbt:sbt:1.3.3
[error]   Not found
[error]   Not found
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo1.maven.org/maven2/org/scala-sbt/sbt/1.3.3/sbt-1.3.3.pom
[error]   not found: C:\Users\yjaquier\.ivy2\local\org.scala-sbt\sbt\1.3.3\ivys\ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.scala-sbt/sbt/1.3.3/ivys/ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/1.3.3/ivys/ivy.xml
[error] Error downloading org.scala-lang:scala-library:2.12.10
[error]   Not found
[error]   Not found
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.pom
[error]   not found: C:\Users\yjaquier\.ivy2\local\org.scala-lang\scala-library\2.12.10\ivys\ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.scala-lang/scala-library/2.12.10/ivys/ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-lang/scala-library/2.12.10/ivys/ivy.xml
[error]         at lmcoursier.CoursierDependencyResolution.unresolvedWarningOrThrow(CoursierDependencyResolution.scala:245)
[error]         at lmcoursier.CoursierDependencyResolution.$anonfun$update$34(CoursierDependencyResolution.scala:214)
[error]         at scala.util.Either$LeftProjection.map(Either.scala:573)
[error]         at lmcoursier.CoursierDependencyResolution.update(CoursierDependencyResolution.scala:214)
[error]         at sbt.librarymanagement.DependencyResolution.update(DependencyResolution.scala:60)
[error]         at sbt.internal.LibraryManagement$.resolve$1(LibraryManagement.scala:52)
[error]         at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$12(LibraryManagement.scala:102)
[error]         at sbt.util.Tracked$.$anonfun$lastOutput$1(Tracked.scala:69)
[error]         at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$20(LibraryManagement.scala:115)
[error]         at scala.util.control.Exception$Catch.apply(Exception.scala:228)
[error]         at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11(LibraryManagement.scala:115)
[error]         at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11$adapted(LibraryManagement.scala:96)
[error]         at sbt.util.Tracked$.$anonfun$inputChanged$1(Tracked.scala:150)
[error]         at sbt.internal.LibraryManagement$.cachedUpdate(LibraryManagement.scala:129)
[error]         at sbt.Classpaths$.$anonfun$updateTask0$5(Defaults.scala:2946)
[error]         at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error]         at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62)
[error]         at sbt.std.Transform$$anon$4.work(Transform.scala:67)
[error]         at sbt.Execute.$anonfun$submit$2(Execute.scala:281)
[error]         at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:19)
[error]         at sbt.Execute.work(Execute.scala:290)
[error]         at sbt.Execute.$anonfun$submit$1(Execute.scala:281)
[error]         at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:178)
[error]         at sbt.CompletionService$$anon$2.call(CompletionService.scala:37)
[error]         at java.util.concurrent.FutureTask.run(Unknown Source)
[error]         at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
[error]         at java.util.concurrent.FutureTask.run(Unknown Source)
[error]         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[error]         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[error]         at java.lang.Thread.run(Unknown Source)
[error] (update) sbt.librarymanagement.ResolveException: Error downloading org.scala-sbt:sbt:1.3.3
[error]   Not found
[error]   Not found
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo1.maven.org/maven2/org/scala-sbt/sbt/1.3.3/sbt-1.3.3.pom
[error]   not found: C:\Users\yjaquier\.ivy2\local\org.scala-sbt\sbt\1.3.3\ivys\ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.scala-sbt/sbt/1.3.3/ivys/ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/1.3.3/ivys/ivy.xml
[error] Error downloading org.scala-lang:scala-library:2.12.10
[error]   Not found
[error]   Not found
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.pom
[error]   not found: C:\Users\yjaquier\.ivy2\local\org.scala-lang\scala-library\2.12.10\ivys\ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.scala-lang/scala-library/2.12.10/ivys/ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-lang/scala-library/2.12.10/ivys/ivy.xml
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q

For file D:\foo-build\project\build.properties I had to change as follow as apparently there is an issue with sbt 1.3.3 repository:

sbt.version=1.2.8

This time it went well:

PS D:\foo-build> sbt

[info] Loading project definition from D:\foo-build\project
[info] Updating ProjectRef(uri("file:/D:/foo-build/project/"), "foo-build-build")...
[info] Done updating.
[info] Loading settings for project foo-build from build.sbt ...
[info] Set current project to foo-build (in build file:/D:/foo-build/)
[info] sbt server started at local:sbt-server-2b55615bd8de0c5ac354
sbt:foo-build>

As suggested in sbt official documentation I have issued:

sbt:foo-build> ~compile
[info] Updating ...
[info] Done updating.
[info] Compiling 1 Scala source to D:\foo-build\target\scala-2.12\classes ...
[info] Done compiling.
[success] Total time: 1 s, completed 18 nov. 2019 12:58:30
1. Waiting for source changes in project foo-build... (press enter to interrupt)

And create the Hello.scala file where suggested that is instantly compiled:

[info] Compiling 1 Scala source to D:\foo-build\target\scala-2.12\classes ...
[info] Done compiling.
[success] Total time: 1 s, completed 18 nov. 2019 12:58:51
2. Waiting for source changes in project foo-build... (press enter to interrupt)

You can run it using:

sbt:foo-build> run
[info] Running example.Hello
Hello
[success] Total time: 1 s, completed 18 nov. 2019 13:06:23

Create a jar file that can be submitted via spark-submit with:

sbt:foo-build> package
[info] Packaging D:\foo-build\target\scala-2.12\foo-build_2.12-0.1.0-SNAPSHOT.jar ...
[info] Done packaging.
[success] Total time: 0 s, completed 18 nov. 2019 16:15:06

Intellij IDEA from JetBrains

Java JDK is a prerequisite, I have personally installed release 1.8.0_231.

Once you have installed Intellij IDEA, the community edition does the job for basics things, the first thing you have to do is to install the Scala plugin (you will have to configure your corporate proxy):

spark_installation02
spark_installation02

Then create your first Scala project with Create New Project:

spark_installation03
spark_installation03

I have chosen Sbt 1.2.8 and Scala 2.12.10 as I had many issues with latest versions:

spark_installation04
spark_installation04

In src/main/scala folder create a new Scala Class (mouse left button click then New/Scala Class) and then select Object to create (for example) a file called scala_testing.scala. As an example:

spark_installation05
spark_installation05

Insert (for example) below Scala code in Scala object script you have just created:

import org.apache.spark.sql.SparkSession

object scala_testing {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("spark_scala_yannick")
      .master("local[*]")
      .enableHiveSupport()
      .getOrCreate()

    spark.sparkContext.setLogLevel("WARN")

    val databases=spark.sql("show databases")

    databases.show(100,false)
  }
}

In build.sbt file add the following dependencies to have a file looking like:

name := "scala_testing"

version := "0.1"

scalaVersion := "2.12.10"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.4.4"

Sbt proxy setting are not inherited (feature has been requested) from Intellij default proxy setting so you have to add the below VM parameters:

spark_installation06
spark_installation06

Remark:
-Dsbt.repository.secure=false is surely not the best idea of the world but I had issue configuring my proxy server in https. -Dsbt.gigahorse=false is to overcome a concurrency issue while downloading multiple packages at same time… Might not be required for you…

Once the build process is completed (dependencies download) if you execute the script you will end up with:

+------------+
|databaseName|
+------------+
|default     |
+------------+

We are still lacking the Hive configuration. To add it go in File/Project Structure and a new Java library

spark_installation07
spark_installation07

Browse to the Spark conf directory we configured in Spark installation and configuration chapter:

spark_installation08
spark_installation08

Choose Class as category:

spark_installation09
spark_installation09

I have finally selected all modules (not fully sure it’s required):

spark_installation10
spark_installation10

If you now re-execute the Sacala script you should now get the complete list of Hive databases of your Hadoop cluster:

+--------------------------+
|databaseName              |
+--------------------------+
|activity                  |
|admin                     |
|audit                     |
|big_processing            |
|big_raw                   |
|big_refined               |
|big_processing            |
.
.

Conclusion

I have also tried to submit directly from Intellij in YARN mode (so onto the Hadoop cluster) but even if I have been able to see my application in Resource Manager of YARN I have never succeeded to make it run normally. The YARN scheduler is taking my request but never satisfy it…

Any inputs are welcome…

References

The post Setup Spark and Intellij on Windows to access a remote Hadoop cluster appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/setup-spark-and-intellij-on-windows-to-access-a-remote-hadoop-cluster.html/feed 0
On the importance to have good Hive statistics on your tables https://blog.yannickjaquier.com/hadoop/on-the-importance-to-have-good-hive-statistics-on-your-tables.html https://blog.yannickjaquier.com/hadoop/on-the-importance-to-have-good-hive-statistics-on-your-tables.html#respond Mon, 23 Mar 2020 09:25:58 +0000 https://blog.yannickjaquier.com/?p=4910 Preamble Hive statistics: if like me you come from the RDBMS world I’ll not make any offense re-explaining the importance of good statistics (table, columns, ..) to help the optimizer to choose the best approach to execute your SQL statements. In this post you will see a concrete example why it is also super important […]

The post On the importance to have good Hive statistics on your tables appeared first on IT World.

]]>

Table of contents

Preamble

Hive statistics: if like me you come from the RDBMS world I’ll not make any offense re-explaining the importance of good statistics (table, columns, ..) to help the optimizer to choose the best approach to execute your SQL statements. In this post you will see a concrete example why it is also super important in an Hadoop cluster !

We have the (bad) habit to load incrementally figures in our Hive partitioned table without gathering statistics each time. In short this is clearly not a best practice !

Then to create a more refined table we are joining two tables and the initial question was: do we need to put all predicates in the JOIN ON keyword or it can be added in a traditional WHERE clause and Hive optimizer is clever enough to do what is usually called predicate push-down.

We have Hortonworks Hadoop edition called HDP 2.6.4 so running Hive 1.2.1000.

The problematic queries

We were trying to load a third table from two sources with (query01):

select dpf.fab,dpf.start_t,dpf.finish_t,dpf.lot_id,dpf.wafer_id,dpf.flow_id,dpf.die_x,dpf.die_y,dpf.part_seq,dpf.parameters,dpf.ingestion_date,dpf.lot_partition
from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
on dpf.fab="C2WF"
and dpf.lot_partition="Q926"
and dpf.die_partition=9
and db.pass_fail="F"
and db.fab="C2WF"
and db.lot_partition="Q926"
and db.die_partition=9
and dpf.fab=db.fab
and dpf.lot_partition=db.lot_partition
and dpf.start_t=db.start_t
and dpf.finish_t= db.finish_t
and dpf.lot_id=db.lot_id
and dpf.wafer_id=db.wafer_id
and dpf.flow_id=db.flow_id
and dpf.die_x=db.die_x
and dpf.die_y=db.die_y
and dpf.part_seq=db.part_seq
and dpf.ingestion_date<"201911291611"
and db.ingestion_date<"201911291611";

And comparing with its sister, I would say written more logically, but in fact not following best practices (query02):

select dpf.fab,dpf.start_t,dpf.finish_t,dpf.lot_id,dpf.wafer_id,dpf.flow_id,dpf.die_x,dpf.die_y,dpf.part_seq,dpf.parameters,dpf.ingestion_date,dpf.lot_partition
from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
on dpf.fab=db.fab
and dpf.lot_partition=db.lot_partition
and dpf.die_partition=db.die_partition
and dpf.start_t=db.start_t
and dpf.finish_t= db.finish_t
and dpf.lot_id=db.lot_id
and dpf.wafer_id=db.wafer_id
and dpf.flow_id=db.flow_id
and dpf.die_x=db.die_x
and dpf.die_y=db.die_y
and dpf.part_seq=db.part_seq
where dpf.fab="C2WF"
and dpf.lot_partition="Q926"
and dpf.die_partition="9"
and db.fab="C2WF"
and db.lot_partition="Q926"
and db.pass_fail="F"
and db.die_partition="9"
and dpf.ingestion_date<"201911291611"
and db.ingestion_date<"201911291611";

The explain plan of query01 is (use EXPLAIN Hive command to generate it in Beeline):

Explain
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan not optimized by CBO.

Vertex dependency in root stage
Map 2 <- Map 1 (BROADCAST_EDGE)

Stage-0
    Fetch Operator
      limit:-1
      Stage-1
          Map 2
          File Output Operator [FS_3747670]
            compressed:false
            Statistics:Num rows: 10066 Data size: 541406213 Basic stats: COMPLETE Column stats: NONE
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [SEL_3747669]
                outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
                Statistics:Num rows: 10066 Data size: 541406213 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator [MAPJOIN_3747675]
                |  condition map:[{"":"Inner Join 0 to 1"}]
                |  HybridGraceHashJoin:true
                |  keys:{"Map 2":"'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)","Map 1":"'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)"}
                |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"]
                |  Statistics:Num rows: 10066 Data size: 541406213 Basic stats: COMPLETE Column stats: NONE
                |<-Map 1 [BROADCAST_EDGE]
                |  Reduce Output Operator [RS_3747666]
                |     key expressions:'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
                |     Map-reduce partition columns:'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
                |     sort order:++++++++++
                |     Statistics:Num rows: 9151 Data size: 492187456 Basic stats: COMPLETE Column stats: NONE
                |     value expressions:parameters (type: array), ingestion_date (type: string)
                |     Filter Operator [FIL_3747673]
                |        predicate:((ingestion_date < '201911291611') and start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null) (type: boolean)
                |        Statistics:Num rows: 9151 Data size: 492187456 Basic stats: COMPLETE Column stats: NONE
                |        TableScan [TS_3747655]
                |           alias:dpf
                |           Statistics:Num rows: 7027779 Data size: 377989801184 Basic stats: COMPLETE Column stats: NONE
                |<-Filter Operator [FIL_3747674]
                      predicate:((pass_fail = 'F') and (ingestion_date < '201911291611') and start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null) (type: boolean)
                      Statistics:Num rows: 4584 Data size: 5321735 Basic stats: COMPLETE Column stats: NONE
                      TableScan [TS_3747656]
                        alias:db
                        Statistics:Num rows: 7040091 Data size: 8173102308 Basic stats: COMPLETE Column stats: NONE


43 rows selected.    

Notice the "Column stats: NONE" written everywhere in the explain plan...

Or graphically using Hive View in Ambari. To do so copy/paste the query without the EXPLAIN keyword and push the Visual Explain button. It will generate the query without executing it:

hive_statistics01
hive_statistics01

So giving:

hive_statistics02
hive_statistics02

For qyery02 the explain plan is a bit longer but we know by experience that it has nothing to see with execution time:

Explain
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan not optimized by CBO.

Vertex dependency in root stage
Map 2 <- Map 1 (BROADCAST_EDGE)

Stage-0
    Fetch Operator
      limit:-1
      Stage-1
          Map 2
          File Output Operator [FS_3747687]
            compressed:false
            Statistics:Num rows: 20969516 Data size: 3690634816 Basic stats: COMPLETE Column stats: PARTIAL
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [SEL_3747686]
                outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
                Statistics:Num rows: 20969516 Data size: 3690634816 Basic stats: COMPLETE Column stats: PARTIAL
                Map Join Operator [MAPJOIN_3747710]
                |  condition map:[{"":"Inner Join 0 to 1"}]
                |  HybridGraceHashJoin:true
                |  keys:{"Map 2":"'C2WF' (type: string), 'Q926' (type: string), 9 (type: int), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)","Map 1":"'C2WF' (type: string), 'Q926' (type: string), 9 (type: int), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)"}
                |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"]
                |  Statistics:Num rows: 20969516 Data size: 53094814512 Basic stats: COMPLETE Column stats: PARTIAL
                |<-Map 1 [BROADCAST_EDGE]
                |  Reduce Output Operator [RS_3747681]
                |     key expressions:'C2WF' (type: string), 'Q926' (type: string), 9 (type: int), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
                |     Map-reduce partition columns:'C2WF' (type: string), 'Q926' (type: string), 9 (type: int), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
                |     sort order:+++++++++++
                |     Statistics:Num rows: 9151 Data size: 1647180 Basic stats: COMPLETE Column stats: PARTIAL
                |     value expressions:parameters (type: array), ingestion_date (type: string)
                |     Filter Operator [FIL_3747690]
                |        predicate:(start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null and (ingestion_date < '201911291611')) (type: boolean)
                |        Statistics:Num rows: 9151 Data size: 1647180 Basic stats: COMPLETE Column stats: PARTIAL
                |        TableScan [TS_3747678]
                |           alias:dpf
                |           Statistics:Num rows: 7027779 Data size: 377989801184 Basic stats: COMPLETE Column stats: PARTIAL
                |  Dynamic Partitioning Event Operator [EVENT_3747703]
                |     Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |     Group By Operator [GBY_3747702]
                |        keys:_col0 (type: string)
                |        outputColumnNames:["_col0"]
                |        Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |        Select Operator [SEL_3747701]
                |           outputColumnNames:["_col0"]
                |           Statistics:Num rows: 9151 Data size: 1647180 Basic stats: COMPLETE Column stats: PARTIAL
                |            Please refer to the previous Filter Operator [FIL_3747690]
                |  Dynamic Partitioning Event Operator [EVENT_3747706]
                |     Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |     Group By Operator [GBY_3747705]
                |        keys:_col0 (type: string)
                |        outputColumnNames:["_col0"]
                |        Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |        Select Operator [SEL_3747704]
                |           outputColumnNames:["_col0"]
                |           Statistics:Num rows: 9151 Data size: 1647180 Basic stats: COMPLETE Column stats: PARTIAL
                |            Please refer to the previous Filter Operator [FIL_3747690]
                |  Dynamic Partitioning Event Operator [EVENT_3747709]
                |     Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |     Group By Operator [GBY_3747708]
                |        keys:_col0 (type: int)
                |        outputColumnNames:["_col0"]
                |        Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |        Select Operator [SEL_3747707]
                |           outputColumnNames:["_col0"]
                |           Statistics:Num rows: 9151 Data size: 1647180 Basic stats: COMPLETE Column stats: PARTIAL
                |            Please refer to the previous Filter Operator [FIL_3747690]
                |<-Filter Operator [FIL_3747691]
                      predicate:(start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null and (pass_fail = 'F') and (ingestion_date < '201911291611')) (type: boolean)
                      Statistics:Num rows: 4583 Data size: 824940 Basic stats: COMPLETE Column stats: PARTIAL
                      TableScan [TS_3747679]
                        alias:db
                        Statistics:Num rows: 7040091 Data size: 8173102308 Basic stats: COMPLETE Column stats: PARTIAL


73 rows selected.    

Or graphically:

hive_statistics03
hive_statistics03

Even if we would expect the Hive optimizer to be clever and apply predicate pushdown, explain plans are different and the one with the WHERE clause look less efficient on the paper.

Overall it is NOT AT ALL a best practice to put a separate WHERE clause in a JOIN operation because the filter will be applied after the join operation and as such increasing the volume of rows to be joined ! And it will even create wrong result in the particular case of an OUTER JOIN as clearly explained in the official documentation:

hive_statistics04
hive_statistics04

I have tried to execute the two queries replacing the column list by a COUNT(*) to avoid network penalty and to be honest I have not seen any big difference. Of course the execution time is linked to the usage of the YARN queue and as the cluster is already widely used I was not alone on it:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db  on
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_17084)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      6          6        0        0       0       0
Map 2 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 726.66 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (735.944 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> on dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition = db.die_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> where dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Session is already open
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_17084)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      6          6        0        0       0       0
Map 2 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 354.95 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (355.482 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db  on
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Session is already open
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_17169)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      6          6        0        0       0       0
Map 2 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 1108.09 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (1135.466 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> on dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition = db.die_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> where dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Session is already open
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_17169)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      6          6        0        0       0       0
Map 2 ..........   SUCCEEDED      6          6        0        0       1       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 1273.38 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (1273.899 seconds)

Problem is that for other partition of source tables the query was not running at all and failed for the classical Out Of Memory (OOM) error. Leaving us with empty partitions in final tables...

Problem has gone with good Hive statistics

We have noticed the Column stats: PARTIAL and Column stats: NONE in explain plans so I decided to check columns statistics with command like:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> describe formatted prod_ews_refined.tbl_die_param_flow ingestion_date partition(fab='C2WF',lot_partition='Q926',die_partition='9');
+-------------------------+-----------------------+-----------------------+-------+------------+-----------------+--------------+--------------+------------+-------------+----------+--+
|        col_name         |       data_type       |          min          |  max  | num_nulls  | distinct_count  | avg_col_len  | max_col_len  | num_trues  | num_falses  | comment  |
+-------------------------+-----------------------+-----------------------+-------+------------+-----------------+--------------+--------------+------------+-------------+----------+--+
| # col_name              | data_type             | comment               |       | NULL       | NULL            | NULL         | NULL         | NULL       | NULL        | NULL     |
|                         | NULL                  | NULL                  | NULL  | NULL       | NULL            | NULL         | NULL         | NULL       | NULL        | NULL     |
| ingestion_date          | string                | from deserializer     | NULL  | NULL       | NULL            | NULL         | NULL         | NULL       | NULL        | NULL     |
+-------------------------+-----------------------+-----------------------+-------+------------+-----------------+--------------+--------------+------------+-------------+----------+--+

As you can see we had no statistics on columns due to our obvious lack of analyze command after ingestion of new figures...

I gathered them with:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> analyze table tbl_die_bin partition(fab="C2WF",lot_partition="Q926",die_partition=9) compute statistics for columns;
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: analyze table tbl_die_bin partitio...columns(Stage-0)
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_20216)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      9          9        0        0       0       0
Reducer 2 ......   SUCCEEDED     31         31        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 64.75 s
--------------------------------------------------------------------------------
No rows affected (73.7 seconds)

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> analyze table tbl_die_param_flow partition(fab="C2WF",lot_partition="Q926",die_partition=9) compute statistics for columns start_t,finish_t,lot_id,wafer_id,flow_id,die_x,die_y,part_seq,ingestion_date;
INFO  : Tez session hasn't been created yet. Opening session
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     39         39        0        0       0       0
Reducer 2 ......   SUCCEEDED    253        253        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 198.48 s
--------------------------------------------------------------------------------
No rows affected (369.772 seconds)

There is the special case of array< string> datatype that you must remove explicitly from column list to avoid this:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> analyze table tbl_die_param_flow partition(fab="C2WF",lot_partition="Q926",die_partition=9) compute statistics for columns;
Error: Error while compiling statement: FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array< string> is passed. (state=42000,code=40000)

With now accurate statistics:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> describe formatted prod_ews_refined.tbl_die_param_flow ingestion_date partition(fab="C2WF",lot_partition="Q926",die_partition=9);
+-------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
|        col_name         |       data_type       |          min          |          max          |       num_nulls       |    distinct_count     |      avg_col_len      |      max_col_len      |       num_trues       |      num_falses       |        comment        |
+-------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
| # col_name              | data_type             | min                   | max                   | num_nulls             | distinct_count        | avg_col_len           | max_col_len           | num_trues             | num_falses            | comment               |
|                         | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  |
| ingestion_date          | string                |                       |                       | 0                     | 510                   | 12.0                  | 12                    |                       |                       | from deserializer     |
+-------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
3 rows selected (0.487 seconds)

The explain plan for query01 and qyery02 is exactly the same:

Explain
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan not optimized by CBO.

Vertex dependency in root stage
Map 1 <- Map 2 (BROADCAST_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Map 1
         File Output Operator [FS_3981081]
            compressed:false
            Statistics:Num rows: 100157 Data size: 76720262 Basic stats: COMPLETE Column stats: PARTIAL
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [SEL_3981080]
               outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
               Statistics:Num rows: 100157 Data size: 76720262 Basic stats: COMPLETE Column stats: PARTIAL
               Map Join Operator [MAPJOIN_3981086]
               |  condition map:[{"":"Inner Join 0 to 1"}]
               |  HybridGraceHashJoin:true
               |  keys:{"Map 2":"'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)","Map 1":"'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)"}
               |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"]
               |  Statistics:Num rows: 100157 Data size: 251394070 Basic stats: COMPLETE Column stats: PARTIAL
               |<-Map 2 [BROADCAST_EDGE] vectorized
               |  Reduce Output Operator [RS_3981089]
               |     key expressions:'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
               |     Map-reduce partition columns:'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
               |     sort order:++++++++++
               |     Statistics:Num rows: 1184068 Data size: 799245900 Basic stats: COMPLETE Column stats: COMPLETE
               |     Filter Operator [FIL_3981088]
               |        predicate:((pass_fail = 'F') and (ingestion_date < '201911291611') and start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null) (type: boolean)
               |        Statistics:Num rows: 1184068 Data size: 799245900 Basic stats: COMPLETE Column stats: COMPLETE
               |        TableScan [TS_3981067]
               |           alias:db
               |           Statistics:Num rows: 7104410 Data size: 8247850962 Basic stats: COMPLETE Column stats: COMPLETE
               |<-Filter Operator [FIL_3981084]
                     predicate:((ingestion_date < '201911291611') and start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null) (type: boolean)
                     Statistics:Num rows: 2364032 Data size: 1394778880 Basic stats: COMPLETE Column stats: PARTIAL
                     TableScan [TS_3981066]
                        alias:dpf
                        Statistics:Num rows: 7092098 Data size: 379282949029 Basic stats: COMPLETE Column stats: PARTIAL


42 rows selected.

Or graphically:

hive_statistics05
hive_statistics05

And execution time is greatly reduced (around a factor 10):

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> on dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Session is already open
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_23102)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     38         38        0        0       0       0
Map 3 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 146.74 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (153.287 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> on dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition = db.die_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> where dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Session is already open
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_23102)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     38         38        0        0       0       0
Map 3 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 126.12 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (126.614 seconds)

References

The post On the importance to have good Hive statistics on your tables appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/on-the-importance-to-have-good-hive-statistics-on-your-tables.html/feed 0
How to use Livy server to submit Spark job through a REST interface https://blog.yannickjaquier.com/hadoop/how-to-use-livy-server-to-submit-spark-job-through-a-rest-interface.html https://blog.yannickjaquier.com/hadoop/how-to-use-livy-server-to-submit-spark-job-through-a-rest-interface.html#respond Mon, 24 Feb 2020 09:02:10 +0000 https://blog.yannickjaquier.com/?p=4866 Preamble Since we have started our Hadoop journey and more particularly developing Spark jobs in Scala and Python having a efficient development environment has always been a challenge. What we currently do is using a remote edition via SSH FS plugins in VSCode and submitting script in a shell terminal directly from one of our […]

The post How to use Livy server to submit Spark job through a REST interface appeared first on IT World.

]]>

Table of contents

Preamble

Since we have started our Hadoop journey and more particularly developing Spark jobs in Scala and Python having a efficient development environment has always been a challenge.

What we currently do is using a remote edition via SSH FS plugins in VSCode and submitting script in a shell terminal directly from one of our edge nodes.

VSCode is a wonderful tool but it lacks the code completion and suggestion as well as tips that increase your productivity, at least in PySPark and Scala language. Recently I have succeeded to configure the community edition of Intellij from IDEA to submit job on my desktop using the data from our Hadoop cluster. Aside this configuration I have also configured a local Spark environment as well as sbt compiler for Scala jobs. I will share soon an article on this…

One of my teammate suggested the use of Livy so decided to have a look even if at the end I have been a little disappointed by its capability…

Livy is a Rest interface from which you interact with a Spark Cluster. In our Hadoop HortonWorks HDP 2.6 installation the Livy server comes pre-installed and in short I had nothing to do to install or configure it. If you are in a different configuration you might have to install and configure by yourself the Livy server.

Configuration

On our Hadoop cluster Livy server came with Spark installation and is already configured as such:

livy01
livy01

To understand on which server, and if process is running, follow the link of Spark service home page as follow:

livy02
livy02

You finnally have server name and process status where Livy server is running:

livy03
livy03

You can access it using your preferred browser:

livy04
livy04

Curl testing

Curl is, as you know, a tool to test resources across a network. Here we will use it to access the Livy Rest API resources. Interactive commands to this REST API can be done using Scala, Python and R.

Even if the official documentation is around Python scripts I have been able to resolve an annoying error using Curl. I have started with:

# curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" livy_server.domain.com:8999/sessions



Error 400 


HTTP ERROR: 400

Problem accessing /sessions. Reason:

    Missing Required Header for CSRF protection.


Powered by Jetty://

I have found that livy.server.csrf_protection.enabled parameter was set to true in my configuration so I had to specify an extra parameter in header request using X-Requested-By: parameter:

# curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" -H "X-Requested-By: Yannick" livy_server.domain.com:8999/sessions
{"id":8,"appId":null,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},
"log":["stdout: ","\nstderr: ","\nYARN Diagnostics: "]}

It returns that session id 8 has been created for me which can be confirmed graphically:

livy05
livy05

Session is in idle state:

# curl http://livy_server.domain.com:8999/sessions/8
{"id":8,"appId":"application_1565718945091_261784","owner":null,"proxyUser":null,"state":"idle","kind":"pyspark",
"appInfo":{"driverLogUrl":"http://data_node01.domain.com:8042/node/containerlogs/container_e212_1565718945091_261784_01_000001/livy",
"sparkUiUrl":"http://name_node01.domain.com:8088/proxy/application_1565718945091_261784/"},
"log":["stdout: ","\nstderr: ","Warning: Master yarn-cluster is deprecated since 2.0. Please use master \"yarn\" with specified deploy mode instead.","\nYARN Diagnostics: "]}

Let submit a bit of work to it:

# curl http://livy_server.domain.com:8999/sessions/8/statements -H "X-Requested-By: Yannick" -X POST -H 'Content-Type: application/json' -d '{"code":"2 + 2"}'
{"id":0,"code":"2 + 2","state":"waiting","output":null,"progress":0.0}

We can get the result using:

# curl http://livy_server.domain.com:8999/sessions/8/statements
{"total_statements":1,"statements":[{"id":0,"code":"2 + 2","state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"4"}},"progress":1.0}]}

Or graphically:

livy06
livy06

To optionally cleanup session:

# curl http://livy_server.domain.com:8999/sessions/8 -H "X-Requested-By: Yannick" -X DELETE
{"msg":"deleted"}

Python testing

Python is what Livy documentation is pushing by default to test the service. I have started by installing requests package (as well as upgrading pip):

PS D:\> python -m pip install --upgrade pip --user
Collecting pip
  Downloading https://files.pythonhosted.org/packages/00/b6/9cfa56b4081ad13874b0c6f96af8ce16cfbc1cb06bedf8e9164ce5551ec1/pip-19.3.1-py2.py3-none-any.whl (1.4MB)
     |████████████████████████████████| 1.4MB 2.2MB/s
Installing collected packages: pip
  Found existing installation: pip 19.2.3
    Uninstalling pip-19.2.3:
Installing collected packages: pip
Successfully installed pip-19.3.1
WARNING: You are using pip version 19.2.3, however version 19.3.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Requests package installation:

PS D:\> python -m pip install requests --user
Collecting requests
  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
     |████████████████████████████████| 61kB 563kB/s
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests)
  Downloading https://files.pythonhosted.org/packages/b4/40/a9837291310ee1ccc242ceb6ebfd9eb21539649f193a7c8c86ba15b98539/urllib3-1.25.7-py2.py3-none-any.whl (125kB)
     |████████████████████████████████| 133kB 6.4MB/s
Collecting certifi>=2017.4.17 (from requests)
  Downloading https://files.pythonhosted.org/packages/18/b0/8146a4f8dd402f60744fa380bc73ca47303cccf8b9190fd16a827281eac2/certifi-2019.9.11-py2.py3-none-any.whl (154kB)
     |████████████████████████████████| 163kB 3.3MB/s
Collecting chardet<3.1.0,>=3.0.2 (from requests)
  Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)
     |████████████████████████████████| 143kB 6.4MB/s
Collecting idna<2.9,>=2.5 (from requests)
  Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
     |████████████████████████████████| 61kB 975kB/s
Installing collected packages: urllib3, certifi, idna, chardet, requests
  WARNING: The script chardetect.exe is installed in 'C:\Users\yjaquier\AppData\Roaming\Python\Python38\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed certifi-2019.9.11 chardet-3.0.4 idna-2.8 requests-2.22.0 urllib3-1.25.7

Python testing has been done using Python 3.8 installed on my Windows 10 machine. Obviously you can also see graphically what’s going on but as it is exactly the same as with curl I will not share again the output.

Session creation:

>>> import json, pprint, requests, textwrap
>>> host = 'http://livy_server.domain.com:8999'
>>> data = {'kind': 'spark'}
>>> headers = {'Content-Type': 'application/json', 'X-Requested-By': 'Yannick'}
>>> r = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
>>> r.json()
{'id': 9, 'appId': None, 'owner': None, 'proxyUser': None, 'state': 'starting', 'kind': 'spark', 'appInfo': {'driverLogUrl': None, 'sparkUiUrl': None},
'log': ['stdout: ', '\nstderr: ', '\nYARN Diagnostics: ']}

Session status:

>>> session_url = host + r.headers['location']
>>> r = requests.get(session_url, headers=headers)
>>> r.json()
{'id': 9, 'appId': 'application_1565718945091_261793', 'owner': None, 'proxyUser': None, 'state': 'idle', 'kind': 'spark',
'appInfo': {'driverLogUrl': 'http://data_node08.domain.com:8042/node/containerlogs/container_e212_1565718945091_261793_01_000001/livy',
'sparkUiUrl': 'http://sparkui_server.domain.com:8088/proxy/application_1565718945091_261793/'},
'log': ['stdout: ', '\nstderr: ', 'Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.', '\nYARN Diagnostics: ']}

Remark:
Note the url to directly access your Spark UI server to have an even better description of your workload…

Job submission:

>>> statements_url = session_url + '/statements'
>>> data = {'code': '1 + 1'}
>>> r = requests.post(statements_url, data=json.dumps(data), headers=headers)
>>> r.json()
{'id': 0, 'code': '1 + 1', 'state': 'waiting', 'output': None, 'progress': 0.0}

Job result:

>>> statement_url = host + r.headers['location']
>>> r = requests.get(statement_url, headers=headers)
>>> pprint.pprint(r.json())
{'code': '1 + 1',
 'id': 0,
 'output': {'data': {'text/plain': 'res0: Int = 2'},
            'execution_count': 0,
            'status': 'ok'},
 'progress': 1.0,
 'state': 'available'}

Optional session deletion:

>>> r = requests.delete(session_url, headers=headers)
>>> r.json()
{'msg': 'deleted'}

References

The post How to use Livy server to submit Spark job through a REST interface appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/how-to-use-livy-server-to-submit-spark-job-through-a-rest-interface.html/feed 0