PySpark and Spark Scala Jupyter kernels cluster integration

Preamble

Even if the standard tool for your data scientist in an Hortonworks Data Platform (HDP) is Zeppelin Notebook this population would most probably want to use Jupyter Lab/Notebook that has quite a momentum in this domain.

As you might guess with the new Hive Warehouse Connector (HWC) to access Hive tables in Spark comes a bunch of problem to correctly configure Jupyter Lab/Notebook…

In short the idea is to add additional Jupyter kernels on top of the default Python 3 one. To do this either you create them on your own by creating a kernel.json file or installing one of the packages that help you to integrate the language you wish.

In this article I assume that you already have a working Anaconda installation on your server. The installation is pretty straightforward, just execute the Anaconda3-2019.10-Linux-x86_64.sh shell script (in my case) and acknowledge the licence information displayed.

JupyterHub installation

If like me you are behind a corporate proxy the first thing to do is to configure it to be able to download conda packages over Internet:

(base) [root@server ~]# cat .condarc
auto_activate_base: false
 
proxy_servers:
    http: http://account:password@proxy_server.domain.com:proxy_port/
    https: http://account:password@proxy_server.domain.com:proxy_port/
 
 
ssl_verify: False

Create the JupyterHub conda environment with (chosen name is totally up to you):

(base) [root@server ~]# conda create --name jupyterhub
Collecting package metadata (current_repodata.json): done
Solving environment: done
 
 
==> WARNING: A newer version of conda exists. <==
  current version: 4.7.12
  latest version: 4.8.2
 
Please update conda by running
 
    $ conda update -n base -c defaults conda
 
 
 
## Package Plan ##
 
  environment location: /opt/anaconda3/envs/jupyterhub
 
 
 
Proceed ([y]/n)? y
 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate jupyterhub
#
# To deactivate an active environment, use
#
#     $ conda deactivate

If, like me, you received a warning about an obsolete release of conda, upgrade it with:

(base) [root@server ~]# conda update -n base -c defaults conda
Collecting package metadata (current_repodata.json): done
Solving environment: done
 
## Package Plan ##
 
  environment location: /opt/anaconda3
 
  added / updated specs:
    - conda
 
 
The following packages will be downloaded:
 
    package                    |            build
    ---------------------------|-----------------
    backports.functools_lru_cache-1.6.1|             py_0          11 KB
    conda-4.8.2                |           py37_0         2.8 MB
    future-0.18.2              |           py37_0         639 KB
    ------------------------------------------------------------
                                           Total:         3.5 MB
 
The following packages will be UPDATED:
 
  backports.functoo~                               1.5-py_2 --> 1.6.1-py_0
  conda                                       4.7.12-py37_0 --> 4.8.2-py37_0
  future                                      0.17.1-py37_0 --> 0.18.2-py37_0
 
 
Proceed ([y]/n)? y
 
 
Downloading and Extracting Packages
future-0.18.2        | 639 KB    | ######################################################################################### | 100%
backports.functools_ | 11 KB     | ######################################################################################### | 100%
conda-4.8.2          | 2.8 MB    | ######################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
 
(base) [root@server ~]# conda -V
conda 4.8.2

Activate your newly created Conda environment and install jupyterhub and notebook inside it:

(base) [root@server ~]# conda activate jupyterhub
(jupyterhub) [root@server ~]# conda install -c conda-forge jupyterhub
(jupyterhub) [root@server ~]# conda install -c conda-forge notebook

Remark
It is also possible to install newest JupyterLab in Jupyterhub instead of Jupyter Notebook. If you do so you have to set c.Spawner.default_url = ‘/lab’ to instruct JupyterHub to load JupyterLab instead of Jupyter Notebook. In below I will try to mix screenshot but clearly the future is JupyterLab and not Jupyter Notebook. JupyterHub is just providing a multi user environment.

Install JupyterLab with:

conda install -c conda-forge jupyterlab

Execute jupyterhub by just typing the command jupyterhub and access to its url at http://server.domain.com:8000. All options can obviously be configured…

As an exemple how to activate https for your Jupyterhub using a self signed certificate (free). Is not optimal but better than http…

Generate the config file using:

(jupyterhub) [root@server ~]# jupyterhub --generate-config -f jupyterhub_config.py
Writing default config to: jupyterhub_config.py

Generate the key and certificate using below command (taken from OpenSSL Cookbook book):

(jupyterhub) [root@server ~]# openssl req -new -newkey rsa:2048 -x509 -nodes -keyout root-ocsp.key -out root-ocsp.csr
Generating a RSA private key
...................................+++++
........................................+++++
writing new private key to 'root-ocsp.key'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:CH
State or Province Name (full name) [Some-State]:Geneva
Locality Name (eg, city) []:Geneva
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Company Name
Organizational Unit Name (eg, section) []:
Common Name (e.g. server FQDN or YOUR name) []:servername.domain.com
Email Address []:

Finally I have configured only three below parameters in jupyterhub_config.py configuration file:

c.JupyterHub.bind_url = 'https://servername.domain.com'
c.JupyterHub.ssl_cert = 'root-ocsp.csr'
c.JupyterHub.ssl_key = 'root-ocsp.key'
c.Spawner.default_url = '/lab' # Unset to keep Jupyter Notebook

Simply execute juyterhub command to run JupyterHub, of course creating a service that start with server boot is more than recommended.

Then accessing to https url you see login window without the HTTP warning (you will have to add the self signed certificate as a trusted server in your browser).

Jupyter kernels manual configuration

Connect with an existing OS account created onto server where JupyterHub is running:

Create a new Python 3 notebook (yannick_python.ipynb in my example, but as you can see I have many others):

And this below dummy example should work:

Obvosuly our goal here is not to do Python but Spark. To manually create a Pyspark kernel create the kernel directory in home installation of your Jupyterhub:

[root@server ~]# cd /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels
(jupyterhub) [root@server kernels]# ll
total 0
drwxr-xr-x 2 root root 69 Feb 18 14:20 python3
(jupyterhub) [root@server kernels]# mkdir pyspark_kernel

Then create below kernel file:

[root@server pyspark]# cat kernel.json
{
  "display_name": "PySpark_kernel",
  "language": "python",
  "argv": [ "/opt/anaconda3/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ],
  "env": {
    "SPARK_HOME": "/usr/hdp/current/spark2-client/",
    "PYSPARK_PYTHON": "/opt/anaconda3/bin/python",
    "PYTHONPATH": "/usr/hdp/current/spark2-client/python/:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip",
    "PYTHONSTARTUP": "/usr/hdp/current/spark2-client/python/pyspark/shell.py",
    "PYSPARK_SUBMIT_ARGS": "--master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip pyspark-shell"
  }
}

You can confirm it is taken into account with:

(jupyterhub) [root@server ~]# jupyter-kernelspec list
Available kernels:
  pyspark_kernel    /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/pyspark_kernel
  python3           /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/python3

If you create a new Jupyter Notebook and choose PySpark_kernel you should be able to execute below sample code:

So far I have not yet found on how to create a Spark Scala manual kernel, any insight is welcome…

Sparkmagic Jupyter kernels configuration

On the list of Jupyter available kernel there is one (IMHO) that comes more often than the competition: Sparkmagic.

Install it with:

conda install -c conda-forge sparkmagic

Once done install the Sparkmagic kernel with:

(jupyterhub) [root@server kernels]# cd /opt/anaconda3/envs/jupyterhub/lib/python3.8/site-packages/sparkmagic/kernels
(jupyterhub) [root@server kernels]# ll
total 28
-rw-rw-r-- 2 root root    46 Jan 23 14:36 __init__.py
-rw-rw-r-- 2 root root 20719 Jan 23 14:36 kernelmagics.py
drwxr-xr-x 2 root root    72 Feb 18 16:48 __pycache__
drwxr-xr-x 3 root root   104 Feb 18 16:48 pysparkkernel
drwxr-xr-x 3 root root   102 Feb 18 16:48 sparkkernel
drwxr-xr-x 3 root root   103 Feb 18 16:48 sparkrkernel
drwxr-xr-x 3 root root    95 Feb 18 16:48 wrapperkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec install sparkrkernel
[InstallKernelSpec] Installed kernelspec sparkrkernel in /usr/local/share/jupyter/kernels/sparkrkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec install sparkkernel
[InstallKernelSpec] Installed kernelspec sparkkernel in /usr/local/share/jupyter/kernels/sparkkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec install pysparkkernel
[InstallKernelSpec] Installed kernelspec pysparkkernel in /usr/local/share/jupyter/kernels/pysparkkernel
(jupyterhub) [root@server kernels]# jupyter-kernelspec list
Available kernels:
  pysparkkernel    /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/pysparkkernel
  python3          /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/python3
  sparkkernel      /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/sparkkernel
  sparkrkernel     /opt/anaconda3/envs/jupyterhub/share/jupyter/kernels/sparkrkernel

Enable the server extension with:

(jupyterhub) [root@server ~]# jupyter serverextension enable --py sparkmagic
Enabling: sparkmagic
- Writing config: /root/.jupyter
    - Validating...
      sparkmagic 0.15.0 OK

In the home directory of the account with which you will connect to JupyterHub create a .sparkmagic directory and create a file that is a copy of provided config.json.

In this file modify at least for each kernel_xxx_credentials section the url to map your Livy server name and port:

"kernel_python_credentials" : {
  "username": "",
  "password": "",
  "url": "http://livyserver.domain.com:8999",
  "auth": "None"
},

And the part on which I have spent quite a lot of time the session_configs section as below (to add the HWC connector information):

  "session_configs": {
  "driverMemory": "1000M",
  "executorCores": 2,
  "conf": {"spark.jars": "file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar",
           "spark.submit.pyFiles": "file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip"}
},

If you then create new notebook using PySpark or Spark whether you want to use Python or Scala you should be able to run the below exemples.

If you use Jupyter Notebook the first command to execute is magic command %load_ext sparkmagic.magics then create a session using magic command %manage_spark select either Scala or Python (remain the question of R language but I do not use it). If you use JupyterLab you can directly start to work as the %manage_spark command does not work. The Livy session should be automatically created while executing the first command, should also be the same with Jupyter Notebook but I had few issues with this so…

Few other magic commands are quite interesting: