Fetch Zookeeper information from Python with Kazoo to connect Hive

Preamble

Straight from the beginning when I have landed to our Hadoop project (HortonWorks) I have seen all our Python scripts directly connecting to our Hive server, with PyHive, bypassing Zookeeper coordination. I also noticed that every Beeline client connection where well using (obviously) the HiveServer2 JDBC URL. I have left this point open for later until we decided to improve our High Availability (HA) by making few components running on multiple servers. And when it came to our HiveServer2 that is now running on two edge nodes of our Hadoop cluster I have decided to dig out this thread…

The good way of connecting to HiveServer2 is to first get current status and configuration from Zookeeper and then use this information in PyHive (for example) to make a Hive connection. Zookeeper is acting here as a configuration keeper as well as an availability watcher, means Zookeeper will not return a dead HiveServer2 information.

Digging a bit on Internet I came quickly to the obvious conclusion that Kazoo Python package was a must try !

This blog post has been written using kazoo 2.5.0, Python 3.7.3. My Hadoop cluster is HortonWorks Data Platform (HDP) 2.6.4. All developed scripts are running on a Fedora 30 virtual machine.

Kazoo development environment installation

Anaconda is the preferred Python 3.7 environment management so started by downloading and installing it for Python 3.7 on my Fedora virtual machine. The release I have installed is:

[root@fedora1 ~]# anaconda --version
anaconda 30.25.6

It also gives you access to conda that is command line environment management:

(base) [root@fedora1 ~]# conda --version
conda 4.6.14
(base) [root@fedora1 ~]# conda info
 
     active environment : base
    active env location : /opt/anaconda3
            shell level : 1
       user config file : /root/.condarc
 populated config files : /root/.condarc
          conda version : 4.6.14
    conda-build version : 3.17.8
         python version : 3.7.3.final.0
       base environment : /opt/anaconda3  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/free/linux-64
                          https://repo.anaconda.com/pkgs/free/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /opt/anaconda3/pkgs
                          /root/.conda/pkgs
       envs directories : /opt/anaconda3/envs
                          /root/.conda/envs
               platform : linux-64
             user-agent : conda/4.6.14 requests/2.21.0 CPython/3.7.3 Linux/5.0.11-300.fc30.x86_64 fedora/30 glibc/2.29
                UID:GID : 0:0
             netrc file : None
           offline mode : False

As I am behind a corporate proxy I had to customize a little bit my .condarc profile:

(base) [root@fedora1 ~]# cat  .condarc
ssl_verify: False
proxy_servers:
    http: http://proxy_user:proxy_password@proxy_server:proxy_port
    https: https://proxy_user:proxy_password@proxy_server:proxy_port

Remark
I have also been obliged to set ssl_verify to false to avoid any certificate issues that are not in my proxy server…

I create a kazoo conda environment with:

(base) [root@fedora1 ~]# conda create -n kazoo
Collecting package metadata: done
Solving environment: done
 
## Package Plan ##
 
  environment location: /opt/anaconda3/envs/kazoo
 
 
 
Proceed ([y]/n)? y
 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate kazoo
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Activate it with:

[root@fedora1 ~]# conda activate kazoo
(kazoo) [root@fedora1 ~]#

As you have seen I am working with root which is a very bad idea so better to use a normal account (and keep root for more important tasks see below), to do so initialize Conda with (my shell is obviously bash):

[yjaquier@fedora1 ~]$ /opt/anaconda3/bin/conda init bash
no change     /opt/anaconda3/condabin/conda
no change     /opt/anaconda3/bin/conda
no change     /opt/anaconda3/bin/conda-env
no change     /opt/anaconda3/bin/activate
no change     /opt/anaconda3/bin/deactivate
no change     /opt/anaconda3/etc/profile.d/conda.sh
no change     /opt/anaconda3/etc/fish/conf.d/conda.fish
no change     /opt/anaconda3/shell/condabin/Conda.psm1
no change     /opt/anaconda3/shell/condabin/conda-hook.ps1
no change     /opt/anaconda3/lib/python3.7/site-packages/xonsh/conda.xsh
no change     /opt/anaconda3/etc/profile.d/conda.csh
modified      /home/yjaquier/.bashrc
 
==> For changes to take effect, close and re-open your current shell. <==

Logoff and logon again and use the newly created environment with:

(base) [yjaquier@fedora1 ~]$ conda activate kazoo
(kazoo) [yjaquier@fedora1 ~]$

You also need to configure your conda environment (.condarc) same as above…

For package management and search your reference will be https://anaconda.org. Here is an example of a search for Python (direct link is https://anaconda.org/search?q=python):

kazoo01
kazoo01

If you enter in the Python package most downloaded (good practice in my opinion) you will find the command to install it:

conda install -c conda-forge python

But then you cannot modify the environment with your own account and packages add must be done by root, which is, in my opinion, a very good practice:

EnvironmentNotWritableError: The current user does not have write permissions to the target environment.
  environment location: /opt/anaconda3/envs/kazoo
  uid: 1000
  gid: 100

So executing with root account (in kazoo conda environment):

(kazoo) [root@fedora1 ~]# conda install -c conda-forge python
Collecting package metadata: done
Solving environment: done
 
## Package Plan ##
 
  environment location: /opt/anaconda3/envs/kazoo
 
  added / updated specs:
    - python
 
 
The following packages will be downloaded:
 
    package                    |            build
    ---------------------------|-----------------
    bzip2-1.0.6                |    h14c3975_1002         415 KB  conda-forge
    certifi-2019.3.9           |           py37_0         149 KB  conda-forge
    pip-19.1                   |           py37_0         1.8 MB  conda-forge
    python-3.7.3               |       h5b0a415_0        35.7 MB  conda-forge
    setuptools-41.0.1          |           py37_0         616 KB  conda-forge
    wheel-0.33.1               |           py37_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        38.7 MB
 
The following NEW packages will be INSTALLED:
 
  bzip2              conda-forge/linux-64::bzip2-1.0.6-h14c3975_1002
  ca-certificates    conda-forge/linux-64::ca-certificates-2019.3.9-hecc5488_0
  certifi            conda-forge/linux-64::certifi-2019.3.9-py37_0
  libffi             conda-forge/linux-64::libffi-3.2.1-he1b5a44_1006
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-8.2.0-hdf63c60_1
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-8.2.0-hdf63c60_1
  ncurses            conda-forge/linux-64::ncurses-6.1-hf484d3e_1002
  openssl            conda-forge/linux-64::openssl-1.1.1b-h14c3975_1
  pip                conda-forge/linux-64::pip-19.1-py37_0
  python             conda-forge/linux-64::python-3.7.3-h5b0a415_0
  readline           conda-forge/linux-64::readline-7.0-hf8c457e_1001
  setuptools         conda-forge/linux-64::setuptools-41.0.1-py37_0
  sqlite             conda-forge/linux-64::sqlite-3.26.0-h67949de_1001
  tk                 conda-forge/linux-64::tk-8.6.9-h84994c4_1001
  wheel              conda-forge/linux-64::wheel-0.33.1-py37_0
  xz                 conda-forge/linux-64::xz-5.2.4-h14c3975_1001
  zlib               conda-forge/linux-64::zlib-1.2.11-h14c3975_1004
 
 
Proceed ([y]/n)? y
 
 
Downloading and Extracting Packages
python-3.7.3         | 35.7 MB   | #################################################################################################################################################################### | 100%
certifi-2019.3.9     | 149 KB    | #################################################################################################################################################################### | 100%
wheel-0.33.1         | 34 KB     | #################################################################################################################################################################### | 100%
setuptools-41.0.1    | 616 KB    | #################################################################################################################################################################### | 100%
pip-19.1             | 1.8 MB    | #################################################################################################################################################################### | 100%
bzip2-1.0.6          | 415 KB    | #################################################################################################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

I have also installed Kazoo using:

(kazoo) [root@fedora1 ~]# conda install -c conda-forge kazoo

And I have also installed PyHive to connect to Hive, Pandas to manipulate data structures (finally I have not used it at the end):

conda install -c anaconda pyhive
conda install -c conda-forge pandas

Python source code

The small source code (kazoo_testing.py file name in below) I have written is mainly coming (for the Kazoo part) from the official documentation so no restriction to visit it:

from kazoo.client import KazooClient,KazooState
 
def my_listener(state):
  if state == KazooState.LOST:
    # Register somewhere that the session was lost
    print('Connection lost !!')
  elif state == KazooState.SUSPENDED:
    # Handle being disconnected from Zookeeper
    print('Connection suspended !!')
  else:
    # Handle being connected/reconnected to Zookeeper
    print('Connected !!')
 
zk = KazooClient(hosts='zookeeper_server01.domain.com:2181,zookeeper_server02.domain.com:2181,zookeeper_server03.domain.com:2181')
 
zk.add_listener(my_listener)
#zk.start()
zk.start(timeout=5)
 
# Display Zookeeper information
print(zk.get_children('/'))
 
#print(zk.get_children('hiveserver2')[0])
print(zk.get_children(path='hiveserver2'))
 
for hiveserver2 in zk.get_children(path='hiveserver2'):
  array01=hiveserver2.split(';')[0].split('=')[1].split(':')
  hive_hostname=array01[0]
  hive_port=array01[1]
  print('Hive hostname: ' + hive_hostname)
  print('Hive port: ' + hive_port)

The list of Zookeeper server can be taken from the Hive Ambari page where you can copy/paste the so called HIVESERVER2 JDBC URL.

The above source code does not include the PyHive connection but once you get the Hive host name and port you can easily connect with something like (configuration parameter is optional):

from pyhive import hive
 
# Hive connection 
connection=hive.connect(
    host = hive_hostname, 
    port = hive_port,
    configuration={'tez.queue.name': 'your_yarn_queue_name'}, 
    username = "your_account"
    )
 
pandas01=pd.read_sql("select * from ...", connection)
 
print(pandas01.sample(10))

Kazoo testing

I have the chance to have configured two HiveServer2 in my Hortonworks Hadoop cluster. Which is, by the way, strongly suggested if you aim to be Highly Available (HA). When the two HiveServer2 processes are up and running I get below result:

(kazoo) [yjaquier@fedora1 ~]$ python kazoo_testing.py
Connected !!
['registry', 'cluster', 'brokers', 'storm', 'zookeeper', 'infra-solr', 'hbase-unsecure', 'tracers', 'hadoop-ha', 'admin', 'isr_change_notification',
 'accumulo', 'logsearch', 'controller_epoch', 'hiveserver2', 'druid', 'rmstore', 'ambari-metrics-cluster', 'consumers', 'config']
['serverUri=hiveserver201.domain.com:10000;version=1.2.1000.2.6.4.0-91;sequence=0000000042', 'serverUri=hiveserver202.domain.com:10000;version=1.2.1000.2.6.4.0-91;sequence=0000000043']
Hive hostname: hiveserver201.domain.com
Hive port: 10000
Hive hostname: hiveserver202.domain.com
Hive port: 10000

If I stop the first HiverServer2, after a while, I suppose the time for Zookeeper to get and propagate the information I finally get:

(kazoo) [yjaquier@fedora1 ~]$ python kazoo_testing.py
Connected !!
['registry', 'cluster', 'brokers', 'storm', 'zookeeper', 'infra-solr', 'hbase-unsecure', 'tracers', 'hadoop-ha', 'admin', 'isr_change_notification',
 'accumulo', 'logsearch', 'controller_epoch', 'hiveserver2', 'druid', 'rmstore', 'ambari-metrics-cluster', 'consumers', 'config']
['serverUri=hiveserver202.domain.com:10000;version=1.2.1000.2.6.4.0-91;sequence=0000000042']
Hive hostname: hiveserver202.domain.com
Hive port: 10000

References

About Post Author

Share the knowledge!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>