IT World https://blog.yannickjaquier.com RDBMS, Unix and many more... Mon, 23 Sep 2019 10:18:04 +0000 en-US hourly 1 https://wordpress.org/?v=5.2.4 Hadoop backup: what parts to backup and how to do it ? https://blog.yannickjaquier.com/hadoop/hadoop-backup-what-parts-to-backup-and-how-to-do-it.html https://blog.yannickjaquier.com/hadoop/hadoop-backup-what-parts-to-backup-and-how-to-do-it.html#respond Fri, 27 Sep 2019 08:09:45 +0000 https://blog.yannickjaquier.com/?p=4617 Preamble Hadoop backup, wide and highly important subject and most probably like me you have been surprised by poor availability of official documents and this is most probably why you have landed here trying to find a first answer ! Needless to say this blog post is far to be complete so please do not […]

The post Hadoop backup: what parts to backup and how to do it ? appeared first on IT World.

]]>

Table of contents

Preamble

hadoop_backup01
hadoop_backup01

Hadoop backup, wide and highly important subject and most probably like me you have been surprised by poor availability of official documents and this is most probably why you have landed here trying to find a first answer ! Needless to say this blog post is far to be complete so please do not hesitate to submit a comment and I would enrich this document with great pleasure !

One of the main difficulty with Hadoop is its scale out natural essence that’s make difficult to understand what’s nice to backup and what is REALLY important to backup.

I have split the article in three parts:

  • First part is what you MUST backup to be able to survive to a major issue
  • Second part is what is not required to be backed-up.
  • Third part is what is nice to backup.

Also, I repeat, I’m interested by any comment you might have that would help to enrich this document or correct any mistake…

Mandatory parts to backup

Configuration files

All files under /etc and /usr/hdp on edge nodes (so not for your workers nodes). On the principle you could recreate them from scratch but you surely do not want to loose multiple months or years of fine tuning isn’t it ?

Theoretically all your configuration files will be saved when saving Ambari server meta info but if you have a corporate tool to backup your host OS it is worth putting the two above directories as it is sometimes much simpler to restore a single files from those tools…

Those edge nodes are:

  • Master nodes
  • Management nodes
  • Client nodes
  • Utilities node (Hive, …)
  • Analytics Nodes

In other words all except worker nodes..

Ambari server meta info

[root@mgmtserver ~]# ambari-server backup /tmp/ambari-server-backup.zip
Using python  /usr/bin/python
Backing up Ambari File System state... *this will not backup the server database*
Backup requested.
Backup process initiated.
Creating zip file...
Zip file created at /tmp/ambari-server-backup.zip
Backup complete.
Ambari Server 'backup' completed successfully.
[root@mgmtserver ~]# ll /tmp/ambari-server-backup.zip
-rw-r--r-- 1 root root 2444590592 Dec  3 17:01 /tmp/ambari-server-backup.zip

To restore this backup in case of a big crash the command is:

[root@mgmtserver ~]# ambari-server restore /tmp/ambari-server-backup.zip

NameNode metadata

As they say in Hadoop Wiki page this component is key:

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

While it’s not an Oracle database and a continuous backup is not possible:

Regardless of the solution, a full, up-to-date continuous backup of the namespace is not possible. Some of the most recent data is always lost. HDFS is not an Online Transaction Processing (OLTP) system. Most data can be easily recreated if you re-run Extract, Transform, Load (ETL) or processing jobs.

The always working procedure to do a backup of your NameNode is really simple:

[hdfs@namenode_primary ~]$ hdfs dfsadmin -saveNamespace
saveNamespace: Safe mode should be turned ON in order to create namespace image.
[hdfs@namenode_primary ~]$ hdfs dfsadmin -safemode enter
Safe mode is ON
[hdfs@namenode_primary ~]$ hdfs dfsadmin -safemode get
Safe mode is ON
[hdfs@namenode_primary ~]$ hdfs dfsadmin -saveNamespace
Save namespace successful
[hdfs@namenode_primary ~]$ hdfs dfsadmin -safemode leave
Safe mode is OFF
[hdfs@namenode_primary ~]$ hdfs dfsadmin -safemode get
Safe mode is OFF
[hdfs@namenode_primary ~]$ hdfs dfsadmin -fetchImage /tmp
19/01/07 12:57:10 INFO namenode.TransferFsImage: Opening connection to http://namenode_primary.domain.com:50070/imagetransfer?getimage=1&txid=latest
19/01/07 12:57:10 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000 milliseconds
19/01/07 12:57:10 INFO namenode.TransferFsImage: Combined time for fsimage download and fsync to all disks took 0.04s. The fsimage download took 0.04s at 167097.56 KB/s. Synchronous (fsync) write to disk of /tmp/fsimage_0000000000002300573 took 0.00s.

Then you can put in a safe place (tape, SAN, NFS, …) the file that has been copied to /tmp directory. But this has anyways the bad idea to put your entire cluster in read only mode (safemode), so in a 24/7 production cluster this is surely not something you can accept…

All your running processes will end with something like:

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot complete file /apps/hive/warehouse/database.db/table_orc/.hive-staging_hive_2019-02-12_06-12-16_976_7305679997226277861-21596/_task_tmp.-ext-10002/_tmp.000199_0. Name node is in safe mode.
It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.

In the initial releases of Hadoop the NameNode was a Single Point Of Failure (SPOF) as you could have only what is called a secondary NameNode. The secondary NameNode handle an important CPU intensive task called checkpointing. Checkpointing is the operation to combine edits logs files (edits_xx files) and latest fsimage file to create an up to date HDFS filesystem metadata snapshot (fsimage_xxx file). But the secondary NameNode cannot be used as a failover of the primary NameNode so in case of failure is can only be used to rebuild the primary NameNode, not to take his role.

In Haddop 2.0 this limitation has gone and in an High Availability (HA) mode you can have a standby NameNode that does same job as secondary NameNode and can also take the role of the primary NameNode by a simple switch.

If for any reason this checkpoint operation has not happened since long you will receive the scary NameNode Last Checkpoint Ambari alert:

hadoop_backup02
hadoop_backup02

This alert will also trigger below Ambari warning when you will try to stop NameNode process (when the NameNode restart is read the latest fsimage and re-apply to it all the edits log files generated since):

hadoop_backup03
hadoop_backup03

Needless to say that having your NameNode service in High Availability (active/standby) is strongly suggested !

Whether you have NameNode in HA or not there is a list of important parameters to consider with the value we have chosen, maybe I should decrease the checkpoint period value:

  • dfs.namenode.name.dir = /hadoop/hdfs
  • dfs.namenode.checkpoint.period = 21600 (in seconds i.e. 6 hours)
  • dfs.namenode.checkpoint.txns = 1000000
  • dfs.namenode.checkpoint.check.period = 60

But in this case on your standby or secondary NameNode every dfs.namenode.checkpoint.period or every dfs.namenode.checkpoint.txns whichever is reached first you will have a new checkpoint file and the cool thing is that this latest checkpoint is copied back to your active NameNode. In below the checkpoint at 07:08 is the periodic automatic checkpoint while the one at 06:15 is the one we have explicitly done with a hdfs dfsadmin -saveNamespace command.

On standby NameNode:

[root@namenode_standby ~]# ll -rt /hadoop/hdfs/current/fsimage*
-rw-r--r-- 1 hdfs hadoop 650179252 Feb 13 06:15 /hadoop/hdfs/current/fsimage_0000000000520456166
-rw-r--r-- 1 hdfs hadoop        62 Feb 13 06:15 /hadoop/hdfs/current/fsimage_0000000000520456166.md5
-rw-r--r-- 1 hdfs hadoop 650235574 Feb 13 07:08 /hadoop/hdfs/current/fsimage_0000000000520466841
-rw-r--r-- 1 hdfs hadoop        62 Feb 13 07:08 /hadoop/hdfs/current/fsimage_0000000000520466841.md5

On active Namenode:

[root@namenode_primary ~]# ll -rt /hadoop/hdfs/current/fsimage*
-rw-r--r-- 1 hdfs hadoop        62 Feb 13 06:15 /hadoop/hdfs/current/fsimage_0000000000520456198.md5
-rw-r--r-- 1 hdfs hadoop 650179470 Feb 13 06:15 /hadoop/hdfs/current/fsimage_0000000000520456198
-rw-r--r-- 1 hdfs hadoop 650235574 Feb 13 07:08 /hadoop/hdfs/current/fsimage_0000000000520466841
-rw-r--r-- 1 hdfs hadoop        62 Feb 13 07:08 /hadoop/hdfs/current/fsimage_0000000000520466841.md5

So in a NameNode HA cluster you can just copy regularly the dfs.namenode.name.dir to a safe place (tape, NFS, …) and you are not obliged to enter this impacting safemode

If at a point in time you don’t have Ambari and/or you want to script it here is the commands to get your active and standby NameNode servers:

[hdfs@namenode_primary ~]$ hdfs getconf -confKey dfs.ha.namenodes.mycluster
nn1,nn2
[hdfs@namenode_primary ~]$ hdfs getconf -confKey dfs.namenode.rpc-address.mycluster.nn1
namenode_standby.domain.com:8020
[hdfs@namenode_primary ~]$ hdfs getconf -confKey dfs.namenode.rpc-address.mycluster.nn2
namenode_primary.domain.com:8020
[hdfs@namenode_primary ~]$ hdfs haadmin -getServiceState nn1
standby
[hdfs@namenode_primary ~]$ hdfs haadmin -getServiceState nn2
active

Ambari repository database

Our Ambari repository database is a PostgreSQL one, if you have chosen MySQL refer to next chapter.

Backup with Point In Time Recovery (PITR) capability

As clearly explained in the documentation there is a tool to do it called pg_basebackup. To use it you need to put your PostgreSQL instance in write ahead log (WAL) mode that is equivalent of binary logging of MySQL or archive log mode of Oracle. This is done by setting three parameters in postgresql.conf file:

  • wal_level = replica
  • archive_mode = on
  • archive_command = ‘test ! -f /var/lib/pgsql/backups/%f && cp %p /var/lib/pgsql/backups/%f’

Remark:
The archive command that has been chosen is just an example that will copy WAL files to a backup directory that you obviously need to save to a secure place.

If not done you will end up with below error message:

[postgres@fedora1 ~]$ pg_basebackup --pgdata=/tmp/pgbackup01
pg_basebackup: could not get write-ahead log end position from server: ERROR:  could not open file "./.postgresql.conf.swp": Permission denied
pg_basebackup: removing data directory "/tmp/pgbackup01"

Once done and activated (restart required) you can make and online backup that can be used to perform PITR with:

[postgres@fedora1 ~]$ pg_basebackup --pgdata=/tmp/pgbackup01
[postgres@fedora1 ~]$ ll /tmp/pgbackup01
total 52
-rw------- 1 postgres postgres   206 Nov 30 18:02 backup_label
drwx------ 6 postgres postgres   120 Nov 30 18:02 base
-rw------- 1 postgres postgres    30 Nov 30 18:02 current_logfiles
drwx------ 2 postgres postgres  1220 Nov 30 18:02 global
drwx------ 2 postgres postgres    80 Nov 30 18:02 log
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_commit_ts
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_dynshmem
-rw------- 1 postgres postgres  4414 Nov 30 18:02 pg_hba.conf
-rw------- 1 postgres postgres  1636 Nov 30 18:02 pg_ident.conf
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_log
drwx------ 4 postgres postgres   100 Nov 30 18:02 pg_logical
drwx------ 4 postgres postgres    80 Nov 30 18:02 pg_multixact
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_notify
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_replslot
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_serial
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_snapshots
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_stat
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_stat_tmp
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_subtrans
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_tblspc
drwx------ 2 postgres postgres    40 Nov 30 18:02 pg_twophase
-rw------- 1 postgres postgres     3 Nov 30 18:02 PG_VERSION
drwx------ 3 postgres postgres    80 Nov 30 18:02 pg_wal
drwx------ 2 postgres postgres    60 Nov 30 18:02 pg_xact
-rw------- 1 postgres postgres    88 Nov 30 18:02 postgresql.auto.conf
-rw------- 1 postgres postgres 22848 Nov 30 18:02 postgresql.conf
[postgres@fedora1 pg_wal]$ ll /var/lib/pgsql/backups/
total 32772
-rw------- 1 postgres postgres 16777216 Nov 30 18:02 000000010000000000000002
-rw------- 1 postgres postgres 16777216 Nov 30 18:02 000000010000000000000003
-rw------- 1 postgres postgres      302 Nov 30 18:02 000000010000000000000003.00000060.backup
[postgres@fedora1 pg_wal]$ cat /var/lib/pgsql/backups/000000010000000000000003.00000060.backup
START WAL LOCATION: 0/3000060 (file 000000010000000000000003)
STOP WAL LOCATION: 0/3000130 (file 000000010000000000000003)
CHECKPOINT LOCATION: 0/3000098
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2018-11-30 18:02:03 CET
LABEL: pg_basebackup base backup
STOP TIME: 2018-11-30 18:02:03 CET
[postgres@fedora1 pg_wal]$ ll /var/lib/pgsql/data/pg_wal/
total 49156
-rw------- 1 postgres postgres 16777216 Nov 30 18:02 000000010000000000000002
-rw------- 1 postgres postgres 16777216 Nov 30 18:02 000000010000000000000003
-rw------- 1 postgres postgres      302 Nov 30 18:02 000000010000000000000003.00000060.backup
-rw------- 1 postgres postgres 16777216 Nov 30 18:02 000000010000000000000004
drwx------ 2 postgres postgres      133 Nov 30 18:02 archive_status
[postgres@fedora1 pg_wal]$ ll /var/lib/pgsql/data/pg_wal/archive_status/
total 0
-rw------- 1 postgres postgres 0 Nov 30 18:02 000000010000000000000002.done
-rw------- 1 postgres postgres 0 Nov 30 18:02 000000010000000000000003.00000060.backup.done
-rw------- 1 postgres postgres 0 Nov 30 18:02 000000010000000000000003.done

You can also directly generate TAR files with:

[postgres@fedora1 pg_wal]$ pg_basebackup --pgdata=/tmp/pgbackup02 --format=t
[postgres@fedora1 pg_wal]$ ll /tmp/pgbackup02
total 48128
-rw-r--r-- 1 postgres postgres 32500224 Nov 30 18:11 base.tar
-rw------- 1 postgres postgres 16778752 Nov 30 18:11 pg_wal.tar

Backup with no PITR capability

This method is obviously based on the creation of a dump file. Either you use pg_dump or pg_dumpall.

At this stage either you do all with postgres Linux account that is able to connect in a password less fashion, thanks to default pg_hba.conf file:

# TYPE  DATABASE        USER            ADDRESS                 METHOD

# "local" is for Unix domain socket connections only
local   all             postgres                                     peer
# IPv4 local connections:
host    all             postgres             127.0.0.1/32            ident
# IPv6 local connections:
host    all            postgres             ::1/128                 ident
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     postgres                                     peer
host    replication     postgres             127.0.0.1/32            ident
host    replication     postgres             ::1/128                 ident

Or you set it for another account that has less privileges, the owner of the database you want to backup for example. I initially tried with PGPASSWORD but this is apparently not working anymore in later releases of PostgreSQL (10.6 for the release I have used to test the feature):

[postgres@fedora1 ~]$ export PGPASSWORD='secure_password'
[postgres@fedora1 ~]$ echo $PGPASSWORD
secure_password
[postgres@fedora1 ~]$ psql --dbname=ambari --username=ambari --password
Password for user ambari:

Our Ambari repository is older (9.2.23) but to prepare future better to move to password file. A password file is file called ~/.pgpass and having below structure:

hostname:port:database:username:password

I have created it like:

[postgres@fedora1 ~]$ ll /var/lib/pgsql/.pgpass
-rw-r--r-- 1 postgres postgres 37 Nov 30 15:12 /var/lib/pgsql/.pgpass
[postgres@fedora1 ~]$ cat /var/lib/pgsql/.pgpass
localhost:5432:ambari:ambari:secure_password

The file must be 600 or less or you will get:

[postgres@fedora1 ~]$ psql --dbname=ambari --username=ambari
WARNING: password file "/var/lib/pgsql/.pgpass" has group or world access; permissions should be u=rw (0600) or less
Password for user ambari:

Then you can connect without specifying a password:

[postgres@fedora1 ~]$ psql --dbname=ambari --username=ambari
psql (10.6)
Type "help" for help.

ambari=>

All this to do a backup off all databases with:

[postgres@fedora1 ~]$ pg_dumpall --file=/tmp/pgbackup.sql
[postgres@fedora1 ~]$ ll /tmp/pgbackup.sql
-rw-r--r-- 1 postgres postgres 3768 Nov 30 16:55 /tmp/pgbackup.sql

Or just the Ambari one with:

[postgres@fedora1 ~]$ pg_dump --file=/tmp/pgbackup_ambari.sql ambari
[postgres@fedora1 ~]$ ll /tmp/pgbackup_ambari.sql
-rw-r--r-- 1 postgres postgres 1117 Nov 30 16:57 /tmp/pgbackup_ambari.sql

Hive repository database

Our Hive repository database is a MySQL one, if you have chosen PostgreSQL refer to previous chapter.

Backup with Point In Time Recovery (PITR) capability

You must activate binary log by setting log-bin parameter in my.cnf file with something like (MOCA https://blog.yannickjaquier.com/mysql/mysql-replication-with-global-transaction-identifiers-gtid-hands-on.html):

log-bin = /mysql/logs/mysql01/mysql-bin

You should end up with below configuration:

+---------------------------------+------------------------------------+
| Variable_name                   | Value                              |
+---------------------------------+------------------------------------+
| log_bin                         | ON                                 |
| log_bin_basename                | /mysql/logs/mysql01/mysql-bin      |
| log_bin_index                   | /mysql/logs/mysql01mysql-bin.index |
+---------------------------------+------------------------------------+

First you must regularly backup the MySQL binary logs !

Before any online backup (snapshot) do the following to reset binary logs:

mysql> show binary logs;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.001087 |       242 |
| mysql-bin.001088 |       242 |
| mysql-bin.001089 |       242 |
| mysql-bin.001090 |      9638 |
| mysql-bin.001091 |      1538 |
| mysql-bin.001092 |       242 |
| mysql-bin.001093 |       242 |
| mysql-bin.001094 |      1402 |
| mysql-bin.001095 |      4314 |
| mysql-bin.001096 |      2304 |
| mysql-bin.001097 |       120 |
+------------------+-----------+
11 rows in set (0.00 sec)

mysql> flush logs;
Query OK, 0 rows affected (0.41 sec)

mysql> show binary logs;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.001088 |       242 |
| mysql-bin.001089 |       242 |
| mysql-bin.001090 |      9638 |
| mysql-bin.001091 |      1538 |
| mysql-bin.001092 |       242 |
| mysql-bin.001093 |       242 |
| mysql-bin.001094 |      1402 |
| mysql-bin.001095 |      4314 |
| mysql-bin.001096 |      2304 |
| mysql-bin.001097 |       167 |
| mysql-bin.001098 |       120 |
+------------------+-----------+
11 rows in set (0.00 sec)

mysql> purge binary logs to 'mysql-bin.001098';
Query OK, 0 rows affected (0.00 sec)

mysql> show binary logs;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.001098 |       120 |
+------------------+-----------+
1 row in set (0.00 sec)

Then take the snapshot by keeping tables in read lock with something like:

FLUSH TABLES WITH READ LOCK;
\! lvcreate --snapshot --size 100M --name lvol98_save /dev/vg00/lvol98 or any snapshot command
UNLOCK TABLES;

Backup with no PITR capability

If you don’t want to activate binary logging and manage them or can afford to loose multiple hours of transaction you can simply perform a MySQL dump even once a week when your cluster is stabilized. Use a command like below to create a simple dump file:

[mysql@server1 ~] mysqldump --user=root -p --single-transaction --all-databases > /tmp/backup.sql

Not mandatory parts to backup

JournalNodes

From Cloudera official documentation:

High-availabilty clusters use JournalNodes to synchronize active and standby NameNodes. The active NameNode writes to each JournalNode with changes, or “edits,” to HDFS namespace metadata. During failover, the standby NameNode applies all edits from the JournalNodes before promoting itself to the active state.

Those JournalNodes are installed only if your NameNode is in HA mode. They are preferred method to handle shared storage between your primary and standby NameNodes, this method is called Quorum Journal Manager(QJM).

Each time a new edits file is created or modified on primary NameNode it is also written on maximum (quorum) of JournalNodes. Standby NameNode constantly monitor the JournalNodes for any changes and apply them to its own namespace to be ready to failover primary NameNode in case of failure. All JournalNodes store more or less same files (edits_xx files and edits_inprogress_xx file) as NameNodes except that they do not have the checkpoint fsimages_xx results. You must have three or more (odd number) JournalNodes for high availability and to handle split brain scenarios.

The working directory of JournalNodes is defined by:

  • dfs.journalnode.edits.dir = /var/qjn

On one JournalNode the real directory will be (cluster name is the name of your cluster that has been chosen at installation):

[root@journalnode01 ~]# ll -rt /var/qjn//current
.
.
.
-rw-r--r-- 1 hdfs hadoop 1006436 Jan 18 12:13 edits_0000000000433896848-0000000000433901168
-rw-r--r-- 1 hdfs hadoop  133375 Jan 18 12:15 edits_0000000000433901169-0000000000433901822
-rw-r--r-- 1 hdfs hadoop  133652 Jan 18 12:17 edits_0000000000433901823-0000000000433902395
-rw-r--r-- 1 hdfs hadoop  918778 Jan 18 12:19 edits_0000000000433902396-0000000000433906383
-rw-r--r-- 1 hdfs hadoop  801672 Jan 18 12:21 edits_0000000000433906384-0000000000433910273
-rw-r--r-- 1 hdfs hadoop   76329 Jan 18 12:23 edits_0000000000433910274-0000000000433910699
-rw-r--r-- 1 hdfs hadoop   90404 Jan 18 12:25 edits_0000000000433910700-0000000000433911201
-rw-r--r-- 1 hdfs hadoop   48435 Jan 18 12:27 edits_0000000000433911202-0000000000433911468
-rw-r--r-- 1 hdfs hadoop  882923 Jan 18 12:29 edits_0000000000433911469-0000000000433915208
-rw-r--r-- 1 hdfs hadoop 1048576 Jan 18 12:31 edits_inprogress_0000000000433915209
-rw-r--r-- 1 hdfs hadoop       8 Jan 18 12:31 committed-txid

So as such JournalNodes do not contains any required information that can be inherited from NameNode so nothing to backup

Parts nice to backup

HDFS

In essence your Hadoop cluster has surely been built to handle Terabytes, not to say Petabytes, of data. So doing a backup of all your HDFS data is technically not possible. First HDFS is replicating each data block (of dfs.blocksize in size, 128MB by default) multiple times (parameter is dfs.replication and is set to 3 in my case and you have surely configured what is call rack awareness. Means your worker nodes are physically in different racks in your computer room.

So in other words is you loose one or multiple worker nodes or even a complete rack of your Hadoop cluster this is going to be completely transparent to your application. At worst you might suffer from a performance decrease but no interruption to production (ITP).

But what if you loose the entire data center where is located your Hadoop cluster ? We initially had the idea to split our cluster between two data center geographically separated by 20-30 Kilometers (12 to 18 miles) but this would require a (dedicated) low latency high speed link (dark fiber or else) between the two data centers which is most probably not cost effective…

This is why the most implemented architecture is a second smaller cluster in a remote site where you will try to have a copy of your main Hadoop cluster. This copy can be done by provided Haddop tool called DistCp or simply by running the exact same ingestion process on this failover cluster…

Running the same ingestion process on two distinct clusters might sound a bad idea but if you store your source raw files on a low cost NFS filer then, first, you can easily backup them to tape. Secondly, you can use same exact copy from two (or more) Hadoop cluster and in case of crash or consistency issue you are able to restart the ingestion from raw files. The secondary cluster can then be, with no issue, smaller that the primary one as only ingestion will run on it. Interactive queries and users will remain on primary cluster…

Here I have not at all mentioned HDFS snapshot because for me it is not a all a backup solution ! This is not different from a NFS snapshot and the only case you handle with this is human error. in case of a hardware failure or a data center failure this HDFS snapshot will be of no help as you will loose it at same time of the crash…

References

The post Hadoop backup: what parts to backup and how to do it ? appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/hadoop-backup-what-parts-to-backup-and-how-to-do-it.html/feed 0
HDFS capacity planning computation and analysis https://blog.yannickjaquier.com/hadoop/hdfs-capacity-planning-generation-analysis.html https://blog.yannickjaquier.com/hadoop/hdfs-capacity-planning-generation-analysis.html#respond Fri, 30 Aug 2019 08:02:42 +0000 https://blog.yannickjaquier.com/?p=4538 Preamble In our ramp up period we wanted to estimate already consumed HDFS size as well as how is split this used space. This would help us build HDFS capacity planning plan and know which investment would be needed. I have found tons of document on how to do it but the snapshot “issue” I […]

The post HDFS capacity planning computation and analysis appeared first on IT World.

]]>

Table of contents

Preamble

In our ramp up period we wanted to estimate already consumed HDFS size as well as how is split this used space. This would help us build HDFS capacity planning plan and know which investment would be needed. I have found tons of document on how to do it but the snapshot “issue” I had was a nice discover…

HDFS capacity planning first estimation

The real two first commands you would use are:

[hdfs@clientnode ~]$ hdfs dfs -df -h /
Filesystem                          Size    Used  Available  Use%
hdfs://DataLakeHdfs               89.5 T  22.4 T     62.5 T   25%

And:

[hdfs@clientnode ~]$ hdfs dfs -du -s -h /
5.9 T  /

You can drill down directories size with:

[hdfs@clientnode ~]$ hdfs dfs -du -h /
169.0 G   /app-logs
466.7 M   /apps
12.5 G    /ats
3.1 T     /data
710.4 M   /hdp
0         /livy2-recovery
0         /mapred
16.8 M    /mr-history
1004.4 M  /spark2-history
2.1 T     /tmp
479.7 G   /user

In HDFS you have dfs.datanode.du.reserved which specify a reserved space in bytes per volume. This is set to 1243,90869140625 MB in my environment.

I also have below HDFS parameters that will be part of formula:

Parameter Value Description
dfs.datanode.du.reserved 1304332800 bytes Reserved space in bytes per volume. Always leave this much space free for non dfs use.
dfs.blocksize 128MB The default block size for new files, in bytes. You can use the following suffix (case insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.), Or provide complete size in bytes (such as 134217728 for 128 MB).
dfs.replication 3 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
fs.trash.interval 360 (minutes)

Number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled. This option may be configured both on the server and the client. If trash is disabled server side then the client side configuration is checked. If trash is enabled on the server side then the value configured on the server is used and the client configuration value is ignored.

You can have a complete report with more precise number than hdfs dfs -df -h / command for all your worker nodes using below command:

[hdfs@clientnode ~]$ hdfs dfsadmin -report
Configured Capacity: 98378048588800 (89.47 TB)
Present Capacity: 93368566571440 (84.92 TB)
DFS Remaining: 68685157293611 (62.47 TB)
DFS Used: 24683409277829 (22.45 TB)
DFS Used%: 26.44%
Under replicated blocks: 20
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (5):

Name: 10.75.144.13:50010 (worker3.domain.com)
Hostname: worker3.domain.com
Rack: /AH/26
Decommission Status : Normal
Configured Capacity: 19675609717760 (17.89 TB)
DFS Used: 3676038734820 (3.34 TB)
Non DFS Used: 0 (0 B)
DFS Remaining: 14998265052417 (13.64 TB)
DFS Used%: 18.68%
DFS Remaining%: 76.23%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 16
Last contact: Wed Oct 24 14:57:06 CEST 2018
Last Block Report: Wed Oct 24 11:48:58 CEST 2018


Name: 10.75.144.12:50010 (worker2.domain.com)
Hostname: worker2.domain.com
Rack: /AH/26
Decommission Status : Normal
Configured Capacity: 19675609717760 (17.89 TB)
DFS Used: 3884987861604 (3.53 TB)
Non DFS Used: 0 (0 B)
DFS Remaining: 14789450082223 (13.45 TB)
DFS Used%: 19.75%
DFS Remaining%: 75.17%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 14
Last contact: Wed Oct 24 14:57:06 CEST 2018
Last Block Report: Wed Oct 24 09:44:51 CEST 2018


Name: 10.75.144.14:50010 (worker4.domain.com)
Hostname: worker4.domain.com
Rack: /AH/27
Decommission Status : Normal
Configured Capacity: 19675609717760 (17.89 TB)
DFS Used: 6604991718895 (6.01 TB)
Non DFS Used: 0 (0 B)
DFS Remaining: 12068909438191 (10.98 TB)
DFS Used%: 33.57%
DFS Remaining%: 61.34%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 22
Last contact: Wed Oct 24 14:57:06 CEST 2018
Last Block Report: Wed Oct 24 12:36:28 CEST 2018


Name: 10.75.144.11:50010 (worker1.domain.com)
Hostname: worker1.domain.com
Rack: /AH/26
Decommission Status : Normal
Configured Capacity: 19675609717760 (17.89 TB)
DFS Used: 3983207846801 (3.62 TB)
Non DFS Used: 0 (0 B)
DFS Remaining: 14690022328249 (13.36 TB)
DFS Used%: 20.24%
DFS Remaining%: 74.66%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 32
Last contact: Wed Oct 24 14:57:06 CEST 2018
Last Block Report: Wed Oct 24 13:50:10 CEST 2018


Name: 10.75.144.15:50010 (worker5.domain.com)
Hostname: worker5.domain.com
Rack: /AH/27
Decommission Status : Normal
Configured Capacity: 19675609717760 (17.89 TB)
DFS Used: 6534183115709 (5.94 TB)
Non DFS Used: 0 (0 B)
DFS Remaining: 12138510392531 (11.04 TB)
DFS Used%: 33.21%
DFS Remaining%: 61.69%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 40
Last contact: Wed Oct 24 14:57:04 CEST 2018
Last Block Report: Wed Oct 24 10:41:56 CEST 2018

So far if I do my computation I get 5.9 TB * 3 (dfs.replication) = 17.7 TB, I am a bit below the 22.4 TB used of hdfs dfs -df -h / command… Where has gone the 4.7 TB ? Quite a few TB isn’t it ?

HDFS snapshot situation

Then after a bit on investigation I had the idea to check if HDFS snapshots have been created on my HDFS:

[hdfs@clientnode ~]$ hdfs lsSnapshottableDir -help
Usage:
hdfs lsSnapshottableDir:
        Get the list of snapshottable directories that are owned by the current user.
        Return all the snapshottable directories if the current user is a super user.

[hdfs@clientnode ~]$ hdfs lsSnapshottableDir
drwxr-xr-x 0 hdfs hdfs 0 2018-07-13 18:14 1 65536 /

You can get snapshot(s) name(s) with:

[hdfs@clientnode ~]$ hdfs dfs -ls /.snapshot
Found 1 items
drwxr-xr-x   - hdfs hdfs          0 2018-07-13 18:14 /.snapshot/s20180713-101304.832

Computing snapshot size is not possible as in case of a pointer to origial block (block not modified) the size of the original block will be added:

[hdfs@clientnode ~]$ hdfs dfs -du -h /.snapshot
3.3 T  /.snapshot/s20180713-101304.832

You can also get a graphical access using NameNode UI in Ambari:

hdfs_capacity_planning01
hdfs_capacity_planning01

Here we are a snapshot of HDFS root directory has been created… I rated this tricky as you don’t see it with a hdfs dfs du command:

[hdfs@clientnode ~]$ hdfs dfs -du -h /
169.0 G   /app-logs
466.7 M   /apps
4.7 G     /ats
3.1 T     /data
710.4 M   /hdp
0         /livy2-recovery
0         /mapred
0         /mr-history
1004.4 M  /spark2-history
2.1 T     /tmp
173.9 G   /user

I have also performed a HSFS filesystem check to be sure everything is fine and no blocks have been marked corrupted:

[hdfs@clientnode ~]$ hdfs fsck
.
.

........................
/user/training/.staging/job_1519657336782_0105/job.jar:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1073754565_13755. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.
/user/training/.staging/job_1519657336782_0105/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1073754566_13756. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1519657336782_0105/libjars/hive-hcatalog-core.jar:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1073754564_13754. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.
/user/training/.staging/job_1536057043538_0001/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085621525_11894367. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536057043538_0002/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085621527_11894369. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536057043538_0004/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085621593_11894435. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536057043538_0023/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085622064_11894906. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536057043538_0025/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085622086_11894928. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536057043538_0027/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085622115_11894957. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536057043538_0028/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085622133_11894975. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536642465198_0002/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397707_12670663. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536642465198_0003/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397706_12670662. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536642465198_0004/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397708_12670664. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536642465198_0005/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397718_12670674. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536642465198_0006/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397720_12670676. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536642465198_0007/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397721_12670677. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/user/training/.staging/job_1536642465198_2307/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086509846_12782817. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
....

Status: HEALTHY
 Total size:    5981414347660 B (Total open files size: 455501 B)
 Total dirs:    740032
 Total files:   3766023
 Total symlinks:                0 (Files currently being written: 17)
 Total blocks (validated):      3781239 (avg. block size 1581866 B) (Total open file blocks (not validated): 17)
 Minimally replicated blocks:   3781239 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       20 (5.2892714E-4 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0000105
 Corrupt blocks:                0
 Missing replicas:              100 (8.8153436E-4 %)
 Number of data-nodes:          5
 Number of racks:               2
FSCK ended at Wed Oct 17 16:12:48 CEST 2018 in 61172 milliseconds


The filesystem under path '/' is HEALTHY

After delete of HDFS snapshot

Get snapshot(s) name(s) and delete them with, I have also forbid further creation of any snapshot on root directory (does not make sense in my opinion):

[hdfs@clientnode  ~]$ hdfs dfsadmin -disallowSnapshot /
disallowSnapshot: The directory / has snapshot(s). Please redo the operation after removing all the snapshots.
[hdfs@clientnode ~]$ hdfs dfs -ls /.snapshot
Found 1 items
drwxr-xr-x   - hdfs hdfs          0 2018-07-13 18:14 /.snapshot/s20180713-101304.832
[hdfs@clientnode ~]$ hdfs dfs -deleteSnapshot / s20180713-101304.832
[hdfs@clientnode ~]$ hdfs dfsadmin -disallowSnapshot /
Disallowing snaphot on / succeeded
[hdfs@clientnode ~]$ hdfs lsSnapshottableDir

After a cleaning phase I reach the stable below situation:

[hdfs@clientnode ~]$ hdfs dfs -df -h /
Filesystem                          Size    Used  Available  Use%
hdfs://DataLakeHdfs               89.5 T  16.8 T     68.1 T   19%
[hdfs@clientnode ~]$ hdfs dfs -du -s -h /
5.5 T  /

So the computation is more accurate as 5.5 * 3 = 16.5 # 16.8.

As you have noticed my /tmp directory is 2.1 TB which is quite a lot of space for a temporary directory. For me all occupied space was directories under /tmp/hive. It end up that it is aborted Hive queries and can be safely deleted (we currently have one directory of 1.7 TB !!!):

Parameter Description Value
hive.exec.scratchdir This directory is used by Hive to store the plans for different map/reduce stages for the query as well as to stored the intermediate outputs of these stages.
Hive 0.14.0 and later: HDFS root scratch directory for Hive jobs, which gets created with write all (733) permission. For each connecting user, an HDFS scratch directory ${hive.exec.scratchdir}/ is created with ${hive.scratch.dir.permission}.
/tmp//hive (Hive 0.8.0 and earlier)
/tmp/hive- (as of Hive 0.8.1 to 0.14.0)
/tmp/hive (Hive 0.14.0 and later)

References

The post HDFS capacity planning computation and analysis appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/hdfs-capacity-planning-generation-analysis.html/feed 0
ORC versus Parquet compression and response time https://blog.yannickjaquier.com/hadoop/orc-versus-parquet-compression-and-response-time.html https://blog.yannickjaquier.com/hadoop/orc-versus-parquet-compression-and-response-time.html#respond Fri, 02 Aug 2019 07:52:44 +0000 https://blog.yannickjaquier.com/?p=4596 Preamble For their open source position we have chosen to install an Hortonworks HDP 2.6 Hadoop cluster. At the initial phase of our Hadoop project ORC storage has been chosen as the default storage engine for our very first Hive tables. Performance of our queries is, obviously, a key factor we consider. This is why […]

The post ORC versus Parquet compression and response time appeared first on IT World.

]]>

Table of contents

Preamble

For their open source position we have chosen to install an Hortonworks HDP 2.6 Hadoop cluster. At the initial phase of our Hadoop project ORC storage has been chosen as the default storage engine for our very first Hive tables.

Performance of our queries is, obviously, a key factor we consider. This is why we have started to consider Live Long And Process (LLAP) and realized it was not so easy to handle in our small initial cluster. Then the merge between Hortonworks and Cloudera happened and we decided to move all our tables to Parquet storage engine with the clear objective to use Impala from Cloudera.

But at a point in time we have started to study a bit the disk space usage (again linked to our small initial cluster) and realized that Parquet tables were much bigger than their ORC counterpart. All our Hive tables are highly using partitioning for performance and to ease cleaning by simply dropping old partitions…

There are plenty of articles comparing Parquet and ORC (and others) storage engines but if you read them carefully till the end there will most probably be a disclaimer stating that the comparison is tightly linked to data nature. In other words your data model and figures is unique and you have really no other option than testing it by yourself and this blog post is here to provide few tricks to achieve this…

Our cluster is running HDP-2.6.4.0 with Ambari version 2.6.1.0.

ORC versus Parquet compression

On one partition of one table we observed:

  • Parquet = 33.9 G
  • ORC = 2.4 G

Digging further we saw that ORC compression can be easily configured in Ambari and we have set it to zlib:

orc_vs_parquet01
orc_vs_parquet01

While the default Parquet compression is (apparently) uncompressed that is obviously not really good from compression perspective.

Digging in multiple (contradictory) blog posts and official documentation and personal testing I have been able to draw below table:

Hive SQL property Default Values
orc.compress ZLIB NONE, ZLIB or SNAPPY
parquet.compression UNCOMPRESSED UNCOMPRESSED, GZIP or SNAPPY

Remark:
I have seen many blog posts suggesting to use parquet.compress for Parquet compression algorithm but in my opinion this one does not work…

To change compression algorithm when creating a table use TBLPROPERTIES keyword like:

STORED AS PARQUET TBLPROPERTIES("parquet.compression"="GZIP");
STORED AS ORC TBLPROPERTIES("orc.compress"="SNAPPY")

So, as an example, the test table I have built for my testing is something like:

drop table default.test purge;

CREATE TABLE default.test(
  column01 string,
  column02 string,
  column03 int,
  column04 array)
PARTITIONED BY (column01 string, column02 string, column03 int)
STORED AS PARQUET
TBLPROPERTIES("parquet.compression"="SNAPPY");

To get the size of your test table (replace database_name and table_name by real values) just use something like (check the value of hive.metastore.warehouse.dir for /apps/hive/warehouse):

[hdfs@server01 ~]$ hdfs dfs -du -s -h /apps/hive/warehouse/database_name/table_name

Then I have copied my source Parquet table to this test table using the six combination of storage engine and compression algorithms and the result is:

ORC Parquet
orc.compress parquet.compression
NONE ZLIB SNAPPY UNCOMPRESSED GZIP SNAPPY
12.9 G 2.4 G 3.2 G 33.9 G 7.3 G 11.5 G

Or graphically:

orc_vs_parquet02
orc_vs_parquet02

To do the copy I have written a shell script that is dynamically copying each partition of the source table to its destination table that has the same layout:

#!/bin/bash
#
# Y.JAQUIER  06/02/2019  Creation
#
# -----------------------------------------------------------------------------
#
# This job copy one table on another one
# Useful when migrating from Parquet to ORC for example
#
# Destination table must have been created before using this script
#
# -----------------------------------------------------------------------------
#

###############################################################################
# Function to execute a query on hive
# Param 1 : Query string
###############################################################################
function execute_query
{
  #echo "Execute query: $1"
  MYSTART=$(date)
  beeline -u "jdbc:hive2://${HIVE_CONNEXION}?tez.queue.name=${HIVE_QUEUE}" -n ${HIVE_USER} --incremental=true --silent=true --fastConnect=true -e "$1"
  status=$?
  MYEND=$(date)
  MYDURATION=$(expr $(date -d "$MYEND" +%s) - $(date -d "$MYSTART" +%s))
  echo "Executed in $MYDURATION second(s)"
  #echo "MEASURE|$FAB|${INGESTION_DATE}|$1|$(date -d "$MYSTART" +%Y%m%d%H%M%S)|$(date -d "$MYEND" +%Y%m%d%H%M%S)|$MYDURATION"
  if [[ $status != "0" ]]
  then
    echo " !!!!!!!! Error Execution Query $1 !!!!!!!! "
    exit -1
  fi
}

###############################################################################
# Function to execute a query on hive in CSV output file
# Param 1 : Query string
# Param 2 : output file
###############################################################################
function execute_query_to_csv
{
  #echo "Execute query to csv: $1 => $2"
  MYSTART=$(date)
  beeline -u "jdbc:hive2://${HIVE_CONNEXION}?tez.queue.name=${HIVE_QUEUE}" -n ${HIVE_USER} --outputformat=csv2 --silent=true --verbose=false \
  --showHeader=false --fastConnect=true -e "$1" > $2
  status=$?
  MYEND=$(date)
  MYDURATION=$(expr $(date -d "$MYEND" +%s) - $(date -d "$MYSTART" +%s))
  #echo "MEASURE|$FAB|${INGESTION_DATE}|$1|$(date -d "$MYSTART" +%Y%m%d%H%M%S)|$(date -d "$MYEND" +%Y%m%d%H%M%S)|$MYDURATION"
  if [[ $status != "0" ]]
  then
    echo " !!!!!!!! Error Execution Query to csv : $1 => $2 !!!!!!!! "
    exit -1
  fi
}

###############################################################################
# Print the help
###############################################################################
function print_help
{
  echo "syntax:"
  echo "$0 source_database.source_table destination_database.destination_table partition_filter_pattern_and_option (not mandatory)"
  echo "Source and destination table must exists"
  echo "Destination table data will be overwritten !!"
}

###############################################################################
# Main
###############################################################################
HIVE_CONNEXION="..."
HIVE_QUEUE=...
HIVE_USER=...

TABLE_SOURCE=$1
TABLE_DESTINATON=$2
PARTITION_FILTER=$3

if [[ $# < "2" ]]
then
  print_help
  exit 0 
fi

echo "This will overwrite $TABLE_DESTINATON table by $TABLE_SOURCE table data !"
echo "The destination table MUST be created first !"
read -p "Do you wish to continue [Y | N] ? " answer
case $answer in
  [Yy]* ) ;;
  [Nn]* ) exit 0;;
  * ) echo "Please answer yes or no."; exit 0;;
esac

# Generate partitions list
execute_query_to_csv "show partitions $1;" partition_list.$$.csv

# Filter partiion list base on regular expression given
if [[ $PARTITION_FILTER != "" ]]
then
  grep $PARTITION_FILTER partition_list.$$.csv > partition_list1.$$.csv
  mv partition_list1.$$.csv partition_list.$$.csv
fi

partition_number=$(cat partition_list.$$.csv | wc -l)

# Generate column list (with partition columns which must be removed)
execute_query_to_csv "show columns from $1;" column_list.$$.csv

# First partition column
while read line
do
  first_partition_column=$(echo $line | awk -F "=" '{print $1}')
  break
done < partition_list.$$.csv

# Columns list without partition columns
columns_list_without_partitions=""
while read line
do
  if [[ $line = $first_partition_column ]]
  then
    break
  fi
  columns_list_without_partitions+="$line,"
done < column_list.$$.csv

# Remove trailing comma
columns_length=${#columns_list_without_partitions}
columns_list_without_partitions=${columns_list_without_partitions:0:$(($columns_length-1))}

echo "The source table has $partition_number partition(s)"

# Generate list of all insert partition by partition
i=1
while read line
do
  #echo $line
  echo "Partition ${i}:"
  j=1
  query1="insert overwrite table $TABLE_DESTINATON partition ("
  query2=""
  query3=""
  IFS="/"
  read -r -a partition_list <<< "$line"

  # We fetch all partition columns
  for partition_column_list in "${partition_list[@]}"
  do
    IFS="="
    read -r -a partition_columns <<< "$partition_column_list"
    # First insert is with WHERE and we must enclosed columns value with double quote
    if [[ $j -eq 1 ]]
    then
      query3+="where ${partition_columns[0]}=\"${partition_columns[1]}\" "
    else
      query3+="and ${partition_columns[0]}=\"${partition_columns[1]}\" "
    fi
    query2+="${partition_columns[0]}=\"${partition_columns[1]}\","
    j=$((j+1))
  done
  IFS=""
  i=$((i+1))
  query2_length=${#query2}
  query2_length=$((query2_length-1))
  query2=${query2:0:$query2_length}
  final_query=$query1$query2") select "$columns_list_without_partitions" from $TABLE_SOURCE "$query3
  #Executing the query comment out the execute query to test it before running
  echo $final_query
  execute_query $final_query
done < partition_list.$$.csv
rm partition_list.$$.csv
rm column_list.$$.csv

So clearly for our data nature the ORC storage engine cannot be beaten when it comes to disk usage...

I have taken additional figures when we have migrated our live tables of our Spotfire data model:

[hdfs@server01 ~]$ hdfs dfs -du -s -h /apps/hive/warehouse/database.db/table01*
564.2 G  /apps/hive/warehouse/database.db/table01_orc
3.6 T  /apps/hive/warehouse/database.db/table01_pqt
[hdfs@server01 ~]$ hdfs dfs -du -s -h /apps/hive/warehouse/database.db/table02*
121.3 M  /apps/hive/warehouse/database.db/table02_pqt
5.6 M  /apps/hive/warehouse/database.db/table02_orc

ORC versus Parquet response time

But what about response time ? To do this I have extracted a typical query used by Spotfire and executed it on Parquet (UNCOMPRESSED) and on ORC (ZLIB) tables:

Iteration Parquet (s) ORC (s)
Run 1 6.553 0.077
Run 2 5.291 0.066
Run 3 1.915 0.065
Run 4 2.987 0.074
Run 5 1.825 0.070
Run 6 2.720 0.092
Run 7 3.989 0.062
Run 8 4.526 0.079
Run 9 3.385 0.082
Run 10 3.588 0.176
Average 3.6779 0.0843

So on average over my ten runs we have a factor of 43-44 time faster for ORC...

I would explain this that I have much less data to read on disk for the ORC tables and, again, this is linked to our data nodes hardware where we have much more CPU than disk axis (ratio of one thread per physical disk is not at all followed). If you are low on CPU and have plenty of disks (which is also not a good practice for an Hadoop cluster) you might experience different results...

References

The post ORC versus Parquet compression and response time appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/orc-versus-parquet-compression-and-response-time.html/feed 0
HDFS balancer options to speed up balance operations https://blog.yannickjaquier.com/hadoop/hdfs-balancer-options-to-speed-up-balance-operations.html https://blog.yannickjaquier.com/hadoop/hdfs-balancer-options-to-speed-up-balance-operations.html#respond Fri, 05 Jul 2019 06:54:28 +0000 https://blog.yannickjaquier.com/?p=4604 Preamble We have started to receive the below Ambari alerts: Percent DataNodes With Available Spaceaffected: [2], total: [5] DataNode StorageRemaining Capacity:[4476139956751], Total Capacity:[77% Used, 19675609717760] In itself the DataNode Storage alert is not super serious because, first, it is sent far in advance (> 75%) but it anyways tells you that you are reaching the […]

The post HDFS balancer options to speed up balance operations appeared first on IT World.

]]>

Table of contents

Preamble

We have started to receive the below Ambari alerts:

  • Percent DataNodes With Available Space
    affected: [2], total: [5]
  • DataNode Storage
    Remaining Capacity:[4476139956751], Total Capacity:[77% Used, 19675609717760]

In itself the DataNode Storage alert is not super serious because, first, it is sent far in advance (> 75%) but it anyways tells you that you are reaching the storage limit of your cluster. One drawback we have seen is the impacted DataNodes are loosing contact with Ambari server and we are often obliged to restart the process.

On our small Hadoop cluster two nodes have more fill than the three others…

Should be easy to solve with below command:

[hdfs@clientnode ~]$ hdfs balancer

HDFS Balancer

We issued the HDFS balancer command with no options but after a very long run (almost a week) we end up with a still unbalanced situation. We even try to rerun the command but at the end the command completed very quickly (less than 2 seconds) but left us with two Datanodes still more filled than the three others.

[hdfs@clientnode ~]$ hdfs dfsadmin -report
Configured Capacity: 98378048588800 (89.47 TB)
Present Capacity: 93358971611260 (84.91 TB)
DFS Remaining: 31894899799432 (29.01 TB)
DFS Used: 61464071811828 (55.90 TB)
DFS Used%: 65.84%
Under replicated blocks: 24
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (5):

Name: 192.168.1.3:50010 (datanode03.domain.com)
Hostname: datanode03.domain.com
Rack: /AH/26
Decommission Status : Normal
Configured Capacity: 19675609717760 (17.89 TB)
DFS Used: 11130853114413 (10.12 TB)
Non DFS Used: 0 (0 B)
DFS Remaining: 7534254091791 (6.85 TB)
DFS Used%: 56.57%
DFS Remaining%: 38.29%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 25
Last contact: Tue Jan 08 12:51:44 CET 2019
Last Block Report: Tue Jan 08 06:52:34 CET 2019


Name: 192.168.1.2:50010 (datanode02.domain.com)
Hostname: datanode02.domain.com
Rack: /AH/26
Decommission Status : Normal
Configured Capacity: 19675609717760 (17.89 TB)
DFS Used: 11269739413291 (10.25 TB)
Non DFS Used: 0 (0 B)
DFS Remaining: 7403207769673 (6.73 TB)
DFS Used%: 57.28%
DFS Remaining%: 37.63%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 33
Last contact: Tue Jan 08 12:51:44 CET 2019
Last Block Report: Tue Jan 08 11:30:59 CET 2019


Name: 192.168.1.4:50010 (datanode04.domain.com)
Hostname: datanode04.domain.com
Rack: /AH/27
Decommission Status : Normal
Configured Capacity: 19675609717760 (17.89 TB)
DFS Used: 14226431394146 (12.94 TB)
Non DFS Used: 0 (0 B)
DFS Remaining: 4448006323316 (4.05 TB)
DFS Used%: 72.30%
DFS Remaining%: 22.61%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 14
Last contact: Tue Jan 08 12:51:43 CET 2019
Last Block Report: Tue Jan 08 12:12:55 CET 2019


Name: 192.168.1.1:50010 (datanode01.domain.com)
Hostname: datanode01.domain.com
Rack: /AH/26
Decommission Status : Normal
Configured Capacity: 19675609717760 (17.89 TB)
DFS Used: 10638187881052 (9.68 TB)
Non DFS Used: 0 (0 B)
DFS Remaining: 8035048514823 (7.31 TB)
DFS Used%: 54.07%
DFS Remaining%: 40.84%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 20
Last contact: Tue Jan 08 12:51:43 CET 2019
Last Block Report: Tue Jan 08 09:38:50 CET 2019


Name: 192.168.1.5:50010 (datanode05.domain.com)
Hostname: datanode05.domain.com
Rack: /AH/27
Decommission Status : Normal
Configured Capacity: 19675609717760 (17.89 TB)
DFS Used: 14198860008926 (12.91 TB)
Non DFS Used: 0 (0 B)
DFS Remaining: 4474383099829 (4.07 TB)
DFS Used%: 72.16%
DFS Remaining%: 22.74%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 29
Last contact: Tue Jan 08 12:51:45 CET 2019
Last Block Report: Tue Jan 08 11:50:32 CET 2019

From NameNode UI it gives the clean graphical picture:

hdfs_balancer01
hdfs_balancer01

Two datanodes are still more filled than the three others.

Then digging inside HDFS balancer official documentation we found two interesting parameters that are -source and -threshold.

-source is easily understandable with below example from official documentation (that I prefer to put it with the acquisition of Hortonworks by Cloudera):

The following table shows an example, where the average utilization is 25% so that D2 is within the 10% threshold. It is unnecessary to move any blocks from or to D2. Without specifying the source nodes, HDFS Balancer first moves blocks from D2 to D3, D4 and D5, since they are under the same rack, and then moves blocks from D1 to D2, D3, D4 and D5.
By specifying D1 as the source node, HDFS Balancer directly moves blocks from D1 to D3, D4 and D5.

Datanodes (with the same capacity) Utilization Rack
D1 95% A
D2 30% B
D3, D4, and D5 0% B

This is also explained in Storage group pairing policy:

The HDFS Balancer selects over-utilized or above-average storage as source storage, and under-utilized or below-average storage as target storage. It pairs a source storage group with a target storage group (source → target) in a priority order depending on whether or not the source and the target storage reside in the same rack.

And this rack awareness story is exactly what we have as displayed in server list of Ambari:

hdfs_balancer02
hdfs_balancer02

-threshold is also an interesting parameter to be more strict with nodes above or below the average…

So we tried unsuccessfully below command:

[hdfs@clientnode ~]$ hdfs balancer -source datanode04.domain.com,datanode05.domain.com -threshold 1

We also found many others “more agressive options” listed below:

DataNode Configuration Properties:

Property Default Background Mode Fast Mode
dfs.datanode.balance.max.concurrent.moves 5 4 x (# of disks) 4 x (# of disks)
dfs.datanode.balance.max.bandwidthPerSec 1048576 (1 MB) use default 10737418240 (10 GB)

Balancer Configuration Properties:

Property Default Background Mode Fast Mode
dfs.datanode.balance.max.concurrent.moves 5 # of disks 4 x (# of disks)
dfs.balancer.moverThreads 1000 use default 20,000
dfs.balancer.max-size-to-move 10737418240 (10 GB) 1073741824 (1GB) 107374182400 (100 GB)
dfs.balancer.getBlocks.min-block-size 10485760 (10 MB) use default 104857600 (100 MB)

So again tried:

[hdfs@clientnode ~]$ hdfs balancer -Ddfs.balancer.movedWinWidth=5400000 -Ddfs.balancer.moverThreads=50 -Ddfs.balancer.dispatcherThreads=200 -threshold 1 \
-source datanode04.domain.com,datanode05.domain.com 1>/tmp/balancer-out.log 2>/tmp/balancer-err.log

But again it did not change anything special and they have been both executed very fast…

So clearly in our case the rack awareness story is a blocking factor. One mistake we have done is to have an odd number of datanodes and this 2-3 configuration in two racks is clearly not a good idea. Of course we could remove the rack awareness configuration to have a well balanced cluster but we do not want to loose the extra high availibilty we have with it. SO only available plan is to buy new databases or add more disks to our existing nodes as we have less disks than threads…

References

The post HDFS balancer options to speed up balance operations appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/hdfs-balancer-options-to-speed-up-balance-operations.html/feed 0
JournalNode Web UI time out critical error on port 8480 https://blog.yannickjaquier.com/hadoop/journalnode-web-ui-critical-error.html https://blog.yannickjaquier.com/hadoop/journalnode-web-ui-critical-error.html#respond Fri, 07 Jun 2019 10:11:29 +0000 https://blog.yannickjaquier.com/?p=4549 Preamble One of our three JournalNodes was constantly failing for an alert on JournalNode Web UI saying the connection to http://journalnode3.domain.com:8480 has timed out. In Ambari the configured alert is this one: JournalNodes is: High-availabilty clusters use JournalNodes to synchronize active and standby NameNodes. The active NameNode writes to each JournalNode with changes, or “edits,” […]

The post JournalNode Web UI time out critical error on port 8480 appeared first on IT World.

]]>

Table of contents

Preamble

One of our three JournalNodes was constantly failing for an alert on JournalNode Web UI saying the connection to http://journalnode3.domain.com:8480 has timed out.

In Ambari the configured alert is this one:

journalnode_web_ui01
journalnode_web_ui01

JournalNodes is:

High-availabilty clusters use JournalNodes to synchronize active and standby NameNodes. The active NameNode writes to each JournalNode with changes, or “edits,” to HDFS namespace metadata. During failover, the standby NameNode applies all edits from the JournalNodes before promoting itself to the active state.

Obviously the URL is also not responding when accessing to it through a web browser…

JournalNode Web UI time out resolution

The port was correctly use by a listening process:

[root@journalnode3 ~]# netstat -an | grep 8480 |grep LISTEN
tcp        0      0 0.0.0.0:8480            0.0.0.0:*               LISTEN

Found this interesting article called Frequent journal node connection timeout alerts.

As suggested and expected curl command do not give anything and end up with a timed out:

[root@journalnode3 ~]# curl -v http://journalnode3.domain.com:8480 --max-time 4 | tail -4 

I also had tens of CLOSE_WAIT connection to this port using below suggested command:

[root@journalnode3 ~]# netstat -putane | grep -i 8480

You can find the JournalNodes directory with dfs.journalnode.edits.dir HDFS variable, which is set to /var/qjn for me.

The files in the directory of the problematic node were all out of date:

.
-rw-r--r-- 1 hdfs hadoop 182246 Oct 31 13:43 edits_0000000000175988225-0000000000175989255
-rw-r--r-- 1 hdfs hadoop 595184 Oct 31 13:45 edits_0000000000175989256-0000000000175992263
-rw-r--r-- 1 hdfs hadoop 216550 Oct 31 13:47 edits_0000000000175992264-0000000000175993354
-rw-r--r-- 1 hdfs hadoop 472885 Oct 31 13:49 edits_0000000000175993355-0000000000175995694
-rw-r--r-- 1 hdfs hadoop 282984 Oct 31 13:51 edits_0000000000175995695-0000000000175997143
-rw-r--r-- 1 hdfs hadoop      8 Oct 31 13:51 committed-txid
-rw-r--r-- 1 hdfs hadoop 626688 Oct 31 13:51 edits_inprogress_0000000000175997144

Versus the directory on one of the two working well JournalNodes:

.
-rw-r--r-- 1 hdfs hadoop  174901 Nov  8 11:28 edits_0000000000184771705-0000000000184772755
-rw-r--r-- 1 hdfs hadoop  418119 Nov  8 11:30 edits_0000000000184772756-0000000000184774838
-rw-r--r-- 1 hdfs hadoop  342889 Nov  8 11:32 edits_0000000000184774839-0000000000184776640
-rw-r--r-- 1 hdfs hadoop  270983 Nov  8 11:34 edits_0000000000184776641-0000000000184778154
-rw-r--r-- 1 hdfs hadoop  593676 Nov  8 11:36 edits_0000000000184778155-0000000000184781027
-rw-r--r-- 1 hdfs hadoop 1048576 Nov  8 11:37 edits_inprogress_0000000000184781028

in /var/log/hadoop/hdfs log directory I have also seen in hadoop-hdfs-journalnode-server1.domain.com.log file below error messages:

2018-11-08 12:18:20,949 WARN  namenode.FSImage (EditLogFileInputStream.java:scanEditLog(359)) - Caught exception after scanning through 0 ops from
/var/qjn/ManufacturingDataLakeHdfs/current/edits_inprogress_0000000000175997144 while determining its valid length. Position was 626688
java.io.IOException: Can't scan a pre-transactional edit log.
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4974)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)
        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:551)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:192)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.(Journal.java:152)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:99)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:165)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:186)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

I have stopped JournalNodes process via Ambari on problematic server and moves the edits_inprogress_xxx file that was out of date:

[root@journalnode3 current]# pwd /var/qjn/Hdfsproject/current [root@journalnode3 current]# mv edits_inprogress_0000000000175997144 edits_inprogress_0000000000175997144.bak

And restarted via Ambari the JournalNodes process on this server… It has taken a while to recover the situation but after few days the log generation have decreased and problematic JournalNodes was able to cope with latency… You might also delete the entire directory, not tested personally but should work…

Watch out the log directory size on problematic JounalNodes server as it may fill up very fast…

References

  • A Journal Node is in Bad Health in Cloudera Manager with “java.io.IOException: Can’t scan a pre-transactional edit log” (Doc ID 2160881.1)
  • HDFS Metadata Directories Explained

The post JournalNode Web UI time out critical error on port 8480 appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/journalnode-web-ui-critical-error.html/feed 0
YARN command line for low level management of applications https://blog.yannickjaquier.com/hadoop/yarn-command-line-manage-applications.html https://blog.yannickjaquier.com/hadoop/yarn-command-line-manage-applications.html#respond Fri, 10 May 2019 07:24:26 +0000 https://blog.yannickjaquier.com/?p=4566 Preamble We recently implemented LLAP in our Hadoop cluster. To do this as you can see in the Hortonworks configuration guideline we have dedicated a YARN queue 100% to run LLAP queries and this YARN queue dedicated to LLAP has a greater priority than the other YARN queues. The queue we have used was already […]

The post YARN command line for low level management of applications appeared first on IT World.

]]>

Table of contents

Preamble

We recently implemented LLAP in our Hadoop cluster. To do this as you can see in the Hortonworks configuration guideline we have dedicated a YARN queue 100% to run LLAP queries and this YARN queue dedicated to LLAP has a greater priority than the other YARN queues. The queue we have used was already existing and was used to perform business reporting queries.

We unfortunately have observed a bad side effect of this change. We had plenty of scripts (in Python mainly) connecting to Hive server directly and not specifying any queue. Those scripts were, also, not going through ZooKeeper and in any case not specifying the new JDBC Hive2 URL to connect to HiveServer2 Interactive i.e. LLAP. The (unexpected) result was multiple applications automatically allocated to LLAP queue in a non-LLAP mode so having NO resources available to be launched…

Solution at that stage is to kill them or try to move them to a different queue… Part of this can be done with Resource Manager UI but for fine modification like queue allocation or application priority you must use YARN command line !

The edition we have chosen is the Hortonworks one and we have installed release HDP-2.6.4.0.

Problematic situation and YARN command line first trial

From Resource Manager UI we can see that many application are launched and have landed in the higher priority queue for LLAP (spotfire). I have started to suspect an issue because the applciation are waiting indefinitely. Application ID application_1543251372086_1684, for example, launched multiple hours ago (at the time of the screen shot) and completed at 0%:

yarn_command_line01
yarn_command_line01

The running applications have no allocated resources and we don’t even see them in TEZ view:

yarn_command_line02
yarn_command_line02

You can do (almost all) with Resource Manager UI but it’s always nice to have the YARN command line equivalence. If one day you like to developed few scripts to ease your life (the move queue below is also NOT accessible with graphical interface). YARN command line is also faster if you need to check status of multiple applications. Start by listing all applications, running and waiting to be run (ACCEPTED) with:

[yarn@mgmtserver ~]$ yarn application -list
18/11/28 14:43:04 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 14:43:04 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 14:43:04 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):24
                Application-Id      Application-Name        Application-Type          User           Queue                   State             Final-State             Progress                        Tracking-URL
application_1543251372086_1684  HIVE-bd74b31a-9151-43bb-a787-6159d79b0970                        TEZ      training        spotfire                 RUNNING               UNDEFINED                   0% http://worker03.domain.com:40313/ui/
application_1543251372086_2692  HIVE-3f923b84-4c06-4b4e-b395-e325a4721714                        TEZ          hive        spotfire                 RUNNING               UNDEFINED                   0% http://worker02.domain.com:34894/ui/
application_1543251372086_2735  HIVE-258d99b0-f275-4dae-853b-20da9e803fef                        TEZ      mfgaewsp        spotfire                 RUNNING               UNDEFINED                   0% http://worker01.domain.com:38295/ui/
application_1543251372086_2774  HIVE-abbb82a4-af93-403d-abc2-58fa6053a217                        TEZ      training        spotfire                 RUNNING               UNDEFINED                   0% http://worker05.domain.com:39203/ui/
application_1543251372086_2929  HIVE-c3d5d016-30f9-4237-ae1a-2858945070ae                        TEZ      training        spotfire                 RUNNING               UNDEFINED                   0% http://worker04.domain.com:43126/ui/
application_1543251372086_2930  HIVE-37c2b156-43e2-43a5-86b5-fa96f29486cf                        TEZ      training        spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_2954  HIVE-de98d006-f1ee-43b6-9928-99c2b8338c2d                        TEZ    mfgdl_ingestion   spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_2955  HIVE-0ddab9a1-b335-4fd0-a717-db7ba5d7a6fd                        TEZ    mfgdl_ingestion   spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_2968  HIVE-ba458900-e9ee-48ad-b0a1-839be5208a81                        TEZ      training        spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_2969  HIVE-e59da0bc-bf0b-428c-ad0c-3e262825bbc2                        TEZ      training        spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_2967  HIVE-770714ee-8c6f-46b3-9374-b768b39b7f01                        TEZ      training        spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_2952  HIVE-01e31000-829c-44b3-a0ff-7113c5864e86                        TEZ      mfgaewsp        spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_2953  HIVE-b2089514-18e1-48e2-b8b1-85c48b407ab2                        TEZ      mfgaewsp        spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_2946  HIVE-c04f98fd-b2a4-4c6d-9bc5-f2481e111f8c                        TEZ    mfgdl_ingestion  ingestion                 RUNNING               UNDEFINED                0.91% http://worker01.domain.com:43019/ui/
application_1543251372086_2970  HIVE-a3fdd952-d5ed-424e-861c-7e0ac4572d37                        TEZ      training        spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_2971  HIVE-8d10cc75-bdfb-4b41-aa2b-2771f401681e                        TEZ      training        spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_3025  HIVE-960aa95c-59f1-4f0f-9feb-6c615b8163cd                        TEZ      mfgaewsp        spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_3023  HIVE-17224232-ebc8-4a3b-bd78-c183840c521c                        TEZ      mfgaewsp        spotfire                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_3100  HIVE-d4aabb8b-b239-4b2e-8bcd-e16880114842                        TEZ    mfgdl_ingestion  ingestion                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_3101  HIVE-9e63b3da-af08-40b0-8f74-4fd596089d51                        TEZ    mfgdl_ingestion  ingestion                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_3098  HIVE-04686451-e693-45c7-9d79-038cffa25a80                        TEZ    mfgdl_ingestion  ingestion                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_3099  HIVE-e12314df-8369-44c6-87cd-5b1c56b97a10                        TEZ    mfgdl_ingestion  ingestion                ACCEPTED               UNDEFINED                   0%                                 N/A
application_1543251372086_0019                 llap0       org-apache-slider          hive        spotfire                 RUNNING               UNDEFINED                 100%   http://worker03.domain.com:39758
application_1542883016916_0125  Thrift JDBC/ODBC Server                SPARK          hive       analytics                 RUNNING               UNDEFINED                  10%             http://192.168.1.2:4040

To have only the running ones:

[yarn@mgmtserver ~]$ yarn application -appStates RUNNING -list
18/11/28 17:15:27 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 17:15:27 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 17:15:28 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Total number of applications (application-types: [] and states: [RUNNING]):11
                Application-Id      Application-Name        Application-Type          User           Queue                   State             Final-State             Progress                        Tracking-URL
application_1543251372086_1684  HIVE-bd74b31a-9151-43bb-a787-6159d79b0970                        TEZ      training        spotfire                 RUNNING               UNDEFINED                   0% http://worker03.domain.com:40313/ui/
application_1543251372086_2692  HIVE-3f923b84-4c06-4b4e-b395-e325a4721714                        TEZ          hive        spotfire                 RUNNING               UNDEFINED                   0% http://worker02.domain.com:34894/ui/
application_1543251372086_2735  HIVE-258d99b0-f275-4dae-853b-20da9e803fef                        TEZ      mfgaewsp        spotfire                 RUNNING               UNDEFINED                   0% http://worker01.domain.com:38295/ui/
application_1543251372086_2774  HIVE-abbb82a4-af93-403d-abc2-58fa6053a217                        TEZ      training        spotfire                 RUNNING               UNDEFINED                   0% http://worker05.domain.com:39203/ui/
application_1543251372086_2929  HIVE-c3d5d016-30f9-4237-ae1a-2858945070ae                        TEZ      training        spotfire                 RUNNING               UNDEFINED                   0% http://worker04.domain.com:43126/ui/
application_1543251372086_2946  HIVE-c04f98fd-b2a4-4c6d-9bc5-f2481e111f8c                        TEZ    mfgdl_ingestion  ingestion                 RUNNING               UNDEFINED                0.91% http://worker01.domain.com:43019/ui/
application_1543251372086_0019                 llap0       org-apache-slider          hive        spotfire                 RUNNING               UNDEFINED                 100%   http://worker03.domain.com:39758
application_1542883016916_0125  Thrift JDBC/ODBC Server                SPARK          hive       analytics                 RUNNING               UNDEFINED                  10%             http://192.168.1.2:4040

If I take the LLAP process I get its application attempt with:

[yarn@mgmtserver ~]$ yarn applicationattempt -list application_1543251372086_0019
18/11/28 17:30:39 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 17:30:39 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 17:30:39 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Total number of application attempts :1
         ApplicationAttempt-Id                 State                        AM-Container-Id                            Tracking-URL
appattempt_1543251372086_0019_000001                 RUNNING    container_e185_1543251372086_0019_01_000001     http://masternode01.domain.com:8088/proxy/application_1543251372086_0019/

And the list of (LLAP) containers with:

[yarn@mgmtserver ~]$ yarn container -list appattempt_1543251372086_0019_000001
18/11/28 15:03:41 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 15:03:41 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 15:03:41 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Total number of containers :4
                  Container-Id            Start Time             Finish Time                   State                    Host       Node Http Address                                LOG-URL
container_e185_1543251372086_0019_01_000001     Mon Nov 26 18:05:35 +0100 2018                   N/A                 RUNNING    worker03.domain.com:45454      http://worker03.domain.com:8042        http://worker03.domain.com:8042/node/containerlogs/container_e185_1543251372086_0019_01_000001/hive
container_e185_1543251372086_0019_01_000005     Mon Nov 26 18:05:41 +0100 2018                   N/A                 RUNNING    worker04.domain.com:45454      http://worker04.domain.com:8042        http://worker04.domain.com:8042/node/containerlogs/container_e185_1543251372086_0019_01_000005/hive
container_e185_1543251372086_0019_01_000002     Mon Nov 26 18:05:40 +0100 2018                   N/A                 RUNNING    worker03.domain.com:45454      http://worker03.domain.com:8042        http://worker03.domain.com:8042/node/containerlogs/container_e185_1543251372086_0019_01_000002/hive
container_e185_1543251372086_0019_01_000003     Mon Nov 26 18:05:40 +0100 2018                   N/A                 RUNNING    worker01.domain.com:45454      http://worker01.domain.com:8042        http://worker01.domain.com:8042/node/containerlogs/container_e185_1543251372086_0019_01_000003/hive

At that stage it means that any application not using one of above containers is NOT using LLAP… Is it the case for the waiting indefinitely one we have seen above ?

YARN command line to the rescue

Get more details of a particular application with (one you suspect to be stuck):

[yarn@mgmtserver ~]$ yarn application -status application_1543251372086_1684
18/11/28 14:52:22 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 14:52:22 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 14:52:22 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Application Report :
        Application-Id : application_1543251372086_1684
        Application-Name : HIVE-bd74b31a-9151-43bb-a787-6159d79b0970
        Application-Type : TEZ
        User : training
        Queue : spotfire
        Application Priority : null
        Start-Time : 1543339326789
        Finish-Time : 0
        Progress : 0%
        State : RUNNING
        Final-State : UNDEFINED
        Tracking-URL : http://worker03.domain.com:40313/ui/
        RPC Port : 42387
        AM Host : worker03.domain.com
        Aggregate Resource Allocation : 61513551 MB-seconds, 15017 vcore-seconds
        Log Aggregation Status : NOT_START
        Diagnostics :
        Unmanaged Application : false
        Application Node Label Expression : 
        AM container Node Label Expression : 

Application is in state RUNNING, in YARN queue spotfire (LLAP) and progress status is 0%.

Get the application attempt (in order to get container list) with:

[yarn@mgmtserver ~]$ yarn applicationattempt -list application_1543251372086_1684
18/11/28 14:53:41 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 14:53:41 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 14:53:41 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Total number of application attempts :1
         ApplicationAttempt-Id                 State                        AM-Container-Id                            Tracking-URL
appattempt_1543251372086_1684_000001                 RUNNING    container_e185_1543251372086_1684_01_000001     http://masternode01.domain.com:8088/proxy/application_1543251372086_1684/

Get container list with:

[yarn@mgmtserver ~]$ yarn container -list appattempt_1543251372086_1684_000001
18/11/28 14:53:55 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 14:53:55 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 14:53:55 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Total number of containers :1
                  Container-Id            Start Time             Finish Time                   State                    Host       Node Http Address                                LOG-URL
container_e185_1543251372086_1684_01_000001     Wed Nov 28 10:42:02 +0100 2018                   N/A                 RUNNING    worker03.domain.com:45454      http://worker03.domain.com:8042        http://worker03.domain.com:8042/node/containerlogs/container_e185_1543251372086_1684_01_000001/training

Here we see that this application is in spotfire queue and not running with one of the LLAP container so the issue…

Get the status of a container with:

[yarn@mgmtserver ~]$ yarn container -status container_e185_1543251372086_1684_01_000001
18/11/28 14:54:25 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 14:54:25 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 14:54:25 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Container Report :
        Container-Id : container_e185_1543251372086_1684_01_000001
        Start-Time : 1543398122796
        Finish-Time : 0
        State : RUNNING
        LOG-URL : http://worker03.domain.com:8042/node/containerlogs/container_e185_1543251372086_1684_01_000001/training
        Host : worker03.domain.com:45454
        NodeHttpAddress : http://worker03.domain.com:8042
        Diagnostics : null

Now you can decide to abruptly kill it with:

[yarn@mgmtserver ~]$ yarn application -kill application_1543251372086_1684
18/11/28 17:50:51 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 17:50:51 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 17:50:51 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Killing application application_1543251372086_1684
18/11/28 17:50:51 INFO impl.YarnClientImpl: Killed application application_1543251372086_1684

Or to move it to a non-LLAP queue that have free resources with -movetoqueue option:

[yarn@mgmtserver ~]$ yarn application -queue analytics -movetoqueue application_1543251372086_1684
18/11/28 15:05:53 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 15:05:53 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 15:05:53 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Moving application application_1543251372086_1684 to queue analytics
Successfully completed move.

In above 1543339326789 is Tue Nov 27 18:22:06 +0100 2018. Which is almost 24 hours before I have taken the screen shot. Your can convert the timestamp in human readable date with many available web sites or directly in Bash with something like (man date for more information):

[yarn@mgmtserver ~]$ date --date='@1543339327' +'%c'
Tue 27 Nov 2018 06:22:07 PM CET

After a very short period I finally got below result (I wanted to go faster by changing application priority but my release is too old):

[yarn@mgmtserver ~]$ yarn application -status application_1543251372086_1684
18/11/28 15:08:54 INFO client.AHSProxy: Connecting to Application History server at masternode01.domain.com/192.168.1.1:10200
18/11/28 15:08:54 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/11/28 15:08:54 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
Application Report :
        Application-Id : application_1543251372086_1684
        Application-Name : HIVE-bd74b31a-9151-43bb-a787-6159d79b0970
        Application-Type : TEZ
        User : training
        Queue : analytics
        Application Priority : null
        Start-Time : 1543339326789
        Finish-Time : 1543414091398
        Progress : 100%
        State : FINISHED
        Final-State : SUCCEEDED
        Tracking-URL : http://mgmtserver.domain.com:8080/#/main/view/TEZ/tez_cluster_instance?viewPath=%2F%23%2Ftez-app%2Fapplication_1543251372086_1684
        RPC Port : 42387
        AM Host : worker03.domain.com
        Aggregate Resource Allocation : 67097515 MB-seconds, 16378 vcore-seconds
        Log Aggregation Status : SUCCEEDED
        Diagnostics : Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0

        Unmanaged Application : false
        Application Node Label Expression : 
        AM container Node Label Expression : 

References

The post YARN command line for low level management of applications appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/yarn-command-line-manage-applications.html/feed 0