IT World https://blog.yannickjaquier.com RDBMS, Unix and many more... Mon, 18 May 2020 09:48:09 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.3 Hive Warehouse Connector integration in Zeppelin Notebook https://blog.yannickjaquier.com/hadoop/hive-warehouse-connector-integration-in-zeppelin-notebook.html https://blog.yannickjaquier.com/hadoop/hive-warehouse-connector-integration-in-zeppelin-notebook.html#respond Thu, 21 May 2020 13:44:17 +0000 https://blog.yannickjaquier.com/?p=4955 Preamble How to configure Hive Warehouse Connector (HWC) integration in Zeppelin Notebook ? Since we upgraded to Hortonworks Data Platform (HDP) 3.1.4 we have been obliged to fight everywhere to integrate the new way of working in Spark with HWC. The connector in itself is not very complex to use, it requires only a small […]

The post Hive Warehouse Connector integration in Zeppelin Notebook appeared first on IT World.

]]>

Table of contents

Preamble

How to configure Hive Warehouse Connector (HWC) integration in Zeppelin Notebook ? Since we upgraded to Hortonworks Data Platform (HDP) 3.1.4 we have been obliged to fight everywhere to integrate the new way of working in Spark with HWC. The connector in itself is not very complex to use, it requires only a small modification of your Spark code (Scala or Python). What’s complex is its integration in the existing tools as well as how, now, to run Pyspark and Spark-shell (Scala) to integrate it in your environment.

On top of this, as clearly stated in the official documentation, you need HWC and LLAP (Live Long And Process or Low-Latency Analytical processing) to read Hive managed tables from Spark. Which is one of the first operation you will do in most of your Spark scripts. So in Ambari activate and configure it (number of nodes, memory, concurrent queries):

hwc01
hwc01

HWC on the command line

Before jumping to Zeppelin let’s quickly see how you know execute Spark on the command line with HWC.

You should have already configured Spark below parameters (as well as reverting back hive.fetch.task.conversion value see my other article on this)

  • spark.hadoop.hive.llap.daemon.service.hosts = @llap0
  • spark.hadoop.hive.zookeeper.quorum = zookeeper01.domain.com:2181,zookeeper03.domain.com:2181,zookeeper03.domain.com:2181
  • spark.sql.hive.hiveserver2.jdbc.url = jdbc:hive2://zookeeper01.domain.com:2181,zookeeper03.domain.com:2181,zookeeper03.domain.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive

To execute spark-shell you must now add below jar file (in the example I’m in YARN mode on llap YARN queue):

spark-shell --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar

To test it’s working you might use the below test code:

val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.showDatabases().show(100, false)

To execute pyspark you must add below jar file (I’m in YARN mode on llap YARN queue) and Python file:

pyspark --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

To test all is good you can use below script

from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show(10, False)

To use spark-submit you do exactly the same thing:

spark-submit --master yarn --queue llap --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip script_to_execute

Zeppelin Spark2 interpreter configuration

To make Spark2 Zeppelin interpreter working with HWC you need to add two parameters to its configuration:

  • spark.jars = /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar
  • spark.submit.pyFiles = /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

Note:
Notice the F in uppercase in spark.submit.pyFiles, I have lost half a day because some blog articles on internet have misspelled it.

Save and restart the interpreter and it you test below code in a new Notebook you should get a result (and not the default databases name):

hwc02
hwc02

Remark:
As you can see above the %spark2.sql option of the interpreter is not configured to use HWC. So far I have not found how to overcome this. This could be an issue because this interpreter option is a nice feature to make graphs from direct SQL commands…

It works also well in Scala:

hwc03
hwc03

Zeppelin Livy2 interpreter configuration

You might wonder why configuring Livy2 interpreter as the first one is working fine. Well I had to configure it to for the JupyterHub Sparkmagic configuration that we will see in another blog post so here it is…

To make Livy2 Zeppelin interpreter working with HWC you need to add two parameters to its configuration:

  • livy.spark.jars = file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar
  • livy.spark.submit.pyFiles = file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip

Note:
The second parameter was just a good guess from the Spark2 interpreter setting. Also note the file:// to instruct interpreter to look on local file system and not on HDFS.

Also ensure the Livy url is well setup to the server (and port) of you Livy process:

  • zeppelin.livy.url = http://livyserver.domain.com:8999

I have also faced this issue:

org.apache.zeppelin.livy.LivyException: Error with 400 StatusCode: "requirement failed: Local path /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar cannot be added to user sessions."
	at org.apache.zeppelin.livy.BaseLivyInterpreter.callRestAPI(BaseLivyInterpreter.java:755)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.createSession(BaseLivyInterpreter.java:337)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.initLivySession(BaseLivyInterpreter.java:209)
	at org.apache.zeppelin.livy.LivySharedInterpreter.open(LivySharedInterpreter.java:59)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.getLivySharedInterpreter(BaseLivyInterpreter.java:190)
	at org.apache.zeppelin.livy.BaseLivyInterpreter.open(BaseLivyInterpreter.java:163)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:617)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
	at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

To solve it add below configuration to your Livy2 server (Custom livy2-conf in Ambari):

  • livy.file.local-dir-whitelist = /usr/hdp/current/hive_warehouse_connector/

Working in Scala:

hwc04
hwc04

In Python:

hwc05
hwc05

But same as for Spark2 interpreter the %livy2.sql option of the interpreter is broken for exact same reason:

hwc06
hwc06

References

The post Hive Warehouse Connector integration in Zeppelin Notebook appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/hive-warehouse-connector-integration-in-zeppelin-notebook.html/feed 0
Setup Spark and Intellij on Windows to access a remote Hadoop cluster https://blog.yannickjaquier.com/hadoop/setup-spark-and-intellij-on-windows-to-access-a-remote-hadoop-cluster.html https://blog.yannickjaquier.com/hadoop/setup-spark-and-intellij-on-windows-to-access-a-remote-hadoop-cluster.html#respond Wed, 22 Apr 2020 07:49:46 +0000 https://blog.yannickjaquier.com/?p=4896 Preamble Developing directly on an Hadoop cluster is not the best development environment you would dream for yourself. Of course it makes easy the testing of your code as you can instantly submit it but in worst case your editor will be vi. Obviously what new generation developers want is a clever editor running on […]

The post Setup Spark and Intellij on Windows to access a remote Hadoop cluster appeared first on IT World.

]]>

Table of contents

Preamble

Developing directly on an Hadoop cluster is not the best development environment you would dream for yourself. Of course it makes easy the testing of your code as you can instantly submit it but in worst case your editor will be vi. Obviously what new generation developers want is a clever editor running on their Windows (Linux ?) desktop. By clever editor I mean one having syntax completion, suggesting to import packages when using a procedures, suggesting unused variables if any and so on… Small teasing: Intellij IDEA from JetBrains is a serious contenders nowadays, apparently much better than Eclipse that is more or less dying…

We, at last, improved a bit the editor part by installing VSCode and editing our file with SSH FS plugins. In clear we edit through SSH files located directly on the server and then using a shell terminal (MobaXterm , Putty, …) we are able to submit scripts directly onto our Hadoop cluster. This works if you do Python (even if a PySpark plugin is not available in VSCode) but if you do Scala you also have to run an sbt command, each time, to compile you Scala program. A bit cumbersome…

As said above Intellij IDEA from JetBrains is a very good Scala language editor and the community edition does a pretty decent job. But then difficulties rise as this software lack a remote SSH edition… So it works well for pure Scala script but if you need to access you Hive table you have to work a bit to configure it…

This blog post is all about this: building a productive Scala/PySpark development environment on your Windows desktop and access Hive tables of an Hadoop cluster.

Spark installation and configuration

Before going in small details I have first tried to make raw Spark installation working on my Windows machine. I have started by downloading it on official web site:

spark_installation01
spark_installation01

And unzip it in default D:\spark-2.4.4-bin-hadoop2.7 directory. You must set SPARK_HOME environment variable to the directory you have unzip Spark. For convenience you also need to add D:\spark-2.4.4-bin-hadoop2.7\bin in the path of your Windows account (restart PowerShell after) and confirm it’s all good with:

$env:path

Then issue spark-shell in a PowerShell session, you should get a warning like:

19/11/15 17:15:37 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
        at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
        at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
        at org.apache.hadoop.util.Shell.(Shell.java:387)
        at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
        at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
        at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
        at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
        at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
        at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
        at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
        at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:348)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:348)
        at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
        at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Take winutils.exe that is Windows binary for Hadoop version at: https://github.com/steveloughran/winutils and put it in D:\spark-2.4.4-bin-hadoop2.7\bin folder.

I have downloaded the one for Hadoop 2.7.1 (linked to my Hadoop cluster). You must also set environment variable HADOOP_HOME to D:\spark-2.4.4-bin-hadoop2.7. Then scala-shell command should execute with no error. You can make the suggested testing right now:

PS D:\> spark-shell
19/11/15 17:29:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://client_machine.domain.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1573835367367).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@28532753

scala> sc
res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext@44846c76
scala> val textFile = spark.read.textFile("D:/spark-2.4.4-bin-hadoop2.7/README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]

scala> textFile.count()
res2: Long = 105

scala> textFile.filter(line => line.contains("Spark")).count()
res3: Long = 20

And that’s it you have a local running working Spark environment with all the Spark clients: Spark shell, PySpark, Spark SQL, SparkR… I had to install Python 3.7.5 as it was not working with Python 3.8.0, might be corrected soon:

PS D:\> pyspark
Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
19/11/15 17:28:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Python version 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019 00:11:34)
SparkSession available as 'spark'.
>>> textFile = spark.read.text("D:/spark-2.4.4-bin-hadoop2.7/README.md")
>>> textFile.count()
105

To access Hive running on my cluster I have taken the file called hive-site.xml in /etc/spark2/conf directory of one of my edge node where Spark was installed and put it in D:\spark-2.4.4-bin-hadoop2.7\conf folder. And it worked pretty well:

scala> val databases=spark.sql("show databases")
databases: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> databases.show(10,false)
+------------------+
|databaseName      |
+------------------+
|activity          |
|admin             |
|alexandre_c300    |
|audit             |
|big_ews_processing|
|big_ews_raw       |
|big_ews_refined   |
|big_fdc_processing|
|big_fdc_raw       |
|big_fdc_refined   |
+------------------+
only showing top 10 rows

But it failed for a more complex example:

scala> val df01=spark.sql("select * from admin.tbl_concatenate")
df01: org.apache.spark.sql.DataFrame = [id_process: string, date_job_start: string ... 9 more fields]

scala> df01.printSchema()
root
 |-- id_process: string (nullable = true)
 |-- date_job_start: string (nullable = true)
 |-- date_job_stop: string (nullable = true)
 |-- action: string (nullable = true)
 |-- database_name: string (nullable = true)
 |-- table_name: string (nullable = true)
 |-- partition_name: string (nullable = true)
 |-- number_files_before: string (nullable = true)
 |-- partition_size_before: string (nullable = true)
 |-- number_files_after: string (nullable = true)
 |-- partition_size_after: string (nullable = true)


scala> df01.show(10,false)
java.lang.IllegalArgumentException: java.net.UnknownHostException: DataLakeHdfs
  at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
  at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
  at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
  at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:678)
  at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:619)

To solve this I have to copy hdfs-site.xml from /etc/hadoop/conf directory of one of my edge where HDFS Client has been installed and it worked much better:

scala> val df01=spark.sql("select * from admin.tbl_concatenate")
df01: org.apache.spark.sql.DataFrame = [id_process: string, date_job_start: string ... 9 more fields]

scala> df01.show(10,false)
19/11/15 17:45:01 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because UNIX Domain sockets are not available on Windows.
+----------+--------------+--------------+----------+----------------+------------------+----------------------------+-------------------+---------------------+------------------+--------------------+
|id_process|date_job_start|date_job_stop |action    |database_name   |table_name        |partition_name              |number_files_before|partition_size_before|number_files_after|partition_size_after|
+----------+--------------+--------------+----------+----------------+------------------+----------------------------+-------------------+---------------------+------------------+--------------------+
|10468     |20190830103323|20190830103439|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q1111|3                  |1606889              |3                 |1606889             |
|10468     |20190830103859|20190830104224|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7180|7                  |37136990             |2                 |37130614            |
|10468     |20190830104251|20190830104401|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7210|4                  |22095872             |3                 |22094910            |
|10468     |20190830104435|20190830104624|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7250|4                  |73357246             |2                 |73352589            |
|10468     |20190830104659|20190830104759|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7280|1                  |1696312              |1                 |1696312             |
|10468     |20190830104845|20190830104952|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7350|3                  |3439184              |3                 |3439184             |
|10468     |20190830105023|20190830105500|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7371|6                  |62283893             |2                 |62274587            |
|10468     |20190830105532|20190830105718|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q7382|5                  |25501396             |3                 |25497826            |
|10468     |20190830110118|20190830110316|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q8030|3                  |74338924             |3                 |74338924            |
|10468     |20190830110413|20190830110520|compaction|prod_ews_refined|tbl_bin_param_stat|fab=C2WF/lot_partition=Q8039|3                  |5336123              |2                 |5335855             |
+----------+--------------+--------------+----------+----------------+------------------+----------------------------+-------------------+---------------------+------------------+--------------------+
only showing top 10 rows

And that’s it you have built a working local Spark environment able to access your Hadoop cluster figures. Of course your Spark power is limited to your client machine and you cannot use the full power of your cluster but to develop scripts this is more than enough…

Sbt installation and configuration

If you write Scala script the next step to submit them using spark-submit and leave the interactive mode will be to compile them and produce a jar file usable by spark-submit. I have to say this is a drawback versus developing in Python where your PySpark script will be directly usable as-is with this language. The speed is also apparently not a criteria as PySpark has same performance as Scala. The only reason to do Scala, except being hype, is the functionalities that are more in advance in Scala versus Python…

The installation is as simple as downloading it on sbt official web site. I have chosen to take the zip file and uncompressed it in d:\sbt directory. For convenience add d:\sbt\bin in your environment path. I have configured my corporate proxy in sbtconfig.txt file located in d:\sbt\conf as follow:

-Dhttp.proxyHost=proxy.domain.com
-Dhttp.proxyPort=8080
-Dhttp.proxyUser=account
-Dhttp.proxyPassword=password
PS D:\> mkdir foo-build


    Directory: D:\


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----       18/11/2019     12:07                foo-build


PS D:\> cd .\foo-build\
PS D:\foo-build> ni build.sbt


    Directory: D:\foo-build


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       18/11/2019     12:08              0 build.sbt
PS D:\foo-build> sbt

[info] Updated file D:\foo-build\project\build.properties: set sbt.version to 1.3.3
[info] Loading project definition from D:\foo-build\project
nov. 18, 2019 12:08:39 PM lmcoursier.internal.shaded.coursier.cache.shaded.org.jline.utils.Log logr
WARNING: Unable to create a system terminal, creating a dumb terminal (enable debug logging for more information)
[info] Updating
















  | => foo-build-build / update 0s
[info] Resolved  dependencies
[warn]
[warn]  Note: Unresolved dependencies path:
[error] sbt.librarymanagement.ResolveException: Error downloading org.scala-sbt:sbt:1.3.3
[error]   Not found
[error]   Not found
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo1.maven.org/maven2/org/scala-sbt/sbt/1.3.3/sbt-1.3.3.pom
[error]   not found: C:\Users\yjaquier\.ivy2\local\org.scala-sbt\sbt\1.3.3\ivys\ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.scala-sbt/sbt/1.3.3/ivys/ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/1.3.3/ivys/ivy.xml
[error] Error downloading org.scala-lang:scala-library:2.12.10
[error]   Not found
[error]   Not found
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.pom
[error]   not found: C:\Users\yjaquier\.ivy2\local\org.scala-lang\scala-library\2.12.10\ivys\ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.scala-lang/scala-library/2.12.10/ivys/ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-lang/scala-library/2.12.10/ivys/ivy.xml
[error]         at lmcoursier.CoursierDependencyResolution.unresolvedWarningOrThrow(CoursierDependencyResolution.scala:245)
[error]         at lmcoursier.CoursierDependencyResolution.$anonfun$update$34(CoursierDependencyResolution.scala:214)
[error]         at scala.util.Either$LeftProjection.map(Either.scala:573)
[error]         at lmcoursier.CoursierDependencyResolution.update(CoursierDependencyResolution.scala:214)
[error]         at sbt.librarymanagement.DependencyResolution.update(DependencyResolution.scala:60)
[error]         at sbt.internal.LibraryManagement$.resolve$1(LibraryManagement.scala:52)
[error]         at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$12(LibraryManagement.scala:102)
[error]         at sbt.util.Tracked$.$anonfun$lastOutput$1(Tracked.scala:69)
[error]         at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$20(LibraryManagement.scala:115)
[error]         at scala.util.control.Exception$Catch.apply(Exception.scala:228)
[error]         at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11(LibraryManagement.scala:115)
[error]         at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11$adapted(LibraryManagement.scala:96)
[error]         at sbt.util.Tracked$.$anonfun$inputChanged$1(Tracked.scala:150)
[error]         at sbt.internal.LibraryManagement$.cachedUpdate(LibraryManagement.scala:129)
[error]         at sbt.Classpaths$.$anonfun$updateTask0$5(Defaults.scala:2946)
[error]         at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error]         at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62)
[error]         at sbt.std.Transform$$anon$4.work(Transform.scala:67)
[error]         at sbt.Execute.$anonfun$submit$2(Execute.scala:281)
[error]         at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:19)
[error]         at sbt.Execute.work(Execute.scala:290)
[error]         at sbt.Execute.$anonfun$submit$1(Execute.scala:281)
[error]         at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:178)
[error]         at sbt.CompletionService$$anon$2.call(CompletionService.scala:37)
[error]         at java.util.concurrent.FutureTask.run(Unknown Source)
[error]         at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
[error]         at java.util.concurrent.FutureTask.run(Unknown Source)
[error]         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[error]         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[error]         at java.lang.Thread.run(Unknown Source)
[error] (update) sbt.librarymanagement.ResolveException: Error downloading org.scala-sbt:sbt:1.3.3
[error]   Not found
[error]   Not found
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo1.maven.org/maven2/org/scala-sbt/sbt/1.3.3/sbt-1.3.3.pom
[error]   not found: C:\Users\yjaquier\.ivy2\local\org.scala-sbt\sbt\1.3.3\ivys\ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.scala-sbt/sbt/1.3.3/ivys/ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/1.3.3/ivys/ivy.xml
[error] Error downloading org.scala-lang:scala-library:2.12.10
[error]   Not found
[error]   Not found
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.pom
[error]   not found: C:\Users\yjaquier\.ivy2\local\org.scala-lang\scala-library\2.12.10\ivys\ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/org.scala-lang/scala-library/2.12.10/ivys/ivy.xml
[error]   download error: Caught java.io.IOException: Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required" (Unable to tunnel through proxy. Proxy returns "HTTP/1.1 407 Proxy Authentication Required") while downloading https://repo.typesafe.com/typesafe/ivy-releases/org.scala-lang/scala-library/2.12.10/ivys/ivy.xml
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q

For file D:\foo-build\project\build.properties I had to change as follow as apparently there is an issue with sbt 1.3.3 repository:

sbt.version=1.2.8

This time it went well:

PS D:\foo-build> sbt

[info] Loading project definition from D:\foo-build\project
[info] Updating ProjectRef(uri("file:/D:/foo-build/project/"), "foo-build-build")...
[info] Done updating.
[info] Loading settings for project foo-build from build.sbt ...
[info] Set current project to foo-build (in build file:/D:/foo-build/)
[info] sbt server started at local:sbt-server-2b55615bd8de0c5ac354
sbt:foo-build>

As suggested in sbt official documentation I have issued:

sbt:foo-build> ~compile
[info] Updating ...
[info] Done updating.
[info] Compiling 1 Scala source to D:\foo-build\target\scala-2.12\classes ...
[info] Done compiling.
[success] Total time: 1 s, completed 18 nov. 2019 12:58:30
1. Waiting for source changes in project foo-build... (press enter to interrupt)

And create the Hello.scala file where suggested that is instantly compiled:

[info] Compiling 1 Scala source to D:\foo-build\target\scala-2.12\classes ...
[info] Done compiling.
[success] Total time: 1 s, completed 18 nov. 2019 12:58:51
2. Waiting for source changes in project foo-build... (press enter to interrupt)

You can run it using:

sbt:foo-build> run
[info] Running example.Hello
Hello
[success] Total time: 1 s, completed 18 nov. 2019 13:06:23

Create a jar file that can be submitted via spark-submit with:

sbt:foo-build> package
[info] Packaging D:\foo-build\target\scala-2.12\foo-build_2.12-0.1.0-SNAPSHOT.jar ...
[info] Done packaging.
[success] Total time: 0 s, completed 18 nov. 2019 16:15:06

Intellij IDEA from JetBrains

Java JDK is a prerequisite, I have personally installed release 1.8.0_231.

Once you have installed Intellij IDEA, the community edition does the job for basics things, the first thing you have to do is to install the Scala plugin (you will have to configure your corporate proxy):

spark_installation02
spark_installation02

Then create your first Scala project with Create New Project:

spark_installation03
spark_installation03

I have chosen Sbt 1.2.8 and Scala 2.12.10 as I had many issues with latest versions:

spark_installation04
spark_installation04

In src/main/scala folder create a new Scala Class (mouse left button click then New/Scala Class) and then select Object to create (for example) a file called scala_testing.scala. As an example:

spark_installation05
spark_installation05

Insert (for example) below Scala code in Scala object script you have just created:

import org.apache.spark.sql.SparkSession

object scala_testing {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("spark_scala_yannick")
      .master("local[*]")
      .enableHiveSupport()
      .getOrCreate()

    spark.sparkContext.setLogLevel("WARN")

    val databases=spark.sql("show databases")

    databases.show(100,false)
  }
}

In build.sbt file add the following dependencies to have a file looking like:

name := "scala_testing"

version := "0.1"

scalaVersion := "2.12.10"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.4.4"

Sbt proxy setting are not inherited (feature has been requested) from Intellij default proxy setting so you have to add the below VM parameters:

spark_installation06
spark_installation06

Remark:
-Dsbt.repository.secure=false is surely not the best idea of the world but I had issue configuring my proxy server in https. -Dsbt.gigahorse=false is to overcome a concurrency issue while downloading multiple packages at same time… Might not be required for you…

Once the build process is completed (dependencies download) if you execute the script you will end up with:

+------------+
|databaseName|
+------------+
|default     |
+------------+

We are still lacking the Hive configuration. To add it go in File/Project Structure and a new Java library

spark_installation07
spark_installation07

Browse to the Spark conf directory we configured in Spark installation and configuration chapter:

spark_installation08
spark_installation08

Choose Class as category:

spark_installation09
spark_installation09

I have finally selected all modules (not fully sure it’s required):

spark_installation10
spark_installation10

If you now re-execute the Sacala script you should now get the complete list of Hive databases of your Hadoop cluster:

+--------------------------+
|databaseName              |
+--------------------------+
|activity                  |
|admin                     |
|audit                     |
|big_processing            |
|big_raw                   |
|big_refined               |
|big_processing            |
.
.

Conclusion

I have also tried to submit directly from Intellij in YARN mode (so onto the Hadoop cluster) but even if I have been able to see my application in Resource Manager of YARN I have never succeeded to make it run normally. The YARN scheduler is taking my request but never satisfy it…

Any inputs are welcome…

References

The post Setup Spark and Intellij on Windows to access a remote Hadoop cluster appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/setup-spark-and-intellij-on-windows-to-access-a-remote-hadoop-cluster.html/feed 0
On the importance to have good Hive statistics on your tables https://blog.yannickjaquier.com/hadoop/on-the-importance-to-have-good-hive-statistics-on-your-tables.html https://blog.yannickjaquier.com/hadoop/on-the-importance-to-have-good-hive-statistics-on-your-tables.html#respond Mon, 23 Mar 2020 09:25:58 +0000 https://blog.yannickjaquier.com/?p=4910 Preamble Hive statistics: if like me you come from the RDBMS world I’ll not make any offense re-explaining the importance of good statistics (table, columns, ..) to help the optimizer to choose the best approach to execute your SQL statements. In this post you will see a concrete example why it is also super important […]

The post On the importance to have good Hive statistics on your tables appeared first on IT World.

]]>

Table of contents

Preamble

Hive statistics: if like me you come from the RDBMS world I’ll not make any offense re-explaining the importance of good statistics (table, columns, ..) to help the optimizer to choose the best approach to execute your SQL statements. In this post you will see a concrete example why it is also super important in an Hadoop cluster !

We have the (bad) habit to load incrementally figures in our Hive partitioned table without gathering statistics each time. In short this is clearly not a best practice !

Then to create a more refined table we are joining two tables and the initial question was: do we need to put all predicates in the JOIN ON keyword or it can be added in a traditional WHERE clause and Hive optimizer is clever enough to do what is usually called predicate push-down.

We have Hortonworks Hadoop edition called HDP 2.6.4 so running Hive 1.2.1000.

The problematic queries

We were trying to load a third table from two sources with (query01):

select dpf.fab,dpf.start_t,dpf.finish_t,dpf.lot_id,dpf.wafer_id,dpf.flow_id,dpf.die_x,dpf.die_y,dpf.part_seq,dpf.parameters,dpf.ingestion_date,dpf.lot_partition
from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
on dpf.fab="C2WF"
and dpf.lot_partition="Q926"
and dpf.die_partition=9
and db.pass_fail="F"
and db.fab="C2WF"
and db.lot_partition="Q926"
and db.die_partition=9
and dpf.fab=db.fab
and dpf.lot_partition=db.lot_partition
and dpf.start_t=db.start_t
and dpf.finish_t= db.finish_t
and dpf.lot_id=db.lot_id
and dpf.wafer_id=db.wafer_id
and dpf.flow_id=db.flow_id
and dpf.die_x=db.die_x
and dpf.die_y=db.die_y
and dpf.part_seq=db.part_seq
and dpf.ingestion_date<"201911291611"
and db.ingestion_date<"201911291611";

And comparing with its sister, I would say written more logically, but in fact not following best practices (query02):

select dpf.fab,dpf.start_t,dpf.finish_t,dpf.lot_id,dpf.wafer_id,dpf.flow_id,dpf.die_x,dpf.die_y,dpf.part_seq,dpf.parameters,dpf.ingestion_date,dpf.lot_partition
from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
on dpf.fab=db.fab
and dpf.lot_partition=db.lot_partition
and dpf.die_partition=db.die_partition
and dpf.start_t=db.start_t
and dpf.finish_t= db.finish_t
and dpf.lot_id=db.lot_id
and dpf.wafer_id=db.wafer_id
and dpf.flow_id=db.flow_id
and dpf.die_x=db.die_x
and dpf.die_y=db.die_y
and dpf.part_seq=db.part_seq
where dpf.fab="C2WF"
and dpf.lot_partition="Q926"
and dpf.die_partition="9"
and db.fab="C2WF"
and db.lot_partition="Q926"
and db.pass_fail="F"
and db.die_partition="9"
and dpf.ingestion_date<"201911291611"
and db.ingestion_date<"201911291611";

The explain plan of query01 is (use EXPLAIN Hive command to generate it in Beeline):

Explain
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan not optimized by CBO.

Vertex dependency in root stage
Map 2 <- Map 1 (BROADCAST_EDGE)

Stage-0
    Fetch Operator
      limit:-1
      Stage-1
          Map 2
          File Output Operator [FS_3747670]
            compressed:false
            Statistics:Num rows: 10066 Data size: 541406213 Basic stats: COMPLETE Column stats: NONE
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [SEL_3747669]
                outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
                Statistics:Num rows: 10066 Data size: 541406213 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator [MAPJOIN_3747675]
                |  condition map:[{"":"Inner Join 0 to 1"}]
                |  HybridGraceHashJoin:true
                |  keys:{"Map 2":"'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)","Map 1":"'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)"}
                |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"]
                |  Statistics:Num rows: 10066 Data size: 541406213 Basic stats: COMPLETE Column stats: NONE
                |<-Map 1 [BROADCAST_EDGE]
                |  Reduce Output Operator [RS_3747666]
                |     key expressions:'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
                |     Map-reduce partition columns:'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
                |     sort order:++++++++++
                |     Statistics:Num rows: 9151 Data size: 492187456 Basic stats: COMPLETE Column stats: NONE
                |     value expressions:parameters (type: array), ingestion_date (type: string)
                |     Filter Operator [FIL_3747673]
                |        predicate:((ingestion_date < '201911291611') and start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null) (type: boolean)
                |        Statistics:Num rows: 9151 Data size: 492187456 Basic stats: COMPLETE Column stats: NONE
                |        TableScan [TS_3747655]
                |           alias:dpf
                |           Statistics:Num rows: 7027779 Data size: 377989801184 Basic stats: COMPLETE Column stats: NONE
                |<-Filter Operator [FIL_3747674]
                      predicate:((pass_fail = 'F') and (ingestion_date < '201911291611') and start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null) (type: boolean)
                      Statistics:Num rows: 4584 Data size: 5321735 Basic stats: COMPLETE Column stats: NONE
                      TableScan [TS_3747656]
                        alias:db
                        Statistics:Num rows: 7040091 Data size: 8173102308 Basic stats: COMPLETE Column stats: NONE


43 rows selected.    

Notice the "Column stats: NONE" written everywhere in the explain plan...

Or graphically using Hive View in Ambari. To do so copy/paste the query without the EXPLAIN keyword and push the Visual Explain button. It will generate the query without executing it:

hive_statistics01
hive_statistics01

So giving:

hive_statistics02
hive_statistics02

For qyery02 the explain plan is a bit longer but we know by experience that it has nothing to see with execution time:

Explain
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan not optimized by CBO.

Vertex dependency in root stage
Map 2 <- Map 1 (BROADCAST_EDGE)

Stage-0
    Fetch Operator
      limit:-1
      Stage-1
          Map 2
          File Output Operator [FS_3747687]
            compressed:false
            Statistics:Num rows: 20969516 Data size: 3690634816 Basic stats: COMPLETE Column stats: PARTIAL
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [SEL_3747686]
                outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
                Statistics:Num rows: 20969516 Data size: 3690634816 Basic stats: COMPLETE Column stats: PARTIAL
                Map Join Operator [MAPJOIN_3747710]
                |  condition map:[{"":"Inner Join 0 to 1"}]
                |  HybridGraceHashJoin:true
                |  keys:{"Map 2":"'C2WF' (type: string), 'Q926' (type: string), 9 (type: int), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)","Map 1":"'C2WF' (type: string), 'Q926' (type: string), 9 (type: int), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)"}
                |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"]
                |  Statistics:Num rows: 20969516 Data size: 53094814512 Basic stats: COMPLETE Column stats: PARTIAL
                |<-Map 1 [BROADCAST_EDGE]
                |  Reduce Output Operator [RS_3747681]
                |     key expressions:'C2WF' (type: string), 'Q926' (type: string), 9 (type: int), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
                |     Map-reduce partition columns:'C2WF' (type: string), 'Q926' (type: string), 9 (type: int), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
                |     sort order:+++++++++++
                |     Statistics:Num rows: 9151 Data size: 1647180 Basic stats: COMPLETE Column stats: PARTIAL
                |     value expressions:parameters (type: array), ingestion_date (type: string)
                |     Filter Operator [FIL_3747690]
                |        predicate:(start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null and (ingestion_date < '201911291611')) (type: boolean)
                |        Statistics:Num rows: 9151 Data size: 1647180 Basic stats: COMPLETE Column stats: PARTIAL
                |        TableScan [TS_3747678]
                |           alias:dpf
                |           Statistics:Num rows: 7027779 Data size: 377989801184 Basic stats: COMPLETE Column stats: PARTIAL
                |  Dynamic Partitioning Event Operator [EVENT_3747703]
                |     Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |     Group By Operator [GBY_3747702]
                |        keys:_col0 (type: string)
                |        outputColumnNames:["_col0"]
                |        Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |        Select Operator [SEL_3747701]
                |           outputColumnNames:["_col0"]
                |           Statistics:Num rows: 9151 Data size: 1647180 Basic stats: COMPLETE Column stats: PARTIAL
                |            Please refer to the previous Filter Operator [FIL_3747690]
                |  Dynamic Partitioning Event Operator [EVENT_3747706]
                |     Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |     Group By Operator [GBY_3747705]
                |        keys:_col0 (type: string)
                |        outputColumnNames:["_col0"]
                |        Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |        Select Operator [SEL_3747704]
                |           outputColumnNames:["_col0"]
                |           Statistics:Num rows: 9151 Data size: 1647180 Basic stats: COMPLETE Column stats: PARTIAL
                |            Please refer to the previous Filter Operator [FIL_3747690]
                |  Dynamic Partitioning Event Operator [EVENT_3747709]
                |     Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |     Group By Operator [GBY_3747708]
                |        keys:_col0 (type: int)
                |        outputColumnNames:["_col0"]
                |        Statistics:Num rows: 1477 Data size: 265860 Basic stats: COMPLETE Column stats: PARTIAL
                |        Select Operator [SEL_3747707]
                |           outputColumnNames:["_col0"]
                |           Statistics:Num rows: 9151 Data size: 1647180 Basic stats: COMPLETE Column stats: PARTIAL
                |            Please refer to the previous Filter Operator [FIL_3747690]
                |<-Filter Operator [FIL_3747691]
                      predicate:(start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null and (pass_fail = 'F') and (ingestion_date < '201911291611')) (type: boolean)
                      Statistics:Num rows: 4583 Data size: 824940 Basic stats: COMPLETE Column stats: PARTIAL
                      TableScan [TS_3747679]
                        alias:db
                        Statistics:Num rows: 7040091 Data size: 8173102308 Basic stats: COMPLETE Column stats: PARTIAL


73 rows selected.    

Or graphically:

hive_statistics03
hive_statistics03

Even if we would expect the Hive optimizer to be clever and apply predicate pushdown, explain plans are different and the one with the WHERE clause look less efficient on the paper.

Overall it is NOT AT ALL a best practice to put a separate WHERE clause in a JOIN operation because the filter will be applied after the join operation and as such increasing the volume of rows to be joined ! And it will even create wrong result in the particular case of an OUTER JOIN as clearly explained in the official documentation:

hive_statistics04
hive_statistics04

I have tried to execute the two queries replacing the column list by a COUNT(*) to avoid network penalty and to be honest I have not seen any big difference. Of course the execution time is linked to the usage of the YARN queue and as the cluster is already widely used I was not alone on it:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db  on
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_17084)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      6          6        0        0       0       0
Map 2 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 726.66 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (735.944 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> on dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition = db.die_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> where dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Session is already open
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_17084)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      6          6        0        0       0       0
Map 2 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 354.95 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (355.482 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db  on
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Session is already open
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_17169)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      6          6        0        0       0       0
Map 2 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 1108.09 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (1135.466 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> on dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition = db.die_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> where dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Session is already open
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_17169)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      6          6        0        0       0       0
Map 2 ..........   SUCCEEDED      6          6        0        0       1       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 1273.38 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (1273.899 seconds)

Problem is that for other partition of source tables the query was not running at all and failed for the classical Out Of Memory (OOM) error. Leaving us with empty partitions in final tables...

Problem has gone with good Hive statistics

We have noticed the Column stats: PARTIAL and Column stats: NONE in explain plans so I decided to check columns statistics with command like:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> describe formatted prod_ews_refined.tbl_die_param_flow ingestion_date partition(fab='C2WF',lot_partition='Q926',die_partition='9');
+-------------------------+-----------------------+-----------------------+-------+------------+-----------------+--------------+--------------+------------+-------------+----------+--+
|        col_name         |       data_type       |          min          |  max  | num_nulls  | distinct_count  | avg_col_len  | max_col_len  | num_trues  | num_falses  | comment  |
+-------------------------+-----------------------+-----------------------+-------+------------+-----------------+--------------+--------------+------------+-------------+----------+--+
| # col_name              | data_type             | comment               |       | NULL       | NULL            | NULL         | NULL         | NULL       | NULL        | NULL     |
|                         | NULL                  | NULL                  | NULL  | NULL       | NULL            | NULL         | NULL         | NULL       | NULL        | NULL     |
| ingestion_date          | string                | from deserializer     | NULL  | NULL       | NULL            | NULL         | NULL         | NULL       | NULL        | NULL     |
+-------------------------+-----------------------+-----------------------+-------+------------+-----------------+--------------+--------------+------------+-------------+----------+--+

As you can see we had no statistics on columns due to our obvious lack of analyze command after ingestion of new figures...

I gathered them with:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> analyze table tbl_die_bin partition(fab="C2WF",lot_partition="Q926",die_partition=9) compute statistics for columns;
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: analyze table tbl_die_bin partitio...columns(Stage-0)
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_20216)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      9          9        0        0       0       0
Reducer 2 ......   SUCCEEDED     31         31        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 64.75 s
--------------------------------------------------------------------------------
No rows affected (73.7 seconds)

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> analyze table tbl_die_param_flow partition(fab="C2WF",lot_partition="Q926",die_partition=9) compute statistics for columns start_t,finish_t,lot_id,wafer_id,flow_id,die_x,die_y,part_seq,ingestion_date;
INFO  : Tez session hasn't been created yet. Opening session
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     39         39        0        0       0       0
Reducer 2 ......   SUCCEEDED    253        253        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 198.48 s
--------------------------------------------------------------------------------
No rows affected (369.772 seconds)

There is the special case of array< string> datatype that you must remove explicitly from column list to avoid this:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> analyze table tbl_die_param_flow partition(fab="C2WF",lot_partition="Q926",die_partition=9) compute statistics for columns;
Error: Error while compiling statement: FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array< string> is passed. (state=42000,code=40000)

With now accurate statistics:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> describe formatted prod_ews_refined.tbl_die_param_flow ingestion_date partition(fab="C2WF",lot_partition="Q926",die_partition=9);
+-------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
|        col_name         |       data_type       |          min          |          max          |       num_nulls       |    distinct_count     |      avg_col_len      |      max_col_len      |       num_trues       |      num_falses       |        comment        |
+-------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
| # col_name              | data_type             | min                   | max                   | num_nulls             | distinct_count        | avg_col_len           | max_col_len           | num_trues             | num_falses            | comment               |
|                         | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  |
| ingestion_date          | string                |                       |                       | 0                     | 510                   | 12.0                  | 12                    |                       |                       | from deserializer     |
+-------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
3 rows selected (0.487 seconds)

The explain plan for query01 and qyery02 is exactly the same:

Explain
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan not optimized by CBO.

Vertex dependency in root stage
Map 1 <- Map 2 (BROADCAST_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Map 1
         File Output Operator [FS_3981081]
            compressed:false
            Statistics:Num rows: 100157 Data size: 76720262 Basic stats: COMPLETE Column stats: PARTIAL
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [SEL_3981080]
               outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
               Statistics:Num rows: 100157 Data size: 76720262 Basic stats: COMPLETE Column stats: PARTIAL
               Map Join Operator [MAPJOIN_3981086]
               |  condition map:[{"":"Inner Join 0 to 1"}]
               |  HybridGraceHashJoin:true
               |  keys:{"Map 2":"'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)","Map 1":"'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)"}
               |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"]
               |  Statistics:Num rows: 100157 Data size: 251394070 Basic stats: COMPLETE Column stats: PARTIAL
               |<-Map 2 [BROADCAST_EDGE] vectorized
               |  Reduce Output Operator [RS_3981089]
               |     key expressions:'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
               |     Map-reduce partition columns:'C2WF' (type: string), 'Q926' (type: string), start_t (type: string), finish_t (type: string), lot_id (type: string), wafer_id (type: string), flow_id (type: string), die_x (type: int), die_y (type: int), part_seq (type: int)
               |     sort order:++++++++++
               |     Statistics:Num rows: 1184068 Data size: 799245900 Basic stats: COMPLETE Column stats: COMPLETE
               |     Filter Operator [FIL_3981088]
               |        predicate:((pass_fail = 'F') and (ingestion_date < '201911291611') and start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null) (type: boolean)
               |        Statistics:Num rows: 1184068 Data size: 799245900 Basic stats: COMPLETE Column stats: COMPLETE
               |        TableScan [TS_3981067]
               |           alias:db
               |           Statistics:Num rows: 7104410 Data size: 8247850962 Basic stats: COMPLETE Column stats: COMPLETE
               |<-Filter Operator [FIL_3981084]
                     predicate:((ingestion_date < '201911291611') and start_t is not null and finish_t is not null and lot_id is not null and wafer_id is not null and flow_id is not null and die_x is not null and die_y is not null and part_seq is not null) (type: boolean)
                     Statistics:Num rows: 2364032 Data size: 1394778880 Basic stats: COMPLETE Column stats: PARTIAL
                     TableScan [TS_3981066]
                        alias:dpf
                        Statistics:Num rows: 7092098 Data size: 379282949029 Basic stats: COMPLETE Column stats: PARTIAL


42 rows selected.

Or graphically:

hive_statistics05
hive_statistics05

And execution time is greatly reduced (around a factor 10):

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> on dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Session is already open
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_23102)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     38         38        0        0       0       0
Map 3 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 146.74 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (153.287 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> select count(*)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> from prod_ews_refined.tbl_die_param_flow dpf join prod_ews_refined.tbl_die_bin db
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> on dpf.fab=db.fab
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition=db.lot_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition = db.die_partition
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.start_t=db.start_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.finish_t= db.finish_t
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_id=db.lot_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.wafer_id=db.wafer_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.flow_id=db.flow_id
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_x=db.die_x
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_y=db.die_y
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.part_seq=db.part_seq
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> where dpf.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.fab="C2WF"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.lot_partition="Q926"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.pass_fail="F"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.die_partition="9"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and dpf.ingestion_date<"201911291611"
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> and db.ingestion_date<"201911291611";
INFO  : Session is already open
INFO  : Dag name: select count(*)
from ..._date<"201911291611"(Stage-1)
INFO  : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO  : Status: Running (Executing on YARN cluster with App id application_1574697549080_23102)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     38         38        0        0       0       0
Map 3 ..........   SUCCEEDED      6          6        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 126.12 s
--------------------------------------------------------------------------------
+---------+--+
|   _c0   |
+---------+--+
| 322905  |
+---------+--+
1 row selected (126.614 seconds)

References

The post On the importance to have good Hive statistics on your tables appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/on-the-importance-to-have-good-hive-statistics-on-your-tables.html/feed 0
How to use Livy server to submit Spark job through a REST interface https://blog.yannickjaquier.com/hadoop/how-to-use-livy-server-to-submit-spark-job-through-a-rest-interface.html https://blog.yannickjaquier.com/hadoop/how-to-use-livy-server-to-submit-spark-job-through-a-rest-interface.html#respond Mon, 24 Feb 2020 09:02:10 +0000 https://blog.yannickjaquier.com/?p=4866 Preamble Since we have started our Hadoop journey and more particularly developing Spark jobs in Scala and Python having a efficient development environment has always been a challenge. What we currently do is using a remote edition via SSH FS plugins in VSCode and submitting script in a shell terminal directly from one of our […]

The post How to use Livy server to submit Spark job through a REST interface appeared first on IT World.

]]>

Table of contents

Preamble

Since we have started our Hadoop journey and more particularly developing Spark jobs in Scala and Python having a efficient development environment has always been a challenge.

What we currently do is using a remote edition via SSH FS plugins in VSCode and submitting script in a shell terminal directly from one of our edge nodes.

VSCode is a wonderful tool but it lacks the code completion and suggestion as well as tips that increase your productivity, at least in PySPark and Scala language. Recently I have succeeded to configure the community edition of Intellij from IDEA to submit job on my desktop using the data from our Hadoop cluster. Aside this configuration I have also configured a local Spark environment as well as sbt compiler for Scala jobs. I will share soon an article on this…

One of my teammate suggested the use of Livy so decided to have a look even if at the end I have been a little disappointed by its capability…

Livy is a Rest interface from which you interact with a Spark Cluster. In our Hadoop HortonWorks HDP 2.6 installation the Livy server comes pre-installed and in short I had nothing to do to install or configure it. If you are in a different configuration you might have to install and configure by yourself the Livy server.

Configuration

On our Hadoop cluster Livy server came with Spark installation and is already configured as such:

livy01
livy01

To understand on which server, and if process is running, follow the link of Spark service home page as follow:

livy02
livy02

You finnally have server name and process status where Livy server is running:

livy03
livy03

You can access it using your preferred browser:

livy04
livy04

Curl testing

Curl is, as you know, a tool to test resources across a network. Here we will use it to access the Livy Rest API resources. Interactive commands to this REST API can be done using Scala, Python and R.

Even if the official documentation is around Python scripts I have been able to resolve an annoying error using Curl. I have started with:

# curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" livy_server.domain.com:8999/sessions



Error 400 


HTTP ERROR: 400

Problem accessing /sessions. Reason:

    Missing Required Header for CSRF protection.


Powered by Jetty://

I have found that livy.server.csrf_protection.enabled parameter was set to true in my configuration so I had to specify an extra parameter in header request using X-Requested-By: parameter:

# curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" -H "X-Requested-By: Yannick" livy_server.domain.com:8999/sessions
{"id":8,"appId":null,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},
"log":["stdout: ","\nstderr: ","\nYARN Diagnostics: "]}

It returns that session id 8 has been created for me which can be confirmed graphically:

livy05
livy05

Session is in idle state:

# curl http://livy_server.domain.com:8999/sessions/8
{"id":8,"appId":"application_1565718945091_261784","owner":null,"proxyUser":null,"state":"idle","kind":"pyspark",
"appInfo":{"driverLogUrl":"http://data_node01.domain.com:8042/node/containerlogs/container_e212_1565718945091_261784_01_000001/livy",
"sparkUiUrl":"http://name_node01.domain.com:8088/proxy/application_1565718945091_261784/"},
"log":["stdout: ","\nstderr: ","Warning: Master yarn-cluster is deprecated since 2.0. Please use master \"yarn\" with specified deploy mode instead.","\nYARN Diagnostics: "]}

Let submit a bit of work to it:

# curl http://livy_server.domain.com:8999/sessions/8/statements -H "X-Requested-By: Yannick" -X POST -H 'Content-Type: application/json' -d '{"code":"2 + 2"}'
{"id":0,"code":"2 + 2","state":"waiting","output":null,"progress":0.0}

We can get the result using:

# curl http://livy_server.domain.com:8999/sessions/8/statements
{"total_statements":1,"statements":[{"id":0,"code":"2 + 2","state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"4"}},"progress":1.0}]}

Or graphically:

livy06
livy06

To optionally cleanup session:

# curl http://livy_server.domain.com:8999/sessions/8 -H "X-Requested-By: Yannick" -X DELETE
{"msg":"deleted"}

Python testing

Python is what Livy documentation is pushing by default to test the service. I have started by installing requests package (as well as upgrading pip):

PS D:\> python -m pip install --upgrade pip --user
Collecting pip
  Downloading https://files.pythonhosted.org/packages/00/b6/9cfa56b4081ad13874b0c6f96af8ce16cfbc1cb06bedf8e9164ce5551ec1/pip-19.3.1-py2.py3-none-any.whl (1.4MB)
     |████████████████████████████████| 1.4MB 2.2MB/s
Installing collected packages: pip
  Found existing installation: pip 19.2.3
    Uninstalling pip-19.2.3:
Installing collected packages: pip
Successfully installed pip-19.3.1
WARNING: You are using pip version 19.2.3, however version 19.3.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Requests package installation:

PS D:\> python -m pip install requests --user
Collecting requests
  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
     |████████████████████████████████| 61kB 563kB/s
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests)
  Downloading https://files.pythonhosted.org/packages/b4/40/a9837291310ee1ccc242ceb6ebfd9eb21539649f193a7c8c86ba15b98539/urllib3-1.25.7-py2.py3-none-any.whl (125kB)
     |████████████████████████████████| 133kB 6.4MB/s
Collecting certifi>=2017.4.17 (from requests)
  Downloading https://files.pythonhosted.org/packages/18/b0/8146a4f8dd402f60744fa380bc73ca47303cccf8b9190fd16a827281eac2/certifi-2019.9.11-py2.py3-none-any.whl (154kB)
     |████████████████████████████████| 163kB 3.3MB/s
Collecting chardet<3.1.0,>=3.0.2 (from requests)
  Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)
     |████████████████████████████████| 143kB 6.4MB/s
Collecting idna<2.9,>=2.5 (from requests)
  Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
     |████████████████████████████████| 61kB 975kB/s
Installing collected packages: urllib3, certifi, idna, chardet, requests
  WARNING: The script chardetect.exe is installed in 'C:\Users\yjaquier\AppData\Roaming\Python\Python38\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed certifi-2019.9.11 chardet-3.0.4 idna-2.8 requests-2.22.0 urllib3-1.25.7

Python testing has been done using Python 3.8 installed on my Windows 10 machine. Obviously you can also see graphically what’s going on but as it is exactly the same as with curl I will not share again the output.

Session creation:

>>> import json, pprint, requests, textwrap
>>> host = 'http://livy_server.domain.com:8999'
>>> data = {'kind': 'spark'}
>>> headers = {'Content-Type': 'application/json', 'X-Requested-By': 'Yannick'}
>>> r = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
>>> r.json()
{'id': 9, 'appId': None, 'owner': None, 'proxyUser': None, 'state': 'starting', 'kind': 'spark', 'appInfo': {'driverLogUrl': None, 'sparkUiUrl': None},
'log': ['stdout: ', '\nstderr: ', '\nYARN Diagnostics: ']}

Session status:

>>> session_url = host + r.headers['location']
>>> r = requests.get(session_url, headers=headers)
>>> r.json()
{'id': 9, 'appId': 'application_1565718945091_261793', 'owner': None, 'proxyUser': None, 'state': 'idle', 'kind': 'spark',
'appInfo': {'driverLogUrl': 'http://data_node08.domain.com:8042/node/containerlogs/container_e212_1565718945091_261793_01_000001/livy',
'sparkUiUrl': 'http://sparkui_server.domain.com:8088/proxy/application_1565718945091_261793/'},
'log': ['stdout: ', '\nstderr: ', 'Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.', '\nYARN Diagnostics: ']}

Remark:
Note the url to directly access your Spark UI server to have an even better description of your workload…

Job submission:

>>> statements_url = session_url + '/statements'
>>> data = {'code': '1 + 1'}
>>> r = requests.post(statements_url, data=json.dumps(data), headers=headers)
>>> r.json()
{'id': 0, 'code': '1 + 1', 'state': 'waiting', 'output': None, 'progress': 0.0}

Job result:

>>> statement_url = host + r.headers['location']
>>> r = requests.get(statement_url, headers=headers)
>>> pprint.pprint(r.json())
{'code': '1 + 1',
 'id': 0,
 'output': {'data': {'text/plain': 'res0: Int = 2'},
            'execution_count': 0,
            'status': 'ok'},
 'progress': 1.0,
 'state': 'available'}

Optional session deletion:

>>> r = requests.delete(session_url, headers=headers)
>>> r.json()
{'msg': 'deleted'}

References

The post How to use Livy server to submit Spark job through a REST interface appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/how-to-use-livy-server-to-submit-spark-job-through-a-rest-interface.html/feed 0
Hive concatenate command issues and workaround https://blog.yannickjaquier.com/hadoop/hive-concatenate-command-issues-and-workaround.html https://blog.yannickjaquier.com/hadoop/hive-concatenate-command-issues-and-workaround.html#respond Fri, 24 Jan 2020 08:30:12 +0000 https://blog.yannickjaquier.com/?p=4834 Preamble To maintain good performance we have develop a script (will be shared in another blog post) to concatenate the partitions of our Hive tables every week. It helps in reducing the number of ORC files per partitions and, as such, helps in reducing the number of MAP and Reduce jobs mandatory to access the […]

The post Hive concatenate command issues and workaround appeared first on IT World.

]]>

Table of contents

Preamble

To maintain good performance we have develop a script (will be shared in another blog post) to concatenate the partitions of our Hive tables every week. It helps in reducing the number of ORC files per partitions and, as such, helps in reducing the number of MAP and Reduce jobs mandatory to access the partitions in Hive queries.

It all went good till one week we started to have error message for partitions of one of our table. The funny situation was that for some partitions it still works well while for other we get an error message.

This table has been created in Hive but we fill the partition in a PySpark script. So partition got created in Spark because obviously this is when you insert rows that the partition directories in HDFS is created…

Our cluster is running HDP-2.6.4.0.

The concerned stack version are:

  • Hive 1.2.1000
  • Spark 2.2.0.2.6.4.0-91

Concatenate command failing for access right

The complete error message is the following:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> alter table prod_ews_refined.tbl_wafer_param_stats partition (fab="CTM8",lot_partition="58053") concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_20190830162358_1591b2d1-7489-4f60-8b9f-725fb50a4648
INFO  : Tez session was closed. Reopening...
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
File Merge ....      RUNNING      5          4        0        1       4       0
--------------------------------------------------------------------------------
VERTICES: 00/01  [====================>>------] 80%   ELAPSED TIME: 5.52 s
--------------------------------------------------------------------------------
ERROR : Status: Failed
ERROR : Vertex failed, vertexName=File Merge, vertexId=vertex_1565718945091_75870_2_00, diagnostics=[Task failed, taskId=task_1565718945091_75870_2_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: Hive Runtime Error while closing operators
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Hive Runtime Error while closing operators
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.close(MergeFileRecordProcessor.java:184)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:164)
        ... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to close AbstractFileMergeOperator
        at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.closeOp(AbstractFileMergeOperator.java:272)
        at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.closeOp(OrcFileMergeOperator.java:250)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:620)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.close(MergeFileRecordProcessor.java:176)
        ... 15 more
Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=mfgdl_ingestion, access=EXECUTE, inode="/apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/.hive-staging_hive_2019-08-30_16-23-58_034_3892822656895858873-3012/_tmp.-ext-10000/000000_0_copy_1/000000_0_copy_7":mfgdl_ingestion:hadoop:-rw-r--r--
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:353)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:292)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:238)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1950)
        at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:108)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4142)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1137)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:866)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2167)
        at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1442)
        at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1447)
        at org.apache.hadoop.hive.ql.exec.Utilities.moveFile(Utilities.java:1807)
        at org.apache.hadoop.hive.ql.exec.Utilities.renameOrMoveFiles(Utilities.java:1843)
        at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.closeOp(AbstractFileMergeOperator.java:258)
        ... 18 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=mfgdl_ingestion, access=EXECUTE, inode="/apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/.hive-staging_hive_2019-08-30_16-23-58_034_3892822656895858873-3012/_tmp.-ext-10000/000000_0_copy_1/000000_0_copy_7":mfgdl_ingestion:hadoop:-rw-r--r--
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:353)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:292)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:238)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1950)
        at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:108)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4142)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1137)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:866)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
        at org.apache.hadoop.ipc.Client.call(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:823)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185)
        at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2165)
        ... 26 more
], TaskAttempt 1 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:266)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:140)
        at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:150)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
        ... 14 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedConstructorAccessor18.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:252)
        ... 18 more
Caused by: java.io.FileNotFoundException: File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000000_0_copy_5
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1225)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
        at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:309)
        at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:274)
        at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:266)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1538)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:332)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:327)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:340)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFileTail(ReaderImpl.java:355)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:319)
        at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:241)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeRecordReader.(OrcFileStripeMergeRecordReader.java:47)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat.getRecordReader(OrcFileStripeMergeInputFormat.java:37)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:67)
        ... 22 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000000_0_copy_5
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
        at org.apache.hadoop.ipc.Client.call(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:272)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185)
        at com.sun.proxy.$Proxy12.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1238)
        ... 39 more
], TaskAttempt 2 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:266)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:140)
        at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:150)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
        ... 14 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:252)
        ... 18 more
Caused by: java.io.FileNotFoundException: File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000000_0_copy_2
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1225)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
        at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:309)
        at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:274)
        at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:266)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1538)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:332)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:327)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:340)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFileTail(ReaderImpl.java:355)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:319)
        at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:241)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeRecordReader.(OrcFileStripeMergeRecordReader.java:47)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat.getRecordReader(OrcFileStripeMergeInputFormat.java:37)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:67)
        ... 23 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000000_0_copy_2
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
        at org.apache.hadoop.ipc.Client.call(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:272)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185)
        at com.sun.proxy.$Proxy12.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1238)
        ... 40 more
], TaskAttempt 3 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:266)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:140)
        at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:150)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
        ... 14 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:252)
        ... 18 more
Caused by: java.io.FileNotFoundException: File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000000_0_copy_2
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1225)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
        at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:309)
        at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:274)
        at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:266)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1538)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:332)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:327)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:340)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFileTail(ReaderImpl.java:355)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:319)
        at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:241)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeRecordReader.(OrcFileStripeMergeRecordReader.java:47)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat.getRecordReader(OrcFileStripeMergeInputFormat.java:37)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:67)
        ... 23 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000000_0_copy_2
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
        at org.apache.hadoop.ipc.Client.call(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:272)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185)
        at com.sun.proxy.$Proxy12.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1238)
        ... 40 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1565718945091_75870_2_00 [File Merge] killed/failed due to:OWN_TASK_FAILURE]
ERROR : DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.DDLTask (state=08S01,code=2)

While, as written, for some other partitions all is still working fine:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> alter table prod_ews_refined.tbl_wafer_param_stats partition (fab="CTM8",lot_partition="59591") concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_20190830145138_334957cb-f329-43af-953b-d03e213c9b03
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1565718945091_74778)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
File Merge .....   SUCCEEDED      4          4        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 31.15 s
--------------------------------------------------------------------------------
INFO  : Loading data to table prod_ews_refined.tbl_wafer_param_stats partition (fab=CTM8, lot_partition=59591) from hdfs://ManufacturingDataLakeHdfs/apps/hive/warehouse/prod_ews_refined.db/tbl_wafe          r_param_stats/fab=CTM8/lot_partition=59591/.hive-staging_hive_2019-08-30_14-51-38_835_3100511055765258809-3012/-ext-10000
INFO  : Partition prod_ews_refined.tbl_wafer_param_stats{fab=CTM8, lot_partition=59591} stats: [numFiles=4, totalSize=107827]
No rows affected (56.472 seconds)

If we extract the root cause from the long above error message it is:

Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=mfgdl_ingestion, access=EXECUTE, inode="/apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/.hive-staging_hive_2019-08-30_16-23-58_034_3892822656895858873-3012/_tmp.-ext-10000/000000_0_copy_1/000000_0_copy_7":mfgdl_ingestion:hadoop:-rw-r--r--

Clearly there is a problem of access right for the concatenate command. The problem is that the command is launched with the exact user as the one that has created and which fill the partition and more importantly the error happens randomly on only few partitions… No doubt we are hitting a bug, cannot say if it is Spark or Hive related…

HDFS default access right with Hive and Spark

I first noticed that for failing partition the rights on the ORC files are not all the same (755, rwxr-xr-x and 644, rw-r–r–):

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053
Found 35 items
.
.
-rwxr-xr-x   3 mfgdl_ingestion hadoop       1959 2019-08-25 18:12 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000001_0_copy_8
-rwxr-xr-x   3 mfgdl_ingestion hadoop       2563 2019-08-25 18:12 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000002_0_copy_1
-rwxr-xr-x   3 mfgdl_ingestion hadoop       1967 2019-08-25 18:11 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000002_0_copy_2
-rwxr-xr-x   3 mfgdl_ingestion hadoop       2190 2019-08-25 18:08 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000002_0_copy_3
-rwxr-xr-x   3 mfgdl_ingestion hadoop       1985 2019-08-25 18:11 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000002_0_copy_4
-rwxr-xr-x   3 mfgdl_ingestion hadoop       3508 2019-08-26 20:19 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000004_0
-rw-r--r--   3 mfgdl_ingestion hadoop       2009 2019-08-30 06:18 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/part-00004-9be63b72-9ccc-48f9-a422-b9a6420a3f6f.c000.zlib.orc
-rw-r--r--   3 mfgdl_ingestion hadoop       1959 2019-08-27 21:16 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/part-00007-dd3cce93-8301-42b4-add0-570ee27a5d66.c000.zlib.orc
-rw-r--r--   3 mfgdl_ingestion hadoop       2133 2019-08-27 21:16 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/part-00030-dd3cce93-8301-42b4-add0-570ee27a5d66.c000.zlib.orc
-rw-r--r--   3 mfgdl_ingestion hadoop       2137 2019-08-30 06:18 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/part-00047-9be63b72-9ccc-48f9-a422-b9a6420a3f6f.c000.zlib.orc
.
.
.

While for working partition it is all the same (755, rwxr-xr-x):

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=59591
Found 4 items
-rwxr-xr-x   3 mfgdl_ingestion hadoop      41397 2019-08-30 14:52 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=59591/000000_0
-rwxr-xr-x   3 mfgdl_ingestion hadoop      39713 2019-08-30 14:52 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=59591/000001_0
-rwxr-xr-x   3 mfgdl_ingestion hadoop      21324 2019-08-30 14:52 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=59591/000002_0
-rwxr-xr-x   3 mfgdl_ingestion hadoop       5393 2019-08-30 14:52 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=59591/000003_0

I have also tried to find if all the files are really part of the partitions (no ghost file) and apparently yes. We can only check that number of file (totalNumberFiles) is consistent from Hive but not have the actual list of HDFS files that make the partition:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> use prod_ews_refined;
No rows affected (0.438 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> show table extended like tbl_wafer_param_stats partition (fab="CTM8",lot_partition="58053");
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                                                                                                                                                           tab_name                                                                                                                                                                                                                            |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| tableName:tbl_wafer_param_stats                                                                                                                                                                                                                                                                                                                                                                                                                               |
| owner:mfgdl_ingestion                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| location:hdfs://ManufacturingDataLakeHdfs/apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053                                                                                                                                                                                                                                                                                                                          |
| inputformat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat                                                                                                                                                                                                                                                                                                                                                                                                   |
| outputformat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat                                                                                                                                                                                                                                                                                                                                                                                                 |
| columns:struct columns { string start_t, string finish_t, string lot_id, string wafer_id, string flow_id, i32 param_id, string param_name, float param_low_limit, float param_high_limit, string param_unit, string ingestion_date, float param_p01, float param_q1, float param_median, float param_q3, float param_p99, float param_min_value, double param_avg_value, float param_max_value, double param_stddev, i64 nb_dies_tested, i64 nb_dies_failed}  |
| partitioned:true                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| partitionColumns:struct partition_columns { string fab, string lot_partition}                                                                                                                                                                                                                                                                                                                                                                                 |
| totalNumberFiles:34                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| totalFileSize:88754                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| maxFileSize:16710                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| minFileSize:1954                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| lastAccessTime:1567138817781                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| lastUpdateTime:1567179342705                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
15 rows selected (0.444 seconds)

Remark
I have tried the: IN | FROM database_name as described in Official Hive documentation

SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];

But I have not been able to make it working so decided to finally use the USE database_name statement…

As written this table is filled using PySpark with a code like:

dataframe.write.mode('append').format('orc').option("compression","zlib").partitionBy('fab','lot_partition').saveAsTable("prod_ews_refined.tbl_wafer_param_stats")

I have checked the default HDFS mask:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> set fs.permissions.umask-mode;
+--------------------------------+--+
|              set               |
+--------------------------------+--+
| fs.permissions.umask-mode=022  |
+--------------------------------+--+
1 row selected (0.061 seconds)

It means that file will be created with 644, rw-r–r– (666 – 022) and directory will be created with 755, rwx-r-xr-x (777 – 022) by default.

But digging a bit inside the tree of my database:

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/prod_ews_refined.db/
Found 7 items
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-09-02 10:12 /apps/hive/warehouse/prod_ews_refined.db/tbl_bin_param_stat
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-08-31 01:38 /apps/hive/warehouse/prod_ews_refined.db/tbl_die_bin
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-09-02 10:36 /apps/hive/warehouse/prod_ews_refined.db/tbl_die_param_flow
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-09-02 10:34 /apps/hive/warehouse/prod_ews_refined.db/tbl_die_sp
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-08-30 19:26 /apps/hive/warehouse/prod_ews_refined.db/tbl_stdf
drwxr-xr-x   - mfgdl_ingestion hadoop          0 2019-09-02 10:39 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-08-30 12:02 /apps/hive/warehouse/prod_ews_refined.db/tbl_wsr_map
hdfs@clientnode:~$ hdfs dfs -ls -d /apps/hive/warehouse/prod_ews_refined.db
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-07-16 23:17 /apps/hive/warehouse/prod_ews_refined.db
hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/
Found 1 items
drwxrwxrwx   - hive hadoop          0 2019-09-02 10:09 /apps/hive/warehouse

Below parameter explain why for sub-directories we always have 777, I have seen that this parameter more apply to files and not to sub-directories (!!):

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> set hive.warehouse.subdir.inherit.perms;
+-------------------------------------------+--+
|                    set                    |
+-------------------------------------------+--+
| hive.warehouse.subdir.inherit.perms=true  |
+-------------------------------------------+--+
1 row selected (0.103 seconds)

So this explain why HDFS files (ORC for me) of tables created and filled by Hive have 777 for directories and files. For partitions created by Spark the default HDFS mask (fs.permissions.umask-mode) and the 755 (rwxr–r–) for directories and 644 (rw-r–r–) for files is expected behavior.

So to solve the execute permission issue of the partition I have issued:

hdfs@clientnode:~$ hdfs dfs -chmod 755 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/part*

And the concatenate went well this time:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> alter table prod_ews_refined.tbl_wafer_param_stats partition (fab="CTM8",lot_partition="58053") concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_20190902152242_ac7ab990-fdfd-4094-89b4-4926c49364ee
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1565718945091_85673)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
File Merge .....   SUCCEEDED      5          5        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 12.87 s
--------------------------------------------------------------------------------
INFO  : Loading data to table prod_ews_refined.tbl_wafer_param_stats partition (fab=CTM8, lot_partition=58053) from hdfs://ManufacturingDataLakeHdfs/apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/.hive-staging_hive_2019-09-02_15-22-42_356_8162954430835088222-15910/-ext-10000
INFO  : Partition prod_ews_refined.tbl_wafer_param_stats{fab=CTM8, lot_partition=58053} stats: [numFiles=8, numRows=57, totalSize=65242, rawDataSize=42834]
No rows affected (29.993 seconds)

With number of files reduced drastically:

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053
Found 9 items
drwxr-xr-x   - mfgdl_ingestion hadoop          0 2019-08-29 19:20 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/.hive-staging_hive_2019-08-29_18-10-26_174_7966264139569655341-114
-rwxr-xr-x   3 mfgdl_ingestion hadoop      20051 2019-09-02 15:23 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000000_0
-rwxr-xr-x   3 mfgdl_ingestion hadoop      19020 2019-09-02 15:23 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000001_0
-rwxr-xr-x   3 mfgdl_ingestion hadoop       2009 2019-08-30 06:18 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000001_0_copy_1
-rwxr-xr-x   3 mfgdl_ingestion hadoop       1959 2019-08-27 21:16 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000001_0_copy_2
-rwxr-xr-x   3 mfgdl_ingestion hadoop       2163 2019-08-27 03:42 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000001_0_copy_3
-rwxr-xr-x   3 mfgdl_ingestion hadoop      14413 2019-09-02 15:23 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000002_0
-rwxr-xr-x   3 mfgdl_ingestion hadoop       3508 2019-09-02 15:23 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000003_0
-rwxr-xr-x   3 mfgdl_ingestion hadoop       2119 2019-09-02 15:23 /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=CTM8/lot_partition=58053/000004_0

To go further

To try to clarify this story of default access rights with hive.warehouse.subdir.inherit.perms and fs.permissions.umask-mode I have created a default table having those parameters set respectively to true and 022:

drop table yannick01 purge;

create table default.yannick01
(
value string
)
partitioned by(fab string, int_partition int)
stored as orc;

I insert a dummy record with:

insert into default.yannick01 partition (fab="CTM8", int_partition=1) values ("One");

As expected I have 777 for all files because my /apps/hive/warehouse directory is 777:

hdfs@clientnode:~$ hdfs dfs -ls -d /apps/hive/warehouse/yannick01
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-09-03 11:21 /apps/hive/warehouse/yannick01

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/yannick01
Found 1 items
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-09-02 18:03 /apps/hive/warehouse/yannick01/fab=CTM8

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/yannick01/fab=CTM8
Found 1 items
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-09-02 18:03 /apps/hive/warehouse/yannick01/fab=CTM8/int_partition=1

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/yannick01/fab=CTM8/int_partition=1
Found 1 items
-rwxrwxrwx   3 mfgdl_ingestion hadoop        214 2019-09-02 18:04 /apps/hive/warehouse/yannick01/fab=CTM8/int_partition=1/000000_0

Now if I re-execute the creation script and set this:

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> set hive.warehouse.subdir.inherit.perms=false;
No rows affected (0.003 seconds)
0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> set hive.warehouse.subdir.inherit.perms;
+--------------------------------------------+--+
|                    set                     |
+--------------------------------------------+--+
| hive.warehouse.subdir.inherit.perms=false  |
+--------------------------------------------+--+
1 row selected (0.005 seconds)

I get:

hdfs@clientnode:~$ hdfs dfs -ls -d /apps/hive/warehouse/yannick01
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-09-03 11:21 /apps/hive/warehouse/yannick01

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/yannick01/fab=CTM8
Found 1 items
drwxrwxrwx   - mfgdl_ingestion hadoop          0 2019-09-03 11:22 /apps/hive/warehouse/yannick01/fab=CTM8/int_partition=1

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/yannick01/fab=CTM8/int_partition=1
Found 1 items
-rw-r--r--   3 mfgdl_ingestion hadoop        223 2019-09-03 11:22 /apps/hive/warehouse/yannick01/fab=CTM8/int_partition=1/000000_0

And the concatenate command is working fine because the directory containing the ORC HDFS files has 777. So it is not required to set execute permission on ORC files, having 777 (write for others, hdfs dfs -chmod g+w,o+w or hdfs dfs -chmod 777) on directory where files are stored is also working. But clearly this is much less secure…

0: jdbc:hive2://zookeeper01.domain.com:2181,zoo> alter table default.yannick01 partition (fab="CTM8", int_partition=1) concatenate;
INFO  : Session is already open
INFO  : Dag name: hive_20190904163205_31445337-20a1-42d3-80ac-e08028d6a2a1
INFO  : Status: Running (Executing on YARN cluster with App id application_1565718945091_100063)

INFO  : Loading data to table default.yannick01 partition (fab=CTM8, int_partition=1) from hdfs://ManufacturingDataLakeHdfs/apps/hive/warehouse/yannick01/fab=CTM8/int_partition=1/.hive-staging_hive_2019-09-04_16-32-05_102_6350061968665706892-148/-ext-10000
INFO  : Partition default.yannick01{fab=CTM8, int_partition=1} stats: [numFiles=2, numRows=1, totalSize=158170, rawDataSize=87]
No rows affected (3.291 seconds)

Worked well till…

It all worked as expected till we hit a partition with lots of files and even setting the correct rights it failed for:

ERROR : Status: Failed
ERROR : Vertex failed, vertexName=File Merge, vertexId=vertex_1565718945091_143224_2_00, diagnostics=[Task failed, taskId=task_1565718945091_143224_2_00_000001, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: Hive Runtime Error while closing operators
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Hive Runtime Error while closing operators
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.close(MergeFileRecordProcessor.java:184)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:164)
        ... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to close AbstractFileMergeOperator
        at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.closeOp(AbstractFileMergeOperator.java:272)
        at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.closeOp(OrcFileMergeOperator.java:250)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:620)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.close(MergeFileRecordProcessor.java:176)
        ... 15 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): Current inode is not a directory: 000001_0_copy_21(INodeFile@1491d045), parentDir=_tmp.-ext-10000/
        at org.apache.hadoop.hdfs.server.namenode.INode.asDirectory(INode.java:331)
        at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.verifyFsLimitsForRename(FSDirRenameOp.java:117)
        at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.unprotectedRenameTo(FSDirRenameOp.java:189)
        at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.renameTo(FSDirRenameOp.java:492)
        at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.renameToInt(FSDirRenameOp.java:73)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3938)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rename(NameNodeRpcServer.java:993)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.rename(ClientNamenodeProtocolServerSideTranslatorPB.java:587)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
        at org.apache.hadoop.ipc.Client.call(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy11.rename(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.rename(ClientNamenodeProtocolTranslatorPB.java:529)
        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185)
        at com.sun.proxy.$Proxy12.rename(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:2006)
        at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:732)
        at org.apache.hadoop.hive.ql.exec.Utilities.moveFile(Utilities.java:1815)
        at org.apache.hadoop.hive.ql.exec.Utilities.renameOrMoveFiles(Utilities.java:1843)
        at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.closeOp(AbstractFileMergeOperator.java:258)
        ... 18 more
], TaskAttempt 1 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:266)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:140)
        at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:150)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
        ... 14 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedConstructorAccessor18.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:252)
        ... 18 more
Caused by: java.io.FileNotFoundException: File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats_old/fab=C2WF/lot_partition=Q9053/000001_0_copy_993
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1225)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
        at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:309)
        at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:274)
        at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:266)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1538)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:332)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:327)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:340)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFileTail(ReaderImpl.java:355)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:319)
        at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:241)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeRecordReader.(OrcFileStripeMergeRecordReader.java:47)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat.getRecordReader(OrcFileStripeMergeInputFormat.java:37)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:67)
        ... 22 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats_old/fab=C2WF/lot_partition=Q9053/000001_0_copy_993
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
        at org.apache.hadoop.ipc.Client.call(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:272)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185)
        at com.sun.proxy.$Proxy12.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1238)
        ... 39 more
], TaskAttempt 2 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:266)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:140)
        at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:150)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
        ... 14 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:252)
        ... 18 more
Caused by: java.io.FileNotFoundException: File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats_old/fab=C2WF/lot_partition=Q9053/000001_0_copy_969
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1225)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
        at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:309)
        at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:274)
        at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:266)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1538)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:332)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:327)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:340)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFileTail(ReaderImpl.java:355)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:319)
        at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:241)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeRecordReader.(OrcFileStripeMergeRecordReader.java:47)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat.getRecordReader(OrcFileStripeMergeInputFormat.java:37)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:67)
        ... 23 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats_old/fab=C2WF/lot_partition=Q9053/000001_0_copy_969
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
        at org.apache.hadoop.ipc.Client.call(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:272)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185)
        at com.sun.proxy.$Proxy12.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1238)
        ... 40 more
], TaskAttempt 3 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:266)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:140)
        at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
        at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:150)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
        ... 14 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:252)
        ... 18 more
Caused by: java.io.FileNotFoundException: File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats_old/fab=C2WF/lot_partition=Q9053/000001_0_copy_969
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1225)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
        at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:309)
        at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:274)
        at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:266)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1538)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:332)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:327)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:340)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFileTail(ReaderImpl.java:355)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:319)
        at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:241)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeRecordReader.(OrcFileStripeMergeRecordReader.java:47)
        at org.apache.hadoop.hive.ql.io.orc.OrcFileStripeMergeInputFormat.getRecordReader(OrcFileStripeMergeInputFormat.java:37)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:67)
        ... 23 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats_old/fab=C2WF/lot_partition=Q9053/000001_0_copy_969
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
        at org.apache.hadoop.ipc.Client.call(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:272)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185)
        at com.sun.proxy.$Proxy12.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1238)
        ... 40 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:1, Vertex vertex_1565718945091_143224_2_00 [File Merge] killed/failed due to:OWN_TASK_FAILURE]
ERROR : DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.DDLTask (state=08S01,code=2)

I initially thought it could be a HDFS FileSytem issue so:

hdfs@clientnode:~$ hdfs fsck /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats_old/fab=C2WF/lot_partition=Q9053/
Connecting to namenode via http://namenode01.domain.com:50070/fsck?ugi=hdfs&path=%2Fapps%2Fhive%2Fwarehouse%2Fprod_ews_refined.db%2Ftbl_wafer_param_stats_old%2Ffab%3DC2WF%2Flot_partition%3DQ9053
FSCK started by hdfs (auth:SIMPLE) from /10.75.144.5 for path /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats_old/fab=C2WF/lot_partition=Q9053 at Mon Sep 16 17:01:29 CEST 2019

Status: HEALTHY
 Total size:    159181622 B
 Total dirs:    1
 Total files:   21778
 Total symlinks:                0
 Total blocks (validated):      21778 (avg. block size 7309 B)
 Minimally replicated blocks:   21778 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          8
 Number of racks:               2
FSCK ended at Mon Sep 16 17:01:29 CEST 2019 in 377 milliseconds


The filesystem under path '/apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats_old/fab=C2WF/lot_partition=Q9053' is HEALTHY

To read all the logs in a convenient manner I strongly encourage the usage of Tez View. To do so you can read the application id when executing the command in beeline:

.
.
INFO  : Status: Running (Executing on YARN cluster with App id application_1565718945091_161035)
.
.

And access the resource using this url http://resourcemanager01.domain.com:8088/cluster/app/application_1565718945091_142464 then click on ApplicationMaster link in displayed page. Finally accessing Dag tab and opening details you should see something like:

concatenate01
concatenate01

You can also generate a text file using:

[yarn@clientnode ~]$ yarn logs -applicationId application_1565718945091_141922 > /tmp/yan.txt

From this huge amount of logs I have obviously see plenty of strange error messages but none of them led to a clear conclusion:

  • |OrcFileMergeOperator|: Incompatible ORC file merge! Writer version mismatch for
  • |tez.RecordProcessor|: Hit error while closing operators – failing tree
  • |tez.TezProcessor|: java.lang.RuntimeException: Hive Runtime Error while closing operators
  • org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): Current inode is not a directory:

But the number of files in the partition is decreasing:

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=C2WF/lot_partition=Q9053 | wc -l
21693

hdfs@clientnode:~$ hdfs dfs -ls /apps/hive/warehouse/prod_ews_refined.db/tbl_wafer_param_stats/fab=C2WF/lot_partition=Q9053 | wc -l
21642

So clearly another bug

Last but not least one of my teammate noticed a strange behavior while doing simple count on partition. He generated a pure Hive table by inserting the rows of the one created by Spark:

0: jdbc:hive2://zookeeer01.domain.com:2181,zoo> select count(*) from prod_ews_refined.tbl_hive_generated where fab="R8WF" and lot_partition="G3473";
+------+--+
| _c0  |
+------+--+
| 137  |
+------+--+
1 row selected (0.045 seconds)
0: jdbc:hive2://zookeeer01.domain.com:2181,zoo> select count(*) from prod_ews_refined.tbl_spark_generated where fab="R8WF" and lot_partition="G3473";
+------+--+
| _c0  |
+------+--+
| 130  |
+------+--+
1 row selected (0.058 seconds)

But by exporting the rows of the Spark generated table in a csv file (–outformat=csv2) the output is the correct one:

Connected to: Apache Hive (version 1.2.1000.2.6.4.0-91)
Driver: Hive JDBC (version 1.2.1000.2.6.4.0-91)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select * from prod_ew...ot_partition="G3473"(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1565718945091_133078)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      2          2        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 2.11 s     
--------------------------------------------------------------------------------
137 rows selected (7.998 seconds)

So one bug more which has lead to the project of upgrading our cluster to latest HDP version to prepare the migration to Cloudera as Hortonworks is dead…

In meanwhile the non-satisfactory solution we have implemented is to fill a pure Hive table with INSERT AS SELECT from the Spark generated table…

References

The post Hive concatenate command issues and workaround appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/hive-concatenate-command-issues-and-workaround.html/feed 0
How to handle HDFS blocks with corrupted replicas or under replicated https://blog.yannickjaquier.com/hadoop/how-to-handle-hdfs-blocks-with-corrupted-replicas-or-under-replicated.html https://blog.yannickjaquier.com/hadoop/how-to-handle-hdfs-blocks-with-corrupted-replicas-or-under-replicated.html#respond Wed, 25 Dec 2019 14:00:23 +0000 https://blog.yannickjaquier.com/?p=4825 Preamble In Ambari, and in HDFS more precisely, there are two widgets that will jump to your eyes if they are not equal to zero. They are Blocks With Corrupted Replicas and Under Replicated Blocks. In the graphical interface this is something like: Managing under replicated blocks You get a complete list of impacted file […]

The post How to handle HDFS blocks with corrupted replicas or under replicated appeared first on IT World.

]]>

Table of contents

Preamble

In Ambari, and in HDFS more precisely, there are two widgets that will jump to your eyes if they are not equal to zero. They are Blocks With Corrupted Replicas and Under Replicated Blocks. In the graphical interface this is something like:

hdfs_blocks01
hdfs_blocks01

Managing under replicated blocks

You get a complete list of impacted file with this command. The grep command is simply to remove all the lines with multiple point symbol:

hdfs@client_node:~$ hdfs fsck / | egrep -v '^\.+'
Connecting to namenode via http://namenode01.domain.com:50070/fsck?ugi=hdfs&path=%2F
FSCK started by hdfs (auth:SIMPLE) from /10.75.144.5 for path / at Tue Jul 23 14:58:19 CEST 2019
/tmp/hive/training/dda2d815-27d3-43e0-9d3b-4aea3983c1a9/hive_2018-09-18_12-13-21_550_5155956994526391933-12/_tez_scratch_dir/split_Map_1/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1087250107_13523212. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/ambari-qa/.staging/job_1525749609269_2679/libjars/hive-shims-0.23-1.2.1000.2.6.4.0-91.jar:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1074437588_697325. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/ambari-qa/.staging/job_1525749609269_2679/libjars/hive-shims-common-1.2.1000.2.6.4.0-91.jar:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1074437448_697185. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/hdfs/.staging/job_1541585350344_0008/job.jar:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1092991377_19266614. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/hdfs/.staging/job_1541585350344_0008/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1092991378_19266615. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/hdfs/.staging/job_1541585350344_0041/job.jar:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1093022580_19297823. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/hdfs/.staging/job_1541585350344_0041/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1093022581_19297824. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1519657336782_0105/job.jar:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1073754565_13755. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1519657336782_0105/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1073754566_13756. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1519657336782_0105/libjars/hive-hcatalog-core.jar:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1073754564_13754. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536057043538_0001/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085621525_11894367. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536057043538_0002/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085621527_11894369. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536057043538_0004/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085621593_11894435. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536057043538_0023/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085622064_11894906. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536057043538_0025/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085622086_11894928. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536057043538_0027/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085622115_11894957. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536057043538_0028/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1085622133_11894975. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536642465198_0002/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397707_12670663. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536642465198_0003/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397706_12670662. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536642465198_0004/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397708_12670664. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536642465198_0005/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397718_12670674. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536642465198_0006/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397720_12670676. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536642465198_0007/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086397721_12670677. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
/user/training/.staging/job_1536642465198_2307/job.split:  Under replicated BP-1711156358-10.75.144.1-1519036486930:blk_1086509846_12782817. Target Replicas is 10 but found 8 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
 Total size:    25053842686185 B (Total open files size: 33152660866 B)
 Total dirs:    1500114
 Total files:   12517972
 Total symlinks:                0 (Files currently being written: 268)
 Total blocks (validated):      12534979 (avg. block size 1998714 B) (Total open file blocks (not validated): 325)
 Minimally replicated blocks:   12534979 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       24 (1.9146422E-4 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.9715698
 Corrupt blocks:                0
 Missing replicas:              48 (1.2886386E-4 %)
 Number of data-nodes:          8
 Number of racks:               2
FSCK ended at Tue Jul 23 15:01:08 CEST 2019 in 169140 milliseconds


The filesystem under path '/' is HEALTHY

All those files have the strange “Target Replicas is 10 but found 8 live replica(s)” message while our default replication factor is 3, so I have decided to set it back to 3 with:

 
hdfs@client_node:~$ hdfs dfs -setrep 3 /user/training/.staging/job_1536642465198_2307/job.split
Replication 3 set: /user/training/.staging/job_1536642465198_2307/job.split

And it solved the issue (24 to 23 under replicated blocks):

 Total size:    25009165459559 B (Total open files size: 1006863 B)
 Total dirs:    1500150
 Total files:   12528930
 Total symlinks:                0 (Files currently being written: 47)
 Total blocks (validated):      12545325 (avg. block size 1993504 B) (Total open file blocks (not validated): 43)
 Minimally replicated blocks:   12545325 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       23 (1.8333523E-4 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.9715445
 Corrupt blocks:                0
 Missing replicas:              46 (1.2339374E-4 %)
 Number of data-nodes:          8
 Number of racks:               2
FSCK ended at Tue Jul 23 15:49:27 CEST 2019 in 155922 milliseconds

That we also see graphically under Ambari:

hdfs_blocks02
hdfs_blocks02

There is also an option to delete the files if it has no interested or if, like for me, this is a temporary old file like this one of my list:

hdfs@client_node:~$ hdfs dfs -ls -d /tmp/hive/training/dda2d815*
drwx------   - training hdfs          0 2018-09-18 12:53 /tmp/hive/training/dda2d815-27d3-43e0-9d3b-4aea3983c1a9

Then deleting it and skipping the trash (no recover possible, if you are unsure keep the trash activated, but you will have to wait that HDFS purge the trash after fs.trash.interval minutes):

hdfs@client_node:~$ hdfs dfs -rm -r -f -skipTrash /tmp/hive/training/dda2d815-27d3-43e0-9d3b-4aea3983c1a9
Deleted /tmp/hive/training/dda2d815-27d3-43e0-9d3b-4aea3983c1a9

Obviously same exact result:

 Total size:    25019302903260 B (Total open files size: 395112 B)
 Total dirs:    1439889
 Total files:   12474754
 Total symlinks:                0 (Files currently being written: 69)
 Total blocks (validated):      12491149 (avg. block size 2002962 B) (Total open file blocks (not validated): 57)
 Minimally replicated blocks:   12491149 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       22 (1.7612471E-4 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.971359
 Corrupt blocks:                0
 Missing replicas:              44 (1.185481E-4 %)
 Number of data-nodes:          8
 Number of racks:               2
FSCK ended at Tue Jul 23 16:18:51 CEST 2019 in 141604 milliseconds

Managing blocks with corrupted replicas

Simulating corrupted blocks is not piece of cake and even if my Ambari display is showing corrupted blocks, in reality I have none:

hdfs@client_node:~$ hdfs fsck -list-corruptfileblocks
Connecting to namenode via http://namenode01.domain.com:50070/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F
The filesystem under path '/' has 0 CORRUPT files

On a TI environment where plenty of people have played with I finally got the (hopefully) rare message of corrupted blocks. In real life it should not happen as by default your HDFS replication factor is 3 (It even forbid me to start HBase):

[hdfs@client_node ~]$ hdfs dfsadmin -report
Configured Capacity: 6139207680 (5.72 GB)
Present Capacity: 5701219000 (5.31 GB)
DFS Remaining: 2659930112 (2.48 GB)
DFS Used: 3041288888 (2.83 GB)
DFS Used%: 53.34%
Replicated Blocks:
        Under replicated blocks: 0
        Blocks with corrupt replicas: 0
        Missing blocks: 10
        Missing blocks (with replication factor 1): 0
        Pending deletion blocks: 0
Erasure Coded Block Groups:
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Pending deletion blocks: 0
.
.

More precisely:

[hdfs@client_node ~]$ hdfs fsck -list-corruptfileblocks
Connecting to namenode via http://namenode01.domain.com:50070/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F
The list of corrupt files under path '/' are:
blk_1073744038  /hdp/apps/3.0.0.0-1634/mapreduce/mapreduce.tar.gz
blk_1073744039  /hdp/apps/3.0.0.0-1634/mapreduce/mapreduce.tar.gz
blk_1073744040  /hdp/apps/3.0.0.0-1634/mapreduce/mapreduce.tar.gz
blk_1073744041  /hdp/apps/3.0.0.0-1634/yarn/service-dep.tar.gz
blk_1073744042  /apps/hbase/data/hbase.version
blk_1073744043  /apps/hbase/data/hbase.id
blk_1073744044  /apps/hbase/data/data/hbase/meta/1588230740/.regioninfo
blk_1073744045  /apps/hbase/data/data/hbase/meta/.tabledesc/.tableinfo.0000000001
blk_1073744046  /apps/hbase/data/MasterProcWALs/pv2-00000000000000000001.log
blk_1073744047  /apps/hbase/data/WALs/datanode01.domain.com,16020,1576839178870/datanode01.domain.com%2C16020%2C1576839178870.1576839191114
The filesystem under path '/' has 10 CORRUPT files

Here is why I was not able to start component related to HBAse and even HBase daemon was failing:

[hdfs@client_node ~]$ hdfs dfs -cat /apps/hbase/data/hbase.version
19/12/20 15:41:12 WARN hdfs.DFSClient: No live nodes contain block BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 after checking nodes = [], ignoredNodes = null
19/12/20 15:41:12 INFO hdfs.DFSClient: No node available for BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 file=/apps/hbase/data/hbase.version
19/12/20 15:41:12 INFO hdfs.DFSClient: Could not obtain BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 from any node:  No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
19/12/20 15:41:12 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 2312.7779354694017 msec.
19/12/20 15:41:14 WARN hdfs.DFSClient: No live nodes contain block BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 after checking nodes = [], ignoredNodes = null
19/12/20 15:41:14 INFO hdfs.DFSClient: No node available for BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 file=/apps/hbase/data/hbase.version
19/12/20 15:41:14 INFO hdfs.DFSClient: Could not obtain BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 from any node:  No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
19/12/20 15:41:14 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2 IOException, will wait for 8143.439259842901 msec.
19/12/20 15:41:22 WARN hdfs.DFSClient: No live nodes contain block BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 after checking nodes = [], ignoredNodes = null
19/12/20 15:41:22 INFO hdfs.DFSClient: No node available for BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 file=/apps/hbase/data/hbase.version
19/12/20 15:41:22 INFO hdfs.DFSClient: Could not obtain BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 from any node:  No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
19/12/20 15:41:22 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 11760.036759939097 msec.
19/12/20 15:41:34 WARN hdfs.DFSClient: No live nodes contain block BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 after checking nodes = [], ignoredNodes = null
19/12/20 15:41:34 WARN hdfs.DFSClient: Could not obtain block: BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 file=/apps/hbase/data/hbase.version No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException
19/12/20 15:41:34 WARN hdfs.DFSClient: No live nodes contain block BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 after checking nodes = [], ignoredNodes = null
19/12/20 15:41:34 WARN hdfs.DFSClient: Could not obtain block: BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 file=/apps/hbase/data/hbase.version No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException
19/12/20 15:41:34 WARN hdfs.DFSClient: DFS Read
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 file=/apps/hbase/data/hbase.version
        at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:870)
        at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:853)
        at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:832)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:564)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:754)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:94)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
        at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101)
        at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
        at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
        at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269)
        at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
cat: Could not obtain block: BP-369465004-10.75.46.68-1539340329592:blk_1073744042_3218 file=/apps/hbase/data/hbase.version

Here no particular miracle on how to solve it:

[hdfs@client_node ~]$ hdfs fsck / -delete
Connecting to namenode via http://namenode01.domain.com:50070/fsck?ugi=hdfs&delete=1&path=%2F
FSCK started by hdfs (auth:SIMPLE) from /10.75.46.69 for path / at Fri Dec 20 15:41:53 CET 2019

/apps/hbase/data/MasterProcWALs/pv2-00000000000000000001.log: MISSING 1 blocks of total size 34358 B.
/apps/hbase/data/WALs/datanode01.domain.com,16020,1576839178870/datanode01.domain.com%2C16020%2C1576839178870.1576839191114: MISSING 1 blocks of total size 98 B.
/apps/hbase/data/data/hbase/meta/.tabledesc/.tableinfo.0000000001: MISSING 1 blocks of total size 996 B.
/apps/hbase/data/data/hbase/meta/1588230740/.regioninfo: MISSING 1 blocks of total size 32 B.
/apps/hbase/data/hbase.id: MISSING 1 blocks of total size 42 B.
/apps/hbase/data/hbase.version: MISSING 1 blocks of total size 7 B.
/hdp/apps/3.0.0.0-1634/mapreduce/mapreduce.tar.gz: MISSING 3 blocks of total size 306766434 B.
/hdp/apps/3.0.0.0-1634/yarn/service-dep.tar.gz: MISSING 1 blocks of total size 92348160 B.
Status: CORRUPT
 Number of data-nodes:  3
 Number of racks:               1
 Total dirs:                    780
 Total symlinks:                0

Replicated Blocks:
 Total size:    1404802556 B
 Total files:   190 (Files currently being written: 1)
 Total blocks (validated):      47 (avg. block size 29889416 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:      10 (21.276596 %)
  MINIMAL BLOCK REPLICATION:    1
  CORRUPT FILES:        8
  MISSING BLOCKS:       10
  MISSING SIZE:         399150127 B
  ********************************
 Minimally replicated blocks:   37 (78.723404 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.3617022
 Missing blocks:                10
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)

Erasure Coded Block Groups:
 Total size:    0 B
 Total files:   0
 Total block groups (validated):        0
 Minimally erasure-coded block groups:  0
 Over-erasure-coded block groups:       0
 Under-erasure-coded block groups:      0
 Unsatisfactory placement block groups: 0
 Average block group size:      0.0
 Missing block groups:          0
 Corrupt block groups:          0
 Missing internal blocks:       0
FSCK ended at Fri Dec 20 15:41:54 CET 2019 in 306 milliseconds


The filesystem under path '/' is CORRUPT

Finally solved (with data loss of course):

[hdfs@client_node ~]$ hdfs fsck -list-corruptfileblocks
Connecting to namenode via http://namenode01.domain.com:50070/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F
The filesystem under path '/' has 0 CORRUPT files

And:

[hdfs@client_node ~]$ hdfs dfsadmin -report
Configured Capacity: 6139207680 (5.72 GB)
Present Capacity: 5701216450 (5.31 GB)
DFS Remaining: 2659930112 (2.48 GB)
DFS Used: 3041286338 (2.83 GB)
DFS Used%: 53.34%
Replicated Blocks:
        Under replicated blocks: 0
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Pending deletion blocks: 0
Erasure Coded Block Groups:
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Pending deletion blocks: 0

References

The post How to handle HDFS blocks with corrupted replicas or under replicated appeared first on IT World.

]]>
https://blog.yannickjaquier.com/hadoop/how-to-handle-hdfs-blocks-with-corrupted-replicas-or-under-replicated.html/feed 0