AWS EMR 5.11.0 - Apache Hive on Spark

2020-03-26 05:15发布

问题:

I am trying to setup Apache Hive on Spark on AWS EMR 5.11.0. Apache Spark Version - 2.2.1 Apache Hive Version - 2.3.2 Yarn logs show below error:

18/01/28 21:55:28 ERROR ApplicationMaster: User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS at org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47) at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134) at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)

hive-server2.log: 2018-01-28T21:56:50,109 ERROR [HiveServer2-Background-Pool: Thread-68([])]: client.SparkClientImpl (SparkClientImpl.java:(112)) - Timed out waiting for client to connect. Possible reasons include network issues, errors in remote driver or the cluster has no available resources, etc. Please check YARN or Spark driver's logs for further information. java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection. at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41) ~[netty-all-4.0.52.Final.jar:4.0.52.Final] at org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:109) ~[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0] at org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80) ~[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0] at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:101) ~[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0] at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97) ~[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0] at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:73) ~[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0] at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62) ~[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0] at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115) ~[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0] at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:126) ~[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0]

Also, 2018-01-28T21:56:50,110 ERROR [HiveServer2-Background-Pool: Thread-68([])]: spark.SparkTask (SessionState.java:printError(1126)) - Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)' org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client. at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:64) at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115) at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:126) at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:103) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1232) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:255) at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection.

Could anyone point out what I maybe missing in the configuration?

回答1:

Sorry, but Hive on Spark is not yet supported on EMR. I have not tried it myself yet, but I think the likely cause of your errors might be a mismatch between the version of Spark supported on EMR and the version of Spark upon which Hive depends. The last time I checked, Hive did not support Spark 2.x when running Hive on Spark. Given that your first error is a NoSuchFieldError, it seems like a version mismatch is the most likely cause. The timeout error may be a red herring.



回答2:

EMR Spark supports Hive version 1.2.1 and not the hive 2.x version. Could you please check the hive jar versions available in /usr/lib/spark/jars/ directory. SPARK_RPC_SERVER_ADDRESS is added in hive version 2.x.



回答3:

The sbt or pom.xml to be like as follows.

"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",

"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",

"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",

I am running DataWarehouse (Hive) on EMR and spark application stored the data into DWH.



回答4:

I be able run hive on spark by run it like:

HIVE_AUX_JARS_PATH=$(find /usr/lib/spark/jars/ -name '*.jar' -and -not -name '*slf4j-log4j12*' -printf '%p:' | head -c-1) hive

Then, before other SQL queries issue:

SET hive.execution.engine = spark;

To make that persistent

Add line

export HIVE_AUX_JARS_PATH=$(find /usr/lib/spark/jars/ -name '*.jar' -and -not -name '*slf4j-log4j12*' -printf '%p:' | head -c-1)

into /home/hadoop/.bashrc

And in file /etc/hive/conf/hive-site.xml set:

<property>
  <name>hive.execution.engine</name>
  <value>spark</value>
</property>