可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.

First I create the hive table:

[biadmin@bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>

Then I create a simple pyspark script:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext

sc = SparkContext()

from pyspark.sql import HiveContext
hc = HiveContext(sc)

pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )

I attempt to execute with:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ spark-submit \
    --master yarn-cluster \
    --deploy-mode cluster \
    --jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
           /usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
           /usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
    test_pokes.py

However, I encounter the error:

Traceback (most recent call last):
  File "test_pokes.py", line 8, in <module>
    pokesRdd = hc.sql('select * from pokes')
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
End of LogType:stdout

If I run spark-submit standalone, I can see the table exists ok:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ spark-submit test_pokes.py
…
…
16/12/21 13:09:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 18962 bytes result sent to driver
16/12/21 13:09:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 168 ms on localhost (1/1)
16/12/21 13:09:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/21 13:09:13 INFO DAGScheduler: ResultStage 0 (collect at /home/biadmin/test_pokes.py:9) finished in 0.179 s
16/12/21 13:09:13 INFO DAGScheduler: Job 0 finished: collect at /home/biadmin/test_pokes.py:9, took 0.236558 s
[Row(foo=238, bar=u'val_238'), Row(foo=86, bar=u'val_86'), Row(foo=311, bar=u'val_311')
…
…

See my previous question related to this issue: hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"

This question is similar to this other question: Spark can access Hive table from pyspark but not from spark-submit. However, unlike that question I am using HiveContext.

Update: see here for the final solution https://stackoverflow.com/a/41272260/1033422

回答1:

This is because the spark-submit job is unable to find the hive-site.xml, so it cannot connect to the Hive metastore. Please add --files /usr/iop/4.2.0.0/hive/conf/hive-site.xml to your spark-submit command.

回答2:

It looks like you are affected by this bug: https://issues.apache.org/jira/browse/SPARK-15345.

I had a similar issue with Spark 1.6.2 and 2.0.0 on HDP-2.5.0.0:
My goal was to create a dataframe from a Hive SQL query, under these conditions:

python API,
cluster deploy-mode (driver program running on one of the executor nodes)
use YARN to manage the executor JVMs (instead of a standalone Spark master instance).

The initial tests gave these results:

spark-submit --deploy-mode client --master local ... => WORKING
spark-submit --deploy-mode client --master yarn ... => WORKING
spark-submit --deploy-mode cluster --master yarn .... => NOT WORKING

In case #3, the driver running on one of the executor nodes could find the database. The error was:

pyspark.sql.utils.AnalysisException: 'Table or view not found: `database_name`.`table_name`; line 1 pos 14'

Fokko Driesprong's answer listed above worked for me.
With, the command listed below, the driver running on the executor node was able to access a Hive table in a database which is not default:

$ /usr/hdp/current/spark2-client/bin/spark-submit \
--deploy-mode cluster --master yarn \
--files /usr/hdp/current/spark2-client/conf/hive-site.xml \
/path/to/python/code.py

The python code I have used to test with Spark 1.6.2 and Spark 2.0.0 is: (Change SPARK_VERSION to 1 to test with Spark 1.6.2. Make sure to update the paths in the spark-submit command accordingly)

SPARK_VERSION=2      
APP_NAME = 'spark-sql-python-test_SV,' + str(SPARK_VERSION)



def spark1():
    from pyspark.sql import HiveContext
    from pyspark import SparkContext, SparkConf

    conf = SparkConf().setAppName(APP_NAME)
    sc = SparkContext(conf=conf)
    hc = HiveContext(sc)

    query = 'select * from database_name.table_name limit 5'
    df = hc.sql(query)
    printout(df)




def spark2():
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName(APP_NAME).enableHiveSupport().getOrCreate()
    query = 'select * from database_name.table_name limit 5'
    df = spark.sql(query)
    printout(df)




def printout(df):
    print('\n########################################################################')
    df.show()
    print(df.count())

    df_list = df.collect()
    print(df_list)
    print(df_list[0])
    print(df_list[1])
    print('########################################################################\n')




def main():
    if SPARK_VERSION == 1:
        spark1()
    elif SPARK_VERSION == 2:
        spark2()




if __name__ == '__main__':
    main()