SPARK SQL fails if there is no specified partition

2019-03-05 23:36发布

I am using Hive Metastore in EMR. I am able to query the table manually through HiveSQL .
But When i use the same table in Spark Job, it says Input path does not exist: s3://

Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://....

I have deleted my above partition path in s3://.. but it still works in my Hive without Dropping Partition at table level. but its not working in pyspark anyways

Here is my full code

from pyspark import SparkContext, HiveContext
from pyspark import SQLContext
from pyspark.sql import SparkSession

sc = SparkContext(appName = "test")
sqlContext = SQLContext(sparkContext=sc)
sqlContext.sql("select count(*) from logan_test.salary_csv").show()
print("done..")

I submitted my job as below to use hive catalog tables.

spark-submit test.py --files /usr/lib/hive/conf/hive-site.xml

标签： python hadoop apache-spark hive pyspark

1条回答

看我几分像从前

2楼-- · 2019-03-06 00:03

I have had a similar error with HDFS where the Metastore kept a partition for the table, but the directory was missing

Check s3... If it is missing, or you deleted it, you need to run MSCK REPAIR TABLE from Hive. Sometimes this doesn't work, and you actually do need a DROP PARTITION

That property is false by default, but you set configuration properties by passing a SparkConf object to SparkContext

from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("test").set("spark.sql.hive.verifyPartitionPath", "false"))
sc = SparkContext(conf = conf)

Or, the Spark 2 way is using a SparkSession.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
...     .appName("test") \
...     .config("spark.sql.hive.verifyPartitionPath", "false") \
...     .enableHiveSupport()
...     .getOrCreate()

0人赞添加讨论(0) 举报

SPARK SQL fails if there is no specified partition

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间