I have been trying to run this code in pyspark.
sqlContext = HiveContext(sc)
datumDF = sqlContext.createDataFrame(datumX, schema)
But have been receiving this warning:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o44))
I log in to AWS and spin up clusters with this code: /User/Downloads/spark-1.5.2-bin-hadoop2.6/ec2/spark-ec2 -k name -i /User/Desktop/pemfile.pem login clustername
However I all the docs I've found involve this commands, which exist in the file
/users/downloads/spark-1.5.2/
I've run them anyway, and tried logging into was using the ec2 command in that folder after I did. Still, just got the same error
I submit export SPARK_HIVE=TRUE
before running these commands on my local machine, but I've seen messages saying its deprecated and will be ignored anyway.
Build hive with maven:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-Phive -Phive-thriftserver -DskipTests clean package
Build hive with sbt
build/sbt -Pyarn -Phadoop-2.3 assembly
And another I found
./sbt/sbt -Phive assembly
I also took the hive-site.xml file
and put in both the /Users/Downloads/spark-1.5.2-bin-hadoop2.6/conf folder and the /Users/Downloads/spark-1.5.2/conf
Still no luck.
I can't seem to run the hive commands no matter what I build it with or how I log in. Is there anything obvious I'm missing.
I too had the same error when using a
HiveContext
on a EC2 cluster built with the ec2 scripts that comes with the Spark package (v1.5.2 in my case). Through much trial and error, I found that building a EC2 cluster with the following options got the right version of Hadoop with Hive properly built so that I can use aHiveContext
in my PySpark jobs:The key parameters here is that you set
--spark-version
to 1.5.2 and--hadoop-major-version
toyarn
- even though you aren't using to use Yarn to submit jobs as it forces the hadoop build to be 2.4. Of course, adjust the other parameters as appropriate for your desired cluster.