I have this strange behavior , my use case is to write a Spark dataframe to a hive partitioned table by using
sqlContext.sql("INSERT OVERWRITE TABLE <table> PARTITION (<partition column) SELECT * FROM <temp table from dataframe>")
the strange thing is this works when using pyspark shell from a host A, but the same exact code ,connected to the same cluster,using the same hive table does not work in jupyter notebooks, it returns:
java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions
exception so is seems to me as some jar mismatch between the host where pyspark shell is launched, and the host where jupyter is launched, my question is , how can i determine which version of the corresponding jar is bein used in pyspark shell, and in jupyter notebook by code(i have no access to the jupyter server) ? and why can 2 distinct versions are being used if both pyspark shell, and jupyter are connecting to the same cluster?
Update :after some researching i found jupyter is using "Livy" and Livy host uses hive-exec-2.0.1.jar, the host where we use pyspark shell uses hive-exec-1.2.1000.2.5.3.58-3.jar, so i downloaded both jars from maven repository and decompiled them, i found that altough loadDynamicPartitions method exists in both, method signature(parameters) differ, in livy version boolean holdDDLTime parameter is missing.
I had similar problem try get the maven dependencies from cloudera