I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:
partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")
I then go to my Python notebook to read in the data:
df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")
and I get the following error:
AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'
I have looked at the spark documentation and I don't think I should be required to specify a schema. Has anyone run into something like this? Should I be doing something else when I save/load? The data is landing in Object Storage.
edit: I'm sing spark 2.0 in both the read and the write.
edit2: This was done in a project in Data Science Experience.