I wrote a DataFrame with pySpark into HDFS with this command:
df.repartition(col("year"))\
.write.option("maxRecordsPerFile", 1000000)\
.parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy')
When taking a look into the HDFS I can see that the files are properly laying there. Anyhow, when I try to read the table with HIVE or Impala, the table cannot be found.
Whats going wrong here, am I missing something?
Interestingly, df.write.format('parquet').saveAsTable("tablename")
works properly.
It's an expected behaviour from spark as:
df...etc.parquet("")
writes the data to HDFS location and won't create any table in Hive.but
df..saveAsTable("")
creates the table in hive and writes data to it.That's the reason why you are
not able to find table in hive
after performingdf...parquet("")