I have 2 parquet part files part-00043-0bfd7e28-6469-4849-8692-e625c25485e2-c000.snappy.parquet
(is part file from 2017 Nov 14th run ) and part-00199-64714828-8a9e-4ae1-8735-c5102c0a834d-c000.snappy.parquet
(is part file from 2017 Nov 16th run ) and have both having same schema (which I verified by printing schema).
My problem is that I have, say 10 columns which is coming properly if I read this 2 files separately using Spark. But if I put this file is folder are try to read together, total count is coming correct (sum of rows from 2 files) but from 2nd file most of the columns are null. Only some 2 3 columns have proper values (values are present in file since its showing properly if I read it alone). What is that I am missing here? Here is my code used for testing:
def initSparkConfig: SparkSession = {
val sparkSession: SparkSession = SparkSession
.builder()
.appName("test")
.master("local")
.getOrCreate()
sparkSession.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
sparkSession.sparkContext.getConf.set("spark.hadoop.parquet.enable.summary-metadata", "false")
sparkSession.sparkContext.getConf.set("spark.sql.parquet.mergeSchema", "false")
sparkSession.sparkContext.getConf.set("spark.sql.parquet.filterPushdown", "false")
sparkSession.sparkContext.getConf.set("spark.sql.hive.metastorePartitionPruning", "true")
sparkSession
}
sparkSession = initSparkConfig
sparkSession.read.parquet("/test_spark/").createOrReplaceTempView("table")
sparkSession.sql("select * from table").show
Update :
If I read both files separately and do a union and read, all columns gets populated without any issues.
Update 2 :
If I make mergeSchema = true
while reading, it throws an exception Found duplicate column(s) in the data schema and the partition schema:
[List of columns which are coming null ] . And one of the filter column as ambiguous