Spark issues reading parquet files

2019-07-27 00:12发布

I have 2 parquet part files part-00043-0bfd7e28-6469-4849-8692-e625c25485e2-c000.snappy.parquet (is part file from 2017 Nov 14th run ) and part-00199-64714828-8a9e-4ae1-8735-c5102c0a834d-c000.snappy.parquet (is part file from 2017 Nov 16th run ) and have both having same schema (which I verified by printing schema).

My problem is that I have, say 10 columns which is coming properly if I read this 2 files separately using Spark. But if I put this file is folder are try to read together, total count is coming correct (sum of rows from 2 files) but from 2nd file most of the columns are null. Only some 2 3 columns have proper values (values are present in file since its showing properly if I read it alone). What is that I am missing here? Here is my code used for testing:

def initSparkConfig: SparkSession = {

    val sparkSession: SparkSession = SparkSession
      .builder()
      .appName("test")
      .master("local")
      .getOrCreate()

    sparkSession.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
    sparkSession.sparkContext.getConf.set("spark.hadoop.parquet.enable.summary-metadata", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.parquet.mergeSchema", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.parquet.filterPushdown", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.hive.metastorePartitionPruning", "true")

    sparkSession
  }

sparkSession = initSparkConfig
sparkSession.read.parquet("/test_spark/").createOrReplaceTempView("table")
sparkSession.sql("select * from table").show 

Update :

If I read both files separately and do a union and read, all columns gets populated without any issues.

Update 2 :

If I make mergeSchema = true while reading, it throws an exception Found duplicate column(s) in the data schema and the partition schema: [List of columns which are coming null ] . And one of the filter column as ambiguous

1条回答
疯言疯语
2楼-- · 2019-07-27 00:44

Turns out that the schemas where not an exact match. There were difference in case (some character in between) for column names which was coming as null. And parquet column names are case sensitive, so this was causing all the issues. It was trying to read columns which was not there at all.

查看更多
登录 后发表回答