Spark issues reading parquet files

I have 2 parquet part files part-00043-0bfd7e28-6469-4849-8692-e625c25485e2-c000.snappy.parquet (is part file from 2017 Nov 14th run ) and part-00199-64714828-8a9e-4ae1-8735-c5102c0a834d-c000.snappy.parquet (is part file from 2017 Nov 16th run ) and have both having same schema (which I verified by printing schema).

My problem is that I have, say 10 columns which is coming properly if I read this 2 files separately using Spark. But if I put this file is folder are try to read together, total count is coming correct (sum of rows from 2 files) but from 2nd file most of the columns are null. Only some 2 3 columns have proper values (values are present in file since its showing properly if I read it alone). What is that I am missing here? Here is my code used for testing:

def initSparkConfig: SparkSession = {

    val sparkSession: SparkSession = SparkSession
      .builder()
      .appName("test")
      .master("local")
      .getOrCreate()

    sparkSession.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
    sparkSession.sparkContext.getConf.set("spark.hadoop.parquet.enable.summary-metadata", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.parquet.mergeSchema", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.parquet.filterPushdown", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.hive.metastorePartitionPruning", "true")

    sparkSession
  }

sparkSession = initSparkConfig
sparkSession.read.parquet("/test_spark/").createOrReplaceTempView("table")
sparkSession.sql("select * from table").show

Update :

If I read both files separately and do a union and read, all columns gets populated without any issues.

Update 2 :

If I make mergeSchema = true while reading, it throws an exception Found duplicate column(s) in the data schema and the partition schema: [List of columns which are coming null ] . And one of the filter column as ambiguous

标签： scala apache-spark parquet apache-spark-dataset

1条回答

疯言疯语

2楼-- · 2019-07-27 00:44

Turns out that the schemas where not an exact match. There were difference in case (some character in between) for column names which was coming as null. And parquet column names are case sensitive, so this was causing all the issues. It was trying to read columns which was not there at all.

0人赞添加讨论(0) 举报

Spark issues reading parquet files

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间