Spark Fixed Width File Import Large number of colu

2020-02-07 07:13发布

问题:

I am getting the fixed width .txt source file from which I need to extract the 20K columns. As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.

Code read the text file as RDD with

sparkContext.textFile("abc.txt") 

then reads JSON schema and gets the column names and width of each column.

  • In the function I read the fixed length string and using the start and end position we use substring function to create the Array.

  • Map the function to RDD.

  • Convert the above RDD to DF and map colnames and write to Parquet.

The representative code

rdd1=spark.sparkContext.textfile("file1")

{ var now=0
 { val collector= new array[String] (ColLenghth.length) 
 val recordlength=line.length
for (k<- 0 to colLength.length -1)
 { collector(k) = line.substring(now,now+colLength(k))
 now =now+colLength(k)
 }
 collector.toSeq}


StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths



StringArray.toDF("StringCol")
  .select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
  .write.mode("overwrite").parquet("c"\home\")

This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns. As number of columns increases , it also increase the time.

If anyone has faced such issue with large number of columns. I need suggestions on performance tuning , how can I tune this Job or code