Possible to put records that aren't same lengt

2019-08-27 03:01发布

I am reading a file into a dataframe like this

val df = spark.read
   .option("sep", props.inputSeperator)
   .option("header", "true")
   .option("badRecordsPath", "/mnt/adls/udf_databricks/error")
   .csv(inputLoc)

The file is setup like this

col_a|col_b|col_c|col_d
1|first|last|
2|this|is|data
3|ok
4|more||stuff
5|||

Now, spark will read all of this as acceptable data. However, I want 3|ok to be marked as a bad record because it's size does not match the header size. Is this possible?

2条回答
Bombasti
2楼-- · 2019-08-27 03:37

The below code is supported by databricks implementation of spark.I dont see schema mapping in your code. could you map it and try ?

.option("badRecordsPath", "/mnt/adls/udf_databricks/error")

Change your code like below,

val customSchema = StructType(Array(
    StructField("col_a", StringType, true),
    StructField("col_b", StringType, true),
    StructField("col_c", StringType, true),
    StructField("col_d", StringType, true)))

val df = spark.read
   .option("sep", props.inputSeperator)
   .option("header", "true")
   .option("badRecordsPath", "/mnt/adls/udf_databricks/error")
   .schema(customSchema)
   .csv(inputLoc)

More detail's you can refer Datbricks doc on badrecordspath

Thanks, Karthick

查看更多
该账号已被封号
3楼-- · 2019-08-27 03:41
val a = spark.sparkContext.textFile(pathOfYourFile)
val size = a.first.split("\\|").length
a.filter(i => i.split("\\|",-1).size != size).saveAsTextFile("/mnt/adls/udf_databricks/error")
查看更多
登录 后发表回答