Spark: Reading data frame from list of paths with

2020-04-20 06:48发布

问题:

I am trying to load dataframe from a list of paths in spark. If a file exists in all the mentioned paths then the code is working fine. If there is at least one path that is empty then it is throwing error.

This is my code:

val paths = List("path1", "path2")
val df = spark.read.json(paths: _*)

I looked at other options.

  1. Build a single regex string which contains all the paths.
  2. Building a list from the master list of paths by checking if spark can read or not.

.

for(path <- paths) {
  if(Try(spark.read.json(path)).isSuccess) {
    //add path to list
  }
}

The first approach won't work for my case because I can't create a regex out the paths I have to read. Second approach works but I feel it is going to degrade performance as it has to read from all the paths twice.

Please suggest an approach to solve this issue.

Note:

  1. All the paths are in hdfs
  2. Each path is itself a regex string which will read from multiple files

回答1:

As mentioned in the comments, you can use HDFS FileSystem API to get a list of paths that exist based on your regex (as long as it's a valid regex).

import org.apache.hadoop.fs._

val path = Array("path_prefix/folder1[2-8]/*", "path_prefix/folder2[2-8]/*")

val fs: FileSystem = FileSystem.get(sc.hadoopConfiguration)  // sc = SparkContext

val paths = path.flatMap(p => fs.globStatus(new Path(p)).map(_.getPath.toString))

This way even if, say, /path_prefix/folder13 is empty, it's contents will not get listed in the variable paths which will be a Array[String] containing all the available files in the regex.

Finally, you can do:

spark.read.json(paths : _*)


回答2:

Add, copy dummy file with 0 length to the directories in path list is a pragmatic techical work around that functionally equates to what you want to do. The empty dir problem I have encountered before and alleviated this way, may be not possible for you...