I am trying to load dataframe from a list of paths in spark. If a file exists in all the mentioned paths then the code is working fine. If there is at least one path that is empty then it is throwing error.
This is my code:
val paths = List("path1", "path2")
val df = spark.read.json(paths: _*)
I looked at other options.
- Build a single regex string which contains all the paths.
- Building a list from the master list of paths by checking if spark can read or not.
.
for(path <- paths) {
if(Try(spark.read.json(path)).isSuccess) {
//add path to list
}
}
The first approach won't work for my case because I can't create a regex out the paths I have to read. Second approach works but I feel it is going to degrade performance as it has to read from all the paths twice.
Please suggest an approach to solve this issue.
Note:
- All the paths are in hdfs
- Each path is itself a regex string which will read from multiple files