I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Spark 2.1 cannot write Vector field on CSV
相关文章
- Java写文件至HDFS失败
- Livy Server: return a dataframe as JSON?
- mapreduce count example
- SQL query Frequency Distribution matrix for produc
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Could you give me any clue Why 'Cannot call me
You can try with globStatus status as well
Here's PySpark version if someone is interested:
In this particular case I get list of all files that make up disc_mrt.unified_fact Hive table.
Other methods of FileStatus object, like getLen() to get file size are described here:
Class FileStatus
@Tagar didn't say how to connect remote hdfs, but this answer did:
this did the job for me
This worked for me.
Spark version 1.5.0-cdh5.5.2
You can use
org.apache.hadoop.fs.FileSystem
. Specifically,FileSystem.listFiles([path], true)
And with Spark...
Edit
It's worth noting that good practice is to get the
FileSystem
that is associated with thePath
's scheme.