Spark iterate HDFS directory

I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?

标签： hadoop hdfs apache-spark

6条回答

趁早两清

2楼-- · 2019-01-14 02:24

You can try with globStatus status as well

val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration).globStatus(new org.apache.hadoop.fs.Path(url))

      for (urlStatus <- listStatus) {
        println("urlStatus get Path:"+urlStatus.getPath())
}

0人赞添加讨论(0) 举报

我只想做你的唯一

3楼-- · 2019-01-14 02:26

Here's PySpark version if someone is interested:

hadoop = sc._jvm.org.apache.hadoop

fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration() 
path = hadoop.fs.Path('/hivewarehouse/disc_mrt.db/unified_fact/')

for f in fs.get(conf).listStatus(path):
    print f.getPath()

In this particular case I get list of all files that make up disc_mrt.unified_fact Hive table.

Other methods of FileStatus object, like getLen() to get file size are described here:

Class FileStatus

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

4楼-- · 2019-01-14 02:29

@Tagar didn't say how to connect remote hdfs, but this answer did:

URI           = sc._gateway.jvm.java.net.URI
Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration


fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())

status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))

for fileStatus in status:
    print(fileStatus.getPath())

0人赞添加讨论(0) 举报

Summer. ? 凉城

5楼-- · 2019-01-14 02:36

this did the job for me

FileSystem.get(new URI("hdfs://HAservice:9000"), sc.hadoopConfiguration).listStatus( new Path("/tmp/")).foreach( x => println(x.getPath ))

0人赞添加讨论(0) 举报

虎瘦雄心在

6楼-- · 2019-01-14 02:39

import  org.apache.hadoop.fs.{FileSystem,Path}

FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///tmp")).foreach( x => println(x.getPath ))

This worked for me.

Spark version 1.5.0-cdh5.5.2

0人赞添加讨论(0) 举报

Deceive 欺骗

7楼-- · 2019-01-14 02:48

You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true)

And with Spark...

FileSystem.get(sc.hadoopConfiguration()).listFiles(..., true)

Edit

It's worth noting that good practice is to get the FileSystem that is associated with the Path's scheme.

path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)

0人赞添加讨论(0) 举报

Spark iterate HDFS directory

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间