Spark Scala list folders in directory

I want to list all folders within a hdfs directory using Scala/Spark. In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/

I tried it with:

val conf = new Configuration()
val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)

val path = new Path("hdfs://sandbox.hortonworks.com/demo/")

val files = fs.listFiles(path, false)

But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.

I also tried with:

FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true)

But this also does not help.

Do you have any other idea?

PS: I also checked this thread: Spark iterate HDFS directory but it does not work for me as it does not seem to search on hdfs directory, instead only on the local file system with schema file//.

标签： scala hadoop apache-spark

7条回答

一纸荒年 Trace。

2楼-- · 2019-01-19 10:23

Azure Blog Storage is mapped to a HDFS location, so all the Hadoop Operations

On Azure Portal, go to Storage Account, you will find following details:

Storage account
Key -
Container -
Path pattern – /users/accountsdata/
Date format – yyyy-mm-dd
Event serialization format – json
Format – line separated

Path Pattern here is the HDFS path, you can login/putty to the Hadoop Edge Node and do:

hadoop fs -ls /users/accountsdata

Above command will list all the files. In Scala you can use

import scala.sys.process._ 

val lsResult = Seq("hadoop","fs","-ls","/users/accountsdata/").!!

0人赞添加讨论(0) 举报

看我几分像从前

3楼-- · 2019-01-19 10:29

Because you're using Scala, you may also be interested in the following:

import scala.sys.process._
val lsResult = Seq("hadoop","fs","-ls","hdfs://sandbox.hortonworks.com/demo/").!!

This will, unfortunately, return the entire output of the command as a string, and so parsing down to just the filenames requires some effort. (Use fs.listStatus instead.) But if you find yourself needing to run other commands where you could do it in the command line easily and are unsure how to do it in Scala, just use the command line through scala.sys.process._. (Use a single ! if you want to just get the return code.)

0人赞添加讨论(0) 举报

聊天终结者

4楼-- · 2019-01-19 10:35

val spark = SparkSession.builder().appName("Demo").getOrCreate()
val path = new Path("enter your directory path")
val fs:FileSystem = projects.getFileSystem(spark.sparkContext.hadoopConfiguration)
val it = fs.listLocatedStatus(path)

This will create an iterator it over org.apache.hadoop.fs.LocatedFileStatus that is your subdirectory

0人赞添加讨论(0) 举报

在下西门庆

5楼-- · 2019-01-19 10:40

object HDFSProgram extends App {    
  val uri = new URI("hdfs://HOSTNAME:PORT")    
  val fs = FileSystem.get(uri,new Configuration())    
  val filePath = new Path("/user/hive/")    
  val status = fs.listStatus(filePath)    
  status.map(sts => sts.getPath).foreach(println)    
}

This is sample code to get list of hdfs files or folder present under /user/hive/

0人赞添加讨论(0) 举报

仙女界的扛把子

6楼-- · 2019-01-19 10:43

   val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration)
.globStatus(new org.apache.hadoop.fs.Path(url))

  for (urlStatus <- listStatus) {
    println("urlStatus get Path:" + urlStatus.getPath())

}

0人赞添加讨论(0) 举报

狗以群分

7楼-- · 2019-01-19 10:44

We are using hadoop 1.4 and it doesn't have listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.

val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
status.foreach(x=> println(x.getPath))

0人赞添加讨论(0) 举报

1 2 下一页

Spark Scala list folders in directory

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间