How to count partitions with FileSystem API?

2020-04-21 06:15发布

问题:

I am using Hadoop version 2.7 and its FileSystem API. The question is about "how to count partitions with the API?" but, to put it into a software problem, I am coping here a Spark-Shell script... The concrete question about the script is

The variable parts below is counting the number of table partitions, or other thing?

import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.mutable.ArrayBuffer
import spark.implicits._

val warehouse = "/apps/hive/warehouse"  // the Hive default location for all databases
val db_regex  = """\.db$""".r   // filter for names like "*.db"
val tab_regex = """\.hive\-staging_""".r    // signature of Hive files

val trStrange = "[\\s/]+|[^\\x00-\\x7F]+|[\\p{Cntrl}&&[^\r\n\t]]+|\\p{C}+".r //mark
def cutPath (thePath: String, toCut: Boolean = true) : String =
  if (toCut) trStrange.replaceAllIn( thePath.replaceAll("^.+/", ""),  "@") else thePath

val fs_get = FileSystem.get( sc.hadoopConfiguration )
fs_get.listStatus( new Path(warehouse) ).foreach( lsb => {
    val b = lsb.getPath.toString
    if (db_regex.findFirstIn(b).isDefined) 
       fs_get.listStatus( new Path(b) ).foreach( lst => {
            val lstPath = lst.getPath
            val t = lstPath.toString
            var parts = -1
            var size = -1L
            if (!tab_regex.findFirstIn(t).isDefined) {
              try {
                  val pp = fs_get.listStatus( lstPath )
                  parts = pp.length // !HERE! partitions?
                  pp.foreach( p => {
                     try { // SUPPOSING that size is the number of bytes of table t
                        size  = size  + fs.getContentSummary(p.getPath).getLength
                     } catch { case _: Throwable => }
                  })
              } catch { case _: Throwable =>  }
              println(s"${cutPath(b)},${cutPath(t)},$parts,$size")
            }
        })
}) // x warehouse loop
System.exit(0)  // get out from spark-shell

This is only an example to show the focus: the correct scan and semantic interpretation of the Hive default database FileSystem structure, using Hive FileSystem API. The script sometimes need some memory, but is working fine. Run with
sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv


Note: the aim here is not to count partitions by any other method (e.g. HQL DESCRIBE or Spark Schema), but to use the API for it... For control and for data quality checks, the API is important as a kind of "lower level measurement".

回答1:

Hive structures its metadata as database > tables > partitions > files. This typically translates into filesystem directory structure <hive.warehouse.dir>/database.db/table/partition/.../files. Where /partition/.../ signifies that tables can be partitioned by multiple columns thus creating a nested levels of subdirectories. (A partition is a directory with the name .../partition_column=value by convention).

So seems like your script will be printing the number of files (parts) and their total length (size) for each single-column partitioned table in each of your databases, if I'm not mistaken.

As alternative, I'd suggest you look at hdfs dfs -count command to see if it suits your needs, and maybe wrap it in a simple shell script to loop through the databases and tables.