I know that from the terminal, one can do a find
command to find files such as :
find . -type d -name "*something*" -maxdepth 4
But, when I am in the hadoop file system, I have not found a way to do this.
hadoop fs -find ....
throws an error.
How do people traverse files in hadoop? I'm using hadoop 2.6.0-cdh5.4.1
.
If you are using the Cloudera stack, try the find tool:
Set the command to a bash variable:
Usage as follows:
It you don't have the cloudera parcels available you can use awk.
that's almost equivalent to the
find . -type d -name "*something*" -maxdepth 4
command.hadoop fs -find
was introduced in Apache Hadoop 2.7.0. Most likely you're using an older version hence you don't have it yet. see: HADOOP-8989 for more information.In the meantime you can use
e.g,: hdfs dfs -ls -R /demo/order*.*
but that's not as powerful as 'find' of course and lacks some basics. From what I understand people have been writing scripts around it to get over this problem.
adding HdfsFindTool as alias in .bash_profile,will make it easy to use always.
--add below to profile alias hdfsfind='hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.HdfsFindTool' alias hdfs='hadoop fs'
--u can use as follows now :(here me using find tool to get HDFS source folder wise File name and record counts.)
$> cnt=1;for ff in
hdfsfind -find /dev/abc/*/2018/02/16/*.csv -type f
; do pp=echo ${ff}|awk -F"/" '{print $7}'
;fn=basename ${ff}
; fcnt=hdfs -cat ${ff}|wc -l
; echo "${cnt}=${pp}=${fn}=${fcnt}"; cnt=expr ${cnt} + 1
; done--simple to get folder /file details: $> hdfsfind -find /dev/abc/ -type f -name "*.csv" $> hdfsfind -find /dev/abc/ -type d -name "toys"