Can hadoop fs -ls be used to find all directories older than N days (from the current date)?
I am trying to write a clean up routine to find and delete all directories on HDFS (matching a pattern) which were created N days prior to the current date.
This script lists all the directories that are older than [days]
:
#!/bin/bash
usage="Usage: $0 [days]"
if [ ! "$1" ]
then
echo $usage
exit 1
fi
now=$(date +%s)
hadoop fs -lsr | grep "^d" | while read f; do
dir_date=`echo $f | awk '{print $6}'`
difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
if [ $difference -gt $1 ]; then
echo $f;
fi
done
If you happen to be using CDH
distribution of Hadoop, it comes with a very useful HdfsFindTool command, which behaves like Linux's find
command.
If you're using the default parcels information, here's how you'd do it:
hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-*-job.jar \
org.apache.solr.hadoop.HdfsFindTool -find PATH -mtime +N
Where you'd replace PATH with the search path and N with number of days.
For real clusters it is not a good idea,
to use ls. If you have admin rights,
it is more suitable to use fsimage.
I modify script above to illustrate idea.
first, fetch fsimage
curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump
convert it to text (same output as lsr gives)
hdfs oiv -i img.dump -o fsimage.txt
Script:
#!/bin/bash
usage="Usage: dir_diff.sh [days]"
if [ ! "$1" ]
then
echo $usage
exit 1
fi
now=$(date +%s)
curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump
hdfs oiv -i img.dump -o fsimage.txt
cat fsimage.txt | grep "^d" | while read f; do
dir_date=`echo $f | awk '{print $6}'`
difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
if [ $difference -gt $1 ]; then
echo $f;
fi
done
hdfs dfs -ls /hadoop/path/*.txt|awk '$6 < "2017-10-24"'