Finding directories older than N days in HDFS

2019-01-18 13:21发布

问题:

Can hadoop fs -ls be used to find all directories older than N days (from the current date)?

I am trying to write a clean up routine to find and delete all directories on HDFS (matching a pattern) which were created N days prior to the current date.

回答1:

This script lists all the directories that are older than [days] :

#!/bin/bash
usage="Usage: $0 [days]"

if [ ! "$1" ]
then
  echo $usage
  exit 1
fi

now=$(date +%s)
hadoop fs -lsr | grep "^d" | while read f; do 
  dir_date=`echo $f | awk '{print $6}'`
  difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
  if [ $difference -gt $1 ]; then
    echo $f;
  fi
done


回答2:

If you happen to be using CDH distribution of Hadoop, it comes with a very useful HdfsFindTool command, which behaves like Linux's find command.

If you're using the default parcels information, here's how you'd do it:

hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-*-job.jar \
org.apache.solr.hadoop.HdfsFindTool -find PATH -mtime +N

Where you'd replace PATH with the search path and N with number of days.



回答3:

For real clusters it is not a good idea, to use ls. If you have admin rights, it is more suitable to use fsimage.

I modify script above to illustrate idea.

first, fetch fsimage

curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump

convert it to text (same output as lsr gives)

hdfs oiv -i img.dump -o fsimage.txt

Script:

#!/bin/bash
usage="Usage: dir_diff.sh [days]"

if [ ! "$1" ]
then
  echo $usage
  exit 1
fi

now=$(date +%s)
curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump
hdfs oiv -i img.dump -o fsimage.txt
cat fsimage.txt | grep "^d" | while read f; do 
  dir_date=`echo $f | awk '{print $6}'`
  difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
  if [ $difference -gt $1 ]; then
    echo $f;
  fi
done


回答4:

hdfs dfs -ls /hadoop/path/*.txt|awk '$6 < "2017-10-24"'



标签: hadoop hdfs