Remove all files older than X days, but keep at le

2020-03-13 09:17发布

问题:

I have a script that removes DB dumps that are older than say X=21 days from a backup dir:

DB_DUMP_DIR=/var/backups/dbs
RETENTION=$((21*24*60))  # 3 weeks

find ${DB_DUMP_DIR} -type f -mmin +${RETENTION} -delete

But if for whatever reason the DB dump jobs fails to complete for a while, all dumps will eventually be thrown away. So as a safeguard i want to keep at least the youngest Y=7 dumps, even it all or some of them are older than 21 days.

I look for something that is more elegant than this spaghetti:

DB_DUMP_DIR=/var/backups/dbs
RETENTION=$((21*24*60))  # 3 weeks
KEEP=7

find ${DB_DUMP_DIR} -type f -printf '%T@ %p\n' | \  # list all dumps with epoch
sort -n | \                                         # sort by epoch, oldest 1st
head --lines=-${KEEP} |\                            # Remove youngest/bottom 7 dumps
while read date filename ; do                       # loop through the rest
    find $filename -mmin +${RETENTION} -delete      # delete if older than 21 days
done

(This snippet might have minor bugs - Ignore them. It's to illustrate what i can come up with myself, and why i don't like it)

Edit: The find option "-mtime" is one-off: "-mtime +21" means actually "at least 22 days old". That always confused me, so i use -mmin instead. Still one-off, but only a minute.

回答1:

Use find to get all files that are old enough to delete, filter out the $KEEP youngest with tail, then pass the rest to xargs.

find ${DB_DUMP_DIR} -type f -printf '%T@ %p\n' -mmin +$RETENTION |
  sort -nr | tail -n +$KEEP |
  xargs -r echo

Replace echo with rm if the reported list of files is the list you want to remove.

(I assume none of the dump files have newlines in their names.)



回答2:

I'm opening a second answer because I just I have a different solution - one using awk: just add the time to the 21 day (in seconds) period, minus the current time and remove the negative ones! (after sorting and removing the newest 7 from the list):

DB_DUMP_DIR=/var/backups/dbs
RETENTION=21*24*60*60  # 3 weeks
CURR_TIME=`date +%s`

find ${DB_DUMP_DIR} -type f -printf '%T@ %p\n' | \
  awk '{ print int($1) -'${CURR_TIME}' + '${RETENTION}' ":" $2}' | \
  sort -n | head -n -7 | grep '^-' | cut -d ':' -f 2- | xargs rm -rf


回答3:

You can use -mtime instead of -mmin which means you don't have to calculate the number of minutes in a day:

find $DB_DUMP_DIR -type f -mtime +21

Instead of deleting them, you could use stat command to sort the files in order:

find $DB_DUMP_DIR -type f -mtime +21 | while read file
do
    stat -f "%-10m %40N" $file
done | sort | awk 'NR > 7 {print $2}'

This will list all files older than 21 days, but not the seven youngest that are older than 21 days.

From there, you could feed this into xargs to do the remove:

find $DB_DUMP_DIR -type f -mtime +21 | while read file
do
    stat -f "%-10m %40N" $file
done | sort | awk 'NR > 7 {print $2]' | xargs rm

Of course, this is all assuming that you don't have spaces in your file names. If you do, you'll have to take a slightly different tack.

This will also keep the seven youngest files over 21 days old. You might have files younger than that, and don't want to really keep those. However, you could simply run the same sequence again (except remove the -mtime parameter:

find $DB_DUMP_DIR -type f |  while read file
do
    stat -f "%-10m %40N" $file
done | sort | awk 'NR > 7 {print $2} | xargs rm

You need to look at your stat command to see what the options are for the format. This varies from system to system. The one I used is for OS X. Linux is different.


Let's take a slightly different approach. I haven't thoroughly tested this, but:

If all of the files are in the same directory, and none of the file names have whitespace in them:

ls -t | awk 'NR > 7 {print $0}'

Will print out all of the files except for the seven youngest files. Maybe we can go with that?

current_seconds=$(date +%S)   # Seconds since the epoch
((days = 60 * 60 * 24 * 21))  # Number of seconds in 21 days
((oldest_allowed = $current_seconds - $days)) # Oldest allowed file
ls -t | awk 'NR > 7 {print $0}' | stat -f "%Dm %N" $file | while date file
do
    [ $date < $oldest_allowed ] || rm $file
done

The ls ... | awk will shave off the seven youngest. After that, we can take stat to get the name of the file and the date. Since the date is seconds after the epoch, we had to calculate what 21 days prior to the current time would be in seconds before the epoch.

After that, it's pretty simple. We look at the date of the file. If it's older than 21 days before the epoch (i.e., it's timestamp is lower) we can delete it.

As I said, I haven't thoroughly tested this, but this will delete all files over 21 days, and only files over 21 days, but always keep the seven youngest.



回答4:

You could do the loop yourself:

t21=$(date -d "21 days ago" +%s)
cd "$DB_DUMP_DIR"
for f in *; do
    if (( $(stat -c %Y "$f") <= $t21 )); then
        echo rm "$f"
    fi
done

I'm assuming you have GNU date



回答5:

None of these answers quite worked for me, so I adapted chepner's answer and came to this, which simply retains the last $KEEP backups.

find ${DB_DUMP_DIR} -printf '%T@ %p\n' | # print entries with creation time
  sort -n |                              # sort in date-ascending order
  head -n -$KEEP |                       # remove the $KEEP most recent entries
  awk '{ print $2 }' |                   # select the file paths
  xargs -r rm                            # remove the file paths

I believe chepner's code retains the $KEEP oldest, rather than the youngest.



回答6:

Here is a BASH function that should do the trick. I couldn't avoid two invocations of find easily, but other than that, it was a relative success:

#  A "safe" function for removing backups older than REMOVE_AGE + 1 day(s), always keeping at least the ALWAYS_KEEP youngest
remove_old_backups() {
    local file_prefix="${backup_file_prefix:-$1}"
    local temp=$(( REMOVE_AGE+1 ))  # for inverting the mtime argument: it's quirky ;)
    # We consider backups made on the same day to be one (commonly these are temporary backups in manual intervention scenarios)
    local keeping_n=`/usr/bin/find . -maxdepth 1 \( -name "$file_prefix*.tgz" -or -name "$file_prefix*.gz" \) -type f -mtime -"$temp" -printf '%Td-%Tm-%TY\n' | sort -d | uniq | wc -l`
    local extra_keep=$(( $ALWAYS_KEEP-$keeping_n ))

    /usr/bin/find . -maxdepth 1 \( -name "$file_prefix*.tgz" -or -name "$file_prefix*.gz" \) -type f -mtime +$REMOVE_AGE -printf '%T@ %p\n' |  sort -n | head -n -$extra_keep | cut -d ' ' -f2 | xargs -r rm
}

It takes a backup_file_prefix env variable or it can be passed as the first argument and expects enviroment variables ALWAYS_KEEP (minimum number of files to keep) and REMOVE_AGE (num days to pass to -mtime). It expects a gz or tgz extension. There are a few other assumptions as you can see in the comments, mostly in the name of safety.

Thanks to ireardon and his answer (which doesn't quite answer the question) for the inspiration!

Happy safe backup management :)



回答7:

From the solutions given in the other solutions, I've experimented and found many bugs or situations that were not wanted.

Here is the solution I finally came up with :

  # Sample variable values
  BACKUP_PATH='/data/backup'
  DUMP_PATTERN='dump_*.tar.gz'
  NB_RETENTION_DAYS=10
  NB_KEEP=2                    # keep at least the 2 most recent files in all cases

  find ${BACKUP_PATH} -name ${DUMP_PATTERN} \
    -mtime +${NB_RETENTION_DAYS} > /tmp/obsolete_files

  find ${BACKUP_PATH} -name ${DUMP_PATTERN} \
    -printf '%T@ %p\n' | \
    sort -n            | \
    tail -n ${NB_KEEP} | \
    awk '{ print $2 }'   > /tmp/files_to_keep

  grep -F -f /tmp/files_to_keep -v /tmp/obsolete_files > /tmp/files_to_delete

  cat /tmp/files_to_delete | xargs -r rm

The ideas are :

  • Most of the time, I just want to keep files that are not aged more than NB_RETENTION_DAYS.
  • However, shit happens, and when for some reason there are no recent files anymore (backup scripts are broken), I don't want to remove the NB_KEEP more recent ones, for security (NB_KEEP should be at least 1).

I my case, I have 2 backups a day, and set NB_RETENTION_DAYS to 10 (thus, I normally have 20 files in normal situation) One could think that I would thus set NB_KEEP=20, but in fact, I chose NB_KEEP=2, and that's why :

Let's imagine my backup scripts are broken, and I don't have backup for a month. I really don't care having my 20 latest files that are more than 30 days old. Having at least one is what I want. However, being able to easily identify that there is a problem is very important (obviously my monitoring system is really blind, but that's another point). And having my backup folder having 10 times less files than usual is maybe something that could ring a bell...