I would like to write a script that finds duplicate mp3's by content and not bf file name. I am wondering how one goes about seeing a file types inner data for the sake of comparison. Thank you.
问题:
回答1:
cmp
can be used to compare binary files.
cmp file1.mp3 file2.mp3
if [[ $? -eq 0 ]]; then echo "Matched"; fi
cmp
command returns 0
if the files are same or else -1
.
回答2:
This first command line lists all files having same size and same md5sum
from the current directory
find . -type f -printf '%11s ' -exec md5sum '{}' ';' |
sort | uniq -w44 --all-repeated=separate
The second command line is
- faster because it calculates
md5sum
solely for the files having the same size - more robust because it handles filenames having special characters as 'space' or 'newline'
Therefore it is also more complex
find . -type f -printf '%11s %P\0' |
LC_ALL=C sort -z |
uniq -Dzw11 |
while IFS= read -r -d '' line
do
md5sum "${line:12}"
done |
uniq -w32 --all-repeated=separate |
tee duplicated.log
Some explanations
# Print file size/md5sum/name in one line (size aligned in 11 characters)
find . -printf '%11s ' -exec md5sum '{}' ';'
# Print duplicated lines considering the the first 44 characters only
# 44 characters = size (11 characters) + one space + md5sum (32 characters)
uniq -w44 --all-repeated=separate
# Print size and path/filename terminated by a null character
find . -printf '%11s %P\0'
# Sort lines separated by a null character (-z) instead of a newline character
# based on native byte value (LC_ALL=C) instead of locals
LC_ALL=C sort -z
# Read lines separated by null character
IFS= read -r -d '' line
# Skip the first 12 characters (size and space)
# in order to obtain the rest: path/filename
"${line:12}"
回答3:
If the files are really byte-to-byte equivalent, you can start searching for files of the same size. If their size is the same, you can investigate further (e.g. compare their md5sum
). If the files just contain the same song, but use a different codec/compression/whatever, bash is probably not the right tool for the task.
回答4:
I use this script for my photo but it can be used for other files.
- First I transfer pictures from my phone/camera to directory
newfiles
- Then I run this script from my pictures root directory
- On detection of duplicated files, the script keeps one file and moves the other ones to directory
../garbage
- The script moves in priority file from
newfiles
- On detection of duplicated files, the script keeps one file and moves the other ones to directory
Caution: This script does not compare file content, but it detects files having same size & name (this is OK for camera files). My other answer is based on content comparison (md5sum
).
#!/bin/bash
# If a file from directory 'newfile' has same size & name
# that another file from another directory
# then moves the file from 'newfile' to 'garbage'
find newfiles/ -type f -printf '%s %f\n' |
while read SIZE f
do
find . -name "$f" -size ${SIZE}c |
grep -v 'newfiles' &&
find . -name "$f" -size ${SIZE}c -path '*newfiles*' -exec mv -v '{}' ../garbage ';' &&
echo
done
# Detect all other duplicated files
# Keep the first occurrence and moves all other to 'garbage'
find . -type f -printf '%s %f\n' |
LC_ALL=C sort | #LC_ALL=C disables locale => sort is faster
uniq -dc | #keep duplicates and count number of occurrences
while read n SIZE f
do
echo -e "\n_____ $n files\t$SIZE bytes\tname: $f"
find . -name "$f" -size ${SIZE}c |
head -n 1 |
xargs mv -v -t ../garbage
done