identifying .mp3 not by name with shell script

2019-07-13 11:17发布

I would like to write a script that finds duplicate mp3's by content and not bf file name. I am wondering how one goes about seeing a file types inner data for the sake of comparison. Thank you.

标签: bash shell
4条回答
别忘想泡老子
2楼-- · 2019-07-13 11:40

I use this script for my photo but it can be used for other files.

  • First I transfer pictures from my phone/camera to directory newfiles
  • Then I run this script from my pictures root directory
    • On detection of duplicated files, the script keeps one file and moves the other ones to directory ../garbage
    • The script moves in priority file from newfiles

Caution: This script does not compare file content, but it detects files having same size & name (this is OK for camera files). My other answer is based on content comparison (md5sum).

#!/bin/bash

# If a file from directory 'newfile' has same size & name
# that another file from another directory 
# then moves the file from 'newfile' to 'garbage'
find newfiles/ -type f -printf '%s %f\n' | 
while read SIZE f
do
   find . -name "$f" -size ${SIZE}c | 
     grep -v 'newfiles' && 
     find . -name "$f" -size ${SIZE}c -path '*newfiles*' -exec mv -v '{}' ../garbage ';' &&
     echo
done

# Detect all other duplicated files
# Keep the first occurrence and moves all other to 'garbage'
find . -type f -printf '%s %f\n' | 
  LC_ALL=C sort |  #LC_ALL=C disables locale => sort is faster
  uniq -dc      |  #keep duplicates and count number of occurrences 
  while read n SIZE f
  do
    echo -e "\n_____ $n files\t$SIZE bytes\tname: $f"
    find . -name "$f" -size ${SIZE}c |
       head -n 1 | 
       xargs mv -v -t ../garbage
  done 
查看更多
我命由我不由天
3楼-- · 2019-07-13 11:46

cmp can be used to compare binary files.

cmp file1.mp3 file2.mp3
if [[ $? -eq 0 ]]; then echo "Matched"; fi

cmp command returns 0 if the files are same or else -1.

查看更多
小情绪 Triste *
4楼-- · 2019-07-13 11:46

If the files are really byte-to-byte equivalent, you can start searching for files of the same size. If their size is the same, you can investigate further (e.g. compare their md5sum). If the files just contain the same song, but use a different codec/compression/whatever, bash is probably not the right tool for the task.

查看更多
疯言疯语
5楼-- · 2019-07-13 11:47

This first command line lists all files having same size and same md5sum from the current directory

find . -type f -printf '%11s ' -exec md5sum '{}' ';' | 
  sort | uniq -w44 --all-repeated=separate

The second command line is

  • faster because it calculates md5sum solely for the files having the same size
  • more robust because it handles filenames having special characters as 'space' or 'newline'

Therefore it is also more complex

find . -type f -printf '%11s %P\0' | 
  LC_ALL=C sort -z | 
  uniq -Dzw11 | 
  while IFS= read -r -d '' line
  do
    md5sum "${line:12}"
  done | 
  uniq -w32 --all-repeated=separate | 
  tee duplicated.log

Some explanations

# Print file size/md5sum/name in one line (size aligned in 11 characters)
find . -printf '%11s ' -exec md5sum '{}' ';'

# Print duplicated lines considering the the first 44 characters only
# 44 characters = size (11 characters) + one space + md5sum (32 characters)
uniq -w44 --all-repeated=separate

# Print size and path/filename terminated by a null character
find . -printf '%11s %P\0'

# Sort lines separated by a null character (-z) instead of a newline character
# based on native byte value (LC_ALL=C) instead of locals
LC_ALL=C sort -z  

# Read lines separated by null character
IFS= read -r -d '' line

# Skip the first 12 characters (size and space) 
# in order to obtain the rest: path/filename
"${line:12}"
查看更多
登录 后发表回答