Detect duplicate MP3 files with different bitrates

2019-01-08 16:13发布

How could I detect (preferably with Python) duplicate MP3 files that can be encoded with different bitrates (but they are the same song) and ID3 tags that can be incorrect?

I know I can do an MD5 checksum of the files content but that won't work for different bitrates. And I don't know if ID3 tags have influence in generating the MD5 checksum. Should I re-encode MP3 files that have a different bitrate and then I can do the checksum? What do you recommend?

9条回答
祖国的老花朵
2楼-- · 2019-01-08 16:15

I don't think simple checksums will ever work:

  1. ID3 tags will affect the md5
  2. Different encoders will encode the same song different ways - so the checksums will be different
  3. Different bit-rates will produce different checksums
  4. Re-encoding an mp3 to a different bit-rate will probably sound terrible and will certainly be different to the original audio compressed in one step.

I think you'll have to compare ID3 tags, song length, and filenames.

查看更多
Emotional °昔
3楼-- · 2019-01-08 16:16

For tag issues, Picard may indeed be a very good bet. If, having identified two potentially duplicate files, what you want is to extract bitrate information from them, have a look at mp3guessenc.

查看更多
干净又极端
4楼-- · 2019-01-08 16:24

You can use the successor for PUID and MusicBrainz, called AcoustiD:

AcoustID is an open source project that aims to create a free database of audio fingerprints with mapping to the MusicBrainz metadata database and provide a web service for audio file identification using this database...

...fingerprints along with some metadata necessary to identify the songs to the AcoustID database...

You will find various client libraries and examples for the webservice at https://acoustid.org/

查看更多
我命由我不由天
5楼-- · 2019-01-08 16:25

I'd use length as my primary heuristic. That's what iTunes does when it's trying to identify a CD using the Gracenote database. Measure the lengths in milliseconds rather than seconds. Remember, this is only a heuristic: you should definitely listen to any detected duplicates before deleting them.

查看更多
Explosion°爆炸
6楼-- · 2019-01-08 16:30

The exact same question that people at the old AudioScrobbler and currently at MusicBrainz have worked on since long ago. For the time being, the Python project that can aid in your quest, is Picard, which will tag audio files (not only MPEG 1 Layer 3 files) with a GUID (actually, several of them), and from then on, matching the tags is quite simple.

If you prefer to do it as a project of your own, libofa might be of help.

查看更多
The star\"
7楼-- · 2019-01-08 16:32

Like the others said, simple checksums won't detect duplicates with different bitrates or ID3 tags. What you need is an audio fingerprint algorithm. The Python Audioprocessing Suite has such an an algorithm, but I can't say anything about how reliable it is.

http://rudd-o.com/new-projects/python-audioprocessing

查看更多
登录 后发表回答