How to find duplicate files in large filesystem wh

I am trying to avoid duplicates in my mp3 collection (quite large). I want to check for duplicates by checking file contents, instead of looking for same file name. I have written the code below to do this but it throws a MemoryError after about a minute. Any suggestions on how I can get this to work?

import os
import hashlib

walk = os.walk('H:\MUSIC NEXT GEN')

mySet = set()
dupe  = []

hasher = hashlib.md5()

for dirpath, subdirs, files in walk:
    for f in files:
        fileName =  os.path.join(dirpath, f)
        with open(fileName, 'rb') as mp3:
            buf = mp3.read()
            hasher.update(buf)
            hashKey = hasher.hexdigest()
            print hashKey
            if hashKey in mySet:
                dupe.append(fileName)
            else:
                mySet.add(hashKey)


print 'Dupes: ' + str(dupe)

标签： python hash duplicates out-of-memory

2条回答

做个烂人

2楼-- · 2019-08-14 21:03

You probably have a huge file that can't be read at once like you try with mp3.read(). Read smaller parts instead. Putting it into a nice little function also helps keeping your main program clean. Here's a function I've been using myself for a while now (just slightly polished it now) for a tool probably similar to yours:

import hashlib

def filehash(filename):
    with open(filename, mode='rb') as file:
        hasher = hashlib.md5()
        while True:
            buffer = file.read(1 << 20)
            if not buffer:
                return hasher.hexdigest()
            hasher.update(buffer)

Update: A readinto version:

buffer = bytearray(1 << 20)
def filehash(filename):
    with open(filename, mode='rb') as file:
        hasher = hashlib.md5()
        while True:
            n = file.readinto(buffer)
            if not n:
                return hasher.hexdigest()
            hasher.update(buffer if n == len(buffer) else buffer[:n])

With a 1GB file already cached in memory and ten attempts, this took on average 5.35 seconds. The read version took on average 6.07 seconds. In both versions, the Python process occupied about 10MB of RAM during the run.

I'll probably stick with the read version, as I prefer its simplicity and because in my real use cases, the data isn't already cached in RAM and I use sha256 (so the overall time goes up significantly and makes the little advantage of readinto even more irrelevant).

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2019-08-14 21:12

hasher.update appends the content to the previous. You may want to create a new hasher for each file

0人赞添加讨论(0) 举报

How to find duplicate files in large filesystem wh

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间