Checking for corrupted files in directory with hun

2020-04-20 07:09发布

问题:

So I have 600,000+ images. My estimate is that roughly 5-10% of these are corrupted. I'm generating a log of exactly which images this pertains to.

Using Python, my approach thus far is this:

def img_validator(source):
    files = get_paths(source)  # A list of complete paths to each image
    invalid_files = []
    for img in files:
        try:
            im = Image.open(img)
            im.verify()
            im.close()
        except (IOError, OSError, Image.DecompressionBombError):
            invalid_files.append(img)

     # Write invalid_files to file

The first 200-250K images are quite fast to process, only around 1-2 hours. I left the process running overnight (at the time it was at 230K), 8 hours later it was only at 310K, but still progressing.

Anyone got an idea of why that is? At first I thought it might be due to the images being stored on an HDD, but that doesn't really make sense seeing as it was very fast the first 200-250k.

回答1:

If you have that many images, I would suggest you use multiprocessing. I created 100,000 files of which 5% were corrupt and checked them like this:

#!/usr/bin/env python3

import glob
from multiprocessing import Pool
from PIL import Image

def CheckOne(f):
    try:
        im = Image.open(f)
        im.verify()
        im.close()
        # DEBUG: print(f"OK: {f}")
        return
    except (IOError, OSError, Image.DecompressionBombError):
        # DEBUG: print(f"Fail: {f}")
        return f

if __name__ == '__main__':
    # Create a pool of processes to check files
    p = Pool()

    # Create a list of files to process
    files = [f for f in glob.glob("*.jpg")]

    print(f"Files to be checked: {len(files)}")

    # Map the list of files to check onto the Pool
    result = p.map(CheckOne, files)

    # Filter out None values representing files that are ok, leaving just corrupt ones
    result = list(filter(None, result)) 
    print(f"Num corrupt files: {len(result)}")

Sample Output

Files to be checked: 100002
Num corrupt files: 5001

That takes 1.6 seconds on my 12-core CPU with NVME disk, but should still be noticeably faster for you.