So I have 600,000+ images. My estimate is that roughly 5-10% of these are corrupted. I'm generating a log of exactly which images this pertains to.
Using Python, my approach thus far is this:
def img_validator(source):
files = get_paths(source) # A list of complete paths to each image
invalid_files = []
for img in files:
try:
im = Image.open(img)
im.verify()
im.close()
except (IOError, OSError, Image.DecompressionBombError):
invalid_files.append(img)
# Write invalid_files to file
The first 200-250K images are quite fast to process, only around 1-2 hours. I left the process running overnight (at the time it was at 230K), 8 hours later it was only at 310K, but still progressing.
Anyone got an idea of why that is? At first I thought it might be due to the images being stored on an HDD, but that doesn't really make sense seeing as it was very fast the first 200-250k.