I'm looking for ways to find image duplicates by fingerprinting. I understand that this is done by applying hash functions on images, and each image would have a unique hash value.
I am fairly new to image processing and don't know much about hashing. How exactly am I supposed to apply hash functions and generate hash values?
Thanks in advance
You need to be careful with hashing, some image formats, such as JPEG and PNG, store dates/times and other information within images and that will make two identical images appear to be different to normal tools such as
md5
andcksum
.Here is an example. Make two images, both identical red squares of 128x128 at the command line in Terminal with ImageMagick
Now check their MD5 sums:
Or their checksums:
Oops, they are different according to both
md5
andcksum
. Why? Because the dates are 1 second apart.I would suggest you use ImageMagick to checksum "just the image data" and not the metadata - unless, of course, the date is important to you:
Now they are both the same, because the image is the same - just the metadata differs.
Of course, you may be more interested in "Perceptual Hashing" where you just get an idea if two images "look similar". If so, look here.
Or you may be interested in allowing slight differences in brightness, or orientation, or cropping - which is another topic altogether.
There's many ways you can achieve this but the simplest would be to convert the image to a base64 string and then use a standard hashing library. In python it will look something like:
If you just think of a hash function as turning one (possibly large) string into another you should be alright. In the above code m.update() just turns encoded_string (a very large base64 string) into a smaller hex string which we get by calling m.hexdigest().
You can read the python docs for the md5 library here but there should be something similar in whatever language you are using.
If you're interested in finding near duplicates, which includes images that have been resized, you could apply difference hashing. More on hashing here. The code below is edited from Real Python blog post to make it work in python 3. It uses the hashing library linked to above that has information on different kinds of hashing. You should be able to just copy and paste the scripts and run them both directly from the command line without editing the scripts.
This first script (
index.py
)creates a difference hash for each image, and then puts the hash in a shelf, or persistent dictionary that you can access later like a database, together with the image filename(s) that have that hash:Run on the command line:
Run in Jupyter notebook
Now everything is stored in a shelf, you can query the shelf with an image filename you want to check, and it will print out the file names of images that match, and also open the matching images (
search.py
):Run on command line to look through
image_directory
for images with the same hash as./directory/someimage.jpg
Again, this is modified from
Real Python
blog post linked above, which is written for python2.7, and should work out the box! Just change the command line as you need to. If I remember correctly, the python 2/3 issue was just withargparse
and not the image libraries.