Image Comparison by Finger Printing

2019-02-25 13:57发布

问题:

I'm looking for ways to find image duplicates by fingerprinting. I understand that this is done by applying hash functions on images, and each image would have a unique hash value.

I am fairly new to image processing and don't know much about hashing. How exactly am I supposed to apply hash functions and generate hash values?

Thanks in advance

回答1:

You need to be careful with hashing, some image formats, such as JPEG and PNG, store dates/times and other information within images and that will make two identical images appear to be different to normal tools such as md5 and cksum.

Here is an example. Make two images, both identical red squares of 128x128 at the command line in Terminal with ImageMagick

convert -size 128x128 xc:red a.png
convert -size 128x128 xc:red b.png

Now check their MD5 sums:

md5 [ab].png
MD5 (a.png) = b4b82ba217f0b36e6d3ba1722f883e59
MD5 (b.png) = 6aa398d3aaf026c597063c5b71b8bd1a

Or their checksums:

cksum [ab].png
4158429075 290 a.png
3657683960 290 b.png

Oops, they are different according to both md5 and cksum. Why? Because the dates are 1 second apart.

I would suggest you use ImageMagick to checksum "just the image data" and not the metadata - unless, of course, the date is important to you:

identify -format %# a.png
e74164f4bab2dd8f7f612f8d2d77df17106bac77b9566aa888d31499e9cf8564

identify -format %# b.png
e74164f4bab2dd8f7f612f8d2d77df17106bac77b9566aa888d31499e9cf8564

Now they are both the same, because the image is the same - just the metadata differs.

Of course, you may be more interested in "Perceptual Hashing" where you just get an idea if two images "look similar". If so, look here.

Or you may be interested in allowing slight differences in brightness, or orientation, or cropping - which is another topic altogether.



回答2:

There's many ways you can achieve this but the simplest would be to convert the image to a base64 string and then use a standard hashing library. In python it will look something like:

import base64
import md5



with open("foo.png", "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read())
    m = md5.new()
    m.update(encoded_string)
    fingerprint = m.hexdigest()
    print(fingerprint)

If you just think of a hash function as turning one (possibly large) string into another you should be alright. In the above code m.update() just turns encoded_string (a very large base64 string) into a smaller hex string which we get by calling m.hexdigest().

You can read the python docs for the md5 library here but there should be something similar in whatever language you are using.



回答3:

If you're interested in finding near duplicates, which includes images that have been resized, you could apply difference hashing. More on hashing here. The code below is edited from Real Python blog post to make it work in python 3. It uses the hashing library linked to above that has information on different kinds of hashing. You should be able to just copy and paste the scripts and run them both directly from the command line without editing the scripts.

This first script (index.py)creates a difference hash for each image, and then puts the hash in a shelf, or persistent dictionary that you can access later like a database, together with the image filename(s) that have that hash:

from PIL import Image
import imagehash
import argparse
import shelve
import glob

# This is just so you can run it from the command line
ap = argparse.ArgumentParser()
ap.add_argument('-d', '--dataset', required = True,
                help = 'path to imput dataset of images')

ap.add_argument('-s', '--shelve', required = True,
                help = 'output shelve database')
args = ap.parse_args()

# open the shelve database
db = shelve.open(args.shelve, writeback = True)

# loop over the image dataset
for imagePath in glob.glob(args.dataset + '/*.jpg'):
    # load the image and compute the difference in hash
    image = Image.open(imagePath)
    h = str(imagehash.dhash(image))
    print(h)

    # extract the filename from the path and update the database using the hash
    # as the key and the filename append to the list of values

    filename = imagePath[imagePath.rfind('/') + 1:]
    db[h] = db.get(h, []) + [filename]

db.close()

Run on the command line:

python index.py --dataset ./image_directory --shelve db.shelve

Run in Jupyter notebook

%run index.py --dataset ./image_directory --shelve db.shelve

Now everything is stored in a shelf, you can query the shelf with an image filename you want to check, and it will print out the file names of images that match, and also open the matching images (search.py):

from PIL import Image
import imagehash
import argparse
import shelve

# arguments for command line
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required=True,
            help="path to dataset of images")
ap.add_argument("-s", "--shelve", required=True,
            help="output the shelve database")
ap.add_argument("-q", "--query", required=True,
            help="path to the query image")
args = ap.parse_args()

# open the shelve database
db = shelve.open(args.shelve)

# Load the query image, compute the difference image hash, and grab the images
# from the database that have the same hash value
query = Image.open(args.query)
h = str(imagehash.dhash(query))
filenames = db[h]
print("found {} images".format(len(filenames)))

# loop over the images
for filename in filenames:
    print(filename)
    image = Image.open(args.dataset + "/" + filename)
    image.show()

# close the shelve database
db.close()

Run on command line to look through image_directory for images with the same hash as ./directory/someimage.jpg

python search.py —dataset ./image_directory —shelve db.shelve —query ./directory/someimage.jpg

Again, this is modified from Real Python blog post linked above, which is written for python2.7, and should work out the box! Just change the command line as you need to. If I remember correctly, the python 2/3 issue was just with argparse and not the image libraries.