Search for similar images with different filenames

2019-04-02 09:11发布

问题:

I have 2 directories with lots and lots of images, say: color/ and gray/. In color/ images are named: image1.png image2.png, etc.

I know that gray/ contains the same images, but in gray-scale, and the file names and order of files is different (eg: file_01.png, but this IS NOT the same image as image1.png).

Is it possible to make a comparison of images in both directories and copy color/ files to a results/ directory with gray/ file names?

Example:

directory        | directory           | directory
   "color/"      |     "gray/"         |      "results/" 
(color images)   | (grayscale images)  | (color images with gray-scale names)   
-----------------+---------------------+----------------------------------------
color/image1.png | gray/file324.png    | results/file324.png  (in color: ==>
                                       | this and image1.png are the same image)

I hope this is not very confusing, but I don't know how to explain it better.

I have tried with imagemagick, and it seems that the -compare option could work for this, but I'm unable to make a bash script or something that does it well.

Another way to say it: I want all color/*.jpg copied into the results/*.jpg folder using the correctly matching gray/*.jpg names.

EDIT (some notes): 1. The three images are IDENTICAL in size and content. The only difference is that two are in color and one is in gray-scale. And the name of the files, of course. 2. I uploaded a zip file with one sample image with their current names (folder "img1" is the color folder and folder "img2" is the grayscale folder) and the expected result ("img3" is the results folder), here: http://www.mediafire.com/?9ug944v6h7t3ya8

回答1:

If I understood the requirement correctly, we need to:

  • find for each grayscale image named XYZ that is in folder gray/...
  • ...the matching color image named ABC that is in folder color/ and...
  • ...copy ABC to folder results/ under the new name XYZ

So the basic algorithm I suggest is this:

  1. Convert all images in folder color/ to grayscale and store result in folder gray-reference/. Keep the original names:

    mkdir gray-reference
    convert  color/img123.jpg  -colorspace gray  gray-reference/img123.jpg
    
  2. For each grayscale image in reference/ make a comparison with each grayscale image in folder gray/. If you find a match, copy the respective image of the same name from color/ to results/. One possible comparison command which creates a visual representation of differences is this:

    compare  gray-reference/img123.jpg  gray/imgABC.jpg  -compose src delta.jpg
    

The real trick is the comparison (as in step 2) of the two grayscale images. ImageMagick has a handy command to compare two (similar) images pixel by pixel and write the results into a 'delta' image:

compare  reference.png  test.png  -compose src  delta.png

If the comparison is for color images, in the delta image...

  • ...each pixel that was equal appears in white, while...
  • ...each pixel that was different appears in a highlight color (defaults to red).

See also my answer "ImageMagick: 'Diff' an Image" for an illustrated example of this technique.

If we directly compared a gray image with a color image pixel by pixel we would of course find that almost every single pixel is different (resulting in an all-red "delta" picture). Hence my proposal from step 1 above to first convert the color image to grayscale.

If we compare two grayscale images, the resulting delta image is in grayscale too. Hence the default highlight color can't be red. We better set it to 'black' in order to see it better.

Now if our current grayscale conversion of the color would result in a 'different' sort of grayscale than the one that the existing gray images have (our currently produced grays could just be slightly lighter or darker than the existing grayscale image due to different color profiles having been applied), it could still happen that our delta picture is all-"red", or rather all-highlight-color. However, I tested this with your sample images, and results are good:

 convert  color/image1.jpg  -colorspace gray  image1-gray.jpg  
 compare                  \
    gray/file324.jpg      \
    image1-gray.jpg       \
   -highlight-color black \
   -compose src           \
    delta.jpg

delta.jpg consists of 98% white pixels. I'm not sure if all the others of your thousands of grayscale images used the same settings when they were derived from the color originals. Therefor we add a small fuzz factor when running the compare command, which does allow for some deviation in color when 2 pixels are compared:

compare  -fuzz 3%  reference.png  test.png  -compose src  delta.png

Since this algorithm is to be executed many thousands of times (maybe several millions of times, given the number of images you talk about), we should make some performance considerations and we should time the duration of the compare command. This is especially a concern, since your sample images are rather large (3072x2048 pixels -- 6 Mega-Pixels), and the comparison could take a while.

My timing results on a MacBook Pro where these:

time (convert  color/image1.jpg  -colorspace gray  image1-gray.jpg ;
      compare                   \
         gray/file324.jpg       \
         image1-gray.jpg        \
        -highlight-color black  \
        -fuzz 3%                \
        -compose src            \
         delta100-fuzz.jpg)

  real  0m6.085s
  user  0m2.616s
  sys   0m0.598s

6 seconds for: 1 conversion of a large color image to grayscale, plus 1 comparison of two large grayscale images.

You talked about 'thousands of images'. Assuming 3000 images, based on this timing, the processing of all the images would require (3000*3000)/2 comparisons (4.5 million) and (3000*3000*6)/2 seconds (27 million sec). That's a total of 312 days to complete all comparisons. Too long, if you ask me.

What could we do to improve the performance?

Well, my first idea is to reduce the size of the images. If we compare smaller images instead of 3072x2048 sized ones, the comparison should return the result faster. (However, we will also spend additional time for first scaling down of our test images -- but hopefully much less time than we later save when comparing the smaller images:

time (convert color/image1.jpg  -colorspace gray  -scale 6.25%  image1-gray.jpg  ;
      convert gray/file324.jpg                    -scale 6.25%  file324-gray.jpg ;
      compare                  \
         file324-gray.jpg      \
         image1-gray.jpg       \
        -highlight-color black \
        -fuzz 3%               \
        -compose src           \
         delta6.25-fuzz.jpg)

   real  0m0.670s
   user  0m0.584s
   sys   0m0.074s

That's much better! We shaved off almost 90% of processing time, which gives hope to complete the job in 35 days if you use a MacBook Pro.

The improvement is only logical: by reducing the image dimension to 6.25% of the original the resulting images are only 192x128 pixels -- a reduction from 6 million pixels to 24.5 thousand pixels, a ratio of 256:1.

(NOTE: The -thumbnail and the -resize parameters would work a little bit faster than -scale does. However, this speed increase is a trade-off against quality loss. That quality loss would probably make the comparison much less reliable...)

Instead of creating a visually inspectable delta image from the compared images, we can tell ImageMagick to print out some statistics. To get the number of different pixels, we can use the AE metric. The command with its results is this:

time (convert color/image1.jpg -colorspace gray -scale 6.25% image1-gray.jpg  ;
     convert gray/file324.jpg                   -scale 6.25% file324-gray.jpg ;
     compare -metric AE  file324-gray.jpg image1-gray.jpg -fuzz 3% null: 2>&1 )
0 

  real  0m0.640s
  user  0m0.574s
  sys   0m0.073s

This means we have 0 differing pixels -- a result that we could directly use inside a shell script!

Building blocks for a Shell script

So here are the building blocks for a shell script to do the automatic comparison:

  1. Convert color images from 'color/' directory to grayscale ones, scale them down to 6.25% and save results in 'reference-color/' directory:

    # Estimated time required to convert 1000 images of size 3072x2048:
    #   500 seconds
    mkdir reference-color
    for i in color/*.jpg; do
        convert  "${i}"  -colorspace gray  -scale 6.25%  reference-color/$(basename "${i}")
    done
    
  2. Scale down images from 'gray/' directory and save results in 'reference-gray/' directory:

    # Estimated time required to convert 1000 images of size 3072x2048:
    #    250 seconds
    mkdir reference-gray
    for i in gray/*.jpg; do
        convert  "${i}"  -scale 6.25%  reference-gray/$(basename "${i}")
    done
    
  3. Compare each image from directory 'reference-gray/' with images from directory 'reference-color' until a match is found:

    # Estimated time required to compare 1 image with 1000 images:
    #    300 seconds
    # If we have 1000 images, we need to conduct a total of 1000*1000/2
    # comparisons to find all matches;
    #    that is, we need about 2 days to accomplish all.
    # If we have 3000 images, we need a total of 3000*3000/2 comparisons
    # to find all matches;
    #    this requires about 20 days.
    #
    for i in reference-gray/*.jpg ; do
    
        for i in reference-color/*.jpg ; do
    
            # compare the two grayscale reference images
            if [ "x0" == "x$(compare  -metric AE  "${i}"  "${j}" -fuzz 3%  null: 2>&1)" ]; then
    
                # if we found a match, then create the copy under the required name
                cp color/$(basename "${j}"  results/$(basename "${i}") ;
    
                # if we found a match, then remove the respective reference image (we do not want to compare again with this one)
                rm -rf "${i}"
    
                # if we found a match, break from within this loop and start the next one
                break ;
    
            fi
    
        done
    
    done
    

Caveat: Do not blindly rely on these building blocks. They are untested. I do not have a directory of multiple suitable images available to test this, and I do not want to create one myself just for this exercise. Proceed with caution!



回答2:

You should try if a perceptual hash technique such as pHash gives some good results on your concrete data.

A perceptual hash will give you a reliable similarity measure since the underlying algorithms are robust enough to take into account changes/transformations such as contrast adjustment or different compression/formats - which is not the case with standard cryptographic hash functions such as MD5.

In addition you can validate if pHash works by using its convenient web-based demo interface on your own images.



回答3:

Kurt's solution very much works after some tweaking and fiddling with the -fuzz option!. :) The final value for -fuzz that finally worked well is 50%! I tried with 3, 10, 19, 20, 24, 25, 30 and 40% with no success. Probably because the gray images were generated previously with a different method, so the grays are different. Also, all the images are of different sizes, some of them relatively small, so the scaling method by percentage produces bad results. I used -resize 200x, so all the reference images were more or less the same size, and finally this was the bash script I used:

    # this bash assumes the existence of two dirs: color/ and gray/ 
    # each one with images to compare

    echo Starting...
    echo Checking directories...
    if [ ! -d color ]; then
        echo Error: the directory color does not exist!
        exit 1;
    fi
    if [ ! -d gray ]; then
        echo Error: the directory gray does not exist!
        exit 1;
    fi

    echo Directories exist. Proceeding...

    mkdir reference-color
    echo creating reference-color...
    for i in color/*.png; do
        convert  "${i}"  -colorspace gray  -resize 200x  reference-color/$(basename "${i}")
    done
    echo reference-color created...

    mkdir reference-gray
    echo creating reference-gray...
    for i in gray/*.png; do
        convert  "${i}"  -resize 200x  reference-gray/$(basename "${i}")
    done
    echo reference-gray created...

    mkdir results
    echo created results directory...

    echo ...ready.

    echo "-------------------------"
    echo "|  starting comparison  |"
    echo "-------------------------"

    for i in reference-gray/*.png; do
        echo comparing image $i 

        for j in reference-color/*.png; do

            # compare the two grayscale reference images

            if [ "x0" == "x$(compare  -metric AE "${i}"  "${j}" -fuzz 50% null: 2>&1)" ]; then

                # if we found a match, then create the copy under the required name
                echo Founded a similar one. Copying and renaming it...
                cp color/$(basename "${j}")  results/$(basename "${i}")

                # if we found a match, then remove the respective reference image (we do not want to compare again with this one)
                echo Deleting references...
                rm -rf "${i}"
                rm -rf "${j}"
                echo "--------------------------------------------------------------"

                # if we found a match, break from within this loop and start the next one
                break ;

            fi

        done

    done
    echo Cleaning...
    rm -rf reference-color
    rm -rf reference-gray
    echo Finished!

The time measure is (for 180 images, using imagemagick in cygwin, so probably better in native linux imagemagick, I don't know yet):

real    5m29.308s
user    2m25.481s
sys     3m1.573s

I uploaded a file with the script and the set of test images if anyone is interested. http://www.mediafire.com/?1ez0gs6bw3rqbe4 (Is compressed with 7z format)

Thanks again!