I have 2 directories with lots and lots of images, say: color/ and gray/. In color/ images are named: image1.png image2.png, etc.
I know that gray/ contains the same images, but in gray-scale, and the file names and order of files is different (eg: file_01.png, but this IS NOT the same image as image1.png).
Is it possible to make a comparison of images in both directories and copy color/ files to a results/ directory with gray/ file names?
Example:
directory | directory | directory
"color/" | "gray/" | "results/"
(color images) | (grayscale images) | (color images with gray-scale names)
-----------------+---------------------+----------------------------------------
color/image1.png | gray/file324.png | results/file324.png (in color: ==>
| this and image1.png are the same image)
I hope this is not very confusing, but I don't know how to explain it better.
I have tried with imagemagick, and it seems that the -compare option could work for this, but I'm unable to make a bash script or something that does it well.
Another way to say it: I want all color/*.jpg
copied into the results/*.jpg
folder using the correctly matching gray/*.jpg
names.
EDIT (some notes):
1. The three images are IDENTICAL in size and content. The only difference is that two are in color and one is in gray-scale. And the name of the files, of course.
2. I uploaded a zip file with one sample image with their current names (folder "img1" is the color folder and folder "img2" is the grayscale folder) and the expected result ("img3" is the results folder), here: http://www.mediafire.com/?9ug944v6h7t3ya8
If I understood the requirement correctly, we need to:
- find for each grayscale image named XYZ that is in folder gray/...
- ...the matching color image named ABC that is in folder color/ and...
- ...copy ABC to folder results/ under the new name XYZ
So the basic algorithm I suggest is this:
Convert all images in folder color/ to grayscale and store result in folder gray-reference/. Keep the original names:
mkdir gray-reference
convert color/img123.jpg -colorspace gray gray-reference/img123.jpg
For each grayscale image in reference/ make a comparison with each grayscale image in folder gray/. If you find a match, copy the respective image of the same name from color/ to results/. One possible comparison command which creates a visual representation of differences is this:
compare gray-reference/img123.jpg gray/imgABC.jpg -compose src delta.jpg
The real trick is the comparison (as in step 2) of the two grayscale images. ImageMagick has a handy command to compare two (similar) images pixel by pixel and write the results into a 'delta' image:
compare reference.png test.png -compose src delta.png
If the comparison is for color images, in the delta image...
- ...each pixel that was equal appears in white, while...
- ...each pixel that was different appears in a highlight color (defaults to red).
See also my answer "ImageMagick: 'Diff' an Image" for an illustrated example of this technique.
If we directly compared a gray image with a color image pixel by pixel we would of course find that almost every single pixel is different (resulting in an all-red "delta" picture). Hence my proposal from step 1 above to first convert the color image to grayscale.
If we compare two grayscale images, the resulting delta image is in grayscale too. Hence the default highlight color can't be red. We better set it to 'black' in order to see it better.
Now if our current grayscale conversion of the color would result in a 'different' sort of grayscale than the one that the existing gray images have (our currently produced grays could just be slightly lighter or darker than the existing grayscale image due to different color profiles having been applied), it could still happen that our delta picture is all-"red", or rather all-highlight-color. However, I tested this with your sample images, and results are good:
convert color/image1.jpg -colorspace gray image1-gray.jpg
compare \
gray/file324.jpg \
image1-gray.jpg \
-highlight-color black \
-compose src \
delta.jpg
delta.jpg consists of 98% white pixels. I'm not sure if all the others of your thousands of grayscale images used the same settings when they were derived from the color originals. Therefor we add a small fuzz factor when running the compare
command, which does allow for some deviation in color when 2 pixels are compared:
compare -fuzz 3% reference.png test.png -compose src delta.png
Since this algorithm is to be executed many thousands of times (maybe several millions of times, given the number of images you talk about), we should make some performance considerations and we should time the duration of the compare
command. This is especially a concern, since your sample images are rather large (3072x2048 pixels -- 6 Mega-Pixels), and the comparison could take a while.
My timing results on a MacBook Pro where these:
time (convert color/image1.jpg -colorspace gray image1-gray.jpg ;
compare \
gray/file324.jpg \
image1-gray.jpg \
-highlight-color black \
-fuzz 3% \
-compose src \
delta100-fuzz.jpg)
real 0m6.085s
user 0m2.616s
sys 0m0.598s
6 seconds for: 1 conversion of a large color image to grayscale, plus 1 comparison of two large grayscale images.
You talked about 'thousands of images'. Assuming 3000 images, based on this timing, the processing of all the images would require (3000*3000)/2
comparisons (4.5 million) and (3000*3000*6)/2
seconds (27 million sec). That's a total of 312 days to complete all comparisons. Too long, if you ask me.
What could we do to improve the performance?
Well, my first idea is to reduce the size of the images. If we compare smaller images instead of 3072x2048 sized ones, the comparison should return the result faster. (However, we will also spend additional time for first scaling down of our test images -- but hopefully much less time than we later save when comparing the smaller images:
time (convert color/image1.jpg -colorspace gray -scale 6.25% image1-gray.jpg ;
convert gray/file324.jpg -scale 6.25% file324-gray.jpg ;
compare \
file324-gray.jpg \
image1-gray.jpg \
-highlight-color black \
-fuzz 3% \
-compose src \
delta6.25-fuzz.jpg)
real 0m0.670s
user 0m0.584s
sys 0m0.074s
That's much better! We shaved off almost 90% of processing time, which gives hope to complete the job in 35 days if you use a MacBook Pro.
The improvement is only logical: by reducing the image dimension to 6.25% of the original the resulting images are only 192x128 pixels -- a reduction from 6 million pixels to 24.5 thousand pixels, a ratio of 256:1.
(NOTE: The -thumbnail
and the -resize
parameters would work a little bit faster than -scale
does. However, this speed increase is a trade-off against quality loss. That quality loss would probably make the comparison much less reliable...)
Instead of creating a visually inspectable delta image from the compared images, we can tell ImageMagick to print out some statistics. To get the number of different pixels, we can use the AE
metric. The command with its results is this:
time (convert color/image1.jpg -colorspace gray -scale 6.25% image1-gray.jpg ;
convert gray/file324.jpg -scale 6.25% file324-gray.jpg ;
compare -metric AE file324-gray.jpg image1-gray.jpg -fuzz 3% null: 2>&1 )
0
real 0m0.640s
user 0m0.574s
sys 0m0.073s
This means we have 0
differing pixels -- a result that we could directly use inside a shell script!
Building blocks for a Shell script
So here are the building blocks for a shell script to do the automatic comparison:
Convert color images from 'color/' directory to grayscale ones, scale them down to 6.25% and save results in 'reference-color/' directory:
# Estimated time required to convert 1000 images of size 3072x2048:
# 500 seconds
mkdir reference-color
for i in color/*.jpg; do
convert "${i}" -colorspace gray -scale 6.25% reference-color/$(basename "${i}")
done
Scale down images from 'gray/' directory and save results in 'reference-gray/' directory:
# Estimated time required to convert 1000 images of size 3072x2048:
# 250 seconds
mkdir reference-gray
for i in gray/*.jpg; do
convert "${i}" -scale 6.25% reference-gray/$(basename "${i}")
done
Compare each image from directory 'reference-gray/' with images from directory 'reference-color' until a match is found:
# Estimated time required to compare 1 image with 1000 images:
# 300 seconds
# If we have 1000 images, we need to conduct a total of 1000*1000/2
# comparisons to find all matches;
# that is, we need about 2 days to accomplish all.
# If we have 3000 images, we need a total of 3000*3000/2 comparisons
# to find all matches;
# this requires about 20 days.
#
for i in reference-gray/*.jpg ; do
for i in reference-color/*.jpg ; do
# compare the two grayscale reference images
if [ "x0" == "x$(compare -metric AE "${i}" "${j}" -fuzz 3% null: 2>&1)" ]; then
# if we found a match, then create the copy under the required name
cp color/$(basename "${j}" results/$(basename "${i}") ;
# if we found a match, then remove the respective reference image (we do not want to compare again with this one)
rm -rf "${i}"
# if we found a match, break from within this loop and start the next one
break ;
fi
done
done
Caveat: Do not blindly rely on these building blocks. They are untested. I do not have a directory of multiple suitable images available to test this, and I do not want to create one myself just for this exercise. Proceed with caution!
You should try if a perceptual hash technique such as pHash gives some good results on your concrete data.
A perceptual hash will give you a reliable similarity measure since the underlying algorithms are robust enough to take into account changes/transformations such as contrast adjustment or different compression/formats - which is not the case with standard cryptographic hash functions such as MD5.
In addition you can validate if pHash works by using its convenient web-based demo interface on your own images.
Kurt's solution very much works after some tweaking and fiddling with the -fuzz option!. :) The final value for -fuzz that finally worked well is 50%! I tried with 3, 10, 19, 20, 24, 25, 30 and 40% with no success. Probably because the gray images were generated previously with a different method, so the grays are different. Also, all the images are of different sizes, some of them relatively small, so the scaling method by percentage produces bad results. I used -resize 200x
, so all the reference images were more or less the same size, and finally this was the bash script I used:
# this bash assumes the existence of two dirs: color/ and gray/
# each one with images to compare
echo Starting...
echo Checking directories...
if [ ! -d color ]; then
echo Error: the directory color does not exist!
exit 1;
fi
if [ ! -d gray ]; then
echo Error: the directory gray does not exist!
exit 1;
fi
echo Directories exist. Proceeding...
mkdir reference-color
echo creating reference-color...
for i in color/*.png; do
convert "${i}" -colorspace gray -resize 200x reference-color/$(basename "${i}")
done
echo reference-color created...
mkdir reference-gray
echo creating reference-gray...
for i in gray/*.png; do
convert "${i}" -resize 200x reference-gray/$(basename "${i}")
done
echo reference-gray created...
mkdir results
echo created results directory...
echo ...ready.
echo "-------------------------"
echo "| starting comparison |"
echo "-------------------------"
for i in reference-gray/*.png; do
echo comparing image $i
for j in reference-color/*.png; do
# compare the two grayscale reference images
if [ "x0" == "x$(compare -metric AE "${i}" "${j}" -fuzz 50% null: 2>&1)" ]; then
# if we found a match, then create the copy under the required name
echo Founded a similar one. Copying and renaming it...
cp color/$(basename "${j}") results/$(basename "${i}")
# if we found a match, then remove the respective reference image (we do not want to compare again with this one)
echo Deleting references...
rm -rf "${i}"
rm -rf "${j}"
echo "--------------------------------------------------------------"
# if we found a match, break from within this loop and start the next one
break ;
fi
done
done
echo Cleaning...
rm -rf reference-color
rm -rf reference-gray
echo Finished!
The time measure is (for 180 images, using imagemagick in cygwin, so probably better in native linux imagemagick, I don't know yet):
real 5m29.308s
user 2m25.481s
sys 3m1.573s
I uploaded a file with the script and the set of test images if anyone is interested. http://www.mediafire.com/?1ez0gs6bw3rqbe4 (Is compressed with 7z format)
Thanks again!