I was trying to implement SIFT for my project and i have got the keypoints. I tried taking the euclidean distance of each keypoint of an image with all the keypoints of the same image but scaled down. It so happens that the distance of 1 keypoint of the query image with rest of the keypoints in the database image has kind of very similar values. How do i select the nearest neighbor and how can i be sure that this is the correct match.
Euclidean distance was calculated as ,for i=1 to 128 sqrt[(pi-qi)^2] for p = 1 to number of keypoints in the database.
any idea on how to proceed will be very much appreciated
I'm guessing you are trying to use SIFT for image retrieval since you mentioned you have a database of images which you are comparing a query image to.
You should realize that comparing every SIFT descriptor vector in a query image with every other descritpor vectors in the image database is not feasible as this will require exponential number of comparisons. The current popular approach to do image retrieval using SIFT descriptors is the bag of words model borrowed from document retrieval.
First, what you want to do is given an image, represent it using a SINGLE vector which can be compared to the vectors of the other images in the database. This is unlike your current approach where each image has many SIFT descriptor vectors (one for each keypoint). In the bag of words (BOW) model, you first need to create what is called a visual codebook (or sometimes referred to as dictionary). You do this by:
- Selecting representative images from your image database
- Collect all the SIFT descriptors from the images in 1)
- Cluster these descriptors using K Means into k number of clusters
where k is a number you set. The center of these clusters are the
"visual words" i.e. representative features in your database of
images.
For every image in the database, you are going to create a vector
v
that counts how frequently the different features in the dictionary occurs so each image would be represented by a vector in the form of: <# times feature 1 in dictionary occur, ... feature 2 in dictionary occur..., ..., ... feature k in dictionary occur>
i.e., a k dimensional vector. You obtain this vector for a image by:
4.1. Extracting SIFT descriptors in the image
4.2. For each SIFT descriptor in the image, find the closest cluster center (using Euclidean distance) in the codebook/dictionary and increment its corresponding count in the vector v
by 1.
E.g., you have a 5 cluster dictionary (e.g., k = 5) and an image has 3 SIFT descriptors. 2 of them are closest to the first cluster center and 1 is closest to the fifth cluster center. Your vector v
would be v = <2, 0, 0, 0, 1>
. Since v
counts the number of times the representative vectors occur in an image, v
is sometimes also referred to as a frequency histogram.
At this stage, you might want to normalize the histogram by dividing each entry by the sum of all the entries so that images with very different number of SIFT keypoints found can be made comparable.
Now, to compare 2 images, you compare this new vector v
instead of the SIFT descriptors themselves. Comparison can be done using euclidean distance (also known as L2 distance). It has been found that using Chi Square distance or Hellinger distance may improve results. See the details described in this page.
Basically comparing the SIFT descriptors themselves in an image to those of another is not feasible because you will end up with multiple SIFT descriptors in an image, and their number varies depending on how you extract them.
What you want is a common basis for comparison and that is done in the BOW model by matching the descriptors to a common codebook/dictionary which accounts for the representative features in a database of images.