So I am being stumped by something that (should) be simple:
I have written a SOM for a simple 'play' two-dimensional data set. Here is the data:
You can make out 3 clusters by yourself.
Now, there are two things that confuse me. The first is that the tutorial that I have, normalizes the data before the SOM gets to work on it. This means, it normalizes each data vector to have length 1. (Euclidean norm). If I do that, then the data looks like this:
(This is because all the data has been projected onto the unit circle).
So, my question(s) are as follows:
1) Is this correct? Projecting the data down onto the unit circle seems to be bad, because you can no longer make out 3 clusters... Is this a fact of life for SOMs? (ie, that they only work on the unit circle).
2) The second related question is that not only are the data normalized to have length 1, but so are the weight vectors of each output unit after every iteration. I understand that they do this so that the weight vectors dont 'blow up', but it seems wrong to me, since the whole point of the weight vectors is to retain distance information. If you normalize them, you lose the ability to 'cluster' properly. For example, how can the SOM possibly distinguish between the cluster on the lower left, from the cluster on the upper right, since they project down to the unit circle the same way?
I am very confused by this. Should data be normalized to unit length in SOMs? Should the weight vectors be normalized as well?
Thanks!
EDIT
Here is the data, saved as a .mat file for MATLAB. It is a simple 2 dimensional data set.
To decide if you are going to normalize input data or not, it depends on what these data represent. Lets say that you doing clustering on two dimensional (or three dimensional) input data in which each data vector represents a spatial point. First dimension is x coordinate and second is y coordinate. In this case you don't normalize the input data because the input features (each dimension) are comparable between each other.
If you are doing clustering again on two dimension space but each input vector represents the age and the annual income of a person, the first feature (dimension) is the age and the second is the annual income, then you must normalize the input features because they represent something different (different measurement unit) and in a completely different scale. Lets examine these input vectors: D1(25, 30000), D2(50, 30000) and D3(25, 60000). Both D2 and D3 are doubling one of the features compared to D1. Keep in mind that SOM uses Euclidian distance measures. Distance(D1, D2) = 25 and Distance(D1, D3) = 30000.
It's kind of "unfair" for the first input feature (age) because although you doubling it you get a much smaller distance as opposed to the second example (D1,D3).
Check this, which also has a similar example
If you are going to normalize your input data, you normalize on each feature/dimension (each column on you input data table). Quoting from som_normalize manual:
"Normalizations are always one-variable operations"
Check also this for a brief explanation on normalization and if you want to read more try this (chapter 7 is what you want)
EDIT:
The most common normalization methods are scaling each dimension data to [0,1] or transforming them to have a zero mean and standard deviation 1. The first is done by substracting from each input the min value of its dimension (column) and the dividing with the the max value minun the min value (of its dimension).
Xi,norm = (Xi - Xmin)/(Xmax-Xmin)
Yi,norm = (Yi - Ymin)/(Ymax-Ymin)
In the second method you substract the mean value of each dimension and then divide with standard deviation.
Xi,norm = (Xi - Xmean)/(Xsd)
Each method has pros/cons. For example the first method is very sensitive to outliers in data. You should choose after you have examined the statistical characteristics of your dataset.
Projecting in the unit circle is not actually a normalization method but more of a dimensionallity reduction method, since after the projection you could replace each data point with a single number (eg. its angle). You don't have to do this.
In SOM training algorithm, a bunch of different measures is used to calculate distance between vectors (patterns and weights). To name a couple of them (perhaps most widely used): euclidean distance and dot product. If you normalize vectors and weights to unity they are equivalent, and allows the network to learn in most effective way. If, for instance, you do not normalize your current data, the network will process points from different parts of input space with different bias (larger values will take larger effect). This is why normalization to unity is important and considered as appropriate step for most cases (specifically, if dot product is used as a measure).
Your source data should be prepared before it can be normalized to unit circle. You should map the data into [-1, 1] region in both axis. There exist several algorithms for this, one of them uses the simple formulae:
mult_factor = 2 / (max - min);
offset_factor = 1 - 2 * max / (max - min),
where min
and max
are minimal and maximal values in your data set, or domain boundaries, if it's known beforehand. Every dimension is processed separately. For your case, this will be X and Y coordinates.
Xnew = Xold * Xmult_factor + Xoffset_factor, i = 1..N
Ynew = Yold * Ymult_factor + Yoffset_factor, i = 1..N
No matter what are actual values of min
and max
before the mapping (this can be [0,1] as in your case, or [-3.6, 10]), after the mapping they'll fall into the range [-1, 1]. Actually, the formulae above are specific for converting data into the range [-1, 1], because they are just a special case of general conversion process from one range into another:
data[i] = (data[i] - old_min) * (new_max - new_min) / (old_max - old_min) + new_min;
After the mapping you can proceed with normalization to unit circle, and this way you'll finally get a circle with [0, 0] in its center.
You can find more information on this page. Though the site is not about neural networks in general, this specific page provides good explanations of SOM, including descriptive graphs on data normalization.