Dimensionality Reduction using Self Organizing Map

2019-07-07 09:20发布

I have been working on Self Organizing Maps(SOM) for the past few months.But I still have some confusions in understanding the dimensionaliy reduction part.Can you suggest any simple method to understand the real working of SOMs on any real world data sets (like a data set from UCI repository).

1条回答
闹够了就滚
2楼-- · 2019-07-07 09:45

Ok so first of all refer to some previous related questions which will give you a better understanding of the dimensional reduction and visualization properties of the SOM. Plotting the Kohonen map - Understanding the visualization, Interpreting a Self Organizing Map.

Second a simple case to test the properties of the SOM:

  • Create a simple dataset with 3 features where you have 3 different clusters;
  • Perform the SOM on this dataset and visualize.

I will use the MATLAB programming language to exemplify how to do it and what you can extract from the learning process.

CODE:

% create a dataset with 3 clusters and 3 features
x=[ones(1000,1)*0.5,zeros(1000,1),zeros(1000,1)]; 
x=[x;[zeros(1000,1),ones(1000,1)*0.5,zeros(1000,1)]]; 
x=[x;[zeros(1000,1),zeros(1000,1),ones(1000,1)*0.5]]; 
x=x+rand(3000,3)*0.2; 
x=x';

%define a 20x20 SOM through MATLAB "selforgmap" function, and train using the "train"
net = selforgmap([20 20]); 
[net,tr] = train(net,x);

%display the number of hits, neighbour distance, and plane maps     figure,plotsomplanes(net)
figure,plotsomnd(net) 

figure,plotsomhits(net,x)

OUTPUT:

So in the first figure you can already see a compression of the 3000x3 dataset into a 20x20x3 map (a reduction of almost 10 times). You can also see that your components can easily be even more compressed into 3 single classes.

enter image description here This is even more evident when you look at the neighbour distance, and hit maps (figure 2 and 3, respectively):

In figure 2 the more different the node with its neighbour (calculated through the Euclidean distance between the node weights, and its neighbour weights) the darker the colour between these two nodes. As such, we can see 3 regions of highly related nodes. We could use this image and threshold it such as to obtain 3 different regions (the 3 clusters), and then obtaining the mean weights.

In figure 3 it is presented how many samples from the dataset where label within each node. As can be seen the 3 previous regions present a somewhat homogeneous distribution of samples (which make sense taking into account that the 3 cluster have the same number of samples), and the interface nodes (the ones that divide the 3 regions) do not map any sample. Again we could use this image and threshold it such as to obtain 3 different regions (the 3 clusters), and then obtaining the mean weights.

So in sum with this dataset and with some easy post-processing you could reduce your dataset from 3000X3, to a 3x3 matrix

enter image description here enter image description here

查看更多
登录 后发表回答