I'm learning R and I have to cluster numeric data with a timestamp field. One of the parameters is a time, and since the data is strictly day-night dependent, I want to take into account the "spherical" nature of this data.
As far as I saw from the manual, libraries such as skmeans cannot handle "cylindrical" data but only "spherical" data (i.e. where all the components are in polar coordinates).
My idea for a suitable solution is the follwing: I can decompose the HOUR column (0-24) into two different colums X,Y and express the time in polar coordinates, such as x^2+y^2=1. In this way a k-means with euclidean distance should not have problem interpreting the data.
Am I right?
k-means should use squared Euclidean distance.
But indeed: projecting your data into a meaningful Euclidean space is an easy way to avoid this kind of problems.
However be aware that your mean will no longer lie on the cylinder. In many cases, you can just scale the mean to the desired cylinder. But it might become 0, then no meaningful rescaling is possible.
The other option is kernel k-means. As your desired distance is Euclidean after a data transformation, you can also "kernelize" this transformation, and use kernel k-means. But it may actually be faster to transform your data in your particular case. It will likely only pay off when using much more complex transformations (say, to an infinite dimensional vector space).
Here is such a mapping of
h
tom
whereh
is the time in hours (and fraction of an hour). Then we trykmeans
and at least in this test it seems to work: