I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows:
if(!require("cluster")) { install.packages("cluster"); require("cluster") }
data(flower)
as.matrix(daisy(flower, metric = "gower"))
This uses the gower metric to deal with the nominal variables. Is there a Python equivalent of the daisy()
function in R?
Or maybe any other module function that allows using the Gower metric or something similar to calculate the (dis)similarity matrix for a dataset with mixed (nominal, numeric) attributes?
I believe you are looking for scipy.spatial.distance.pdist
.
If you implement a function that computes the Gower distance on a single pair of observations, you can pass that function to pdist
and it will apply it pairwise and return the resulting matrix of pairwise distances. It does not appear that the Gower distance is one of the built-in options.
Likewise, if a single observation has mixed attributes, you can just define your own function which, say, uses something like the Euclidean distance on the subset of numerical attributes, a Gower distance on the subset of categorical attributes, and adds them -- or any other implementation of what it means to you, for your application, to compute the distance between two isolated observations.
For clustering in Python, usually you want to work with scikits.learn and this question and answer page discusses exactly this problem of using a custom distance measure (in your case Gower) with scikits -- which does not appear possible.
You could use one of the choices provided by pdist
along with the implementation at that linked answer page -- or you could implement a function for the Gower similarity and use that. But if you want the out-of-the-box clustering tools from scikits, it does not appear to be directly possible.
Just to implement a Gower function to use with pdist won´t be enough.
Internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data.
I implemented the Gower function, according the original paper, and the respective adptations necessary in the pdist module (I could not simply override the functions, because the defs in the pdist module are private).
The results I obtained with this so far are the same from R´s daisy function.
The source code is avilable at this jupyter notebook:
https://sourceforge.net/projects/gower-distance-4python/files/