I was surprised to find out that clara
from library(cluster)
allows NAs. But function documentation says nothing about how it handles these values.
So my questions are:
- How
clara
handles NAs?
- Can this be somehow used for
kmeans
(Nas not allowed)?
[Update] So I did found lines of code in clara
function:
inax <- is.na(x)
valmisdat <- 1.1 * max(abs(range(x, na.rm = TRUE)))
x[inax] <- valmisdat
which do missing value replacement by valmisdat
. Not sure I understand the reason to use such formula. Any ideas? Would it be more "natural" to treat NAs by each column separately, maybe replacing with mean/median?
Although not stated explicitly, I believe that NA
are handled in the manner described in the ?daisy
help page. The Details section has:
In the daisy algorithm, missing values in a row of x are not included in the dissimilarities
involving that row.
Given internally the same code will be being used by clara()
that is how I understand that NA
s in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.
Update The C
sources for clara.c
clearly indicate that this (the above) is how NA
s are handled by clara()
(lines 350-356 in ./src/clara.c
):
if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
/* in the following line (Fortran!), x[-2] ==> seg.fault
{BDR to R-core, Sat, 3 Aug 2002} */
if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
continue /* next j */;
}
}
Not sure if kmeans
can handle missing data by ignoring the missing values in a row.
There are two steps in kmeans
;
- calculating the distance between an observation and original cluster mean.
- updating the new cluster mean based on the newly calculated distances.
When we have missing data in our observations:
Step 1 can be handled by adjusting the distance metric appropriately as in the clara/pam/daisy
package. But Step 2 can only be performed if we have some value for each column of an observation. Therefore imputing might be the next best option for kmeans
to deal missing data.
By looking at the Clara c code, I noticed that in clara algorithm, when there are missing values in the observations, the sum of squares is "reduced" proportional to the number of missing values, which I think is wrong! line 646 of clara.c is like " dsum *= (nobs / pp) " which shows it counts the number of non-missing values in each pair of observations (nobs), divides it by the number of variables (pp) and multiplies this by the sum of squares. I think it must be done in other way, i.e. " dsum *= (pp / nobs) ".