I have a dataset consisting of text-tokens (words, different kinds of identification numbers and some additional types) that I want to classify using some unsupervised classification algorithm.
Given some kinds of features that I extract from the text (# of characters, # of digits, # of alphas, some regexes etc) algorithms such as kmeans
(just as an example, I am not bound to kmeans) work fine, but I want to add some more details such as the Levenshtein-distance, which I can use with hclust
.
However, I don't quite find a starting point of how to combine the two-different data-types (the data that is linked to two observations, such as the distance-metrics, and the data that is linked to only one observation, such as the number of characters each token has).
Did I miss some easy part, is it even possible or did I just look for the wrong algorithm?
Below, you find an example of a small dataset and the different approaches I have taken so far.
MWE Data
# create some data
set.seed(123)
x <- sapply(1:20, function(i) {
paste(c(
sample(LETTERS, sample(1:10, 1), replace = T),
sample(1:9, sample(1:10, 1), replace = T),
sample(LETTERS[1:10], 2)
), collapse = "")
})
head(x)
#> [1] "UKW1595595761IC" "I9769675632JI" "UAMTFIG44DB" "GM814HB"
#> [5] "FDTXJR4CH" "VVULT7152464BC"
# apply the different algorithms
# 1. K-means
df <- data.frame(x)
df$nchars <- nchar(x)
df$n_nums <- nchar(gsub("[^[:digit:]]", "", x))
# etc.
kclust <- kmeans(df[, 2:3], centers = 2)
pairs(df, col=c(2:3)[kclust$cluster])
# 2. Levensthein distance and hclust
distance <- adist(x)
rownames(distance) <- x
hc <- hclust(as.dist(distance))
plot(hc)
# 3. Combination of adist(x) and the df-variables
# ???
If you want a method for combining the metrics of Levenshtein and something like the Euclidean distance, you can do it by combining the distance matrices, as they are of the same shape, and send it to hclust.
Of course you can weight the two matrices however you like.
If you want to combine k-means and hierarchical clustering I know of one way to do that. Essentially you perform hierarchical clustering on a matrix, divide it up into k groups, calculate the mean of each group and pass those means as the starting centroids for the k-means.
If you want to combine k-means with Levenshtein, I'm afraid I don't know how to do that, as it doesn't make much sense to pass a distance matrix to k-means. Maybe k-medoids could work?