I have a dataset consisting of text-tokens (words, different kinds of identification numbers and some additional types) that I want to classify using some unsupervised classification algorithm.
Given some kinds of features that I extract from the text (# of characters, # of digits, # of alphas, some regexes etc) algorithms such as kmeans
(just as an example, I am not bound to kmeans) work fine, but I want to add some more details such as the Levenshtein-distance, which I can use with hclust
.
However, I don't quite find a starting point of how to combine the two-different data-types (the data that is linked to two observations, such as the distance-metrics, and the data that is linked to only one observation, such as the number of characters each token has).
Did I miss some easy part, is it even possible or did I just look for the wrong algorithm?
Below, you find an example of a small dataset and the different approaches I have taken so far.
MWE Data
# create some data
set.seed(123)
x <- sapply(1:20, function(i) {
paste(c(
sample(LETTERS, sample(1:10, 1), replace = T),
sample(1:9, sample(1:10, 1), replace = T),
sample(LETTERS[1:10], 2)
), collapse = "")
})
head(x)
#> [1] "UKW1595595761IC" "I9769675632JI" "UAMTFIG44DB" "GM814HB"
#> [5] "FDTXJR4CH" "VVULT7152464BC"
# apply the different algorithms
# 1. K-means
df <- data.frame(x)
df$nchars <- nchar(x)
df$n_nums <- nchar(gsub("[^[:digit:]]", "", x))
# etc.
kclust <- kmeans(df[, 2:3], centers = 2)
pairs(df, col=c(2:3)[kclust$cluster])
# 2. Levensthein distance and hclust
distance <- adist(x)
rownames(distance) <- x
hc <- hclust(as.dist(distance))
plot(hc)
# 3. Combination of adist(x) and the df-variables
# ???