Here is my code: example 1:
a <- c("ab cd de","ENERGIZER A23 12V ALKALINE BATTERi")
a1 <- VCorpus(VectorSource(a))
a2 <- TermDocumentMatrix(a1,control = list(stemming=T))
inspect(a2)
The result is:
Docs
Terms 1 2
12v 0 1
a23 0 1
alkalin 0 1
batteri 0 1
energ 0 1
Looks like the first string in a is ignored.
example 2
a <- c("abcd cde de","ENERGIZER A23 12V ALKALINE BATTERi")
a1 <- VCorpus(VectorSource(a))
a2 <- TermDocumentMatrix(a1,control = list(stemming=T))
inspect(a2)
The result is:
Docs
Terms 1 2
12v 0 1
a23 0 1
abcd 1 0
alkalin 0 1
batteri 0 1
cde 1 0
energ 0 1
We can see two sub-strings (abcd, cde) are kept while the shorest one (de) is still missing. The situation is the same if I do not use control = list(stemming=T). So, I am curious if this is a sort of definition in tm? The strings will be ignored if it is less than 3 letters? I do not think this is a good idea. It is very possible that a string is useful even it is short such as abbreviation.
If so, is there a parameter or something that can change this? Thanks a lot.