Lemmatization using txt file with lemmes in R

2019-04-15 02:22发布

问题:

I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/)

Abadan  Abadanem
Abadan  Abadanie
Abadan  Abadanowi
Abadan  Abadanu
abadańczyk  abadańczycy
abadańczyk  abadańczyka
abadańczyk  abadańczykach
abadańczyk  abadańczykami
abadańczyk  abadańczyki
abadańczyk  abadańczykiem
abadańczyk  abadańczykom
abadańczyk  abadańczyków
abadańczyk  abadańczykowi
abadańczyk  abadańczyku
abadanka    abadance
abadanka    abadanek
abadanka    abadanką
abadanka    abadankach
abadanka    abadankami

What packages and with what syntax, would allow me use such txt database to lemmatize my bag of words. I realize, for English there is Wordnet, but there is no luck for those who would like to use this functionality for rare languages.

If not, can this database be converted to be useful with any package that provides lemmatization? Perhaps by converting it to a wide form? For instance, the form used by free AntConc concordancer, (http://www.laurenceanthony.net/software/antconc/)

Abadan -> Abadanem, Abadanie, Abadanowi, Abadanu
abadańczyk -> abadańczycy, abadańczyka, abadańczykach 
etc.

In brief: How can lemmatization with lemmas in txt file be done in any of the known CRAN R text mining packages ? If so, how to format such txt file?

UPDATE: Dear @DmitriySelivanov I got rid of all diacritical marks, now I would like to apply it on tm corpus "docs"

docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")) 

and I tried it as tokenizer

LemmaTokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")

docsTDM <-
  DocumentTermMatrix(docs, control = list(wordLengths = c(4, 25), tokenize=LemmaTokenizer)) 

It throws at me an error:

 Error in lemma_hashmap[[tokens]] : 
  attempt to select more than one element in vectorIndex 

The function works with a vector of texts as charm though.

回答1:

My guess is that here is nothing to do with text-mining packages for this task. You need just to replace word in a second column by word in a first column. You can do it with creating hashmap (for example https://github.com/nathan-russell/hashmap).

Below is example of how you can create "lemmatizing" tokenizer which you can easily use in text2vec (and I guess quanteda as well).

Contributions in order to create such "lemmatizing" package are very welcome - will be very useful.

library(hashmap)
library(data.table)
txt = 
  "Abadan  Abadanem
  Abadan  Abadanie
  Abadan  Abadanowi
  Abadan  Abadanu
  abadańczyk  abadańczycy
  abadańczyk  abadańczykach
  abadańczyk  abadańczykami
  "
dt = fread(txt, header = F, col.names = c("lemma", "word"))
lemma_hm = hashmap(dt$word, dt$lemma)

lemma_hm[["Abadanu"]]
#"Abadan"


lemma_tokenizer = function(x, lemma_hashmap, 
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }
  tokens_list
}
texts = c("Abadanowi abadańczykach OutOfVocabulary", 
          "abadańczyk Abadan OutOfVocabulary")
lemma_tokenizer(texts, lemma_hm)

#[[1]]
#[1] "Abadan"          "abadańczyk"      "OutOfVocabulary"
#[[2]]
#[1] "abadańczyk"      "Abadan"          "OutOfVocabulary"