A lemmatizing function using a hash dictionary doe

2020-06-28 01:30发布

问题:

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts.

Unfortunately it does not work with the corpus format generated by tm. Let me paste Dmitriy's code:

library(hashmap)
library(data.table)
txt = 
  "Abadan  Abadanem
  Abadan  Abadanie
  Abadan  Abadanowi
  Abadan  Abadanu
  abadańczyk  abadańczycy
  abadańczyk  abadańczykach
  abadańczyk  abadańczykami
  "
dt = fread(txt, header = F, col.names = c("lemma", "word"))
lemma_hm = hashmap(dt$word, dt$lemma)

lemma_hm[["Abadanu"]]
#"Abadan"


lemma_tokenizer = function(x, lemma_hashmap, 
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }
  tokens_list
}
texts = c("Abadanowi abadańczykach OutOfVocabulary", 
          "abadańczyk Abadan OutOfVocabulary")
lemma_tokenizer(texts, lemma_hm)

#[[1]]
#[1] "Abadan"          "abadańczyk"      "OutOfVocabulary"
#[[2]]
#[1] "abadańczyk"      "Abadan"          "OutOfVocabulary"

now I would like to apply it on tm corpus "docs" here is an example syntax I would use with tm package, on tm generated corpus.

docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm"))

another syntax that I tried:

LemmaTokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")

docsTDM <-
  DocumentTermMatrix(docs, control = list(wordLengths = c(4, 25), tokenize=LemmaTokenizer))

It throws at me an error:

 Error in lemma_hashmap[[tokens]] : 
  attempt to select more than one element in vectorIndex 

The function works with a vector of texts but it will not work with tm corpus. Thanks in advance for suggestions (even use of this function with other text mining package if it will not work with tm).

回答1:

I see two problems here. 1) your custom function returns a list, while it should return a vector of strings; and 2) you are passing a wrong lemma_hashmap argument.

A quick workaround to fix the first problem is to use paste() and sapply() before returning the function result.

lemma_tokenizer = function(x, lemma_hashmap, 
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }

  # paste together, return a vector
  sapply(tokens_list, (function(i){paste(i, collapse = " ")}))
}

We can run the same example of your post.

texts = c("Abadanowi abadańczykach OutOfVocabulary", 
          "abadańczyk Abadan OutOfVocabulary")
lemma_tokenizer(texts, lemma_hm)
[1] "Abadan abadańczyk OutOfVocabulary" "abadańczyk Abadan OutOfVocabulary"

Now, we can use tm_map. Just make sure to use lemma_hm (i.e., the variable) and not "lemma_hm" (a string) as argument.

docs <- SimpleCorpus(VectorSource(texts))
out <- tm_map(docs, (function(x) {lemma_tokenizer(x, lemma_hashmap=lemma_hm)}))
out[[1]]$content
[1] "Abadan abadańczyk OutOfVocabulary"


回答2:

For polish lemmatization please refer to this script https://github.com/MarcinKosinski/trigeR5/blob/master/R/lematyzacja.R that uses this polmorfologik dictionary https://github.com/MarcinKosinski/trigeR5/tree/master/dicts (and also stop words are included there).



回答3:

Try using quanteda's dictionary() function, after creating a dictionary mapping each variant as a dictionary value, to the lemma as a dictionary key. Below, it looks up your values and then pastes the tokens back into a text. (If you wanted tokens, you would not need the last paste() operation.

txt <-  
    "Abadan  Abadanem
Abadan  Abadanie
Abadan  Abadanowi
Abadan  Abadanu
abadańczyk  abadańczycy
abadańczyk  abadańczykach
abadańczyk  abadańczykami"

list_temp <- strsplit(readLines(textConnection(txt)), "\\s+")
list_temp2 <- lapply(list_temp, "[", 2)
names(list_temp2) <- sapply(list_temp, "[", 1)

library("quanteda")
polish_lemma_dict <- dictionary(list_temp2)
# Dictionary object with 7 key entries.
# - Abadan:
#   - abadanem
# - Abadan:
#   - abadanie
# - Abadan:
#   - abadanowi
# - Abadan:
#   - abadanu
# - abadańczyk: 
#   - abadańczycy
# - abadańczyk:
#   - abadańczykach
# - abadańczyk:
#   - abadańczykami

texts <- c("Abadanowi abadańczykach OutOfVocabulary", 
           "abadańczyk Abadan OutOfVocabulary")

The texts can now be converted into tokens, and use quanteda's tokens_lookup() function to replace the dictionary values (inflected words) with the dictionary keys (lemmas). In the last step, I've pasted the tokens back together, which you can skip if you want tokens and not a full text.

require(magrittr)
texts %>%
    tokens() %>%
    tokens_lookup(dictionary = polish_lemma_dict, exclusive = FALSE, capkeys = FALSE) %>%
    as.character() %>%
    paste(collapse = " ")
# [1] "Abadan abadańczyk OutOfVocabulary abadańczyk Abadan OutOfVocabulary"


回答4:

Here is the complete imperfect code I used the answer in. Credits to many people, I described all sources on the bottom. It is very rough, I realise, but it catches mise for me, ie. I can use txt lemmes dictionary and my stopwords to classify Polish texts. Thanks to Damiano Fantini, Dmitriy Selivanov and many others.

#----1. Set up. ----
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))


library(readtext)
library(tm)
library(proxy)
library(stringi)
library(stringr)
library(hashmap)
library(data.table)
library(text2vec)

# For reading n-grams
library(RWeka) #(*)
BigramTokenizer <- 
           function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3)) #(*)


#----2. Read data. ----
stopwordsPL <- as.vector(str_split(readLines("polish.stopwords.text",encoding = "UTF-8"), pattern = " ",simplify = T))


docs <- VCorpus(DirSource(pattern="txt"))
titles <- rownames(summary(docs))

docs <- tm_map(docs, removeWords, words=stopwordsPL)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, function(x) stri_trans_general(x, "Latin-ASCII"))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)

# for English texts it would be simpler
# docs <- tm_map(docs, removeWords, stopwords("english")) #can add other words to remove
# docs <- tm_map(docs, stemDocument, "english")

#====3. Lemmatize ====
# # Dictionary from http://www.lexiconista.com/datasets/lemmatization/
# lemmadict_file = "lemmatization-pl.text"
# dt = fread(file= lemmadict_file, header = F, col.names = c("lemma", "word"), data.table=T, encoding="UTF-8")
# # I threw away Polish letters, maybe changing locales may help.
# dt$lemma <- stri_trans_general(dt$lemma, "Latin-ASCII;lower")
# dt$word <- stri_trans_general(dt$word, "Latin-ASCII;lower")
# dt <- unique(dt)
# 
# # Creating hash dictionary
# lemma_hm = hashmap(dt$word, dt$lemma)
# 
# # Test if it works
# lemma_hm[["mnozyl"]]
# # [1] "mnozyc"
# 
# save_hashmap(lemma_hm, file="lemma_hm", overwrite = TRUE, compress = TRUE)

lemma_hm <- load_hashmap(file="lemma_hm")

lemma_tokenizer = function(x, lemma_hashmap, 
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }
  # paste together, return a vector
  sapply(tokens_list, (function(i){paste(i, collapse = " ")}))
}

docs <- tm_map(docs, (function(x) {lemma_tokenizer(x, lemma_hashmap=lemma_hm)}))
docs <- tm_map(docs, PlainTextDocument)

#====4. Create document term matrix====

docsTDM <-
  DocumentTermMatrix(docs, control = list(wordLengths = c(5, 25),tokenize = BigramTokenizer))  #  tokenize=LemmaTokenizer, tokenize = BigramTokenizer (*)


docsTDM$dimnames

#====5. Remove sparse and common words====

docsTDM <- removeSparseTerms(docsTDM, .90)

# https://stackoverflow.com/questions/25905144/removing-overly-common-words-occur-in-more-than-80-of-the-documents-in-r

removeCommonTerms <- function (x, pct) 
{
  stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), 
            is.numeric(pct), pct > 0, pct < 1)
  m <- if (inherits(x, "DocumentTermMatrix")) 
    t(x)
  else x
  t <- table(m$i) < m$ncol * (pct)
  termIndex <- as.numeric(names(t[t]))
  if (inherits(x, "DocumentTermMatrix")) 
    x[, termIndex]
  else x[termIndex, ]
}


docsTDM <-
  removeCommonTerms(docsTDM, .8) #remove terms that are in >=80% of the documents
docsTDM$dimnames


#====6. Cluster data (hclust). ====


docsdissim <- dist(as.matrix(docsTDM), method = "cosine")

docsdissim2 <- as.matrix(docsdissim)
dim(docsdissim2)

rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles

h <- hclust(docsdissim, method = "ward.D2")

plot(h, labels = titles, sub = "")

# Library hclust with p-values (pvclust)

library(pvclust)

h_pv <- pvclust(docsdissim2, method.hclust = "ward.D2", method.dist ="correlation")

plot(h_pv)

data.frame(cutree(tree = h_pv$hclust, k = 4))


# pvclust provides two types of p-values: AU (Approximately Unbiased) p-value and BP (Bootstrap Probability) value. 
# AU p-value, which is computed by multiscale bootstrap resampling, is a better approximation to unbiased p-value 
# than BP value computed by normal bootstrap resampling.
# AU p-value > 0.95 we can assume the clusters exist and may stably be 
# observed if we increase the number of observations. 
# (http://stat.sys.i.kyoto-u.ac.jp/prog/pvclust/)

#==== Literature:====
# Original article:
# http://www.rexamine.com/2014/06/text-mining-in-r-automatic-categorization-of-wikipedia-articles/

# Updates to make it work after some functions became obsolete:
# https://stackoverflow.com/questions/34423823/r-automatic-categorization-of-wikipedia-articles
# https://stackoverflow.com/questions/34372166/error-using-termdocumentmatrix-and-dist-functions-in-r
#
# Based on that:
# http://brazenly.blogspot.co.uk/2015/02/r-categorization-clustering-of.html
#
# Sparse terms:
# https://stackoverflow.com/questions/28763389/how-does-the-removesparseterms-in-r-work

# Lemmatizing function:
# https://stackoverflow.com/questions/46122591/a-lemmatizing-function-using-a-hash-dictionary-does-not-work-with-tm-package-in
# https://stackoverflow.com/questions/45762559/lemmatization-using-txt-file-with-lemmes-in-r/45790325#45790325