Keeping Turkish characters with the text mining pa

2019-08-14 03:54发布

问题:

let me start this by saying that I'm still pretty much a beginner with R. Currently I am trying out basic text mining techniques for Turkish texts, using the tm package. I have, however, encountered a problem with the display of Turkish characters in R.

Here's what I did:

docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")

My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))

also saves the file without the characters in ANSI encoding.

This seems to not only be an issue with the output file.

writeLines(as.character(docs[[1]])

for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"

After reading this: UTF-8 file output in R I also tried the following code:

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)

which didn't change the results.

All of this is on Windows 7 with both the most recent version of R and RStudio.

Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.