let me start this by saying that I'm still pretty much a beginner with R. Currently I am trying out basic text mining techniques for Turkish texts, using the tm package. I have, however, encountered a problem with the display of Turkish characters in R.
Here's what I did:
docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")
My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))
also saves the file without the characters in ANSI encoding.
This seems to not only be an issue with the output file.
writeLines(as.character(docs[[1]])
for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"
After reading this: UTF-8 file output in R I also tried the following code:
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)
which didn't change the results.
All of this is on Windows 7 with both the most recent version of R and RStudio.
Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.