arabic text mining using R [closed]

2019-03-21 14:45发布

I am a new user and I just want to get help with my work on R. i am doing Arabic text mining and I would love to have some help anyone have experience in this fields. So far I felt to normalize the Arabic text and even R doesn't print the Arabic characters in the console. I am stuck now and I don’t know is it right to change the language like doing the mining in Weka or any other way. Can anyone advise me if anyone achieved anything in mining Arabic text using R?
By the way I am working on Arabic tweets data set analysis. It took my one month to fetch the data. And I don’t know how long will take me to pre-processing the text.

1条回答
▲ chillily
2楼-- · 2019-03-21 15:07

I don't have much experience in this area, but I do not have problems with Arabic characters when I try this:

require(tm)
require(tm.plugin.webmining)
require(SnowballC)

corpus <- WebCorpus(GoogleNewsSource("سلام"))
corpus
inspect(corpus)

tdm <- TermDocumentMatrix(corpus)

Make sure to install the proper fonts on your OS and IDE.

```{r}
y <<- dget("file") # get the file ext rated from MongoDB with rmongodb package
a <<- y$tweet_text # extract only the text of the tweets in the dataset
text_df <<- data.frame(a, stringsAsFactors = FALSE) # Save as a data frame
myCorpus_df <<- Corpus(DataframeSource(text_df_2)) # Compute a Corpus from the data frame
```

In OS X Arabic characters are properly represented :

```{r}
str(myCorpus_df[1:2])
```

List of 2
 $ 1:List of 2
  ..$ content: chr "The CHRONICLE EYE  Ahrar al#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings #Aleppo "
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "1"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"


 $ 2:List of 2
  ..$ content: chr "RT @######## جبهة النصرة مهاجرينها وأنصارها  مقراتها مكان آمن لكل من يخشى على نفسه الآذى "
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "2"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"

When I check the encoding of an Arabic word on the both OS (OS X and Win 7), it seems to be well coded :

```{r}
Encoding("لمياه_و_الإصحا")
```

[1] "UTF-8"

This may also be helpful: Reading arabic data text in R and plot()

查看更多
登录 后发表回答