Apply Encoding to Entire Data.Table

2019-04-28 09:24发布

I have the following file read into a data.table like so:

raw <- fread("avito_train.tsv", nrows=1000)

Then, if I change the encoding of a particular column and row like this:

Encoding(raw$title[2]) <- "UTF-8"

It works perfectly.

But, how can I apply the encoding to all columns, and all rows?

I checked the fread documentation but there doesn't appear to be any encoding option. Also, I tried Encoding(raw) but that gives me an error (a character vector argument expected).

Edit: This article details more information on foreign text in RStudio on Windows http://quantifyingmemory.blogspot.com/2013/01/r-and-foreign-characters.html

3条回答
啃猪蹄的小仙女
2楼-- · 2019-04-28 10:09

I tried this:

Encoding(raw$title) <- "UTF-8"

Which sets the encoding for the entire column. That will work fine for now. Still open to any other options so it will do this automatically upon import.

查看更多
Bombasti
3楼-- · 2019-04-28 10:19

Sadly, there does not seem to be a way of doing this while importing (yet) with fread.

While you seem to have figured it out already, I'll post a way of setting the encoding of the entire dt after import.

One way of getting it done would be to loop that over all the character columns in a data table:

for (name in colnames(raw[,sapply(raw, is.character), with=F])){
  Encoding(raw[[name]]) <- "UTF-8"}

the colnames... bit first gets the columns that are characters (with=F being necessary for dt it seems), and then one gets the column names that one will loop over. In short: this gives users what you have already found works, but across all char columns.

Now ... since there's no guarantee that the colnames for your integers, floats etc will not need some massaging, the following should solve it:

for (name in colnames(raw)){
  Encoding(colnames(raw)) <- "UTF-8"
}
查看更多
女痞
4楼-- · 2019-04-28 10:27

This has been recently implemented in the devel version of data.table, v1.9.5. This'll be soon pushed to CRAN (as v1.9.6). Could you please give the devel version a try to see if that solves this for you?

fread() has gained an encoding argument, specifically for issues with windows.

require(data.table) # v1.9.5+
fread("file.txt", encoding="UTF-8")

should solve the issue. There's no file for me to test. If it doesn't solve your issue, please file an issue on the project page, with a reproducible example/file.

查看更多
登录 后发表回答