Importing an Excel file with Greek characters into

I am having some trouble importing the following file: http://www.kuleuven.be/bio/ento/temp/test.xlsx into R in the correct encoding. In particular,

library("xlsx")
read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8")

gives me

                                             X1
1                                     a-cadinol
2                                  a-calacorene
3                       a-caryophyllene alcohol
4                                   a-curcumene
5                                      a-elemol
6                                   a-muurolene
7                           a-terpineol acetate
8  ß-4-dimethyl-3-cyclohexane-1-ethanol acetate
9                                  ß-bisabolene
10                                  ß-bisabolol
11                                 ß-bourbonene
12                      ß-caryophyllene alcohol
13                                ß-cyclocitral
14                                   ß-farnesol
15                                   ß-selinene
16                         ß-sesquiphellandrene
17                            <U+03B3>-cadinene
18  <U+03B3>-Carboethoxy-<U+03B3>-butyrolactone
19        <U+03B3>-ethyl-<U+03B3>-butyrolactone
20                            <U+03B3>-eudesmol
21                           <U+03B3>-muurolene
22                         <U+03B3>-nonalactone
23                         <U+03B3>-octalactone
24                            <U+03B3>-selinene
25                       <U+03B3>-undecalactone
26                                   d-cadinene
27                                    d-cadinol
28                                  d-muurolene
29                              d-undecalactone

but the a-, <U+03B3>- and d- should be alpha-, gamma- and delta-

Any thoughts on how I could import my file in the correct encoding?

I am working on Windows, and iconvlist() gives me

  [1] "437"                     "850"                     "852"                     "855"                     "857"                    
  [6] "860"                     "861"                     "862"                     "863"                     "865"                    
 [11] "866"                     "869"                     "ANSI_X3.4-1968"          "ANSI_X3.4-1986"          "ASCII"                  
 [16] "ASMO-708"                "BIG-5"                   "BIG-FIVE"                "big5"                    "BIG5"                   
 [21] "big5-hkscs"              "BIG5-HKSCS"              "big5hkscs"               "BIG5HKSCS"               "CP-GR"                  
 [26] "CP-IS"                   "cp1025"                  "CP1125"                  "CP1133"                  "CP1200"                 
 [31] "CP12000"                 "CP12001"                 "CP1201"                  "CP1250"                  "CP1251"                 
 [36] "CP1252"                  "CP1253"                  "CP1254"                  "CP1255"                  "CP1256"                 
 [41] "CP1257"                  "CP1258"                  "CP1361"                  "CP154"                   "CP367"                  
 [46] "CP437"                   "CP50221"                 "CP51932"                 "CP65001"                 "CP737"                  
 [51] "CP775"                   "CP819"                   "CP850"                   "CP852"                   "CP853"                  
 [56] "CP855"                   "CP857"                   "CP858"                   "CP860"                   "CP861"                  
 [61] "CP862"                   "CP863"                   "CP864"                   "CP865"                   "cp866"                  
 [66] "CP866"                   "CP869"                   "CP874"                   "cp875"                   "CP932"                  
 [71] "CP936"                   "CP949"                   "CP950"                   "CSASCII"                 "CSIBM855"               
 [76] "CSIBM857"                "CSIBM860"                "CSIBM861"                "CSIBM863"                "CSIBM864"               
 [81] "CSIBM865"                "CSIBM866"                "CSIBM869"                "csISO2022JP"             "CSISOLATIN1"            
 [86] "CSPC775BALTIC"           "CSPC850MULTILINGUAL"     "CSPC862LATINHEBREW"      "CSPC8CODEPAGE437"        "CSPCP852"               
 [91] "CSPTCP154"               "CSWINDOWS31J"            "CYRILLIC-ASIAN"          "DOS-720"                 "DOS-862"                
 [96] "EUC-CN"                  "euc-jp"                  "euc-kr"                  "EUC-KR"                  "EUCCN"                  
[101] "eucjp"                   "euckr"                   "GB18030"                 "gb2312"                  "GBK"                    
[106] "hz-gb-2312"              "IBM-CP1133"              "IBM-Thai"                "IBM00858"                "IBM00924"               
[111] "IBM01047"                "IBM01140"                "IBM01141"                "IBM01142"                "IBM01143"               
[116] "IBM01144"                "IBM01145"                "IBM01146"                "IBM01147"                "IBM01148"               
[121] "IBM01149"                "IBM037"                  "IBM1026"                 "IBM273"                  "IBM277"                 
[126] "IBM278"                  "IBM280"                  "IBM284"                  "IBM285"                  "IBM290"                 
[131] "IBM297"                  "IBM367"                  "IBM420"                  "IBM423"                  "IBM424"                 
[136] "IBM437"                  "IBM437"                  "IBM500"                  "ibm737"                  "ibm775"                 
[141] "IBM775"                  "IBM819"                  "ibm850"                  "IBM850"                  "ibm852"                 
[146] "IBM852"                  "IBM855"                  "IBM855"                  "ibm857"                  "IBM857"                 
[151] "IBM860"                  "IBM860"                  "ibm861"                  "IBM861"                  "IBM862"                 
[156] "IBM863"                  "IBM863"                  "IBM864"                  "IBM864"                  "IBM865"                 
[161] "IBM865"                  "IBM866"                  "ibm869"                  "IBM869"                  "IBM870"                 
[166] "IBM871"                  "IBM880"                  "IBM905"                  "iso-2022-jp"             "iso-2022-jp"            
[171] "ISO-2022-JP"             "ISO-2022-JP-MS"          "iso-2022-kr"             "ISO-8859-1"              "iso-8859-13"            
[176] "iso-8859-15"             "iso-8859-2"              "iso-8859-3"              "iso-8859-4"              "iso-8859-5"             
[181] "iso-8859-6"              "iso-8859-7"              "iso-8859-8"              "iso-8859-8-i"            "iso-8859-9"             
[186] "ISO-IR-100"              "ISO-IR-6"                "ISO_646.IRV:1991"        "ISO_8859-1"              "ISO_8859-1:1987"        
[191] "ISO2022-JP"              "ISO2022-JP-MS"           "iso2022-kr"              "ISO646-US"               "iso8859-1"              
[196] "ISO8859-1"               "iso8859-13"              "iso8859-15"              "iso8859-2"               "iso8859-3"              
[201] "iso8859-4"               "iso8859-5"               "iso8859-6"               "iso8859-7"               "iso8859-8"              
[206] "iso8859-8-i"             "iso8859-9"               "Johab"                   "JOHAB"                   "koi8-r"                 
[211] "koi8-u"                  "ks_c_5601-1987"          "L1"                      "latin-9"                 "LATIN1"                 
[216] "latin2"                  "latin3"                  "latin4"                  "latin5"                  "latin7"                 
[221] "latin9"                  "mac"                     "mac-centraleurope"       "mac-is"                  "macarabic"              
[226] "maccentraleurope"        "maccroatian"             "maccyrillic"             "macgreek"                "machebrew"              
[231] "maciceland"              "macintosh"               "macis"                   "macroman"                "macromania"             
[236] "macthai"                 "macturkish"              "macukraine"              "macukrainian"            "MS-ANSI"                
[241] "MS-ARAB"                 "MS-CYRL"                 "MS-EE"                   "MS-GREEK"                "MS-HEBR"                
[246] "MS-TURK"                 "MS50221"                 "MS51932"                 "MS932"                   "MS936"                  
[251] "PT154"                   "PTCP154"                 "SHIFFT_JIS"              "SHIFFT_JIS-MS"           "shift-jis"              
[256] "shift_jis"               "SJIS"                    "SJIS-MS"                 "SJIS-OPEN"               "SJIS-WIN"               
[261] "UCS-2"                   "UCS-2BE"                 "UCS-2LE"                 "UCS-4"                   "UCS-4BE"                
[266] "UCS-4BE"                 "UCS-4LE"                 "UCS-4LE"                 "UCS2"                    "UCS2BE"                 
[271] "UCS2LE"                  "UCS4"                    "UCS4BE"                  "UCS4LE"                  "UHC"                    
[276] "unicodeFFFE"             "US"                      "US-ASCII"                "UTF-16"                  "UTF-16BE"               
[281] "UTF-16LE"                "UTF-32"                  "UTF-32BE"                "UTF-32LE"                "UTF-8"                  
[286] "UTF16"                   "UTF16BE"                 "UTF16LE"                 "UTF32"                   "UTF32BE"                
[291] "UTF32LE"                 "UTF8"                    "WINBALTRIM"              "windows-1250"            "windows-1251"           
[296] "windows-1252"            "windows-1253"            "windows-1254"            "windows-1255"            "windows-1256"           
[301] "windows-1257"            "windows-1258"            "WINDOWS-31J"             "WINDOWS-50221"           "WINDOWS-51932"          
[306] "windows-874"             "WINDOWS-932"             "WINDOWS-936"             "x-Chinese_CNS"           "x-cp20001"              
[311] "x-cp20003"               "x-cp20004"               "x-cp20005"               "x-cp20261"               "x-cp20269"              
[316] "x-cp20936"               "x-cp20949"               "x-cp50227"               "x-EBCDIC-KoreanExtended" "x-Europa"               
[321] "x-IA5"                   "x-IA5-German"            "x-IA5-Norwegian"         "x-IA5-Swedish"           "x-iscii-as"             
[326] "x-iscii-be"              "x-iscii-de"              "x-iscii-gu"              "x-iscii-ka"              "x-iscii-ma"             
[331] "x-iscii-or"              "x-iscii-pa"              "x-iscii-ta"              "x-iscii-te"              "x-mac-arabic"           
[336] "x-mac-ce"                "x-mac-chinesesimp"       "x-mac-chinesetrad"       "x-mac-croatian"          "x-mac-cyrillic"         
[341] "x-mac-greek"             "x-mac-hebrew"            "x-mac-icelandic"         "x-mac-japanese"          "x-mac-korean"           
[346] "x-mac-romanian"          "x-mac-thai"              "x-mac-turkish"           "x-mac-ukrainian"         "x_Chinese-Eten"

I tried with many of these, to no avail... Unfortunately, I also don't know what encoding Excel saved my file in...

Also, is there any easy function in R that would allow me to convert all Greek alpha, beta, gamma and delta's (in original encoding) to "alpha", "beta", "gamma" and "delta" (ie written out in full)? Or to do the reverse, ie convert "alpha", "beta", "gamma" etc written out in full to single Greek characters?

EDIT: regarding my last question I tried

togreek=function(compname) {
  n=as.character(compname,encoding="UTF-8")
  n=gsub("alpha","\u03B1",n)
  n=gsub("beta","\u03B2",n)
  n=gsub("gamma","\u03B3",n)
  n=gsub("delta","\u03B4",n)
  n=gsub("epsilon","\u03B5",n)
  n
}

tolatin=function(compname) {
  n=as.character(compname,encoding="UTF-8")
  n=gsub("\u03B1","alpha",n)
  n=gsub("\u03B2","beta",n)
  n=gsub("\u03B3","gamma",n)
  n=gsub("\u03B4","delta",n)
  n=gsub("\u03B5","epsilon",n)
  n
}

The function tolatin seems to work:

library("xlsx")
test=read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8")
tolatin(test$X1)
 [1] "alpha-cadinol"                                   "alpha-calacorene"                                "alpha-caryophyllene alcohol"                    
 [4] "alpha-curcumene"                                 "alpha-elemol"                                    "alpha-muurolene"                                
 [7] "alpha-terpineol acetate"                         "beta-4-dimethyl-3-cyclohexane-1-ethanol acetate" "beta-bisabolene"                                
[10] "beta-bisabolol"                                  "beta-bourbonene"                                 "beta-caryophyllene alcohol"                     
[13] "beta-cyclocitral"                                "beta-farnesol"                                   "beta-selinene"                                  
[16] "beta-sesquiphellandrene"                         "gamma-cadinene"                                  "gamma-Carboethoxy-gamma-butyrolactone"          
[19] "gamma-ethyl-gamma-butyrolactone"                 "gamma-eudesmol"                                  "gamma-muurolene"                                
[22] "gamma-nonalactone"                               "gamma-octalactone"                               "gamma-selinene"                                 
[25] "gamma-undecalactone"                             "delta-cadinene"                                  "delta-cadinol"                                  
[28] "delta-muurolene"                                 "delta-undecalactone"

But if I then convert back to Greek characters I again run into problems:

togreek(tolatin(test$X1))

 [1] "Î±-cadinol"                                   "Î±-calacorene"                                "Î±-caryophyllene alcohol"                    
 [4] "Î±-curcumene"                                 "Î±-elemol"                                    "Î±-muurolene"                                
 [7] "Î±-terpineol acetate"                         "ß-4-dimethyl-3-cyclohexane-1-ethanol acetate" "ß-bisabolene"                                
[10] "ß-bisabolol"                                  "ß-bourbonene"                                 "ß-caryophyllene alcohol"                     
[13] "ß-cyclocitral"                                "ß-farnesol"                                   "ß-selinene"                                  
[16] "ß-sesquiphellandrene"                         "<U+03B3>-cadinene"                            "<U+03B3>-Carboethoxy-<U+03B3>-butyrolactone" 
[19] "<U+03B3>-ethyl-<U+03B3>-butyrolactone"        "<U+03B3>-eudesmol"                            "<U+03B3>-muurolene"                          
[22] "<U+03B3>-nonalactone"                         "<U+03B3>-octalactone"                         "<U+03B3>-selinene"                           
[25] "<U+03B3>-undecalactone"                       "d-cadinene"                                   "d-cadinol"                                   
[28] "d-muurolene"                                  "d-undecalactone"

Any thoughts what I am doing wrong?

标签： r excel import character-encoding iconv

2条回答

Evening l夕情丶

2楼-- · 2019-01-25 15:53

Try this:

library(stringi)

?stringi

Apply this to see your file codification

stri_enc_detect("path-to-your-file/your-file.csv", filter_angle_brackets = T)

stri_enc_detect2("path-to-your-file/your-file.csv", locale = NULL)

The first one works for me.

Then apply the result in the read.csv(), as the example bellow:

    df <- read.csv("path-to-your-file/your-file.csv",header = TRUE, sep = ";", 
quote = "\"", na.strings = "", dec = ".", skip = 2, check.names = T, 
fileEncoding = "YOUR RESULT OF stri_enc_detect", encoding = "UTF-8")

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2019-01-25 16:14

Try this:
Sys.setlocale(category = "LC_ALL", locale = "Greek")

0人赞添加讨论(0) 举报

Importing an Excel file with Greek characters into

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间