WinXP-x32, R-2.13.0
Dear list,
I have a problem that (I think) relates to the interaction between Windows and R.
I am trying to scrape a table with data on the Hawai'ian Islands. This is my R code:
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
The output is (first set of columns):
Island Nickname > > Islands Island Nickname > > Location 1 Hawaiʻi[7] The Big
Island 19°34′N 155°30′W / 19.567°N 155.5°W / 19.567; -155.5 2 Maui[8] The Valley Isle 20°48′N 156°20′W / 20.8°N 156.333°W / 20.8; -156.333 3 KahoÊ»olawe[9] The Target Isle 20°33′N 156°36′W / 20.55°N 156.6°W / 20.55; -156.6 4 LÄnaÊ»i[10] The Pineapple Isle 20°50′N 156°56′W / 20.833°N 156.933°W / 20.833; -156.933 5 MolokaÊ»i[11] The Friendly Isle 21°08′N 157°02′W / 21.133°N 157.033°W / 21.133; -157.033 6 OÊ»ahu[12] The Gathering Place 21°28′N 157°59′W / 21.467°N 157.983°W / 21.467; -157.983 7 KauaÊ»i[13] The Garden Isle 22°05′N 159°30′W / 22.083°N 159.5°W / 22.083; -159.5 8 NiÊ»ihau[14] The Forbidden Isle
21°54′N 160°10′W / 21.9°N 160.167°W / 21.9; -160.167
As you can see, there are "weird" characters in there. I have also tried readHTMLTable(u, encoding = "UTF-16")
and readHTMLTable(u, encoding = "UTF-8")
but that didn't help.
It seems to me that there may be an issue with the interaction of the Windows settings of the character set and R.
sessionInfo()
gives
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.2-0.2
I have also attempted to let R use another setting by entering: Sys.setlocale("LC_ALL", "en_US.UTF-8")
, but this yields the response:
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
In addition, I have attempted to make the change directly from the windows command prompt, using: chcp 65001
and variations of that, but that didn't change anything.
I noticed from searching the web that others have the issue as well, but have not been able to find a solution. I looks like this is an issue of how Windows and R interact. Unfortunately, all three computers at my disposal have this problem. It occurs both under WinXP-x32 and under Win7-x86.
Is there a way to make R override the windows settings or can the issue be solved otherwise? I have also tried other websites, and the issue occurs every time when there is an é, ü, ä, î, et cetera in the text-to-be-scraped.
Thank you, Roger
A not quite an answer:
If you look at the wikipedia page and change the encoding in your browser (in IE, View -> Encoding; in Firefox, View -> Character Encoding) to Western (ISO-8869-1) or Western (Windows-1252) then you see the silly characters. That ought to mean that you can use
iconv
to change the encoding and fix your problems.Unfortunately, it doesn't work. It may be possible to get the correct text by using a different conversion (
iconvlist()
shows all the possibilities).It is possible it simply strip out the offending characters, though this isn't ideal.
Unable to replicate the error, however looking at the help files is useful.
For a windows you should use formatting like "English" or "Dutch_Netherlands.1252" to change these settings.
I tried to replicate your state
However I do not get the funny characters in console, in my own locale the ʻ was marked as , but still all functionality remained.
And these funny characters can be read easily, and found from the table.
If you still have problems it would rely elsewhere, however to change the locale under windows you have to use different names than Linux or OS X (see your own locale info for example). In Windows "Dutch" is probably enough.