I'm importing xls
files using gdata
. I am converting date columns using as.Date
to convert the date
As per the manual for as.Date
, the date origin is platform dependent, and so I am determining which origin to use accordingly
.origin <- ifelse(Sys.info()[['sysname']] == "Windows", "1899-12-30", "1904-01-01")
as.Date(myData$Date, origin=.origin)
However, I'm wondering if I should be considering the platform where the file is being read or the platform where it was written?
For what it's worth, I am currently testing the code on a linux box with no excel, and the correct Dates are produced by using origin="1904-01-01"
Quoting `?as.Date'
## date given as number of days since 1900-01-01 (a date in 1989)
as.Date(32768, origin = "1900-01-01")
## Excel is said to use 1900-01-01 as day 1 (Windows default) or
## 1904-01-01 as day 0 (Mac default), but this is complicated by Excel
## treating 1900 as a leap year.
## So for dates (post-1901) from Windows Excel
as.Date(35981, origin = "1899-12-30") # 1998-07-05
## and Mac Excel
as.Date(34519, origin = "1904-01-01") # 1998-07-05
## (these values come from http://support.microsoft.com/kb/214330)
You could try out the (extremely) new exell package: https://github.com/hadley/exell. It loads excel dates into POSIXct, correctly choosing the origin based on whether the file was written by Windows or Mac Excel.
Yes, you should be considering where the file was written. Excel-Windows appears capable of distinguishing Mac-written dates from Win-written dates, but you are getting evidence that these are Mac-originated .xls files.
The safest method would be to work within the version of Excel on which the data was entered and to use the format menu to bring up a dialog box from which you choose as-Date and a custom format of yyyy-mm-dd. Then save as a csv file and you will be able to import into R with the colClasses vector "Date" in the proper column position. But that sounds as though it is an option not available.
I suppose it doesn't apply to you on a linux box so this is just a Mac-whine: The gdata-package gives deprecation warnings and then fails to install the XLSX support files on R 3.0.0 with the ordinary Perl 5.8 installation in '/opt/local/bin/perl'. This despite 'gdata::findPerl` being able to find it successfully.
At this point I think the question should be redirected to inquiring whether you could coax gdata functions into inspecting the properties of the files. After looking at the codebase for xls reading, I rather doubt it, since do not see any mention of inspecting for different xls versions.
Near the end of a blank xls file created with a Mac version of Excel, looking with a text editor I see:
Worksheets˛ˇˇˇˇˇ ¿F$Microsoft Excel 97 - 2004 Worksheet˛ˇˇˇ8FIBExcel.Sheet.8˛ˇ
‡ÖüÚ˘Oh´ë+'≥Ÿ0îHPhħ
∞ºƒ'David WinsemiusDavid WinsemiusMicrosoft Macintosh Excel@ê˚á!Ë+Œ@ê'å-Ë+ŒG»˛ˇˇˇPICT¿Kġ
The other difference was that the Windows version inspected the same way was had "Excel 2003 Worksheet" as teh type of worksheet, whereas it was "Excel 97 - 2004" for the Mac version.
So maybe you can coerce R into bypassing all the errors that get triggered when reading or grepping during scanning for "Macintosh". Maybe Linux-R is more resistant to that sort of thing?
Error: invalid multibyte string at '<ff>'
I also got a bunch of warnings from grep that suggested I might not be able to "see" into some of the strings:
Warning message:
In grep("Macintosh", lin) : input string 1 is invalid in this locale
You might be able to highjack some more robust code from the Perl code in xls2csv.pl which is part of the gdata package.