annoying “feature” (or bugs?) for RODBC

2019-02-13 21:24发布

问题:

RODBC is the main library in R to import data from a database into R. RODBC seems to have the ability of "guess" the datatype of the column which I find it particularly annoying.

I have uploaded a file test.xls here, or you may create a xls file yourself:

  1. create 2 columns, first column named col_a and the second column named col_b.
  2. type whatever you like in col_a, I typed letters on this column for 92 rows
  3. at the 92th row of col_b, type a number there, I typed "1923" without changing the data type (i.e. not using ')
  4. try to import the xls file into R using the following script:

library(RODBC)

setwd("C:/Users/hke775/Documents/Enoch/MISC/R_problems/RODBC")
channel <- odbcConnectExcel("test.xls",readOnly=TRUE)
dummy.df <- sqlFetch(channel,"Sheet1")
odbcClose(channel)

You will see that in dummy.df, col_b is all NA, the 1923 in this column is gone.

If you want to see the 1923 again, you can change the 1st row of col_b to a number, and it is back again.

This is very annoying as I don't prefer modifying data manually. I need to use other package to do the xls importing, but I can't find other packages do as smooth as RODBC (I tried gdata and xlsReadWrite).

Did I missing anything in the sqlFetch command, and cause the trouble? Thanks.

回答1:

Please don't blame R or RODBC for Microsoft's bugs... ;)

However, due to a bug in the ODBC driver, specifying the Rows to Scan (MaxScanRows) setting currently has no effect. In other words, the Excel ODBC driver (MDAC 2.1 and later) always scans the first 8 rows in the specified data source in order to determine each column's datatype.

For additional information about the Rows to Scan bug, including a simple workaround, click the article number below to view the article in the Microsoft Knowledge Base:

189897 XL97: Data Truncated to 255 Characters with Excel ODBC Driver

I tried the fix in KB189897 by setting the TypeGuessRows value to 0 and look what happens!

> library(RODBC)
> channel <- odbcConnectExcel("test.xls",readOnly=TRUE)
> tail(dummy.df <- sqlFetch(channel,"Sheet1"))
   col_a col_b
87     c    NA
88     d    NA
89     e    NA
90     f    NA
91     g    NA
92     h  1923
> odbcClose(channel)

Please, no up-votes or check marks... just send cash. :)