rodbc character encoding error with PostgreSQL

2019-04-24 18:45发布

I'm getting a new error which I've never gotten before when connecting from R to a GreenPlum PostgreSQL database using RODBC. I've gotten the error using both EMACS/ESS and RStudio, and the RODBC call has worked as is in the past.

library(RODBC)
gp <- odbcConnect("greenplum", believeNRows = FALSE)
data <- sqlQuery(gp, "select * from mytable")

> data
[1] "22P05 7 ERROR: character 0xc280 of encoding \"UTF8\" has no equivalent in  "WIN1252\";\nError while executing the query" 
[2] "[RODBC] ERROR: Could not SQLExecDirect 'select * from mytable'"

EDIT: Just tried querying another table and did get results. So I guess it's not an RODBC problem but a PostgreSQL table encoding problem.

R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RODBC_1.3-2
> 

4条回答
戒情不戒烟
2楼-- · 2019-04-24 19:00

0xc280 is a control element ( U+0080 in Unicode) that is causing trouble pretty often when using SQL and the likes. The problem often lies in the conversion chain that invariably happens when you use different applications that use different encoding schemes. Windows has UTF-8 included by now, so it's not strictly a Windows problem. I believe the problem arises before R reads the data in.

In fact, in the chain the character sequence 0x80 in UNICODE will be mapped to 0xc280 in UTF-8. This is supposed to be a control sequence, and cannot be printed. But chances are big that the 0x80 is in fact not UNICODE, but Windows Latin-1 or Latin-2. In that case, the 0x80 represents the euro sign. That might explain how it ends up in your data. Check if you can find something like that in the data, that would explain something already.

My guess is that the solution will not lie at the R-end of this workchain, but before that. It will try automatic conversion, but this one is reported to fail in some cases (also for SQL and Oracle btw). Check in which encoding you're working in Postgresql, and try to use any of the latin types. There might be other links involved (a Putty or similar terminal for example). I'm pretty sure all the encodings there are ISO8859-1, which is Latin-1. Somewhere UTF-8 gets thrown in between, and when the 0x80 character gets wrongly mapped to 0xc280, you get trouble.

So check the encodings in your complete workchain, and make sure that they all match. If they don't, the automatic conversion done between each step is bound to give trouble for some characters.

Hope this helps.

查看更多
狗以群分
3楼-- · 2019-04-24 19:05

I might have posted this response elswhere but here goes.

I get similar error when connecting to Postgres DB from MS SQL Management client. Tyring to fix the source data is almost impossible in my case.

My Scenario:

  1. Trying to connect to Postgress using MS SQL Linked Objects via an ODBC System DSN, and see errors such as "ERROR: character 0xc280 of encoding "UTF8" has no equivalent in"WIN1252";
  2. Select statements on some tables work and others throw this error.

Fix: Use an ODBC driver that supports Unicode. I am using an ODBC driver from PostgreSQL Global Development Group. Go to Configure DSN/Manage DSN and select the Unicode driver.

Good luck.

查看更多
乱世女痞
4楼-- · 2019-04-24 19:12

First, the issue arises because R is trying to convert to a Windows locale that supports UTF8. Unfortunately, Brian Ripley has reported numerous times that Windows has no UTF8 locales. From hours spent searching the web, StackOverflow, Microsoft, etc., I have come to the conclusion that Microsoft hates UTF-8 Windows won't support UTF8.

As a result, I'm not sure that there's an easy solution to this, if there is any solution at all. The best I can recommend is to wrap some kind of conversion on the server side, look at filtering the data if you can, or try a different language, if appropriate (e.g. Chinese, Japanese, Korean).

If you do decide to wrap a converter, unicode.org recommends this ICU toolkit.

查看更多
淡お忘
5楼-- · 2019-04-24 19:19

By default Greenplum use UTF8 for character encoding. You could check this by logging in to Greenplum server and launching psql - console client for Greenplum. In this console application you could issue command: \l to list all of the databases configured in the Greenplum - this should also describe character set for database.

I think your prblem is that R doesnt support UTF8 for chars (You use different locale) But you could use On-the-fly transcoding in ODBC driver. Not sure about all ODBC drivers but DataDirect drivers support extra option in odbc.ini file (usually located in user home directory) - IANAAppCodePage.

You could find appropriate code for this parameter on this link: http://www.iana.org/assignments/character-sets

Here is the example od ODBC.ini content:

[ODBC]
Driver=/opt/odbc/lib/S0gplm60.so
IANAAppCodePage=2252
AlternateServers=
ApplicationUsingThreads=1
ConnectionReset=0
ConnectionRetryCount=0
ConnectionRetryDelay=3
Database=mysdb
EnableDescribeParam=1
ExtendedColumnMetadata=0
FailoverGranularity=0
FailoverMode=0
FailoverPreconnect=0
FetchRefCursor=1
FetchTSWTZasTimestamp=0
FetchTWFSasTime=0
HostName=192.168.1.100
InitializationString=
LoadBalanceTimeout=0
LoadBalancing=0
LoginTimeout=15
LogonID=
MaxPoolSize=100
MinPoolSize=0
Password=
Pooling=0
PortNumber=5432  
QueryTimeout=0
ReportCodepageConversionErrors=0
TransactionErrorBehavior=1
XMLDescribeType=-10
查看更多
登录 后发表回答