Encoding problems with R XML via SPARQL

2019-05-10 02:48发布

问题:

I running into an encoding problem with the SPARQL package for R. I'm running the following code:

library(SPARQL)

rights_query <- '
PREFIX dc:  <http://purl.org/dc/elements/1.1/>
PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX ore: <http://www.openarchives.org/ore/terms/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT DISTINCT ?edmrights ?provider (COUNT(*) as ?count)
WHERE {
?agg rdf:type ore:Aggregation .
?agg edm:rights ?edmrights .
#?agg dc:rights ?dcrights .
?agg edm:dataProvider ?provider .

?proxy ore:proxyIn ?agg .
?proxy edm:type "IMAGE" .
}
GROUP BY ?edmrights ?provider
ORDER BY ?provider DESC(?count)'

eur <- "http://europeana.ontotext.com/sparql"

eur_data <- SPARQL(eur, rights_query)$results
write.csv(eur_data, "results.csv")

The code runs without any errors or warnings, however the resulting data frame as viewed in RStudio, as well as the CSV, clearly have encoding problems.

For example, the last ought to be partly Cyrillic: Чувашский государственный художественный музей / Chouvashia State Art Museum

However it comes out looking like this: ЧÑваÑÑкий гоÑÑдаÑÑÑвеннÑй ÑÑдожеÑÑвеннÑй мÑзей / Chouvashia State Art Museum

I've inspected the XML returned by the SPARQL query. It passes XML validation, and contains the proper "UTF-8" encoding declaration. The R XML package (which is what the R SPARQL package uses to parse XML output into a data frame) ought to recognize this, right?

You can inspect the entire XML output, as well as the CSV file. I am running R 3.1.0 via RStudio, on OS X Mavericks. I have set RStudio's default character encoding to UTF-8.

回答1:

instead of SPARQL.R in this case I would consider something like the following: ...

Sys.setenv(LANG="ru")
library(RCurl)
library(XML)
url=<SPARQL endpoint query URL>
xquery=<your query which may contain windows-1251 chars on Windows 7+>
#xquery=iconv(xquery,"CP1251","UTF-8") #it may be required on Windows 7+
param="query"
extrastr=""
resp <- getURL(url = paste(url, '?', param, '=', gsub('\\+','%2B',URLencode(xquery,reserved=TRUE)), extrastr, sep=""), httpheader = c(Accept="application/sparql-results+xml"),.encoding="UTF-8")
xmlResp=xmlParse(resp)
xmlRespRoot=xmlRoot(xmlResp)
nRows=xmlSize(xmlRespRoot[2]$results)
if (nRows>0){
    xRespList=NULL
    for(i in 1:nRows){
    xrow=NULL
    tmprow=(xmlRespRoot[2]$results)[i]$result
    nCols=xmlSize(tmprow)
    for(j in 1:nCols){
        xrow=cbind(xrow,xmlValue(tmprow[j]$binding,encoding="UTF-8"))
    }
    xRespList=rbind(xRespList,xrow)
    }
}