Is there a way to locate an encoding problem within an XML file? I'm trying to parse such a file (let's call it doc
) with the XML
library in R
, but there seems to be a problem with the encoding.
xmlInternalTreeParse(doc, asText=TRUE)
Error: Document labelled UTF-16 but has UTF-8 content.
Error: Input is not proper UTF-8, indicate encoding!
Error: Premature end of data in tag ...
and a list of tags with presumably premature end of data follows. However, I'm pretty sure that no premature ends exist in this document.
Ok, so next try:
doc <- iconv(doc, to="UTF-8")
doc <- sub("utf-16", "utf-8", doc)
xmlInternalTreeParse(doc, asText=T)
Error: Premature end of data in tag...
and again a list of tags follows along with line numbers. I've checked the lines and I can't find any errors.
Another suspicion: the "µ"-character that occurs in the document might cause the error. So next try:
doc <- iconv(doc, to="UTF-8")
doc <- gsub("µ", "micro", doc)
doc <- sub("utf-16", "utf-8", doc)
xmlInternalTreeParse(doc, asText=T)
Error: Premature end of data in tag...
Any other suggestions for debugging?
EDIT: After having spent two days with trying to fix the error, I still haven't found a solution. However, I think I have narrowed down the possible answers. Here is what I've found:
copying the
XML
string from the source database into a file and saving it as a separatexml
file in Notepad++ -->Document labelled UTF-16 but has UTF-8 content
.changing
<?xml version="1.0" encoding="utf-16"?>
to<?xml version="1.0" encoding="utf-8"?>
(orencoding="latin1"
) within this file --> no errorreading
XML
string from database viadoc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE); doc <- doc[1,1]
, manipulating it withstr_sub(doc, 35, 36) <- "8"
orstr_sub(doc, 31, 36) <- "latin1"
and then trying to parse it withxmlInternalTreeParse(doc)
-->Premature end of data in tag...
reading the
XML
string from database as above and then trying to parse it withxmlInternalTreeParse(doc)
-->Document labelled UTF-16 but has UTF-8 content. Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x64 0x2E 0x20 Premature end of data in tag...
(list of tags follows).reading the
XML
string from database as above and parsing withxmlInternalTreeParse(doc, encoding="latin1")
-->Premature end of data in tag...
using
doc <- iconv(doc[1,1], to="UTF-8")
orto="latin1"
before parsing doesn't change anything
I would appreciate any suggestions very much.