I have a problem while doing a htmlParse()
on a XHTML document.
When it loads into R as an 'externalptr', I can see that one line is added, at the top of the file:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
I don't want to make this line appear because it breaks my application. I would like to delete it within the htmlParse()
function, and not having to delete this line manually for each XHTML I have.
Any suggestions? I've tried changing some parameters passed to the function htmlParse()
but at this time, after trying with it, I have not found it.
If it helps, here are the first lines of the XHTML I parse:
<?xml version="1.0" encoding="utf-8" ?>
<html dir="ltr" xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="es">
<head>
<meta charset="utf-8" />
I tried with
xmlRoot()
and then saved withsaveXML()
, including as parameters the prefix<?xml version="1.0" encoding="utf-8" ?>
There was also an encoding problem but that's another story. In Windows didn't work, in Ubuntu finally worked.
Thank you all.