Reading malformed XML with Nokogiri: Unescaped Amp

2020-03-27 03:43发布

问题:

I am trying to read a XML file from a third party with Nokogiri in my rails project. One of the nodes I have ot parse contains an URL with unescaped ampersands (like foo.com/index.html?page=1&query=bar)

I understand that this is considered malformed XML, and Nokogiri just tries to parse it anyway, resulting in foo.com/index.html?page=1=bar.

How can I obtain the full URL? Can I tweak Nokogiri? Would you do a search&replace-prerun or what would be the best practice?

回答1:

Had the same issue parsing SVGs with image links containing ampersands.

Parsing SVGs as HTML seems to correctly handle the links, escaping &.

fixed_svg = Nokogiri::HTML.fragment(raw_svg).to_html
# proceed with XML parsing
svg = Nokogiri::XML(fixed_svg)