Unescape the ampersand (&) via XMLStarlet - Buggin

2019-07-28 06:17发布

问题:

This a quite annoying but rather a much simpler task. According to this guide, I wrote this:

#!/bin/bash

content=$(wget "https://example.com/" -O -)
ampersand=$(echo '\&')

xmllint --html --xpath '//*[@id="table"]/tbody' - <<<"$content" 2>/dev/null |
    xmlstarlet sel -t \
        -m "/tbody/tr/td" \
            -o "https://example.com" \
            -v "a//@href" \
            -o "/?A=1" \
            -o "$ampersand" \
            -o "B=2" -n \

I successfully extract each link from the table and everything gets concatenated correctly, however, instead of reproducing the ampersand as & I receive this at the end of each link:

https://example.com/hello-world/?A=1\&amp;B=2

But actually, I was looking for something like:

https://example.com/hello-world/?A=1&B=2

The idea is to escape the character using a backslash \& so that it gets ignored. Initially, I tried placing it directly into -o "\&" \ instead of -o "$ampersand" \ and removing ampersand=$(echo '\&') in this case scenario. Still the same result.

Essentially, by removing the backslash it still outputs:

https://example.com/hello-world/?A=1&amp;B=2

Only that the \ behind the &amp; is removed.

Why?

I'm sure it is something basic that is missing.

回答1:

Sorry I can't reproduce your result but why don't make substitutions? Just filter your results through

sed 's/\\&amp;/\&/g'

add it to your pipe. It should replace all &amp; to &.



回答2:

&amp; is the correct way to print & in an XML document, but since you just want a plain URL your output should not be XML. Therefore you need to switch to text mode, by passing --text or -T to the sel command.

Your example input doesn't quite work because example.com doesn't have any table elements, but here is a working example building links from p elements instead.

content=$(wget 'https://example.com/' -O -)
xmlstarlet fo --html <<<"$content" |
    xmlstarlet sel -T -t \
        -m '//p[a]' \
            --if 'not(starts-with(a//@href,"http"))' \
              -o 'https://example.com/' \
            --break \
            -v 'a//@href' \
            -o '/?A=1' \
            -o '&' \
            -o 'B=2' -n

The output is

http://www.iana.org/domains/example/?A=1&B=2


回答3:

As you have already seen, backslash-escaping isn't the solution here. I can think of two possible options:

Extract the hrefs (probably don't need to be using both xmllint and xmlstarlet to do this), then just use a standard text processing tool such as sed to add the start and the end:

sed 's,^,https://example.com/,; s,$,/?A=1\&B=2,'

Alternatively, pipe the output of what you've currently got to xmlstarlet unesc, which will change &amp; into &.