Is there a way to escape non-alphanumeric characte

2019-07-22 11:47发布

问题:

I have an anchor tag:

file.html#stuff-morestuff-CHP-1-SECT-2.1

Trying to pull the referenced content in Nokogiri:

documentFragment.at_css('#stuff-morestuff-CHP-1-SECT-2.1')

fails with the error:

unexpected '.1' after '[#<Nokogiri::CSS:
:Node:0x007fd1a7df9b40 @type=:CONDITIONAL_SELECTOR, @value=[#<Nokogiri::CSS::Node:0x007fd1a7df9b90 @type=:ELEMENT_NAME, @value=["*"]>, #<Nokogiri::CSS::Node:0x007fd1a7df9cd0 @
type=:ID, @value=["#unixnut4-CHP-1-SECT-2"
]>]>]' (Nokogiri::CSS::SyntaxError)

Just trying talk through this - I think Nokogiri is complaining about the .1 in the selectorId, because . is not valid in an html id.

I don't own the content, so I really don't want to go through and fix all the bad IDs if it is avoidable. Is there a way to escape non-alphanumeric selectors in a nokogiri .css() call?

回答1:

Assuming your HTML looks something like this:

<div id='stuff-morestuff-CHP-1-SECT-2.1'>foo</div>

The string in question, stuff-morestuff-CHP-1-SECT-2.1, is a valid HTML ID, but it isn’t a valid CSS selector — the . character isn’t valid there.

You should be able to escape the . with a slash character, i.e. this is a valid CSS selector:

#stuff-morestuff-CHP-1-SECT-2\.1

Unfortunately this doesn’t seem to work in Nokogiri, there may be a bug in the CSS to XPath translation that it does. (It does work in the browser).

You can get around this by just checking the id attribute directly:

documentFragment.at_css('*[id="stuff-morestuff-CHP-1-SECT-2.1"]')

Even if slash escaping worked, you would probably have to check the id attribute like this if it value started with a digit, which is valid in HTML but cannot be (as far as I can tell) expressed as a CSS selector, even with escaping.

You could also use XPath, which has an id function that you can use here:

documentFragment.xpath("id('stuff-morestuff-CHP-1-SECT-2.1')")