i fetch one html fragment like
"<li>市 场 价"
which contains "
", but after calling to_s
of Nokogiri NodeSet, it becomes
"<li>市 场 价"
, i want to keep the original html fragment, and tried to set :save_with option
for to_s
method, but failed.
can someone encounter the same problem and give me help? thank you in advance.
I think the problem is how you're looking at the string. It will look like a space, but it's not quite the same:
A regular space is
32
,0x20
or' '
.160
is the decimal value for a non-breaking-space, which is what
converts to after you use Nokogiri's variousinner_text
,content
,text
orto_s
tags. It's no longer a HTML entity-encoding, but it's still a non-breaking space. I think Nokogiri's conversion from the entity-encoding is the appropriate behavior when asking for a stringification.There might be a flag to tell Nokogiri to NOT decode the value, but I'm not aware of it off-hand. You can check on Nokogiri's mail-list that I mentioned in the comment above, to see if there is a flag. I can see an advantage for Nokogiri to not do the decode also so if there isn't such a flag it would be nice occasionally.
Now, all that said, I think the
to_html
method SHOULD return the value to its entity-encoded value, since a non-breaking space is a nasty thing to encounter in a HTML stream. And that I think you should mention on the mail-list or maybe even as a bug. I think it's an inappropriate result.http://groups.google.com/group/nokogiri-talk/msg/0b81ef0dc180dc74
Your sample text isn't
ASCII-8BIT
so try changing that encoding string to the Unicode character set name and see ifinner_html
will return an entity-encoded value.I encountered a similar situation, and what I came up was a bit of a hack, but it seems to work well.
In my case, I wanted the nbsp to be a regular space. I think in your case, you want them to be returned to a " ", so you could do something like: