I'm trying to parse an HTML page with Nokogiri but I'm having some issues with text. Mainly, I cannot get rid of unwanted chars. While parsing, when I obtain a String I always try to clean it as much as possible. I try to convert nonprintable chars to unique spaces. I use this method without success after a lot of modifications:
def clear_string(str)
CGI::unescapeHTML(str).gsub(/\s+/mu," ").strip
end
For instance, supose this HTML fragment (copy-pasted from http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525)
<tr>
<td><span class="linkred2">Tramitació:</span></td>
<td> ordinària </td>
</tr>
Some intermediate example outputs showed by Netbeans 7.0 using Nokogiri and clear_string
(the method defined above)
row.at("td[1]").text # => "Tramitació:"
row.at("td[2]").text # => " ordinària "
clear_string(row.at("td[2]").text) # => " ordinària"
row.at("td[2]").text.scan(/./mu) # => ["\302\240", "o", "r", "d", "i", "n", "\303\240", "r", "i", "a", " "]
I don't know why strip
doesn't get rid of first spaces. Moreover, the parsing result after applying clear_string
, is dumped into a yaml file using YAML::dump
. Its contents are respectively, for both texts:
"Tramitaci\xC3\xB3:"
!binary |
wqBvcmRpbsOgcmlh
The first one seems barely OK, but I don't know how to fix the second case.