ruby (1.8.7): How to get rid of non-printable char

2019-07-29 12:26发布

问题:

I'm trying to parse an HTML page with Nokogiri but I'm having some issues with text. Mainly, I cannot get rid of unwanted chars. While parsing, when I obtain a String I always try to clean it as much as possible. I try to convert nonprintable chars to unique spaces. I use this method without success after a lot of modifications:

def clear_string(str)
  CGI::unescapeHTML(str).gsub(/\s+/mu," ").strip
end

For instance, supose this HTML fragment (copy-pasted from http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525)

<tr>
    <td><span class="linkred2">Tramitaci&oacute;:</span></td>
    <td>&nbsp;ordinària </td>
</tr>

Some intermediate example outputs showed by Netbeans 7.0 using Nokogiri and clear_string (the method defined above)

row.at("td[1]").text # => "Tramitació:"
row.at("td[2]").text # => " ordinària "
clear_string(row.at("td[2]").text) # => " ordinària"
row.at("td[2]").text.scan(/./mu) # => ["\302\240", "o", "r", "d", "i", "n", "\303\240", "r", "i", "a", " "]

I don't know why strip doesn't get rid of first spaces. Moreover, the parsing result after applying clear_string, is dumped into a yaml file using YAML::dump. Its contents are respectively, for both texts:

"Tramitaci\xC3\xB3:"
!binary |
  wqBvcmRpbsOgcmlh

The first one seems barely OK, but I don't know how to fix the second case.

回答1:

One way to translate characters from one character set to another is to use Iconv. For example if what you are looking for is just converting UTF8 to ASCII you could do something like this:

require 'iconv'

s = "ordinària"
Iconv.conv('ASCII//TRANSLIT', 'UTF8', s)
=> "ordinaria"

The TRANSLIT switch tells Iconv to try and transliterate (approximately match) unconvertable characters. If you instead want to completely ignore unconvertable characters then you can use the IGNORE switch:

Iconv.conv('ASCII//IGNORE', 'UTF8', s)
=> "ordinria"

Note that Iconv will throw an exception with TRANSLIT if it finds something it can't convert. For that you can combine IGNORE and TRANSLIT like so:

Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', s)
=> "ordinaria"