ruby (1.8.7): How to get rid of non-printable char

I'm trying to parse an HTML page with Nokogiri but I'm having some issues with text. Mainly, I cannot get rid of unwanted chars. While parsing, when I obtain a String I always try to clean it as much as possible. I try to convert nonprintable chars to unique spaces. I use this method without success after a lot of modifications:

def clear_string(str)
  CGI::unescapeHTML(str).gsub(/\s+/mu," ").strip
end

For instance, supose this HTML fragment (copy-pasted from http://www.gisa.cat/gisa/servlet/HomeLicitation?licitationID=1061525)

<tr>
    <td><span class="linkred2">Tramitaci&oacute;:</span></td>
    <td>&nbsp;ordinària </td>
</tr>

Some intermediate example outputs showed by Netbeans 7.0 using Nokogiri and clear_string (the method defined above)

row.at("td[1]").text # => "Tramitació:"
row.at("td[2]").text # => " ordinària "
clear_string(row.at("td[2]").text) # => " ordinària"
row.at("td[2]").text.scan(/./mu) # => ["\302\240", "o", "r", "d", "i", "n", "\303\240", "r", "i", "a", " "]

I don't know why strip doesn't get rid of first spaces. Moreover, the parsing result after applying clear_string, is dumped into a yaml file using YAML::dump. Its contents are respectively, for both texts:

"Tramitaci\xC3\xB3:"
!binary |
  wqBvcmRpbsOgcmlh

The first one seems barely OK, but I don't know how to fix the second case.

One way to translate characters from one character set to another is to use Iconv. For example if what you are looking for is just converting UTF8 to ASCII you could do something like this:

require 'iconv'

s = "ordinària"
Iconv.conv('ASCII//TRANSLIT', 'UTF8', s)
=> "ordinaria"

The TRANSLIT switch tells Iconv to try and transliterate (approximately match) unconvertable characters. If you instead want to completely ignore unconvertable characters then you can use the IGNORE switch:

Iconv.conv('ASCII//IGNORE', 'UTF8', s)
=> "ordinria"

Note that Iconv will throw an exception with TRANSLIT if it finds something it can't convert. For that you can combine IGNORE and TRANSLIT like so:

Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', s)
=> "ordinaria"

ruby (1.8.7): How to get rid of non-printable char

问题:

回答1:

收藏的人(0)

ruby (1.8.7): How to get rid of non-printable char

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮