Equivalent of Iconv.conv(“UTF-8//IGNORE”,…) in Rub

I'm reading data from a remote source, and occassionally get some characters in another encoding. They're not important.

I'd like to get get a "best guess" utf-8 string, and ignore the invalid data.

Main goal is to get a string I can use, and not run into errors such as:

Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8:
invalid byte sequence in utf-8

标签： ruby encoding utf-8 iconv

6条回答

淡お忘

2楼-- · 2019-03-12 22:55

String#chars or String#each_char can be also used.

# Table 3-8. Use of U+FFFD in UTF-8 Conversion
# http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf)
str = "\x61"+"\xF1\x80\x80"+"\xE1\x80"+"\xC2"
     +"\x62"+"\x80"+"\x63"+"\x80"+"\xBF"+"\x64"

p [
  'abcd' == str.chars.collect { |c| (c.valid_encoding?) ? c : '' }.join,
  'abcd' == str.each_char.map { |c| (c.valid_encoding?) ? c : '' }.join
]

String#scrub can be used since Ruby 2.1.

p [
  'abcd' == str.scrub(''),
  'abcd' == str.scrub{ |c| '' }
]

0人赞添加讨论(0) 举报

ゆ、 Hurt°

3楼-- · 2019-03-12 23:02

To ignore all unknown parts of the string that aren't correctly UTF-8 encoded the following (as you originally posted) almost does what you want.

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

The caveat is that encode doesn't do anything if it thinks the string is already UTF-8. So you need to change encodings, going via an encoding that can still encode the full set of unicode characters that UTF-8 can encode. (If you don't you'll corrupt any characters that aren't in that encoding - 7bit ASCII would be a really bad choice!) So go via UTF-16:

string.encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')

0人赞添加讨论(0) 举报

甜甜的少女心

4楼-- · 2019-03-12 23:05

I have not had luck with the one-line uses of String#encode ala string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?"). Do not work reliably for me.

But I wrote a pure ruby "backfill" of String#scrub to MRI 1.9 or 2.0 or any other ruby that does not offer a String#scrub.

https://github.com/jrochkind/scrub_rb

It makes String#scrub available in rubies that don't have it; if loaded in MRI 2.1, it will do nothing and you'll still be using the built-in String#scrub, so it can allow you to easily write code that will work on any of these platforms.

It's implementation is somewhat similar to some of the other char-by-char solutions proposed in other answers, but it does not use exceptions for flow control (don't do that), is tested, and provides an API compatible with MRI 2.1 String#scrub

0人赞添加讨论(0) 举报

放荡不羁爱自由

5楼-- · 2019-03-12 23:14

I thought this was it:

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")

will replace all knowns with '?'.

To ignore all unknowns, :replace => '':

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Edit:

I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:

string.encode("UTF-8", ...).force_encoding('UTF-8')

Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.

Edit 2:

Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.

0人赞添加讨论(0) 举报

够拽才男人

6楼-- · 2019-03-12 23:18

This works great for me:

"String".encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "").force_encoding('UTF-8')

0人赞添加讨论(0) 举报

孤傲高冷的网名

7楼-- · 2019-03-12 23:18

With a bit of help from @masakielastic I have solved this problem for my personal purposes using the #chars method.

The trick is to break down each character into its own separate block so that ruby can fail.

Ruby needs to fail when it confronts binary code etc. If you don't allow ruby to go ahead and fail its a tough road when it comes to this stuff. So I use the String#chars method to break the given string into an array of characters. Then I pass that code into a sanitizing method that allows the code to have "microfailures" (my coinage) within the string.

So, given a "dirty" string, lets say you used File#read on a picture. (my case)

dirty = File.open(filepath).read    
clean_chars = dirty.chars.select do |c|
  begin
    num_or_letter?(c)
  rescue ArgumentError
    next
  end
end
clean = clean_chars.join("")

def num_or_letter?(char)
  if char =~ /[a-zA-Z0-9]/
    true
  elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
    true
  end
end

allowing the code to fail somewhere along in the process seems to be the best way to move through it. So long as you contain those failures within blocks you can grab what is readable by the UTF-8-only-accepting parts of ruby

0人赞添加讨论(0) 举报

Equivalent of Iconv.conv(“UTF-8//IGNORE”,…) in Rub

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间