Why do I get an “Invalid Byte Sequence in UTF-8” e

2020-05-01 04:22发布

I'm writing a Ruby script to process a large text file, and keep getting an odd encoding error. Here's the situation:

input_data = File.new(in_path, 'r').read
p input_data.encoding.name   #   UTF-8 
break_char = "\r".encode("UTF-8")
p break_char # "\r"
p break_char.encoding.name # "UTF-8" 
p Encoding.compatible?(input_data, break_char) # # Encoding:UTF-8>

This produces the error :in 'split': invalid byte sequence in UTF-8 (ArgumentError)

I read http://blog.grayproductions.net/articles/ruby_19s_string and looked at other solutions to apparently the same problem, but still can't work out why it's happening when I believe I am controlling the encodings.

I'm on OSX working with ruby 1.9.2

2楼-- · 2020-05-01 04:56

Obviously your input file is not UTF-8 (or at least, not entirely). If you don't care about non-ascii characters, you can simply assume your file is ascii-8bit encoded. BTW, your separator (break_char) is not causing problems as comma is encoded the same way in UTF-8 as in ASCII.

fname = 'test.in'

# create example file and fill it with invalid UTF-8 sequence
File.open(fname, 'w') do |f|
  f.write "\xc3\x28"

# then try to read and parse it
s = File.open(fname) do |f| # file opened as UTF-8
#s = File.open(fname, 'r:ascii-8bit') do |f| # file opened as ascii-8bit
p s.split ','
3楼-- · 2020-05-01 04:56

Please try this one:-

input_data = File.open("path/your_file.pdf", "rb") {|io| io.read}


4楼-- · 2020-05-01 05:03

Here are 2 common situations and how to deal with them:

Situation 1

You have an UTF-8 input-file with possibly a few invalid bytes
Remove the invalid bytes:

test = "Partly valid\xE4 UTF-8 encoding: äöüß"
File.open( 'input_file', 'w' ) {|f| f.write(test)}

str  = File.read( 'input_file' )

   => "Partly valid UTF-8 encoding: äöüß"

Situation 2

You have an input-file that could be in either UTF-8 or ISO-8859-1 encoding
Check which encoding it is and convert to UTF-8 (if necessary):

test = "String in ISO-8859-1 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}

str  = File.read( 'input_file' )

unless str.valid_encoding?
  str.encode!( 'UTF-8', 'ISO-8859-1', invalid: :replace )
end #unless
   => "String in ISO-8859-1 encoding: äöüß"


  • The above code snippets assume that Ruby encodes all your strings in UTF-8 by default. Even though, this is almost always the case, you can make sure of this by starting your scripts with # encoding: UTF-8.

  • If invalid, it is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?). However, it is NOT possible (or at least extremely hard) to programmatically detect invalidity of single-byte-encodings like ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid ISO-8859-1 encoding.

  • Even though UTF-8 has become increasingly popular as the default encoding in computer-systems, ISO-8859-1 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from ISO-8859-1. Examples: CP1252 (a.k.a. Windows-1252), ISO-8859-15

[ruby] [encoding] [utf8] [file-encoding] [character-encoding]

5楼-- · 2020-05-01 05:08

I fail to get an error here on Linux even when the input file is not UTF-8. (I'm using Ruby 1.9.2, as well.)

Logically, either this problem is linked with OS-X, or it's something to do with your input data. Does it happen regardless of the data in the input file?

(I realise that this is not a proper answer, but I lack the rep to add a comment. And since no-one has responded yet, I thought it better than nothing...)

6楼-- · 2020-05-01 05:08

You read the file using the default encoding your system provides. So ruby tags the string as utf8, which doesn't mean it's really utf8-data. Try file <input file> to guess what kind of encoding is in there, then tell ruby it's that one (unclean: force_encoding(<encoding>), clean: tell the File object what encoding it is, I don't know how to do that) and then use encode!("utf8") to convert it to utf8.

登录 后发表回答