Rails/Ruby invalid byte sequence in UTF-8 even aft

2019-09-02 07:47发布

问题:

I'm trying to iterate over a remote nginx log file (compressed .gz file) in Rails and I'm getting this error at some point in the file:

TTPArgumentError: invalid byte sequence in UTF-8

I tried forcing the encoding too although it seems the encoding was already UTF8:

logfile = logfile.force_encoding("UTF-8")

The method that I'm using:

  def remote_update
    uri = "http://" + self.url + "/localhost.access.log.2.gz"
    source = open(uri)
    gz = Zlib::GzipReader.new(source)
    logfile = gz.read

    # prints UTF-8
    print logfile.encoding.name
    logfile = logfile.force_encoding("UTF-8")

    # prints UTF-8
    print logfile.encoding.name

    logfile.each_line do |line|
      print line[/\/someregex\/1\/(.*)\//,1]
    end 
  end

Really trying to understand why this is happening (tried to look in other SO threads with no success). What's wrong here?

Update:

Added exception's trace:

HTTPArgumentError: invalid byte sequence in UTF-8
    from /Users/T/workspace/sample_app/app/models/server.rb:25:in `[]'
    from /Users/T/workspace/sample_app/app/models/server.rb:25:in `block in remote_update'
    from /Users/T/workspace/sample_app/app/models/server.rb:24:in `each_line'
    from /Users/T/workspace/sample_app/app/models/server.rb:24:in `remote_update'
    from (irb):2
    from /Users/T/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.5/lib/rails/commands/console.rb:110:in `start'
    from /Users/T/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.5/lib/rails/commands/console.rb:9:in `start'

回答1:

force_encoding doesn't change the actual string data: it just changes the variable that says what encoding to use when interpreting the bytes.

If the data is not in fact utf-8 or contains invalid utf-8 sequences then force encoding won't help. Force encoding is basically only useful when you get some raw data from somewhere and you know what encoding it is in and you want to tell ruby what that encoding is.

The first thing to do would be to determine what is the actual encoding used. The charlock_holmes gem can detect encodings. A more tricky case would be if the file was a mish-mash of encodings but hopefully that isn't the case (if it was, then perhaps trying to handle each line separately might work).



回答2:

If you want to take a string, which has the correct encoding, and transcode it to valid UTF-8 and clean up invalid characters you can use something like:

str.encode!('UTF-8', invalid: :replace, undef: :replace, replace: '?')

If you have a UTF-8 encoded string which has invalid UTF-8 characters in it, you can clean that up by using the 'binary' encoding source:

str.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '?')

Both will give you a UTF-8 string with any invalid characters replaced by question marks which should pass. You can also pass in replace: '' to strip the bad characters, or leave the option off and you'll get the \uFFFD unicode character.

My guess is that the source file before gzipping had some binary data/corruption/invalid UTF-8 that got logged into it?

This question has also been asked and answered before on StackOverflow. See the following blog post for good information:

https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences

And here's a prior example of a SO answer:

https://stackoverflow.com/a/18454435/506908