Is there any way to clean a file of “invalid byte

2019-07-30 04:24发布

问题:

I have tried everything before posting to StackOverflow I really hope someone can help, but I'm pretty desperate.

So, I have a service that uploads data to our database via XML feeds provided by customers. Often these XML files claim to be UTF-8 encoded but clearly have quite a few invalid byte sequences. I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:

tr -cd '^[:print:]' < original.xml > clean.xml

Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Rails.

The problem is that we're deploying on Heroku and I can't preprocess the file with a Linux command. I've spent the last week searching the Internet for native Rails based solutions to this problem, but none of them work. Before I run through all the suggestions I've tried, here is my original code:

data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
  hash = node.element_children.each_with_object(Hash.new) do |e, h|
   h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
   data.push(newrow)
 end
end

Running this on the raw file produces an error: "Invalid byte sequence in UTF-8"

Here are all the helpful suggestions I've tried but all have failed.

  1. Use Coder

    Coder.clean!(data_string, "UTF-8")

  2. Force Encoding

    data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')

  3. Convert to UTF-16 and back to UTF-8

    data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '') data_string.encode!('UTF-8', 'UTF-16')

  4. Use valid_encoding?

    data_string.chars.select{|i| i.valid_encoding?}.join

    No characters are removed; generates "invalid byte sequence" errors.

  5. Specify encoding on opening the file

I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (@file_encodings is an array of every possible file encoding):

@file_encodings.each do |enc|
  print "#{enc}..."
  conv_str = "r:#{enc}:utf-8"
  begin
    data_file = File.open(fname, conv_str)
    data_string = data_file.read 
  rescue
    data_file = nil
    data_string = ""
  end
  data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")

  unless data_string.blank? print "\n#{enc} detected!\n"
  return data_string
end
  1. Use Regexp to remove non-printables:

    data_string.gsub!(/[^[:print:]]/,"") data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

(I also tried variants including /[^a-zA-Z0-9~`!@#$%^&*()-_=+[{]}\|;:'",<.>/\?]/)

For ALL of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.

So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Rails.

What I ended up doing is extremely inelegant, but gets the job done. I inspected each row that stopped Nokogiri (row.last) and looked for strange characters. Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):

data_string.gsub!(/[Crazy Control Characters]/,"")

But the purist in me insists there should be a more elegant, general solution.

(I've indented all my code four spaces, but the editor doesn't seem to be picking that up.)

回答1:

Ruby 2.1 has a new method called String.scrub which is exactly what you need.

If the string is invalid byte sequence then replace invalid bytes with given replacement character, else returns self. If block is given, replace invalid bytes with returned value of the block.

Check the docs for more info.

http://ruby-doc.org/core-2.1.0/String.html#method-i-scrub



回答2:

I found this on stackoverflow for some other question and this too worked fine for me. Assuming data_string is your xml:

data_string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')



回答3:

try using combination of force_encoding("ISO-8859-1") and encode("utf-8"). This helps me once.

data_string.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)


回答4:

Thanks for the responses. I did find something that works by testing all sorts of combinations of different tools. I hope this is helpful to other people who have shared the same frustration.

data_string.encode!("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "" )
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

As you can see, it's a combination of the "encode" method and a regexp to remove control characters (except for newlines).

My testing revealed that the file I was importing had TWO problems: (1) invalid UTF-8 byte sequences; and (2) unprintable control characters that forced Nokogiri to stop parsing before the end of the file. I had to fix both problems, in that order, otherwise gsub! throws the "invalid byte sequence" error.

Note that the first line in the code above could be substituted with EITHER of the following with the same successful result:

Coder.clean!(data_string,'UTF-8')

or

data_string.scrub!("")

This worked perfectly for me.