I have tried everything before posting to StackOverflow I really hope someone can help, but I'm pretty desperate.
So, I have a service that uploads data to our database via XML feeds provided by customers. Often these XML files claim to be UTF-8 encoded but clearly have quite a few invalid byte sequences. I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:
tr -cd '^[:print:]' < original.xml > clean.xml
Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Rails.
The problem is that we're deploying on Heroku and I can't preprocess the file with a Linux command. I've spent the last week searching the Internet for native Rails based solutions to this problem, but none of them work. Before I run through all the suggestions I've tried, here is my original code:
data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
hash = node.element_children.each_with_object(Hash.new) do |e, h|
h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
data.push(newrow)
end
end
Running this on the raw file produces an error: "Invalid byte sequence in UTF-8"
Here are all the helpful suggestions I've tried but all have failed.
Use Coder
Coder.clean!(data_string, "UTF-8")
Force Encoding
data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')
Convert to UTF-16 and back to UTF-8
data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '') data_string.encode!('UTF-8', 'UTF-16')
Use valid_encoding?
data_string.chars.select{|i| i.valid_encoding?}.join
No characters are removed; generates "invalid byte sequence" errors.
Specify encoding on opening the file
I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (@file_encodings is an array of every possible file encoding):
@file_encodings.each do |enc|
print "#{enc}..."
conv_str = "r:#{enc}:utf-8"
begin
data_file = File.open(fname, conv_str)
data_string = data_file.read
rescue
data_file = nil
data_string = ""
end
data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")
unless data_string.blank? print "\n#{enc} detected!\n"
return data_string
end
Use Regexp to remove non-printables:
data_string.gsub!(/[^[:print:]]/,"") data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
(I also tried variants including /[^a-zA-Z0-9~`!@#$%^&*()-_=+[{]}\|;:'",<.>/\?]/)
For ALL of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.
So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Rails.
What I ended up doing is extremely inelegant, but gets the job done. I inspected each row that stopped Nokogiri (row.last) and looked for strange characters. Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):
data_string.gsub!(/[Crazy Control Characters]/,"")
But the purist in me insists there should be a more elegant, general solution.
(I've indented all my code four spaces, but the editor doesn't seem to be picking that up.)