Despite of numerous SO threads on the topic, I'm having trouble with parsing CSV. It's a .csv file downloaded from the Adwords Keyword Planner. Previously, Adwords had the option of exporting data as 'plain CSV' (which could be parsed with the Ruby CSV library), now the options are either Adwords CSV or Excel CSV. BOTH of these formats cause this problem (illustrated by a terminal session):
file = File.open('public/uploads/testfile.csv')
=> #<File:public/uploads/testfile.csv>
file.read.encoding
=> #<Encoding:UTF-8>
require 'csv'
=> true
CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8
Let's change the encoding and see if that helps:
file.close
=> nil
file = File.open("public/uploads/testfile.csv", "r:ISO-8859-1")
=> #<File:public/uploads/testfile.csv>
file.read.encoding
=> #<Encoding:ISO-8859-1>
CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8
Let's try using a different CSV library:
require 'smarter_csv'
=> true
file.close
=> nil
file = SmarterCSV.process('public/uploads/testfile.csv')
ArgumentError: invalid byte sequence in UTF-8
Is this a no-win situation? Do I have to roll my own CSV parser?
I'm using Ruby 1.9.3p374. Thanks!
UPDATE 1:
Using the suggestions in the comments, here's the current version:
file_contents = File.open("public/uploads/new-format/testfile-adwords.csv", 'rb').read
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
file_contents.encode!('UTF-8', 'UTF-16')
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
file_contents.gsub!(/\0/, '') #needed because otherwise, I get "string contains null byte (ArgumentError)"
CSV.foreach(file_contents, :headers => true, :header_converters => :symbol) do |row|
puts row
end
This doesn't work - now I get a "file name too long" error.
There are two things to solve here when dealing with the AdWords Keyword Planner download. For one is the encoding.
And the fact that the delimiters are tabs and not commas!
So stepping over the CSV file is as simple as this:
FYI: The
\t
must be in double quotes so it would be interpreted as a tab and not the string\t
.Looking at the file in question:
The byte order mark
ffee
at the start suggests the file encoding is little endian UTF-16, and the00
bytes at every other position back this up.This would suggest that you should be able to do this:
However that gives me
invalid byte sequence in UTF-16LE (ArgumentError)
coming from inside the CSV library. I think this is due to IO#gets only returning a single byte for some reason when faced with the BOM when called in CSV, resulting in the invalid UTF-16.You can get CSV to strip of the BOM, by using
bom|utf-16-le
as the encoding:You might prefer to convert the string to a more familiar encoding instead, in which case you could do:
Both of these appear to work okay.
Converting the file to UTF8 first and then reading it also works nicely:
Iconv seems to understand correctly that the file has a BOM at the start and strips it off when converting.