How can I convert a string from windows-1252 to ut

2019-01-11 20:32发布

问题:

I'm migrating some data from MS Access 2003 to MySQL 5.0 using Ruby 1.8.6 on Windows XP (writing a Rake task to do this).

Turns out the Windows string data is encoded as windows-1252 and Rails and MySQL are both assuming utf-8 input so some of the characters, such as apostrophes, are getting mangled. They wind up as "a"s with an accent over them and stuff like that.

Does anyone know of a tool, library, system, methodology, ritual, spell, or incantation to convert a windows-1252 string to utf-8?

回答1:

For Ruby 1.8.6, it appears you can use Ruby Iconv, part of the standard library:

Iconv documentation

According this helpful article, it appears you can at least purge unwanted win-1252 characters from your string like so:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]

One might then attempt to do a full conversion like so:

ic = Iconv.new('UTF-8', 'WINDOWS-1252')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]


回答2:

If you're on Ruby 1.9...

string_in_windows_1252 = database.get(...)
# => "Fåbulous"

string_in_windows_1252.encoding
# => "windows-1252"

string_in_utf_8 = string_in_windows_1252.encode('UTF-8')
# => "Fabulous"

string_in_utf_8.encoding
# => 'UTF-8'


回答3:

Hy,

I had the exact same problem.

These tips helped me get goin:

Always check for the proper encoding name in order to feed your conversion tools correctly. In doubt you can get a list of supported encodings for iconv or recode using:

$ recode -l

or

$ iconv -l

Always start from you original file and encode a sample to work with:

$ recode windows-1252..u8 < original.txt > sample_utf8.txt

or

$ iconv -f windows-1252 -t utf8 original.txt -o sample_utf8.txt

Install Ruby1.9, because it helps you A LOT when it comes to encodings. Even if you don't use it in your programm, you can always start an irb1.9 session and pick on the strings to see what the output is. File.open has a new 'mode' parameter in Ruby 1.9. Use it! This article helped a lot: http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

File.open('original.txt', 'r:windows-1252:utf-8')
# This opens a file specifying all encoding options. r:windows-1252 means read it as windows-1252. :utf-8 means treat it as utf-8 internally.

Have fun and swear a lot!



回答4:

If you want to convert a file named win1252file, on a unix OS, run:

$ iconv -f windows-1252 -t utf-8 win1252_file > utf8_file

You should probably be able to do the same on Windows with cygwin.



回答5:

If you're NOT on Ruby 1.9, and assuming yhager's command works, you could try

File.open('/tmp/w1252', 'w') do |file|
  my_windows_1252_string.each_byte do |byte|
    file << byte
  end
end

`iconv -f windows-1252 -t utf-8 /tmp/w1252 > /tmp/utf8`

my_utf_8_string = File.read('/tmp/utf8')

['/tmp/w1252', '/tmp/utf8'].each do |path|
  FileUtils.rm path
end