Windows-1252 to UTF-8 encoding-第2页回答

I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files?

Example usage of recode:

recode windows-1252.. myfile.txt

This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.

标签： encoding utf-8 character-encoding windows-1252

10条回答

看我几分像从前

2楼-- · 2019-01-13 04:07

If you are sure your files are either UTF-8 or Windows 1252 (or Latin1), you can take advantage of the fact that recode will exit with an error if you try to convert an invalid file.

While utf8 is valid Win-1252, the reverse is not true: win-1252 is NOT valid UTF-8. So:

recode utf8..utf16 <unknown.txt >/dev/null || recode cp1252..utf8 <unknown.txt >utf8-2.txt

Will spit out errors for all cp1252 files, and then proceed to convert them to UTF8.

I would wrap this into a cleaner bash script, keeping a backup of every converted file.

Before doing the charset conversion, you may wish to first ensure you have consistent line-endings in all files. Otherwise, recode will complain because of that, and may convert files which were already UTF8, but just had the wrong line-endings.

0人赞添加讨论(0) 举报

甜甜的少女心

3楼-- · 2019-01-13 04:10

If you want to rename multiple files in a single command ‒ let's say you want to convert all *.txt files ‒ here is the command:

find . -name "*.txt" -exec iconv -f WINDOWS-1252 -t UTF-8 {} -o {}.ren \; -a -exec mv {}.ren {} \;

0人赞添加讨论(0) 举报

乱世女痞

4楼-- · 2019-01-13 04:11

Here's a transcription of another answer I gave to a similar question:

If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

Update:

I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Update: I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

0人赞添加讨论(0) 举报

叛逆

5楼-- · 2019-01-13 04:11

You can change the encoding of a file with an editor such as notepad++. Just go to Encoding and select what you want.

I always prefer the Windows 1252

0人赞添加讨论(0) 举报

上一页 1 2

Windows-1252 to UTF-8 encoding

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间