我想从转码US-ASCII一堆文件为UTF-8。
对于这一点,我使用的iconv:
iconv -f US-ASCII -t UTF-8 file.php > file-utf8.php
事情是我原来的文件是US-ASCII编码,这使得转换不发生。 显然,它发生的原因ASCII是UTF-8的一个子集...
http://www.linuxquestions.org/questions/linux-software-2/iconv-us-ascii-to-utf-8-or-iso-8859-15-a-705054/
和报价:
有没有必要,直到引入非ASCII字符的文本文件以其他方式出现
真正。 如果我在文件中引入非ASCII字符,并保存它,让我们与Eclipse说,文件编码(字符集)切换为UTF-8。
就我而言,我想强迫的iconv将文件转码为UTF-8反正 。 是否存在与否非ASCII字符。
注:原因是我的PHP代码(非ASCII文件...)是处理一些非ASCII字符串,这会导致字符串没有得到很好的解释(法国):
它©很久很久......人住©动画系列©è传奇伟业
©百味(Procidis)1ère
...
编辑
-
US-ASCII
- 是 -的一个子集UTF-8
见Ned的回答如下) - 这意味着
US-ASCII
文件在实际编码UTF-8
- 我的问题来自其他地方
Short Answer
file
only guesses at the file encoding and may be wrong (especially in cases where special characters only appear late in large files).
- you can use
hexdump
to look at bytes of non-7-bit-ascii text and compare against code tables for common encodings (iso-8859-*, utf-8) to decide for yourself what the encoding is.
iconv
will use whatever input/output encoding you specify regardless of what the contents of the file are. If you specify the wrong input encoding the output will be garbled.
- even after running
iconv
, file
may not report any change due to the limited way in which file
attempts to guess at the encoding. For a specific example, see my long answer.
- 7-bit ascii (aka us-ascii) is identical at a byte level to utf-8 and the 8-bit ascii extensions (iso-8859-*). So if your file only has 7-bit characters, then you can call it utf-8, iso-8859-* or us-ascii because at a byte level they are all identical. It only makes sense to talk about utf-8 and other encodings (in this context) once your file has characters outside the 7-bit ascii range.
Long Answer
I ran into this today and came across your question. Perhaps I can add a little more information to help other people who run into this issue.
First, the term ASCII is overloaded, and that leads to confusion.
7-bit ASCII only includes 128 characters (00-7F or 0-127 in decimal). 7-bit ASCII is also referred to as US-ASCII.
https://en.wikipedia.org/wiki/ASCII
UTF-8 encoding uses the same encoding as 7-bit ASCII for its first 128 characters. So a text file that only contains characters from that range of the first 128 characters will be identical at a byte level whether encoded with UTF-8 or 7-bit ASCII.
https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
The term extended ascii (or high ascii) refers to eight-bit or larger character encodings that include the standard seven-bit ASCII characters, plus additional characters.
https://en.wikipedia.org/wiki/Extended_ASCII
ISO-8859-1 (aka "ISO Latin 1") is a specific 8-bit ASCII extension standard that covers most characters for Western Europe. There are other ISO standards for Eastern European languages and Cyrillic languages. ISO-8859-1 includes characters like Ö, é, ñ and ß for German and Spanish. "Extension" means that ISO-8859-1 includes the 7-bit ASCII standard and adds characters to it by using the 8th bit. So for the first 128 characters, it is equivalent at a byte level to ASCII and UTF-8 encoded files. However, when you start dealing with characters beyond the first 128, your are no longer UTF-8 equivalent at the byte level, and you must do a conversion if you want your "extended ascii" file to be UTF-8 encoded.
https://en.wikipedia.org/wiki/Extended_ASCII#ISO_8859_and_proprietary_adaptations
One lesson I learned today is that we can't trust file
to always give correct interpretation of a file's character encoding.
https://en.wikipedia.org/wiki/File_%28command%29
The command tells only what the file looks like, not what it is (in the case where file looks at the content). It is easy to fool the program by putting a magic number into a file the content of which does not match it. Thus the command is not usable as a security tool other than in specific situations.
file
looks for magic numbers in the file that hint at the type, but these can be wrong, no guarantee of correctness. file
also tries to guess the character encoding by looking at the bytes in the file. Basically file
has a series of tests that helps it guess at the file type and encoding.
My file is a large CSV file. file
reports this file as us-ascii encoded, which is WRONG.
$ ls -lh
total 850832
-rw-r--r-- 1 mattp staff 415M Mar 14 16:38 source-file
$ file -b --mime-type source-file
text/plain
$ file -b --mime-encoding source-file
us-ascii
My file has umlauts in it (ie Ö). The first non-7-bit-ascii doesn't show up until over 100k lines into the file. I suspect this is why file
doesn't realize the file encoding isn't US-ASCII.
$ pcregrep -no '[^\x00-\x7F]' source-file | head -n1
102321:�
I'm on a mac, so using PCRE's grep
. With gnu grep you could use the -P
option. Alternatively on a mac, one could install coreutils (via homebrew or other) in order to get gnu grep.
我没有挖成的源代码file
,而该男子的页面不详细讨论文本编码检测,但我猜测file
猜测编码前不看整个文件。
不管我的文件的编码是,这些非7位ASCII字符打破东西。 我的德国CSV文件;
- 分隔并提取单个列不起作用。
$ cut -d";" -f1 source-file > tmp
cut: stdin: Illegal byte sequence
$ wc -l *
3081673 source-file
102320 tmp
3183993 total
注意cut
错误,而我的“TMP”的文件只有102320上线102321第一个特殊字符线。
让我们来看看这些非ASCII字符的编码方式。 我倾倒的第一个非7位ASCII码为hexdump
,做一些格式化,删除换行符( 0a
),并采取开头的几个。
$ pcregrep -o '[^\x00-\x7F]' source-file | head -n1 | hexdump -v -e '1/1 "%02x\n"'
d6
0a
其他方式。 我所知道的第一个非7位ASCII字符是在85位上线102321.我抢行,并告诉hexdump
采取两个字节开始,85位可以看到特殊(非7位ASCII由“”来表示)字符,而下一个字节是‘M’...所以这是一个单字节字符编码。
$ tail -n +102321 source-file | head -n1 | hexdump -C -s85 -n2
00000055 d6 4d |.M|
00000057
在这两种情况下,我们看到了特殊字符被表示为d6
。 由于这个角色是一个O是一个德国人信,我猜测,ISO-8859-1应该包括这个。 果然可以看到“D6”匹配( https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout )。
重要的问题......我怎么知道这个人物是没有被确认文件编码的一个O? 答案是背景。 我打开该文件,阅读文本,然后确定什么字符它应该是。 如果我打开它vim
它显示为O,因为vim
做猜测的字符编码(在这种情况下)的一个更好的工作比file
一样。
所以,我的文件似乎是ISO-8859-1。 从理论上讲,我应该检查非7位ASCII字符的其余部分,以确保ISO-8859-1是一个不错的选择......没有什么强制程序只用写一个文件时,一个编码磁盘(比礼貌等)。
我会跳过检查并移动到转换步骤。
$ iconv -f iso-8859-1 -t utf8 source-file > output-file
$ file -b --mime-encoding output-file
us-ascii
嗯。 file
还告诉我这个文件是US-ASCII甚至转换后。 让我们检查hexdump
试。
$ tail -n +102321 output-file | head -n1 | hexdump -C -s85 -n2
00000055 c3 96 |..|
00000057
这绝对是一个变化。 需要注意的是,我们有非7位ASCII(由“”右边表示)和两个字节的十六进制代码的两个字节是现在c3 96
。 如果我们看一看,我们似乎有UTF-8,现在(C3 96为O的UTF-8正确的编码) http://www.utf8-chartable.de/
但file
仍然向我们报告文件作为us-ascii
? 嗯,我认为这可以追溯到点约file
不看整个文件,并且第一个非7位ASCII字符不会发生,直到深文件中的事实。
我将使用sed
在文件的开头贴Ö看看会发生什么。
$ sed '1s/^/Ö\'$'\n/' source-file > test-file
$ head -n1 test-file
Ö
$ head -n1 test-file | hexdump -C
00000000 c3 96 0a |...|
00000003
好的,我们有一个变音。 注意,编码虽然是C3 96(UTF-8)。 嗯。
再次检查我们在同一个文件中的其他变音:
$ tail -n +102322 test-file | head -n1 | hexdump -C -s85 -n2
00000055 d6 4d |.M|
00000057
ISO-8859-1。 哎呀! 只是表明它是多么容易得到的编码搞砸了。
让我们尝试在前面转换我们的新的测试文件,变音,看看会发生什么。
$ iconv -f iso-8859-1 -t utf8 test-file > test-file-converted
$ head -n1 test-file-converted | hexdump -C
00000000 c3 83 c2 96 0a |.....|
00000005
$ tail -n +102322 test-file-converted | head -n1 | hexdump -C -s85 -n2
00000055 c3 96 |..|
00000057
哎呀。 这第一变音符号,这是UTF-8被解读为ISO-8859-1,因为这是我们说iconv
。 第二变音符被正确地从转换d6
到c3 96
。
我再试一次,这次我会用vim
做Ö插入,而不是sed
。 vim
似乎检测编码更好(如在“latin1”又名ISO-8859-1),所以也许它会插入一个一致的编码将新的O。
$ vim source-file
$ head -n1 test-file-2
�
$ head -n1 test-file-2 | hexdump -C
00000000 d6 0d 0a |...|
00000003
$ tail -n +102322 test-file-2 | head -n1 | hexdump -C -s85 -n2
00000055 d6 4d |.M|
00000057
看起来不错。 看起来像ISO-8859-1为新老变音符号。
现在的考验。
$ file -b --mime-encoding test-file-2
iso-8859-1
$ iconv -f iso-8859-1 -t utf8 test-file-2 > test-file-2-converted
$ file -b --mime-encoding test-file-2-converted
utf-8
繁荣! 故事的道德启示。 不要相信file
总是猜您编码的权利。 易同一个文件内混合编码。 如果有疑问,看看十六进制。
黑客攻击(也容易发生故障),将解决这一特定限制file
大文件打交道时会缩短文件,以确保特殊字符在文件中出现较早所以file
更容易找到他们。
$ first_special=$(pcregrep -o1 -n '()[^\x00-\x7F]' source-file | head -n1 | cut -d":" -f1)
$ tail -n +$first_special source-file > /tmp/source-file-shorter
$ file -b --mime-encoding /tmp/source-file-shorter
iso-8859-1
更新
克里斯托Zoulas更新file
,使字节的数量看配置。 有一天,掉头的功能要求,真棒!
http://bugs.gw.com/view.php?id=533 https://github.com/file/file/commit/d04de269e0b06ccd0a7d1bf4974fed1d75be7d9e
这项功能是发布file
版本5.26。
做一个猜测关于编码需要一定的时间才考虑推出更多的大型文件。 然而,它是好的,有特殊用途的,情况较好的猜测可能会超过额外的时间/ IO的选项。
使用以下选项:
−P, −−parameter name=value
Set various parameter limits.
Name Default Explanation
bytes 1048576 max number of bytes to read from file
就像是...
file_to_check="myfile"
bytes_to_scan=$(wc -c < $file_to_check)
file -b --mime-encoding -P bytes=$bytes_to_scan $file_to_check
...如果你想逼应该做的伎俩file
来看待整个文件进行猜测了。 当然,如果你有这仅适用file
5.26或更高版本。
我还没有建立/测试的最新版本呢。 我的大多数机器目前有file
5.04(2010年)...希望有一天这个版本中,将下来,从上游。