可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I need to find the encoding of all files that are placed in a directory. Is there a way to find the encoding used?
The file
command is not able to do this.
The encoding that is of interest to me is:ISO-8859-1. If the encoding is anything else, I want to move the file to another directory.
回答1:
Sounds like you're looking for enca
. It can guess and even convert between encodings. Just look at the man page.
Or, failing that, use file -i
(linux) or file -I
(osx). That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too :)
回答2:
file -bi <file name>
If you like to do this for a bunch of files
for f in `find | egrep -v Eliminate`; do echo "$f" ' -- ' `file -bi "$f"` ; done
回答3:
uchardet - An encoding detector library ported from Mozilla.
Usage:
~> uchardet file.java
UTF-8
Various Linux distributions (Debian/Ubuntu, OpenSuse-packman, ...) provide binaries.
回答4:
here is an example script using file -I and iconv which works on MacOsX
For your question you need to use mv instead of iconv
#!/bin/bash
# 2016-02-08
# check encoding and convert files
for f in *.java
do
encoding=`file -I $f | cut -f 2 -d";" | cut -f 2 -d=`
case $encoding in
iso-8859-1)
iconv -f iso8859-1 -t utf-8 $f > $f.utf8
mv $f.utf8 $f
;;
esac
done
回答5:
It is really hard to determine if it is iso-8859-1. If you have a text with only 7 bit characters that could also be iso-8859-1 but you don't know. If you have 8 bit characters then the upper region characters exist in order encodings as well. Therefor you would have to use a dictionary to get a better guess which word it is and determine from there which letter it must be. Finally if you detect that it might be utf-8 than you are sure it is not iso-8859-1
Encoding is one of the hardest things to do because you never know if nothing is telling you
回答6:
If you're talking about XML-files (ISO-8859-1), the XML-declaration inside them specifies the encoding: <?xml version="1.0" encoding="ISO-8859-1" ?>
So, you can use regular expressions (e.g. with perl
) to check every file for such specification.
More information can be found here: How to Determine Text File Encoding.
回答7:
With Python, you can use the chardet module: https://github.com/chardet/chardet
回答8:
This is not something you can do in a foolproof way. One possibility would be to examine every character in the file to ensure that it doesn't contain any characters in the ranges 0x00 - 0x1f
or 0x7f -0x9f
but, as I said, this may be true for any number of files, including at least one other variant of ISO8859.
Another possibility is to look for specific words in the file in all of the languages supported and see if you can find them.
So, for example, find the equivalent of the English "and", "but", "to", "of" and so on in all the supported languages of 8859-1 and see if they have a large number of occurrences within the file.
I'm not talking about literal translation such as:
English French
------- ------
of de, du
and et
the le, la, les
although that's possible. I'm talking about common words in the target language (for all I know, Icelandic has no word for "and" - you'd probably have to use their word for "fish" [sorry that's a little stereotypical, I didn't mean any offense, just illustrating a point]).
回答9:
I know you're interested in a more general answer, but what's good in ASCII is usually good in other encodings. Here is a Python one-liner to determine if standard input is ASCII. (I'm pretty sure this works in Python 2, but I've only tested it on Python 3.)
python -c 'from sys import exit,stdin;exit()if 128>max(c for l in open(stdin.fileno(),"b") for c in l) else exit("Not ASCII")' < myfile.txt
回答10:
In Debian you can also use: encguess
:
$ encguess test.txt
test.txt US-ASCII
回答11:
In Cygwin, this looks like it works for me:
find -type f -name "<FILENAME_GLOB>" | while read <VAR>; do (file -i "$<VAR>"); done
Example:
find -type f -name "*.txt" | while read file; do (file -i "$file"); done
You could pipe that to awk and create an iconv command to convert everything to utf8, from any source encoding supported by iconv.
Example:
find -type f -name "*.txt" | while read file; do (file -i "$file"); done | awk -F[:=] '{print "iconv -f "$3" -t utf8 \""$1"\" > \""$1"_utf8\""}' | bash
回答12:
You can extract encoding of a single file with the file command. I have a sample.html file with:
$ file sample.html
sample.html: HTML document, UTF-8 Unicode text, with very long lines
$ file -b sample.html
HTML document, UTF-8 Unicode text, with very long lines
$ file -bi sample.html
text/html; charset=utf-8
$ file -bi sample.html | awk -F'=' '{print $2 }'
utf-8
回答13:
I am using the following script to
- Find all files that match FILTER with SRC_ENCODING
- Create a backup of them
- Convert them to DST_ENCODING
- (optional) Remove the backups
.
#!/bin/bash -xe
SRC_ENCODING="iso-8859-1"
DST_ENCODING="utf-8"
FILTER="*.java"
echo "Find all files that match the encoding $SRC_ENCODING and filter $FILTER"
FOUND_FILES=$(find . -iname "$FILTER" -exec file -i {} \; | grep "$SRC_ENCODING" | grep -Eo '^.*\.java')
for FILE in $FOUND_FILES ; do
ORIGINAL_FILE="$FILE.$SRC_ENCODING.bkp"
echo "Backup original file to $ORIGINAL_FILE"
mv "$FILE" "$ORIGINAL_FILE"
echo "converting $FILE from $SRC_ENCODING to $DST_ENCODING"
iconv -f "$SRC_ENCODING" -t "$DST_ENCODING" "$ORIGINAL_FILE" -o "$FILE"
done
echo "Deleting backups"
find . -iname "*.$SRC_ENCODING.bkp" -exec rm {} \;
回答14:
With Perl, use Encode::Detect.