How do I determine file encoding in OS X?

2020-01-26 12:42发布

I'm trying to enter some UTF-8 characters into a LaTeX file in TextMate (which says its default encoding is UTF-8), but LaTeX doesn't seem to understand them.

Running cat my_file.tex shows the characters properly in Terminal. Running ls -al shows something I've never seen before: an "@" by the file listing:

-rw-r--r--@  1 me      users      2021 Feb 11 18:05 my_file.tex

(And, yes, I'm using \usepackage[utf8]{inputenc} in the LaTeX.)

I've found iconv, but that doesn't seem to be able to tell me what the encoding is -- it'll only convert once I figure it out.

15条回答
淡お忘
2楼-- · 2020-01-26 13:17

Which LaTeX are you using? When I was using teTeX, I had to manually download the unicode package and add this to my .tex files:

% UTF-8 stuff
\usepackage[notipa]{ucs}
\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}

Now, I've switched over to XeTeX from the TeXlive 2008 package (here), it is even more simple:

% UTF-8 stuff
\usepackage{fontspec}
\usepackage{xunicode}

As for detection of a file's encoding, you could play with file(1) (but it is rather limited) but like someone else said, it is difficult.

查看更多
▲ chillily
3楼-- · 2020-01-26 13:17

I implemented the bash script below, it works for me.

It first tries to iconv from the encoding returned by file --mime-encoding to utf-8.

If that fails, it goes through all encodings and shows the diff between the original and re-encoded file. It skips over encodings that produce a large diff output ("large" as defined by the MAX_DIFF_LINES variable or the second input argument), since those are most likely the wrong encoding.

If "bad things" happen as a result of using this script, don't blame me. There's a rm -f in there, so there be monsters. I tried to prevent adverse effects by using it on files with a random suffix, but I'm not making any promises.

Tested on Darwin 15.6.0.

#!/bin/bash

if [[ $# -lt 1 ]]
then
  echo "ERROR: need one input argument: file of which the enconding is to be detected."
  exit 3
fi

if [ ! -e "$1" ]
then
  echo "ERROR: cannot find file '$1'"
  exit 3
fi

if [[ $# -ge 2 ]]
then
  MAX_DIFF_LINES=$2
else
  MAX_DIFF_LINES=10
fi


#try the easy way
ENCOD=$(file --mime-encoding $1 | awk '{print $2}')
#check if this enconding is valid
iconv -f $ENCOD -t utf-8 $1 &> /dev/null
if [ $? -eq 0 ]
then
  echo $ENCOD
  exit 0
fi

#hard way, need the user to visually check the difference between the original and re-encoded files
for i in $(iconv -l | awk '{print $1}')
do
  SINK=$1.$i.$RANDOM
  iconv -f $i -t utf-8 $1 2> /dev/null > $SINK
  if [ $? -eq 0 ]
  then
    DIFF=$(diff $1 $SINK)
    if [ ! -z "$DIFF" ] && [ $(echo "$DIFF" | wc -l) -le $MAX_DIFF_LINES ]
    then
      echo "===== $i ====="
      echo "$DIFF"
      echo "Does that make sense [N/y]"
      read $ANSWER
      if [ "$ANSWER" == "y" ] || [ "$ANSWER" == "Y" ]
      then
        echo $i
        exit 0
      fi
    fi
  fi
  #clean up re-encoded file
  rm -f $SINK
done

echo "None of the encondings worked. You're stuck."
exit 3
查看更多
啃猪蹄的小仙女
4楼-- · 2020-01-26 13:19

Using the -I (that's a capital i) option on the file command seems to show the file encoding.

file -I {filename}
查看更多
登录 后发表回答