iconv any encoding to UTF-8

2020-02-09 01:29发布

I am trying to point iconv to a directory and all files will be converted UTF-8 regardless of the current encoding

I am using this script but you have to specify what encoding you are going FROM. How can I make it autdetect the current encoding?

dir_iconv.sh

#!/bin/bash

ICONVBIN='/usr/bin/iconv' # path to iconv binary

if [ $# -lt 3 ]
then
    echo "$0 dir from_charset to_charset"
    exit
fi

for f in $1/*
do
    if test -f $f
    then
        echo -e "\nConverting $f"
        /bin/mv $f $f.old
        $ICONVBIN -f $2 -t $3 $f.old > $f
    else
        echo -e "\nSkipping $f - not a regular file";
    fi
done

terminal line

sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8

7条回答
beautiful°
2楼-- · 2020-02-09 01:43

Here is my solution to inplace all files:

#!/bin/bash

apt-get -y install recode uchardet > /dev/null
find "$1" -type f | while read FFN # 'dir' should be changed...
do
    encoding=$(uchardet "$FFN")
    echo "$FFN: $encoding"
    enc=`echo $encoding | sed 's#^x-mac-#mac#'`
    set +x
    recode $enc..UTF-8 "$FFN"
done

https://gist.github.com/demofly/25f856a96c29b89baa32

put it into convert-dir-to-utf8.sh and run:

bash convert-dir-to-utf8.sh /pat/to/my/trash/dir

Note that sed is a workaround for mac encodings here. Many uncommon encodings need workarounds like this.

查看更多
干净又极端
3楼-- · 2020-02-09 01:51

Here's my answer... =D


#!/bin/bash

find <YOUR_FOLDER_PATH> -name '*' -type f -exec grep -Iq . {} \; -print0 | 
while IFS= read -r -d $'\0' LINE_FILE; do
    CHARSET=$(uchardet $LINE_FILE)
    echo "Converting ($CHARSET) $LINE_FILE"

    # NOTE: Convert/reconvert to utf8. By Questor
    iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE"

    # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
    # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
    # https://stackoverflow.com/a/45240995/3223785 ]
    sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

done
# [Refs.: https://justrocketscience.com/post/handle-encodings , 
# https://stackoverflow.com/a/9612232/3223785 , 
# https://stackoverflow.com/a/13659891/3223785 ]

FURTHER QUESTION: I do not know if my approach is the safest. I say this because I noticed that some files are not correctly converted (characters will be lost) or are "truncated". I suspect that this has to do with the "iconv" tool or with the charset information obtained with the "uchardet" tool. I was curious about the solution presented at https://stackoverflow.com/a/22841847/3223785 ( @demofly ) because it could be safer.


Another answer, now based on @demofly 's answer...

#!/bin/bash

find <YOUR_FOLDER_PATH> -name '*' -type f -exec grep -Iq . {} \; -print0 | 
while IFS= read -r -d $'\0' LINE_FILE; do
    CHARSET=$(uchardet $LINE_FILE)
    REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
    echo "\"$CHARSET\" \"$LINE_FILE\""

    # NOTE: Convert/reconvert to utf8. By Questor
    recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

    STDERR_OP=$(cat STDERR_OP)
    rm -f STDERR_OP
    if [ -n "$STDERR_OP" ] ; then

        # NOTE: Convert/reconvert to utf8. By Questor
        iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

        STDERR_OP=$(cat STDERR_OP)
        rm -f STDERR_OP
    fi

    # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
    # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
    # https://stackoverflow.com/a/45240995/3223785 ]
    sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

    if [ -n "$STDERR_OP" ] ; then
        echo "ERROR: \"$STDERR_OP\""
    fi
    STDOUT_OP=$(cat STDOUT_OP)
    rm -f STDOUT_OP
    if [ -n "$STDOUT_OP" ] ; then
        echo "RESULT: \"$STDOUT_OP\""
    fi
done
# [Refs.: https://justrocketscience.com/post/handle-encodings , 
# https://stackoverflow.com/a/9612232/3223785 , 
# https://stackoverflow.com/a/13659891/3223785 ]

Hybrid solution with recode and vim...

#!/bin/bash

find <YOUR_FOLDER_PATH> -name '*' -type f -exec grep -Iq . {} \; -print0 | 
while IFS= read -r -d $'\0' LINE_FILE; do
    CHARSET=$(uchardet $LINE_FILE)
    REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
    echo "\"$CHARSET\" \"$LINE_FILE\""

    # NOTE: Convert/reconvert to utf8. By Questor
    recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

    STDERR_OP=$(cat STDERR_OP)
    rm -f STDERR_OP
    if [ -n "$STDERR_OP" ] ; then

        # NOTE: Convert/reconvert to utf8. By Questor
        bash -c "</dev/tty vim -u NONE +\"set binary | set noeol | set nobomb | set encoding=utf-8 | set fileencoding=utf-8 | wq\" \"$LINE_FILE\""

    else

        # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
        # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
        # https://stackoverflow.com/a/45240995/3223785 ]
        sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

    fi
done

NOTE: This was the solution with the highest number of perfect conversions. Additionally we did not have any truncated files.


WARNING: Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!

TIP: The command sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE" can be executed after a preliminary comparison with the merge tool after a conversion without it since it can cause "differences".

NOTE: The search using "find" brings all non-binary files from "YOUR_FOLDER_PATH" and its subfolders.

Thanks!

查看更多
叼着烟拽天下
4楼-- · 2020-02-09 01:57

enca command doesn't work for my Simplified-Chinese text file with GB2312 encoding.

Instead, I use the following function to convert the text file for me. You could of course re-direct the output into a file.

It requires chardet and iconv commands.

detection_cat () 
{
    DET_OUT=$(chardet $1);
    ENC=$(echo $DET_OUT | sed "s|^.*: \(.*\) (confid.*$|\1|");
    iconv -f $ENC $1
}
查看更多
欢心
5楼-- · 2020-02-09 01:58

You can get what you need using standard gnu utils file and awk. Example:

file -bi .xsession-errors gives me: "text/plain; charset=us-ascii"

so file -bi .xsession-errors |awk -F "=" '{print $2}' gives me "us-ascii"

I use it in scripts like so:

CHARSET="$(file -bi "$i"|awk -F "=" '{print $2}')"

if [ "$CHARSET" != utf-8 ]; then

        iconv -f "$CHARSET" -t utf8 "$i" -o outfile

fi
查看更多
相关推荐>>
6楼-- · 2020-02-09 02:03

Maybe you are looking for enca:

Enca is an Extremely Naive Charset Analyser. It detects character set and encoding of text files and can also convert them to other encodings using either a built-in converter or external libraries and tools like libiconv, librecode, or cstocs.

Currently it supports Belarusian, Bulgarian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese, and some multibyte encodings independently on language.

Note that in general, autodetection of current encoding is a difficult process (the same byte sequence can be correct text in multiple encodings). enca uses heuristics based on the language you tell it to detect (to limit the number of encodings). You can use enconv to convert text files to a single encoding.

查看更多
何必那么认真
7楼-- · 2020-02-09 02:03

Compiling all them. Go to dir, create dir2utf8.sh :

#!/bin/bash
# converting all files in a dir to utf8 

for f in *
do
    if test -f $f then
        echo -e "\nConverting $f"
        CHARSET="$( file -bi "$f"|awk -F "=" '{print $2}')"
        if [ "$CHARSET" != utf-8 ]; then
                iconv -f "$CHARSET" -t utf8 "$f" -o "$f"
        fi
    else
        echo -e "\nSkipping $f - it's a regular file";
    fi
done
查看更多
登录 后发表回答