iconv any encoding to UTF-8

I am trying to point iconv to a directory and all files will be converted UTF-8 regardless of the current encoding

I am using this script but you have to specify what encoding you are going FROM. How can I make it autdetect the current encoding?

dir_iconv.sh

#!/bin/bash

ICONVBIN='/usr/bin/iconv' # path to iconv binary

if [ $# -lt 3 ]
then
    echo "$0 dir from_charset to_charset"
    exit
fi

for f in $1/*
do
    if test -f $f
    then
        echo -e "\nConverting $f"
        /bin/mv $f $f.old
        $ICONVBIN -f $2 -t $3 $f.old > $f
    else
        echo -e "\nSkipping $f - not a regular file";
    fi
done

terminal line

sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8

标签： linux ubuntu encoding utf-8 iconv

7条回答

beautiful°

2楼-- · 2020-02-09 01:43

Here is my solution to inplace all files:

#!/bin/bash

apt-get -y install recode uchardet > /dev/null
find "$1" -type f | while read FFN # 'dir' should be changed...
do
    encoding=$(uchardet "$FFN")
    echo "$FFN: $encoding"
    enc=`echo $encoding | sed 's#^x-mac-#mac#'`
    set +x
    recode $enc..UTF-8 "$FFN"
done

https://gist.github.com/demofly/25f856a96c29b89baa32

put it into convert-dir-to-utf8.sh and run:

bash convert-dir-to-utf8.sh /pat/to/my/trash/dir

Note that sed is a workaround for mac encodings here. Many uncommon encodings need workarounds like this.

0人赞添加讨论(0) 举报

干净又极端

3楼-- · 2020-02-09 01:51

Here's my answer... =D

#!/bin/bash

find <YOUR_FOLDER_PATH> -name '*' -type f -exec grep -Iq . {} \; -print0 | 
while IFS= read -r -d $'\0' LINE_FILE; do
    CHARSET=$(uchardet $LINE_FILE)
    echo "Converting ($CHARSET) $LINE_FILE"

    # NOTE: Convert/reconvert to utf8. By Questor
    iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE"

    # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
    # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
    # https://stackoverflow.com/a/45240995/3223785 ]
    sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

done
# [Refs.: https://justrocketscience.com/post/handle-encodings , 
# https://stackoverflow.com/a/9612232/3223785 , 
# https://stackoverflow.com/a/13659891/3223785 ]

FURTHER QUESTION: I do not know if my approach is the safest. I say this because I noticed that some files are not correctly converted (characters will be lost) or are "truncated". I suspect that this has to do with the "iconv" tool or with the charset information obtained with the "uchardet" tool. I was curious about the solution presented at https://stackoverflow.com/a/22841847/3223785 ( @demofly ) because it could be safer.

Another answer, now based on @demofly 's answer...

#!/bin/bash

find <YOUR_FOLDER_PATH> -name '*' -type f -exec grep -Iq . {} \; -print0 | 
while IFS= read -r -d $'\0' LINE_FILE; do
    CHARSET=$(uchardet $LINE_FILE)
    REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
    echo "\"$CHARSET\" \"$LINE_FILE\""

    # NOTE: Convert/reconvert to utf8. By Questor
    recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

    STDERR_OP=$(cat STDERR_OP)
    rm -f STDERR_OP
    if [ -n "$STDERR_OP" ] ; then

        # NOTE: Convert/reconvert to utf8. By Questor
        iconv -f "$CHARSET" -t utf8 "$LINE_FILE" -o "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

        STDERR_OP=$(cat STDERR_OP)
        rm -f STDERR_OP
    fi

    # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
    # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
    # https://stackoverflow.com/a/45240995/3223785 ]
    sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

    if [ -n "$STDERR_OP" ] ; then
        echo "ERROR: \"$STDERR_OP\""
    fi
    STDOUT_OP=$(cat STDOUT_OP)
    rm -f STDOUT_OP
    if [ -n "$STDOUT_OP" ] ; then
        echo "RESULT: \"$STDOUT_OP\""
    fi
done
# [Refs.: https://justrocketscience.com/post/handle-encodings , 
# https://stackoverflow.com/a/9612232/3223785 , 
# https://stackoverflow.com/a/13659891/3223785 ]

Hybrid solution with recode and vim...

#!/bin/bash

find <YOUR_FOLDER_PATH> -name '*' -type f -exec grep -Iq . {} \; -print0 | 
while IFS= read -r -d $'\0' LINE_FILE; do
    CHARSET=$(uchardet $LINE_FILE)
    REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'`
    echo "\"$CHARSET\" \"$LINE_FILE\""

    # NOTE: Convert/reconvert to utf8. By Questor
    recode $REENCSED..UTF-8 "$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP

    STDERR_OP=$(cat STDERR_OP)
    rm -f STDERR_OP
    if [ -n "$STDERR_OP" ] ; then

        # NOTE: Convert/reconvert to utf8. By Questor
        bash -c "</dev/tty vim -u NONE +\"set binary | set noeol | set nobomb | set encoding=utf-8 | set fileencoding=utf-8 | wq\" \"$LINE_FILE\""

    else

        # NOTE: Remove "BOM" if exists as it is unnecessary. By Questor
        # [Refs.: https://stackoverflow.com/a/2223926/3223785 ,
        # https://stackoverflow.com/a/45240995/3223785 ]
        sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"

    fi
done

NOTE: This was the solution with the highest number of perfect conversions. Additionally we did not have any truncated files.

WARNING: Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!

TIP: The command sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE" can be executed after a preliminary comparison with the merge tool after a conversion without it since it can cause "differences".

NOTE: The search using "find" brings all non-binary files from "YOUR_FOLDER_PATH" and its subfolders.

Thanks!

0人赞添加讨论(0) 举报

叼着烟拽天下

4楼-- · 2020-02-09 01:57

enca command doesn't work for my Simplified-Chinese text file with GB2312 encoding.

Instead, I use the following function to convert the text file for me. You could of course re-direct the output into a file.

It requires chardet and iconv commands.

detection_cat () 
{
    DET_OUT=$(chardet $1);
    ENC=$(echo $DET_OUT | sed "s|^.*: \(.*\) (confid.*$|\1|");
    iconv -f $ENC $1
}

0人赞添加讨论(0) 举报

欢心

5楼-- · 2020-02-09 01:58

You can get what you need using standard gnu utils file and awk. Example:

file -bi .xsession-errors gives me: "text/plain; charset=us-ascii"

so file -bi .xsession-errors |awk -F "=" '{print $2}' gives me "us-ascii"

I use it in scripts like so:

CHARSET="$(file -bi "$i"|awk -F "=" '{print $2}')"

if [ "$CHARSET" != utf-8 ]; then

        iconv -f "$CHARSET" -t utf8 "$i" -o outfile

fi

0人赞添加讨论(0) 举报

iconv any encoding to UTF-8

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间