I am trying to point iconv to a directory and all files will be converted UTF-8 regardless of the current encoding
I am using this script but you have to specify what encoding you are going FROM. How can I make it autdetect the current encoding?
dir_iconv.sh
#!/bin/bash
ICONVBIN='/usr/bin/iconv' # path to iconv binary
if [ $# -lt 3 ]
then
echo "$0 dir from_charset to_charset"
exit
fi
for f in $1/*
do
if test -f $f
then
echo -e "\nConverting $f"
/bin/mv $f $f.old
$ICONVBIN -f $2 -t $3 $f.old > $f
else
echo -e "\nSkipping $f - not a regular file";
fi
done
terminal line
sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8
Here is my solution to inplace all files:
https://gist.github.com/demofly/25f856a96c29b89baa32
put it into
convert-dir-to-utf8.sh
and run:Note that sed is a workaround for mac encodings here. Many uncommon encodings need workarounds like this.
Here's my answer... =D
FURTHER QUESTION: I do not know if my approach is the safest. I say this because I noticed that some files are not correctly converted (characters will be lost) or are "truncated". I suspect that this has to do with the "iconv" tool or with the charset information obtained with the "uchardet" tool. I was curious about the solution presented at https://stackoverflow.com/a/22841847/3223785 ( @demofly ) because it could be safer.
Another answer, now based on @demofly 's answer...
Hybrid solution with recode and vim...
NOTE: This was the solution with the highest number of perfect conversions. Additionally we did not have any truncated files.
WARNING: Make a backup of your files and use a merge tool to check/compare the changes. Problems probably will appear!
TIP: The command
sed -i '1s/^\xEF\xBB\xBF//' "$LINE_FILE"
can be executed after a preliminary comparison with the merge tool after a conversion without it since it can cause "differences".NOTE: The search using "find" brings all non-binary files from "YOUR_FOLDER_PATH" and its subfolders.
Thanks!
enca command doesn't work for my Simplified-Chinese text file with GB2312 encoding.
Instead, I use the following function to convert the text file for me. You could of course re-direct the output into a file.
It requires chardet and iconv commands.
You can get what you need using standard gnu utils file and awk. Example:
file -bi .xsession-errors
gives me: "text/plain; charset=us-ascii"so
file -bi .xsession-errors |awk -F "=" '{print $2}'
gives me "us-ascii"I use it in scripts like so:
Maybe you are looking for
enca
:Note that in general, autodetection of current encoding is a difficult process (the same byte sequence can be correct text in multiple encodings).
enca
uses heuristics based on the language you tell it to detect (to limit the number of encodings). You can useenconv
to convert text files to a single encoding.Compiling all them. Go to dir, create dir2utf8.sh :