I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:
Malformed UTF-8 character (fatal)
Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.
Is there anyway to do it?
Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.
will do the job.
This command:
will clean up your UTF-8 file, skipping all the invalid characters.