How to remove non UTF-8 characters from text file

2019-01-07 04:02发布

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:

Malformed UTF-8 character (fatal)

Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.

Is there anyway to do it?

标签： linux bash text utf-8 character-encoding

3条回答

The star\"

2楼-- · 2019-01-07 05:01

Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.

0人赞添加讨论(0) 举报

何必那么认真

3楼-- · 2019-01-07 05:01

cat foo.txt | strings -n 8 > bar.txt

will do the job.

0人赞添加讨论(0) 举报

冷血范

4楼-- · 2019-01-07 05:06

This command:

iconv -f utf-8 -t utf-8 -c file.txt

will clean up your UTF-8 file, skipping all the invalid characters.

-f is the source format
-t the target format
-c skips any invalid sequence

0人赞添加讨论(0) 举报

How to remove non UTF-8 characters from text file

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间