I have a large text file that contains a few unicode characters that make LaTeX crash. How can I find non-ASCII characters in a file with sed, and the like in a Linux bash?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
Try:
nonascii() { LANG=C grep --color=always '[^ -~]\+'; }
Which can be used like:
printf 'ŨTF8\n' | nonascii
Within []
^
means "not". So [^ -~]
means characters not between space and ~. So excluding control chars, this matches non ASCII characters, and is a more portable though slightly less accurate version of [^\x00-\x7f]
below. The \+
means 1 or more
and will get multibye characters to have a color shown around the complete character(s), rather than interspersed in each byte, thus corrupting the multibyte sequence
回答2:
Try this command:
grep -P '[^\x00-\x7f]' file