grep/regex can't find accented word

I'm trying mount a regex that get some words on a file where all letters of this word match with a word pattern.

My problem is, the regex can't find accented words, but in my text file there are alot of accented words.

My command line is:

cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt

And the content of file is:

carroça
éra
éssa
roça
roco
rato
onça
orça
roca

How can I fix it?

标签： regex unicode grep cat non-ascii-characters

4条回答

The star\"

2楼-- · 2019-02-23 23:58

I found a related question here that seems to work.

So if you try something like:

cat input/words.txt | LANG=C grep '^[éra]\{1,4\}$' > output/words_era.txt

Does that produce what you expect?

0人赞添加讨论(0) 举报

放荡不羁爱自由

3楼-- · 2019-02-24 00:10

Assuming everything is UTF-8, I’d usually just use something like

perl -CSAD -le 'print if /^carroça{1,3}$/' filenames

because then I know what it’s doing.

0人赞添加讨论(0) 举报

相关推荐>>

4楼-- · 2019-02-24 00:12

Try as @dule said, but with LANG=en_US.iso88591:

cat input/words.txt | LANG=en_US.iso88591 grep '^[éra]\{1,4\}$' > output/words_era.txt

0人赞添加讨论(0) 举报

放我归山

5楼-- · 2019-02-24 00:19

If your file is encoded in ISO-8859-1 but your system locale is UTF-8, this will not work.

Either convert the file to UTF-8 or change your system locale to ISO-8859-1.

# convert from ISO-8859-1 to the environmental locale before grepping
# output will be in the current locale
$ iconv -f 8859_1 input/words.txt | grep ...

# run grep with an ISO-8859-1 locale
# output will be in ISO-8859-1 encoding
$ cat input/words.txt | env LC_ALL=en_US grep ...

0人赞添加讨论(0) 举报

grep/regex can't find accented word

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间