What regex to find files with CJK characters using

2019-08-08 17:11发布

问题:

The files I'm looking for are of the form cmn-我.flac, where the CJK character is variable.

Using find command, what regexp should I use to find all files with a single CJK characters in its name?


Hints: The following regexp find all files including those with and without CJK characters :

find ./ -regex '.*\..*'  # ex: cmn-我.flac

Then :

find ./ -regex "cmn-.*[\x4e00-\x9fa5]*\.flac"   # the `-` breaks => fails 
find ./ -regex ".*[\x4e00-\x9fa5]*\.flac"       # finds with n CJK characters => we get closer!
find ./ -regex ".*[\x4e00-\x9fa5]{1}\.flac"     # the `{1}` breaks => fails. 
find ./ -regex ".*[\x4e00-\x9fa5]?\.flac"       # the `?` breaks => fails. 

How to make it works ?

回答1:

I think you're on the correct way and need to look a bit more at the find man page (e.g. -regextype).

Can't reproduce

find ./ -regex "cmn-.*[\x4e00-\x9fa5]*\.xml"
# find: Invalid range end

find's version

First, Be sure to check which version of find you're using, there is some differences between implementation:

find --version

Give:

find (GNU findutils) 4.4.2
…

Explanation

Looking at the -regex-type option I only see POSIX regular expression types: emacs (default), posix-awk, posix-basic, posix-egrep and posix-extended).

Which doesn't support custom hex range definition (compare Perl with POSIX).



回答2:

  1. There was an error in the regex, outside of the CJK matching part. The file form to match is not

    cmn-我.flac

    but is rather :

    ./cmn-我.flac

  2. The following command fully works, matching ./cmn-*.flac where * is any single character, including CJK :

    find ./ -regex "./cmn-.\.flac"

  3. The following fully works, matching ./cmn-*.flac where * is any single CJK character.

    << NOT yet found ! Help welcome! >>