regex for accepting only persian characters

2019-01-02 18:43发布

I'm working on a form which one of it's custom validator should only accept persian characters...I used the following code:

    var myregex = new Regex(@"^[\u0600-\u06FF]+$");
    if (myregex.IsMatch(mytextBox.Text))
    {
        args.IsValid = true;
    }
    else
    {
        args.IsValid = false;
    }

but it seems it only work for checking arabic characters and it doesn't cover all persian characters (it lacks these four گ,چ,پ,ژ )... is there a way for solving this problem?

7条回答
ら面具成の殇う
2楼-- · 2019-01-02 18:56

In addition to the accepted answer(https://stackoverflow.com/a/22565376/790811), we should consider Zero-width_non-joiner (or نیم فاصله in persian) characters too. Unfortunately we have 2 symbols for it. One is standard and the other is not standard but widely used :

  1. \u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
  2. \u200F : Right-to-left mark (http://unicode-table.com/en/#200F)

So the final regix can be :

^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+$

If you want to consider "space", you can use this :

^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F ]+$

you can test it JavaScript by this :

/^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF7\u200C\u200F ]+$/.test('ای‌پسر تو چه می‌دانی؟')
查看更多
琉璃瓶的回忆
3楼-- · 2019-01-02 19:03

What you currently have in your regex is a standard Arabic symbols range. For additional characters your need to add them to the regex separately. Here are their codes:

ژ \u0698
پ \u067E
چ \u0686
گ \u06AF

So all in all you should have

^[\u0600-\u06FF\u0698\u067E\u0686\u06AF]+$
查看更多
临风纵饮
4楼-- · 2019-01-02 19:07

attention: persianRex is written in Javascript however you can use the source code and copy paste the characters

Detecting Persian characters is a tricky task due to veraiety of keyboard layouts and operating systems. I faced the same challenge sometime before and I decided to write an open source library to fix this issue.

you can fix your issue like this: persianRex.text.test(yourInput); //returns true or false

here is the full documentation: http://imanmh.github.io/persianRex/

查看更多
梦该遗忘
5楼-- · 2019-01-02 19:13

Farsi, Dari and Tajik are out of my bailiwick, but a little rummaging through the Unicode code charts tells me that Arabic covers 5 Unicode code blocks:

You can get at them (at least some of them) in regular expressions using named blocks instead of explicit code point ranges: \p{IsArabicPresentationForms-A} will give you the 4th Unicode block in the preceding list.

You might also read Persian Computing in Unicode: http://behdad.org/download/Publications/persiancomputing/a007.pdf

查看更多
深知你不懂我心
6楼-- · 2019-01-02 19:15

I'm not sure if regex is the way to do this, however the problem is not specific to only persian or arabic, chinees, russian text. so perhaps you could see if the character is existing in your Codepage, if not in the code page then I doubt the user can insert them using a input device....

 var encoding = Encoding.GetEncoding(1256);
 var expect = "گ چ پ ژ";
 var actual= encoding.GetBytes("گ چ پ ژ");
 Assert.AreEqual(encoding.GetString(actual),expect);

The test tests a round trip where input should match the string to bytes and back. The link shows those code pages supported.

查看更多
弹指情弦暗扣
7楼-- · 2019-01-02 19:17

The named blocks, e.g \p{Arabic} cover the entire Arabic script, not just the Persian characters.

The presentation forms (u+FB50-u+FDFF) should not be used in text, and should be converted to the standard range (u+0600-u+06FF).

In order to only cover Persian we need the following:

  • The subset of Farsi characters out of the standard Arabic range, i.e (U+0621-U+0624, U+0626-U+063A, U+0641-U+0642, U+0644-U+0648)
  • The standard Arabic diacritics (U+064B-U+0652)
  • The 2 additional diacritics (U+0654, U+0670)
  • The 4 extra Farsi characters "گ چ پ ژ" (U+067E, U+0686, U+0698, U+06AF)
  • U+06A9: Persian Kaf (formally: "Arabic Letter Keheh"; different notation from Arabic Kaf)
  • U+06CC: Farsi Yeh (a different notation from the Arabic Yeh)
  • U+200C: Zero-Width-Non-Joiner

So, the resulting regexp would be:

^[\u0621-\u0624\u0626-\u063A\u0641-\u0642\u0644-\u0648\u064B-\u0652\u067E\u0686\u0698\u06AF\u06CC\u06A9\u0654\u670\u200c}]+$

See also the exemplar characters for Persian listed here:

http://unicode.org/cldr/trac/browser/trunk/common/main/fa.xml

查看更多
登录 后发表回答