在Perl多语言文本排序,在Windows下使用的语言环境(Multilingual text so

2019-07-20 08:28发布

我建立一个软件在不同的语言排序书索引。 它使用Perl和键关于区域。 我正在开发它在Unix,但它需要移植到Windows。 如若这一工作原则上,或依靠现场,我是不是找错了树? 底线,Windows实际上是我需要这个工作,但我更舒适的在我的UNIX环境中发展。

Answer 1:

假设你的出发点是Unicode,因为你一直很小心,不管其母语的编码可能是所有输入数据进行解码,那么很容易使用到Unicode::Collate模块为出发点。

如果你想现场剪裁,那么你可能要开始Unicode::Collate::Locale来代替。

解码成Unicode

如果您在全UTF8环境下运行,这是很容易的,但如果你受的随机所谓的“语言环境”的沧桑(或更糟的是,丑陋的东西微软称为“代码页”),那么你可能想拿到CPAN Encode::Locale模块,为您排忧解难。 例如:

 use Encode;
 use Encode::Locale;

 # use "locale" as an arg to encode/decode
 @ARGV = map { decode(locale =>  $_) } @ARGV;

 # or as a stream for binmode or open
 binmode $some_fh, ":encoding(locale)";

 binmode STDIN,  ":encoding(console_in)"  if -t STDIN;
 binmode STDOUT, ":encoding(console_out)"  if -t STDOUT;
 binmode STDERR, ":encoding(console_out)"  if -t STDERR;

(如果是我的话,我只想用":utf8"为输出)。


标准排序,加上语言环境和剪裁

问题的关键是,一旦你拥有了一切解码成内部Perl的格式,你可以使用Unicode::CollateUnicode::Collate::Locale就可以了。 这些都可以很容易的:

   use v5.14;
   use utf8;
   use Unicode::Collate;
   my @exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
   @exes = Unicode::Collate->new->sort(@exes);
   say "@exes";

   # prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹

或者,他们可以很花哨。 这是一个试图处理书名:这条重要文章和零垫数字。

my $collator = Unicode::Collate->new(
    --upper_before_lower => 1,
    --preprocess => {
        local $_ = shift;
        s/^ (?: The | An? ) \h+ //x;  # strip articles
        s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
        return $_;
    };
);

现在只需要使用对象的sort方法与排序。

有时候,你需要打开内而外的那种。 例如:

 my $collator = Unicode::Collate->new();
 for my $rec (@recs) {
     $rec->{NAME_key} = 
        $collator->getSortKey( $rec->{NAME} );
 }
 @srecs = sort {
     $b->{AGE}       <=>  $a->{AGE}
                     ||
     $a->{NAME_key}  cmp  $b->{NAME_key}
 } @recs;

你必须这样做的原因是因为你在各种领域的记录进行排序。 二进制排序键让你使用cmp运营商对已经通过你的选择/自定义Collat​​or对象数据。

对于整理器对象的完整构造了这一切的正式语法:

      $Collator = Unicode::Collate->new(
         UCA_Version => $UCA_Version,
         alternate => $alternate, # alias for 'variable'
         backwards => $levelNumber, # or \@levelNumbers
         entry => $element,
         hangul_terminator => $term_primary_weight,
         highestFFFF => $bool,
         identical => $bool,
         ignoreName => qr/$ignoreName/,
         ignoreChar => qr/$ignoreChar/,
         ignore_level2 => $bool,
         katakana_before_hiragana => $bool,
         level => $collationLevel,
         minimalFFFE => $bool,
         normalization  => $normalization_form,
         overrideCJK => \&overrideCJK,
         overrideHangul => \&overrideHangul,
         preprocess => \&preprocess,
         rearrange => \@charList,
         rewrite => \&rewrite,
         suppress => \@charList,
         table => $filename,
         undefName => qr/$undefName/,
         undefChar => qr/$undefChar/,
         upper_before_lower => $bool,
         variable => $variable,
      );

但你平时不担心那些几乎任何。 事实上,如果你想具体国家的语言环境中使用CLDR数据裁缝,你应该只使用Unicode::Collate::Locale ,增加了正好一个参数的构造函数: locale => $country_code

 use Unicode::Collate::Locale;
 $coll = Unicode::Collate::Locale->
           new(locale => "fr");
 @french_text = $coll->sort(@french_text);

见多么容易那是什么?

但你可以做其他的很酷的事情了。

 use Unicode::Collate::Locale;
 my $Collator = new Unicode::Collate::Locale::
                 locale => "de__phonebook",
                 level  => 1,
                 normalization => undef,
                ;

 my $full = "Ich müß Perl studieren.";
 my $sub = "MUESS";
 if (my ($pos,$len) = $Collator->index($full, $sub)) {
     my $match = substr($full, $pos, $len);
     say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";

 }

在运行时,上面写着:

 Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›

以下是可用的语言环境中的v0.96中Unicode::Collate::Locale模块,从它的手册页采取:

 locale name       description
--------------------------------------------------------------
 af                Afrikaans
 ar                Arabic
 as                Assamese
 az                Azerbaijani (Azeri)
 be                Belarusian
 bg                Bulgarian
 bn                Bengali
 bs                Bosnian
 bs_Cyrl           Bosnian in Cyrillic (tailored as Serbian)
 ca                Catalan
 cs                Czech
 cy                Welsh
 da                Danish
 de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
 ee                Ewe
 eo                Esperanto
 es                Spanish
 es__traditional   Spanish ('ch' and 'll' as a grapheme)
 et                Estonian
 fa                Persian
 fi                Finnish (v and w are primary equal)
 fi__phonebook     Finnish (v and w as separate characters)
 fil               Filipino
 fo                Faroese
 fr                French
 gu                Gujarati
 ha                Hausa
 haw               Hawaiian
 hi                Hindi
 hr                Croatian
 hu                Hungarian
 hy                Armenian
 ig                Igbo
 is                Icelandic
 ja                Japanese [1]
 kk                Kazakh
 kl                Kalaallisut
 kn                Kannada
 ko                Korean [2]
 kok               Konkani
 ln                Lingala
 lt                Lithuanian
 lv                Latvian
 mk                Macedonian
 ml                Malayalam
 mr                Marathi
 mt                Maltese
 nb                Norwegian Bokmal
 nn                Norwegian Nynorsk
 nso               Northern Sotho
 om                Oromo
 or                Oriya
 pa                Punjabi
 pl                Polish
 ro                Romanian
 ru                Russian
 sa                Sanskrit
 se                Northern Sami
 si                Sinhala
 si__dictionary    Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
 sk                Slovak
 sl                Slovenian
 sq                Albanian
 sr                Serbian
 sr_Latn           Serbian in Latin (tailored as Croatian)
 sv                Swedish (v and w are primary equal)
 sv__reformed      Swedish (v and w as separate characters)
 ta                Tamil
 te                Telugu
 th                Thai
 tn                Tswana
 to                Tonga
 tr                Turkish
 uk                Ukrainian
 ur                Urdu
 vi                Vietnamese
 wae               Walser
 wo                Wolof
 yo                Yoruba
 zh                Chinese
 zh__big5han       Chinese (ideographs: big5 order)
 zh__gb2312han     Chinese (ideographs: GB-2312 order)
 zh__pinyin        Chinese (ideographs: pinyin order) [3]
 zh__stroke        Chinese (ideographs: stroke order) [3]
 zh__zhuyin        Chinese (ideographs: zhuyin order) [3]

   Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
   it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
   (Zulu).

   Note

   [1] ja: Ideographs are sorted in JIS X 0208 order.  Fullwidth and halfwidth forms are identical to their regular form.  The
   difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
   and then "katakana_before_hiragana" has no effect.

   [2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
   (level 2) greater than, the corresponding hangul syllable.

   [3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.

   Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.

因此,在总结,主要伎俩是让你的本地数据解码成一个统一的Unicode表示,则使用确定性的排序,可能是定制的,不依赖于正确的行为,用户的控制台窗口的随机设置。


注:所有这些例子中,除了手册页引文,是从精心的Perl编程 4版解禁,其作者的一种许可。 :)



Answer 2:

的Win32 :: OLE :: NLS ,您可以访问系统中的一部分。 它为您提供CompareString和必要的工具,以获得必要的区域设置ID。

如果你想/需要找到系统文件,底层系统调用被命名为CompareStringEx



文章来源: Multilingual text sorting in Perl, on Windows, using locale