我建立一个软件在不同的语言排序书索引。 它使用Perl和键关于区域。 我正在开发它在Unix,但它需要移植到Windows。 如若这一工作原则上,或依靠现场,我是不是找错了树? 底线,Windows实际上是我需要这个工作,但我更舒适的在我的UNIX环境中发展。
Answer 1:
假设你的出发点是Unicode,因为你一直很小心,不管其母语的编码可能是所有输入数据进行解码,那么很容易使用到Unicode::Collate
模块为出发点。
如果你想现场剪裁,那么你可能要开始Unicode::Collate::Locale
来代替。
解码成Unicode
如果您在全UTF8环境下运行,这是很容易的,但如果你受的随机所谓的“语言环境”的沧桑(或更糟的是,丑陋的东西微软称为“代码页”),那么你可能想拿到CPAN Encode::Locale
模块,为您排忧解难。 例如:
use Encode;
use Encode::Locale;
# use "locale" as an arg to encode/decode
@ARGV = map { decode(locale => $_) } @ARGV;
# or as a stream for binmode or open
binmode $some_fh, ":encoding(locale)";
binmode STDIN, ":encoding(console_in)" if -t STDIN;
binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
binmode STDERR, ":encoding(console_out)" if -t STDERR;
(如果是我的话,我只想用":utf8"
为输出)。
标准排序,加上语言环境和剪裁
问题的关键是,一旦你拥有了一切解码成内部Perl的格式,你可以使用Unicode::Collate
和Unicode::Collate::Locale
就可以了。 这些都可以很容易的:
use v5.14;
use utf8;
use Unicode::Collate;
my @exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
@exes = Unicode::Collate->new->sort(@exes);
say "@exes";
# prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹
或者,他们可以很花哨。 这是一个试图处理书名:这条重要文章和零垫数字。
my $collator = Unicode::Collate->new(
--upper_before_lower => 1,
--preprocess => {
local $_ = shift;
s/^ (?: The | An? ) \h+ //x; # strip articles
s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
return $_;
};
);
现在只需要使用对象的sort
方法与排序。
有时候,你需要打开内而外的那种。 例如:
my $collator = Unicode::Collate->new();
for my $rec (@recs) {
$rec->{NAME_key} =
$collator->getSortKey( $rec->{NAME} );
}
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} @recs;
你必须这样做的原因是因为你在各种领域的记录进行排序。 二进制排序键让你使用cmp
运营商对已经通过你的选择/自定义Collator对象数据。
对于整理器对象的完整构造了这一切的正式语法:
$Collator = Unicode::Collate->new(
UCA_Version => $UCA_Version,
alternate => $alternate, # alias for 'variable'
backwards => $levelNumber, # or \@levelNumbers
entry => $element,
hangul_terminator => $term_primary_weight,
highestFFFF => $bool,
identical => $bool,
ignoreName => qr/$ignoreName/,
ignoreChar => qr/$ignoreChar/,
ignore_level2 => $bool,
katakana_before_hiragana => $bool,
level => $collationLevel,
minimalFFFE => $bool,
normalization => $normalization_form,
overrideCJK => \&overrideCJK,
overrideHangul => \&overrideHangul,
preprocess => \&preprocess,
rearrange => \@charList,
rewrite => \&rewrite,
suppress => \@charList,
table => $filename,
undefName => qr/$undefName/,
undefChar => qr/$undefChar/,
upper_before_lower => $bool,
variable => $variable,
);
但你平时不担心那些几乎任何。 事实上,如果你想具体国家的语言环境中使用CLDR数据裁缝,你应该只使用Unicode::Collate::Locale
,增加了正好一个参数的构造函数: locale => $country_code
。
use Unicode::Collate::Locale;
$coll = Unicode::Collate::Locale->
new(locale => "fr");
@french_text = $coll->sort(@french_text);
见多么容易那是什么?
但你可以做其他的很酷的事情了。
use Unicode::Collate::Locale;
my $Collator = new Unicode::Collate::Locale::
locale => "de__phonebook",
level => 1,
normalization => undef,
;
my $full = "Ich müß Perl studieren.";
my $sub = "MUESS";
if (my ($pos,$len) = $Collator->index($full, $sub)) {
my $match = substr($full, $pos, $len);
say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";
}
在运行时,上面写着:
Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›
以下是可用的语言环境中的v0.96中Unicode::Collate::Locale
模块,从它的手册页采取:
locale name description
--------------------------------------------------------------
af Afrikaans
ar Arabic
as Assamese
az Azerbaijani (Azeri)
be Belarusian
bg Bulgarian
bn Bengali
bs Bosnian
bs_Cyrl Bosnian in Cyrillic (tailored as Serbian)
ca Catalan
cs Czech
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
ee Ewe
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fa Persian
fi Finnish (v and w are primary equal)
fi__phonebook Finnish (v and w as separate characters)
fil Filipino
fo Faroese
fr French
gu Gujarati
ha Hausa
haw Hawaiian
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
kn Kannada
ko Korean [2]
kok Konkani
ln Lingala
lt Lithuanian
lv Latvian
mk Macedonian
ml Malayalam
mr Marathi
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
or Oriya
pa Punjabi
pl Polish
ro Romanian
ru Russian
sa Sanskrit
se Northern Sami
si Sinhala
si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sr_Latn Serbian in Latin (tailored as Croatian)
sv Swedish (v and w are primary equal)
sv__reformed Swedish (v and w as separate characters)
ta Tamil
te Telugu
th Thai
tn Tswana
to Tonga
tr Turkish
uk Ukrainian
ur Urdu
vi Vietnamese
wae Walser
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order) [3]
zh__stroke Chinese (ideographs: stroke order) [3]
zh__zhuyin Chinese (ideographs: zhuyin order) [3]
Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
(Zulu).
Note
[1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and halfwidth forms are identical to their regular form. The
difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
and then "katakana_before_hiragana" has no effect.
[2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
(level 2) greater than, the corresponding hangul syllable.
[3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.
Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.
因此,在总结,主要伎俩是让你的本地数据解码成一个统一的Unicode表示,则使用确定性的排序,可能是定制的,不依赖于正确的行为,用户的控制台窗口的随机设置。
注:所有这些例子中,除了手册页引文,是从精心的Perl编程的第 4版解禁,其作者的一种许可。 :)
Answer 2:
的Win32 :: OLE :: NLS ,您可以访问系统中的一部分。 它为您提供CompareString
和必要的工具,以获得必要的区域设置ID。
如果你想/需要找到系统文件,底层系统调用被命名为CompareStringEx
。