Unicode::Collate::Locale - 通过 Unicode::Collate 对 DUCET 进行语言定制
use Unicode::Collate::Locale;
#construct
$Collator = Unicode::Collate::Locale->
new(locale => $locale_name, %tailoring);
#sort
@sorted = $Collator->sort(@not_sorted);
#compare
$result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
注意:@not_sorted
、$a
和 $b
中的字符串将根据 Perl 的 Unicode 支持进行解释。请参阅 perlunicode、perluniintro、perlunitut、perlunifaq、utf8。否则,您可使用 preprocess
(请参阅 Unicode::Collate
)或在之前对它们进行解码。
此模块为其利用 Unicode::Collate
提供语言定制。
new
方法返回一个排序器对象。
构造函数的参数列表是一个哈希,其中可以包括一个特殊键 locale
及其值(不区分大小写),表示 Unicode 基本语言代码(两个或三个字母)。例如,Unicode::Collate::Locale->new(locale => 'ES')
返回一个针对西班牙语定制的排序器。
$locale_name
可以后缀一个 Unicode 脚本代码(四个字母)、一个 Unicode 区域(地区)代码、一个 Unicode 语言变体代码。这些代码不区分大小写,并用 '_'
或 '-'
分隔。例如,en_US
表示美国的英语,az_Cyrl
表示西里尔字母中的阿塞拜疆语,es_ES_traditional
表示西班牙的西班牙语(传统)。
如果 $locale_name
不可用,则按以下顺序选择回退
1. language with a variant code
2. language with a script code
3. language with a region code
4. language
5. default
允许 Unicode::Collate
提供的定制标签,只要它们不用于 locale
支持即可。特别是 table
标签始终不可定制,因为它保留给 DUCET。
但是,即使 entry
用于 locale
支持,也允许添加或覆盖映射。
例如,一个西班牙语比较器,它忽略变音符号和大小写差异(即级别 1),并具有反向大小写顺序且没有规范化。
Unicode::Collate::Locale->new(
level => 1,
locale => 'es',
upper_before_lower => 1,
normalization => undef
)
如果将此类定制传递给 new()
,则不允许覆盖已由 locale
定制的行为。
Unicode::Collate::Locale->new(
locale => 'da',
upper_before_lower => 0, # causes error as reserved by 'da'
)
但是,从 Unicode::Collate
继承的 change()
允许 locale
保留此类定制。示例
new(locale => 'fr_ca')->change(backwards => undef)
new(locale => 'da')->change(upper_before_lower => 0)
new(locale => 'ja')->change(overrideCJK => undef)
Unicode::Collate::Locale
是 Unicode::Collate
的子类,并且除了 new
之外的其他方法都从 Unicode::Collate
继承。
以下是其他方法的列表
$Collator->getlocale
返回实际接受和用于排序的语言代码。如果您传递的语言代码未提供语言定制(对于某些语言是有意为之,或者由于实现不完整),则此方法返回字符串 'default'
,表示没有特殊定制。
$Collator->locale_version
(自 Unicode::Collate::Locale 0.87 起)返回语言环境的版本号(可能是 /\d\.\d\d/
),如同 Locale/*.pl。
注意:比较器使用的 Locale/*.pl 应由 getlocale
和 locale_version
返回值的组合来标识。
locale name description
--------------------------------------------------------------
af Afrikaans
ar Arabic
as Assamese
az Azerbaijani (Azeri)
be Belarusian
bn Bengali
bs Bosnian (tailored as Croatian)
bs_Cyrl Bosnian in Cyrillic (tailored as Serbian)
ca Catalan
cs Czech
cu Church Slavic
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
de_AT_phonebook Austrian German (umlaut primary greater)
dsb Lower Sorbian
ee Ewe
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fa Persian
fi Finnish (v and w are primary equal)
fi__phonebook Finnish (v and w as separate characters)
fil Filipino
fo Faroese
fr_CA Canadian French
gu Gujarati
ha Hausa
haw Hawaiian
he Hebrew
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
kn Kannada
ko Korean [2]
kok Konkani
lkt Lakota
ln Lingala
lt Lithuanian
lv Latvian
mk Macedonian
ml Malayalam
mr Marathi
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
or Oriya
pa Punjabi
pl Polish
ro Romanian
sa Sanskrit
se Northern Sami
si Sinhala
si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sr_Latn Serbian in Latin (tailored as Croatian)
sv Swedish (v and w are primary equal)
sv__reformed Swedish (v and w as separate characters)
ta Tamil
te Telugu
th Thai
tn Tswana
to Tonga
tr Turkish
ug_Cyrl Uyghur in Cyrillic
uk Ukrainian
ur Urdu
vi Vietnamese
vo Volapu"k
wae Walser
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order) [3]
zh__stroke Chinese (ideographs: stroke order) [3]
zh__zhuyin Chinese (ideographs: zhuyin order) [3]
--------------------------------------------------------------
根据默认 UCA 规则的语言环境包括 am(阿姆哈拉语),不带 [reorder Ethi]
,bg(保加利亚语),不带 [reorder Cyrl]
,chr(切罗基语),不带 [reorder Cher]
,de(德语),en(英语),fr(法语),ga(爱尔兰语),id(印度尼西亚语),it(意大利语),ka(格鲁吉亚语),不带 [reorder Geor]
,mn(蒙古语),不带 [reorder Cyrl Mong]
,ms(马来语),nl(荷兰语),pt(葡萄牙语),ru(俄语),不带 [reorder Cyrl]
,sw(斯瓦希里语),zu(祖鲁语)。
注意
[1] ja:按照 JIS X 0208 顺序对表意文字进行排序。全角和半角形式与其常规形式相同。平假名和片假名之间的差异在第 4 级,比较还需要 (variable => 'Non-ignorable')
,然后 katakana_before_hiragana
无效。
[2] ko:许多表意文字按其读音排序。这种表意文字主要(级别 1)等于相应的韩语音节,次要(级别 2)大于相应的韩语音节。
[3] zh__pinyin、zh__stroke 和 zh__zhuyin:实现 alt='short',其中调整了较少数量的表意文字。
variant code alias
------------------------------------------
dictionary dict
phonebook phone phonebk
reformed reform
traditional trad
------------------------------------------
big5han big5
gb2312han gb2312
pinyin
stroke
zhuyin
------------------------------------------
注意:“拼音”是拉丁语中的汉语,“注音”是注音符号中的汉语。
安装 Unicode::Collate::Locale
需要 Collate/Locale.pm、Collate/Locale/*.pm、Collate/CJK/*.pm 和 Collate/allkeys.txt。在构建时,Unicode::Collate::Locale
不需要任何 data/*.txt、gendata/* 和 mklocale。Unicode::Collate::Locale
的测试命名为 t/loc_*.t。
即使某个字母经过调整,其等效字母也不一定也经过调整。例如,即使 W 经过调整,全角 W(U+FF37
)、带尖音符的 W(U+1E82
)等也没有经过调整。结果可能取决于源字符串是否已标准化以及是否已分解或组合。因此,(normalization => undef)
不太可取。
包括脚本在内的任何组的顺序不会更改。
locale based CLDR or other reference
--------------------------------------------------------------------
af 30 = 1.8.1
ar 30 = 28 ("compat" wo [reorder Arab]) = 1.9.0
as 30 = 28 (without [reorder Beng..]) = 23
az 30 = 24 ("standard" wo [reorder Latn Cyrl])
be 30 = 28 (without [reorder Cyrl])
bn 30 = 28 ("standard" wo [reorder Beng..]) = 2.0.1
bs 30 = 28 (type="standard": [import hr])
bs_Cyrl 30 = 28 (type="standard": [import sr])
ca 30 = 23 (alt="proposed" type="standard")
cs 30 = 1.8.1 (type="standard")
cu 34 = 30 (without [reorder Cyrl])
cy 30 = 1.8.1
da 22.1 = 1.8.1 (type="standard")
de__phonebook 30 = 2.0 (type="phonebook")
de_AT_phonebook 30 = 27 (type="phonebook")
dsb 30 = 26
ee 30 = 21
eo 30 = 1.8.1
es 30 = 1.9.0 (type="standard")
es__traditional 30 = 1.8.1 (type="traditional")
et 30 = 26
fa 22.1 = 1.8.1
fi 22.1 = 1.8.1 (type="standard" alt="proposed")
fi__phonebook 22.1 = 1.8.1 (type="phonebook")
fil 30 = 1.9.0 (type="standard") = 1.8.1
fo 22.1 = 1.8.1 (alt="proposed" type="standard")
fr_CA 30 = 1.9.0
gu 30 = 28 ("standard" wo [reorder Gujr..]) = 1.9.0
ha 30 = 1.9.0
haw 30 = 24
he 30 = 28 (without [reorder Hebr]) = 23
hi 30 = 28 (without [reorder Deva..]) = 1.9.0
hr 30 = 28 ("standard" wo [reorder Latn Cyrl]) = 1.9.0
hu 22.1 = 1.8.1 (alt="proposed" type="standard")
hy 30 = 28 (without [reorder Armn]) = 1.8.1
ig 30 = 1.8.1
is 22.1 = 1.8.1 (type="standard")
ja 22.1 = 1.8.1 (type="standard")
kk 30 = 28 (without [reorder Cyrl])
kl 22.1 = 1.8.1 (type="standard")
kn 30 = 28 ("standard" wo [reorder Knda..]) = 1.9.0
ko 22.1 = 1.8.1 (type="standard")
kok 30 = 28 (without [reorder Deva..]) = 1.8.1
lkt 30 = 25
ln 30 = 2.0 (type="standard") = 1.8.1
lt 22.1 = 1.9.0
lv 22.1 = 1.9.0 (type="standard") = 1.8.1
mk 30 = 28 (without [reorder Cyrl])
ml 22.1 = 1.9.0
mr 30 = 28 (without [reorder Deva..]) = 1.8.1
mt 22.1 = 1.9.0
nb 22.1 = 2.0 (type="standard")
nn 22.1 = 2.0 (type="standard")
nso [*] 26 = 1.8.1
om 22.1 = 1.8.1
or 30 = 28 (without [reorder Orya..]) = 1.9.0
pa 22.1 = 1.8.1
pl 30 = 1.8.1
ro 30 = 1.9.0 (type="standard")
sa [*] 1.9.1 = 1.8.1 (type="standard" alt="proposed")
se 22.1 = 1.8.1 (type="standard")
si 30 = 28 ("standard" wo [reorder Sinh..]) = 1.9.0
si__dictionary 30 = 28 ("dictionary" wo [reorder Sinh..]) = 1.9.0
sk 22.1 = 1.9.0 (type="standard")
sl 22.1 = 1.8.1 (type="standard" alt="proposed")
sq 22.1 = 1.8.1 (alt="proposed" type="standard")
sr 30 = 28 (without [reorder Cyrl])
sr_Latn 30 = 28 (type="standard": [import hr])
sv 22.1 = 1.9.0 (type="standard")
sv__reformed 22.1 = 1.8.1 (type="reformed")
ta 22.1 = 1.9.0
te 30 = 28 (without [reorder Telu..]) = 1.9.0
th 22.1 = 22
tn [*] 26 = 1.8.1
to 22.1 = 22
tr 22.1 = 1.8.1 (type="standard")
uk 30 = 28 (without [reorder Cyrl])
ug_Cyrl https://en.wikipedia.org/wiki/Uyghur_Cyrillic_alphabet
ur 22.1 = 1.9.0
vi 22.1 = 1.8.1
vo 30 = 25
wae 30 = 2.0
wo [*] 1.9.1 = 1.8.1
yo 30 = 1.8.1
zh 22.1 = 1.8.1 (type="standard")
zh__big5han 22.1 = 1.8.1 (type="big5han")
zh__gb2312han 22.1 = 1.8.1 (type="gb2312han")
zh__pinyin 22.1 = 2.0 (type='pinyin' alt='short')
zh__stroke 22.1 = 1.9.1 (type='stroke' alt='short')
zh__zhuyin 22.1 = 22 (type='zhuyin' alt='short')
--------------------------------------------------------------------
[*] http://www.unicode.org/repos/cldr/tags/latest/seed/collation/
perl 的 Unicode::Collate::Locale 模块由 SADAHIRO Tomoyuki(
此模块是免费软件;您可以在与 Perl 本身相同的条款下重新分发和/或修改它。