目录

名称

Unicode::Collate::Locale - 通过 Unicode::Collate 对 DUCET 进行语言定制

概要

use Unicode::Collate::Locale;

#construct
$Collator = Unicode::Collate::Locale->
    new(locale => $locale_name, %tailoring);

#sort
@sorted = $Collator->sort(@not_sorted);

#compare
$result = $Collator->cmp($a, $b); # returns 1, 0, or -1.

注意:@not_sorted$a$b 中的字符串将根据 Perl 的 Unicode 支持进行解释。请参阅 perlunicodeperluniintroperlunitutperlunifaqutf8。否则,您可使用 preprocess(请参阅 Unicode::Collate)或在之前对它们进行解码。

说明

此模块为其利用 Unicode::Collate 提供语言定制。

构造函数

new 方法返回一个排序器对象。

构造函数的参数列表是一个哈希,其中可以包括一个特殊键 locale 及其值(不区分大小写),表示 Unicode 基本语言代码(两个或三个字母)。例如,Unicode::Collate::Locale->new(locale => 'ES') 返回一个针对西班牙语定制的排序器。

$locale_name 可以后缀一个 Unicode 脚本代码(四个字母)、一个 Unicode 区域(地区)代码、一个 Unicode 语言变体代码。这些代码不区分大小写,并用 '_''-' 分隔。例如,en_US 表示美国的英语,az_Cyrl 表示西里尔字母中的阿塞拜疆语,es_ES_traditional 表示西班牙的西班牙语(传统)。

如果 $locale_name 不可用,则按以下顺序选择回退

1. language with a variant code
2. language with a script code
3. language with a region code
4. language
5. default

允许 Unicode::Collate 提供的定制标签,只要它们不用于 locale 支持即可。特别是 table 标签始终不可定制,因为它保留给 DUCET。

但是,即使 entry 用于 locale 支持,也允许添加或覆盖映射。

例如,一个西班牙语比较器,它忽略变音符号和大小写差异(即级别 1),并具有反向大小写顺序且没有规范化。

Unicode::Collate::Locale->new(
    level => 1,
    locale => 'es',
    upper_before_lower => 1,
    normalization => undef
)

如果将此类定制传递给 new(),则不允许覆盖已由 locale 定制的行为。

Unicode::Collate::Locale->new(
    locale => 'da',
    upper_before_lower => 0, # causes error as reserved by 'da'
)

但是,从 Unicode::Collate 继承的 change() 允许 locale 保留此类定制。示例

new(locale => 'fr_ca')->change(backwards => undef)
new(locale => 'da')->change(upper_before_lower => 0)
new(locale => 'ja')->change(overrideCJK => undef)

方法

Unicode::Collate::LocaleUnicode::Collate 的子类,并且除了 new 之外的其他方法都从 Unicode::Collate 继承。

以下是其他方法的列表

$Collator->getlocale

返回实际接受和用于排序的语言代码。如果您传递的语言代码未提供语言定制(对于某些语言是有意为之,或者由于实现不完整),则此方法返回字符串 'default',表示没有特殊定制。

$Collator->locale_version

(自 Unicode::Collate::Locale 0.87 起)返回语言环境的版本号(可能是 /\d\.\d\d/),如同 Locale/*.pl

注意:比较器使用的 Locale/*.pl 应由 getlocalelocale_version 返回值的组合来标识。

可定制语言环境列表

  locale name       description
--------------------------------------------------------------
  af                Afrikaans
  ar                Arabic
  as                Assamese
  az                Azerbaijani (Azeri)
  be                Belarusian
  bn                Bengali
  bs                Bosnian (tailored as Croatian)
  bs_Cyrl           Bosnian in Cyrillic (tailored as Serbian)
  ca                Catalan
  cs                Czech
  cu                Church Slavic
  cy                Welsh
  da                Danish
  de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
  de_AT_phonebook   Austrian German (umlaut primary greater)
  dsb               Lower Sorbian
  ee                Ewe
  eo                Esperanto
  es                Spanish
  es__traditional   Spanish ('ch' and 'll' as a grapheme)
  et                Estonian
  fa                Persian
  fi                Finnish (v and w are primary equal)
  fi__phonebook     Finnish (v and w as separate characters)
  fil               Filipino
  fo                Faroese
  fr_CA             Canadian French
  gu                Gujarati
  ha                Hausa
  haw               Hawaiian
  he                Hebrew
  hi                Hindi
  hr                Croatian
  hu                Hungarian
  hy                Armenian
  ig                Igbo
  is                Icelandic
  ja                Japanese [1]
  kk                Kazakh
  kl                Kalaallisut
  kn                Kannada
  ko                Korean [2]
  kok               Konkani
  lkt               Lakota
  ln                Lingala
  lt                Lithuanian
  lv                Latvian
  mk                Macedonian
  ml                Malayalam
  mr                Marathi
  mt                Maltese
  nb                Norwegian Bokmal
  nn                Norwegian Nynorsk
  nso               Northern Sotho
  om                Oromo
  or                Oriya
  pa                Punjabi
  pl                Polish
  ro                Romanian
  sa                Sanskrit
  se                Northern Sami
  si                Sinhala
  si__dictionary    Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
  sk                Slovak
  sl                Slovenian
  sq                Albanian
  sr                Serbian
  sr_Latn           Serbian in Latin (tailored as Croatian)
  sv                Swedish (v and w are primary equal)
  sv__reformed      Swedish (v and w as separate characters)
  ta                Tamil
  te                Telugu
  th                Thai
  tn                Tswana
  to                Tonga
  tr                Turkish
  ug_Cyrl           Uyghur in Cyrillic
  uk                Ukrainian
  ur                Urdu
  vi                Vietnamese
  vo                Volapu"k
  wae               Walser
  wo                Wolof
  yo                Yoruba
  zh                Chinese
  zh__big5han       Chinese (ideographs: big5 order)
  zh__gb2312han     Chinese (ideographs: GB-2312 order)
  zh__pinyin        Chinese (ideographs: pinyin order) [3]
  zh__stroke        Chinese (ideographs: stroke order) [3]
  zh__zhuyin        Chinese (ideographs: zhuyin order) [3]
--------------------------------------------------------------

根据默认 UCA 规则的语言环境包括 am(阿姆哈拉语),不带 [reorder Ethi],bg(保加利亚语),不带 [reorder Cyrl],chr(切罗基语),不带 [reorder Cher],de(德语),en(英语),fr(法语),ga(爱尔兰语),id(印度尼西亚语),it(意大利语),ka(格鲁吉亚语),不带 [reorder Geor],mn(蒙古语),不带 [reorder Cyrl Mong],ms(马来语),nl(荷兰语),pt(葡萄牙语),ru(俄语),不带 [reorder Cyrl],sw(斯瓦希里语),zu(祖鲁语)。

注意

[1] ja:按照 JIS X 0208 顺序对表意文字进行排序。全角和半角形式与其常规形式相同。平假名和片假名之间的差异在第 4 级,比较还需要 (variable => 'Non-ignorable'),然后 katakana_before_hiragana 无效。

[2] ko:许多表意文字按其读音排序。这种表意文字主要(级别 1)等于相应的韩语音节,次要(级别 2)大于相应的韩语音节。

[3] zh__pinyin、zh__stroke 和 zh__zhuyin:实现 alt='short',其中调整了较少数量的表意文字。

变体代码及其别名的列表

  variant code       alias
------------------------------------------
  dictionary         dict
  phonebook          phone     phonebk
  reformed           reform
  traditional        trad
------------------------------------------
  big5han            big5
  gb2312han          gb2312
  pinyin
  stroke
  zhuyin
------------------------------------------

注意:“拼音”是拉丁语中的汉语,“注音”是注音符号中的汉语。

安装

安装 Unicode::Collate::Locale 需要 Collate/Locale.pmCollate/Locale/*.pmCollate/CJK/*.pmCollate/allkeys.txt。在构建时,Unicode::Collate::Locale 不需要任何 data/*.txtgendata/*mklocaleUnicode::Collate::Locale 的测试命名为 t/loc_*.t

注意事项

调整并非最大

即使某个字母经过调整,其等效字母也不一定也经过调整。例如,即使 W 经过调整,全角 W(U+FF37)、带尖音符的 W(U+1E82)等也没有经过调整。结果可能取决于源字符串是否已标准化以及是否已分解或组合。因此,(normalization => undef) 不太可取。

不支持整理顺序

包括脚本在内的任何组的顺序不会更改。

参考

  locale            based CLDR or other reference
--------------------------------------------------------------------
  af                30 = 1.8.1
  ar                30 = 28 ("compat" wo [reorder Arab]) = 1.9.0
  as                30 = 28 (without [reorder Beng..]) = 23
  az                30 = 24 ("standard" wo [reorder Latn Cyrl])
  be                30 = 28 (without [reorder Cyrl])
  bn                30 = 28 ("standard" wo [reorder Beng..]) = 2.0.1
  bs                30 = 28 (type="standard": [import hr])
  bs_Cyrl           30 = 28 (type="standard": [import sr])
  ca                30 = 23 (alt="proposed" type="standard")
  cs                30 = 1.8.1 (type="standard")
  cu                34 = 30 (without [reorder Cyrl])
  cy                30 = 1.8.1
  da                22.1 = 1.8.1 (type="standard")
  de__phonebook     30 = 2.0 (type="phonebook")
  de_AT_phonebook   30 = 27 (type="phonebook")
  dsb               30 = 26
  ee                30 = 21
  eo                30 = 1.8.1
  es                30 = 1.9.0 (type="standard")
  es__traditional   30 = 1.8.1 (type="traditional")
  et                30 = 26
  fa                22.1 = 1.8.1
  fi                22.1 = 1.8.1 (type="standard" alt="proposed")
  fi__phonebook     22.1 = 1.8.1 (type="phonebook")
  fil               30 = 1.9.0 (type="standard") = 1.8.1
  fo                22.1 = 1.8.1 (alt="proposed" type="standard")
  fr_CA             30 = 1.9.0
  gu                30 = 28 ("standard" wo [reorder Gujr..]) = 1.9.0
  ha                30 = 1.9.0
  haw               30 = 24
  he                30 = 28 (without [reorder Hebr]) = 23
  hi                30 = 28 (without [reorder Deva..]) = 1.9.0
  hr                30 = 28 ("standard" wo [reorder Latn Cyrl]) = 1.9.0
  hu                22.1 = 1.8.1 (alt="proposed" type="standard")
  hy                30 = 28 (without [reorder Armn]) = 1.8.1
  ig                30 = 1.8.1
  is                22.1 = 1.8.1 (type="standard")
  ja                22.1 = 1.8.1 (type="standard")
  kk                30 = 28 (without [reorder Cyrl])
  kl                22.1 = 1.8.1 (type="standard")
  kn                30 = 28 ("standard" wo [reorder Knda..]) = 1.9.0
  ko                22.1 = 1.8.1 (type="standard")
  kok               30 = 28 (without [reorder Deva..]) = 1.8.1
  lkt               30 = 25
  ln                30 = 2.0 (type="standard") = 1.8.1
  lt                22.1 = 1.9.0
  lv                22.1 = 1.9.0 (type="standard") = 1.8.1
  mk                30 = 28 (without [reorder Cyrl])
  ml                22.1 = 1.9.0
  mr                30 = 28 (without [reorder Deva..]) = 1.8.1
  mt                22.1 = 1.9.0
  nb                22.1 = 2.0   (type="standard")
  nn                22.1 = 2.0   (type="standard")
  nso           [*] 26 = 1.8.1
  om                22.1 = 1.8.1
  or                30 = 28 (without [reorder Orya..]) = 1.9.0
  pa                22.1 = 1.8.1
  pl                30 = 1.8.1
  ro                30 = 1.9.0 (type="standard")
  sa            [*] 1.9.1 = 1.8.1 (type="standard" alt="proposed")
  se                22.1 = 1.8.1 (type="standard")
  si                30 = 28 ("standard" wo [reorder Sinh..]) = 1.9.0
  si__dictionary    30 = 28 ("dictionary" wo [reorder Sinh..]) = 1.9.0
  sk                22.1 = 1.9.0 (type="standard")
  sl                22.1 = 1.8.1 (type="standard" alt="proposed")
  sq                22.1 = 1.8.1 (alt="proposed" type="standard")
  sr                30 = 28 (without [reorder Cyrl])
  sr_Latn           30 = 28 (type="standard": [import hr])
  sv                22.1 = 1.9.0 (type="standard")
  sv__reformed      22.1 = 1.8.1 (type="reformed")
  ta                22.1 = 1.9.0
  te                30 = 28 (without [reorder Telu..]) = 1.9.0
  th                22.1 = 22
  tn            [*] 26 = 1.8.1
  to                22.1 = 22
  tr                22.1 = 1.8.1 (type="standard")
  uk                30 = 28 (without [reorder Cyrl])
  ug_Cyrl           https://en.wikipedia.org/wiki/Uyghur_Cyrillic_alphabet
  ur                22.1 = 1.9.0
  vi                22.1 = 1.8.1
  vo                30 = 25
  wae               30 = 2.0
  wo            [*] 1.9.1 = 1.8.1
  yo                30 = 1.8.1
  zh                22.1 = 1.8.1 (type="standard")
  zh__big5han       22.1 = 1.8.1 (type="big5han")
  zh__gb2312han     22.1 = 1.8.1 (type="gb2312han")
  zh__pinyin        22.1 = 2.0   (type='pinyin' alt='short')
  zh__stroke        22.1 = 1.9.1 (type='stroke' alt='short')
  zh__zhuyin        22.1 = 22    (type='zhuyin' alt='short')
--------------------------------------------------------------------

[*] http://www.unicode.org/repos/cldr/tags/latest/seed/collation/

作者

perl 的 Unicode::Collate::Locale 模块由 SADAHIRO Tomoyuki()编写。此模块的版权所有者为 SADAHIRO Tomoyuki。日本。保留所有权利。

此模块是免费软件;您可以在与 Perl 本身相同的条款下重新分发和/或修改它。

另请参阅

Unicode 整理算法 - UTS #10

http://www.unicode.org/reports/tr10/

默认 Unicode 整理元素表 (DUCET)

http://www.unicode.org/Public/UCA/latest/allkeys.txt

Unicode 语言环境数据标记语言 (LDML) - UTS #35

http://www.unicode.org/reports/tr35/

CLDR - Unicode 通用语言环境数据存储库

http://cldr.unicode.org/

Unicode::Collate
Unicode::Normalize