src/gen-tag-table.py

Branch


Log

Author Commit Date CI Message
David Corbett 0fbbf749 2025-09-10T21:11:44 [ot-tags] Update IANA subtags to 2025-08-25 (#5537)
David Corbett 31b22016 2024-10-03T14:16:54 [ot-tags] Update IANA and OT language registries
David Corbett c2b5b7b9 2024-06-01T12:48:17 [ot-tags] Update IANA and OT language registries
David Corbett 86942e9a 2024-03-08T18:12:56 [ot-tags] Let Võro fall back to Estonian
David Corbett 88868411 2024-03-08T18:11:45 [ot-tags] Remove obsolete overrides
David Corbett f3727c47 2024-04-04T19:04:59 Recognize ot_languages2’s disambiguation priority
David Corbett 0692d23c 2024-03-07T17:30:56 Update IANA Language Subtag Registry to 2024-03-07
Behdad Esfahbod a7960bdf 2022-06-17T15:10:20 [config] Add HB_NO_LANGUAGE_LONG and enable in TINY profile Disables 3letter language tags and more complex ones. Fixes https://github.com/harfbuzz/harfbuzz/issues/3664
David Corbett e3e685e5 2022-05-18T15:05:55 [ot-tags] Fix `min_subtag_len` calculations
Behdad Esfahbod e24797ae 2022-05-18T11:10:10 [ot-tags] Follow-up to previous commit Part of https://github.com/harfbuzz/harfbuzz/issues/3591
Behdad Esfahbod f5d619be 2022-05-18T11:04:52 [ot-tags] Further gate the slow complex case, and add more tests Part of https://github.com/harfbuzz/harfbuzz/issues/3591 Still 'zh-trad' is the slowest case. -------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------------------- BM_hb_ot_tags_from_script_and_language/COMMON zh_trad 136 ns 136 ns 5107838 BM_hb_ot_tags_from_script_and_language/COMMON ab_abcd 115 ns 115 ns 6103104 BM_hb_ot_tags_from_script_and_language/COMMON ab_abc 25.4 ns 25.3 ns 27674482 BM_hb_ot_tags_from_script_and_language/COMMON abcdef_XY 20.2 ns 20.1 ns 34795719 BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY 19.4 ns 19.3 ns 36390401 BM_hb_ot_tags_from_script_and_language/COMMON cxy_CN 33.5 ns 33.4 ns 20998939 BM_hb_ot_tags_from_script_and_language/COMMON exy_CN 25.1 ns 25.0 ns 27705832 BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 34.2 ns 34.1 ns 20564356 BM_hb_ot_tags_from_script_and_language/COMMON en_US 15.5 ns 15.5 ns 45032204 BM_hb_ot_tags_from_script_and_language/LATIN en_US 15.9 ns 15.8 ns 44412379 BM_hb_ot_tags_from_script_and_language/COMMON none 4.72 ns 4.71 ns 149101665 BM_hb_ot_tags_from_script_and_language/LATIN none 4.72 ns 4.70 ns 149254498
Behdad Esfahbod 3df8017e 2022-05-17T17:29:39 [ot-tag] Optimize subtag_matches() more
Behdad Esfahbod 909f00ac 2022-05-17T15:51:41 [ot-tags] Further speed up language bsearch() Using an integer tag to bsearch, instead of string. Part of: https://github.com/harfbuzz/harfbuzz/issues/3591 Before: ------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------ BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY 8.11 ns 8.08 ns 87067795 BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 53.6 ns 53.5 ns 13042418 BM_hb_ot_tags_from_script_and_language/COMMON en_US 24.2 ns 24.1 ns 29052731 BM_hb_ot_tags_from_script_and_language/LATIN en_US 24.4 ns 24.3 ns 28736769 BM_hb_ot_tags_from_script_and_language/COMMON none 4.43 ns 4.41 ns 160370413 BM_hb_ot_tags_from_script_and_language/LATIN none 4.35 ns 4.34 ns 160578191 After: ------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------ BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY 7.97 ns 7.95 ns 85208363 BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 41.7 ns 41.6 ns 16945817 BM_hb_ot_tags_from_script_and_language/COMMON en_US 16.1 ns 16.0 ns 43613523 BM_hb_ot_tags_from_script_and_language/LATIN en_US 16.5 ns 16.4 ns 42568107 BM_hb_ot_tags_from_script_and_language/COMMON none 4.30 ns 4.29 ns 164055469 BM_hb_ot_tags_from_script_and_language/LATIN none 4.29 ns 4.27 ns 163793591
Behdad Esfahbod 15be0ded 2022-05-17T14:57:08 [ot-tags] Optimize lang_matches() Part of https://github.com/harfbuzz/harfbuzz/issues/3591 Before: ------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------ BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY 8.67 ns 8.64 ns 80324382 BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 91.2 ns 90.9 ns 7674131 BM_hb_ot_tags_from_script_and_language/COMMON en_US 41.1 ns 41.0 ns 17174093 BM_hb_ot_tags_from_script_and_language/LATIN en_US 41.3 ns 41.2 ns 17000876 BM_hb_ot_tags_from_script_and_language/COMMON none 4.56 ns 4.55 ns 153914130 BM_hb_ot_tags_from_script_and_language/LATIN none 4.53 ns 4.52 ns 153830303 After: ------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------ BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY 8.24 ns 8.21 ns 84078465 BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 77.5 ns 77.2 ns 9059230 BM_hb_ot_tags_from_script_and_language/COMMON en_US 38.8 ns 38.7 ns 17790692 BM_hb_ot_tags_from_script_and_language/LATIN en_US 37.6 ns 37.5 ns 18648293 BM_hb_ot_tags_from_script_and_language/COMMON none 4.50 ns 4.49 ns 155573267 BM_hb_ot_tags_from_script_and_language/LATIN none 4.49 ns 4.47 ns 156456653
Behdad Esfahbod dd3c858f 2022-05-17T14:28:28 [ot-tags] Speed up hb_ot_tags_from_language() Part of https://github.com/harfbuzz/harfbuzz/issues/3591 "After that, bulk of the time I suppose is spent in binary-searching the language table. I suggest we split the language table in 2-letter and 3-letter tags, to speed-up the vast majority of cases that are 2-letter." benchmark-ot, before: ---------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 112 ns 111 ns 6286271 BM_hb_ot_tags_from_script_and_language/COMMON en_US 60.6 ns 60.4 ns 11671176 BM_hb_ot_tags_from_script_and_language/LATIN en_US 61.3 ns 61.1 ns 11442645 BM_hb_ot_tags_from_script_and_language/COMMON none 4.75 ns 4.74 ns 146997235 BM_hb_ot_tags_from_script_and_language/LATIN none 4.65 ns 4.64 ns 150938747 After: ---------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 89.5 ns 89.2 ns 7747649 BM_hb_ot_tags_from_script_and_language/COMMON en_US 38.5 ns 38.4 ns 18199432 BM_hb_ot_tags_from_script_and_language/LATIN en_US 39.0 ns 38.9 ns 18049238 BM_hb_ot_tags_from_script_and_language/COMMON none 4.53 ns 4.52 ns 154895110 BM_hb_ot_tags_from_script_and_language/LATIN none 4.54 ns 4.52 ns 154762105
Behdad Esfahbod 9baccb98 2022-05-17T13:34:34 [ot-tags] Speed up hb_ot_tags_from_complex_language() Part of https://github.com/harfbuzz/harfbuzz/issues/3591 2. All the subtag_matches outside the switch match long strings (>= 6 or so). As such, check the tag for such length before going into any of them. benchmark-ot, before: ---------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 172 ns 171 ns 4083155 BM_hb_ot_tags_from_script_and_language/COMMON en_US 120 ns 119 ns 5849947 BM_hb_ot_tags_from_script_and_language/LATIN en_US 113 ns 112 ns 5840326 BM_hb_ot_tags_from_script_and_language/COMMON none 4.66 ns 4.64 ns 151396224 BM_hb_ot_tags_from_script_and_language/LATIN none 4.66 ns 4.64 ns 149019593 After: ---------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 112 ns 112 ns 6357763 BM_hb_ot_tags_from_script_and_language/COMMON en_US 60.5 ns 60.3 ns 11475091 BM_hb_ot_tags_from_script_and_language/LATIN en_US 54.9 ns 54.8 ns 12575690 BM_hb_ot_tags_from_script_and_language/COMMON none 4.61 ns 4.59 ns 152388450 BM_hb_ot_tags_from_script_and_language/LATIN none 4.66 ns 4.64 ns 151497600
David Corbett ae9afd97 2021-10-03T20:09:33 Let BCP 47 tag "mo" fall back to OT tag 'ROM '
David Corbett a184c5f8 2022-01-30T13:28:23 Don’t always inherit from macrolanguages If an OpenType tag maps to a BCP 47 macrolanguage, that is presumably to support the use of the macrolanguage as a vague stand-in for one of its individual languages. For example, "ar" and "zh" are often used for "arb" and "cmn". When the OpenType tag maps to a macrolanguage and some but not all of its individual languages, that indicates that the OpenType tag only corresponds to the listed individual languages (which may be referred to using the macrolanguage subtag) but not the missing individual languages. In particular, INUK (Nunavik Inuktitut) is mapped to "ike" (Eastern Canadian Inuktitut) and "iu" (Inuktitut) but not to "ikt" (Inuinnaqtun), so "ikt" should not inherit the INUK mapping from its macrolanguage "iu".
David Corbett 0b1bf89c 2022-01-28T22:27:51 Replace “[family]” with “[collection]” Not all language collections are language families.
David Corbett 0e31595e 2022-01-28T22:26:38 Infer tag mappings for unregistered macrolanguages Every macrolanguage not mentioned in the OT language system tag registry is mapped to every tag of its individual languages, if those have registered tags.
David Corbett 2404617a 2021-12-08T21:10:22 Update language system tag registry to OT 1.9
David Corbett d18915f9 2021-03-28T10:09:13 Reformat gen-tag-table.py
David Corbett e19de65e 2021-03-08T13:12:47 Update hb-ot-tag-table.hh (#2890)
David Corbett b2e7bb2a 2020-10-27T19:50:33 Don’t map BCP 47 to coincidentally similar OT tag
David Corbett e1df2c52 2020-10-26T19:16:35 Map ISO 639 code qul to language system tag 'QUH '
David Corbett 17da41bd 2020-11-17T14:29:05 Update language system tag registry to OT 1.8.4
David Corbett 27170e05 2020-10-28T18:02:55 Fix names for language tag in gen-tag-table.py A BCP 47 language tag with both a script subtag and a region subtag would be printed as a human-readable name in hb-ot-tag-table.hh as if it only had its language subtag.
David Corbett dec52006 2020-10-10T14:49:55 Map BCP 47 tags to all macrolanguages The general rule is that if a BCP 47 macrolanguage maps to an OpenType language system tag, all its individual languages map to it too. Previously, a tag like "prs" (Dari) would not map to the language system tag ('FAR ') of its macrolanguage ("fa") because "prs" already has its own language system tag ('DRI '). That exception has been removed: now "prs" maps to 'DRI ' and falls back to 'FAR '.
David Corbett 1d53268d 2020-10-10T14:46:36 Fix two-way mapping of "man" and 'MNK '
David Corbett ab38cf67 2020-10-10T14:21:20 Map hy-arevmda to 'HYE ' instead of HYE0
David Corbett 916c5a90 2020-10-10T14:15:16 Consistently emit BCP 47 subtag scope suffixes
David Corbett ac3f859a 2020-09-09T11:49:56 Demote unregistered vendor-specific language tags
David Corbett 91fe20f0 2020-09-04T09:18:19 Disambiguate OT tags when primary tag is not first
Ebrahim Byagowi ad87155f 2020-05-29T00:11:19 minor, use py3's open(encoding=)
Ebrahim Byagowi 7554f618 2020-05-28T22:51:29 minor, use sys.exit print shorthand
Ebrahim Byagowi 08f1d95a 2020-05-28T15:01:15 minor, move scripts manuals to __doc__
David Corbett 7a961692 2020-04-01T17:26:07 Update IANA Language Subtag Registry to 2020-05-12
David Corbett fd748fac 2020-03-15T15:59:31 Update to Unicode 13.0.0
Ebrahim Byagowi e17fd0d9 2020-02-23T23:58:39 [tools] More on py3 compatibility
Ebrahim Byagowi 8c652f72 2020-02-19T16:32:44 Minor, switch to https links where possible
Ebrahim Byagowi bbcbcafc 2020-02-19T16:21:47 [tool] Minor, move input files link
Ebrahim Byagowi 8d199077 2020-02-19T14:56:55 Remove python2 support from tests/utils scripts
Evgeniy Reizner 4dc87365 2020-02-09T18:39:33 Add links to files used by python scripts. Closes #2150
David Corbett 6745a600 2019-04-16T17:29:34 Comment out ot_languages where fallback suffices
David Corbett 1ce11b44 2019-04-16T10:04:45 Reduce LangTag from 3 language system tags to 1
David Corbett 3f887747 2018-07-19T13:48:07 Switch on the first char of a complex language tag This results in a tenfold speed-up for the common case of tags that are not complex, in the sense of `hb_ot_tags_from_complex_language`.
David Corbett a754d441 2018-07-16T21:14:48 Map Quechua languages to closest ones with tags OpenType only officially maps four ISO 639 codes to Quechua languages, but prior versions of HarfBuzz also mapped qu to 'QUZ '. Because qu is a macrolanguage, the mapping now applies to all individual Quechua languages. OpenType calls 'QUZ ' "Quechua", but it really corresponds to Cusco Quechua, so the individual Quechua languages should not all necessarily be mapped to it.
David Corbett 7c7cb2a9 2018-01-20T15:53:09 Match extlang subtags If the second subtag of a BCP 47 tag is three letters long, it denotes an extended language. The tag converter ignores the language subtag and uses the extended language instead. There are some grandfathered exceptions, which are handled earlier.
David Corbett 2f1f961c 2017-12-08T22:45:52 Autogenerate the BCP 47 to OpenType mappings The new script, gen-tag-table.py, generates `ot_languages` automatically from the [OpenType language system tag registry][ot] and the [IANA Language Subtag Registry][bcp47] with some manual modifications. If an OpenType tag maps to a BCP 47 macrolanguage, all the macrolanguage's individual languages are mapped to the same OpenType tag, except for individual languages with their own OpenType mappings. Deprecated BCP 47 tags are canonicalized. [ot]: https://docs.microsoft.com/en-us/typography/opentype/spec/languagetags [bcp47]: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry Some OpenType tags correspond to multiple ISO 639 codes. The mapping from ISO 639 codes lists OpenType tags in priority order, such that more specific or more likely tags appear first. Some OpenType tags have no corresponding ISO 639 code in the registry so their mappings use BCP 47 subtags besides the language. For example, any BCP 47 tag with a fonipa variant subtag is mapped to 'IPPH', and 'IPPH' is mapped back to und-fonipa. Other OpenType tags have no corresponding ISO 639 code because it is not clear what they are for. HarfBuzz just ignores these tags. One such ignored tag is 'ZHP ' (Chinese Phonetic). It probably means zh-Latn. However, it is used in Microsoft JhengHei and Microsoft YaHei with the script tag 'hani', implying that it is not a romanization scheme after all. It would be simple enough to add this mapping to gen-tag-table.py once a definitive mapping is determined. The manual modifications are mainly either obvious mappings that the OpenType registry omits or mappings for compatibility with previous versions of HarfBuzz. Some of the old mappings were discarded, though, for homophonous language names. For example, OpenType maps 'KUI ' to kxu; previous versions of HarfBuzz also mapped it to kvd, because kvd and kxu both happen to be called "Kui". gen-tag-table.py also generates a function to convert multi-subtag tags like el-polyton and zh-HK to OpenType tags, replacing `ot_languages_zh` and the hard-coded list of special cases in `hb_ot_tags_from_language`. It also generates a function to convert OpenType tags to BCP 47, replacing the hard-coded list of special cases in `hb_ot_tag_to_language`.
David Corbett bca7a169 2018-09-10T12:05:51 Update language system tag registry to OT 1.8.3