Edit

kc3-lang/libiconv/NOTES

Branch :

  • Show log

    Commit

  • Author : Bruno Haible
    Date : 2024-10-22 12:28:36
    Hash : f017cafc
    Message : Spelling fixes. From Mike Taves <mwtoews@gmail.com> in <https://lists.gnu.org/archive/html/bug-gnu-libiconv/2024-10/msg00002.html>. * NOTES: Fix grammar. * lib/iconv_open1.h: Fix a typo in comments. * lib/iso2022_jp2.h: Likewise. * lib/iso2022_jpms.h: Fix English spelling in comments.

  • NOTES
  • Q: Why does libiconv support encoding XXX? Why does libiconv not support
       encoding ZZZ?
    
    A: libiconv, as an internationalization library, supports those character
       sets and encodings which are in wide-spread use in at least one territory
       of the world.
    
       Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
       page "Languages, countries, and the charsets typically used for them".
       From this table, we can conclude that the following are in active use:
    
         ISO-8859-1, CP1252   Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
                              English, Faroese, Finnish, French, Galician, German,
                              Icelandic, Irish, Italian, Norwegian, Portuguese,
                              Scottish, Spanish, Swedish
         ISO-8859-2           Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
                              Slovenian
         ISO-8859-3           Esperanto, Maltese
         ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
                              Serbian, Ukrainian
         ISO-8859-6           Arabic
         ISO-8859-7           Greek
         ISO-8859-8           Hebrew
         ISO-8859-9, CP1254   Turkish
         ISO-8859-10          Inuit, Lapp
         ISO-8859-13          Latvian, Lithuanian
         ISO-8859-15          Estonian
         KOI8-R               Russian
         SHIFT_JIS            Japanese
         ISO-2022-JP          Japanese
         EUC-JP               Japanese
    
       Ordered by frequency on the web (1997):
         ISO-8859-1, CP1252   96%
         SHIFT_JIS             1.6%
         ISO-2022-JP           1.2%
         EUC-JP                0.4%
         CP1250                0.3%
         CP1251                0.2%
         CP850                 0.1%
         MACINTOSH             0.1%
         ISO-8859-5            0.1%
         ISO-8859-2            0.0%
    
       Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
    
         ISO-8859-1           Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
                              English, Estonian, Faroese, Finnish, French,
                              Galician, German, Greenlandic, Icelandic,
                              Indonesian, Irish, Italian, Lithuanian, Norwegian,
                              Occitan, Portuguese, Scottish, Spanish, Swedish,
                              Walloon, Welsh
         ISO-8859-2           Albanian, Croatian, Czech, Hungarian, Polish,
                              Romanian, Serbian, Slovak, Slovenian
         ISO-8859-3           Esperanto
         ISO-8859-4           Estonian, Latvian, Lithuanian
         ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
                              Serbian, Ukrainian
         ISO-8859-6           Arabic
         ISO-8859-7           Greek
         ISO-8859-8           Hebrew
         ISO-8859-9           Turkish
         ISO-8859-14          Breton, Irish, Scottish, Welsh
         ISO-8859-15          Basque, Breton, Catalan, Danish, Dutch, Estonian,
                              Faroese, Finnish, French, Galician, German,
                              Greenlandic, Icelandic, Irish, Italian, Lithuanian,
                              Norwegian, Occitan, Portuguese, Scottish, Spanish,
                              Swedish, Walloon, Welsh
         KOI8-R               Russian
         KOI8-U               Russian, Ukrainian
         EUC-JP (alias eucJP)      Japanese
         ISO-2022-JP (alias JIS7)  Japanese
         SHIFT_JIS (alias SJIS)    Japanese
         U90                       Japanese
         S90                       Japanese
         EUC-CN (alias eucCN)      Chinese
         EUC-TW (alias eucTW)      Chinese
         BIG5                      Chinese
         EUC-KR (alias eucKR)      Korean
         ARMSCII-8                 Armenian
         GEORGIAN-ACADEMY          Georgian
         GEORGIAN-PS               Georgian
         TIS-620 (alias TACTIS)    Thai
         MULELAO-1                 Laothian
         IBM-CP1133                Laothian
         VISCII                    Vietnamese
         TCVN                      Vietnamese
         NUNACOM-8                 Inuktitut
    
       Hint3: The character sets supported by Netscape Communicator 4.
    
         Where is this documented? For the complete picture, I had to use
         "strings netscape" and then a lot of guesswork. For a quick take,
         look at the "View - Character set" menu of Netscape Communicator 4.6:
    
         ISO-8859-{1,2,5,7,9,15}
         WINDOWS-{1250,1251,1253}
         KOI8-R               Cyrillic
         CP866                Cyrillic
         Autodetect           Japanese  (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
         EUC-JP               Japanese
         SHIFT_JIS            Japanese
         GB2312               Chinese
         BIG5                 Chinese
         EUC-TW               Chinese
         Autodetect           Korean    (EUC-KR, ISO-2022-KR, but not JOHAB)
    
         UTF-8
         UTF-7
    
       Hint4: The character sets supported by Microsoft Internet Explorer 4.
    
         ISO-8859-{1,2,3,4,5,6,7,8,9}
         WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
         KOI8-R               Cyrillic
         KOI8-RU              Ukrainian
         ASMO-708             Arabic
         EUC-JP               Japanese
         ISO-2022-JP          Japanese
         SHIFT_JIS            Japanese
         GB2312               Chinese
         HZ-GB-2312           Chinese
         BIG5                 Chinese
         EUC-KR               Korean
         ISO-2022-KR          Korean
         WINDOWS-874          Thai
         WINDOWS-1258         Vietnamese
    
         UTF-8
         UTF-7
         UNICODE             actually UNICODE-LITTLE
         UNICODEFEFF         actually UNICODE-BIG
    
         and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
    
       We take the union of all these four sets. The result is:
    
       European and Semitic languages
         * ASCII.
           We implement this because it is occasionally useful to know or to
           check whether some text is entirely ASCII (i.e. if the conversion
           ISO-8859-x -> UTF-8 is trivial).
         * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
           We implement this because they are widely used. Except ISO-8859-4
           which appears to have been superseded by ISO-8859-13 in the baltic
           countries. But it's an ISO standard anyway.
         * ISO-8859-13
           We implement this because it's a standard in Lithuania and Latvia.
         * ISO-8859-14
           We implement this because it's an ISO standard.
         * ISO-8859-15
           We implement this because it's increasingly used in Europe, because
           of the Euro symbol.
         * ISO-8859-16
           We implement this because it's an ISO standard.
         * KOI8-R, KOI8-U
           We implement this because it appears to be the predominant encoding
           on Unix in Russia and Ukraine, respectively.
         * KOI8-RU
           We implement this because MSIE4 supports it.
         * KOI8-T
           We implement this because it is the locale encoding in glibc's Tajik
           locale.
         * PT154
           We implement this because it is the locale encoding in glibc's Kazakh
           locale.
         * RK1048
           We implement this because it's a standard in Kazakhstan.
         * CP{1250,1251,1252,1253,1254,1255,1256,1257}
           We implement these because they are the predominant Windows encodings
           in Europe.
         * CP850
           We implement this because it is mentioned as occurring in the web
           in the aforementioned statistics.
         * CP862
           We implement this because Ron Aaron says it is sometimes used in web
           pages and emails.
         * CP866
           We implement this because Netscape Communicator does.
         * CP1131
           We implement this because it is the locale encoding of a Belorusian
           locale in FreeBSD and MacOS X.
         * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
           Mac{Hebrew,Arabic}
           We implement these because the Sun JDK does, and because Mac users
           don't deserve to be punished.
         * Macintosh
           We implement this because it is mentioned as occurring in the web
           in the aforementioned statistics.
       Japanese
         * EUC-JP, SHIFT_JIS, ISO-2022-JP
           We implement these because they are widely used. EUC-JP and SHIFT_JIS
           are more used for files, whereas ISO-2022-JP is recommended for email.
         * CP932
           We implement this because it is the Microsoft variant of SHIFT_JIS,
           used on Windows.
         * ISO-2022-JP-2
           We implement this because it's the common way to represent mails which
           make use of JIS X 0212 characters.
         * ISO-2022-JP-1
           We implement this because it's in the RFCs, but I don't think it is
           really used.
         * ISO-2022-JP-MS
           We implement this because Microsoft Outlook Express / Microsoft MimeOLE
           sends emails in this encoding.
         * U90, S90
           We DON'T implement this because I have no information about what it
           is or who uses it.
       Simplified Chinese
         * EUC-CN = GB2312
           We implement this because it is the widely used representation
           of simplified Chinese.
         * GBK
           We implement this because it appears to be used on Solaris and Windows.
         * GB18030
           We implement this because it is an official requirement in the
           People's Republic of China.
         * ISO-2022-CN
           We implement this because it is in the RFCs, but I have no idea
           whether it is really used.
         * ISO-2022-CN-EXT
           We implement this because it's in the RFCs, but I don't think it is
           really used.
         * HZ = HZ-GB-2312
           We implement this because the RFCs recommend it for Usenet postings,
           and because MSIE4 supports it.
       Traditional Chinese
         * EUC-TW
           We implement it because it appears to be used on Unix.
         * BIG5
           We implement it because it is the de-facto standard for traditional
           Chinese.
         * CP950
           We implement this because it is the Microsoft variant of BIG5, used
           on Windows.
         * BIG5+
           We DON'T implement this because it doesn't appear to be in wide use.
           Only the CWEX fonts use this encoding. Furthermore, the conversion
           tables in the big5p package are not coherent: If you convert directly,
           you get different results than when you convert via GBK.
         * BIG5-HKSCS
           We implement it because it is the de-facto standard for traditional
           Chinese in Hongkong.
       Korean
         * EUC-KR
           We implement these because they appear to be the widely used
           representations for Korean.
         * CP949
           We implement this because it is the Microsoft variant of EUC-KR, used
           on Windows.
         * ISO-2022-KR
           We implement it because it is in the RFCs and because MSIE4 supports
           it, but I have no idea whether it's really used.
         * JOHAB
           We implement this because it is apparently used on Windows as a locale
           encoding (codepage 1361).
         * ISO-646-KR
           We DON'T implement this because although an old ASCII variant, its
           glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
           say it's a tilde, but Ken Lunde's "CJKV information processing" says
           it's an overline. And it is not ISO-IR registered.
       Armenian
         * ARMSCII-8
           We implement it because XFree86 supports it.
       Georgian
         * Georgian-Academy, Georgian-PS
           We implement these because they appear to be both used for Georgian;
           Xfree86 supports them.
       Thai
         * ISO-8859-11, TIS-620
           We implement these because it seems to be standard for Thai.
         * CP874
           We implement this because MSIE4 supports it.
         * MacThai
           We implement this because the Sun JDK does, and because Mac users
           don't deserve to be punished.
       Laotian
         * MuleLao-1, CP1133
           We implement these because XFree86 supports them. I have no idea which
           one is used more widely.
       Vietnamese
         * VISCII, TCVN
           We implement these because XFree86 supports them.
         * CP1258
           We implement this because MSIE4 supports it.
       Other languages
         * NUNACOM-8 (Inuktitut)
           We DON'T implement this because it isn't part of Unicode yet, and
           therefore doesn't convert to anything except itself.
       Platform specifics
         * HP-ROMAN8, NEXTSTEP
           We implement these because they were the native character set on HPs
           and NeXTs for a long time, and libiconv is intended to be usable on
           these old machines.
       Full Unicode
         * UTF-8, UCS-2, UCS-4
           We implement these. Obviously.
         * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
           We implement these because they are the preferred internal
           representation of strings in Unicode aware applications. These are
           non-ambiguous names, known to glibc. (glibc doesn't have
           UCS-2-INTERNAL and UCS-4-INTERNAL.)
         * UTF-16, UTF-16BE, UTF-16LE
           We implement these, because UTF-16 is still the favourite encoding of
           the president of the Unicode Consortium (for political reasons), and
           because they appear in RFC 2781.
         * UTF-32, UTF-32BE, UTF-32LE
           We implement these because they are part of Unicode 3.1.
         * UTF-7
           We implement this because it is essential functionality for mail
           applications.
         * C99
           We implement it because it's used for C and C++ programs and because
           it's a nice encoding for debugging.
         * JAVA
           We implement it because it's used for Java programs and because it's
           a nice encoding for debugging.
         * UNICODE (big endian), UNICODEFEFF (little endian)
           We DON'T implement these because they are stupid and not standardized.
       Full Unicode, in terms of 'uint16_t' or 'uint32_t'
       (with machine dependent endianness and alignment)
         * UCS-2-INTERNAL, UCS-4-INTERNAL
           We implement these because they are the preferred internal
           representation of strings in Unicode aware applications.
    
    Q: Support encodings mentioned in RFC 1345 ?
    A: No, they are not in use any more. Supporting ISO-646 variants is pointless
       since ISO-8859-* have been adopted.
    
    Q: Support EBCDIC ?
    A: Available through --enable-extra-encodings.
       Why? Because several people (Ulrich Schwab, Calvin Buckley) have shown
       interest in these encodings, by preparing forks of GNU libiconv.
    
    Q: How do I add a new character set?
    A: 1. Explain the "why" in this file, above.
       2. You need to have a conversion table from/to Unicode. Transform it into
       the format used by the mapping tables found on ftp.unicode.org: each line
       contains the character code, in hex, with 0x prefix, then whitespace,
       then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
       counts as a comment delimiter until end of line.
       Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
       can include it in his collection.
       3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
       tools directory to generate the C code for the conversion. You may tweak
       the resulting C code if you are not satisfied with its quality, but this
       is rarely needed.
       If it's a two-dimensional character set (with rows and columns), use the
       'cjk_tab_to_h' program in the tools directory to generate the C code for
       the conversion. You will need to modify the main() function to recognize
       the new character set name, with the proper dimensions, but that shouldn't
       be too hard. This yields the CCS. The CES you have to write by hand.
       4. Store the resulting C code file in the lib directory. Add a #include
       directive to converters.h, and add an entry to the encodings.def file.
       5. Compile the package, and test your new encoding using a program like
       iconv(1) or clisp(1).
       6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
       encoding, create the complete table as a TXT file. For a stateful encoding,
       provide a text snippet encoded using your new encoding and its UTF-8
       equivalent.
       7. Update the README and man/iconv_open.3, to mention the new encoding.
       Add a note in the NEWS file.
    
    Q: What about bidirectional text? Should it be tagged or reversed when
       converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
       this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
    A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
       ISO-8859-E remains to be implemented.
       On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
       is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
       the same as ISO-8859-8-I. I'm confused.
    
    Other character sets not implemented:
    "MNEMONIC" = "csMnemonic"
    "MNEM" = "csMnem"
    "ISO-10646-UCS-Basic" = "csUnicodeASCII"
    "ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
    "ISO-10646-J-1"
    "UNICODE-1-1" = "csUnicode11"
    "csWindows31Latin5"
    
    Other aliases not implemented (and not implemented in glibc-2.1 either):
      From MSIE4:
        ISO-8859-1: alias ISO8859-1
        ISO-8859-2: alias ISO8859-2
        KSC_5601: alias KS_C_5601
        UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
    
    
    Q: How can I integrate libiconv into my package?
    A: Just copy the entire libiconv package into a subdirectory of your package.
       At configuration time, call libiconv's configure script with the
       appropriate --srcdir option and maybe --enable-static or --disable-shared.
       Then "cd libiconv && make && make install-lib libdir=... includedir=...".
       'install-lib' is a special (not GNU standardized) target which installs
       only the include file - in $(includedir) - and the library - in $(libdir) -
       and does not use other directory variables. After "installing" libiconv
       in your package's build directory, building of your package can proceed.
    
    Q: Why is the testsuite so big?
    A: Because some of the tests are very comprehensive.
       If you don't feel like using the testsuite, you can simply remove the
       tests/ directory.