Author :
Daniel Veillard
Date :
2000-07-14 12:10:59
Hash :be40c8b2 Message :First version of the encoding doc, Daniel.
doc/encoding.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>Libxml Internationalization support</title>
<meta name="GENERATOR" content="amaya V3.2">
<meta http-equiv="Content-Type" content="text/html">
</head>
<body bgcolor="#ffffff">
<h1 align="center">Libxml Internationalization support</h1>
<p>Location: <a
href="http://xmlsoft.org/encoding.html">http://xmlsoft.org/encoding.html</a></p>
<p>Libxml home page: <a href="http://xmlsoft.org/">http://xmlsoft.org/</a></p>
<p>Mailing-list archive: <a
href="http://xmlsoft.org/messages/">http://xmlsoft.org/messages/</a></p>
<p>Version: $Revision$</p>
<p>Table of Content:</p>
<ol>
<li><a href="#What">What does internationalization support mean ?</a></li>
<li><a href="#internal">The internal encoding, how and why</a></li>
<li><a href="#implemente">How is it implemented ?</a></li>
<li><a href="#Default">Default supported encodings</a></li>
<li><a href="#extend">How to extend the existing support</a></li>
</ol>
<h2><a name="What">What does internationalization support mean ?</a></h2>
<p>XML was designed from the start to allow the support of any character set
by using Unicode. Any conformant XML parser has to support the UTF-8 and
UTF-16 default encodings which can both express the full unicode ranges. UTF8
is a variable length encoding whose greatest point are to resuse the same
emcoding for ASCII and to save space for Western encodings, but it is a bit
more complex to handle in practice. UTF-16 use 2 bytes per characters (and
sometimes combines two pairs), it makes implementation easier, but looks a bit
overkill for Western languages encoding. Moreover the XML specification allows
document to be encoded in other encodings at the condition that they are
clearly labelled as such. For example the following is a wellformed XML
document encoded in ISO-Latin 1 and using accentuated letter that we French
likes for both markup and content:</p>
<pre><?xml version="1.0" encoding="ISO-8859-1"?>
<tr