|
46f05ea4
|
2025-05-09T00:21:47
|
|
html: Rework meta charset handling
Don't use encoding from meta tags when serializing. Only use the value
in `doc->encoding`, matching the XML serializer. This is the actual
encoding used when parsing.
Stop modifying the input document by setting meta tags before
serializing. Meta tags are now injected during serialization.
Add full support for <meta charset=""> which is also used when adding
meta tags.
Align with HTML5 and implement the "algorithm for extracting a character
encoding from a meta element". Only modify the encoding substring in
Content-Type meta tags.
Only switch encoding once when parsing.
Fix htmlSaveFileFormat with a NULL encoding not to declare a misleading
UTF-8 charset.
Fixes #909.
|
|
f933c898
|
2012-09-07T19:32:12
|
|
Keep non-significant blanks node in HTML parser
For https://bugzilla.gnome.org/show_bug.cgi?id=681822
Regardless if the option HTML_PARSE_NOBLANKS is set or not, blank nodes
are removed from a HTML document, for example:
<html>
<head>
<title>This is a test.</title>
</head>
<body>
<p>This is a test.</p>
</body>
</html>
is read as:
<html><head><title>This is a test.</title></head><body>
<p>This is a test.</p>
</body></html>
This changes the default behaviour but the old behaviour is available
as expected when using the parser flag HTML_PARSE_NOBLANKS
Based on original patch from Igor Ignatyuk <igor_ignatiouk@hotmail.com>
* HTMLparser.c: change various places in the parser where ignorable_space
SAX callback was called without checking for the parser flag preference
* xmllint.c: make sure we use the new flag even for HTML parsing
* result/HTML/*: this modifies the output of a number of tests
|
|
868d92da
|
2012-05-10T15:34:57
|
|
Add HTML parser support for HTML5 meta charset encoding declaration
For https://bugzilla.gnome.org/show_bug.cgi?id=655218
http://www.w3.org/TR/2011/WD-html5-20110525/semantics.html#the-meta-element
"""
The charset attribute specifies the character encoding used by the document.
This is a character encoding declaration. If the attribute is present in an XML
document, its value must be an ASCII case-insensitive match for the string
"UTF-8" (and the document is therefore forced to use UTF-8 as its
encoding).
"""
However, while <meta http-equiv="Content-Type" content="text/html;
charset=utf8"> works, <meta charset="utf8"> does not.
While libxml2 HTML parser is not tuned for HTML5, this is a simple
addition
Also added a testcase
|