result/HTML/html5_enc.html


Log

Author Commit Date CI Message
Nick Wellnhofer 46f05ea4 2025-05-09T00:21:47 html: Rework meta charset handling Don't use encoding from meta tags when serializing. Only use the value in `doc->encoding`, matching the XML serializer. This is the actual encoding used when parsing. Stop modifying the input document by setting meta tags before serializing. Meta tags are now injected during serialization. Add full support for <meta charset=""> which is also used when adding meta tags. Align with HTML5 and implement the "algorithm for extracting a character encoding from a meta element". Only modify the encoding substring in Content-Type meta tags. Only switch encoding once when parsing. Fix htmlSaveFileFormat with a NULL encoding not to declare a misleading UTF-8 charset. Fixes #909.
Daniel Veillard f933c898 2012-09-07T19:32:12 Keep non-significant blanks node in HTML parser For https://bugzilla.gnome.org/show_bug.cgi?id=681822 Regardless if the option HTML_PARSE_NOBLANKS is set or not, blank nodes are removed from a HTML document, for example: <html> <head> <title>This is a test.</title> </head> <body> <p>This is a test.</p> </body> </html> is read as: <html><head><title>This is a test.</title></head><body> <p>This is a test.</p> </body></html> This changes the default behaviour but the old behaviour is available as expected when using the parser flag HTML_PARSE_NOBLANKS Based on original patch from Igor Ignatyuk <igor_ignatiouk@hotmail.com> * HTMLparser.c: change various places in the parser where ignorable_space SAX callback was called without checking for the parser flag preference * xmllint.c: make sure we use the new flag even for HTML parsing * result/HTML/*: this modifies the output of a number of tests
Denis Pauk 868d92da 2012-05-10T15:34:57 Add HTML parser support for HTML5 meta charset encoding declaration For https://bugzilla.gnome.org/show_bug.cgi?id=655218 http://www.w3.org/TR/2011/WD-html5-20110525/semantics.html#the-meta-element """ The charset attribute specifies the character encoding used by the document. This is a character encoding declaration. If the attribute is present in an XML document, its value must be an ASCII case-insensitive match for the string "UTF-8" (and the document is therefore forced to use UTF-8 as its encoding). """ However, while <meta http-equiv="Content-Type" content="text/html; charset=utf8"> works, <meta charset="utf8"> does not. While libxml2 HTML parser is not tuned for HTML5, this is a simple addition Also added a testcase