kmx git

Commit	Date	Message
6b50d8c8	2025-06-08T13:05:22	html: Add missing call to grow parser in htmlParseComment Otherwise, long chains of short comments could exhaust the input buffer when pull parsing.
70335c41	2025-06-06T03:29:57	html: Don't stop on unsupported encoding Continue to parse unlike in the XML case.
416da89d	2025-06-04T20:49:16	html: Make htmlCtxtReset call xmlCtxtReset The two implementations shouldn't diverge.
c6206c93	2025-06-05T21:06:11	html: Ignore ASCII-incompatible encoding in meta tag After successfully parsing an ASCII-encoded meta tag, switching to an encoding that isn't ASCII-compatible cannot work.
6a6a46f0	2025-05-28T16:02:41	doc: Fix autolink errors Fix links, remove links to internal functions.
7bd8d1d9	2025-05-28T15:53:38	doc: Prefix autolinks with '#' Use `#func` instead of `func()` to ignore parameters and make all autolinks work.
c5b45fbc	2025-05-16T16:54:09	doc: Misc fixes
6f4b4527	2025-05-15T23:43:32	parser: Stop using ctxt->linenumbers I think this was used to avoid setting the `line` member before it was added (20+ years ago).
258d8706	2025-05-15T17:49:49	codegen: Consolidate tools for code generation Move tools, source files and output tables into codegen directory. Rename some files. Adjust tools to match modified files. Remove generation date and source files from output. Distribute all tools and sources.
adfbeb7e	2025-05-14T04:58:21	doc: Stop using *Ptr typedefs in documentation
a40f36e7	2025-05-14T04:04:28	include: Stop using *Ptr typedefs in public headers
2d83a84c	2025-05-14T00:29:19	doc: Misc improvements
f0983199	2025-05-12T13:00:20	html: Map some encodings according to HTML5 Windows-1252 is a superset of ISO-8859-1 and should be used instead. Same for ASCII. Also map UCS-2 and UTF-16 to UTF-16LE.
05b8fe0a	2025-04-12T23:10:40	html: Don't escape RAWTEXT and PLAINTEXT Align with HTML5.
809ded58	2025-04-12T22:50:56	html: Add more empty elements Add empty HTML5 elements <bgsound>, <keygen>, <source>, <track> and <wbr>. Make <embed> an empty element.
c7c49643	2025-05-09T15:26:15	html: Move DTD creation to endDocument SAX callback
46f05ea4	2025-05-09T00:21:47	html: Rework meta charset handling Don't use encoding from meta tags when serializing. Only use the value in `doc->encoding`, matching the XML serializer. This is the actual encoding used when parsing. Stop modifying the input document by setting meta tags before serializing. Meta tags are now injected during serialization. Add full support for <meta charset=""> which is also used when adding meta tags. Align with HTML5 and implement the "algorithm for extracting a character encoding from a meta element". Only modify the encoding substring in Content-Type meta tags. Only switch encoding once when parsing. Fix htmlSaveFileFormat with a NULL encoding not to declare a misleading UTF-8 charset. Fixes #909.
f3a080bc	2025-05-07T14:32:42	html: Ignore U+0000 in body text Align with HTML5. Fixes #908.
9bbffec5	2025-05-06T17:42:46	doc: Move brief to top, params to bottom of doc comments
b7274fb0	2025-05-03T16:34:02	doc: Misc fixes to HTML parser docs
4a010875	2025-05-03T15:38:15	doc: Move parser option docs to enum
cb1635a6	2025-05-02T19:05:25	doc: Use @since command
e78e05c9	2025-05-02T17:32:51	doc: Fix autolinks to functions Unfortunately, autolinks in .c files aren't converted by Doxygen for some reason.
f7c41287	2025-05-02T15:57:17	doc: Remove more comment block headers
e525564f	2025-05-01T19:20:06	doc: Remove empty lines at start of block These lines were left over after automatic conversion.
e549622b	2025-04-28T15:11:24	doc: Convert documentation to Doxygen Automated conversion based on a few regexes.
69879da8	2025-04-28T14:04:30	doc: Remove email addresses from documentation Also remove authorship information from generated files, hash.c and globals.c which were rewritten.
61890e39	2025-04-27T21:50:15	doc: Prepare for conversion to Doxygen Fix many params in internal functions (not really necessary but Doxygen warns about that in XML mode). Fix formatting in a few corner cases that automatic conversion can't handle. Rearrange some DOC_DISABLE blocks.
4ba1f923	2025-04-18T17:28:24	html: Avoid HTML_PARSE_HTML5 clashing with XML_PARSE_NOENT There are several users that pass invalid XML parser options to the HTML parser. Choose a value that is less likely to clash.
b8018afa	2025-04-09T23:30:47	html: Fix documentation of parser options
2ecc08f6	2025-04-09T21:11:47	html: Deprecate more functions
69b83bb6	2025-03-10T02:18:51	encoding: Detect truncated multi-byte sequences with ICU Unlike iconv or the internal converters, ICU consumes truncated multi- byte sequences at the end of an input buffer. We currently check for a non-empty raw input buffer to detect truncated sequences, so this fails with ICU. It might be possible to inspect the pivot buffer pointers, but it seems cleaner to implement a `flush` flag for some encoding and I/O functions. After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or detect remaining input with other converters. Also fix detection of truncated sequences for HTML, XML content and DTDs with iconv.
8873a498	2025-03-09T16:21:13	html: Fix areBlanks check Short-lived regression from 71122421.
5f0b1378	2025-03-08T22:07:15	parser: Add more parser context accessors Fixes #763.
5237d90f	2025-03-07T21:15:20	html: Process data before switching encoding This reduces the amount of data to convert and avoids issues with EOF detection. Also reset EOF flag after switching encoding as a precaution.
0b27097a	2025-03-04T12:55:25	encoding: Rename unprefixed public functions
5ed4eafd	2025-02-22T14:51:39	html: Don't invoke SAX callbacks if parser was stopped
63dfcca6	2024-12-16T01:34:29	fuzz: Reduce initial array size
b8234e8c	2025-02-19T12:53:32	html: Fix check for partial named character references Digits are allowed after the first character.
7a61c32b	2025-02-13T23:09:28	html: Use enum instead of magic values for insertion modes
8cf6129b	2025-02-13T18:20:46	html: Stop implying <p> start tags Only <html>, <head> or <body> should be implied. Opening extra <p> tags has always been a libxml2 quirk.
71122421	2025-02-13T14:04:10	html: Make implied <p> tags more deterministic libxml2's HTML parser adds <p> start tags in some situations. This behavior, which doesn't follow any standard, was added in 2000, see here: http://veillard.com/XML/messages/0655.html Text nodes that only contain whitespace don't imply a <p> tag, but the whitespace check cannot work reliably if we're parsing partial text data which can happen with both pull and push parser. The logic in `areBlanks` is hard to follow. The checks involving `CUR` depend on the position of the input pointer and seem dubious. It's also possible that the behavior changed inadvertently with a later commit. As a result, it's hard to come up with good test cases. We now process leading whitespace before creating implied tags. This is more in line with HTML5 and should avoid at least some issues with partial text data. For example, parsing the string "<head> x" used to result in: <html> <head></head> <body><p> x</p></body> </html> And now results in: <html> <head> </head> <body><p>x</p></body> </html> Except for the implied <p> tag, this matches HTML5.
8d7e38d5	2025-02-01T22:41:53	fuzz: Ignore encodings when fuzzing on Apple Not long ago, Apple decided to replace GNU libiconv with a patched up version of FreeBSD's iconv implementation in their operating systems. Unfortunately, the quality of both the original implementation as well as Apple's patches is so abysmal that you routinely find issues when fuzzing your own code.
68be036f	2025-02-01T22:09:18	fuzz: Disable HTML encoding detection for now This doesn't work with the push parser.
c13fcc19	2025-02-01T19:36:06	html: Chunk text data in push parser Follow the logic of the XML parser and chunk large text nodes.
08028572	2025-02-01T18:21:47	html: Make data parsing modes work with push parser This can't be solved with a simple scan for a terminator. Instead, we make htmlParseCharData handle incomplete data if the "partial" flag is set.
4be1e8be	2025-02-01T15:00:26	html: Simplify htmlParseTryOrFinish a little
12732592	2025-02-01T00:36:12	html: Remove unused epilog state
70bf754e	2025-02-01T00:17:01	html: Fix pull-parsing of incomplete end tags Handle this HTML5 quirk in htmlParseEndTag.
4a776c78	2025-01-31T23:57:44	html: Use htmlParseElementInternal in push parser
ba153737	2025-01-31T22:51:59	html: Fix corner case when push-parsing HTML5 comments
e48fb5e4	2025-01-31T22:08:13	html: Handle incomplete UTF-8 when push-parsing For now, incomplete UTF-8 is always an error in push mode. Eventually, we could pass chunked data to the character handler when push-parsing. Then we'd have to handle incomplete sequences.
6bb2ea8e	2025-02-01T14:58:06	html: Adjust xmlDetectEncoding for HTML Don't check for UTF-32 or EBCDIC. We now perform BOM sniffing and the first step of the HTML5 prescan algorithm (detect UTF-16 XML declarations). The rest of the algorithm still has to be implemented.
227d8f73	2025-01-31T21:05:22	html: Support encoding auto-detection in push parser Align with pull parser.
641fb1ac	2025-01-31T20:41:28	html: Fix state update in push parser
a86a8ae9	2025-01-31T20:09:54	html: Fix push-parsing of empty documents Also simplify end-of-document handling in push parser. Align with pull parser.
ca819160	2025-01-03T20:50:08	include: Use intptr_t to cast between pointers and ints
53c131f6	2024-12-26T20:29:58	doc: Make apibuild.py work again
0447275e	2024-12-15T21:17:07	html: Check reallocations for overflow
6548ba11	2024-12-13T16:37:40	parser: Fix argument checks in xmlCtxtParse* - Raise invalid argument error. - Free input stream if ctxt is NULL.
497081ba	2024-11-17T20:25:07	parser: Remove remaining calls to xml{Push\|Pop}Input
0f4f8900	2024-11-17T20:13:14	parser: Rename inputPush to xmlCtxtPushInput
225ed707	2024-09-26T22:38:24	html: Accelerate htmlParseCharData
20799979	2024-09-26T17:09:40	html: Handle numeric character references directly
0bc4608c	2024-09-15T20:28:49	html: Use hash table to check for duplicate attributes
24a6149f	2024-09-15T19:18:40	html: Make sure that character data mode is reset
c32397d5	2024-09-12T22:39:05	html: Improve character class macros
e8406554	2024-09-12T15:21:03	html: Rewrite parsing of most data
f77ec16d	2024-09-12T01:45:34	html: Optimize htmlParseCharData
440bd64c	2024-09-12T04:01:38	html: Optimize htmlParseHTMLName
6040785a	2024-09-12T23:12:01	html: Deprecate AutoClose API
188cad68	2024-09-12T02:51:20	html: Remove obsolete content model
0144f662	2024-09-12T02:30:10	html: Remove obsolete code
575be6c1	2024-09-12T01:40:07	html: Fix line numbers with CRs
be874d78	2024-09-11T19:47:07	html: Ignore unexpected DOCTYPE declarations
462bf0b7	2024-09-11T19:06:06	html: Rework options Introduce htmlCtxtSetOptions, see similar changes made to XML parser. Add HTML_PARSE_HUGE alias. Support HTML_PARSE_BIG_LINES.
42c3823d	2024-09-11T19:05:09	html: Update comment
9f04cce6	2024-09-11T17:43:07	html: Remove unused or useless return codes htmlParseStartTag should always succeed (except for malloc failures).
e179f3ec	2024-09-11T17:29:59	html: Stop reporting syntax errors It doesn't make much sense to keep the old syntax error handling which doesn't conform to HTML5. Handling HTML5 parser errors is rather involved and not essential for parsers.
27752f75	2024-09-11T15:06:55	html: Fix EOF handling in start tags
b19d3539	2024-09-11T15:03:49	html: Fix EOF handling in comments
17e56ac5	2024-09-11T14:24:58	html: Fix parsing of end tags
24a09033	2024-09-09T02:53:14	html: Fix bogus end tags
bca64854	2024-09-09T02:30:18	html: Allow U+000C FORM FEED as whitespace
6edf1a64	2024-09-09T02:09:20	html: Fix DOCTYPE parsing
9678163f	2024-09-09T02:01:19	html: Don't check for valid XML characters
a6955c13	2024-09-08T23:19:49	html: Parse numeric character references according to HTML5
4eeac309	2024-09-08T22:20:20	html: Start to fix EOF and U+0000 handling
e062a4a9	2024-09-08T20:40:36	html: Add HTML5 parser option This option passes tokenizer output directly to the SAX callbacks, making it possible to test the tokenizer against the html5lib test suite. This will produce unbalanced calls to the startElement and endElement callbacks, but it's the only way to support a SAX like interface for HTML5. It can be used for filtering or rewriting HTML5, for example. A HTML5 tree builder could then be implemented on top of the SAX callbacks.
17da54c5	2024-09-08T19:16:12	html: Normalize newlines
341dc78f	2024-09-08T19:11:14	html: Deduplicate code in htmlCurrentChar
3adb396d	2024-09-07T15:18:13	html: Parse bogus comments instead of ignoring them Also treat XML processing instructions as bogus comments.
84440175	2024-09-07T14:21:12	html: Add missing calls to htmlCheckParagraph()
86d6b9b0	2024-09-07T04:18:06	html: Deduplicate some code
0d324bde	2024-09-07T03:45:09	html: Simplify node info accounting
ccb61f59	2024-09-07T03:15:50	html: Remove duplicate calls to htmlAutoClose
f9ed30e9	2024-09-06T17:49:04	html: HTML5 character data states
59511792	2024-09-03T15:52:44	html: Parse named character references according to HTML5
d5cd0f07	2022-07-15T17:00:36	html: Prefer SKIP(1) over NEXT in HTML parser Use SKIP(1) where it's safe to avoid a function call.
dc2d4983	2023-05-04T17:47:38	html: Rework htmlLookupSequence Rename to htmlLookupString and use strstr for increased performance.

6b50d8c8

2025-06-08T13:05:22

html: Add missing call to grow parser in htmlParseComment Otherwise, long chains of short comments could exhaust the input buffer when pull parsing.

70335c41

2025-06-06T03:29:57

html: Don't stop on unsupported encoding Continue to parse unlike in the XML case.

416da89d

2025-06-04T20:49:16

html: Make htmlCtxtReset call xmlCtxtReset The two implementations shouldn't diverge.

c6206c93

2025-06-05T21:06:11

html: Ignore ASCII-incompatible encoding in meta tag After successfully parsing an ASCII-encoded meta tag, switching to an encoding that isn't ASCII-compatible cannot work.

6a6a46f0

2025-05-28T16:02:41

doc: Fix autolink errors Fix links, remove links to internal functions.

7bd8d1d9

2025-05-28T15:53:38

doc: Prefix autolinks with '#' Use `#func` instead of `func()` to ignore parameters and make all autolinks work.

c5b45fbc

2025-05-16T16:54:09

doc: Misc fixes

6f4b4527

2025-05-15T23:43:32

parser: Stop using ctxt->linenumbers I think this was used to avoid setting the `line` member before it was added (20+ years ago).

258d8706

2025-05-15T17:49:49

codegen: Consolidate tools for code generation Move tools, source files and output tables into codegen directory. Rename some files. Adjust tools to match modified files. Remove generation date and source files from output. Distribute all tools and sources.

adfbeb7e

2025-05-14T04:58:21

doc: Stop using *Ptr typedefs in documentation

a40f36e7

2025-05-14T04:04:28

include: Stop using *Ptr typedefs in public headers

2d83a84c

2025-05-14T00:29:19

doc: Misc improvements

f0983199

2025-05-12T13:00:20

html: Map some encodings according to HTML5 Windows-1252 is a superset of ISO-8859-1 and should be used instead. Same for ASCII. Also map UCS-2 and UTF-16 to UTF-16LE.

05b8fe0a

2025-04-12T23:10:40

html: Don't escape RAWTEXT and PLAINTEXT Align with HTML5.

809ded58

2025-04-12T22:50:56

html: Add more empty elements Add empty HTML5 elements <bgsound>, <keygen>, <source>, <track> and . Make <embed> an empty element.

c7c49643

2025-05-09T15:26:15

html: Move DTD creation to endDocument SAX callback

46f05ea4

2025-05-09T00:21:47

html: Rework meta charset handling Don't use encoding from meta tags when serializing. Only use the value in `doc->encoding`, matching the XML serializer. This is the actual encoding used when parsing. Stop modifying the input document by setting meta tags before serializing. Meta tags are now injected during serialization. Add full support for <meta charset=""> which is also used when adding meta tags. Align with HTML5 and implement the "algorithm for extracting a character encoding from a meta element". Only modify the encoding substring in Content-Type meta tags. Only switch encoding once when parsing. Fix htmlSaveFileFormat with a NULL encoding not to declare a misleading UTF-8 charset. Fixes #909.

f3a080bc

2025-05-07T14:32:42

html: Ignore U+0000 in body text Align with HTML5. Fixes #908.

9bbffec5

2025-05-06T17:42:46

doc: Move brief to top, params to bottom of doc comments

b7274fb0

2025-05-03T16:34:02

doc: Misc fixes to HTML parser docs

4a010875

2025-05-03T15:38:15

doc: Move parser option docs to enum

cb1635a6

2025-05-02T19:05:25

doc: Use @since command

e78e05c9

2025-05-02T17:32:51

doc: Fix autolinks to functions Unfortunately, autolinks in .c files aren't converted by Doxygen for some reason.

f7c41287

2025-05-02T15:57:17

doc: Remove more comment block headers

e525564f

2025-05-01T19:20:06

doc: Remove empty lines at start of block These lines were left over after automatic conversion.

e549622b

2025-04-28T15:11:24

doc: Convert documentation to Doxygen Automated conversion based on a few regexes.

69879da8

2025-04-28T14:04:30

doc: Remove email addresses from documentation Also remove authorship information from generated files, hash.c and globals.c which were rewritten.

61890e39

2025-04-27T21:50:15

doc: Prepare for conversion to Doxygen Fix many params in internal functions (not really necessary but Doxygen warns about that in XML mode). Fix formatting in a few corner cases that automatic conversion can't handle. Rearrange some DOC_DISABLE blocks.

4ba1f923

2025-04-18T17:28:24

html: Avoid HTML_PARSE_HTML5 clashing with XML_PARSE_NOENT There are several users that pass invalid XML parser options to the HTML parser. Choose a value that is less likely to clash.

b8018afa

2025-04-09T23:30:47

html: Fix documentation of parser options

2ecc08f6

2025-04-09T21:11:47

html: Deprecate more functions

69b83bb6

2025-03-10T02:18:51

encoding: Detect truncated multi-byte sequences with ICU Unlike iconv or the internal converters, ICU consumes truncated multi- byte sequences at the end of an input buffer. We currently check for a non-empty raw input buffer to detect truncated sequences, so this fails with ICU. It might be possible to inspect the pivot buffer pointers, but it seems cleaner to implement a `flush` flag for some encoding and I/O functions. After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or detect remaining input with other converters. Also fix detection of truncated sequences for HTML, XML content and DTDs with iconv.

8873a498

2025-03-09T16:21:13

html: Fix areBlanks check Short-lived regression from 71122421.

5f0b1378

2025-03-08T22:07:15

parser: Add more parser context accessors Fixes #763.

5237d90f

2025-03-07T21:15:20

html: Process data before switching encoding This reduces the amount of data to convert and avoids issues with EOF detection. Also reset EOF flag after switching encoding as a precaution.

0b27097a

2025-03-04T12:55:25

encoding: Rename unprefixed public functions

5ed4eafd

2025-02-22T14:51:39

html: Don't invoke SAX callbacks if parser was stopped

63dfcca6

2024-12-16T01:34:29

fuzz: Reduce initial array size

b8234e8c

2025-02-19T12:53:32

html: Fix check for partial named character references Digits are allowed after the first character.

7a61c32b

2025-02-13T23:09:28

html: Use enum instead of magic values for insertion modes

8cf6129b

2025-02-13T18:20:46

html: Stop implying start tags Only <html>, <head> or <body> should be implied. Opening extra tags has always been a libxml2 quirk.

71122421

2025-02-13T14:04:10

html: Make implied tags more deterministic libxml2's HTML parser adds start tags in some situations. This behavior, which doesn't follow any standard, was added in 2000, see here: http://veillard.com/XML/messages/0655.html Text nodes that only contain whitespace don't imply a tag, but the whitespace check cannot work reliably if we're parsing partial text data which can happen with both pull and push parser. The logic in `areBlanks` is hard to follow. The checks involving `CUR` depend on the position of the input pointer and seem dubious. It's also possible that the behavior changed inadvertently with a later commit. As a result, it's hard to come up with good test cases. We now process leading whitespace before creating implied tags. This is more in line with HTML5 and should avoid at least some issues with partial text data. For example, parsing the string "<head> x" used to result in: <html> <head></head> <body> x</body> </html> And now results in: <html> <head> </head> <body>x</body> </html> Except for the implied tag, this matches HTML5.

8d7e38d5

2025-02-01T22:41:53

fuzz: Ignore encodings when fuzzing on Apple Not long ago, Apple decided to replace GNU libiconv with a patched up version of FreeBSD's iconv implementation in their operating systems. Unfortunately, the quality of both the original implementation as well as Apple's patches is so abysmal that you routinely find issues when fuzzing your own code.

68be036f

2025-02-01T22:09:18

fuzz: Disable HTML encoding detection for now This doesn't work with the push parser.

c13fcc19

2025-02-01T19:36:06

html: Chunk text data in push parser Follow the logic of the XML parser and chunk large text nodes.

08028572

2025-02-01T18:21:47

html: Make data parsing modes work with push parser This can't be solved with a simple scan for a terminator. Instead, we make htmlParseCharData handle incomplete data if the "partial" flag is set.

4be1e8be

2025-02-01T15:00:26

html: Simplify htmlParseTryOrFinish a little

12732592

2025-02-01T00:36:12

html: Remove unused epilog state

70bf754e

2025-02-01T00:17:01

html: Fix pull-parsing of incomplete end tags Handle this HTML5 quirk in htmlParseEndTag.

4a776c78

2025-01-31T23:57:44

html: Use htmlParseElementInternal in push parser

ba153737

2025-01-31T22:51:59

html: Fix corner case when push-parsing HTML5 comments

e48fb5e4

2025-01-31T22:08:13

html: Handle incomplete UTF-8 when push-parsing For now, incomplete UTF-8 is always an error in push mode. Eventually, we could pass chunked data to the character handler when push-parsing. Then we'd have to handle incomplete sequences.

6bb2ea8e

2025-02-01T14:58:06

html: Adjust xmlDetectEncoding for HTML Don't check for UTF-32 or EBCDIC. We now perform BOM sniffing and the first step of the HTML5 prescan algorithm (detect UTF-16 XML declarations). The rest of the algorithm still has to be implemented.

227d8f73

2025-01-31T21:05:22

html: Support encoding auto-detection in push parser Align with pull parser.

641fb1ac

2025-01-31T20:41:28

html: Fix state update in push parser

a86a8ae9

2025-01-31T20:09:54

html: Fix push-parsing of empty documents Also simplify end-of-document handling in push parser. Align with pull parser.

ca819160

2025-01-03T20:50:08

include: Use intptr_t to cast between pointers and ints

53c131f6

2024-12-26T20:29:58

doc: Make apibuild.py work again

0447275e

2024-12-15T21:17:07

html: Check reallocations for overflow

6548ba11

2024-12-13T16:37:40

parser: Fix argument checks in xmlCtxtParse* - Raise invalid argument error. - Free input stream if ctxt is NULL.

497081ba

2024-11-17T20:25:07

parser: Remove remaining calls to xml{Push|Pop}Input

0f4f8900

2024-11-17T20:13:14

parser: Rename inputPush to xmlCtxtPushInput

225ed707

2024-09-26T22:38:24

html: Accelerate htmlParseCharData

20799979

2024-09-26T17:09:40

html: Handle numeric character references directly

0bc4608c

2024-09-15T20:28:49

html: Use hash table to check for duplicate attributes

24a6149f

2024-09-15T19:18:40

html: Make sure that character data mode is reset

c32397d5

2024-09-12T22:39:05

html: Improve character class macros

e8406554

2024-09-12T15:21:03

html: Rewrite parsing of most data

f77ec16d

2024-09-12T01:45:34

html: Optimize htmlParseCharData

440bd64c

2024-09-12T04:01:38

html: Optimize htmlParseHTMLName

6040785a

2024-09-12T23:12:01

html: Deprecate AutoClose API

188cad68

2024-09-12T02:51:20

html: Remove obsolete content model

0144f662

2024-09-12T02:30:10

html: Remove obsolete code

575be6c1

2024-09-12T01:40:07

html: Fix line numbers with CRs

be874d78

2024-09-11T19:47:07

html: Ignore unexpected DOCTYPE declarations

462bf0b7

2024-09-11T19:06:06

html: Rework options Introduce htmlCtxtSetOptions, see similar changes made to XML parser. Add HTML_PARSE_HUGE alias. Support HTML_PARSE_BIG_LINES.

42c3823d

2024-09-11T19:05:09

html: Update comment

9f04cce6

2024-09-11T17:43:07

html: Remove unused or useless return codes htmlParseStartTag should always succeed (except for malloc failures).

e179f3ec

2024-09-11T17:29:59

html: Stop reporting syntax errors It doesn't make much sense to keep the old syntax error handling which doesn't conform to HTML5. Handling HTML5 parser errors is rather involved and not essential for parsers.

27752f75

2024-09-11T15:06:55

html: Fix EOF handling in start tags

b19d3539

2024-09-11T15:03:49

html: Fix EOF handling in comments

17e56ac5

2024-09-11T14:24:58

html: Fix parsing of end tags

24a09033

2024-09-09T02:53:14

html: Fix bogus end tags

bca64854

2024-09-09T02:30:18

html: Allow U+000C FORM FEED as whitespace

6edf1a64

2024-09-09T02:09:20

html: Fix DOCTYPE parsing

9678163f

2024-09-09T02:01:19

html: Don't check for valid XML characters

a6955c13

2024-09-08T23:19:49

html: Parse numeric character references according to HTML5

4eeac309

2024-09-08T22:20:20

html: Start to fix EOF and U+0000 handling

e062a4a9

2024-09-08T20:40:36

html: Add HTML5 parser option This option passes tokenizer output directly to the SAX callbacks, making it possible to test the tokenizer against the html5lib test suite. This will produce unbalanced calls to the startElement and endElement callbacks, but it's the only way to support a SAX like interface for HTML5. It can be used for filtering or rewriting HTML5, for example. A HTML5 tree builder could then be implemented on top of the SAX callbacks.

17da54c5

2024-09-08T19:16:12

html: Normalize newlines

341dc78f

2024-09-08T19:11:14

html: Deduplicate code in htmlCurrentChar

3adb396d

2024-09-07T15:18:13

html: Parse bogus comments instead of ignoring them Also treat XML processing instructions as bogus comments.

84440175

2024-09-07T14:21:12

html: Add missing calls to htmlCheckParagraph()

86d6b9b0

2024-09-07T04:18:06

html: Deduplicate some code

0d324bde

2024-09-07T03:45:09

html: Simplify node info accounting

ccb61f59

2024-09-07T03:15:50

html: Remove duplicate calls to htmlAutoClose

f9ed30e9

2024-09-06T17:49:04

html: HTML5 character data states

59511792

2024-09-03T15:52:44

html: Parse named character references according to HTML5

d5cd0f07

2022-07-15T17:00:36

html: Prefer SKIP(1) over NEXT in HTML parser Use SKIP(1) where it's safe to avoid a function call.

dc2d4983

2023-05-04T17:47:38

html: Rework htmlLookupSequence Rename to htmlLookupString and use strstr for increased performance.

kc3-lang/libxml2/HTMLparser.c

HTMLparser.c

Log