|
4ba1f923
|
2025-04-18T17:28:24
|
|
html: Avoid HTML_PARSE_HTML5 clashing with XML_PARSE_NOENT
There are several users that pass invalid XML parser options to the
HTML parser. Choose a value that is less likely to clash.
|
|
b8018afa
|
2025-04-09T23:30:47
|
|
html: Fix documentation of parser options
|
|
2ecc08f6
|
2025-04-09T21:11:47
|
|
html: Deprecate more functions
|
|
69b83bb6
|
2025-03-10T02:18:51
|
|
encoding: Detect truncated multi-byte sequences with ICU
Unlike iconv or the internal converters, ICU consumes truncated multi-
byte sequences at the end of an input buffer. We currently check for a
non-empty raw input buffer to detect truncated sequences, so this fails
with ICU.
It might be possible to inspect the pivot buffer pointers, but it seems
cleaner to implement a `flush` flag for some encoding and I/O functions.
After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or
detect remaining input with other converters.
Also fix detection of truncated sequences for HTML, XML content and
DTDs with iconv.
|
|
8873a498
|
2025-03-09T16:21:13
|
|
html: Fix areBlanks check
Short-lived regression from 71122421.
|
|
5f0b1378
|
2025-03-08T22:07:15
|
|
parser: Add more parser context accessors
Fixes #763.
|
|
5237d90f
|
2025-03-07T21:15:20
|
|
html: Process data before switching encoding
This reduces the amount of data to convert and avoids issues with EOF
detection.
Also reset EOF flag after switching encoding as a precaution.
|
|
0b27097a
|
2025-03-04T12:55:25
|
|
encoding: Rename unprefixed public functions
|
|
5ed4eafd
|
2025-02-22T14:51:39
|
|
html: Don't invoke SAX callbacks if parser was stopped
|
|
63dfcca6
|
2024-12-16T01:34:29
|
|
fuzz: Reduce initial array size
|
|
b8234e8c
|
2025-02-19T12:53:32
|
|
html: Fix check for partial named character references
Digits are allowed after the first character.
|
|
7a61c32b
|
2025-02-13T23:09:28
|
|
html: Use enum instead of magic values for insertion modes
|
|
8cf6129b
|
2025-02-13T18:20:46
|
|
html: Stop implying <p> start tags
Only <html>, <head> or <body> should be implied. Opening extra <p> tags
has always been a libxml2 quirk.
|
|
71122421
|
2025-02-13T14:04:10
|
|
html: Make implied <p> tags more deterministic
libxml2's HTML parser adds <p> start tags in some situations. This
behavior, which doesn't follow any standard, was added in 2000, see
here: http://veillard.com/XML/messages/0655.html
Text nodes that only contain whitespace don't imply a <p> tag, but the
whitespace check cannot work reliably if we're parsing partial text data
which can happen with both pull and push parser.
The logic in `areBlanks` is hard to follow. The checks involving `CUR`
depend on the position of the input pointer and seem dubious. It's also
possible that the behavior changed inadvertently with a later commit.
As a result, it's hard to come up with good test cases.
We now process leading whitespace before creating implied tags. This is
more in line with HTML5 and should avoid at least some issues with
partial text data.
For example, parsing the string "<head> x" used to result in:
<html>
<head></head>
<body><p> x</p></body>
</html>
And now results in:
<html>
<head> </head>
<body><p>x</p></body>
</html>
Except for the implied <p> tag, this matches HTML5.
|
|
8d7e38d5
|
2025-02-01T22:41:53
|
|
fuzz: Ignore encodings when fuzzing on Apple
Not long ago, Apple decided to replace GNU libiconv with a patched up
version of FreeBSD's iconv implementation in their operating systems.
Unfortunately, the quality of both the original implementation as well
as Apple's patches is so abysmal that you routinely find issues when
fuzzing your own code.
|
|
68be036f
|
2025-02-01T22:09:18
|
|
fuzz: Disable HTML encoding detection for now
This doesn't work with the push parser.
|
|
c13fcc19
|
2025-02-01T19:36:06
|
|
html: Chunk text data in push parser
Follow the logic of the XML parser and chunk large text nodes.
|
|
08028572
|
2025-02-01T18:21:47
|
|
html: Make data parsing modes work with push parser
This can't be solved with a simple scan for a terminator. Instead, we
make htmlParseCharData handle incomplete data if the "partial" flag is
set.
|
|
4be1e8be
|
2025-02-01T15:00:26
|
|
html: Simplify htmlParseTryOrFinish a little
|
|
12732592
|
2025-02-01T00:36:12
|
|
html: Remove unused epilog state
|
|
70bf754e
|
2025-02-01T00:17:01
|
|
html: Fix pull-parsing of incomplete end tags
Handle this HTML5 quirk in htmlParseEndTag.
|
|
4a776c78
|
2025-01-31T23:57:44
|
|
html: Use htmlParseElementInternal in push parser
|
|
ba153737
|
2025-01-31T22:51:59
|
|
html: Fix corner case when push-parsing HTML5 comments
|
|
e48fb5e4
|
2025-01-31T22:08:13
|
|
html: Handle incomplete UTF-8 when push-parsing
For now, incomplete UTF-8 is always an error in push mode.
Eventually, we could pass chunked data to the character handler when
push-parsing. Then we'd have to handle incomplete sequences.
|
|
6bb2ea8e
|
2025-02-01T14:58:06
|
|
html: Adjust xmlDetectEncoding for HTML
Don't check for UTF-32 or EBCDIC.
We now perform BOM sniffing and the first step of the HTML5 prescan
algorithm (detect UTF-16 XML declarations). The rest of the algorithm
still has to be implemented.
|
|
227d8f73
|
2025-01-31T21:05:22
|
|
html: Support encoding auto-detection in push parser
Align with pull parser.
|
|
641fb1ac
|
2025-01-31T20:41:28
|
|
html: Fix state update in push parser
|
|
a86a8ae9
|
2025-01-31T20:09:54
|
|
html: Fix push-parsing of empty documents
Also simplify end-of-document handling in push parser.
Align with pull parser.
|
|
ca819160
|
2025-01-03T20:50:08
|
|
include: Use intptr_t to cast between pointers and ints
|
|
53c131f6
|
2024-12-26T20:29:58
|
|
doc: Make apibuild.py work again
|
|
0447275e
|
2024-12-15T21:17:07
|
|
html: Check reallocations for overflow
|
|
6548ba11
|
2024-12-13T16:37:40
|
|
parser: Fix argument checks in xmlCtxtParse*
- Raise invalid argument error.
- Free input stream if ctxt is NULL.
|
|
497081ba
|
2024-11-17T20:25:07
|
|
parser: Remove remaining calls to xml{Push|Pop}Input
|
|
0f4f8900
|
2024-11-17T20:13:14
|
|
parser: Rename inputPush to xmlCtxtPushInput
|
|
225ed707
|
2024-09-26T22:38:24
|
|
html: Accelerate htmlParseCharData
|
|
20799979
|
2024-09-26T17:09:40
|
|
html: Handle numeric character references directly
|
|
0bc4608c
|
2024-09-15T20:28:49
|
|
html: Use hash table to check for duplicate attributes
|
|
24a6149f
|
2024-09-15T19:18:40
|
|
html: Make sure that character data mode is reset
|
|
c32397d5
|
2024-09-12T22:39:05
|
|
html: Improve character class macros
|
|
e8406554
|
2024-09-12T15:21:03
|
|
html: Rewrite parsing of most data
|
|
f77ec16d
|
2024-09-12T01:45:34
|
|
html: Optimize htmlParseCharData
|
|
440bd64c
|
2024-09-12T04:01:38
|
|
html: Optimize htmlParseHTMLName
|
|
6040785a
|
2024-09-12T23:12:01
|
|
html: Deprecate AutoClose API
|
|
188cad68
|
2024-09-12T02:51:20
|
|
html: Remove obsolete content model
|
|
0144f662
|
2024-09-12T02:30:10
|
|
html: Remove obsolete code
|
|
575be6c1
|
2024-09-12T01:40:07
|
|
html: Fix line numbers with CRs
|
|
be874d78
|
2024-09-11T19:47:07
|
|
html: Ignore unexpected DOCTYPE declarations
|
|
462bf0b7
|
2024-09-11T19:06:06
|
|
html: Rework options
Introduce htmlCtxtSetOptions, see similar changes made to XML parser.
Add HTML_PARSE_HUGE alias. Support HTML_PARSE_BIG_LINES.
|
|
42c3823d
|
2024-09-11T19:05:09
|
|
html: Update comment
|
|
9f04cce6
|
2024-09-11T17:43:07
|
|
html: Remove unused or useless return codes
htmlParseStartTag should always succeed (except for malloc failures).
|
|
e179f3ec
|
2024-09-11T17:29:59
|
|
html: Stop reporting syntax errors
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.
Handling HTML5 parser errors is rather involved and not essential for
parsers.
|
|
27752f75
|
2024-09-11T15:06:55
|
|
html: Fix EOF handling in start tags
|
|
b19d3539
|
2024-09-11T15:03:49
|
|
html: Fix EOF handling in comments
|
|
17e56ac5
|
2024-09-11T14:24:58
|
|
html: Fix parsing of end tags
|
|
24a09033
|
2024-09-09T02:53:14
|
|
html: Fix bogus end tags
|
|
bca64854
|
2024-09-09T02:30:18
|
|
html: Allow U+000C FORM FEED as whitespace
|
|
6edf1a64
|
2024-09-09T02:09:20
|
|
html: Fix DOCTYPE parsing
|
|
9678163f
|
2024-09-09T02:01:19
|
|
html: Don't check for valid XML characters
|
|
a6955c13
|
2024-09-08T23:19:49
|
|
html: Parse numeric character references according to HTML5
|
|
4eeac309
|
2024-09-08T22:20:20
|
|
html: Start to fix EOF and U+0000 handling
|
|
e062a4a9
|
2024-09-08T20:40:36
|
|
html: Add HTML5 parser option
This option passes tokenizer output directly to the SAX callbacks,
making it possible to test the tokenizer against the html5lib test
suite.
This will produce unbalanced calls to the startElement and endElement
callbacks, but it's the only way to support a SAX like interface for
HTML5. It can be used for filtering or rewriting HTML5, for example.
A HTML5 tree builder could then be implemented on top of the SAX
callbacks.
|
|
17da54c5
|
2024-09-08T19:16:12
|
|
html: Normalize newlines
|
|
341dc78f
|
2024-09-08T19:11:14
|
|
html: Deduplicate code in htmlCurrentChar
|
|
3adb396d
|
2024-09-07T15:18:13
|
|
html: Parse bogus comments instead of ignoring them
Also treat XML processing instructions as bogus comments.
|
|
84440175
|
2024-09-07T14:21:12
|
|
html: Add missing calls to htmlCheckParagraph()
|
|
86d6b9b0
|
2024-09-07T04:18:06
|
|
html: Deduplicate some code
|
|
0d324bde
|
2024-09-07T03:45:09
|
|
html: Simplify node info accounting
|
|
ccb61f59
|
2024-09-07T03:15:50
|
|
html: Remove duplicate calls to htmlAutoClose
|
|
f9ed30e9
|
2024-09-06T17:49:04
|
|
html: HTML5 character data states
|
|
59511792
|
2024-09-03T15:52:44
|
|
html: Parse named character references according to HTML5
|
|
d5cd0f07
|
2022-07-15T17:00:36
|
|
html: Prefer SKIP(1) over NEXT in HTML parser
Use SKIP(1) where it's safe to avoid a function call.
|
|
dc2d4983
|
2023-05-04T17:47:38
|
|
html: Rework htmlLookupSequence
Rename to htmlLookupString and use strstr for increased performance.
|
|
637215a4
|
2023-05-04T17:16:51
|
|
html: Always terminate doctype declarations on '>'
Align with HTML5 spec. This allows to remove the old quote handling in
htmlLookupSequence.
|
|
72e29f9a
|
2023-05-04T17:03:22
|
|
html: Fix quadratic behavior in push parser
Fix quadratic behavior related to unquoted attribute values. We really
have to replicate parts of the HTML5 state machine to find the end of
tags relibably.
Fixes #533.
|
|
a80f8b64
|
2023-05-04T15:59:31
|
|
html: Allow attributes in end tags
Attribute are syntactically allowed in HTML5 end tags but otherwise
ignored.
|
|
f2272c23
|
2023-05-04T15:33:27
|
|
html: Handle unexpected-solidus-in-tag according to HTML5
|
|
939b53ee
|
2023-05-04T15:25:24
|
|
html: Stop skipping tag content
Tag and attributes names should always be parsed succesfully now.
|
|
dcb2abb2
|
2023-05-04T15:16:29
|
|
html: Parse tag and attribute names according to HTML5
HTML5 allows bascially all characters in tag and attribute names.
|
|
5d36664f
|
2024-07-16T00:35:53
|
|
memory: Deprecate xmlGcMemSetup
|
|
8af55c8d
|
2024-07-06T22:14:21
|
|
parser: Rename new input API functions
These weren't made public yet.
|
|
d74ca594
|
2024-07-06T22:04:06
|
|
parser: Rename internal xmlNewInput functions
|
|
4f329dc5
|
2024-07-10T03:27:47
|
|
parser: Implement xmlCtxtParseContent
This implements xmlCtxtParseContent, a better alternative to
xmlParseInNodeContext or xmlParseBalancedChunkMemory. It accepts a
parser context and a parser input, making it a lot more versatile.
xmlParseInNodeContext is now implemented in terms of
xmlCtxtParseContent. This makes sure that xmlParseInNodeContext never
modifies the target document, improving thread safety.
xmlParseInNodeContext is also more lenient now with regard to undeclared
entities.
Fixes #727.
|
|
2e63656e
|
2024-07-07T19:21:46
|
|
parser: Check return value of inputPush
inputPush typically doesn't fail because we pre-allocate the input
table. The return value should be checked nevertheless.
|
|
fdfeecfe
|
2024-07-02T21:54:26
|
|
parser: Reenable ctxt->directory
Unused internally, but used in downstream code.
Should fix #753.
|
|
30ef7755
|
2024-07-02T04:02:16
|
|
parser: Don't use deprecated xmlCopyChar
|
|
dd8e3785
|
2024-06-28T21:15:27
|
|
HTML: Rework UTF8ToHtml
Optimize code. Check for XML_ENC_ERR_SPACE. Use error macros.
|
|
f505dcae
|
2024-06-26T14:11:34
|
|
tree: Remove underscores from xmlRegisterCallbacks
|
|
1112699c
|
2024-06-17T02:42:18
|
|
legacy: Remove most legacy functions from public headers
Also remove warning messages.
|
|
039ce1e8
|
2024-06-14T16:41:43
|
|
parser: Pass global object to sax->setDocumentLocator
Revert part of commit c011e760.
Fixes #732.
|
|
89fcae4d
|
2024-06-11T16:19:58
|
|
parser: Don't report malloc failures when creating context
We don't want messages to stderr before an error handler could be set on
a parser context.
|
|
e75e878e
|
2024-05-20T13:58:22
|
|
doc: Update and fix documentation
|
|
a4c2b723
|
2024-05-05T17:26:31
|
|
io: Don't set close callback in xmlParserInputBufferCreateFd
|
|
05654cfe
|
2024-04-28T17:54:20
|
|
html: Deprecate htmlHandleOmittedElem
|
|
aa04838e
|
2024-03-26T14:10:58
|
|
html: Use binary search in htmlEntityValueLookup
|
|
3efbe916
|
2024-01-05T00:11:29
|
|
parser: Mark 'token' member as unused in xmlParserCtxt
|
|
b82fd81d
|
2024-01-04T23:25:06
|
|
parser: Rework xmlCtxtParseDocument
Make xmlCtxtParseDocument take a parser input which can be popped after
parsing.
|
|
7e0bbbc1
|
2023-12-27T18:33:30
|
|
parser: New input API
Provide a new set of functions to create xmlParserInputs. These can be
used for the document entity or from external entity loaders.
- Don't require xmlParserInputBuffer.
- All functions take a base URI.
- All functions take an encoding as string.
- xmlNewInputURL also takes a public ID.
- xmlNewInputMemory takes a size_t.
- Optimization hints for memory buffers.
Improve documentation.
Only call xmlInitParser before allocating a new parser context.
Call xmlCtxtUseOptions as early as possible.
|
|
6a9a88a1
|
2023-12-26T03:13:05
|
|
parser: Move progressive flag into input struct
|
|
d944a415
|
2023-12-26T02:10:35
|
|
parser: Fix in-parameter-entity and in-external-dtd checks
Use in ctxt->input->entity instead of ctxt->inputNr to determine whether
we are inside a parameter entity.
Stop using ctxt->external to check whether we're in an external DTD.
This is signaled by ctxt->inSubset == 2.
|
|
477a7ed8
|
2023-12-28T19:06:32
|
|
html: Abort earlier on fatal errors
|