kmx git

Commit	Date	Message
e0dd330b	2023-09-29T00:18:44	parser: Use hash tables to avoid quadratic behavior Use a hash table to lookup namespaces by prefix. The hash table stores an index into the namespace table. Auxiliary data for namespaces is stored in a separate array along the main namespace table. Use a hash table to verify attribute uniqueness. The hash table stores an index into the attribute table. Reuse hash value from the dictionary to avoid computing them twice. See #346.
a873191c	2023-09-25T14:51:35	parser: Introduce xmlParseQNameHashed
8c084ebd	2023-09-21T22:57:33	doc: Make apibuild.py happy
11a1839d	2023-09-20T17:54:48	globals: Move remaining globals back to correct header files This undoes a lot of damage.
a77f9ab8	2023-09-20T16:57:22	globals: Don't include SAX2.h from globals.h
2e6c49a7	2023-09-20T14:43:14	globals: Don't store xmlParserVersion in global state This is a constant.
a07ec7c1	2023-09-18T17:39:13	threads: Move library initialization code to threads.c This allows to consolidate the initialization code since the global init lock was already implemented in threads.c.
4e1c13eb	2023-09-18T14:45:10	debug: Remove debugging code This is barely useful these days and only clutters the code base.
c19771c1	2023-09-18T00:54:39	globals: Move code from threads.c to globals.c Move all code that handles globals to the place where it belongs.
d7cfe356	2023-09-14T20:52:24	parser: Avoid undefined behavior in xmlParseStartTag2 Instead of using arithmetic on dangling pointers, store ptrdiff_t values in void pointers which is at least implementation-defined.
57cfd221	2023-09-01T14:52:04	dict: Use xoroshiro64** as PRNG Stop using rand_r. This enables hash randomization on all platforms.
53050b1d	2023-08-29T20:06:43	parser: More fixes to push parser error handling
bbd918b2	2023-08-29T15:56:37	parser: Fix detection of null bytes Also suppress misleading extra errors. Fixes #122.
c6083a32	2023-08-29T16:30:22	parser: Improve error handling in push parser - Report errors earlier - Align error messages with pull parser
1edae30f	2023-08-29T15:58:22	parser: Don't check inputNr in xmlParseTryOrFinish There's no apparent reason for this check. inputNr should always be 1 here.
e48f2695	2023-08-29T17:41:18	parser: Remove push parser debugging code
ed3bd052	2023-08-20T20:48:10	parser: Allow to set maximum amplification factor
855818bd	2023-08-08T15:21:37	parser: Check for truncated multi-byte sequences When decoding input data, check whether the "raw" buffer is empty after parsing the document. Otherwise, the input ends with a truncated multi-byte sequence which shouldn't be silently ignored.
95e81a36	2023-08-08T15:21:31	parser: Decode all data in xmlCharEncInput Even with flush set to true, xmlCharEncInput didn't guarantee to decode all data. This complicated the push parser. Remove the flush flag and always decode all available data. Also fix ICU code where the flush flag has a different meaning. Always set flush to false and retry even with empty input buffers.
834b8123	2023-08-08T15:21:28	parser: Stream data when reading from memory Don't create a copy of the whole input buffer. Read the data chunk by chunk to save memory. Historically, it was probably envisioned to read data from memory without additional copying. This doesn't work reliably with the current design of the XML parser which requires a terminating null byte at the end of input buffers. This lead to xmlReadMemory interfaces, which expect pointer and size arguments, being changed to make a zero-terminated copy of the input buffer. Interfaces based on xmlReadDoc, which actually expect a zero-terminated string and would make zero-copy operation work, were then simplified to rely on xmlReadMemoryi, resulting in an unnecessary copy. To avoid copying (possibly gigabytes) of memory temporarily, we now stream in-memory input just like content read from files in a chunk-by-chunk fashion (using a somewhat outdated INPUT_CHUNK size of 250 bytes). As a side effect, we also avoid another copy of the whole input when handling non-UTF-8 data which was made possible by some earlier commits. Interfaces expecting zero-terminated strings now make use of strnlen which unfortunately isn't part of the standard C library and only mandated since POSIX 2008.
5aff27ae	2023-08-08T15:21:25	parser: Optimize xmlLoadEntityContent Load entity content via xmlParserInputBufferGrow, avoiding a copy. This also fixes an entity size accounting error.
facc2a06	2023-08-08T15:21:21	parser: Don't overwrite EOF parser state
59fa0bb3	2023-08-08T15:21:14	parser: Simplify input pointer updates The base member always points to the beginning of the buffer.
c88ab7e3	2023-08-08T15:19:54	parser: Don't reinitialize parser input members The parser input struct should already be initialized.
ec7be506	2023-08-08T15:19:46	parser: Rework encoding detection Introduce XML_INPUT_HAS_ENCODING flag for xmlParserInput which is set when xmlSwitchEncoding is called. The parser can use the flag to reliably detect whether an encoding was already set via user override, BOM or other auto-detection. In this case, the encoding declaration won't be used to switch the encoding. Before, an inscrutable mix of ctxt->charset, ctxt->input->encoding and ctxt->input->buf->encoder was used. Introduce private helper functions to switch encodings used by both the XML and HTML parser: - xmlDetectEncoding which skips over the BOM, allowing to remove the BOM checks from other encoding functions. - xmlSetDeclaredEncoding, replacing htmlCheckEncodingDirect, which warns about encoding mismatches. If users override the encoding, store the declared instead of the actual encoding in xmlDoc. In this case, the actual encoding is known and the raw value from the doc is more useful. Also use the input flags to store the ISO-8859-1 fallback state. Restrict the fallback to cases where no encoding was specified. (The fallback is only useful in recovery mode and these days broken UTF-8 is probably more likely than ISO-8859-1, so it might eventually be removed completely.) The 'charset' member of xmlParserCtxt is now unused. The 'encoding' member of xmlParserInput is now unused. The 'standalone' member of xmlParserInput is renamed to 'flags'. A new parser state XML_PARSER_XML_DECL is added for the push parser.
d38e73f9	2023-08-08T15:19:44	parser: Always create UTF-8 in xmlParseReference It seems that this code path could only be triggered after an encoding error in recovery mode. Creating char-ref nodes is unnecessary and typically unexpected.
131d0dc0	2023-08-08T15:19:39	parser: Don't use 'standalone' member of xmlParserInput The standalone declaration is only parsed in the main input stream.
d9ec182b	2023-08-08T15:19:36	parser: Don't detect encoding in xmlCtxtResetPush The encoding will be detected in xmlParseTryOrFinish.
90bcbcfc	2023-07-20T21:08:01	parser: Fix potential use-after-free in xmlParseCharDataInternal Return immediately if a SAX handler stops the parser. Fixes #569.
e0f3016f	2023-05-18T17:31:44	parser: Fix regression when push parsing UTF-8 sequences Partial UTF-8 sequences are allowed when push parsing. Fixes #542.
235b15a5	2023-05-08T17:58:02	SAX: Always initialize SAX1 element handlers Follow-up to commit d0c3f01e. A parser context will be initialized to SAX version 2, but this can be overridden with XML_PARSE_SAX1 later, so we must initialize the SAX1 element handlers as well. Change the check in xmlDetectSAX2 to only look for XML_SAX2_MAGIC, so we don't switch to SAX1 if the SAX2 element handlers are NULL.
d0c3f01e	2023-05-06T17:47:37	parser: Fix old SAX1 parser with custom callbacks For some reason, xmlCtxtUseOptionsInternal set the start and end element SAX handlers to the internal DOM builder functions when XML_PARSE_SAX1 was specified. This means that custom SAX handlers could never work with that flag because these functions would receive the wrong user data argument and crash immediately. Fixes #535.
320f5084	2023-04-30T18:25:09	parser: Improve handling of encoding and IO errors Make sure that xmlCharEncInput, xmlParserInputBufferPush and xmlParserInputBufferGrow set the correct error code in the xmlParserInputBuffer. Handle errors when calling these functions.
fc69cf56	2023-04-30T17:51:29	parser: Move xmlFatalErr to parserInternals.c
3ffcc03b	2023-03-13T19:38:41	parser: Deprecate more internal functions
250faf3c	2023-04-20T12:35:21	parser: Fix regression in xmlParserNodeInfo accounting Commit 62150ed2 broke begin_pos and begin_line when extra node info was recorded. Fixes #523.
9282b084	2023-04-19T21:55:24	parser: Fix regression in memory pull parser with encoding Revert another change from commit 98840d40. Decode the whole buffer when reading from memory and switching to the initial encoding. Add some comments about potential improvements.
86105c04	2023-04-15T18:04:03	Fix use-after-free in xmlParseContentInternal() * parser.c: (xmlParseCharData): - Check if the parser has stopped before advancing `ctxt->input->cur`. This only occurs if a custom SAX error handler calls xmlStopParser() on fatal errors. Fixes #518.
b4d46cee	2023-04-12T15:10:01	parser: Remove first line handling in xmlParseChunk After reworking EBCDIC detection, this isn't necessary.
98840d40	2023-03-21T19:07:12	parser: Rework EBCDIC code page detection To detect EBCDIC code pages, we used to switch the encoding twice and had to be very careful not to decode data after the XML declaration before the second switch. This relied on a hard-coded expected size of the XML declaration and was complicated and unreliable. Now we convert the first 200 bytes to EBCDIC-US and parse the encoding declaration manually.
3eb9f5ca	2023-03-21T13:19:31	parser: Limit name length in xmlParseEncName
04d1bedd	2023-03-21T13:08:44	parser: Rework shrinking of input buffers Don't try to grow the input buffer in xmlParserShrink. This makes sure that no memory allocations are made and the function always succeeds. Remove unnecessary invocations of SHRINK. Invoke SHRINK at the end of DTD parsing loops. Shrink before growing.
067986fa	2023-03-18T14:44:28	parser: Fix regressions from previous commits - Fix memory leak in xmlParseNmtoken. - Fix buffer overread after htmlParseCharDataInternal.
3e85d7b7	2023-03-17T13:15:35	parser: Rely on CUR_CHAR/NEXT to grow the input buffer The input buffer is now grown reliably when calling CUR_CHAR (xmlCurrentChar) or NEXT (xmlNextChar). This allows to remove many other invocations of GROW.
c81d0d04	2023-03-17T12:39:35	malloc-fail: Add more error checks when parsing names xmlParseName and similar functions must return NULL if an error occurs. Found by OSS-Fuzz, see #344.
b167c731	2023-03-14T14:42:36	parser: Fix short-lived regression causing infinite loops Fix 3eb6bf03. We really have to halt the parser, so the input buffer gets reset.
2099441f	2023-03-13T17:51:13	parser: Stop calling xmlParserInputShrink Introduce xmlParserShrink which takes a parser context to simplify error handling.
cabde70f	2023-03-12T19:07:23	parser: Simplify calculation of available buffer space
b75976e0	2023-03-12T19:06:19	parser: Use size_t when subtracting input buffer pointers Avoid integer overflows.
9a6ca816	2023-03-12T19:03:11	parser: Check for integer overflow when updating checkIndex Unfortunately, checkIndex is a long, not a size_t. Check for integer overflow before updating the value.
bd63d730	2023-03-12T17:40:55	html: Impose some length limits Impose length limits on names, attribute values, PIs and comments, similar to the XML parser.
3eb6bf03	2023-03-12T16:47:15	parser: Stop calling xmlParserInputGrow Introduce xmlParserGrow which takes a parser context to simplify error handling.
207ebdfd	2023-03-12T14:43:01	malloc-fail: Fix out-of-bounds read in xmlGROW Short-lived regression from 56cc2211.
56cc2211	2023-03-09T22:27:58	parser: Merge xmlParserInputGrow into xmlGROW Simplifies the code and makes error handling easier.
14604a44	2023-03-09T22:10:44	malloc-fail: Fix out-of-bounds read in xmlCurrentChar Found by OSS-Fuzz.
3f69fc80	2023-03-08T13:58:49	parser: Tighten expansion limits - Lower the amount of expansion which is always allowed from 10MB to 1MB. - Lower the maximum amplification factor from 10 to 5. - Lower the "fixed cost" from 50 to 20.
5d55315e	2023-02-18T17:29:07	parser: Fix OOB read when formatting error message Don't try to print characters beyond the end of the buffer. Found by OSS-Fuzz.
f8852184	2023-02-14T13:03:13	malloc-fail: Fix memory leak in xmlParseEntityDecl Found with libFuzzer, see #344.
e6d22f92	2023-01-23T01:48:37	malloc-fail: Fix reallocation in inputPush Store xmlRealloc result in temporary variable to avoid null deref in error handler. Found with libFuzzer, see #344.
6fd89041	2023-01-22T19:42:41	malloc-fail: Fix use-after-free in xmlParseStartTag2 Fix error handling in xmlCtxtGrowAttrs. Found with libFuzzer, see #344.
d1b87856	2023-01-22T17:42:09	malloc-fail: Fix infinite loop in xmlParseTextDecl Memory errors can set `instate` to `XML_PARSER_EOF` which results in `NEXT` making no progress. Found with libFuzzer, see #344.
bd9de3a3	2023-01-22T16:52:39	malloc-fail: Fix null deref in xmlAddDefAttrs Found with libFuzzer, see #344.
33d4a0fe	2023-01-22T15:41:00	parser: Fix progress check in xmlParseExternalSubset Avoid infinite loop. Short-lived regression from f61b8a62. Found with libFuzzer.
74aa61e0	2023-01-22T13:09:03	parser: Halt parser on DTD errors If we try to continue parsing after an error in the internal or external subset, entity expansion accounting gets more complicated. Simply halt the parser. Found with libFuzzer.
d320a683	2023-01-17T13:50:51	parser: Fix entity check in attributes Don't set the "checked" flag when checking entities in default attribute values. These entities could reference other entities which weren't defined yet, so the check isn't reliable. This fixes a short-lived regression which could lead to a call stack overflow later in xmlStringGetNodeList.
59b33661	2022-12-27T14:15:51	error: Limit number of parser errors Reporting errors is expensive and some abusive test cases can generate an error for each invalid input byte. This causes the parser to spend most of the time with error handling. Limit the number of errors and warnings to 100.
66e9fd66	2022-12-25T21:26:17	parser: Fix infinite loop with push parser in recovery mode Short-lived regression from commit b1f9c193. Found by OSS-Fuzz.
49b54d7e	2022-12-25T15:06:51	parser: Fix null deref in xmlStringDecodeEntitiesInt Short-lived regression.
1865668b	2022-12-23T22:44:40	parser: Fix accounting of consumed input bytes Only add consumed bytes if - we're not parsing an entity - we're parsing external parameter entities for the first time. Always ignore internal parameter entities.
bc18f4a6	2022-12-23T21:55:38	parser: Lower entity nesting limit with XML_PARSE_HUGE The old limit of 1024 could lead to excessively deep call stacks. This could probably be set much lower without causing issues.
dd62e541	2022-12-23T21:53:30	parser: Don't increase depth twice when parsing internal entities Fix xmlParseBalancedChunkMemoryInternal.
a41b09c7	2022-12-23T21:29:28	parser: Improve detection of entity loops Set a flag to detect entity loops at once instead of processing until the depth limit is exceeded.
d972393f	2022-12-23T21:01:20	parser: Only report a single entity error Don't report errors multiple times for nested entity references.
077df27e	2022-12-22T15:22:01	parser: Fix integer overflow of input ID Applies a patch from Chromium. Also stop incrementing input ID of subcontexts. This isn't necessary. Fixes #465.
0bd4e4e0	2022-12-21T19:21:30	xmlParseStartTag2() contains typo when checking for default definitions for an attribute in a namespace * parser.c: (xmlParseStartTag2): - Fix index into defaults->values. It is only correct the first time through the loop when i == 0. Fixes #467.
b47ebf04	2022-12-21T00:02:47	parser: Deprecate xmlString*DecodeEntities These are internal functions.
ec6633af	2022-12-20T03:09:11	parser: Remove useless ent->etype test in xmlParseReference If ent->etype is invalid, ret can't equal XML_ERR_OK.
7ee7f036	2022-12-20T02:06:38	parser: Remove useless ent->children tests in xmlParseReference The if-block before always returns if ent->children == NULL.
ce76ebfd	2022-12-19T20:56:23	entities: Stop counting entities This was only used in the old version of xmlParserEntityCheck.
a3c8b180	2022-12-19T20:51:52	entities: Add entity flag for loop check
463bbeec	2022-12-19T18:39:45	entities: Rework entity amplification checks This commit implements robust detection of entity amplification attacks, better known as the "billion laughs" attack. We now limit the size of the document after substitution of entities to 10 times the size before expansion. This guarantees linear behavior by definition. There already was a similar check before, but the accounting of "sizeentities" (size of external entities) and "sizeentcopy" (size of all copies created by entity references) wasn't accurate. We also need saturation arithmetic since we're historically limited to "unsigned long" which is 32-bit on many platforms. A maximum of 10 MB of substitutions is always allowed. This should make use cases like DITA work which have caused problems in the past. The old checks based on the number of entities were removed. This is accounted for by adding a fixed cost to each entity reference. Entity amplification checks are now enabled even if XML_PARSE_HUGE is set. This option is mainly used to allow larger text nodes. Most users were unaware that it also disabled entity expansion checks. Some of the limits might be adjusted later. If this change turns out to affect legitimate use cases, we can add a separate parser option to disable the checks. Fixes #294. Fixes #345.
7e3f469b	2022-12-19T15:59:49	entities: Use flags to store '<' check results Instead of abusing the LSB of the "checked" member, store the result of testing for occurrence of '<' character in "flags". Also use the flags in xmlParseStringEntityRef instead of rescanning every time.
481d79d4	2022-12-19T15:26:46	entities: Add XML_ENT_PARSED flag To check whether an entity was already parsed, the code previously tested whether "checked" was non-zero or "children" was non-null. The "children" check could be unreliable because an empty entity also results in an empty (NULL) node list. Use a separate flag to make this check more reliable.
4b959ee1	2022-12-01T13:23:09	Remove hacky heuristic from b2dc5675e94aa6b5557ba63f7d66b0f08dd17e4d Checking whether the context is close to the parent context by hardcoding 250 is not portable (I noticed tests were failing on Morello since the value is 288 there due to pointers being 128 bits). Instead we should ensure that the XML_VCTXT_USE_PCTXT flag is not set in cases where the user data is not actually a parser context (or ideally add a separate field but that would be an ABI break. From what I can see in the source, the XML_VCTXT_USE_PCTXT is only set if the userData field points to a valid context, and if this is not the case the flag should be cleared when changing userData rather than relying on the offset between the two. Looking at the history, I think d7cb33cf44aa688f24215c9cd398c1a26f0d25ff fixed most of the need for this workaround, but it looks like there are a few more locations that need updating; This commit changes two more places to set/clear/copy the XML_VCTXT_USE_PCTXT flag, so this heuristic should not be needed anymore. I've also drop two = NULL assignment in xmllint since this is not needed after a call to memset(). There was also an uninitialized vctxt.flags (and other fields) in `xmlShellValidate()`, which I've fixed by adding a memset() call.
c62c0d82	2022-12-01T12:58:11	Correctly relocate internal pointers after realloc() Adding an offset to a deallocated pointer and assuming that it can be dereferenced is undefined behaviour. When running libxml2 on CHERI-enabled systems such as Arm Morello this results in the creation of an out-of-bounds pointer that cannot be dereferenced and therefore crashes at runtime. The effect of this UB is not just limited to architectures such as CHERI, incorrect relocation of pointers after realloc can in fact cause FORTIFY_SOURCE errors with recent GCC: https://developers.redhat.com/articles/2022/09/17/gccs-new-fortification-level
c16fd705	2022-11-25T14:52:37	xpath: Make init function private
53ab3840	2022-11-25T14:26:59	encoding: Make init function private
05c3a458	2022-11-25T14:15:43	tests: Check that xmlInitParser doesn't allocate memory
78c0391b	2022-11-25T13:55:39	parser: Register atexit handler in locked section
ed053c50	2022-11-25T12:27:14	dict: Make init/cleanup functions private
7010d877	2022-11-25T12:06:27	threads: Rework initialization Make init/cleanup functions private. Merge xmlOnceInit into xmlInitThreadsInternal.
9dbf1374	2022-11-24T20:52:57	parser: Make some module init/cleanup functions private
cecd364d	2022-11-24T16:38:47	parser: Don't call *DefaultSAXHandlerInit from xmlInitParser Change the default handler definitions to match the result after calling the initialization functions. This makes sure that no thread-local variables are accessed when calling xmlInitParser.
b1f9c193	2022-11-22T21:39:01	parser: Fix push parser with unterminated CDATA sections Short-lived regression found by OSS-Fuzz.
0e193f0d	2022-11-21T22:09:19	parser: Remove dangerous check in xmlParseCharData If this check succeeds, xmlParseCharData could be called over and over again without making progress, resulting in an infinite loop. It's only important to check for XML_PARSER_EOF which is done later. Related to #441.
94ca36c2	2022-11-21T22:07:11	parser: Restore parser state in xmlParseCDSect Fixes #441.
a8b31e68	2022-11-21T21:35:01	parser: Fix progress check when parsing character data Skip over zero bytes to guarantee progress. Short-lived regression.
c63900fb	2022-11-21T20:11:35	parser: Check terminate flag when push parsing CDATA sections Found by OSS-Fuzz.
a781ee33	2022-11-21T20:10:42	Revert "parser: Add overflow checks to xmlParseLookup functions" This reverts commit bfc55d688427972d093be010a8c2ef265375fcb2. It's better to fix the root cause.
bfc55d68	2022-11-21T18:29:54	parser: Add overflow checks to xmlParseLookup functions Short-lived regression found by OSS-Fuzz.

e0dd330b

2023-09-29T00:18:44

parser: Use hash tables to avoid quadratic behavior Use a hash table to lookup namespaces by prefix. The hash table stores an index into the namespace table. Auxiliary data for namespaces is stored in a separate array along the main namespace table. Use a hash table to verify attribute uniqueness. The hash table stores an index into the attribute table. Reuse hash value from the dictionary to avoid computing them twice. See #346.

a873191c

2023-09-25T14:51:35

parser: Introduce xmlParseQNameHashed

8c084ebd

2023-09-21T22:57:33

doc: Make apibuild.py happy

11a1839d

2023-09-20T17:54:48

globals: Move remaining globals back to correct header files This undoes a lot of damage.

a77f9ab8

2023-09-20T16:57:22

globals: Don't include SAX2.h from globals.h

2e6c49a7

2023-09-20T14:43:14

globals: Don't store xmlParserVersion in global state This is a constant.

a07ec7c1

2023-09-18T17:39:13

threads: Move library initialization code to threads.c This allows to consolidate the initialization code since the global init lock was already implemented in threads.c.

4e1c13eb

2023-09-18T14:45:10

debug: Remove debugging code This is barely useful these days and only clutters the code base.

c19771c1

2023-09-18T00:54:39

globals: Move code from threads.c to globals.c Move all code that handles globals to the place where it belongs.

d7cfe356

2023-09-14T20:52:24

parser: Avoid undefined behavior in xmlParseStartTag2 Instead of using arithmetic on dangling pointers, store ptrdiff_t values in void pointers which is at least implementation-defined.

57cfd221

2023-09-01T14:52:04

dict: Use xoroshiro64** as PRNG Stop using rand_r. This enables hash randomization on all platforms.

53050b1d

2023-08-29T20:06:43

parser: More fixes to push parser error handling

bbd918b2

2023-08-29T15:56:37

parser: Fix detection of null bytes Also suppress misleading extra errors. Fixes #122.

c6083a32

2023-08-29T16:30:22

parser: Improve error handling in push parser - Report errors earlier - Align error messages with pull parser

1edae30f

2023-08-29T15:58:22

parser: Don't check inputNr in xmlParseTryOrFinish There's no apparent reason for this check. inputNr should always be 1 here.

e48f2695

2023-08-29T17:41:18

parser: Remove push parser debugging code

ed3bd052

2023-08-20T20:48:10

parser: Allow to set maximum amplification factor

855818bd

2023-08-08T15:21:37

parser: Check for truncated multi-byte sequences When decoding input data, check whether the "raw" buffer is empty after parsing the document. Otherwise, the input ends with a truncated multi-byte sequence which shouldn't be silently ignored.

95e81a36

2023-08-08T15:21:31

parser: Decode all data in xmlCharEncInput Even with flush set to true, xmlCharEncInput didn't guarantee to decode all data. This complicated the push parser. Remove the flush flag and always decode all available data. Also fix ICU code where the flush flag has a different meaning. Always set flush to false and retry even with empty input buffers.

834b8123

2023-08-08T15:21:28

parser: Stream data when reading from memory Don't create a copy of the whole input buffer. Read the data chunk by chunk to save memory. Historically, it was probably envisioned to read data from memory without additional copying. This doesn't work reliably with the current design of the XML parser which requires a terminating null byte at the end of input buffers. This lead to xmlReadMemory interfaces, which expect pointer and size arguments, being changed to make a zero-terminated copy of the input buffer. Interfaces based on xmlReadDoc, which actually expect a zero-terminated string and would make zero-copy operation work, were then simplified to rely on xmlReadMemoryi, resulting in an unnecessary copy. To avoid copying (possibly gigabytes) of memory temporarily, we now stream in-memory input just like content read from files in a chunk-by-chunk fashion (using a somewhat outdated INPUT_CHUNK size of 250 bytes). As a side effect, we also avoid another copy of the whole input when handling non-UTF-8 data which was made possible by some earlier commits. Interfaces expecting zero-terminated strings now make use of strnlen which unfortunately isn't part of the standard C library and only mandated since POSIX 2008.

5aff27ae

2023-08-08T15:21:25

parser: Optimize xmlLoadEntityContent Load entity content via xmlParserInputBufferGrow, avoiding a copy. This also fixes an entity size accounting error.

facc2a06

2023-08-08T15:21:21

parser: Don't overwrite EOF parser state

59fa0bb3

2023-08-08T15:21:14

parser: Simplify input pointer updates The base member always points to the beginning of the buffer.

c88ab7e3

2023-08-08T15:19:54

parser: Don't reinitialize parser input members The parser input struct should already be initialized.

ec7be506

2023-08-08T15:19:46

parser: Rework encoding detection Introduce XML_INPUT_HAS_ENCODING flag for xmlParserInput which is set when xmlSwitchEncoding is called. The parser can use the flag to reliably detect whether an encoding was already set via user override, BOM or other auto-detection. In this case, the encoding declaration won't be used to switch the encoding. Before, an inscrutable mix of ctxt->charset, ctxt->input->encoding and ctxt->input->buf->encoder was used. Introduce private helper functions to switch encodings used by both the XML and HTML parser: - xmlDetectEncoding which skips over the BOM, allowing to remove the BOM checks from other encoding functions. - xmlSetDeclaredEncoding, replacing htmlCheckEncodingDirect, which warns about encoding mismatches. If users override the encoding, store the declared instead of the actual encoding in xmlDoc. In this case, the actual encoding is known and the raw value from the doc is more useful. Also use the input flags to store the ISO-8859-1 fallback state. Restrict the fallback to cases where no encoding was specified. (The fallback is only useful in recovery mode and these days broken UTF-8 is probably more likely than ISO-8859-1, so it might eventually be removed completely.) The 'charset' member of xmlParserCtxt is now unused. The 'encoding' member of xmlParserInput is now unused. The 'standalone' member of xmlParserInput is renamed to 'flags'. A new parser state XML_PARSER_XML_DECL is added for the push parser.

d38e73f9

2023-08-08T15:19:44

parser: Always create UTF-8 in xmlParseReference It seems that this code path could only be triggered after an encoding error in recovery mode. Creating char-ref nodes is unnecessary and typically unexpected.

131d0dc0

2023-08-08T15:19:39

parser: Don't use 'standalone' member of xmlParserInput The standalone declaration is only parsed in the main input stream.

d9ec182b

2023-08-08T15:19:36

parser: Don't detect encoding in xmlCtxtResetPush The encoding will be detected in xmlParseTryOrFinish.

90bcbcfc

2023-07-20T21:08:01

parser: Fix potential use-after-free in xmlParseCharDataInternal Return immediately if a SAX handler stops the parser. Fixes #569.

e0f3016f

2023-05-18T17:31:44

parser: Fix regression when push parsing UTF-8 sequences Partial UTF-8 sequences are allowed when push parsing. Fixes #542.

235b15a5

2023-05-08T17:58:02

SAX: Always initialize SAX1 element handlers Follow-up to commit d0c3f01e. A parser context will be initialized to SAX version 2, but this can be overridden with XML_PARSE_SAX1 later, so we must initialize the SAX1 element handlers as well. Change the check in xmlDetectSAX2 to only look for XML_SAX2_MAGIC, so we don't switch to SAX1 if the SAX2 element handlers are NULL.

d0c3f01e

2023-05-06T17:47:37

parser: Fix old SAX1 parser with custom callbacks For some reason, xmlCtxtUseOptionsInternal set the start and end element SAX handlers to the internal DOM builder functions when XML_PARSE_SAX1 was specified. This means that custom SAX handlers could never work with that flag because these functions would receive the wrong user data argument and crash immediately. Fixes #535.

320f5084

2023-04-30T18:25:09

parser: Improve handling of encoding and IO errors Make sure that xmlCharEncInput, xmlParserInputBufferPush and xmlParserInputBufferGrow set the correct error code in the xmlParserInputBuffer. Handle errors when calling these functions.

fc69cf56

2023-04-30T17:51:29

parser: Move xmlFatalErr to parserInternals.c

3ffcc03b

2023-03-13T19:38:41

parser: Deprecate more internal functions

250faf3c

2023-04-20T12:35:21

parser: Fix regression in xmlParserNodeInfo accounting Commit 62150ed2 broke begin_pos and begin_line when extra node info was recorded. Fixes #523.

9282b084

2023-04-19T21:55:24

parser: Fix regression in memory pull parser with encoding Revert another change from commit 98840d40. Decode the whole buffer when reading from memory and switching to the initial encoding. Add some comments about potential improvements.

86105c04

2023-04-15T18:04:03

Fix use-after-free in xmlParseContentInternal() * parser.c: (xmlParseCharData): - Check if the parser has stopped before advancing `ctxt->input->cur`. This only occurs if a custom SAX error handler calls xmlStopParser() on fatal errors. Fixes #518.

b4d46cee

2023-04-12T15:10:01

parser: Remove first line handling in xmlParseChunk After reworking EBCDIC detection, this isn't necessary.

98840d40

2023-03-21T19:07:12

parser: Rework EBCDIC code page detection To detect EBCDIC code pages, we used to switch the encoding twice and had to be very careful not to decode data after the XML declaration before the second switch. This relied on a hard-coded expected size of the XML declaration and was complicated and unreliable. Now we convert the first 200 bytes to EBCDIC-US and parse the encoding declaration manually.

3eb9f5ca

2023-03-21T13:19:31

parser: Limit name length in xmlParseEncName

04d1bedd

2023-03-21T13:08:44

parser: Rework shrinking of input buffers Don't try to grow the input buffer in xmlParserShrink. This makes sure that no memory allocations are made and the function always succeeds. Remove unnecessary invocations of SHRINK. Invoke SHRINK at the end of DTD parsing loops. Shrink before growing.

067986fa

2023-03-18T14:44:28

parser: Fix regressions from previous commits - Fix memory leak in xmlParseNmtoken. - Fix buffer overread after htmlParseCharDataInternal.

3e85d7b7

2023-03-17T13:15:35

parser: Rely on CUR_CHAR/NEXT to grow the input buffer The input buffer is now grown reliably when calling CUR_CHAR (xmlCurrentChar) or NEXT (xmlNextChar). This allows to remove many other invocations of GROW.

c81d0d04

2023-03-17T12:39:35

malloc-fail: Add more error checks when parsing names xmlParseName and similar functions must return NULL if an error occurs. Found by OSS-Fuzz, see #344.

b167c731

2023-03-14T14:42:36

parser: Fix short-lived regression causing infinite loops Fix 3eb6bf03. We really have to halt the parser, so the input buffer gets reset.

2099441f

2023-03-13T17:51:13

parser: Stop calling xmlParserInputShrink Introduce xmlParserShrink which takes a parser context to simplify error handling.

cabde70f

2023-03-12T19:07:23

parser: Simplify calculation of available buffer space

b75976e0

2023-03-12T19:06:19

parser: Use size_t when subtracting input buffer pointers Avoid integer overflows.

9a6ca816

2023-03-12T19:03:11

parser: Check for integer overflow when updating checkIndex Unfortunately, checkIndex is a long, not a size_t. Check for integer overflow before updating the value.

bd63d730

2023-03-12T17:40:55

html: Impose some length limits Impose length limits on names, attribute values, PIs and comments, similar to the XML parser.

3eb6bf03

2023-03-12T16:47:15

parser: Stop calling xmlParserInputGrow Introduce xmlParserGrow which takes a parser context to simplify error handling.

207ebdfd

2023-03-12T14:43:01

malloc-fail: Fix out-of-bounds read in xmlGROW Short-lived regression from 56cc2211.

56cc2211

2023-03-09T22:27:58

parser: Merge xmlParserInputGrow into xmlGROW Simplifies the code and makes error handling easier.

14604a44

2023-03-09T22:10:44

malloc-fail: Fix out-of-bounds read in xmlCurrentChar Found by OSS-Fuzz.

3f69fc80

2023-03-08T13:58:49

parser: Tighten expansion limits - Lower the amount of expansion which is always allowed from 10MB to 1MB. - Lower the maximum amplification factor from 10 to 5. - Lower the "fixed cost" from 50 to 20.

5d55315e

2023-02-18T17:29:07

parser: Fix OOB read when formatting error message Don't try to print characters beyond the end of the buffer. Found by OSS-Fuzz.

f8852184

2023-02-14T13:03:13

malloc-fail: Fix memory leak in xmlParseEntityDecl Found with libFuzzer, see #344.

e6d22f92

2023-01-23T01:48:37

malloc-fail: Fix reallocation in inputPush Store xmlRealloc result in temporary variable to avoid null deref in error handler. Found with libFuzzer, see #344.

6fd89041

2023-01-22T19:42:41

malloc-fail: Fix use-after-free in xmlParseStartTag2 Fix error handling in xmlCtxtGrowAttrs. Found with libFuzzer, see #344.

d1b87856

2023-01-22T17:42:09

malloc-fail: Fix infinite loop in xmlParseTextDecl Memory errors can set `instate` to `XML_PARSER_EOF` which results in `NEXT` making no progress. Found with libFuzzer, see #344.

bd9de3a3

2023-01-22T16:52:39

malloc-fail: Fix null deref in xmlAddDefAttrs Found with libFuzzer, see #344.

33d4a0fe

2023-01-22T15:41:00

parser: Fix progress check in xmlParseExternalSubset Avoid infinite loop. Short-lived regression from f61b8a62. Found with libFuzzer.

74aa61e0

2023-01-22T13:09:03

parser: Halt parser on DTD errors If we try to continue parsing after an error in the internal or external subset, entity expansion accounting gets more complicated. Simply halt the parser. Found with libFuzzer.

d320a683

2023-01-17T13:50:51

parser: Fix entity check in attributes Don't set the "checked" flag when checking entities in default attribute values. These entities could reference other entities which weren't defined yet, so the check isn't reliable. This fixes a short-lived regression which could lead to a call stack overflow later in xmlStringGetNodeList.

59b33661

2022-12-27T14:15:51

error: Limit number of parser errors Reporting errors is expensive and some abusive test cases can generate an error for each invalid input byte. This causes the parser to spend most of the time with error handling. Limit the number of errors and warnings to 100.

66e9fd66

2022-12-25T21:26:17

parser: Fix infinite loop with push parser in recovery mode Short-lived regression from commit b1f9c193. Found by OSS-Fuzz.

49b54d7e

2022-12-25T15:06:51

parser: Fix null deref in xmlStringDecodeEntitiesInt Short-lived regression.

1865668b

2022-12-23T22:44:40

parser: Fix accounting of consumed input bytes Only add consumed bytes if - we're not parsing an entity - we're parsing external parameter entities for the first time. Always ignore internal parameter entities.

bc18f4a6

2022-12-23T21:55:38

parser: Lower entity nesting limit with XML_PARSE_HUGE The old limit of 1024 could lead to excessively deep call stacks. This could probably be set much lower without causing issues.

dd62e541

2022-12-23T21:53:30

parser: Don't increase depth twice when parsing internal entities Fix xmlParseBalancedChunkMemoryInternal.

a41b09c7

2022-12-23T21:29:28

parser: Improve detection of entity loops Set a flag to detect entity loops at once instead of processing until the depth limit is exceeded.

d972393f

2022-12-23T21:01:20

parser: Only report a single entity error Don't report errors multiple times for nested entity references.

077df27e

2022-12-22T15:22:01

parser: Fix integer overflow of input ID Applies a patch from Chromium. Also stop incrementing input ID of subcontexts. This isn't necessary. Fixes #465.

0bd4e4e0

2022-12-21T19:21:30

xmlParseStartTag2() contains typo when checking for default definitions for an attribute in a namespace * parser.c: (xmlParseStartTag2): - Fix index into defaults->values. It is only correct the first time through the loop when i == 0. Fixes #467.

b47ebf04

2022-12-21T00:02:47

parser: Deprecate xmlString*DecodeEntities These are internal functions.

ec6633af

2022-12-20T03:09:11

parser: Remove useless ent->etype test in xmlParseReference If ent->etype is invalid, ret can't equal XML_ERR_OK.

7ee7f036

2022-12-20T02:06:38

parser: Remove useless ent->children tests in xmlParseReference The if-block before always returns if ent->children == NULL.

ce76ebfd

2022-12-19T20:56:23

entities: Stop counting entities This was only used in the old version of xmlParserEntityCheck.

a3c8b180

2022-12-19T20:51:52

entities: Add entity flag for loop check

463bbeec

2022-12-19T18:39:45

entities: Rework entity amplification checks This commit implements robust detection of entity amplification attacks, better known as the "billion laughs" attack. We now limit the size of the document after substitution of entities to 10 times the size before expansion. This guarantees linear behavior by definition. There already was a similar check before, but the accounting of "sizeentities" (size of external entities) and "sizeentcopy" (size of all copies created by entity references) wasn't accurate. We also need saturation arithmetic since we're historically limited to "unsigned long" which is 32-bit on many platforms. A maximum of 10 MB of substitutions is always allowed. This should make use cases like DITA work which have caused problems in the past. The old checks based on the number of entities were removed. This is accounted for by adding a fixed cost to each entity reference. Entity amplification checks are now enabled even if XML_PARSE_HUGE is set. This option is mainly used to allow larger text nodes. Most users were unaware that it also disabled entity expansion checks. Some of the limits might be adjusted later. If this change turns out to affect legitimate use cases, we can add a separate parser option to disable the checks. Fixes #294. Fixes #345.

7e3f469b

2022-12-19T15:59:49

entities: Use flags to store '<' check results Instead of abusing the LSB of the "checked" member, store the result of testing for occurrence of '<' character in "flags". Also use the flags in xmlParseStringEntityRef instead of rescanning every time.

481d79d4

2022-12-19T15:26:46

entities: Add XML_ENT_PARSED flag To check whether an entity was already parsed, the code previously tested whether "checked" was non-zero or "children" was non-null. The "children" check could be unreliable because an empty entity also results in an empty (NULL) node list. Use a separate flag to make this check more reliable.

4b959ee1

2022-12-01T13:23:09

Remove hacky heuristic from b2dc5675e94aa6b5557ba63f7d66b0f08dd17e4d Checking whether the context is close to the parent context by hardcoding 250 is not portable (I noticed tests were failing on Morello since the value is 288 there due to pointers being 128 bits). Instead we should ensure that the XML_VCTXT_USE_PCTXT flag is not set in cases where the user data is not actually a parser context (or ideally add a separate field but that would be an ABI break. From what I can see in the source, the XML_VCTXT_USE_PCTXT is only set if the userData field points to a valid context, and if this is not the case the flag should be cleared when changing userData rather than relying on the offset between the two. Looking at the history, I think d7cb33cf44aa688f24215c9cd398c1a26f0d25ff fixed most of the need for this workaround, but it looks like there are a few more locations that need updating; This commit changes two more places to set/clear/copy the XML_VCTXT_USE_PCTXT flag, so this heuristic should not be needed anymore. I've also drop two = NULL assignment in xmllint since this is not needed after a call to memset(). There was also an uninitialized vctxt.flags (and other fields) in `xmlShellValidate()`, which I've fixed by adding a memset() call.

c62c0d82

2022-12-01T12:58:11

Correctly relocate internal pointers after realloc() Adding an offset to a deallocated pointer and assuming that it can be dereferenced is undefined behaviour. When running libxml2 on CHERI-enabled systems such as Arm Morello this results in the creation of an out-of-bounds pointer that cannot be dereferenced and therefore crashes at runtime. The effect of this UB is not just limited to architectures such as CHERI, incorrect relocation of pointers after realloc can in fact cause FORTIFY_SOURCE errors with recent GCC: https://developers.redhat.com/articles/2022/09/17/gccs-new-fortification-level

c16fd705

2022-11-25T14:52:37

xpath: Make init function private

53ab3840

2022-11-25T14:26:59

encoding: Make init function private

05c3a458

2022-11-25T14:15:43

tests: Check that xmlInitParser doesn't allocate memory

78c0391b

2022-11-25T13:55:39

parser: Register atexit handler in locked section

ed053c50

2022-11-25T12:27:14

dict: Make init/cleanup functions private

7010d877

2022-11-25T12:06:27

threads: Rework initialization Make init/cleanup functions private. Merge xmlOnceInit into xmlInitThreadsInternal.

9dbf1374

2022-11-24T20:52:57

parser: Make some module init/cleanup functions private

cecd364d

2022-11-24T16:38:47

parser: Don't call *DefaultSAXHandlerInit from xmlInitParser Change the default handler definitions to match the result after calling the initialization functions. This makes sure that no thread-local variables are accessed when calling xmlInitParser.

b1f9c193

2022-11-22T21:39:01

parser: Fix push parser with unterminated CDATA sections Short-lived regression found by OSS-Fuzz.

0e193f0d

2022-11-21T22:09:19

parser: Remove dangerous check in xmlParseCharData If this check succeeds, xmlParseCharData could be called over and over again without making progress, resulting in an infinite loop. It's only important to check for XML_PARSER_EOF which is done later. Related to #441.

94ca36c2

2022-11-21T22:07:11

parser: Restore parser state in xmlParseCDSect Fixes #441.

a8b31e68

2022-11-21T21:35:01

parser: Fix progress check when parsing character data Skip over zero bytes to guarantee progress. Short-lived regression.

c63900fb

2022-11-21T20:11:35

parser: Check terminate flag when push parsing CDATA sections Found by OSS-Fuzz.

a781ee33

2022-11-21T20:10:42

Revert "parser: Add overflow checks to xmlParseLookup functions" This reverts commit bfc55d688427972d093be010a8c2ef265375fcb2. It's better to fix the root cause.

bfc55d68

2022-11-21T18:29:54

parser: Add overflow checks to xmlParseLookup functions Short-lived regression found by OSS-Fuzz.

kc3-lang/libxml2/parser.c

parser.c

Log