parser.c


Log

Author Commit Date CI Message
Nick Wellnhofer 54c70ed5 2023-12-18T19:31:29 parser: Improve error handling Introduce xmlCtxtSetErrorHandler allowing to set a structured error for a parser context. There already was the "serror" SAX handler but this always receives the parser context as argument. Start to use xmlRaiseMemoryError. Remove useless arguments from memory error functions. Rename xmlErrMemory to xmlCtxtErrMemory. Remove a few calls to xmlGenericError. Remove support for runtime entity debugging.
Nick Wellnhofer 1c106edf 2023-12-13T23:56:19 parser: Allow recovery in xmlParseInNodeContext Should fix #645.
Nick Wellnhofer 862e9ce0 2023-12-13T14:53:44 malloc-fail: Fix use-of-uninitialized-value in xmlParseConditionalSections Short-lived regression.
Nick Wellnhofer c2bbeed1 2023-12-12T23:51:32 io: Fix memory lifetime issue with input buffers xmlParserInputBufferCreateMem must make a copy of the buffer. This fixes a regression from 2.11 which could cause reads from freed memory depending on the use case. Undeprecate xmlParserInputBufferCreateStatic which can avoid copying the whole buffer.
Nick Wellnhofer f19a9510 2023-12-10T17:50:22 parser: Report malloc failures Fix many places where malloc failures aren't reported. Make xmlErrMemory public. This is useful for custom external entity loaders. Introduce new API function xmlSwitchEncodingName. Change the way how we store whether the the parser is stopped. This used to be signaled by setting ctxt->instate to XML_PARSER_EOF which was misdesigned and error-prone. Set ctxt->disableSAX to 2 instead and introduce a macro PARSER_STOPPED. Also stop to remove parser inputs in xmlHaltParser. This allows to remove many checks of ctxt->instate. Introduce xmlErrParser to handle errors if a parser context is available.
Nick Wellnhofer 7d446e97 2023-12-08T12:13:49 parser: Fix namespaces redefined from default attributes This regressed in commit e0dd330b. Also fixes a long-standing issue where namespaces from default attributes weren't added if they match an existing namespace. Fixes #643.
Nick Wellnhofer c011e760 2023-12-06T01:09:31 globals: Remove unused globals from thread storage Setting these deprecated globals hasn't had an effect for a long time. Make them constants. This reduces the size of per-thread storage from ~700 to ~250 bytes.
Nick Wellnhofer 7f00273c 2023-12-01T19:21:17 parser: Fix invalid free in xmlParseBalancedChunkMemoryRecover Set the dictionary for newDoc in xmlParseBalancedChunkMemoryRecover. This is a long-standing bug which was masked by - xmlParseBalancedChunkMemoryRecover changing the document of the root node. This is a really bad idea, resulting in a mismatch between ctxt->myDoc and ctxt->node->doc. - SAX2.c preferring ctxt->node->doc over ctxt->myDoc until commit a31e1b06. Fixes #641.
Nick Wellnhofer c7629c9e 2023-11-30T16:52:34 parser: Clarify documentation regarding xmlReadMemory buffer size Fixes #638.
Nick Wellnhofer 43b511fa 2023-11-26T14:31:39 parser: Make CRLF increment line number Partial revert of cb927e85 fixing CRLFs not incrementing the line number. This requires to rework xmlParseQNameHashed. The original implementation prompted the change to xmlCurrentChar which really shouldn't modify the 'cur' pointer as side effect. But the NEXTL macro relies on this behavior. Ultimately, we should reintroduce the change to xmlCurrentChar and fix the NEXTL macro. This will lead to single CRs incrementing the line number as well which seems more consistent. Fixes #628.
Nick Wellnhofer aca37d8c 2023-11-20T15:20:37 parser: Only enable SAX2 if there are SAX2 element handlers This reverts part of commit 235b15a5 for backward compatibility and adds some comments trying to clarify the whole mess. Fixes #623.
Nick Wellnhofer 529df196 2023-11-15T12:10:25 parser: Don't overwrite error state in xmlParseTextDecl Fixes a null deref in xmlLoadEntityContent found by OSS-Fuzz.
Nick Wellnhofer 70cc45b8 2023-11-05T00:49:40 parser: Improve attribute hash table There's no need to grow the hash table dynamically. The size is known which simplifies the implementation.
Nick Wellnhofer 58598494 2023-11-04T23:47:33 parser: Fix combination of hash values This bug resulted in a stuck bit in hash values which can have a severe performance impact.
Nick Wellnhofer 7a2d412f 2023-10-31T20:15:38 parser: Copy default namespace in xmlParseBalancedChunkMemory
Nick Wellnhofer e0c2f14d 2023-10-31T13:53:15 parser: Copy namespaces in xmlParseBalancedChunkMemory Reenable copying of namespaces but don't set SAX data. This should match the old behavior.
Nick Wellnhofer 02856674 2023-10-22T15:56:46 parser: Remove redundant IS_CHAR check in xmlCurrentChar
Nick Wellnhofer c082ef46 2023-08-09T16:59:36 parser: Stop switching to ISO-8859-1 on encoding errors Use U+FFFD Replacement Character if invalid UTF-8 is encountered in recovery mode. Also rewrite xmlNextChar and xmlCurrentChar. Fixes #598.
Nick Wellnhofer 572ecc17 2023-10-22T13:59:55 parser: Fix buffer shrinking when push parsing Short-lived regression from b76d81da.
Nick Wellnhofer 86ef190e 2023-10-14T22:43:25 parser: Fix stack handling in xmlParseTryOrFinish After commit e0dd330b, this latent bug could cause use-after-free errors in rare circumstances like using the reader API with recovery and XIncludes.
Nick Wellnhofer 514ab399 2023-10-11T13:25:49 parser: Don't overwrite error state in xmlParseTextDecl If a memory allocation fails, this could cause a null deref after recent changes. Found by OSS-Fuzz.
Nick Wellnhofer 821a0370 2023-10-09T15:20:00 parser: Fix memory leak in xmlLoadEntityContent Found by OSS-Fuzz.
Nick Wellnhofer 4fc5340e 2023-10-08T14:17:46 parser: Also grow comment buffer if SAX is disabled Fix short-lived regression from 8afd321a, found by OSS-Fuzz.
Nick Wellnhofer 36374bc9 2023-10-08T14:08:44 parser: Fix error handling in xmlLoadEntityContent Backup more members of context struct. Fix small accounting error.
Nick Wellnhofer b76d81da 2023-10-06T11:50:29 parser: Fix regression when push parsing parameter entities Short-lived regression from 834b8123. Also shrink parameter entity buffers when push parsing.
Nick Wellnhofer 134d2ad8 2023-10-06T00:31:44 parser: Protect against quadratic default attribute expansion
Nick Wellnhofer 7615fae6 2023-10-05T23:52:55 parser: Make XML_PARSE_NSCLEAN option work again
Nick Wellnhofer 0ba22c05 2023-10-05T22:05:04 parser: Support encoded external PEs in entity values Corner case which was never supported.
Nick Wellnhofer 8afd321a 2023-10-05T22:02:56 parser: Missing checks for disableSAX
Nick Wellnhofer 97e99f41 2023-10-05T17:11:24 parser: Acknowledge that entities with namespaces are broken Entities which reference out-of-scope namespace have always been broken. xmlParseBalancedChunkMemoryInternal tried to reuse the namespaces currently in scope but these namespaces were ignored by the SAX handler. Besides, there could be different namespaces in scope when expanding the entity again. For example: <!DOCTYPE doc [ <!ENTITY ent "<ns:elem/>"> ]> <doc> <decl1 xmlns:ns="urn:ns1"> &ent; </decl1> <decl2 xmlns:ns="urn:ns2"> &ent; </decl2> </doc> Add some comments outlining possible solutions to this problem. For now, we stop copying namespaces to the temporary parser context in xmlParseBalancedChunkMemoryInternal. This has never really worked and the recent changes contained a partial fix which uncovered other problems like a use-after-free with the XML Reader interface, found by OSS-Fuzz.
Nick Wellnhofer eb69c1d3 2023-10-02T12:16:05 parser: Fix initialization of namespace data Move initialization to xmlInitSAXParserCtxt. Also add missing XML_HIDDEN to xmlParserNsFree. Fixes #597.
Nick Wellnhofer fc496793 2023-10-02T12:05:36 parser: Fix error handling in xmlParseQNameHashed Short-lived regression found by OSS-Fuzz.
Nick Wellnhofer 6dd87f5e 2023-09-30T17:11:25 malloc-fail: Fix memory leak in xmlParseBalancedChunkMemoryInternal Short-lived regression found by OSS-Fuzz.
Nick Wellnhofer e0dd330b 2023-09-29T00:18:44 parser: Use hash tables to avoid quadratic behavior Use a hash table to lookup namespaces by prefix. The hash table stores an index into the namespace table. Auxiliary data for namespaces is stored in a separate array along the main namespace table. Use a hash table to verify attribute uniqueness. The hash table stores an index into the attribute table. Reuse hash value from the dictionary to avoid computing them twice. See #346.
Nick Wellnhofer a873191c 2023-09-25T14:51:35 parser: Introduce xmlParseQNameHashed
Nick Wellnhofer 8c084ebd 2023-09-21T22:57:33 doc: Make apibuild.py happy
Nick Wellnhofer 11a1839d 2023-09-20T17:54:48 globals: Move remaining globals back to correct header files This undoes a lot of damage.
Nick Wellnhofer a77f9ab8 2023-09-20T16:57:22 globals: Don't include SAX2.h from globals.h
Nick Wellnhofer 2e6c49a7 2023-09-20T14:43:14 globals: Don't store xmlParserVersion in global state This is a constant.
Nick Wellnhofer a07ec7c1 2023-09-18T17:39:13 threads: Move library initialization code to threads.c This allows to consolidate the initialization code since the global init lock was already implemented in threads.c.
Nick Wellnhofer 4e1c13eb 2023-09-18T14:45:10 debug: Remove debugging code This is barely useful these days and only clutters the code base.
Nick Wellnhofer c19771c1 2023-09-18T00:54:39 globals: Move code from threads.c to globals.c Move all code that handles globals to the place where it belongs.
Nick Wellnhofer d7cfe356 2023-09-14T20:52:24 parser: Avoid undefined behavior in xmlParseStartTag2 Instead of using arithmetic on dangling pointers, store ptrdiff_t values in void pointers which is at least implementation-defined.
Nick Wellnhofer 57cfd221 2023-09-01T14:52:04 dict: Use xoroshiro64** as PRNG Stop using rand_r. This enables hash randomization on all platforms.
Nick Wellnhofer 53050b1d 2023-08-29T20:06:43 parser: More fixes to push parser error handling
Nick Wellnhofer bbd918b2 2023-08-29T15:56:37 parser: Fix detection of null bytes Also suppress misleading extra errors. Fixes #122.
Nick Wellnhofer c6083a32 2023-08-29T16:30:22 parser: Improve error handling in push parser - Report errors earlier - Align error messages with pull parser
Nick Wellnhofer 1edae30f 2023-08-29T15:58:22 parser: Don't check inputNr in xmlParseTryOrFinish There's no apparent reason for this check. inputNr should always be 1 here.
Nick Wellnhofer e48f2695 2023-08-29T17:41:18 parser: Remove push parser debugging code
Nick Wellnhofer ed3bd052 2023-08-20T20:48:10 parser: Allow to set maximum amplification factor
Nick Wellnhofer 855818bd 2023-08-08T15:21:37 parser: Check for truncated multi-byte sequences When decoding input data, check whether the "raw" buffer is empty after parsing the document. Otherwise, the input ends with a truncated multi-byte sequence which shouldn't be silently ignored.
Nick Wellnhofer 95e81a36 2023-08-08T15:21:31 parser: Decode all data in xmlCharEncInput Even with flush set to true, xmlCharEncInput didn't guarantee to decode all data. This complicated the push parser. Remove the flush flag and always decode all available data. Also fix ICU code where the flush flag has a different meaning. Always set flush to false and retry even with empty input buffers.
Nick Wellnhofer 834b8123 2023-08-08T15:21:28 parser: Stream data when reading from memory Don't create a copy of the whole input buffer. Read the data chunk by chunk to save memory. Historically, it was probably envisioned to read data from memory without additional copying. This doesn't work reliably with the current design of the XML parser which requires a terminating null byte at the end of input buffers. This lead to xmlReadMemory interfaces, which expect pointer and size arguments, being changed to make a zero-terminated copy of the input buffer. Interfaces based on xmlReadDoc, which actually expect a zero-terminated string and would make zero-copy operation work, were then simplified to rely on xmlReadMemoryi, resulting in an unnecessary copy. To avoid copying (possibly gigabytes) of memory temporarily, we now stream in-memory input just like content read from files in a chunk-by-chunk fashion (using a somewhat outdated INPUT_CHUNK size of 250 bytes). As a side effect, we also avoid another copy of the whole input when handling non-UTF-8 data which was made possible by some earlier commits. Interfaces expecting zero-terminated strings now make use of strnlen which unfortunately isn't part of the standard C library and only mandated since POSIX 2008.
Nick Wellnhofer 5aff27ae 2023-08-08T15:21:25 parser: Optimize xmlLoadEntityContent Load entity content via xmlParserInputBufferGrow, avoiding a copy. This also fixes an entity size accounting error.
Nick Wellnhofer facc2a06 2023-08-08T15:21:21 parser: Don't overwrite EOF parser state
Nick Wellnhofer 59fa0bb3 2023-08-08T15:21:14 parser: Simplify input pointer updates The base member always points to the beginning of the buffer.
Nick Wellnhofer c88ab7e3 2023-08-08T15:19:54 parser: Don't reinitialize parser input members The parser input struct should already be initialized.
Nick Wellnhofer ec7be506 2023-08-08T15:19:46 parser: Rework encoding detection Introduce XML_INPUT_HAS_ENCODING flag for xmlParserInput which is set when xmlSwitchEncoding is called. The parser can use the flag to reliably detect whether an encoding was already set via user override, BOM or other auto-detection. In this case, the encoding declaration won't be used to switch the encoding. Before, an inscrutable mix of ctxt->charset, ctxt->input->encoding and ctxt->input->buf->encoder was used. Introduce private helper functions to switch encodings used by both the XML and HTML parser: - xmlDetectEncoding which skips over the BOM, allowing to remove the BOM checks from other encoding functions. - xmlSetDeclaredEncoding, replacing htmlCheckEncodingDirect, which warns about encoding mismatches. If users override the encoding, store the declared instead of the actual encoding in xmlDoc. In this case, the actual encoding is known and the raw value from the doc is more useful. Also use the input flags to store the ISO-8859-1 fallback state. Restrict the fallback to cases where no encoding was specified. (The fallback is only useful in recovery mode and these days broken UTF-8 is probably more likely than ISO-8859-1, so it might eventually be removed completely.) The 'charset' member of xmlParserCtxt is now unused. The 'encoding' member of xmlParserInput is now unused. The 'standalone' member of xmlParserInput is renamed to 'flags'. A new parser state XML_PARSER_XML_DECL is added for the push parser.
Nick Wellnhofer d38e73f9 2023-08-08T15:19:44 parser: Always create UTF-8 in xmlParseReference It seems that this code path could only be triggered after an encoding error in recovery mode. Creating char-ref nodes is unnecessary and typically unexpected.
Nick Wellnhofer 131d0dc0 2023-08-08T15:19:39 parser: Don't use 'standalone' member of xmlParserInput The standalone declaration is only parsed in the main input stream.
Nick Wellnhofer d9ec182b 2023-08-08T15:19:36 parser: Don't detect encoding in xmlCtxtResetPush The encoding will be detected in xmlParseTryOrFinish.
Nick Wellnhofer 90bcbcfc 2023-07-20T21:08:01 parser: Fix potential use-after-free in xmlParseCharDataInternal Return immediately if a SAX handler stops the parser. Fixes #569.
Nick Wellnhofer e0f3016f 2023-05-18T17:31:44 parser: Fix regression when push parsing UTF-8 sequences Partial UTF-8 sequences are allowed when push parsing. Fixes #542.
Nick Wellnhofer 235b15a5 2023-05-08T17:58:02 SAX: Always initialize SAX1 element handlers Follow-up to commit d0c3f01e. A parser context will be initialized to SAX version 2, but this can be overridden with XML_PARSE_SAX1 later, so we must initialize the SAX1 element handlers as well. Change the check in xmlDetectSAX2 to only look for XML_SAX2_MAGIC, so we don't switch to SAX1 if the SAX2 element handlers are NULL.
Nick Wellnhofer d0c3f01e 2023-05-06T17:47:37 parser: Fix old SAX1 parser with custom callbacks For some reason, xmlCtxtUseOptionsInternal set the start and end element SAX handlers to the internal DOM builder functions when XML_PARSE_SAX1 was specified. This means that custom SAX handlers could never work with that flag because these functions would receive the wrong user data argument and crash immediately. Fixes #535.
Nick Wellnhofer 320f5084 2023-04-30T18:25:09 parser: Improve handling of encoding and IO errors Make sure that xmlCharEncInput, xmlParserInputBufferPush and xmlParserInputBufferGrow set the correct error code in the xmlParserInputBuffer. Handle errors when calling these functions.
Nick Wellnhofer fc69cf56 2023-04-30T17:51:29 parser: Move xmlFatalErr to parserInternals.c
Nick Wellnhofer 3ffcc03b 2023-03-13T19:38:41 parser: Deprecate more internal functions
Nick Wellnhofer 250faf3c 2023-04-20T12:35:21 parser: Fix regression in xmlParserNodeInfo accounting Commit 62150ed2 broke begin_pos and begin_line when extra node info was recorded. Fixes #523.
Nick Wellnhofer 9282b084 2023-04-19T21:55:24 parser: Fix regression in memory pull parser with encoding Revert another change from commit 98840d40. Decode the whole buffer when reading from memory and switching to the initial encoding. Add some comments about potential improvements.
David Kilzer 86105c04 2023-04-15T18:04:03 Fix use-after-free in xmlParseContentInternal() * parser.c: (xmlParseCharData): - Check if the parser has stopped before advancing `ctxt->input->cur`. This only occurs if a custom SAX error handler calls xmlStopParser() on fatal errors. Fixes #518.
Nick Wellnhofer b4d46cee 2023-04-12T15:10:01 parser: Remove first line handling in xmlParseChunk After reworking EBCDIC detection, this isn't necessary.
Nick Wellnhofer 98840d40 2023-03-21T19:07:12 parser: Rework EBCDIC code page detection To detect EBCDIC code pages, we used to switch the encoding twice and had to be very careful not to decode data after the XML declaration before the second switch. This relied on a hard-coded expected size of the XML declaration and was complicated and unreliable. Now we convert the first 200 bytes to EBCDIC-US and parse the encoding declaration manually.
Nick Wellnhofer 3eb9f5ca 2023-03-21T13:19:31 parser: Limit name length in xmlParseEncName
Nick Wellnhofer 04d1bedd 2023-03-21T13:08:44 parser: Rework shrinking of input buffers Don't try to grow the input buffer in xmlParserShrink. This makes sure that no memory allocations are made and the function always succeeds. Remove unnecessary invocations of SHRINK. Invoke SHRINK at the end of DTD parsing loops. Shrink before growing.
Nick Wellnhofer 067986fa 2023-03-18T14:44:28 parser: Fix regressions from previous commits - Fix memory leak in xmlParseNmtoken. - Fix buffer overread after htmlParseCharDataInternal.
Nick Wellnhofer 3e85d7b7 2023-03-17T13:15:35 parser: Rely on CUR_CHAR/NEXT to grow the input buffer The input buffer is now grown reliably when calling CUR_CHAR (xmlCurrentChar) or NEXT (xmlNextChar). This allows to remove many other invocations of GROW.
Nick Wellnhofer c81d0d04 2023-03-17T12:39:35 malloc-fail: Add more error checks when parsing names xmlParseName and similar functions must return NULL if an error occurs. Found by OSS-Fuzz, see #344.
Nick Wellnhofer b167c731 2023-03-14T14:42:36 parser: Fix short-lived regression causing infinite loops Fix 3eb6bf03. We really have to halt the parser, so the input buffer gets reset.
Nick Wellnhofer 2099441f 2023-03-13T17:51:13 parser: Stop calling xmlParserInputShrink Introduce xmlParserShrink which takes a parser context to simplify error handling.
Nick Wellnhofer cabde70f 2023-03-12T19:07:23 parser: Simplify calculation of available buffer space
Nick Wellnhofer b75976e0 2023-03-12T19:06:19 parser: Use size_t when subtracting input buffer pointers Avoid integer overflows.
Nick Wellnhofer 9a6ca816 2023-03-12T19:03:11 parser: Check for integer overflow when updating checkIndex Unfortunately, checkIndex is a long, not a size_t. Check for integer overflow before updating the value.
Nick Wellnhofer bd63d730 2023-03-12T17:40:55 html: Impose some length limits Impose length limits on names, attribute values, PIs and comments, similar to the XML parser.
Nick Wellnhofer 3eb6bf03 2023-03-12T16:47:15 parser: Stop calling xmlParserInputGrow Introduce xmlParserGrow which takes a parser context to simplify error handling.
Nick Wellnhofer 207ebdfd 2023-03-12T14:43:01 malloc-fail: Fix out-of-bounds read in xmlGROW Short-lived regression from 56cc2211.
Nick Wellnhofer 56cc2211 2023-03-09T22:27:58 parser: Merge xmlParserInputGrow into xmlGROW Simplifies the code and makes error handling easier.
Nick Wellnhofer 14604a44 2023-03-09T22:10:44 malloc-fail: Fix out-of-bounds read in xmlCurrentChar Found by OSS-Fuzz.
Nick Wellnhofer 3f69fc80 2023-03-08T13:58:49 parser: Tighten expansion limits - Lower the amount of expansion which is always allowed from 10MB to 1MB. - Lower the maximum amplification factor from 10 to 5. - Lower the "fixed cost" from 50 to 20.
Nick Wellnhofer 5d55315e 2023-02-18T17:29:07 parser: Fix OOB read when formatting error message Don't try to print characters beyond the end of the buffer. Found by OSS-Fuzz.
Nick Wellnhofer f8852184 2023-02-14T13:03:13 malloc-fail: Fix memory leak in xmlParseEntityDecl Found with libFuzzer, see #344.
Nick Wellnhofer e6d22f92 2023-01-23T01:48:37 malloc-fail: Fix reallocation in inputPush Store xmlRealloc result in temporary variable to avoid null deref in error handler. Found with libFuzzer, see #344.
Nick Wellnhofer 6fd89041 2023-01-22T19:42:41 malloc-fail: Fix use-after-free in xmlParseStartTag2 Fix error handling in xmlCtxtGrowAttrs. Found with libFuzzer, see #344.
Nick Wellnhofer d1b87856 2023-01-22T17:42:09 malloc-fail: Fix infinite loop in xmlParseTextDecl Memory errors can set `instate` to `XML_PARSER_EOF` which results in `NEXT` making no progress. Found with libFuzzer, see #344.
Nick Wellnhofer bd9de3a3 2023-01-22T16:52:39 malloc-fail: Fix null deref in xmlAddDefAttrs Found with libFuzzer, see #344.
Nick Wellnhofer 33d4a0fe 2023-01-22T15:41:00 parser: Fix progress check in xmlParseExternalSubset Avoid infinite loop. Short-lived regression from f61b8a62. Found with libFuzzer.
Nick Wellnhofer 74aa61e0 2023-01-22T13:09:03 parser: Halt parser on DTD errors If we try to continue parsing after an error in the internal or external subset, entity expansion accounting gets more complicated. Simply halt the parser. Found with libFuzzer.
Nick Wellnhofer d320a683 2023-01-17T13:50:51 parser: Fix entity check in attributes Don't set the "checked" flag when checking entities in default attribute values. These entities could reference other entities which weren't defined yet, so the check isn't reliable. This fixes a short-lived regression which could lead to a call stack overflow later in xmlStringGetNodeList.
Nick Wellnhofer 59b33661 2022-12-27T14:15:51 error: Limit number of parser errors Reporting errors is expensive and some abusive test cases can generate an error for each invalid input byte. This causes the parser to spend most of the time with error handling. Limit the number of errors and warnings to 100.
Nick Wellnhofer 66e9fd66 2022-12-25T21:26:17 parser: Fix infinite loop with push parser in recovery mode Short-lived regression from commit b1f9c193. Found by OSS-Fuzz.