SAX2.c


Log

Author Commit Date CI Message
Nick Wellnhofer c5b45fbc 2025-05-16T16:54:09 doc: Misc fixes
Nick Wellnhofer 6f4b4527 2025-05-15T23:43:32 parser: Stop using ctxt->linenumbers I think this was used to avoid setting the `line` member before it was added (20+ years ago).
Nick Wellnhofer adfbeb7e 2025-05-14T04:58:21 doc: Stop using *Ptr typedefs in documentation
Nick Wellnhofer a40f36e7 2025-05-14T04:04:28 include: Stop using *Ptr typedefs in public headers
Nick Wellnhofer 5ce48ec1 2025-05-15T22:51:54 SAX2: Rework xmlSAX2Text Simplify and make more readable.
Nick Wellnhofer 442c1903 2025-05-09T18:52:36 doc: Fix some damage from automated conversions Add some newlines, fix returns.
Nick Wellnhofer ad390a5d 2025-05-09T15:34:53 parser: Set doc properties in endDocument SAX handler
Nick Wellnhofer c7c49643 2025-05-09T15:26:15 html: Move DTD creation to endDocument SAX callback
Nick Wellnhofer 9bbffec5 2025-05-06T17:42:46 doc: Move brief to top, params to bottom of doc comments
Nick Wellnhofer e78e05c9 2025-05-02T17:32:51 doc: Fix autolinks to functions Unfortunately, autolinks in .c files aren't converted by Doxygen for some reason.
Nick Wellnhofer e525564f 2025-05-01T19:20:06 doc: Remove empty lines at start of block These lines were left over after automatic conversion.
Nick Wellnhofer e549622b 2025-04-28T15:11:24 doc: Convert documentation to Doxygen Automated conversion based on a few regexes.
Nick Wellnhofer 69879da8 2025-04-28T14:04:30 doc: Remove email addresses from documentation Also remove authorship information from generated files, hash.c and globals.c which were rewritten.
Nick Wellnhofer 61890e39 2025-04-27T21:50:15 doc: Prepare for conversion to Doxygen Fix many params in internal functions (not really necessary but Doxygen warns about that in XML mode). Fix formatting in a few corner cases that automatic conversion can't handle. Rearrange some DOC_DISABLE blocks.
Nick Wellnhofer 8696ebe1 2025-03-11T14:32:35 parser: Fix ignorableWhitespace callback If ignorableWhitespace differs from the "characters" callback, we have to check for blanks as well. Regressed with 1f5b537.
Nick Wellnhofer 6ab430ca 2025-02-22T21:17:42 Remove unnecessary #includes
Nick Wellnhofer 6dfa68ac 2025-02-22T14:49:51 SAX2: Fix ctxt->nodemem check In some error cases and maybe other situations, nodemem can have a value of -1.
Nick Wellnhofer 63dfcca6 2024-12-16T01:34:29 fuzz: Reduce initial array size
Nick Wellnhofer 1f5b5371 2025-01-31T16:21:20 parser: Improve handling of NOBLANKS option Don't change the SAX handler. Use a helper function to invoke "characters" SAX callback. The old code didn't advance the input pointer consistently before invoking the callback. There was also some inconsistency wrt to ctxt->space handling. I don't understand the ctxt->space thing, but now we always behave like the non-complex case before.
Nick Wellnhofer ca819160 2025-01-03T20:50:08 include: Use intptr_t to cast between pointers and ints
Nick Wellnhofer 5c9abbf8 2024-12-09T17:17:32 SAX2: Fix xmlSAX2ResolveEntity if systemId is NULL Passing a NULL systemId results in snprintf("%s", NULL) which crashes on some platforms. Regressed with commit 4ff2dccf. Note that systemId should never be NULL during normal parsing. It can only be NULL if API functions are called with a NULL systemId. Should fix #825.
Nick Wellnhofer 497081ba 2024-11-17T20:25:07 parser: Remove remaining calls to xml{Push|Pop}Input
Nick Wellnhofer 467f4445 2024-10-30T14:03:39 SAX2: Add NULL check for ctxt->myDoc
Nick Wellnhofer 0ce7bfe5 2024-09-12T01:44:18 html: Try to avoid passing XML options to HTML parser
Nick Wellnhofer be874d78 2024-09-11T19:47:07 html: Ignore unexpected DOCTYPE declarations
Nick Wellnhofer 8ae06d52 2024-08-29T00:07:27 SAX2: Don't merge CDATA sections The Document Object Model (DOM) Level 3 Core Specification says: > Adjacent CDATASection nodes are not merged by use of the normalize > method of the Node interface. Fixes #412.
Nick Wellnhofer 1e5375c1 2024-07-06T15:15:57 SAX2: Check return value of xmlPushInput Fix null deref in case of malloc failure.
Nick Wellnhofer 80aabea1 2024-07-03T11:55:38 SAX2: Reenable 'directory' as base URI fallback Apparently, some users overwrite this member manually to set a base URI for memory streams. Fixes #753.
Nick Wellnhofer 842a0448 2024-07-03T11:46:06 valid: Restore ID lookup Revert a change from d025cfbb and don't overwrite ID table entries, so that the first attribute will be returned if there are duplicate IDs. This requires two other changes: - Attributes in entity content are never added to the ID table. This seems reasonable. - Remove the optimization to skip ID lookup when copying and the target document has an empty ID table. This also seems more correct since the document could have ID declarations nevertheless or we could be copying xml:ids into the document for the first time. Fixes #757.
Nick Wellnhofer f9065261 2024-07-02T23:43:28 SAX2: Fix HTML IDs Short-lived regression. Fixes #755.
Nick Wellnhofer 866be54e 2024-07-02T04:27:53 parser: Don't use deprecated xmlSplitQName
Nick Wellnhofer 16e7ecd4 2024-07-01T16:01:24 xinclude: Check URI length Don't report long URIs as OOM errors.
Nick Wellnhofer f505dcae 2024-06-26T14:11:34 tree: Remove underscores from xmlRegisterCallbacks
Nick Wellnhofer 8b1f79ce 2024-06-26T04:30:38 SAX2: Make xmlSAXDefaultVersion a no-op
Nick Wellnhofer 5cf5b542 2024-06-26T04:30:10 SAX2: Deprecate xmlSAX2StartElement
Nick Wellnhofer 860fb460 2024-06-17T20:58:27 SAX2: Fix null deref after malloc failure Short-lived regression.
Nick Wellnhofer faae3a91 2024-06-16T23:21:55 SAX2: Split out legacy SAX1 handling Split xmlSAX2StartElement into two functions handling legacy SAX1 and HTML.
Nick Wellnhofer 11c3f84b 2024-06-15T23:57:39 SAX2: Always make xmlSAX2{Start,End}Element public Simplify symbol availability logic.
Nick Wellnhofer 52384043 2024-06-11T19:10:41 parser: Pass resource type to resource loader
Nick Wellnhofer 64ad2725 2024-06-11T03:51:43 parser: Introduce per-context resource loader
Nick Wellnhofer 4ff2dccf 2024-05-10T02:04:52 SAX2: Warn if URI resolution failed
Nick Wellnhofer 71a7a33e 2024-05-03T00:44:42 parser: Fix base URI of internal parameter entities Search parent inputs of internal parameter entities for base URI. Fixes a long-standing bug, which manifested in a different way after commit 955c177f. Reproduce with xmllint --noent xmlconf/eduni/errata-2e/E18.xml
Nick Wellnhofer af2bda4e 2024-04-05T13:09:45 SAX2: Also check URI length before resolving We don't want to exceed the size limit of 1 MB in uri.c. Such errors can't be distinguished from malloc failures.
Nick Wellnhofer 2cc7f710 2024-03-29T11:55:20 SAX2: Fix xmlSAX2EntityDecl with empty base Short-lived regression.
Nick Wellnhofer 730de88b 2024-03-28T15:42:02 SAX2: Optimize appending children xmlSAX2AppendChild can make several assumptions which make appending nodes more efficient. Also handle line numbers in xmlSAX2AppendChild.
Nick Wellnhofer 05c147c3 2024-03-22T13:03:37 SAX2: Report malloc failure in xmlSAX2AttributeNs
Nick Wellnhofer 6a49bb77 2024-03-17T17:16:55 tree: Introduce xmlSearchNsSafe After the failed experiment with a static XML namespace, introduce versions of xmlSearchNs that report malloc failures. Optimize the no-document case by only adding the XML namespace declaration if it wasn't found in an ancestor.
Nick Wellnhofer 047ea3ec 2024-03-17T16:23:31 Revert "tree: Allocate XML namespace statically" This reverts commit 2840e33c5e4b51589a0b96e8102638eeaea6df72.
Nick Wellnhofer 9f049afa 2024-03-11T15:57:14 tree: Refactor element creation and parsing of attribute values Replace xmlStringGetNodeList and xmlStringLenGetNodeList with xmlNodeParseContentInternal which also updates an optional parent node. Don't look up entities a second time via xmlNewReference.
Nick Wellnhofer 2840e33c 2024-03-04T07:34:25 tree: Allocate XML namespace statically
Nick Wellnhofer 84a71860 2024-02-26T15:14:28 xmlreader: Fix xmlTextReaderConstEncoding Regression from commit f1c1f5c6. Fixes #697.
Nick Wellnhofer 7dc8600a 2024-02-20T12:32:17 SAX2: Report malloc failure in xmlCheckDefaultedAttributes
Nick Wellnhofer 2e19d0ef 2024-01-26T11:39:51 SAX2: Make sure that OOM errors aren't overwritten
Nick Wellnhofer 57c68759 2024-01-07T20:44:40 SAX2: Limit entity URI length to 2000 bytes Avoid quadratic behavior when loading entities with long URIs multiple times. This limitation could be dropped if we cached external entities.
Nick Wellnhofer 02cc5c36 2024-01-05T04:17:14 parser: Add XML_PARSE_NO_XXE parser option
Nick Wellnhofer 9912c369 2024-01-02T17:23:59 SAX2: Enforce size limit in xmlSAX2Text with XML_PARSE_HUGE
Nick Wellnhofer 37c6618b 2023-12-30T02:50:34 parser: Rework parsing of attribute and entity values Don't use a separate function to handle "complex" attributes. Validate UTF-8 byte sequences without decoding. This should improve performance considerably when parsing multi-byte UTF-8 sequences. Use a string buffer to avoid unnecessary allocations and copying when expanding entities. Normalize attribute values in a single pass while expanding entities. Be more lenient in recovery mode. If no entity substitution was requested, validate entities without expanding. Fixes #596. Also fixes #655.
Nick Wellnhofer 6a9a88a1 2023-12-26T03:13:05 parser: Move progressive flag into input struct
Nick Wellnhofer d944a415 2023-12-26T02:10:35 parser: Fix in-parameter-entity and in-external-dtd checks Use in ctxt->input->entity instead of ctxt->inputNr to determine whether we are inside a parameter entity. Stop using ctxt->external to check whether we're in an external DTD. This is signaled by ctxt->inSubset == 2.
Nick Wellnhofer 5f319304 2023-12-28T19:05:51 SAX2: Fix error code Today I learned that the TSCII character encoding [1] can blow up the size of text 12 times when converted to UTF-8: $ printf '\x82' |iconv -f TSCII -t UTF-8 |hexdump -C 00000000 e0 ae b8 e0 af 8d e0 ae b0 e0 af 80 0000000c [1] https://en.wikipedia.org/wiki/Tamil_Script_Code_for_Information_Interchange
Nick Wellnhofer 955c177f 2023-12-23T00:58:36 parser: Stop using 'directory' struct member This was only used as a pointless fallback for URI resolution.
Nick Wellnhofer 13043691 2023-12-20T00:33:34 parser: Rename xmlErrParser to xmlCtxtErr
Nick Wellnhofer 54c70ed5 2023-12-18T19:31:29 parser: Improve error handling Introduce xmlCtxtSetErrorHandler allowing to set a structured error for a parser context. There already was the "serror" SAX handler but this always receives the parser context as argument. Start to use xmlRaiseMemoryError. Remove useless arguments from memory error functions. Rename xmlErrMemory to xmlCtxtErrMemory. Remove a few calls to xmlGenericError. Remove support for runtime entity debugging.
Nick Wellnhofer e58ea29f 2023-12-10T18:10:42 SAX2: Report malloc failures Fix many places where malloc failures aren't reported. Improve error handling when parsing entity declarations. Fixes #308.
Nick Wellnhofer 7f00273c 2023-12-01T19:21:17 parser: Fix invalid free in xmlParseBalancedChunkMemoryRecover Set the dictionary for newDoc in xmlParseBalancedChunkMemoryRecover. This is a long-standing bug which was masked by - xmlParseBalancedChunkMemoryRecover changing the document of the root node. This is a really bad idea, resulting in a mismatch between ctxt->myDoc and ctxt->node->doc. - SAX2.c preferring ctxt->node->doc over ctxt->myDoc until commit a31e1b06. Fixes #641.
Nick Wellnhofer a31e1b06 2023-11-04T20:21:54 SAX2: Fix quadratic behavior in xmlSAX2AttributeNs The last missing piece to make parsing of attributes O(n).
Nick Wellnhofer e0dd330b 2023-09-29T00:18:44 parser: Use hash tables to avoid quadratic behavior Use a hash table to lookup namespaces by prefix. The hash table stores an index into the namespace table. Auxiliary data for namespaces is stored in a separate array along the main namespace table. Use a hash table to verify attribute uniqueness. The hash table stores an index into the attribute table. Reuse hash value from the dictionary to avoid computing them twice. See #346.
Nick Wellnhofer da274bfa 2023-09-21T01:29:40 build: Fix build when certain modules are disabled
Nick Wellnhofer 9b5cce7a 2023-09-21T00:44:50 include: Remove more unnecessary includes
Nick Wellnhofer 699299ca 2023-09-20T18:54:39 globals: Stop including globals.h
Nick Wellnhofer a77f9ab8 2023-09-20T16:57:22 globals: Don't include SAX2.h from globals.h
Nick Wellnhofer 4e1c13eb 2023-09-18T14:45:10 debug: Remove debugging code This is barely useful these days and only clutters the code base.
Nick Wellnhofer cde44997 2023-08-27T16:35:23 SAX2: Allow multiple top-level elements When parsing with HTML_PARSE_NOIMPLIED, the result document can contain multiple top-level elements. Rework xmlSAX2StartElement to simply add the element as a child of ctxt->node or ctxt->myDoc. Don't invoke xmlAddSibling for non-element parents. The context node should always be an element node. Fixes #584.
Nick Wellnhofer f1c1f5c6 2023-08-16T19:43:02 parser: Revert change to doc->encoding Fixes #579.
Nick Wellnhofer cb717d7e 2023-08-09T16:52:02 parser: Update line number after coalescing text nodes This should make the line number of text nodes deterministic. Before, it depended on the callback sequence which depends on the size of chunks fed to the parser.
Nick Wellnhofer ec7be506 2023-08-08T15:19:46 parser: Rework encoding detection Introduce XML_INPUT_HAS_ENCODING flag for xmlParserInput which is set when xmlSwitchEncoding is called. The parser can use the flag to reliably detect whether an encoding was already set via user override, BOM or other auto-detection. In this case, the encoding declaration won't be used to switch the encoding. Before, an inscrutable mix of ctxt->charset, ctxt->input->encoding and ctxt->input->buf->encoder was used. Introduce private helper functions to switch encodings used by both the XML and HTML parser: - xmlDetectEncoding which skips over the BOM, allowing to remove the BOM checks from other encoding functions. - xmlSetDeclaredEncoding, replacing htmlCheckEncodingDirect, which warns about encoding mismatches. If users override the encoding, store the declared instead of the actual encoding in xmlDoc. In this case, the actual encoding is known and the raw value from the doc is more useful. Also use the input flags to store the ISO-8859-1 fallback state. Restrict the fallback to cases where no encoding was specified. (The fallback is only useful in recovery mode and these days broken UTF-8 is probably more likely than ISO-8859-1, so it might eventually be removed completely.) The 'charset' member of xmlParserCtxt is now unused. The 'encoding' member of xmlParserInput is now unused. The 'standalone' member of xmlParserInput is renamed to 'flags'. A new parser state XML_PARSER_XML_DECL is added for the push parser.
Nick Wellnhofer d38e73f9 2023-08-08T15:19:44 parser: Always create UTF-8 in xmlParseReference It seems that this code path could only be triggered after an encoding error in recovery mode. Creating char-ref nodes is unnecessary and typically unexpected.
Nick Wellnhofer b8961df6 2023-05-09T03:25:24 SAX: Always validate xml:ids The behavior shouldn't depend on mostly random configuration options.
Nick Wellnhofer 235b15a5 2023-05-08T17:58:02 SAX: Always initialize SAX1 element handlers Follow-up to commit d0c3f01e. A parser context will be initialized to SAX version 2, but this can be overridden with XML_PARSE_SAX1 later, so we must initialize the SAX1 element handlers as well. Change the check in xmlDetectSAX2 to only look for XML_SAX2_MAGIC, so we don't switch to SAX1 if the SAX2 element handlers are NULL.
Nick Wellnhofer 250faf3c 2023-04-20T12:35:21 parser: Fix regression in xmlParserNodeInfo accounting Commit 62150ed2 broke begin_pos and begin_line when extra node info was recorded. Fixes #523.
Nick Wellnhofer d7d0bc65 2023-03-31T16:47:48 SAX2: Ignore namespaces in HTML documents In commit 21ca8829, we started to ignore namespaces in HTML element names but we still called xmlSplitQName, effectively stripping the namespace prefix. This would cause elements like <o:p> being parsed as <p>. Now we leave the name untouched. Fixes #508.
Nick Wellnhofer cb4334b7 2023-02-14T18:10:14 malloc-fail: Fix memory leak in xmlSAX2StartElementNs Found with libFuzzer, see #344.
Nick Wellnhofer 0c5f40b7 2023-01-22T13:27:41 malloc-fail: Fix null deref in xmlSAX2AttributeInternal Found with libFuzzer, see #344.
Nick Wellnhofer b3b53dcc 2023-01-22T11:28:46 malloc-fail: Fix null deref in xmlSAX2Text Found with libFuzzer, see #344.
Nick Wellnhofer 463bbeec 2022-12-19T18:39:45 entities: Rework entity amplification checks This commit implements robust detection of entity amplification attacks, better known as the "billion laughs" attack. We now limit the size of the document after substitution of entities to 10 times the size before expansion. This guarantees linear behavior by definition. There already was a similar check before, but the accounting of "sizeentities" (size of external entities) and "sizeentcopy" (size of all copies created by entity references) wasn't accurate. We also need saturation arithmetic since we're historically limited to "unsigned long" which is 32-bit on many platforms. A maximum of 10 MB of substitutions is always allowed. This should make use cases like DITA work which have caused problems in the past. The old checks based on the number of entities were removed. This is accounted for by adding a fixed cost to each entity reference. Entity amplification checks are now enabled even if XML_PARSE_HUGE is set. This option is mainly used to allow larger text nodes. Most users were unaware that it also disabled entity expansion checks. Some of the limits might be adjusted later. If this change turns out to affect legitimate use cases, we can add a separate parser option to disable the checks. Fixes #294. Fixes #345.
Nick Wellnhofer cecd364d 2022-11-24T16:38:47 parser: Don't call *DefaultSAXHandlerInit from xmlInitParser Change the default handler definitions to match the result after calling the initialization functions. This makes sure that no thread-local variables are accessed when calling xmlInitParser.
Nick Wellnhofer 68a6518c 2022-11-15T18:23:33 parser: Rewrite push parser boundary checks Remove inaccurate xmlParseCheckTransition check. Remove non-incremental xmlParseGetLasts check. Add functions that check for several boundary constructs more accurately, keeping track of progress in ctxt->checkIndex. Fixes #439.
Nick Wellnhofer 7ceaee94 2022-11-02T16:05:05 malloc-fail: Fix memory leak in xmlSAX2ExternalSubset Found with libFuzzer, see #344.
Nick Wellnhofer 81621b1f 2022-09-02T18:38:33 Fix compiler warnings in SAX2.c
Nick Wellnhofer ad338ca7 2022-09-01T01:18:30 Remove explicit integer casts Remove explicit integer casts as final operation - in assignments - when passing arguments - when returning values Remove casts - to the same type - from certain range-bound values The main motivation is that these explicit casts don't change the result of operations and only render UBSan's implicit-conversion checks useless. Removing these casts allows UBSan to detect cases where truncation or sign-changes occur unexpectedly. Document some explicit casts as truncating and add a few missing ones.
Nick Wellnhofer aeb69fd3 2022-09-01T02:33:16 Fix overflow check in SAX2.c
Nick Wellnhofer 0f568c0b 2022-08-26T01:22:33 Consolidate private header files Private functions were previously declared - in header files in the root directory - in public headers guarded with IN_LIBXML - in libxml.h - redundantly in source files that used them. Consolidate all private header files in include/private.
Nick Wellnhofer 0e49f882 2022-08-24T05:25:37 Mark most SAX1 functions as deprecated No compiler warnings generated yet.
Nick Wellnhofer 4b184240 2022-08-22T14:11:15 Remove htmlDefaultSAXHandler from non-SAX1 build This matches long-standing behavior of the XML counterpart.
Nick Wellnhofer 3e7b4f37 2022-05-20T23:28:25 Avoid calling xmlSetTreeDoc Create text nodes with xmlNewDocText or set the document directly to avoid xmlSetTreeDoc being called when the node is inserted.
Nick Wellnhofer 40483d0c 2022-03-06T13:55:48 Deprecate module init and cleanup functions These functions shouldn't be part of the public API. Most init functions are only thread-safe when called from xmlInitParser. Global variables should only be cleaned up by calling xmlCleanupParser.
Nick Wellnhofer 4a8c71eb 2022-03-04T03:35:57 Remove DOCBparser This code has been broken and deprecated since version 2.6.0, released in 2003. Because of a bug in commit 961b535c, DOCBparser.c was never compiled since 2012. I couldn't find a Debian package using any of its symbols, so it seems safe to remove this module.
Nick Wellnhofer c41bc10d 2022-02-22T19:57:12 Fix unused variable warnings with disabled features
Nick Wellnhofer 346c3a93 2022-02-20T18:46:42 Remove elfgcchack.h The same optimization can be enabled with -fno-semantic-interposition since GCC 5. clang has always used this option by default.
Nick Wellnhofer e03590c9 2022-02-08T02:42:30 Don't add IDs containing unexpanded entity references When parsing without entity substitution, IDs or IDREFs containing unexpanded entity reference like "abc&x;def" could be created. We could try to expand these entities like in validation mode, but it seems safer to honor the request not to expand entities. We silently ignore such IDs for now.