encoding.c


Log

Author Commit Date CI Message
Nick Wellnhofer 7bd8d1d9 2025-05-28T15:53:38 doc: Prefix autolinks with '#' Use `#func` instead of `func()` to ignore parameters and make all autolinks work.
Nick Wellnhofer 258d8706 2025-05-15T17:49:49 codegen: Consolidate tools for code generation Move tools, source files and output tables into codegen directory. Rename some files. Adjust tools to match modified files. Remove generation date and source files from output. Distribute all tools and sources.
Nick Wellnhofer adfbeb7e 2025-05-14T04:58:21 doc: Stop using *Ptr typedefs in documentation
Nick Wellnhofer a40f36e7 2025-05-14T04:04:28 include: Stop using *Ptr typedefs in public headers
Nick Wellnhofer 2d83a84c 2025-05-14T00:29:19 doc: Misc improvements
Nick Wellnhofer b0234633 2025-05-13T20:19:39 encoding: Preserve original encoding label When using built-in encodings, the label would be normalized which causes various issues. We now create a copy of the handler with the original name. This is somewhat dangerous as it will require users to free built-in encodings with xmlCharEncCloseFunc. But to handle the general case, this was already required. Fixes #916 in another way than originally proposed.
Nick Wellnhofer 19b99311 2025-05-12T21:07:41 encoding: Fix -Wswitch warning
Nick Wellnhofer f0983199 2025-05-12T13:00:20 html: Map some encodings according to HTML5 Windows-1252 is a superset of ISO-8859-1 and should be used instead. Same for ASCII. Also map UCS-2 and UTF-16 to UTF-16LE.
Nick Wellnhofer 93f67106 2025-05-12T12:27:54 encoding: Add HTML5 aliases
Nick Wellnhofer 628006f4 2025-05-12T11:47:40 encoding: Add windows-1252 Fixes #915.
Nick Wellnhofer 777e2adf 2025-05-09T23:53:03 io: Consolidate escaping code Use generated table approach of xmlSerializeText for xmlEscapeText. Move most code to xmlIO.c.
Nick Wellnhofer 9bbffec5 2025-05-06T17:42:46 doc: Move brief to top, params to bottom of doc comments
Nick Wellnhofer 80b6429f 2025-05-04T19:13:24 doc: Misc fixes to encoding docs
Nick Wellnhofer cb1635a6 2025-05-02T19:05:25 doc: Use @since command
Nick Wellnhofer e78e05c9 2025-05-02T17:32:51 doc: Fix autolinks to functions Unfortunately, autolinks in .c files aren't converted by Doxygen for some reason.
Nick Wellnhofer e525564f 2025-05-01T19:20:06 doc: Remove empty lines at start of block These lines were left over after automatic conversion.
Nick Wellnhofer e549622b 2025-04-28T15:11:24 doc: Convert documentation to Doxygen Automated conversion based on a few regexes.
Nick Wellnhofer 69879da8 2025-04-28T14:04:30 doc: Remove email addresses from documentation Also remove authorship information from generated files, hash.c and globals.c which were rewritten.
Nick Wellnhofer 97ffa77d 2025-04-10T17:36:58 encoding: Deprecate non-thread-safe functions
Nick Wellnhofer b3492259 2025-03-14T00:01:11 include: Change some return types from int to enum This also affects some new functions from 2.13.
Nick Wellnhofer 84c6524e 2025-03-13T19:45:35 encoding: Support input-only and output-only converters Make it possible to open an encoding handler only for input or output. This avoids the creation of unnecessary converters. Should also fix #863.
Nick Wellnhofer 69b83bb6 2025-03-10T02:18:51 encoding: Detect truncated multi-byte sequences with ICU Unlike iconv or the internal converters, ICU consumes truncated multi- byte sequences at the end of an input buffer. We currently check for a non-empty raw input buffer to detect truncated sequences, so this fails with ICU. It might be possible to inspect the pivot buffer pointers, but it seems cleaner to implement a `flush` flag for some encoding and I/O functions. After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or detect remaining input with other converters. Also fix detection of truncated sequences for HTML, XML content and DTDs with iconv.
Nick Wellnhofer ef44c240 2025-03-10T14:15:35 encoding: Fix memory leak in xmlCharEncNewCustomHandler Short-lived regression.
Nick Wellnhofer 87c9e000 2025-03-09T22:20:23 encoding: Rework custom encoding implementation API
Nick Wellnhofer 38f47507 2025-03-05T21:06:05 encoding: Make conversion callbacks more type-safe
Nick Wellnhofer a846d964 2025-03-05T16:49:42 encoding: Remove compatibility struct members
Nick Wellnhofer 0b27097a 2025-03-04T12:55:25 encoding: Rename unprefixed public functions
Nick Wellnhofer 3793eaad 2025-02-16T13:54:56 fuzz: Fix build
Nick Wellnhofer 9c16a153 2025-02-13T18:41:33 Revert "include: Make most IS_* macros private" This reverts commit 84a6c82ff83d04963d6e1c5cd18ded68ea02d99f.
Nick Wellnhofer cfc854b8 2025-02-11T00:21:12 fuzz: Work around glibc iconv() bug
Nick Wellnhofer c4f760be 2025-02-01T15:29:56 encoding: Handle iconv() returning EOPNOTSUPP on Apple iconv() really shouldn't return undocumented error codes.
Nick Wellnhofer cdfb54ff 2025-01-31T18:38:40 Fix typos
Nick Wellnhofer 6ec616ba 2025-01-24T18:26:55 encoding: Don't allow POSIX indicator suffixes in encoding names Suffixes like "//IGNORE" change the behavior of iconv. Also add comment on how we currently rely on GNU libiconv behavior which technically violates the POSIX spec.
Nick Wellnhofer fbaacfe2 2025-01-16T15:57:35 encoding: Clean up UCS-4 encodings Use "UCS-*" instead of "ISO-10646-UCS-*". While the XML spec recommends "ISO-10646-UCS-2" and "ISO-10646-UCS-4", GNU iconv doesn't understand these names. Ignore UCS4_2143 and UCS4_3412 which were never supported.
Nick Wellnhofer df0f16fa 2024-12-15T21:34:59 encoding: Check reallocations for overflow
Nick Wellnhofer dae160c6 2024-09-13T12:08:20 encoding: Fix table entry for "UTF16"
Nick Wellnhofer 6e503eb7 2024-09-10T03:32:37 encoding: Handle more ICU error codes U_ILLEGAL_ESCAPE_SEQUENCE and U_UNSUPPORTED_ESCAPE_SEQUENCE can occur with ISO-2022.
Nick Wellnhofer 55d36c59 2024-09-10T03:11:18 encoding: Fix error code in xmlUconvConvert Broke in 46ec621e.
Nick Wellnhofer 34c9108f 2024-07-07T18:38:31 encoding: Add sizeOut argument to xmlCharEncInput When push parsing, we want to convert as much of the input as possible. When pull parsing memory buffers, we want to convert data chunk by chunk to save memory.
Nick Wellnhofer 1cfc5b80 2024-07-12T03:07:57 entities: Rework serialization of numeric character references
Nick Wellnhofer 69f12d6d 2024-07-13T00:17:18 encoding: Deprecate xmlByteConsumed This was only used by Chromium/WebKit to detect whether xmlParseContent really succeeded. It's a horrible, overcomplicated hack. See 8c5848bd and #767.
Nick Wellnhofer d0997956 2024-07-10T22:26:19 encoding: Readd some UTF-8 validation to encoders This isn't strictly needed but avoids generating invalid UTF-16 and unsigned integer overflows.
Nick Wellnhofer f48eefe3 2024-07-09T14:09:15 encoding: Rework xmlByteConsumed Don't loop infinitely if input buffer is too large. Allocate conversion buffer on the heap.
Nick Wellnhofer f86d17c1 2024-07-04T15:14:54 encoding: Fix xmlParseCharEncoding Make "UTF-16" return the UTF16LE handler as before. Fix error return.
Nick Wellnhofer 46ec621e 2024-07-03T15:48:01 encoding: Clarify xmlUconvConvert
Nick Wellnhofer 48fec242 2024-07-03T15:11:20 encoding: Remove duplicate code Fix recent commit.
Nick Wellnhofer 71fb2579 2024-07-03T14:35:49 encoding: Fix ICU build
Nick Wellnhofer 9a4770ef 2024-07-02T02:18:03 doc: Improve documentation
Nick Wellnhofer 0b0dd989 2024-06-28T23:13:38 parser: Fix EBCDIC detection
Nick Wellnhofer 37a9ff11 2024-06-28T22:42:46 encoding: Simplify xmlCharEncCloseFunc
Nick Wellnhofer 1167c334 2024-06-28T21:51:21 encoding: Don't include iconv.h from libxml/encoding.h
Nick Wellnhofer 30be984a 2024-06-28T20:37:47 encoding: Rework ISO-8859-X conversion Optimize code. Pass tables as context parameter. Check for XML_ENC_ERR_SPACE.
Nick Wellnhofer 282ec1d5 2024-06-28T19:06:57 encoding: Rework xmlCharEncodingHandler layout Reuse some of the old members. The "input" and "output" function pointers are actually of type xmlCharEncConvFunc, accepting an additional argument. For default handlers, this argument is unused, so this should work with most ABIs. For iconv handlers, these function pointers used to be NULL but now point to a function which requires the extra argument. "iconv_in" and "iconv_out" are made void pointers. "uconv_in" and "uconv_out" are renamed and made void pointers. This is unlikely to cause issues. We now expect that the built-in conversion functions correctly report XML_ENC_ERR_SPACE. For UTF8ToHtml and the ISO-8859-X code, this will be done in the following commits.
Nick Wellnhofer 57e37dff 2017-06-17T21:43:48 encoding: Rework UTF-16 conversion functions Optimize UTF-16 conversion functions. Avoid misaligned memory access. Don't rely on 'sizeof(short) == 2'. Check for XML_ENC_ERR_SPACE. Add some tests for UTF-16 conversion.
Nick Wellnhofer bb8e81c7 2024-06-28T04:36:14 encoding: Rework simple conversions function Use a single function for ASCII conversion. Optimize code. Check for XML_ENC_ERR_SPACE.
Nick Wellnhofer 501e5d19 2024-06-28T04:10:03 encoding: Stop using XML_ENC_ERR_PARTIAL
Nick Wellnhofer c59c2449 2024-06-27T23:32:58 encoding: Support custom implementations
Nick Wellnhofer 1e3da9f4 2024-06-27T21:37:18 encoding: Start with callbacks
Nick Wellnhofer 6d8427dc 2024-06-27T20:39:52 encoding: Rework encoding lookup Add missing xmlCharEncoding enum values. Simplify and speed up encoding lookup by using a table mapping names to xmlCharEncoding enums and binary search. Rearrange the default handler table to match the enum layout. For some encodings we now only lookup the provided or most canonical name instead of trying several names, expecting that iconv or ICU handle aliases: - IBM037 (EBCDIC) - UCS-2 - UCS-4 - Shift_JIS
Nick Wellnhofer f4e63f7a 2024-06-27T15:15:06 Regenerate libxml2-api.xml and testapi.c
Nick Wellnhofer b1a416bf 2024-06-27T12:00:45 encoding: Restore old lookup order in xmlOpenCharEncodingHandler When looking up encodings with xmlLookupCharEncodingHandler, the returned handler can have a different name than requested (capitalization, internal aliases). This should eventually be fixed. For now we revert part of commit 5b893fa9, start the lookup with xmlFindHandler and add an explicit check for UTF-8. Should fix the encoding name issue mentioned in #749.
Nick Wellnhofer c4d8343b 2024-06-24T19:41:32 encoding: Make xmlFindCharEncodingHandler return UTF-8 handler xmlFindCharEncodingHandler must always return a handler. Remove UTF-8 handler from default handler list. Fixes 5b893fa9.
Nick Wellnhofer 5b893fa9 2024-06-22T19:15:17 encoding: Fix encoding lookup with xmlOpenCharEncodingHandler Make xmlOpenCharEncodingHandler call xmlParseCharEncoding first so we prefer our own handlers for names like "UTF8". Only UTF-16 needs an exception. Make callers check the return value. For UTF-8, a NULL encoding doesn't mean an error. Remove unnecessary UTF-8 check from htmlFindOutputEncoder. Don't try to look up ASCII handler since the HTML handler is always available. Fix return code of xmlParseCharEncoding. Should fix #744.
Rosen Penev 2def7b4b 2024-06-18T13:55:34 clang-tidy: move assignments out of if Found with bugprone-assignment-in-if-condition Signed-off-by: Rosen Penev <rosenp@gmail.com>
Nick Wellnhofer 63ce5f9a 2024-04-28T17:32:35 Make some globals const
Nick Wellnhofer 072facc4 2024-03-18T14:17:57 encoding: Don't shrink input too early in xmlCharEncOutput Some exotic encodings like ISO646-FR don't support '#' characters, so encoding a character reference can actually fail. Don't skip the offending input in this case so the error will be reported on the next call.
Nick Wellnhofer 0821efc8 2024-01-02T18:33:57 encoding: Check whether encoding handlers support input/output The "HTML" encoding handler doesn't support input which could lead to a wrong error report.
Nick Wellnhofer 023aecc4 2023-12-13T23:45:53 encoding: Support ASCII in xmlLookupCharEncodingHandler Return our built-in ASCII handler. This was never implemented and triggered the new and stricter error checks.
Nick Wellnhofer bd5ad030 2023-12-10T14:56:21 encoding: Report malloc failures Introduce new API functions that return a separate error code if a memory allocation fails. - xmlOpenCharEncodingHandler - xmlLookupCharEncodingHandler Fix a few places where malloc failures weren't reported.
Nick Wellnhofer 89d19534 2023-10-28T03:04:59 encoding: Fix decoding of large chunks After 95e81a36, we must support XML_ENC_ERR_SPACE when using built-in encoding handlers. Should fix #610.
Nick Wellnhofer 1734d27d 2023-10-02T15:04:18 encoding: Suppress -Wcast-align warnings
Nick Wellnhofer 0533daf5 2023-09-29T02:45:20 encoding: Fix infinite loop in xmlCharEncInput Short-lived regression from 95e81a36.
Nick Wellnhofer 8c084ebd 2023-09-21T22:57:33 doc: Make apibuild.py happy
Nick Wellnhofer 699299ca 2023-09-20T18:54:39 globals: Stop including globals.h
Nick Wellnhofer 7909ff08 2023-09-20T17:38:26 include: Remove unnecessary includes - Don't include tree.h from encoding.h - Don't include parser.h from xmlIO.h
Nick Wellnhofer 507f11ed 2023-08-16T15:43:47 encoding: Remove debugging code
Nick Wellnhofer 95e81a36 2023-08-08T15:21:31 parser: Decode all data in xmlCharEncInput Even with flush set to true, xmlCharEncInput didn't guarantee to decode all data. This complicated the push parser. Remove the flush flag and always decode all available data. Also fix ICU code where the flush flag has a different meaning. Always set flush to false and retry even with empty input buffers.
Nick Wellnhofer 4ee08155 2023-08-08T15:19:51 encoding: Move rawconsumed accounting to xmlCharEncInput
Nick Wellnhofer b236b7a5 2023-06-08T21:53:05 parser: Halt parser when growing buffer results in OOM Fix short-lived regression from previous commit. It might be safer to make xmlBufSetInputBaseCur use the original buffer even in case of errors. Found by OSS-Fuzz.
Nick Wellnhofer db21cd5d 2023-06-06T14:25:30 malloc-fail: Handle malloc failures in xmlAddEncodingAlias Avoid memory errors if an allocation fails. See #344. Fixes #553.
Nick Wellnhofer 2f12e3a9 2023-04-30T18:46:05 encoding: Stop calling xmlEncodingErr This invokes the global error handler which should be avoided.
Nick Wellnhofer 320f5084 2023-04-30T18:25:09 parser: Improve handling of encoding and IO errors Make sure that xmlCharEncInput, xmlParserInputBufferPush and xmlParserInputBufferGrow set the correct error code in the xmlParserInputBuffer. Handle errors when calling these functions.
Nick Wellnhofer 3ff6abbf 2023-02-22T17:11:20 encoding: Rework error codes Use an enum instead of magic numbers. Fix a few error codes. Simplify handling of "space" and "partial" errors. See #506.
Nick Wellnhofer 33fb297b 2023-04-15T16:53:00 encoding: Fix compiler warning in ICU build
Nick Wellnhofer a6b9e55a 2023-03-26T15:42:02 encoding: Fix error code in asciiToUTF8 Use correct error code when invalid ASCII bytes are encountered. Found by OSS-Fuzz.
Nick Wellnhofer 98840d40 2023-03-21T19:07:12 parser: Rework EBCDIC code page detection To detect EBCDIC code pages, we used to switch the encoding twice and had to be very careful not to decode data after the XML declaration before the second switch. This relied on a hard-coded expected size of the XML declaration and was complicated and unreliable. Now we convert the first 200 bytes to EBCDIC-US and parse the encoding declaration manually.
Nick Wellnhofer 1c5e1fc1 2023-02-14T13:56:21 malloc-fail: Check for malloc failure in xmlFindCharEncodingHandler Don't return encoding handlers with a NULL name. Found with libFuzzer, see #344.
Nick Wellnhofer d18f9c11 2023-02-14T13:50:46 malloc-fail: Fix leak of xmlCharEncodingHandler Also free handler if its name is NULL. Found with libFuzzer, see #344.
Nick Wellnhofer 3cc900f0 2023-02-16T11:50:52 encoding: Cast toupper argument to unsigned char Fixes undefined behavior. Also cast return value explicitly to fix implicit-integer-sign-change checks.
Nick Wellnhofer 2355eac5 2023-01-22T14:52:06 malloc-fail: Fix null deref if growing input buffer fails Also add some error checks. Found with libFuzzer, see #344.
Nick Wellnhofer 0f54af74 2022-12-08T18:36:45 encoding.c: Fix for documentation generator Top-level macro invocations throw off the documentation parser.
Nick Wellnhofer 53ab3840 2022-11-25T14:26:59 encoding: Make init function private
Nick Wellnhofer 3e9d5e4f 2022-11-25T14:19:36 encoding: Remove unused variable xmlDefaultCharEncodingHandler
Nick Wellnhofer 1406b20f 2022-11-24T19:14:33 encoding: Allocate default handlers statically
Nick Wellnhofer 2059df53 2022-11-14T22:27:58 buf: Deprecate static/immutable buffers
Nick Wellnhofer ad338ca7 2022-09-01T01:18:30 Remove explicit integer casts Remove explicit integer casts as final operation - in assignments - when passing arguments - when returning values Remove casts - to the same type - from certain range-bound values The main motivation is that these explicit casts don't change the result of operations and only render UBSan's implicit-conversion checks useless. Removing these casts allows UBSan to detect cases where truncation or sign-changes occur unexpectedly. Document some explicit casts as truncating and add a few missing ones.
Nick Wellnhofer 0f568c0b 2022-08-26T01:22:33 Consolidate private header files Private functions were previously declared - in header files in the root directory - in public headers guarded with IN_LIBXML - in libxml.h - redundantly in source files that used them. Consolidate all private header files in include/private.
David Kilzer c14cac8b 2022-05-25T18:13:07 xmlBufAvail() should return length without including a byte for NUL terminator * buf.c: (xmlBufAvail): - Return the number of bytes available in the buffer, but do not include a byte for the NUL terminator so that it is reserved. * encoding.c: (xmlCharEncFirstLineInput): (xmlCharEncInput): (xmlCharEncOutput): * xmlIO.c: (xmlOutputBufferWriteEscape): - Remove code that subtracts 1 from the return value of xmlBufAvail(). It was implemented inconsistently anyway.
David Kilzer 21561e83 2016-05-20T15:21:43 Mark more static data as `const` Similar to 8f5710379, mark more static data structures with `const` keyword. Also fix placement of `const` in encoding.c. Original patch by Sarah Wilkin.
Nick Wellnhofer 40483d0c 2022-03-06T13:55:48 Deprecate module init and cleanup functions These functions shouldn't be part of the public API. Most init functions are only thread-safe when called from xmlInitParser. Global variables should only be cleaned up by calling xmlCleanupParser.