Hash :
dcb80b92
Author :
Date :
2021-02-20T20:30:43
Fix slow parsing of HTML with encoding errors Under certain circumstances, the HTML parser would try to guess and switch input encodings multiple times, leading to slow processing of documents with encoding errors. The repeated scanning of the input buffer when guessing encodings could even lead to quadratic behavior. The code htmlCurrentChar probably assumed that if there's an encoding handler, it is guaranteed to produce valid UTF-8. This holds true in general, but if the detected encoding was "UTF-8", the UTF8ToUTF8 encoding handler simply invoked memcpy without checking for invalid UTF-8. This still must be fixed, preferably by not using this handler at all. Also leave a note that switching encodings twice seems impossible to implement correctly. Add a check when handling UTF-8 encoding errors in htmlCurrentChar to avoid this situation, even if encoders produce invalid UTF-8. Found by OSS-Fuzz.
XML toolkit from the GNOME project
Full documentation is available on-line at
http://xmlsoft.org/
This code is released under the MIT Licence see the Copyright file.
To build on an Unixised setup:
./configure ; make ; make install
if the ./configure file does not exist, run ./autogen.sh instead.
To build on Windows:
see instructions on win32/Readme.txt
To assert build quality:
on an Unixised setup:
run make tests
otherwise:
There is 3 standalone tools runtest.c runsuite.c testapi.c, which
should compile as part of the build or as any application would.
Launch them from this directory to get results, runtest checks
the proper functioning of libxml2 main APIs while testapi does
a full coverage check. Report failures to the list.
To report bugs, follow the instructions at:
http://xmlsoft.org/bugs.html
A mailing-list xml@gnome.org is available, to subscribe:
http://mail.gnome.org/mailman/listinfo/xml
The list archive is at:
http://mail.gnome.org/archives/xml/
All technical answers asked privately will be automatically answered on
the list and archived for public access unless privacy is explicitly
required and justified.
Daniel Veillard
$Id$