README.md: Update.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
diff --git a/README.md b/README.md
index b64ead8..93f8f4f 100644
--- a/README.md
+++ b/README.md
@@ -26,26 +26,32 @@ MD4C is C Markdown parser with the following features:
* **Compliance:** Generally MD4C aims to be compliant to the latest version of
[CommonMark specification](http://spec.commonmark.org/). Right now we are
- quite close to CommonMark 0.27.
+ very close to CommonMark 0.27.
-* **Extensions:** If explicitly enabled, the parser supports some commonly
- requested and accepted extensions. See below.
+* **Extensions:** MD4C supports some commonly requested and accepted extensions.
+ See below.
* **Compactness:** MD4C is implemented in one source file and one header file.
* **Embedding:** MD4C is easy to reuse in other projects, its API is very
- straightforward.
+ straightforward: There is actually just one function, `md_parse()`.
+
+* **Push model:** MD4C parses the complete document and calls callback
+ functions provided by the application for each start/end of block, start/end
+ of a span, and with any textual contents.
* **Portability:** MD4C builds and works on Windows and Linux, and it should
- be fairly trivial to build it also on other systems.
+ be fairly simple to make it run also on most other systems.
-* **Encoding:** MD4C can compiled to recognize ASCII-only control characters,
- UTF-8 and, on Windows, also UTF-16 little endian, i.e. what is commonly called
- Unicode on Windows.
+* **Encoding:** MD4C can be compiled to recognize ASCII-only control characters,
+ UTF-8 and, on Windows, also UTF-16 little endian, i.e. what is on Windows
+ commonly called just "Unicode". See more details below.
* **Permissive license:** MD4C is available under the MIT license.
-* **Performance:** MD4C is quite fast.
+* **Performance:** MD4C is very fast. Preliminary tests show its quite faster
+ then [Hoedown](https://github.com/hoedown/hoedown) or
+ [Cmark](https://github.com/jgm/cmark).
## Using MD4C
@@ -54,8 +60,10 @@ The parser is implemented in a single C source file `md4c.c` and its
accompanying header `md4c.h`.
The main provided function is `md_parse()`. It takes a text in Markdown syntax
-as an input and a renderer structure which holds pointers to few callback
-functions. As `md_parse()` eats the input, it calls appropriate callbacks
+as an input and a pointer to renderer structure which holds pointers to few
+callback functions.
+
+As `md_parse()` processes the input, it calls the appropriate callbacks
allowing application to convert it into another format or render it onto
the screen.
@@ -68,19 +76,25 @@ directory which implements a conversion utility from Markdown to HTML.
## Markdown Extensions
-By default, MD4C recognizes only elements defined by CommonMark specification.
+The default behavior is to recognize only elements defined by the CommonMark
+specification.
-However with appropriate flags enabling it, behavior of MD4C parse can be tuned
-to enable some extensions or allowing some deviations from the specification.
+However with appropriate renderer flags, the behavior can be tuned to enable
+some extensions or allowing some deviations from the specification.
* With the flag `MD_FLAG_COLLAPSEWHITESPACE`, non-trivial whitespace is
collapsed into a single space.
+
* With the flag `MD_FLAG_TABLES`, GitHub-style tables are supported.
+
* With the flag `MD_FLAG_PERMISSIVEURLAUTOLINKS` permissive URL autolinks
(not enclosed in '<' and '>') are supported.
+
* With the flag `MD_FLAG_PERMISSIVEAUTOLINKS`, ditto for e-mail autolinks.
+
* With the flag `MD_FLAG_NOHTMLSPANS` or `MD_FLAG_NOHTML`, raw inline HTML
or raw HTML blocks respectively are disabled.
+
* With the flag `MD_FLAG_NOINDENTEDCODEBLOCKS`, indented code blocks are
disabled.
@@ -88,26 +102,41 @@ to enable some extensions or allowing some deviations from the specification.
## Input/Output Encoding
The CommonMark specification generally assumes UTF-8 input, but under closer
-inspection Unicode is actually used on very few occasions.
+inspection Unicode is actually used on very few occasions:
+
+ * Classification of Unicode character as a Unicode whitespace or Unicode
+ punctuation. This is used for detection of word boundary when processing
+ emphasis and strong emphasis.
+
+ * Unicode case folding. This is used to perform case-independent matching
+ of link labels when resolving reference links.
+
+MD4C uses this property of the standard and its implementation is, to a large
+degree, encoding-agnostic. Most of the code only assumes that the encoding of
+your choice is compatible with ASCII, i.e. that the codepoints below 128 have
+the same numeric values as ASCII.
-MD4C uses this property of the standard and its implementation is to a large
-degree encoding-agnostic, just with the assumption the encoding of your choice
-is compatible with ASCII.
+All input MD4C does not understand is seen as a text and sent to the callbacks
+unchanged.
-By default MD4C simply only understands the ASCII characters as those making
-the marks in the document, and all the other input (the text) is provided
-as it is on the input.
+The behavior of MD4C in the isolated situations where the encoding really
+matters is determined by preprocessor macros:
-That said, the Unicode is supported too:
+ * If preprocessor macro `MD4C_USE_UNICODE` is defined, MD4C assumes UTF-8
+ in the specific situations.
- * If you predefine macro `MD4C_USE_UNICODE`, MD4C performs parsing of UTF-8
- locally where it does matter.
+ * On Windows, if preprocessor macro `MD4C_USE_WIN_UNICODE` is defined, MD4C
+ assumes little-endian UTF-16 and uses `WCHAR` instead of `char`. This allows
+ usage of MD4C directly within Unicode applications on Windows, without any
+ text conversion.
- * On Windows, if you predefine macro `MD4C_USE_WIN_UNICODE`, MD4C shall use
- `WCHAR` instead of `char` and will assume UTF16-LE encoding.
+ * When none of the macros is defined, ASCII-only approach is used even in
+ the listed situations. This effectively means that non-ASCII whitespace or
+ punctuation characters won't be recognized as such and that case-folding is
+ performed only on ASCII letters (i.e. `[a-zA-Z]`).
-It should be relatively easy to add support for any other encoding, as long as
-its codepoints below 128 are compatible with ASCII.
+(Adding support for yet another encodings should be relatively simple due
+the isolation of the respective code.)
## License