Commit 4585088ad7201a452d4df0ee0a31c21155664109

Martin Mitas 2020-11-13T10:16:34

md_analyze_permissive_url_autolink: Better GFM compatibility. The autolinks now allow unmatched parenthesis, only the trailing parenthesis closers are handled specially to deal with the situation the autolink is all inside an outer parenthesis. Somehow our tests were broken and avoided the cases with unmatched parenthesis pairs inside the auto-link. That's now fixed and in sync with GFM specs too. Fixes #135.

diff --git a/CHANGELOG.md b/CHANGELOG.md
index e9affd4..2dd35ee 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,6 +9,10 @@ Fixes:
 * [#131](https://github.com/mity/md4c/issues/131):
   Fix handling of a reference image nested in a reference link.
 
+* [#135](https://github.com/mity/md4c/issues/135):
+  Handle unmatched parenthesis pairs inside a permissive URL and WWW auto-links
+  in a way more compatible with the GFM.
+
 
 ## Version 0.4.6
 
diff --git a/src/md4c.c b/src/md4c.c
index f0af787..12ed351 100644
--- a/src/md4c.c
+++ b/src/md4c.c
@@ -3810,6 +3810,7 @@ md_analyze_permissive_url_autolink(MD_CTX* ctx, int mark_index)
     int has_underscore_in_last_seg = FALSE;
     int has_underscore_in_next_to_last_seg = FALSE;
     int n_opened_parenthesis = 0;
+    int n_excess_parenthesis = 0;
 
     /* Check for domain. */
     while(off < ctx->size) {
@@ -3848,17 +3849,28 @@ md_analyze_permissive_url_autolink(MD_CTX* ctx, int mark_index)
             if(n_opened_parenthesis > 0)
                 n_opened_parenthesis--;
             else
-                break;
+                n_excess_parenthesis++;
         }
 
         off++;
     }
-    /* These cannot be last char In such case they are more likely normal
-     * punctuation. */
-    if(ISANYOF(off-1, _T("?!.,:*_~")))
-        off--;
 
-    /* Ok. Lets call it auto-link. Adapt opener and create closer to zero
+    /* Trim a trailing punctuation from the end. */
+    while(TRUE) {
+        if(ISANYOF(off-1, _T("?!.,:*_~"))) {
+            off--;
+        } else if(CH(off-1) == ')'  &&  n_excess_parenthesis > 0) {
+            /* Unmatched ')' can be in an interior of the path but not at the
+             * of it, so the auto-link may be safely nested in a parenthesis
+             * pair. */
+            off--;
+            n_excess_parenthesis--;
+        } else {
+            break;
+        }
+    }
+
+    /* Ok. Lets call it an auto-link. Adapt opener and create closer to zero
      * length so all the contents becomes the link text. */
     MD_ASSERT(closer->ch == 'D');
     opener->end = opener->beg;
diff --git a/test/permissive-url-autolinks.txt b/test/permissive-url-autolinks.txt
index 03a2069..f75830d 100644
--- a/test/permissive-url-autolinks.txt
+++ b/test/permissive-url-autolinks.txt
@@ -5,7 +5,7 @@ With the flag `MD_FLAG_PERMISSIVEURLAUTOLINKS`, MD4C enables more permissive rec
 of URLs and transform them to autolinks, even if they do not exactly follow the syntax
 of autolink as specified in CommonMark specification.
 
-This is standard CommonMark autolink:
+This is a standard CommonMark autolink:
 
 ```````````````````````````````` example
 Homepage: <https://github.com/mity/md4c>
diff --git a/test/permissive-www-autolinks.txt b/test/permissive-www-autolinks.txt
index 2830722..046de9d 100644
--- a/test/permissive-www-autolinks.txt
+++ b/test/permissive-www-autolinks.txt
@@ -8,11 +8,11 @@ of autolink as specified in CommonMark specification.
 These do not have to be enclosed in `<` and `>`, and they even do not need
 any preceding scheme specification.
 
-The WWW autolink will be recognized when a valid domain is found.
-
-A valid domain consists of the text `www.`, followed by alphanumeric characters,
-nderscores (`_`), hyphens (`-`) and periods (`.`). There must be at least one
-period, and no underscores may be present in the last two segments of the domain.
+The WWW autolink will be recognized when the text `www.` is found followed by a
+valid domain. A valid domain consists of segments of alphanumeric characters,
+underscores (`_`) and hyphens (`-`) separated by periods (`.`). There must be
+at least one period, and no underscores may be present in the last two segments
+of the domain.
 
 The scheme `http` will be inserted automatically:
 
@@ -64,9 +64,9 @@ the only parentheses are in the interior of the autolink, no special rules are
 applied:
 
 ```````````````````````````````` example
-www.google.com/search?q=(business)+ok
+www.google.com/search?q=(business))+ok
 .
-<p><a href="http://www.google.com/search?q=(business)+ok">www.google.com/search?q=(business)+ok</a></p>
+<p><a href="http://www.google.com/search?q=(business))+ok">www.google.com/search?q=(business))+ok</a></p>
 ````````````````````````````````
 
 If an autolink ends in a semicolon (`;`), we check to see if it appears to