Edit

kc3-lang/harfbuzz/docs/usermanual-clusters.xml

Branch :

  • Show log

    Commit

  • Author : Behdad Esfahbod
    Date : 2023-02-09 12:53:17
    Hash : 96d9e862
    Message : [docs] Improve cluster-level docs

  • docs/usermanual-clusters.xml
  • <?xml version="1.0"?>
    <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
                   "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
      <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
      <!ENTITY version SYSTEM "version.xml">
    ]>
    <chapter id="clusters">
      <title>Clusters</title>
      <section id="clusters-and-shaping">
        <title>Clusters and shaping</title>
        <para>
          In text shaping, a <emphasis>cluster</emphasis> is a sequence of
          characters that needs to be treated as a single, indivisible
          unit. A single letter or symbol can be a cluster of its
          own. Other clusters correspond to longer subsequences of the
          input code points &mdash; such as a ligature or conjunct form
          &mdash; and require the shaper to ensure that the cluster is not
          broken during the shaping process.
        </para>
        <para>
          A cluster is distinct from a <emphasis>grapheme</emphasis>,
          which is the smallest unit of meaning in a writing system or
          script.
        </para>
        <para>
          The definitions of the two terms are similar. However, clusters
          are only relevant for script shaping and glyph layout. In
          contrast, graphemes are a property of the underlying script, and
          are of interest when client programs implement orthographic 
          or linguistic functionality.
        </para>
        <para>
          For example, two individual letters are often two separate
          graphemes. When two letters form a ligature, however, they
          combine into a single glyph. They are then part of the same
          cluster and are treated as a unit by the shaping engine &mdash;
          even though the two original, underlying letters remain separate
          graphemes.
        </para>
        <para>
          HarfBuzz is concerned with clusters, <emphasis>not</emphasis>
          with graphemes &mdash; although client programs using HarfBuzz
          may still care about graphemes for other reasons from time to time.
        </para>
        <para>
          During the shaping process, there are several shaping operations
          that may merge adjacent characters (for example, when two code
          points form a ligature or a conjunct form and are replaced by a
          single glyph) or split one character into several (for example,
          when decomposing a code point through the
          <literal>ccmp</literal> feature). Operations like these alter
          clusters; HarfBuzz tracks the changes to ensure that no clusters
          get lost or broken during shaping. 
        </para>
        <para>
          HarfBuzz records cluster information independently from how
          shaping operations affect the individual glyphs returned in an
          output buffer. Consequently, a client program using HarfBuzz can
          utilize the cluster information to implement features such as:
        </para>
        <itemizedlist>
          <listitem>
    	<para>
    	  Correctly positioning the cursor within a shaped text run,
    	  even when characters have formed ligatures, composed or
    	  decomposed, reordered, or undergone other shaping operations.
    	</para>
          </listitem>
          <listitem>
    	<para>
    	  Correctly highlighting a text selection that includes some,
    	  but not all, of the characters in a word. 
    	</para>
          </listitem>
          <listitem>
    	<para>
    	  Applying text attributes (such as color or underlining) to
    	  part, but not all, of a word.
    	</para>
          </listitem>
          <listitem>
    	<para>
    	  Generating output document formats (such as PDF) with
    	  embedded text that can be fully extracted.
    	</para>
          </listitem>
          <listitem>
    	<para>
    	  Determining the mapping between input characters and output
    	  glyphs, such as which glyphs are ligatures.
    	</para>
          </listitem>
          <listitem>
    	<para>
    	  Performing line-breaking, justification, and other
    	  line-level or paragraph-level operations that must be done
    	  after shaping is complete, but which require examining
    	  character-level properties.
    	</para>
          </listitem>
        </itemizedlist>
      </section>
      <section id="working-with-harfbuzz-clusters">
        <title>Working with HarfBuzz clusters</title>
        <para>
          When you add text to a HarfBuzz buffer, each code point must be
          assigned a <emphasis>cluster value</emphasis>.
        </para>
        <para>
          This cluster value is an arbitrary number; HarfBuzz uses it only
          to distinguish between clusters. Many client programs will use
          the index of each code point in the input text stream as the
          cluster value. This is for the sake of convenience; the actual
          value does not matter.
        </para>
        <para>
          Some of the shaping operations performed by HarfBuzz &mdash;
          such as reordering, composition, decomposition, and substitution
          &mdash; may alter the cluster values of some characters. The
          final cluster values in the buffer at the end of the shaping
          process will indicate to client programs which subsequences of
          glyphs represent a cluster and, therefore, must not be
          separated.
        </para>
        <para>
          In addition, client programs can query the final cluster values
          to discern other potentially important information about the
          glyphs in the output buffer (such as whether or not a ligature
          was formed).
        </para>
        <para>
          For example, if the initial sequence of cluster values was:
        </para>
        <programlisting>
          0,1,2,3,4
        </programlisting>
        <para>
          and the final sequence of cluster values is:
        </para>
        <programlisting>
          0,0,3,3
        </programlisting>
        <para>
          then there are two clusters in the output buffer: the first
          cluster includes the first two glyphs, and the second cluster
          includes the third and fourth glyphs. It is also evident that a
          ligature or conjunct has been formed, because there are fewer
          glyphs in the output buffer (four) than there were code points
          in the input buffer (five).
        </para>
        <para>
          Although client programs using HarfBuzz are free to assign
          initial cluster values in any manner they choose to, HarfBuzz
          does offer some useful guarantees if the cluster values are
          assigned in a monotonic (either non-decreasing or non-increasing)
          order.
        </para>
        <para>
          For buffers in the left-to-right (LTR)
          or top-to-bottom (TTB) text flow direction,
          HarfBuzz will preserve the monotonic property: client programs
          are guaranteed that monotonically increasing initial cluster
          values will be returned as monotonically increasing final
          cluster values.
        </para>
        <para>
          For buffers in the right-to-left (RTL)
          or bottom-to-top (BTT) text flow direction,
          the directionality of the buffer itself is reversed for final
          output as a matter of design. Therefore, HarfBuzz inverts the
          monotonic property: client programs are guaranteed that
          monotonically increasing initial cluster values will be
          returned as monotonically <emphasis>decreasing</emphasis> final
          cluster values.
        </para>
        <para>
          Client programs can adjust how HarfBuzz handles clusters during
          shaping by setting the
          <literal>cluster_level</literal> of the
          buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
          clustering support for this property:
        </para>
        <itemizedlist>
          <listitem>
    	<para><emphasis>Level 0</emphasis> is the default.
    	</para>
    	<para>
    	  The distinguishing feature of level 0 behavior is that, at
    	  the beginning of processing the buffer, all code points that
    	  are categorized as <emphasis>marks</emphasis>,
    	  <emphasis>modifier symbols</emphasis>, or
    	  <emphasis>Emoji extended pictographic</emphasis> modifiers,
    	  as well as the <emphasis>Zero Width Joiner</emphasis> and
    	  <emphasis>Zero Width Non-Joiner</emphasis> code points, are
    	  assigned the cluster value of the closest preceding code
    	  point from <emphasis>different</emphasis> category.
    	</para>
    	<para>
    	  In essence, whenever a base character is followed by a mark
    	  character or a sequence of mark characters, those marks are
    	  reassigned to the same initial cluster value as the base
    	  character. This reassignment is referred to as
    	  "merging" the affected clusters. This behavior is based on
    	  the Grapheme Cluster Boundary specification in <ulink
    	  url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode
    	  Technical Report 29</ulink>.
    	</para>
    	<para>
    	  This cluster level is suitable for code that likes to use
    	  HarfBuzz cluster values as an approximation of the Unicode
    	  Grapheme Cluster Boundaries as well.
    	</para>
    	<para>
    	  Client programs can specify level 0 behavior for a buffer by
    	  setting its <literal>cluster_level</literal> to
    	  <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>. 
    	</para>
          </listitem>
          <listitem>
    	<para>
    	  <emphasis>Level 1</emphasis> tweaks the old behavior
    	  slightly to produce better results. Therefore, level 1
    	  clustering is recommended for code that is not required to
    	  implement backward compatibility with the old HarfBuzz.
    	</para>
    	<para>
    	  <emphasis>Level 1</emphasis> differs from level 0 by not merging the
    	  clusters of marks and other modifier code points with the
    	  preceding "base" code point's cluster. By preserving the
    	  separate cluster values of these marks and modifier code
    	  points, script shapers can perform additional operations
    	  that might lead to improved results (for example, coloring
    	  mark glyphs differently than their base).
    	</para>
    	<para>
    	  Client programs can specify level 1 behavior for a buffer by
    	  setting its <literal>cluster_level</literal> to
    	  <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>. 
    	</para>
          </listitem>
          <listitem>
    	<para>
    	  <emphasis>Level 2</emphasis> differs significantly in how it
    	  treats cluster values. In level 2, HarfBuzz never merges
    	  clusters.
    	</para>
    	<para>
    	  This difference can be seen most clearly when HarfBuzz processes
    	  ligature substitutions and glyph decompositions. In level 0
    	  and level 1, ligatures and glyph decomposition both involve
    	  merging clusters; in level 2, neither of these operations
    	  triggers a merge.
    	</para>
    	<para>
    	  Client programs can specify level 2 behavior for a buffer by
    	  setting its <literal>cluster_level</literal> to
    	  <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>. 
    	</para>
          </listitem>
        </itemizedlist>
        <para>
          As mentioned earlier, client programs using HarfBuzz often
          assign initial cluster values in a buffer by reusing the indices
          of the code points in the input text. This gives a sequence of
          cluster values that is monotonically increasing (for example,
          0,1,2,3,4).
        </para>
        <para>
          It is not <emphasis>required</emphasis> that the cluster values
          in a buffer be monotonically increasing. However, if the initial
          cluster values in a buffer are monotonic and the buffer is
          configured to use cluster level 0 or 1, then HarfBuzz
          guarantees that the final cluster values in the shaped buffer
          will also be monotonic. No such guarantee is made for cluster
          level 2.
        </para>
        <para>
          In levels 0 and 1, HarfBuzz implements the following conceptual
          model for cluster values:
        </para>
        <itemizedlist spacing="compact">
          <listitem>
    	<para>
              If the sequence of input cluster values is monotonic, the
    	  sequence of cluster values will remain monotonic.
    	</para>
          </listitem>
          <listitem>
    	<para>
              Each cluster value represents a single cluster.
    	</para>
          </listitem>
          <listitem>
    	<para>
              Each cluster contains one or more glyphs and one or more
              characters.
    	</para>
          </listitem>
        </itemizedlist>
        <para>
          In practice, this model offers several benefits. Assuming that
          the initial cluster values were monotonically increasing
          and distinct before shaping began, then, in the final output:
        </para>
        <itemizedlist spacing="compact">
          <listitem>
    	<para>
    	  All adjacent glyphs having the same final cluster
    	  value belong to the same cluster.
    	</para>
          </listitem>
          <listitem>
    	<para>
              Each character belongs to the cluster that has the highest
    	  cluster value <emphasis>not larger than</emphasis> its
    	  initial cluster value.
    	</para>
          </listitem>
        </itemizedlist>
      </section>
    
      <section id="a-clustering-example-for-levels-0-and-1">
        <title>A clustering example for levels 0 and 1</title>
        <para>
          The basic shaping operations affect clusters in a predictable
          manner when using level 0 or level 1: 
        </para>
        <itemizedlist>
          <listitem>
    	<para>
    	  When two or more clusters <emphasis>merge</emphasis>, the
    	  resulting merged cluster takes as its cluster value the
    	  <emphasis>minimum</emphasis> of the incoming cluster values.
    	</para>
          </listitem>
          <listitem>
    	<para>
    	  When a cluster <emphasis>decomposes</emphasis>, all of the
    	  resulting child clusters inherit as their cluster value the
    	  cluster value of the parent cluster.
    	</para>
          </listitem>
          <listitem>
    	<para>
    	  When a character is <emphasis>reordered</emphasis>, the
    	  reordered character and all clusters that the character
    	  moves past as part of the reordering are merged into one cluster.
    	</para>
          </listitem>
        </itemizedlist>
        <para>
          The functionality, guarantees, and benefits of level 0 and level
          1 behavior can be seen with some examples. First, let us examine
          what happens with cluster values when shaping involves cluster
          merging with ligatures and decomposition.
        </para>
    
        <para>
          Let's say we start with the following character sequence (top row) and
          initial cluster values (bottom row):
        </para>
        <programlisting>
          A,B,C,D,E
          0,1,2,3,4
        </programlisting>
        <para>
          During shaping, HarfBuzz maps these characters to glyphs from
          the font. For simplicity, let us assume that each character maps
          to the corresponding, identical-looking glyph:
        </para>
        <programlisting>
          A,B,C,D,E
          0,1,2,3,4
        </programlisting>
        <para>
          Now if, for example, <literal>B</literal> and <literal>C</literal>
          form a ligature, then the clusters to which they belong
          &quot;merge&quot;. This merged cluster takes for its cluster
          value the minimum of all the cluster values of the clusters that
          went in to the ligature. In this case, we get:
        </para>
        <programlisting>
          A,BC,D,E
          0,1 ,3,4
        </programlisting>
        <para>
          because 1 is the minimum of the set {1,2}, which were the
          cluster values of <literal>B</literal> and
          <literal>C</literal>. 
        </para>
        <para>
          Next, let us say that the <literal>BC</literal> ligature glyph
          decomposes into three components, and <literal>D</literal> also
          decomposes into two components. Whenever a cluster decomposes,
          its components each inherit the cluster value of their parent: 
        </para>
        <programlisting>
          A,BC0,BC1,BC2,D0,D1,E
          0,1  ,1  ,1  ,3 ,3 ,4
        </programlisting>
        <para>
          Next, if <literal>BC2</literal> and <literal>D0</literal> form a
          ligature, then their clusters (cluster values 1 and 3) merge into
          <literal>min(1,3) = 1</literal>:
        </para>
        <programlisting>
          A,BC0,BC1,BC2D0,D1,E
          0,1  ,1  ,1    ,1 ,4
        </programlisting>
        <para>
          Note that the entirety of cluster 3 merges into cluster 1, not
          just the <literal>D0</literal> glyph. This reflects the fact
          that the cluster <emphasis>must</emphasis> be treated as an
          indivisible unit.
        </para>
        <para>
          At this point, cluster 1 means: the character sequence
          <literal>BCD</literal> is represented by glyphs
          <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
          further.
        </para>
      </section>
      <section id="reordering-in-levels-0-and-1">
        <title>Reordering in levels 0 and 1</title>
        <para>
          Another common operation in some shapers is glyph
          reordering. In order to maintain a monotonic cluster sequence
          when glyph reordering takes place, HarfBuzz merges the clusters
          of everything in the reordering sequence.
        </para>
        <para>
          For example, let us again start with the character sequence (top
          row) and initial cluster values (bottom row):
        </para>
        <programlisting>
          A,B,C,D,E
          0,1,2,3,4
        </programlisting>
        <para>
          If <literal>D</literal> is reordered to the position immediately
          before <literal>B</literal>, then HarfBuzz merges the
          <literal>B</literal>, <literal>C</literal>, and
          <literal>D</literal> clusters &mdash; all the clusters between
          the final position of the reordered glyph and its original
          position. This means that we get:
        </para>
        <programlisting>
          A,D,B,C,E
          0,1,1,1,4
        </programlisting>
        <para>
          as the final cluster sequence.
        </para>
        <para>
          Merging this many clusters is not ideal, but it is the only
          sensible way for HarfBuzz to maintain the guarantee that the
          sequence of cluster values remains monotonic and to retain the
          true relationship between glyphs and characters.
        </para>
      </section>
      <section id="the-distinction-between-levels-0-and-1">
        <title>The distinction between levels 0 and 1</title>
        <para>
          The preceding examples demonstrate the main effects of using
          cluster levels 0 and 1. The only difference between the two
          levels is this: in level 0, at the very beginning of the shaping
          process, HarfBuzz merges the cluster of each base character
          with the clusters of all Unicode marks (combining or not) and
          modifiers that follow it.
        </para>
        <para>
          For example, let us start with the following character sequence
          (top row) and accompanying initial cluster values (bottom row):
        </para>
        <programlisting>
          A,acute,B
          0,1    ,2
        </programlisting>
        <para>
          The <literal>acute</literal> is a Unicode mark. If HarfBuzz is
          using cluster level 0 on this sequence, then the
          <literal>A</literal> and <literal>acute</literal> clusters will
          merge, and the result will become:
        </para>
        <programlisting>
          A,acute,B
          0,0    ,2
        </programlisting>
        <para>
          This merger is performed before any other script-shaping
          steps.
        </para>
        <para>
          This initial cluster merging is the default behavior of the
          Windows shaping engine, and the old HarfBuzz codebase copied
          that behavior to maintain compatibility. Consequently, it has
          remained the default behavior in the new HarfBuzz codebase.
        </para>
        <para>
          But this initial cluster-merging behavior makes it impossible
          for client programs to implement some features (such as to
          color diacritic marks differently from their base
          characters). That is why, in level 1, HarfBuzz does not perform
          the initial merging step.
        </para>
        <para>
          For client programs that rely on HarfBuzz cluster values to
          perform cursor positioning, level 0 is more convenient. But
          relying on cluster boundaries for cursor positioning is wrong: cursor
          positions should be determined based on Unicode grapheme
          boundaries, not on shaping-cluster boundaries. As such, using
          level 1 clustering behavior is recommended. 
        </para>
        <para>
          One final facet of levels 0 and 1 is worth noting. HarfBuzz
          currently does not allow any
          <emphasis>multiple-substitution</emphasis> GSUB lookups to 
          replace a glyph with zero glyphs (in other words, to delete a
          glyph).
        </para>
        <para>
          But, in some other situations, glyphs can be deleted. In
          those cases, if the glyph being deleted is the last glyph of its
          cluster, HarfBuzz makes sure to merge the deleted glyph's
          cluster with a neighboring cluster.
        </para>
        <para>
          This is done primarily to make sure that the starting cluster of the
          text always has the cluster index pointing to the start of the text
          for the run; more than one client program currently relies on this
          guarantee.
        </para>
        <para>
          Incidentally, Apple's CoreText does something different to
          maintain the same promise: it inserts a glyph with id 65535 at
          the beginning of the glyph string if the glyph corresponding to
          the first character in the run was deleted. HarfBuzz might do
          something similar in the future.
        </para>
      </section>
      <section id="level-2">
        <title>Level 2</title>
        <para>
          HarfBuzz's level 2 cluster behavior uses a significantly
          different model than that of level 0 and level 1.
        </para>
        <para>
          The level 2 behavior is easy to describe, but it may be
          difficult to understand in practical terms. In brief, level 2 
          performs no merging of clusters whatsoever.
        </para>
        <para>
          This means that there is no initial base-and-mark merging step
          (as is done in level 0), and it means that reordering moves and
          ligature substitutions do not trigger a cluster merge.
        </para>
        <para>
          Only one shaping operation directly affects clusters when using
          level 2:
        </para>
        <itemizedlist>
          <listitem>
    	<para>
    	  When a cluster <emphasis>decomposes</emphasis>, all of the
    	  resulting child clusters inherit as their cluster value the
    	  cluster value of the parent cluster.
    	</para>
          </listitem>
        </itemizedlist>
        <para>
          When glyphs do form a ligature (or when some other feature
          substitutes multiple glyphs with one glyph) the cluster value
          of the first glyph is retained as the cluster value for the
          resulting ligature.
        </para>
        <para>
          This occurrence sounds similar to a cluster merge, but it is
          different. In particular, no subsequent characters &mdash;
          including marks and modifiers &mdash; are affected. They retain
          their previous cluster values. 
        </para>
        <para>
          Level 2 cluster behavior is ultimately less complex than level 0
          or level 1, but there are several cases for which processing
          cluster values produced at level 2 may be tricky. 
        </para>
        <section id="ligatures-with-combining-marks-in-level-2">
          <title>Ligatures with combining marks in level 2</title>
          <para>
    	The first example of how HarfBuzz's level 2 cluster behavior
    	can be tricky is when the text to be shaped includes combining
    	marks attached to ligatures.
          </para>
          <para>
    	Let us start with an input sequence with the following
    	characters (top row) and initial cluster values (bottom row):
          </para>
          <programlisting>
    	A,acute,B,breve,C,circumflex
    	0,1    ,2,3    ,4,5
          </programlisting>
          <para>
    	If the sequence <literal>A,B,C</literal> forms a ligature,
    	then these are the cluster values HarfBuzz will return under
    	the various cluster levels:
          </para>
          <para>
    	Level 0:
          </para>
          <programlisting>
    	ABC,acute,breve,circumflex
    	0  ,0    ,0    ,0
          </programlisting>
          <para>
    	Level 1:
          </para>
          <programlisting>
    	ABC,acute,breve,circumflex
    	0  ,0    ,0    ,5
          </programlisting>
          <para>
    	Level 2:
          </para>
          <programlisting>
    	ABC,acute,breve,circumflex
    	0  ,1    ,3    ,5
          </programlisting>
          <para>
    	Making sense of the level 2 result is the hardest for a client
    	program, because there is nothing in the cluster values that
    	indicates that <literal>B</literal> and <literal>C</literal>
    	formed a ligature with <literal>A</literal>.
          </para>
          <para>
    	In contrast, the "merged" cluster values of the mark glyphs
    	that are seen in the level 0 and level 1 output are evidence
    	that a ligature substitution took place. 
          </para>
        </section>
        <section id="reordering-in-level-2">
          <title>Reordering in level 2</title>
          <para>
    	Another example of how HarfBuzz's level 2 cluster behavior
    	can be tricky is when glyphs reorder. Consider an input sequence
    	with the following characters (top row) and initial cluster
    	values (bottom row):
          </para>
          <programlisting>
    	A,B,C,D,E
    	0,1,2,3,4
          </programlisting>
          <para>
    	Now imagine <literal>D</literal> moves before
    	<literal>B</literal> in a reordering operation. The cluster
    	values will then be:
          </para>
          <programlisting>
    	A,D,B,C,E
    	0,3,1,2,4
          </programlisting>
          <para>
    	Next, if <literal>D</literal> forms a ligature with
    	<literal>B</literal>, the output is:
          </para>
          <programlisting>
    	A,DB,C,E
    	0,3 ,2,4
          </programlisting>
          <para>
    	However, in a different scenario, in which the shaping rules
    	of the script instead caused <literal>A</literal> and
    	<literal>B</literal> to form a ligature
    	<emphasis>before</emphasis> the <literal>D</literal> reordered, the
    	result would be:
          </para>
          <programlisting>
    	AB,D,C,E
    	0 ,3,2,4   
          </programlisting>
          <para>
    	There is no way for a client program to differentiate between
    	these two scenarios based on the cluster values
    	alone. Consequently, client programs that use level 2 might
    	need to undertake additional work in order to manage cursor
    	positioning, text attributes, or other desired features.
          </para>
        </section>
        <section id="other-considerations-in-level-2">
          <title>Other considerations in level 2</title>
          <para>
    	There may be other problems encountered with ligatures under
    	level 2, such as if the direction of the text is forced to
    	the opposite of its natural direction (for example, Arabic text
    	that is forced into left-to-right directionality). But,
    	generally speaking, these other scenarios are minor corner
    	cases that are too obscure for most client programs to need to
    	worry about.
          </para>
        </section>
      </section>
    </chapter>