11# Converting Unicode Code Point to Glyph Keyed Segmentations during IFT Encoding using Subsetter Glyph Closure
22
33Author: Garret Rieger
4- Date: Jan 27, 2025
4+ Date: Jan 27, 2025
5+ Updated: Oct 27, 2025
56
67## Introduction
78
@@ -46,46 +47,51 @@ procedures work in practice.
4647
4748## Status
4849
49- Development of a robust glyph segmentation process that produces performant, low over head
50- segmentations is an area of active development in the ift encoder. The current prototype
51- implementation in [ closure_glyph_segmenter.cc] ( ../ift/encoder/closure_glyph_segmenter.cc ) can produce
52- segmentations that satisfy the closure requirement, but does not yet necessarily produce ones that
53- are performant.
54-
55- The approach laid out in this document is just one possible approach to solving the problem. This
56- document aims primarily to describe how the prototype implementation in
57- [ closure_glyph_segmenter.cc] ( ../ift/encoder/closure_glyph_segmenter.cc ) functions, and is not intended to
58- present the final (or only) solution to the problem. There are several unsolved problems and
59- remaining areas for development in this particular approach:
60-
61- * Input segmentation generation: selecting good quality input code point segmentations is critically
62- important to achieving high quality glyph segmentations. A high quality code point segmentation
63- will need to balance keeping interacting code points together with also keeping code points that
64- are commonly used together. This document and the implementation make no attempt to solve this
65- problem yet.
66-
67- * Patch merging: is the process of combining patches from a found segmentation together in order to
68- reduce overall overhead (eg. if there was a patch containing only one glyph the overhead cost of
69- the patch format and network transfer would dominate, therefore it may make sense to merge into
70- another similar patch). A very basic patch merging process has been included in the implementation
71- but there is lots of room for improvement. Notably, it does not yet handle conditional patch
72- merging. Additionally, a more advanced heuristic is needed for selecting which patches to merge.
50+ The current prototype implementation in
51+ [ closure_glyph_segmenter.cc] ( ../ift/encoder/closure_glyph_segmenter.cc ) can produce segmentations
52+ that satisfy the closure requirement and are performant (via merging). The approach laid
53+ out in this document is just one possible approach to solving the problem. This document aims
54+ primarily to describe how the prototype implementation in
55+ [ closure_glyph_segmenter.cc] ( ../ift/encoder/closure_glyph_segmenter.cc ) functions, and is not
56+ intended to present the final (or only) solution to the problem. There are several unsolved problems
57+ and remaining areas for development in this particular approach:
58+
59+ * Much of the ongoing work is on the "merger" which is a sub-problem of producing segmentations.
60+ That's discussed in the separate
61+ [ closure_glyph_segmentation_merging.md] ( closure_glyph_segmentation_merging.md ) document.
62+ See the implementation status and areas for further development sections for more specifics.
63+
64+ * Running the segmenter currently requires manual configuration to get good results. Configuration
65+ is needed to select appropriate frequency data and settings for parameters controlling merger
66+ behaviour. The goal is to get to the point where good results can be produced with zero
67+ configuration.
68+
69+ * Support for merging segmentations involving multiple overlapping scripts is not yet implemented
70+ (for example creating a segmentation that supports Chinese and Japanese simultaneously).
7371
7472* [ Multi segment analysis] ( #multi-segment-dependencies ) : the current implementation only does single
7573 segment analysis which in some cases leaves sizable fallback glyph sets. How to implement multi
7674 segment analysis is an open question and more development is needed.
7775
76+ * Input segmentation generation: the glyph segmentation process starts with an existing
77+ codepoint/feature based segmentation. Good results can be achieved by starting with one input
78+ segment per codepoint/feature and letting merging join segments as needed. However, there is still
79+ value in starting with a good quality input segmentation that places commonly used codepoints
80+ together. This can significantly reduce the amount of work the merger needs to do. Therefore it
81+ may be useful to develop functionality that creates a first pass input segmentation based on
82+ codepoint frequency data.
83+
7884* Incorporating dependency information: whatever produces the input code point segments will likely
7985 have discovered dependency information related to those code points. That information can be
8086 reused in this process to narrow selections during patch merging and multi segment
8187 analysis. Future work will look at adding dependency information as an optional input to this
8288 procedure.
8389
84- One of the main down sides to this approach is it's reliance on a subsetting closure function which
85- are computationally costly. Complex fonts which can require hundreds of closure operation which as a
86- result can be slow to process. So another area of open research is if a non closure based approach
87- could be developed that is computationally cheaper (for example by producing a segmentation by
88- working directly with the substitution and dependencies encoded in a font).
90+ * One of the main down sides to this approach is it's reliance on a subsetting closure function which
91+ are computationally costly. Complex fonts which can require hundreds of closure operation which as a
92+ result can be slow to process. So another area of open research is if a non closure based approach
93+ could be developed that is computationally cheaper (for example by producing a segmentation by
94+ working directly with the substitution and dependencies encoded in a font).
8995
9096## Goals
9197
@@ -100,12 +106,15 @@ The segmentation procedure described in this document aims to achieve the follow
100106 values. The input unicode code point segmentations are used to form the conditions.
101107
102108* Optimize for minimal data transfer by avoiding duplicating glyphs across patches where possible.
109+
110+ * Support optimization of a generated segmentation via merging to reduce network overhead.
103111
104112* The chosen glyph segmentation and activation conditions must satisfy the closure requirement:
105113
106114 The set of glyphs contained in patches loaded for a font subset definition (a set of Unicode
107115 code points and a set of layout feature tags) through the patch map tables must be a superset of
108116 those in the glyph closure of the font subset definition.
117+
109118
110119## Subsetter Glyph Closures
111120
@@ -227,90 +236,19 @@ data in the initial font. In these cases leaving them in the fallback patch may
227236
228237## Merging
229238
230- In some cases the patch set produced above may result in some patches that contain a small number of
231- glyphs. Because there is a per patch overhead cost (from network and patch format overhead) it may
232- be desirable to merge some patches together in order to meet some minimum size target. Patches can
233- be merged so long as it's done in a way that preserves the glyph closure requirement.
234-
235- There are two types of patches that can be merged: exclusive and conditional. The procedure for
236- merging is dependent on the type. The next two sections provide some guidelines for merging the two
237- types together.
238-
239- ### Merging Input Segments
240-
241- This section outlines a procedure to find and merge input code point segments in order to increase
242- the sizes of one or more exclusive patches. It searches for other input segments that interact with
243- the one that needs to be enlarged. Keeping interacting code points together in a single segment since
244- it reduces the number of conditional patches needed, thereby reducing overall overhead.
239+ When starting with an input segmentation that is fine grained (for example using one segment per
240+ code point) the resulting glyph segmentation may involve a large number of patches. This results in
241+ excessive network overhead when loading the patches. Performance can be increased by selectively merging
242+ patches together to reduce overhead. This is a complex problem as it needs to be done in a way that
243+ avoids excessive transfers of glyph data that isn't needed.
245244
246- Starting with:
247-
248- * A set of patches and activation conditions produced by the "Segmenting Glyphs Based on Closure
249- Analysis" procedure.
250- * A desired minimum and maximum patch size in bytes.
251-
252- Then, for each segment $s_i$ in $s_1$ through $s_n$ ordered by expected frequency of use (high to low):
253-
254- 1 . If the associated exclusive patch does not exist or it is larger than the minimum size skip this segment.
255-
256- 2 . Locate one or more segments to merge:
257-
258- a. Sort all glyph patches by their conditions. First on the number of segments in the condition ascending and
259- then by the segment frequency (high to low).
260-
261- b. Return the set of segments in the condition of the first patch from that list where $s_i$
262- appears somewhere in the patch’s condition.
263-
264- c. If no such patch was found then instead select another exclusive patch which has the closest
265- frequency to $s_i$ and return the associated segment.
245+ The segmenter currently implements a cost based merging algorithm which selects merges that minimize
246+ an overall cost function. This process is documented in detail in
247+ [ closure_glyph_segmentation_merging.md] ( closure_glyph_segmentation_merging.md ) .
266248
267- 3 . Generate a new $s'_ i$ which is the union of $s_i$ and all returned segments from step 2. Add it
268- to the input segment list.
269-
270- 4 . Remove the $s_i$ and all segments from step 2 from all per glyph conditions and the input segment list.
271-
272- 5 . Re-run the "Segment Closure Analysis" closure test on $s'_ i$ and update per glyph conditions as needed.
273-
274- 6 . Based on the updated per glyph conditions, recompute the overall patch and condition sets
275- following "Segmenting Glyphs Based on Closure Analysis". If the new patch for $s'_ i$ is larger
276- than the maximum allowed size, undo the changes from steps 3-6 and go back to step 2 to continue
277- searching for more candidate segments. Ignore the previously selected group.
278-
279- 7 . Re-run this process to find and fix the next patch which is too small. If none remain then the
280- process is finished.
281-
282-
283- ### Merging Conditional Patches
284-
285- At a high level two conditional patches can be merged together by creating a new patch containing
286- the union of the glyphs in the two and assigning a new condition with is a super set of the two
287- original conditions. Merging patches in this way avoids duplicating glyphs, but results in more
288- relaxed overall activation conditions meaning some of the glyphs will be loaded when not strictly
289- needed.
290-
291- Merged conditions for conditional patches can be created by adding a disjunction between the two
292- overall conditions. For example if the two conditions were ($s_1 \wedge s_2$), ($s_2 \wedge s_3$)
293- then a new condition $(s_1 \wedge s_2) \vee (s_2 \wedge s_3)$ could be used for a combined
294- patch. This condition will activate the combined patch when either of the original conditions would
295- have matched. When selecting patches to merge, patches that have a smaller symmetric difference
296- between the segments in their conditions should be prioritized as that will minimize the widening of
297- the activation condition.
298-
299- There are also two other options for merging conditional patches, though these are generally less
300- preferable than the merging procedure described above:
301-
302- 1 . Move the patch's glyphs into the initial font or merge with the fallback patch. This removes the
303- patch at the cost of always loading the glyph's data. This may be useful when there is a very
304- small patch with a wide activation condition. In this case it may not be desirable to merge with
305- other larger conditional patches due to excessive widening of their activation conditions.
306-
307- 2 . Duplicate the patch's glyphs into the segments that make up the patch's condition. This
308- eliminates the patch at the cost of duplicating glyph data. It may be useful in cases of small
309- patches with narrow activation conditions.
310-
311- More research is needed in this area, and ultimately its likely that a selection heuristic which
312- takes into account segment frequency to assess the impact of condition widening will need to be
313- developed.
249+ The best segmentation results so far have been obtained by starting with one input segment per
250+ code point in the font and then letting the merger figure out how to best place them together in a way
251+ that minimizes overall cost.
314252
315253## Multi Segment Dependencies
316254
0 commit comments