Add a canonical encoding of identifiers as numbers and use it in clightgen #353

xavierleroy · 2020-05-02T12:10:56Z

This PR is a possible solution to the need expressed in #311 and to some extent #222.

The problem

Within CompCert, identifiers (names of C functions, variables, types, etc) are represented by unique positive numbers, sometimes called "atoms".

In the original implementation, atoms 1, 2, ..., N are assigned to identifiers as they are encountered. The resulting number are small and are efficient when used as keys in data structures such as PTrees. However, the mapping from C source-level identifiers to atoms differs between compilation units. This is not a problem for CompCert but complicates CompCert-based verification tools
that need to combine several compilation units.

The proposed solution

This commit introduces an alternate implementation of atoms, suggested by @andrew-appel. The choice between implementations is governed by the Boolean reference Camlcoq.use_canonical_atoms.

In the alternate implementation, identifiers are converted to bit sequences via a Huffman encoding, then the bits are represented as positive numbers. The same identifier is always represented by the
same number. However, the numbers are usually bigger than in the original implementation, making PTree operations slower: lookups and updates take time linear in the length of the identifier, instead of logarithmic time in the number of identifiers encountered.

The CompCert compiler (the ccomp executable) still uses the original implementation, but the clightgen tool used in conjunction with the VST program logic can use either implementations:

The alternate "canonical atoms" implementation is used by default, and also if the -canonical-idents option is given.
The original implementation is used if the -short-idents option is given.

This should make it possible to compare performance and convenience of the ttwo approaches on VST.

…htgen

Within CompCert, identifiers (names of C functions, variables, types,
etc) are represented by unique positive numbers, sometimes called
"atoms".

In the original implementation, atoms 1, 2, ..., N are assigned
to identifiers as they are encountered. The resulting number
are small and are efficient when used as keys in data structures
such as PTrees. However, the mapping from C source-level identifiers
to atoms differs between compilation units. This is not a problem
for CompCert but complicates CompCert-based verification tools
that need to combine several compilation units.

This commit introduces an alternate implementation of atoms, suggested
by Andrew Appel. The choice between implementations is governed by
the Boolean reference Camlcoq.use_canonical_atoms.

In the alternate implementation, identifiers are converted to bit
sequences via a Huffman encoding, then the bits are represented as
positive numbers. The same identifier is always represented by the
same number. However, the numbers are usually bigger than in the
original implementation, making PTree operations slower: lookups and
updates take time linear in the length of the identifier, instead of
logarithmic time in the number of identifiers encountered.

The CompCert compiler (the ccomp executable) still uses the original
implementation, but the clightgen tool used in conjunction with the
VST program logic can use either implementations:

The alternate "canonical atoms" implementation is used by default,
and also if the -canonical-idents option is given.
The original implementation is used if the -short-idents option is
given.

…htgen Within CompCert, identifiers (names of C functions, variables, types, etc) are represented by unique positive numbers, sometimes called "atoms". In the original implementation, atoms 1, 2, ..., N are assigned to identifiers as they are encountered. The resulting number are small and are efficient when used as keys in data structures such as PTrees. However, the mapping from C source-level identifiers to atoms differs between compilation units. This is not a problem for CompCert but complicates CompCert-based verification tools that need to combine several compilation units. This commit introduces an alternate implementation of atoms, suggested by Andrew Appel. The choice between implementations is governed by the Boolean reference `Camlcoq.use_canonical_atoms`. In the alternate implementation, identifiers are converted to bit sequences via a Huffman encoding, then the bits are represented as positive numbers. The same identifier is always represented by the same number. However, the numbers are usually bigger than in the original implementation, making PTree operations slower: lookups and updates take time linear in the length of the identifier, instead of logarithmic time in the number of identifiers encountered. The CompCert compiler (the `ccomp` executable) still uses the original implementation, but the `clightgen` tool used in conjunction with the VST program logic can use either implementations: - The alternate "canonical atoms" implementation is used by default, and also if the `-canonical-idents` option is given. - The original implementation is used if the `-short-idents` option is given.

andrew-appel · 2020-05-02T19:42:35Z

In this message I report on how the new string->positive mapping affects the performance of VST-Floyd, the proof automation system of the Verified Software Toolchain. In many places, VST-Floyd uses computational PTrees (binary tries) indexed by identifiers, that is, positive numbers. As the identifiers grow longer, the tries grow larger, and construction and lookup become slower. VST-Floyd runs entirely in Coq, so performance in extracted Ocaml is not relevant. PTree lookups are part of proofs, checked by Qed; that is, much of the performance cost may come from how Coq handles proof trees.

There are different PTrees for different name spaces: extern global variables, struct/union names, nonaddressable locals (temps, in Clight). The most frequent lookups are for temps, so it is most important that those bitstrings (identifiers) be short.

Even though it is desirable that names be short, we also strongly desire that extern global variable names should be consistent between compilation units. Hence, even if canonical-idents is slightly slower than short-idents, it would be preferable.

In addition to the "traditional" short-idents scheme (as in CompCert 3.7), and the new canonical-idents scheme (as in this pull request), I also measured a "hybrid-idents" scheme, in which global variables (and struct/union names) are canonical, but temp names are short.

I measured the performance of VST/progs/verif_strlib.v, which verifies implementations of C library functions strlen, strchr, strcat, strcmp, strcpy; the file verifies each function twice, two different proof scripts using slightly different proof-automation styles.

short-idents: 316 seconds
canonical-idents: 324 seconds [see my next message for worse news]
hybrid-idents: 263 seconds

I don't have formal measurements of this, but I believe that canonical-idents uses more memory than short-idents, and in a make -j situation, the performance loss would be increased.

Below are excerpts of preambles of strlib.v, compiled from strlib.c using clightgen -normalize. From this we learn:

The new -canonical-idents scheme does not re-use compiler-introduced temporaries (_t'1, _t'2, etc.), as short-idents does; instead, it uses rather large numbers for compiler temporaries.
The -short-idents scheme "wastes" the smallest positive numbers on external functions that are never referenced in Clight source code. The -hybrid-idents scheme is faster because it uses smaller positive numbers for local variables even than -short-idents.

-short-idents
Definition ___builtin_annot : ident := 9%positive.
Definition ___builtin_annot_intval : ident := 10%positive.
Definition ___builtin_bswap : ident := 2%positive.
Definition ___builtin_bswap16 : ident := 4%positive.
Definition ___builtin_bswap32 : ident := 3%positive.
Definition ___builtin_bswap64 : ident := 1%positive.
Definition ___builtin_clz : ident := 35%positive.
Definition ___builtin_clzl : ident := 36%positive.
Definition ___builtin_clzll : ident := 37%positive.
(*  . . . [many lines deleted] . . .   *)
Definition _str : ident := 52%positive.
Definition _str1 : ident := 62%positive.
Definition _str2 : ident := 63%positive.
Definition _strcat : ident := 61%positive.
Definition _strchr : ident := 56%positive.
Definition _strcmp : ident := 66%positive.
Definition _strcpy : ident := 59%positive.
Definition _strlen : ident := 67%positive.
Definition _t'1 : ident := 69%positive.
Definition _t'2 : ident := 70%positive.
Definition _t'3 : ident := 71%positive.

-canonical-idents
Definition _str : ident := $"str".
Definition _str1 : ident := $"str1".
Definition _str2 : ident := $"str2".
Definition _strcat : ident := $"strcat".
Definition _strchr : ident := $"strchr".
Definition _strcmp : ident := $"strcmp".
Definition _strcpy : ident := $"strcpy".
Definition _strlen : ident := $"strlen".
Definition _t'1 : ident := 128%positive.
Definition _t'17413854 : ident := 17413981%positive.
Definition _t'17413855 : ident := 17413982%positive.
Definition _t'24494863 : ident := 24494990%positive.
Definition _t'24494864 : ident := 24494991%positive.
Definition _t'374494 : ident := 374621%positive.

-hybrid-idents
Definition _str : ident := 11%positive.
Definition _str1 : ident := 12%positive.
Definition _str2 : ident := 13%positive.
Definition _strcat : ident := $"strcat".
Definition _strchr : ident := $"strchr".
Definition _strcmp : ident := $"strcmp".
Definition _strcpy : ident := $"strcpy".
Definition _strlen : ident := $"strlen".
Definition _t'1 : ident := 14%positive.
Definition _t'17413854 : ident := 15%positive.
Definition _t'17413855 : ident := 16%positive.
Definition _t'24494863 : ident := 17%positive.
Definition _t'24494864 : ident := 18%positive.
Definition _t'374494 : ident := 19%positive.

andrew-appel · 2020-05-03T12:17:03Z

I tested also on some larger function-bodies, with more local variables, than the simple strlib functions. These numbers are from verifications of function bodies from the SHA-2 verification; the C code is in VST sha/sha.c.

Bottom line: canonical-idents slows down VST by 33%, compared to "hybrid" or "short" identifiers. I vigorously request the "hybrid" mode.

hybrid	short	canon	file
20.54	16.89	30.99	verif_sha_init.v
38.67	37.20	55.36	verif_sha_update3.v
25.95	29.10	36.18	verif_sha_update4.v
36.83	37.18	50.60	verif_sha_update.v
23.09	25.90	28.35	verif_sha_bdo4.v
23.23	28.16	32.63	verif_sha_bdo7.v
21.43	25.11	29.51	verif_sha_bdo8.v
7.48	7.56	7.16	verif_sha_bdo.v
22.99	20.78	28.67	verif_sha_final2.v
84.05	75.74	106.24	verif_sha_final3.v
39.48	36.78	51.52	verif_sha_final.v
8.80	9.90	11.08	verif_addlength.v
10.86	13.26	16.19	verif_SHA256.v
363.4	363.6	484.4	TOTAL

484.4 / 363.5 = 1.33

xavierleroy · 2020-05-05T08:24:53Z

OK, l'ets forget about this PR, then.

xavierleroy · 2020-05-06T15:06:13Z

Looks like I misunderstood: performance on VST is not that great, indeed, but the proposed canonical encoding is better than nothing, so let's reopen.

xavierleroy · 2020-05-11T08:09:38Z

Any feedback from @alxest (regarding PR #311) and @jeremie-koenig (regarding PR #222) ?

alxest · 2020-05-12T07:18:15Z

Thank you for your attention to this matter, @xavierleroy and @andrew-appel !
canonical-idents definitely solves my problem.
If hybrid-idents provides consistent mapping for all global definitions, that is also sufficient for my problem.

One litmus test might be this property.

a.c && b.c are linkable (using gnu linker)
a.cl = clightgen(a.c)
b.cl = clightgen(b.c)
-----------------------------------------------------------------------
a.cl && b.cl are linkable (using CompCert's Linker_prog extract)

This is desirable for the following scenarios.

Modular verification in CompCertM style.
Here, we verify a.cl, b.cl modularly, and then combine these results together to establish verification for ab.cl (a.cl && b.cl linked using CompCert's Linker_prog).
In this scenario, we have some maintenance burden because of inconsistent string-to-ident mapping.
- Basically, I would like to run clightgen for each module separately. However, the resulting .cl files are not linkable. Therefore, I should manually give consistent mapping (like this) and modify a.cl, b.cl accordingly.
- As a workaround, I can run clightgen only once with all modules as input. This way, the resulting .cl files are linkable. However, it also has drawbacks:
  - Adding a local temporary in a module affects others, and this breaks incremental verification (we need to recompile the whole project).
  - Not applicable in verification in an open setting. For instance, we might want to verify sort library function where extern cmp is later provided by the clients.
Supporting thin-LTO style optimizations, where each module is thin-linked (only analysis results, not function bodies) in the middle of the compilation. If we have a consistent mapping for global definitions, we can avoid unnecessary complications/TCB.

xavierleroy · 2020-05-14T14:43:34Z

I think the "litmus test" holds, up to a known limitation with static variables or functions: if two compilation units define a static with the same name, the system linker succeeds, but CompCert's Linker_prog fails, as documented here:

CompCert/common/Linking.v

Lines 238 to 250 in 4a67662

    
           (** Linking two compilation units.  Compilation units are represented like 
        
             whole programs using the type [program F V].  If a name has 
        
             a global definition in one unit but not in the other, this definition 
        
             is left unchanged in the result of the link.  If a name has 
        
             global definitions in both units, and is public (not static) in both, 
        
             the two definitions are linked as per [Linker_def] above. 
        
             If one or both definitions are static (not public), we should ideally 
        
             rename it so that it can be kept unchanged in the result of the link. 
        
             This would require a general notion of renaming of global identifiers 
        
             in programs that we do not have yet.  Hence, as a first step, linking 
        
             is undefined if static definitions with the same name appear in both 
        
             compilation units. *)

jeremie-koenig · 2020-05-14T16:07:18Z

I'm not on top of this issue at the moment, but I think this would be quite useful for CertiKOS --- currently we have to maintain a big global table of identifiers. Since we don't do a lot of PTree lookups in our proofs, I don't expect any performance issues to be a concern for us at this point.

xavierleroy · 2020-05-19T08:25:07Z

Thank you all for the feedback. There is agreement that the new code is useful, so I'll merge now.

xavierleroy closed this May 5, 2020

xavierleroy reopened this May 6, 2020

xavierleroy merged commit 0eba6f6 into master May 19, 2020

xavierleroy deleted the canonical-atoms branch May 19, 2020 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a canonical encoding of identifiers as numbers and use it in clightgen #353

Add a canonical encoding of identifiers as numbers and use it in clightgen #353

Uh oh!

xavierleroy commented May 2, 2020

Uh oh!

andrew-appel commented May 2, 2020 •

edited

Loading

Uh oh!

andrew-appel commented May 3, 2020

Uh oh!

xavierleroy commented May 5, 2020

Uh oh!

xavierleroy commented May 6, 2020

Uh oh!

xavierleroy commented May 11, 2020

Uh oh!

alxest commented May 12, 2020 •

edited

Loading

Uh oh!

xavierleroy commented May 14, 2020

Uh oh!

jeremie-koenig commented May 14, 2020

Uh oh!

xavierleroy commented May 19, 2020

Uh oh!

Uh oh!

Add a canonical encoding of identifiers as numbers and use it in clightgen #353

Add a canonical encoding of identifiers as numbers and use it in clightgen #353

Uh oh!

Conversation

xavierleroy commented May 2, 2020

The problem

The proposed solution

Uh oh!

andrew-appel commented May 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrew-appel commented May 3, 2020

Uh oh!

xavierleroy commented May 5, 2020

Uh oh!

xavierleroy commented May 6, 2020

Uh oh!

xavierleroy commented May 11, 2020

Uh oh!

alxest commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy commented May 14, 2020

Uh oh!

jeremie-koenig commented May 14, 2020

Uh oh!

xavierleroy commented May 19, 2020

Uh oh!

Uh oh!

andrew-appel commented May 2, 2020 •

edited

Loading

alxest commented May 12, 2020 •

edited

Loading