Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a canonical encoding of identifiers as numbers and use it in clightgen #353

Merged
merged 1 commit into from
May 19, 2020

Conversation

xavierleroy
Copy link
Contributor

This PR is a possible solution to the need expressed in #311 and to some extent #222.

The problem

Within CompCert, identifiers (names of C functions, variables, types, etc) are represented by unique positive numbers, sometimes called "atoms".

In the original implementation, atoms 1, 2, ..., N are assigned to identifiers as they are encountered. The resulting number are small and are efficient when used as keys in data structures such as PTrees. However, the mapping from C source-level identifiers to atoms differs between compilation units. This is not a problem for CompCert but complicates CompCert-based verification tools
that need to combine several compilation units.

The proposed solution

This commit introduces an alternate implementation of atoms, suggested by @andrew-appel. The choice between implementations is governed by the Boolean reference Camlcoq.use_canonical_atoms.

In the alternate implementation, identifiers are converted to bit sequences via a Huffman encoding, then the bits are represented as positive numbers. The same identifier is always represented by the
same number. However, the numbers are usually bigger than in the original implementation, making PTree operations slower: lookups and updates take time linear in the length of the identifier, instead of logarithmic time in the number of identifiers encountered.

The CompCert compiler (the ccomp executable) still uses the original implementation, but the clightgen tool used in conjunction with the VST program logic can use either implementations:

  • The alternate "canonical atoms" implementation is used by default, and also if the -canonical-idents option is given.
  • The original implementation is used if the -short-idents option is given.

This should make it possible to compare performance and convenience of the ttwo approaches on VST.

…htgen

Within CompCert, identifiers (names of C functions, variables, types,
etc) are represented by unique positive numbers, sometimes called
"atoms".

In the original implementation, atoms 1, 2, ..., N are assigned
to identifiers as they are encountered. The resulting number
are small and are efficient when used as keys in data structures
such as PTrees. However, the mapping from C source-level identifiers
to atoms differs between compilation units. This is not a problem
for CompCert but complicates CompCert-based verification tools
that need to combine several compilation units.

This commit introduces an alternate implementation of atoms, suggested
by Andrew Appel. The choice between implementations is governed by
the Boolean reference Camlcoq.use_canonical_atoms.

In the alternate implementation, identifiers are converted to bit
sequences via a Huffman encoding, then the bits are represented as
positive numbers. The same identifier is always represented by the
same number. However, the numbers are usually bigger than in the
original implementation, making PTree operations slower: lookups and
updates take time linear in the length of the identifier, instead of
logarithmic time in the number of identifiers encountered.

The CompCert compiler (the ccomp executable) still uses the original
implementation, but the clightgen tool used in conjunction with the
VST program logic can use either implementations:

  • The alternate "canonical atoms" implementation is used by default,
    and also if the -canonical-idents option is given.
  • The original implementation is used if the -short-idents option is
    given.

…htgen

Within CompCert, identifiers (names of C functions, variables, types,
etc) are represented by unique positive numbers, sometimes called
"atoms".

In the original implementation, atoms 1, 2, ..., N are assigned
to identifiers as they are encountered.  The resulting number
are small and are efficient when used as keys in data structures
such as PTrees.  However, the mapping from C source-level identifiers
to atoms differs between compilation units.  This is not a problem
for CompCert but complicates CompCert-based verification tools
that need to combine several compilation units.

This commit introduces an alternate implementation of atoms, suggested
by Andrew Appel.  The choice between implementations is governed by
the Boolean reference `Camlcoq.use_canonical_atoms`.

In the alternate implementation, identifiers are converted to bit
sequences via a Huffman encoding, then the bits are represented as
positive numbers.  The same identifier is always represented by the
same number.  However, the numbers are usually bigger than in the
original implementation, making PTree operations slower: lookups and
updates take time linear in the length of the identifier, instead of
logarithmic time in the number of identifiers encountered.

The CompCert compiler (the `ccomp` executable) still uses the original
implementation, but the `clightgen` tool used in conjunction with the
VST program logic can use either implementations:
- The alternate "canonical atoms" implementation is used by default,
  and also if the `-canonical-idents` option is given.
- The original implementation is used if the `-short-idents` option is
  given.
@andrew-appel
Copy link

andrew-appel commented May 2, 2020

In this message I report on how the new string->positive mapping affects the performance of VST-Floyd, the proof automation system of the Verified Software Toolchain. In many places, VST-Floyd uses computational PTrees (binary tries) indexed by identifiers, that is, positive numbers. As the identifiers grow longer, the tries grow larger, and construction and lookup become slower. VST-Floyd runs entirely in Coq, so performance in extracted Ocaml is not relevant. PTree lookups are part of proofs, checked by Qed; that is, much of the performance cost may come from how Coq handles proof trees.

There are different PTrees for different name spaces: extern global variables, struct/union names, nonaddressable locals (temps, in Clight). The most frequent lookups are for temps, so it is most important that those bitstrings (identifiers) be short.

Even though it is desirable that names be short, we also strongly desire that extern global variable names should be consistent between compilation units. Hence, even if canonical-idents is slightly slower than short-idents, it would be preferable.

In addition to the "traditional" short-idents scheme (as in CompCert 3.7), and the new canonical-idents scheme (as in this pull request), I also measured a "hybrid-idents" scheme, in which global variables (and struct/union names) are canonical, but temp names are short.

I measured the performance of VST/progs/verif_strlib.v, which verifies implementations of C library functions strlen, strchr, strcat, strcmp, strcpy; the file verifies each function twice, two different proof scripts using slightly different proof-automation styles.

short-idents: 316 seconds
canonical-idents: 324 seconds [see my next message for worse news]
hybrid-idents: 263 seconds

I don't have formal measurements of this, but I believe that canonical-idents uses more memory than short-idents, and in a make -j situation, the performance loss would be increased.

Below are excerpts of preambles of strlib.v, compiled from strlib.c using clightgen -normalize. From this we learn:

  • The new -canonical-idents scheme does not re-use compiler-introduced temporaries (_t'1, _t'2, etc.), as short-idents does; instead, it uses rather large numbers for compiler temporaries.
  • The -short-idents scheme "wastes" the smallest positive numbers on external functions that are never referenced in Clight source code. The -hybrid-idents scheme is faster because it uses smaller positive numbers for local variables even than -short-idents.
-short-idents
Definition ___builtin_annot : ident := 9%positive.
Definition ___builtin_annot_intval : ident := 10%positive.
Definition ___builtin_bswap : ident := 2%positive.
Definition ___builtin_bswap16 : ident := 4%positive.
Definition ___builtin_bswap32 : ident := 3%positive.
Definition ___builtin_bswap64 : ident := 1%positive.
Definition ___builtin_clz : ident := 35%positive.
Definition ___builtin_clzl : ident := 36%positive.
Definition ___builtin_clzll : ident := 37%positive.
(*  . . . [many lines deleted] . . .   *)
Definition _str : ident := 52%positive.
Definition _str1 : ident := 62%positive.
Definition _str2 : ident := 63%positive.
Definition _strcat : ident := 61%positive.
Definition _strchr : ident := 56%positive.
Definition _strcmp : ident := 66%positive.
Definition _strcpy : ident := 59%positive.
Definition _strlen : ident := 67%positive.
Definition _t'1 : ident := 69%positive.
Definition _t'2 : ident := 70%positive.
Definition _t'3 : ident := 71%positive.
-canonical-idents
Definition _str : ident := $"str".
Definition _str1 : ident := $"str1".
Definition _str2 : ident := $"str2".
Definition _strcat : ident := $"strcat".
Definition _strchr : ident := $"strchr".
Definition _strcmp : ident := $"strcmp".
Definition _strcpy : ident := $"strcpy".
Definition _strlen : ident := $"strlen".
Definition _t'1 : ident := 128%positive.
Definition _t'17413854 : ident := 17413981%positive.
Definition _t'17413855 : ident := 17413982%positive.
Definition _t'24494863 : ident := 24494990%positive.
Definition _t'24494864 : ident := 24494991%positive.
Definition _t'374494 : ident := 374621%positive.
-hybrid-idents
Definition _str : ident := 11%positive.
Definition _str1 : ident := 12%positive.
Definition _str2 : ident := 13%positive.
Definition _strcat : ident := $"strcat".
Definition _strchr : ident := $"strchr".
Definition _strcmp : ident := $"strcmp".
Definition _strcpy : ident := $"strcpy".
Definition _strlen : ident := $"strlen".
Definition _t'1 : ident := 14%positive.
Definition _t'17413854 : ident := 15%positive.
Definition _t'17413855 : ident := 16%positive.
Definition _t'24494863 : ident := 17%positive.
Definition _t'24494864 : ident := 18%positive.
Definition _t'374494 : ident := 19%positive.

@andrew-appel
Copy link

I tested also on some larger function-bodies, with more local variables, than the simple strlib functions. These numbers are from verifications of function bodies from the SHA-2 verification; the C code is in VST sha/sha.c.

Bottom line: canonical-idents slows down VST by 33%, compared to "hybrid" or "short" identifiers. I vigorously request the "hybrid" mode.

hybrid short canon file
20.54 16.89 30.99 verif_sha_init.v
38.67 37.20 55.36 verif_sha_update3.v
25.95 29.10 36.18 verif_sha_update4.v
36.83 37.18 50.60 verif_sha_update.v
23.09 25.90 28.35 verif_sha_bdo4.v
23.23 28.16 32.63 verif_sha_bdo7.v
21.43 25.11 29.51 verif_sha_bdo8.v
7.48 7.56 7.16 verif_sha_bdo.v
22.99 20.78 28.67 verif_sha_final2.v
84.05 75.74 106.24 verif_sha_final3.v
39.48 36.78 51.52 verif_sha_final.v
8.80 9.90 11.08 verif_addlength.v
10.86 13.26 16.19 verif_SHA256.v
363.4 363.6 484.4 TOTAL

484.4 / 363.5 = 1.33

@xavierleroy
Copy link
Contributor Author

OK, l'ets forget about this PR, then.

@xavierleroy xavierleroy closed this May 5, 2020
@xavierleroy
Copy link
Contributor Author

Looks like I misunderstood: performance on VST is not that great, indeed, but the proposed canonical encoding is better than nothing, so let's reopen.

@xavierleroy xavierleroy reopened this May 6, 2020
@xavierleroy
Copy link
Contributor Author

Any feedback from @alxest (regarding PR #311) and @jeremie-koenig (regarding PR #222) ?

@alxest
Copy link

alxest commented May 12, 2020

Thank you for your attention to this matter, @xavierleroy and @andrew-appel !
canonical-idents definitely solves my problem.
If hybrid-idents provides consistent mapping for all global definitions, that is also sufficient for my problem.

One litmus test might be this property.

a.c && b.c are linkable (using gnu linker)
a.cl = clightgen(a.c)
b.cl = clightgen(b.c)
-----------------------------------------------------------------------
a.cl && b.cl are linkable (using CompCert's Linker_prog extract)

This is desirable for the following scenarios.

  • Modular verification in CompCertM style.
    Here, we verify a.cl, b.cl modularly, and then combine these results together to establish verification for ab.cl (a.cl && b.cl linked using CompCert's Linker_prog).
    In this scenario, we have some maintenance burden because of inconsistent string-to-ident mapping.
    • Basically, I would like to run clightgen for each module separately. However, the resulting .cl files are not linkable. Therefore, I should manually give consistent mapping (like this) and modify a.cl, b.cl accordingly.
    • As a workaround, I can run clightgen only once with all modules as input. This way, the resulting .cl files are linkable. However, it also has drawbacks:
      • Adding a local temporary in a module affects others, and this breaks incremental verification (we need to recompile the whole project).
      • Not applicable in verification in an open setting. For instance, we might want to verify sort library function where extern cmp is later provided by the clients.
  • Supporting thin-LTO style optimizations, where each module is thin-linked (only analysis results, not function bodies) in the middle of the compilation. If we have a consistent mapping for global definitions, we can avoid unnecessary complications/TCB.

@xavierleroy
Copy link
Contributor Author

I think the "litmus test" holds, up to a known limitation with static variables or functions: if two compilation units define a static with the same name, the system linker succeeds, but CompCert's Linker_prog fails, as documented here:

CompCert/common/Linking.v

Lines 238 to 250 in 4a67662

(** Linking two compilation units. Compilation units are represented like
whole programs using the type [program F V]. If a name has
a global definition in one unit but not in the other, this definition
is left unchanged in the result of the link. If a name has
global definitions in both units, and is public (not static) in both,
the two definitions are linked as per [Linker_def] above.
If one or both definitions are static (not public), we should ideally
rename it so that it can be kept unchanged in the result of the link.
This would require a general notion of renaming of global identifiers
in programs that we do not have yet. Hence, as a first step, linking
is undefined if static definitions with the same name appear in both
compilation units. *)

@jeremie-koenig
Copy link
Contributor

I'm not on top of this issue at the moment, but I think this would be quite useful for CertiKOS --- currently we have to maintain a big global table of identifiers. Since we don't do a lot of PTree lookups in our proofs, I don't expect any performance issues to be a concern for us at this point.

@xavierleroy
Copy link
Contributor Author

Thank you all for the feedback. There is agreement that the new code is useful, so I'll merge now.

@xavierleroy xavierleroy merged commit 0eba6f6 into master May 19, 2020
@xavierleroy xavierleroy deleted the canonical-atoms branch May 19, 2020 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants