-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a canonical encoding of identifiers as numbers and use it in clightgen #353
Conversation
…htgen Within CompCert, identifiers (names of C functions, variables, types, etc) are represented by unique positive numbers, sometimes called "atoms". In the original implementation, atoms 1, 2, ..., N are assigned to identifiers as they are encountered. The resulting number are small and are efficient when used as keys in data structures such as PTrees. However, the mapping from C source-level identifiers to atoms differs between compilation units. This is not a problem for CompCert but complicates CompCert-based verification tools that need to combine several compilation units. This commit introduces an alternate implementation of atoms, suggested by Andrew Appel. The choice between implementations is governed by the Boolean reference `Camlcoq.use_canonical_atoms`. In the alternate implementation, identifiers are converted to bit sequences via a Huffman encoding, then the bits are represented as positive numbers. The same identifier is always represented by the same number. However, the numbers are usually bigger than in the original implementation, making PTree operations slower: lookups and updates take time linear in the length of the identifier, instead of logarithmic time in the number of identifiers encountered. The CompCert compiler (the `ccomp` executable) still uses the original implementation, but the `clightgen` tool used in conjunction with the VST program logic can use either implementations: - The alternate "canonical atoms" implementation is used by default, and also if the `-canonical-idents` option is given. - The original implementation is used if the `-short-idents` option is given.
In this message I report on how the new string->positive mapping affects the performance of VST-Floyd, the proof automation system of the Verified Software Toolchain. In many places, VST-Floyd uses computational PTrees (binary tries) indexed by identifiers, that is, positive numbers. As the identifiers grow longer, the tries grow larger, and construction and lookup become slower. VST-Floyd runs entirely in Coq, so performance in extracted Ocaml is not relevant. PTree lookups are part of proofs, checked by Qed; that is, much of the performance cost may come from how Coq handles proof trees. There are different PTrees for different name spaces: extern global variables, struct/union names, nonaddressable locals (temps, in Clight). The most frequent lookups are for temps, so it is most important that those bitstrings (identifiers) be short. Even though it is desirable that names be short, we also strongly desire that extern global variable names should be consistent between compilation units. Hence, even if canonical-idents is slightly slower than short-idents, it would be preferable. In addition to the "traditional" short-idents scheme (as in CompCert 3.7), and the new canonical-idents scheme (as in this pull request), I also measured a "hybrid-idents" scheme, in which global variables (and struct/union names) are canonical, but temp names are short. I measured the performance of VST/progs/verif_strlib.v, which verifies implementations of C library functions strlen, strchr, strcat, strcmp, strcpy; the file verifies each function twice, two different proof scripts using slightly different proof-automation styles. short-idents: 316 seconds I don't have formal measurements of this, but I believe that canonical-idents uses more memory than short-idents, and in a Below are excerpts of preambles of strlib.v, compiled from strlib.c using clightgen -normalize. From this we learn:
|
I tested also on some larger function-bodies, with more local variables, than the simple strlib functions. These numbers are from verifications of function bodies from the SHA-2 verification; the C code is in VST sha/sha.c. Bottom line: canonical-idents slows down VST by 33%, compared to "hybrid" or "short" identifiers. I vigorously request the "hybrid" mode.
484.4 / 363.5 = 1.33 |
OK, l'ets forget about this PR, then. |
Looks like I misunderstood: performance on VST is not that great, indeed, but the proposed canonical encoding is better than nothing, so let's reopen. |
Any feedback from @alxest (regarding PR #311) and @jeremie-koenig (regarding PR #222) ? |
Thank you for your attention to this matter, @xavierleroy and @andrew-appel ! One litmus test might be this property.
This is desirable for the following scenarios.
|
I think the "litmus test" holds, up to a known limitation with static variables or functions: if two compilation units define a static with the same name, the system linker succeeds, but CompCert's Linker_prog fails, as documented here: Lines 238 to 250 in 4a67662
|
I'm not on top of this issue at the moment, but I think this would be quite useful for CertiKOS --- currently we have to maintain a big global table of identifiers. Since we don't do a lot of PTree lookups in our proofs, I don't expect any performance issues to be a concern for us at this point. |
Thank you all for the feedback. There is agreement that the new code is useful, so I'll merge now. |
This PR is a possible solution to the need expressed in #311 and to some extent #222.
The problem
Within CompCert, identifiers (names of C functions, variables, types, etc) are represented by unique positive numbers, sometimes called "atoms".
In the original implementation, atoms 1, 2, ..., N are assigned to identifiers as they are encountered. The resulting number are small and are efficient when used as keys in data structures such as PTrees. However, the mapping from C source-level identifiers to atoms differs between compilation units. This is not a problem for CompCert but complicates CompCert-based verification tools
that need to combine several compilation units.
The proposed solution
This commit introduces an alternate implementation of atoms, suggested by @andrew-appel. The choice between implementations is governed by the Boolean reference
Camlcoq.use_canonical_atoms
.In the alternate implementation, identifiers are converted to bit sequences via a Huffman encoding, then the bits are represented as positive numbers. The same identifier is always represented by the
same number. However, the numbers are usually bigger than in the original implementation, making PTree operations slower: lookups and updates take time linear in the length of the identifier, instead of logarithmic time in the number of identifiers encountered.
The CompCert compiler (the
ccomp
executable) still uses the original implementation, but theclightgen
tool used in conjunction with the VST program logic can use either implementations:-canonical-idents
option is given.-short-idents
option is given.This should make it possible to compare performance and convenience of the ttwo approaches on VST.
…htgen
Within CompCert, identifiers (names of C functions, variables, types,
etc) are represented by unique positive numbers, sometimes called
"atoms".
In the original implementation, atoms 1, 2, ..., N are assigned
to identifiers as they are encountered. The resulting number
are small and are efficient when used as keys in data structures
such as PTrees. However, the mapping from C source-level identifiers
to atoms differs between compilation units. This is not a problem
for CompCert but complicates CompCert-based verification tools
that need to combine several compilation units.
This commit introduces an alternate implementation of atoms, suggested
by Andrew Appel. The choice between implementations is governed by
the Boolean reference
Camlcoq.use_canonical_atoms
.In the alternate implementation, identifiers are converted to bit
sequences via a Huffman encoding, then the bits are represented as
positive numbers. The same identifier is always represented by the
same number. However, the numbers are usually bigger than in the
original implementation, making PTree operations slower: lookups and
updates take time linear in the length of the identifier, instead of
logarithmic time in the number of identifiers encountered.
The CompCert compiler (the
ccomp
executable) still uses the originalimplementation, but the
clightgen
tool used in conjunction with theVST program logic can use either implementations:
and also if the
-canonical-idents
option is given.-short-idents
option isgiven.