-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the string-ident maps available from Coq code #222
Conversation
Now that the string intern tables can be queried from Coq code, we no longer need to allocate explicit ident values to symbols, but can instead use their original names explicitely.
In particular, use the name [AST.string_of_ident] and declare it in the same places, so that a merge would be straightforward.
We used to have Even though this PR looks pretty good, I'm still worried of potential problems with:
|
Regarding "different runs of CompCert or of clightgen can use different mappings of identifiers to strings", I have the following suggestion (discussed last week with Jeremie and others): For external idents only, CompCert and clightgen should derive the "positive" ident from the bits of the Ascii string. (There's a bijection . . . perhaps that's ident_of_string/string_of_ident, I haven't looked). For local idents, CompCert and clightgen should NOT do this. The reason is that looking up huge positive numbers in PTrees is quite slow (proportional to the log2 of the number), and we don't want to do that for all the local variables. For the locals, CompCert/clightgen should continue to use the smallest possible positive values. This solution still needs a bit more thought before implementing, but I think it could be compatible with this current pull request. |
I think we're trying to work around a Coq limitation rather than addressing it upfront. The limitation, I think, is the lack of computationally-efficient character strings: one that extracts to OCaml strings (and not lists of characters) + string operations (fast comparison), supports computation within Coq, and has the "..." syntactic suger. Then we could use strings instead of ident for global, named things, and AVL maps or red-black maps instead of PTree for maps. |
Thanks for the background and suggestions. In addition to the subpar performance of the primitive operations for extracted Coq strings, the performance boost from string interning itself may be significant (@andrew-appel pointed this out to me). In any case, I've been approaching the problem with the mindset that the official name and external representation of a symbol should always be a Re: different mappings across executionsIn that spirit, different runs of CompCert or clightgen may indeed use different mappings, but these mappings will only be transient and It is true that if we're not careful, we could break this assumption (either now or in future changes). As far as soundness goes, the fact that
Possible solution. I believe we could mitigate this by:
I will try to come up with an implementation of this soonish; managing uninterned symbols in a good way and updating SimplExpr seem to be the most subtle / labor-intensive parts of that. Re: computability within CoqAs far as the computability of The problem with this approach is that its state would have to be passed around everywhere. The types of The "cleverness" of my pull request is in recognizing that this statefulness can be hidden from Coq (though I recognize that "clever" is not always a positive thing from a software engineering and correctness perspective). But I'm afraid there's a trade-off here where we can only accomplish this by implementing the string interning in ML, barring some future change in proof assistant technology. Re: embedding string into identIt is possible to define a bijection or embedding of One thing that could be worth exploring would be to offer two options for implementing the string-ident mapping: the clean, constant, pure one, computable within Coq, implemented as an actual bijection between Another middle-of-the-road approach might be to forgo the performance benefits of string interning: have |
Instead, we use regular symbols, but make sure that they are not legal C identifiers, so that they cannot conflict with those parsed from the source program. Specifically, the temporaries introduced by the SimplExpr pass are of the form "$1", "$2", etc. (which is similar in effect to the old behavior when they are printed out), and temporaries introduced to separate memory loads in clightgen normalization are of the form "@1", "@2", etc.
6480c1b
to
2c8f703
Compare
AST.ident values can now only be created through Ident.of_string, and can only be used through Ident.to_string. This guarantees that the concrete positive <-> string mapping used is immaterial and that ident values can never be communicated with the outside world. This discipline is enforced both in the Coq and Ocaml code: the Camlcoq functions intern_string and extern_atom now use the sealed Ident module (and type) as extracted from Coq. Independently, the FastIdent Ocaml module provides the string interning backend necessary for Symbols.v, but it cannot be used to bypass abstraction. The new structure also makes it easy to switch out the implementation of the Ident module in the future, or provide an alternative one, for instance if we want to implement a version of Ident where of_string and to_string can compute within Coq, sacrificing performance.
2c8f703
to
f59cca2
Compare
I realize I'm jumping into something that is probably a dead thread, but I recently started using VST and this seems to be a big issue for the type of verification that I'd like to do. I'm probably the least knowledgeable about this topic given that I only very recently started using VST, but for my purposes I need but it seems like the basic issue here is a trade-off between performance and functionality. From what I can tell: Benefits of (small)
Benefits of
Is there anything that I'm missing? If the trade-off really is performance vs functionality, then has anyone compared the runtimes/memory usage? I assume that there is some significant factor going on here. Is the critical thing performance of the extracted code, or performance of the Gallina code?
I did a little bit of performance checks on a few representations (this is doing 1000 equality checks of morally 20 character strings):
If map inserts/lookups are the real underlying issue, then I could try some similar benchmarks if you'd like. I'm certain that small positives will be faster than the alternatives but I'm interested to know by how much (and how much it matters). Further, if there is a separate representation of locals and globals (e.g. split |
A few years ago I made a performance measurement of this question. I compared the performance of VST with 5-to-10-bit positives, as we have now, versus 60-bit positives, still using the PTree representation for lookup tables. The performance of VST as a whole slowed down by a factor of three. This is probably because VST uses PTrees very heavily in its internal computational reflection. The 60-bit positives were to simulate ASCII strings as sequences of bits. Therefore, I concluded that the use of ASCII strings for identifiers directly would cause severe performance degradation of VST. You can do this experiment yourself! Simply edit the header of a .v file produced by clightgen, to make all the positive numbers (used for identifiers) much longer, and see how long it now takes to run that file through a VST verification. |
Given that the space of strings is so sparse, I would imagine that something more efficient than PTree would be necessary. My thought was AVL trees from the standard library would be a better choice for this. It does require comparison on strings at every node though so that is a trade-off. If globals and locals were syntactically distinct, e.g. locals using small positive and globals using string, do you think there would be much overhead? |
Most of the PTree-lookups are on locals, but there is still some on globals. You can also do this experiment yourself! Edit the .v file output from clightgen, just to make the positives for globals huge, but not for locals, and see how much VST slows down. |
The "size_arguments" function and its properties can be systematically derived from the "loc_arguments" function and its properties. Before, the RISC-V port used this derivation, and all other ports used hand-written "size_arguments" functions and proofs. This commit moves the definition of "size_arguments" to the platform-independent file backend/Conventions.v, using the systematic derivation, and removes the platform-specific definitions. This reduces code and proof size, and makes it easier to change the calling conventions.
This addresses the problem discussed here, and was inspired by a conversation with @andrew-appel, @lennartberinger, @scuellar, @pwilke. Since this is largely an independent issue I decided to create a new pull request.
Problem summary: When we use CompCert as a library for verification and directly define ASTs from within Coq, we need to hardcode some
AST.ident
values for the symbols we manipulate (as done byclightgen
for example). If the program is ever to be printed out, we would need to seed the ML tables with the corresponding strings we want (as done in CertiKOS). This is inelegant and a maintenance burden, since there is a potential for out-of-date, inconsistent, or conflicting mappings between independently developed modules.The solution implemented here allows Coq code to access to the ML functions
intern_string
andextern_atom
defined inCamlcoq.ml
. They are axiomatized as a total, bijective maping betweenstring
andident
. In practice, the mapping is built on-demand, but if we're careful we can make sure the statefulness is not observable.Interface
As declared in
AST.v
:Changes to clightgen
I updated
clightgen
to take advantage of this new interface and removed theident
allocation logic. Before, it would emit a.v
file that would look like:After my change, it can mention symbols directly:
As an extra benefit, if this AST or its compiled form was ever to be printed out by the ML code, we would get the correct output of "foo" and "bar" rather than "$42" and "$43", with no additional effort. This is relevant for CertiKOS at least.
As a potential issue, note that we no longer have concrete values for identifiers, which may create issues if VST or CertiKOS use reflection tactics which expect to be able to compare idents. Nevertheless, equal symbols will still unify, and we can still provide a pre-extraction decision procedure for (#x = #y) specifically, by using
string_dec x y
and the two axioms.Changes to PrintClight
I also modified
PrintClight.ml
to useextern_atom
for names of temporaries instead of handling them as a special case and printing out their numerical ident values. Temporaries introduced by the SimplExpr pass will have no corresponding string, andextern_atom
will make up a fresh name for them, starting with "$1". Variables transformed into temporaries by SimplLocals would retain their names (but currently the Clight code is printed out before SimplLocals).This change is not essential but the fresh names look better than the (larger, seemingly arbitrary) numerical
ident
values printed out bytemp_name
.