Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the string-ident maps available from Coq code #222

Closed
wants to merge 6 commits into from

Conversation

jeremie-koenig
Copy link
Contributor

This addresses the problem discussed here, and was inspired by a conversation with @andrew-appel, @lennartberinger, @scuellar, @pwilke. Since this is largely an independent issue I decided to create a new pull request.

Problem summary: When we use CompCert as a library for verification and directly define ASTs from within Coq, we need to hardcode some AST.ident values for the symbols we manipulate (as done by clightgen for example). If the program is ever to be printed out, we would need to seed the ML tables with the corresponding strings we want (as done in CertiKOS). This is inelegant and a maintenance burden, since there is a potential for out-of-date, inconsistent, or conflicting mappings between independently developed modules.

The solution implemented here allows Coq code to access to the ML functions intern_string and extern_atom defined in Camlcoq.ml. They are axiomatized as a total, bijective maping between string and ident. In practice, the mapping is built on-demand, but if we're careful we can make sure the statefulness is not observable.

Interface

As declared in AST.v:

Parameter ident_of_string: string -> ident.
Parameter string_of_ident: ident -> string.

Axiom string_of_ident_of_string:
  forall s, string_of_ident (ident_of_string s) = s.
Axiom ident_of_string_of_ident:
  forall i, ident_of_string (string_of_ident i) = i.

Notation "# s" := (ident_of_string s) (at level 1, format "'#' s").

Changes to clightgen

I updated clightgen to take advantage of this new interface and removed the ident allocation logic. Before, it would emit a .v file that would look like:

...
Definition _foo := 42%positive.
Definition _bar := 43%positive.
...
Definition myast := ... _foo ... _bar ... _foo ...

After my change, it can mention symbols directly:

Definition myast := ... #"foo" ... #"bar" ... #"foo" ...

As an extra benefit, if this AST or its compiled form was ever to be printed out by the ML code, we would get the correct output of "foo" and "bar" rather than "$42" and "$43", with no additional effort. This is relevant for CertiKOS at least.

As a potential issue, note that we no longer have concrete values for identifiers, which may create issues if VST or CertiKOS use reflection tactics which expect to be able to compare idents. Nevertheless, equal symbols will still unify, and we can still provide a pre-extraction decision procedure for (#x = #y) specifically, by using string_dec x y and the two axioms.

Changes to PrintClight

I also modified PrintClight.ml to use extern_atom for names of temporaries instead of handling them as a special case and printing out their numerical ident values. Temporaries introduced by the SimplExpr pass will have no corresponding string, and extern_atom will make up a fresh name for them, starting with "$1". Variables transformed into temporaries by SimplLocals would retain their names (but currently the Clight code is printed out before SimplLocals).

This change is not essential but the fresh names look better than the (larger, seemingly arbitrary) numerical ident values printed out by temp_name.

Now that the string intern tables can be queried from Coq code, we no
longer need to allocate explicit ident values to symbols, but can
instead use their original names explicitely.
jeremie-koenig added a commit to CertiKOS/compcert that referenced this pull request Feb 19, 2018
In particular, use the name [AST.string_of_ident] and declare it in the
same places, so that a merge would be straightforward.
@xavierleroy
Copy link
Contributor

We used to have AST.ident_of_string (but not string_of_ident) but it was causing trouble and was removed, see commit 7a6bb90 .

Even though this PR looks pretty good, I'm still worried of potential problems with:

  • the fact that ident_of_string and string_of_ident do not compute within Coq;
  • the fact that different runs of CompCert or of clightgen can use different mappings of identifiers to strings.

@andrew-appel
Copy link

Regarding "different runs of CompCert or of clightgen can use different mappings of identifiers to strings", I have the following suggestion (discussed last week with Jeremie and others):

For external idents only, CompCert and clightgen should derive the "positive" ident from the bits of the Ascii string. (There's a bijection . . . perhaps that's ident_of_string/string_of_ident, I haven't looked).

For local idents, CompCert and clightgen should NOT do this. The reason is that looking up huge positive numbers in PTrees is quite slow (proportional to the log2 of the number), and we don't want to do that for all the local variables. For the locals, CompCert/clightgen should continue to use the smallest possible positive values.

This solution still needs a bit more thought before implementing, but I think it could be compatible with this current pull request.

@xavierleroy
Copy link
Contributor

I think we're trying to work around a Coq limitation rather than addressing it upfront. The limitation, I think, is the lack of computationally-efficient character strings: one that extracts to OCaml strings (and not lists of characters) + string operations (fast comparison), supports computation within Coq, and has the "..." syntactic suger. Then we could use strings instead of ident for global, named things, and AVL maps or red-black maps instead of PTree for maps.

@jeremie-koenig
Copy link
Contributor Author

jeremie-koenig commented Feb 20, 2018

Thanks for the background and suggestions.

In addition to the subpar performance of the primitive operations for extracted Coq strings, the performance boost from string interning itself may be significant (@andrew-appel pointed this out to me).

In any case, I've been approaching the problem with the mindset that the official name and external representation of a symbol should always be a string, but within the context of a given execution of Coq/CompCert we can use ident as a small, efficient representation.

Re: different mappings across executions

In that spirit, different runs of CompCert or clightgen may indeed use different mappings, but these mappings will only be transient and ident values should never be exchanged with the outside world (hence my changes to PrintClight.ml and ExportClight.ml).

It is true that if we're not careful, we could break this assumption (either now or in future changes). As far as soundness goes, the fact that string_of_ident/ident_of_string are parameters in Coq prevents the proofs from depending on any specific mapping. But there remain some borderline iffy phenomena:

  • in the way the SimplExpr pass allocates fresh uninterned symbols, in particular observing !next_atom and using the order on positive in related invariants/proofs;
  • in the way that those uninterned symbols are then assigned an external representation whenever string_of_ident/extern_atom is called (by clightgen for instance), where that representation will depend on whether a given "$n" has been passed to ident_of_string/intern_string before.

Possible solution. I believe we could mitigate this by:

  1. making sure the critical assumptions are explicitely documented in comments wherever relevant, so that future changes to CompCert are less likely to inadvertently break them, and so that if they do there will be no ambiguity as to where the bug lies;
  2. as much as possible, enforcing these assumptions mechanically by using the Coq / Ocaml module system to hide the implementation of ident and restrict the ways in which we can create and use ident values.

I will try to come up with an implementation of this soonish; managing uninterned symbols in a good way and updating SimplExpr seem to be the most subtle / labor-intensive parts of that.

Re: computability within Coq

As far as the computability of string_of_ident and ident_of_string within Coq goes, I think the only way to achieve this while retaining the performance of string interning would be to implement the string intern pool within Coq.

The problem with this approach is that its state would have to be passed around everywhere. The types of string_of_ident and ident_of_string, and their logical characterization, would become more complicated. Even using some well-designed monad of some sort this would be far from straightforward, and a major change to CompCert.

The "cleverness" of my pull request is in recognizing that this statefulness can be hidden from Coq (though I recognize that "clever" is not always a positive thing from a software engineering and correctness perspective). But I'm afraid there's a trade-off here where we can only accomplish this by implementing the string interning in ML, barring some future change in proof assistant technology.

Re: embedding string into ident

It is possible to define a bijection or embedding of string into positive without much difficulty (both are bit strings give or take), but I think if we were to do that it may be more straightforward to use string in the first place wherever we would be inclined to use ident-embedded strings. We can use IMap to define string-indexed maps with similar performance.


One thing that could be worth exploring would be to offer two options for implementing the string-ident mapping: the clean, constant, pure one, computable within Coq, implemented as an actual bijection between string and positive in the way suggested by @andrew-appel; and an efficient one implemented in ML. The user could choose which one they want at ./configure time, depending on their particular requirements and level of paranoia. A potentially tricky part is that some of the ML code seems to assume that the ident values we manipulate fit into machine integers.

Another middle-of-the-road approach might be to forgo the performance benefits of string interning: have ident be efficient strings with primitive operations implemented in ML, and build some appropriate maps on top of that in Coq. However this would still not compute in Coq, and it's unclear to me what the performance implications would be.

Instead, we use regular symbols, but make sure that they are not legal
C identifiers, so that they cannot conflict with those parsed from the
source program. Specifically, the temporaries introduced by the
SimplExpr pass are of the form "$1", "$2", etc. (which is similar in
effect to the old behavior when they are printed out), and temporaries
introduced to separate memory loads in clightgen normalization are of
the form "@1", "@2", etc.
AST.ident values can now only be created through Ident.of_string, and
can only be used through Ident.to_string. This guarantees that the
concrete positive <-> string mapping used is immaterial and that ident
values can never be communicated with the outside world.

This discipline is enforced both in the Coq and Ocaml code: the
Camlcoq functions intern_string and extern_atom now use the sealed
Ident module (and type) as extracted from Coq. Independently, the
FastIdent Ocaml module provides the string interning backend necessary
for Symbols.v, but it cannot be used to bypass abstraction.

The new structure also makes it easy to switch out the implementation
of the Ident module in the future, or provide an alternative one, for
instance if we want to implement a version of Ident where of_string
and to_string can compute within Coq, sacrificing performance.
@gmalecha
Copy link

gmalecha commented Oct 2, 2018

I realize I'm jumping into something that is probably a dead thread, but I recently started using VST and this seems to be a big issue for the type of verification that I'd like to do.

I'm probably the least knowledgeable about this topic given that I only very recently started using VST, but for my purposes I need but it seems like the basic issue here is a trade-off between performance and functionality. From what I can tell:

Benefits of (small) positive

  • They are fast to compare,
  • They have efficient maps (these maps turn out to be canonical, but since no one has mentioned it, it doesn't seem to be important).
  • They are already implemented

Benefits of string

  • Strings are used by object files for exporting named symbols (necessary for linking + separate verification),
  • Because of the the above, they are stable across runs of clightgen and other tools.

Is there anything that I'm missing?

If the trade-off really is performance vs functionality, then has anyone compared the runtimes/memory usage? I assume that there is some significant factor going on here. Is the critical thing performance of the extracted code, or performance of the Gallina code?

  • Ocaml performance should be tweakable for string by extracting strings into native Ocaml strings. Is there some reason to not do this, e.g. you don't want it in the TCB? Or are there some issues with this?
  • If Coq/Gallina performance is an issue, then what is the reduction mechanism being used (cbv, lazy, vm_compute, native_compute) and where is the problem? Are you running the compiler? Doing VST proofs? Running something else, e.g. Verasco?

I did a little bit of performance checks on a few representations (this is doing 1000 equality checks of morally 20 character strings):

type time (compute/cbv) time (vm_compute)
positive 0.088 0.011
string 0.825 0.084
string (boolean) 0.079 0.037
ascii=pair half_word 0.337 0.026
ascii=pair half_word (boolean) 0.035 0.01

positive is pretty fast and is certainly simple, but equality on it can certainly be made much faster by using the boolean equality rather than the sumbool equality. Using boolean equality and a more compact version of ascii (a pair of 16 constructor variants) performance can be comparable to positive.

If map inserts/lookups are the real underlying issue, then I could try some similar benchmarks if you'd like. I'm certain that small positives will be faster than the alternatives but I'm interested to know by how much (and how much it matters). Further, if there is a separate representation of locals and globals (e.g. split Evar into Elocal (_ : ident) and Eglobal (_ : string)) than in the common case it seems like most accesses will hit locals (which are small) and only a few (mostly function calls) will be references to globals. With the split representation, locals could retain the small, fast positive representation and we would only pay map lookups on globals.

@andrew-appel
Copy link

A few years ago I made a performance measurement of this question. I compared the performance of VST with 5-to-10-bit positives, as we have now, versus 60-bit positives, still using the PTree representation for lookup tables. The performance of VST as a whole slowed down by a factor of three. This is probably because VST uses PTrees very heavily in its internal computational reflection.
[I don't remember if it was 60-bit positives, or longer ones, sorry.]

The 60-bit positives were to simulate ASCII strings as sequences of bits. Therefore, I concluded that the use of ASCII strings for identifiers directly would cause severe performance degradation of VST.

You can do this experiment yourself! Simply edit the header of a .v file produced by clightgen, to make all the positive numbers (used for identifiers) much longer, and see how long it now takes to run that file through a VST verification.

@gmalecha
Copy link

gmalecha commented Oct 2, 2018

Given that the space of strings is so sparse, I would imagine that something more efficient than PTree would be necessary. My thought was AVL trees from the standard library would be a better choice for this. It does require comparison on strings at every node though so that is a trade-off.

If globals and locals were syntactically distinct, e.g. locals using small positive and globals using string, do you think there would be much overhead?

@andrew-appel
Copy link

Most of the PTree-lookups are on locals, but there is still some on globals. You can also do this experiment yourself! Edit the .v file output from clightgen, just to make the positives for globals huge, but not for locals, and see how much VST slows down.

xavierleroy added a commit that referenced this pull request Feb 24, 2020
The "size_arguments" function and its properties can be systematically
derived from the "loc_arguments" function and its properties.

Before, the RISC-V port used this derivation, and all other ports
used hand-written "size_arguments" functions and proofs.

This commit moves the definition of "size_arguments" to the
platform-independent file backend/Conventions.v, using the systematic
derivation, and removes the platform-specific definitions.

This reduces code and proof size, and makes it easier to change the
calling conventions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants