|
| 1 | +# Simplicity Human-Readable Encoding |
| 2 | + |
| 3 | +This module defines a human-readable encoding for Simplicity programs. This encoding |
| 4 | +is intended to be the encoding used for storage and interchange of "commitment-time" |
| 5 | +Simplicity programs, i.e. programs which are unpruned and have no witnesses. |
| 6 | + |
| 7 | +The following parts of the encoding are incomplete/undesigned: |
| 8 | + |
| 9 | +1. It does not support witness data; in future we would like to support partial/full |
| 10 | + witness population, as well as population of disconnected expressions. |
| 11 | +2. `2^n` when it appears in types is a single lexer token and you cannot put spaces |
| 12 | + into it. I'm not sure if I want to fix this or not. |
| 13 | + |
| 14 | +With that said, the rest of the document defines the encoding. |
| 15 | + |
| 16 | +## Syntax |
| 17 | + |
| 18 | +The syntax is defined in `src/human_encoding/parse/ast.rs`. It currently uses the |
| 19 | +`santiago` parser generator, but we would like to move away from this, probably to |
| 20 | +an ad-hoc parser, to avoid poor asymptotic behavior and to get better error messages. |
| 21 | + |
| 22 | +Comments are started by `--` and end at the next newline. This is the only aspect |
| 23 | +in which whitespace is significant. |
| 24 | + |
| 25 | +Simplicity expressions are composed as a series of **definitions** of the form: |
| 26 | + |
| 27 | + NAME := EXPRESSION |
| 28 | + |
| 29 | +and **type bounds** of the form |
| 30 | + |
| 31 | + NAME : TYPE -> TYPE |
| 32 | + |
| 33 | +where these may be combined into the singular form `NAME := EXPRESSION : TYPE -> TYPE`. |
| 34 | +Whitespace is not significant. Each definition or type bound is self-delimiting, so |
| 35 | +there are no semicolons or other separators, but by convention each one should be |
| 36 | +separated by at least one newline. |
| 37 | + |
| 38 | +Here NAME is |
| 39 | + |
| 40 | +* Any sequence matching the regex `[a-zA-Z_\-.'][0-9a-zA-Z_\-.']*`; i.e. combination |
| 41 | + of alphanumerics, `-`, `_`, `'`, and `.` that does not start with a numeral; |
| 42 | +* EXCEPT for the following reserved symbols, which may not be used: |
| 43 | + * `_`; |
| 44 | + * `assertl`, `assertr`, `case`, `comp`, `const`, `disconnect`, `drop`, `fail`, `iden`, `injl`, `injr`, `pair`, `take`, `unit`, `witness`; |
| 45 | + * anything beginning with `prim`; and |
| 46 | + * anything beginning with `jet_`. |
| 47 | + |
| 48 | +and EXPRESSION is |
| 49 | + |
| 50 | +* a NAME; |
| 51 | +* a HOLE (defined below); |
| 52 | +* `unit`, `iden`, or `witness`; |
| 53 | +* `injl`, `injr`, `take`, or `drop` followed by another EXPRESSION; |
| 54 | +* `case`, `comp`, or `pair` followed by two EXPRESSIONs; |
| 55 | +* `assertl` followed by an EXPRESSION, a literal `#`, and another EXPRESSION; |
| 56 | +* `assertr` followed by a literal `#` and two EXPRESSIONs; |
| 57 | +* a jet, which begins with `jet_` and must belong to the list of jets (FIXME define this list); |
| 58 | +* `const` followed by a VALUE (defined below); |
| 59 | +* `fail` followed by an ENTROPY (defined below); or |
| 60 | +* `(` followed by another EXPRESSION followed by `)`. |
| 61 | + |
| 62 | +Note that while we allow parenthesis to help group parts of expressions for human |
| 63 | +understanding, they are never needed for disambiguation and are essentially |
| 64 | +ignored by the parser. |
| 65 | + |
| 66 | +A HOLE is the literal `?` followed by a NAME. It indicates an expression that has |
| 67 | +yet to be defined. Holes have a different namespace than other names. |
| 68 | + |
| 69 | +A VALUE is one of |
| 70 | +* the literal `_`, which is interpreted as the empty value; |
| 71 | +* a binary literal `0b[01]+` which is interpreted as a sequence of bits; or |
| 72 | +* a hex literal `0x[0-9a-f]+` which is interpreted as a sequence of 4-bit nybbles |
| 73 | + |
| 74 | +An ENTROPY is a VALUE whose size is between 128 and 512 bits inclusive. Internally |
| 75 | +it is 0-padded out to 512 bits. |
| 76 | + |
| 77 | +Finally, TYPE is |
| 78 | + |
| 79 | +* a literal `_`, indicating no type bound; |
| 80 | +* a NAME; |
| 81 | +* a literal `1`, indicating the unit type; |
| 82 | +* a literal `2`, indicating the bit type; |
| 83 | +* `2^n`, where `n` is any power of two, in decimal with no spaces or punctuation; |
| 84 | +* `(` followed by another TYPE followed by `)`; |
| 85 | +* another TYPE, followed by `+`, followed by another TYPE; or |
| 86 | +* another TYPE, followed by `*`, followed by another TYPE. |
| 87 | + |
| 88 | +Here `*` has higher precedence than `+`, and both `+` and `*` are left-associative. |
| 89 | + |
| 90 | +## Namespaces |
| 91 | + |
| 92 | +There are three independent namespaces: one for NAMEs, one for HOLEs, and one for |
| 93 | +TYPEs. They all have the same rules for valid symbols, except that `_` is reserved |
| 94 | +(may not be used) for NAMEs and `_` has a special meaning (no type bound) for |
| 95 | +TYPEs. |
| 96 | + |
| 97 | +## Semantics: Definitions |
| 98 | + |
| 99 | +As above, Simplicity expressions are a series of **definitions** of the form |
| 100 | + |
| 101 | + NAME := EXPRESSION |
| 102 | + |
| 103 | +or |
| 104 | + |
| 105 | + NAME := EXPRESSION : TYPE -> TYPE |
| 106 | + |
| 107 | +and **type bounds** of the form |
| 108 | + |
| 109 | + NAME : TYPE -> TYPE |
| 110 | + |
| 111 | +We refer to the `NAME` part as the **name**, `EXPRESSION` as the **expression**, |
| 112 | +and `TYPE -> TYPE` as the **type ascription**. If such a named expression appears anywhere |
| 113 | +in a Simplicity encoding, then whenever that name appears in the expression of |
| 114 | +any other named expression, its expression is substituted in place of it. |
| 115 | + |
| 116 | +For a given name, it is permissible to have multiple type bounds, but any name |
| 117 | +which appears must have exactly one definition. That is, it is not permitted to |
| 118 | +have a type bound for a name which isn't defined elsewhere, and it is not permitted |
| 119 | +to have multiple definitions for the same name. |
| 120 | + |
| 121 | +This allows the user to provide any number of type bounds for a given name, each of |
| 122 | +which may be helpful in clarifying a program. |
| 123 | + |
| 124 | +For example, in the encoding |
| 125 | + |
| 126 | + node := unit : _ -> 1 |
| 127 | + main := comp node node |
| 128 | + |
| 129 | +the name `node` is substituted by `unit` both places that it appears. We can see |
| 130 | +that starting from the name `main`, by recursive substitution, we obtain a single |
| 131 | +Simplicity expression. The type checker will ensure that the target type of `node` |
| 132 | +is 1. |
| 133 | + |
| 134 | +In general, we do not need to obtain a single expression. It is permissible to |
| 135 | +encode a "DAG forest". |
| 136 | + |
| 137 | +The name `main` is special for several reasons: |
| 138 | +* An expression named `name` implicitly has the type ascription `1 -> 1`. That is, it must always be a program. |
| 139 | +* To generate a commitment-time program from an expression, it must be called `main`. |
| 140 | +* Type ascriptions for `main` and its children are enforced **after** type inference has completed, so they act as sanity checks but cannot change the output of type inference. For other expressions, type ascriptions are enforced **before** and may guide inference. |
| 141 | + |
| 142 | +No cycles are allowed; if a name occurs anywhere in the expansion of its expression, |
| 143 | +this is an error. |
| 144 | + |
| 145 | +## Semantics: Expressions |
| 146 | + |
| 147 | +Expressions may be |
| 148 | + |
| 149 | +* a NAME, which simply refers to another expression; |
| 150 | +* a HOLE, which is described in the next section; |
| 151 | +* one of the core combinators `unit`, `iden`, `comp`, `injl`, `injr`, `case`, `take`, `drop`, `pair`, followed by subexpression(s) as needed; |
| 152 | +* the `disconnect` combinator followed by an expression and a hole; |
| 153 | +* the `witness` combinator which currently allows no subexpressions; |
| 154 | +* the assertions, `assertl` or `assertr`, which take two subexpressions, one of which will be hidden in the decoded program. The hidden subexpression should be prefixed by `#` which indicates to the parser to take the CMR of that expression, not the expression itself. |
| 155 | +* `fail` followed by a 128-to-512-bit entropy value, which should occur only in the pruned branch of an assertion, though this is not enforced; |
| 156 | +* `const` followed by a value, which is a "constant-word jet" and is equivalent to constructing the given value by a tree of `pair`s whose leaves are `injl unit` (0) or `injr unit` (1); |
| 157 | + |
| 158 | +Expressions have an associated **type arrow**, which is inferred by the type checker as |
| 159 | +expressions are built up. If a combinator's children's types are incompatible for that |
| 160 | +combinator, an error is raised. |
| 161 | + |
| 162 | +After the type arrow for a named expression is fully inferred, any type ascriptions for |
| 163 | +that name are applied, and an error is raised if this fails. In this way, a user can |
| 164 | +provide type ascriptions which act as sanity checks and documentation for sub-parts of |
| 165 | +a Simplicity expression. |
| 166 | + |
| 167 | +## Semantics: Holes |
| 168 | + |
| 169 | +Holes are of the form `?NAME`; there may be whitespace after the `?` but by convention it is |
| 170 | +omitted. Holes must have unique names, but live in a separate namespace from ordinary names, |
| 171 | +so they cannot conflict with the names of expressions. |
| 172 | + |
| 173 | +Holes have two meanings: |
| 174 | +* When they occur as the right child of a `disconnect` combinator, they give a name to a disconnected expression. `disconnect` combinators are **required** to have holes for right children. Any other expression form is an error. |
| 175 | +* In all other contexts, they indicate an incomplete part of the program which can be typechecked but not much else. |
| 176 | + |
| 177 | +In all cases, holes are typechecked before any errors are reported, and the assembler will |
| 178 | +report their types. This allows the use of holes as placeholders during development, where |
| 179 | +users can learn the required type of the eventual expression by querying the typechecker. |
| 180 | + |
| 181 | +When assembling or computing CMRs of incomplete programs (any program with a hole outside |
| 182 | +of the right child of a `disconnect` node), errors will be reported for every hole. |
| 183 | +error messages will include the holes' type arrows. |
| 184 | + |
| 185 | +## Semantics: Type Ascriptions |
| 186 | + |
| 187 | +Type ascriptions are of the form `TYPE -> TYPE`. We refer to the first type as the **source |
| 188 | +type** and the second as the **target type**. Here each TYPE is one of |
| 189 | + |
| 190 | +* the literal `_`, which indicates that no additional checks should be done on the appropriate type; |
| 191 | +* the literals `1`, `2` or `2^n` indicate that the appropriate type must be the unit, bit, or n-bit word type; |
| 192 | +* an arbitrary NAME, which simply gives a name to the appropriate type. (If the same name is used in multiple places, the type-checker will check that the same type appears in each place.); |
| 193 | +* any pair of TYPEs separated by `+` or `*`, which indicate a sum or product bound respectively; or |
| 194 | +* any TYPE surrounded by `(` or `)`. |
| 195 | + |
| 196 | +Note that the NAMEs used for types are in a separate namespace from the NAMEs used in |
| 197 | +named expression, and from HOLEs. These three uses of names do not interact. |
| 198 | + |
| 199 | +Note also that unlike the case for EXPRESSIONs, parentheses may be meaningful. Absent parentheses, |
| 200 | +`*` has higher precedence than `+` and both operators are left-associative. |
| 201 | + |
| 202 | +The interpretation of type ascriptions depends on whether they appear within the `main` expression. |
| 203 | +If so, type inference is completed and all free types set to unit, and **then** all type ascriptions |
| 204 | +are applied. In this case, the type ascriptions cannot change the final types, but if they are |
| 205 | +incompatible with the final types, an error is raised. |
| 206 | + |
| 207 | +Outside of `main`, type ascriptions are applied in parallel with type inference, and before free |
| 208 | +types are set to unit. This means that type ascription can change types that would otherwise be |
| 209 | +free, and the type of the resulting expression will not necessarily match its type if the |
| 210 | +expression were to be pulled into `main`. |
| 211 | + |
0 commit comments