Combinatorial generation of Simplicity types and values #255

apoelstra · 2024-09-12T23:35:11Z

apoelstra
Sep 12, 2024
Maintainer

This post is a semi-structured dump of some thoughts I've had while trying to combinatorially generate Simplicity structures.

Simplicity types live in a set $\mathbb{T}$ which admits a very simple recursive definition:

$u\in\mathbb{T}$ is the unit type.
$\forall x,y\in\mathbb{T}$, $x+y$ is a sum type of $x$ and $y$
$\forall x,y\in\mathbb{T}$, $x\times y$ is a product type of $x$ and $y$

Then for each type there is a natural notion of size $t\mapsto|t|$ which expresses the bitlength of a value of type $t$ when encoded in the bit machine:

$|u|$ = 0
$|x + y| = 1 + \max\lbrace x, y\rbrace$
$|x\times y| = |x|+|y|$

This has the curious effect that for any particular size $n$ there are an infinity of types whose size is $n$, because you can always take arbitrarily many products with units, which does not increase the bitlength of a type.

There are two other notions of size which may be relevant:

For an object $v$ of type $t$, its "witness-encoded size" is defined similarly to $|t|$ except that if $v$ has type $x+y$ then its size will be either $1+|x|$ or $1+|y|$. In general this means there is no relationship between the witness-encoded size of a value and its bit-machine-encoded size, other than that the former is less than or equal to the latter.
For a type $t$ represented in memory, nothing has zero size; we can approximately measure the size of $t$ by saying that $u$ has size 1 and the sizes of both $x+y$ and $x\times y$ are the sum of the sizes of $x$ and $y$. We then see that (absent clever sharing tricks) the in-memory size of $t$ is worst-case exponential in the bitlength of $t$ (in particular the bitlength of $x+x$ being $1+x$ while the in-memory size is $2x$).

Naively, this looks alarming -- we have three notions of size which are decoupled from each other and the "in-memory" size may be exponentially larger than the "on-chain" size reflected in the Simplicity cost model. In fact, there is no problem here -- it turns out that every Simplicity combinator introduces at most 1 free type and at most 4 total type bounds (which each can, at most, "turn into" a single sum or product). And since every combinator requires some minimum number of bits to encode, I think 4, this means that actually the in-memory footprint of the types inferred in a given program linear in the encoded size of that program. Phew.

This actually introduces yet another notion of the "size" of a type -- how many bounds were required to construct it. This is a murky definition since you can't really assign a particular bound to a particular type. But in type-inferring a Simplicity program the number of bounds is proportional to the amount of memory used during inference.

From a software testing point of view, we would like to generate a diverse set of types (and values which inhabit them). For simplicity's sake it would be nice if this generation looked like "pick a size $n$ and recursively generate a type of size $n$". To do this we need to pick a notion of "size". For the sake of not blowing up our test harness, the most natural choice would be the "in-memory size", although of all the sizes above, this is the only one that doesn't reflect some on-chain property of a Simplicity program or value.

An interesting alternate strategy might be to combinatorially generate bounds, and from there produce types, and from there choose values (which is "easy" to do uniformly once you know the target type). This just occurred to me as I was typing this, and it might also resolve another problem how can we combinatorially generate well-typed Simplicity programs. The problem with this is that while there's a simple recursive definition of "untyped Simplicty", there is no recursive definition (I'm pretty sure) of "well-typed Simplicity". So while there's a wealth of literature on efficiently generating recursive structures, it is inapplicable to the problem of generating well-typed Simplicity so you wind up doing a lot of very inefficient backtracking and type inference and re-inference.

An alternate strategy is:

Combinatorially generate type bounds of the form they are generated during real Simplicity inference.
As you generate these bounds, synthetically produce a Simplicity node that would have produced those bounds.
Finally do something to "tie together" the Simplicity forest this would produce. (One obvious way is to compose witness nodes with them, pair those all together, and compose the resulting mess with unit. But maybe I can do something more clever e.g. by generating nodes that'd force all my free types to be equal and then bind those all to unit, and this magically would result in a single 1->1 program. Would need to think about it more.)

roconnor-blockstream · 2024-09-13T01:11:52Z

roconnor-blockstream
Sep 13, 2024
Maintainer

I think 4, this means that actually the in-memory footprint of the types inferred in a given program linear in the encoded size of that program. Phew.

Just to clarify and drive home an important point: The memory used to represent the inferred types from Simplicity programs is linear in the size of the program, but only because of the natural sharing of type subexpressions induced by the type inference algorithm. The unshared size of inferred types can be exponential in the size of the program. So it is, in fact, the clever sharing that saves us here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combinatorial generation of Simplicity types and values #255

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Combinatorial generation of Simplicity types and values #255

apoelstra Sep 12, 2024 Maintainer

Replies: 1 comment

roconnor-blockstream Sep 13, 2024 Maintainer

apoelstra
Sep 12, 2024
Maintainer

roconnor-blockstream
Sep 13, 2024
Maintainer