From bdf31df8a3f7fb3b4a43c5dec1fea4e2e205a530 Mon Sep 17 00:00:00 2001 From: Iulia Bastys Date: Mon, 5 May 2025 00:11:25 +0200 Subject: [PATCH 1/2] Update Lesson 1.3 README. --- .../k-tutorial/1_basic/03_parsing/README.md | 506 +++++++++++------- 1 file changed, 314 insertions(+), 192 deletions(-) diff --git a/k-distribution/k-tutorial/1_basic/03_parsing/README.md b/k-distribution/k-tutorial/1_basic/03_parsing/README.md index 752eb258777..0b987e1f4a7 100644 --- a/k-distribution/k-tutorial/1_basic/03_parsing/README.md +++ b/k-distribution/k-tutorial/1_basic/03_parsing/README.md @@ -5,31 +5,35 @@ copyright: Copyright (c) Runtime Verification, Inc. All Rights Reserved. # Lesson 1.3: BNF Syntax and Parser Generation The purpose of this lesson is to explain the full syntax and semantics of -**productions** in K as well as how productions and other syntactic -**sentences** can be used to define grammars for use parsing both rules as well -as programs. +**productions** in K, as well as how productions and other syntactic +**sentences** can be used to define grammars for parsing both rules and +programs. In this context, you'll also learn about two additional types +of productions, **brackets** and **tokens**. ## K's approach to parsing -K's grammar is divided into two components: the **outer syntax** of K and the -**inner syntax** of K. Outer syntax refers to the parsing of **requires**, +K's grammar is divided into two components: one **outer syntax** and one +**inner syntax**. Outer syntax refers to the parsing of **requires**, **modules**, **imports**, and **sentences** in a K definition. Inner syntax refers to the parsing of **rules** and **programs**. Unlike the outer syntax of K, which is predetermined, much of the inner syntax of K is defined by you, the developer. When rules or programs are parsed, they are parsed within the context of a module. Rules are parsed in the context of the module in which they exist, whereas programs are parsed in the context of the -**main syntax module** of a K definition. The productions and other syntactic -sentences in a module are used to construct the grammar of the module, which -is then used to perform parsing. +**main syntax module** of a K definition. + +Recall that a K definition consists of several modules, which in turn consist +each of several sentences (productions, rules, etc.). Sentences within a +module form the grammar of that module, and this grammar is used for parsing +programs in the language you defined. ## Basic BNF productions -To illustrate how this works, we will consider a simple K definition which -defines a relatively basic calculator capable of evaluating Boolean expressions -containing and, or, not, and xor. +To illustrate how this works, let's consider the K module below which defines +a logical calculator for evaluating Boolean expressions containing operations +AND, OR, NOT, and XOR. -Input the following program into your editor as file `lesson-03-a.k`: +Input the code below into your editor as file `lesson-03-a.k`: ```k module LESSON-03-A @@ -43,87 +47,127 @@ module LESSON-03-A endmodule ``` -You will notice that the productions in this file look a little different than -the ones from the previous lesson. In point of fact, K has two different -mechanisms for defining productions. We have previously been focused -exclusively on the first mechanism, where the `::=` symbol is followed by an -alphanumeric identifier followed by a comma-separated list of sorts in -parentheses. However, this is merely a special case of a more generic mechanism -for defining the syntax of productions using a variant of -[BNF Form](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form). +Observe that the productions in this file look a little different than +what we have seen in the previous lesson. The reason is that K has two +mechanisms for defining productions: +[BNF notation](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) and +alphanumeric identifiers. In [Lesson 1.2](../02_basics/README.md) we have +seen the latter, where the identifier was followed by a (possibly empty) +list of sorts in parentheses. In this lesson, we will learn about the +former, which is a more generic mechanism for defining the syntax of +productions. -For example, in the previous lesson, we had the following set of productions: +Recall the set of productions from the previous lesson: ```k module LESSON-03-B + syntax Color ::= Yellow() | Blue() syntax Fruit ::= Banana() | Blueberry() syntax Color ::= colorOf(Fruit) [function] + endmodule ``` -It turns out that this is equivalent to the following definition which defines -the same grammar, but using BNF notation: +This definition is equivalent to the following one which specifies the same +grammar, but in BNF notation: ```k module LESSON-03-C + syntax Color ::= "Yellow" "(" ")" | "Blue" "(" ")" syntax Fruit ::= "Banana" "(" ")" | "Blueberrry" "(" ")" syntax Color ::= "colorOf" "(" Fruit ")" [function] + endmodule ``` -In this example, the sorts of the argument to the function are unchanged, but +You can see that the sort of the function's argument is unchanged, but everything else has been wrapped in double quotation marks. This is because in BNF notation, we distinguish between two types of **production items**: -**terminals** and **non-terminals**. A terminal represents simply a literal -string of characters that is verbatim part of the syntax of that production. -A non-terminal, conversely, represents a sort name, where the syntax of that -production accepts any valid term of that sort at that position. - -This is why, when we wrote the program `colorOf(Banana())`, `krun` was able to -execute that program: because it represented a term of sort `Color` that was -parsed and interpreted by K's interpreter. In other words, `krun` parses and -interprets terms according to the grammar defined by the developer. It is -automatically converted into an AST of that term, and then the `colorOf` -function is evaluated using the function rules provided in the definition. - -You can ask yourself: How does K match the strings between the double quotes? -The answer is that K uses Flex to generate a scanner for the grammar. Flex looks -for the longest possible match of a regular expression in the input. If there -are ambiguities between 2 or more regular expressions, it will pick the one with -the highest `prec` attribute. You can learn more about how Flex matching works -[here](https://westes.github.io/flex/manual/Matching.html#Matching). - -Bringing us back to the file `lesson-03-a.k`, we can see that this grammar -has given a simple BNF grammar for expressions over Booleans. We have defined -constructors corresponding to the Boolean values true and false, and functions -corresponding to the Boolean operators for and, or, not, and xor. We have also -given a syntax for each of these functions based on their syntax in the `C` -programming language. As such, we can now write programs in the simple language -we have defined. - -Input the following program into your editor as `and.bool` in the same -directory: +**terminals** and **non-terminals**. A terminal represents a literal string +of characters that is a verbatim part of the syntax of that production. +For example, `Banana` is part of the syntax. Conversely, non-terminals, refer +to a sort name, like `Fruit`, and the syntax of the production they +belong to accepts any valid term of that sort at that position. + +In the previous lesson we executed successfully the program `colorOf(Banana())` +using `krun`. `krun` parses and interprets terms according to the grammar we +have defined. `Banana()` is a term of sort `Fruit`, hence a valid argument +for function `colorOf`. Under the hood, the term is automatically converted +into an AST (abstract syntax tree), and then the function `colorOf` is +evaluated using the function rules provided in the definition. + +What about the strings between the double quotes? How does K match them? +The answer is that K uses [Flex](https://github.com/westes/flex) to generate a +scanner for the grammar. Remember that a scanner, or lexical analyzer or lexer, +is a component of an interpreter that breaks down source code into tokens, +which are units such as keywords, variables, and operators. These tokens are +then processed by the parser, which interprets the structure of the code +according to the grammar rules. + +Flex looks for the longest possible match of a regular expression in the input. +If there are ambiguities between two or more regular expressions, it will pick +the one with the highest `prec` attribute. You can learn more about how Flex +matching works in the +[Flex Manual](https://westes.github.io/flex/manual/Matching.html#Matching). + +Returning to module `LESSON-03-A`, we can see that it defines a simple BNF +grammar for expressions over Booleans. We have defined constructors +corresponding to the Boolean values _true_ and _false_, and functions +corresponding to the Boolean operators AND, OR, NOT, and XOR. We have also +given a syntax for each of these functions based on their syntax in the `C` +programming language. As such, we can now write programs in the simple language +we have defined! + +First, let's compile our grammar: + +``` +kompile lesson-03-a.k +``` + +Recall that compilation produces a parser, interpreter, and verifier for the +grammar specified in the K definition. + +Now, save the following program as `and.bool` in the same directory: ``` true && false ``` -We cannot interpret this program yet, because we have not given rules defining -the meaning of the `&&` function yet, but we can parse it. To do this, you can -run (from the same directory): +Interpreting this program by executing ``` -kast --output kore and.bool +krun and.bool +``` + +will raise an error: + ``` +terminate called after throwing an instance of 'std::runtime_error' + what(): No tag found for symbol Lbl'UndsAnd-And-UndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}. Maybe attempted to evaluate a symbol with no rules? + +/nix/store/2av44963lcqsmkj7hwmjhrg9pzpbkr1i-k-7.1.232-88c9c766d76f624329400cd554fafd3beec16a15/bin-unwrapped/../lib/kframework/k-util.sh: line 114: 384286 Aborted (core dumped) "$@" +[Error] krun: ./lesson-03-a-kompiled/interpreter +/tmp/.krun-2025-04-29-12-58-55-lb8wwXcWVv/tmp.in.0nVZCuXTXv -1 +/tmp/.krun-2025-04-29-12-58-55-lb8wwXcWVv/result.kore +Syntax error at /tmp/.krun-2025-04-29-12-58-55-lb8wwXcWVv/result.kore:1.1: Expected: [, ] Actual: +[Error] krun: kore-print --definition ./lesson-03-a-kompiled --output pretty +/tmp/.krun-2025-04-29-12-58-55-lb8wwXcWVv/result.kore --color on +``` + +This is expected, as we have not given rules defining the meaning of the `&&` +function, and the error message highlights exactly this—_Maybe attempted +to evaluate a symbol with no rules?_ -`kast` is K's just-in-time parser. It will generate a grammar from your K -definition on the fly and use it to parse the program passed on the command -line. The `--output` flag controls how the resulting AST is represented; don't -worry about the possible values yet, just use `kore`. +We cannot interpret the program just yet, but we can _parse_ it. To do this, +run the command below from the same directory: -You ought to get the following AST printed on standard output, minus the +``` +kast --output kore and.bool +``` + +You should see the following AST printed on standard output, minus the formatting: ``` @@ -135,16 +179,24 @@ inj{SortBoolean{}, SortKItem{}}( ) ``` -Don't worry about what exactly this means yet, just understand that it -represents the AST of the program that you just parsed. You ought to be able -to recognize the basic shape of it by seeing the words `true`, `false`, and -`And` in there. This is **Kore**, the intermediate representation of K, and we -will cover it in detail later. +`kast` is K's just-in-time parser, just another tool produced at compile time. +It generates a grammar from your K definition on the fly and uses it to parse +the program passed on the command line. -Note that you can also tell `kast` to print the AST in other formats. For a -more direct representation of the original K, while still maintaining the -structure of an AST, you can say `kast --output kast and.bool`. This will -yield the following output: +The `--output` flag controls how the resulting AST is represented. There are +several possible values for it, and you can see all options by running +`kast --help`. `kore` used above is one of them and denotes KORE, the +intermediate representation of K. +Value `kast` for the flag gives us an AST in a more direct representation of +the original K definition. + +Executing + +``` +kast --output kast and.bool +``` + +yields the following output, minus the formatting: ``` `_&&__LESSON-03-A_Boolean_Boolean_Boolean`( @@ -153,69 +205,81 @@ yield the following output: ) ``` -Note how the first output is largely a name-mangled version of the second -output. The one difference is the presence of the `inj` symbol in the KORE -output. We will talk more about this in later lessons. +Comparing both outputs, you can observe that the former is largely a +name-mangled version of the latter. However, a notable difference is +represented by the `inj` attribute in the KORE output. Keep it in mind for now, +we will talk more about it in future lessons. + +Note that while `krun` also accepts programs as arguments, `kast` only takes +file names as arguments. ### Exercise Parse the expression `false || true` with `--output kast`. See if you can predict approximately what the corresponding output would be with -`--output kore`, then run the command yourself and compare it to your -prediction. +`--output kore`, then run the command and compare it to your prediction. ## Ambiguities -Now let's try a slightly more advanced example. Input the following program -into your editor as `and-or.bool`: +Now let's try a slightly more advanced example. Save the following program as +`and-or.bool`: ``` true && false || false ``` -When you try and parse this program, you ought to see the following error: +If you try to parse it, you will see the following error: ``` [Error] Inner Parser: Parsing ambiguity. -1: syntax Boolean ::= Boolean "||" Boolean [function] - -`_||__LESSON-03-A_Boolean_Boolean_Boolean`(`_&&__LESSON-03-A_Boolean_Boolean_Boolean`(`true_LESSON-03-A_Boolean`(.KList),`false_LESSON-03-A_Boolean`(.KList)),`false_LESSON-03-A_Boolean`(.KList)) -2: syntax Boolean ::= Boolean "&&" Boolean [function] - -`_&&__LESSON-03-A_Boolean_Boolean_Boolean`(`true_LESSON-03-A_Boolean`(.KList),`_||__LESSON-03-A_Boolean_Boolean_Boolean`(`false_LESSON-03-A_Boolean`(.KList),`false_LESSON-03-A_Boolean`(.KList))) - Source(./and-or.bool) - Location(1,1,1,23) +1: syntax Boolean ::= Boolean "&&" Boolean [function] + `_&&__LESSON-03-A_Boolean_Boolean_Boolean`(`true_LESSON-03-A_Boolean`(.KList),`_||__LESSON-03-A_Boolean_Boolean_Boolean`(`false_LESSON-03-A_Boolean`(.KList),`false_LESSON-03-A_Boolean`(.KList))) +2: syntax Boolean ::= Boolean "||" Boolean [function] + `_||__LESSON-03-A_Boolean_Boolean_Boolean`(`_&&__LESSON-03-A_Boolean_Boolean_Boolean`(`true_LESSON-03-A_Boolean`(.KList),`false_LESSON-03-A_Boolean`(.KList)),`false_LESSON-03-A_Boolean`(.KList)) + Source(./and-or.bool) + Location(1,1,1,23) + 1 | true && false || false + . ^~~~~~~~~~~~~~~~~~~~~~ ``` -This error is saying that `kast` was unable to parse this program because it is -ambiguous. K's just-in-time parser is a GLL parser, which means it can handle +The error is saying that `kast` was unable to parse the program because it is +ambiguous. K's just-in-time parser is a GLL (generalized +left-to-right, leftmost derivation) parser, which means it can handle the full generality of context-free grammars, including those grammars which are ambiguous. An ambiguous grammar is one where the same string can be parsed as multiple distinct ASTs. In this example, it can't decide whether it should -be parsed as `(true && false) || false` or as `true && (false || false)`. As a -result, it reports the error to the user. +be parsed as `(true && false) || false` or as `true && (false || false)`. + +In Boolean logic and other programming languages such as C, logical AND has +precedence over logical OR. However, grammars defined in K assume all operators +to have the same priority in evaluation, unless specified otherwise. Hence the +ambiguity reported by the parser. You will learn in the next lesson how to set +up precendence of some operators over others and define the logical connectives +_the usual way_. We continue this lesson by showing how to reduce ambiguity by +using brackets. + ## Brackets -Currently there is no way of resolving this ambiguity, making it impossible -to write complex expressions in this language. This is obviously a problem. -The standard solution in most programming languages to this problem is to -use parentheses to indicate the appropriate grouping. K generalizes this notion -into a type of production called a **bracket**. A bracket production in K -is any production with the `bracket` attribute. It is required that such a -production only have a single non-terminal, and the sort of the production -must equal the sort of that non-terminal. However, K does not otherwise -impose restrictions on the grammar the user provides for a bracket. With that -being said, the most common type of bracket is one in which a non-terminal -is surrounded by terminals representing some type of bracket such as -`()`, `[]`, `{}`, `<>`, etc. For example, we can define the most common -type of bracket, the type used by the vast majority of programming languages, -quite simply. - -Consider the following modified definition, which we will save to -`lesson-03-d.k`: +With the grammar defined in module `LESSON-03-A` there is no way of resolving +this ambiguity, making it impossible to write complex expressions in our small +language. The standard solution in most programming languages to this problem +is to use parentheses to indicate the appropriate grouping. K generalizes this +notion into a type of production called **bracket**. -```k +A bracket production is any production with the `bracket` attribute. It is +required that such a production only have a single non-terminal, and the sort +of the production must equal the sort of that non-terminal. Otherwise, K does +not impose any restrictions on the grammar provided for a bracket. + +Like in other languages, the most common type of bracket is one in which a +non-terminal is surrounded by terminals representing one of the following +symbols `()`, `[]`, `{}`, or `<>`. For example, we can define the most common +type of bracket, the parentheses, quite simply. + +Consider the following modified definition and save it to file `lesson-03-d.k`: + +```k module LESSON-03-D syntax Boolean ::= "true" | "false" @@ -228,10 +292,8 @@ module LESSON-03-D endmodule ``` -In this definition, if the user does not explicitly define parentheses, the -grammar remains ambiguous and K's just-in-time parser will report an error. -However, you are now able to parse more complex programs by means of explicitly -grouping subterms with the bracket we have just defined. +With this augmented definition, you are now able to parse more complex programs +by explicitly grouping subterms with the bracket we have just defined. Consider `and-or-left.bool`: @@ -239,87 +301,88 @@ Consider `and-or-left.bool`: (true && false) || false ``` -Now consider `and-or-right.bool`: +and `and-or-right.bool`: ``` true && (false || false) ``` -If you parse these programs with `kast`, you will once again get a single -unique AST with no error. If you look, you might notice that the bracket itself -does not appear in the AST. In fact, this is a property unique to brackets: -productions with the bracket attribute are not represented in the parsed AST -of a term, and the child of the bracket is folded immediately into the parent -term. This is the reason for the requirement that a bracket production have -a single non-terminal of the same sort as the production itself. +When parsing these programs with `kast`, you get a unique AST with no error. +If you check the output carefully, you will notice that the bracket itself does +not appear in the AST. In fact, this is a property unique to brackets: +productions with the bracket attribute are not represented in the parsed AST of +a term, and the child of the bracket is folded immediately into the parent +term. This is why a bracket production must have a single non-terminal of the same +sort as the production itself. ### Exercise -Write out what you expect the AST to be arising from parsing these two programs -above with `--output kast`, then parse them yourself and compare them to the -AST you expected. Confirm for yourself that the bracket production does not -appear in the AST. +Write out the AST you expect to be arising from parsing the two programs above +with `--output kast`, then parse them and compare the result to the AST you +expected. Confirm for yourself that the bracket production does not appear in +the AST. ## Tokens -So far we have seen how we can define the grammar of a language. However, -the grammar is not the only relevant part of parsing a language. Also relevant -is the lexical syntax of the language. Thus far, we have implicitly been using -K's automatic lexer generation to generate a token in the scanner for each -terminal in our grammar. However, sometimes we wish to define more complex -lexical syntax. For example, consider the case of integers in C: an integer -consists of a decimal, octal, or hexadecimal number followed by an optional -suffix indicating the type of the literal. - -In theory it would be possible to define this syntax via a grammar, but not -only would it be cumbersome and tedious, you would also then have to deal with -an AST generated for the literal which is not convenient to work with. - -Instead of doing this, K allows you to define **token** productions, where -a production consists of a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) followed by the `token` -attribute, and the resulting AST consists of a typed string containing the -value recognized by the regular expression. - -For example, the builtin integers in K are defined using the following +So far we have seen how to define the grammar of a language and we have +implicitly been using K's automatic lexer generation to generate a token for +each terminal in our grammar. +However, the grammar is not the only relevant part of parsing a language. Also +relevant is the lexical syntax of the language, i.e., how the tokens are +defined and recognized. + +Sometimes we need to define more complex lexical syntax. Consider, for +instance, integers in C: an integer consists of a decimal, octal, +or hexadecimal number, followed by an optional suffix that specifies the type +of the literal. While it’s theoretically possible to define this syntax using +a grammar, doing so would be cumbersome and tedious. Additionally, you'd be +faced with an AST generated for the literal, which is not particularly +convenient to work with. + +As an alternative, K allows you to define **token** productions, which are +[regular expressions](https://en.wikipedia.org/wiki/Regular_expression) followed +by the `token` attribute. The resulting AST would then consist of a typed string +containing the value recognized by the regular expression. + +For example, the built-in integers in K are defined using the following production: ```{.k .exclude} syntax Int ::= r"[\\+\\-]?[0-9]+" [token] ``` -Here we can see that we have defined that an integer is an optional sign -followed by a nonzero sequence of digits. The `r` preceding the terminal -indicates that what appears inside the double quotes is a regular expression, -and the `token` attribute indicates that terms which parse as this production -should be converted into a token by the parser. +An integer is thus an optional sign followed by a nonzero sequence of digits. +The `r` preceding the terminal indicates that what appears inside the double +quotes is a regular expression, and the `token` attribute indicates that terms +which parse as this production should be converted into a token. -It is also possible to define tokens that do not use regular expressions. This -can be useful when you wish to declare particular identifiers for use in your +Before looking at how integers in C can be defined in K, let us mention that it +is also possible to define tokens that do not use regular expressions. This can +be useful when you wish to declare particular identifiers for use in your semantics later. For example: ```{.k .exclude} syntax Id ::= "main" [token] ``` -Here, we declare that `main` is a token of sort `Id`. Instead of being parsed -as a symbol, it gets parsed as a token, generating a typed string in the AST. -This is useful in a semantics of C because the parser generally does not treat -the `main` function in C specially; only the semantics treats it specially. +Here, we declare `main` as a token of sort `Id`. Instead of being parsed as a +symbol, it gets parsed as a token, generating a typed string in the AST. +This can be useful in a semantics of C because the parser typically doesn't +handle the `main` function in any special way; it's only the semantics that +gives it special treatment. -Of course, languages can have more complex lexical syntax. For example, if we -wish to define the syntax of integers in C, we could use the following -production: +The syntax of integers in C has a more complex lexical structure than the one +of built-in integers in K, and a production defining them could look as +follows: ```{.k .exclude} syntax IntConstant ::= r"(([1-9][0-9]*)|(0[0-7]*)|(0[xX][0-9a-fA-F]+))(([uU][lL]?)|([uU]((ll)|(LL)))|([lL][uU]?)|(((ll)|(LL))[uU]?))?" [token] ``` -As you may have noted above, long and complex regular expressions -can be hard to read. They also suffer from the problem that unlike a grammar, -they are not particularly modular. - -We can get around this restriction by declaring explicit regular expressions, -giving them a name, and then referring to them in productions. +This is a long and complex regular expression, hard to read. In addition, +unlike a grammar, it is not particularly modular. However, we can get around +this restriction by declaring _explicit_ regular expressions, giving them a +name, and referring to them in productions. Consider the following (equivalent) way to define the lexical syntax of integers in C: @@ -340,48 +403,107 @@ syntax lexical LongSuffix = r"[lL]" syntax lexical LongLongSuffix = r"ll|LL" ``` -As you can see, this is rather more verbose, but it has the benefit of both -being much easier to read and understand, and also increased modularity. +As you can see, this is rather more verbose, but it has the benefit of being +easier to read and understand, as well as providing increased modularity. + Note that we refer to a named regular expression by putting the name in curly brackets. Note also that only the first sentence actually declares a new piece -of syntax in the language. When the user writes `syntax lexical`, they are only -declaring a regular expression. To declare an actual piece of syntax in the -grammar, you still must actually declare an explicit token production. +of syntax in the language. `syntax lexical` only declares an explicit regular +expression. -One final note: K uses [Flex](https://github.com/westes/flex) to implement -its lexical analysis. As a result, you can refer to the +Finally, recall that K uses [Flex](https://github.com/westes/flex) to implement +its lexical analysis. As such, you can refer to the [Flex Manual](http://westes.github.io/flex/manual/Patterns.html#Patterns) -for a detailed description of the regular expression syntax supported. Note -that for performance reasons, Flex's regular expressions are actually a regular -language, and thus lack some of the syntactic convenience of modern -"regular expression" libraries. If you need features that are not part of the -syntax of Flex regular expressions, you are encouraged to express them via -a grammar instead. +for a detailed description of the regular expression syntax supported. For +performance reasons, Flex's regular expressions are actually a regular +language, and thus lack some of the syntactic convenience of modern "regular +expression" libraries. If you need features that are not part of the syntax of +Flex regular expressions, you are encouraged to express them via a grammar +instead. ## Ahead-of-time parser generation So far we have been entirely focused on K's support for just-in-time parsing, -where the parser is generated on the fly prior to being used. This benefits -from being faster to generate the parser, but it suffers in performance if you -have to repeatedly parse strings with the same parser. For this reason, it is -generally encouraged that when parsing programs, you use K's ahead-of-time -parser generation. K makes use of -[GNU Bison](https://www.gnu.org/software/bison/) to generate parsers. - -By default, you can enable ahead-of-time parsing via the `--gen-bison-parser` -flag to `kompile`. This will make use of Bison's LR(1) parser generator. As +where the parser is generated on the fly prior to being used. This method +offers faster parser generation, but its performance suffers if you have to +repeatedly parse strings with the same parser. For this reason, it is generally +encouraged that when parsing programs to use K's ahead-of-time parser +generation based on tool [GNU Bison](https://www.gnu.org/software/bison/). + +You can enable ahead-of-time parsing via the `--gen-bison-parser` flag to +`kompile`. This will make use of Bison's +[LR(1) parser](https://en.wikipedia.org/wiki/Canonical_LR_parser) generator. As such, if your grammar is not LR(1), it may not parse exactly the same as if -you were to use the just-in-time parser, because Bison will automatically pick +you were to use the just-in-time parser because Bison will automatically pick one of the possible branches whenever it encounters a shift-reduce or reduce-reduce conflict. In this case, you can either modify your grammar to be -LR(1), or you can enable use of Bison's GLR support by instead passing +LR(1), or you can use Bison's GLR support by passing flag `--gen-glr-bison-parser` to `kompile`. Note that if your grammar is ambiguous, the ahead-of-time parser will not provide you with particularly readable error messages at this time. -If you have a K definition named `foo.k`, and it generates a directory when -you run `kompile` called `foo-kompiled`, you can invoke the ahead-of-time -parser you generated by running `foo-kompiled/parser_PGM ` on a file. +`kompile --gen-bison-parser 'lesson-03-a.k'` gives + +``` +[Warning] Compiler: Could not find main syntax module with name +LESSON-03-A-SYNTAX in definition. Use --syntax-module to specify one. Using +LESSON-03-A as default. +[Warning] Inner Parser: Skipping modules [ML-SYNTAX] tagged as not-lr1 which +are not supported by Bison. +``` + +We now have a second warning. + +Run + +`lesson-03-a-kompiled/parser_PGM and-or.bool` + +to see that now you don't get an error when parsing. The output, minus +formatting is the following: + +``` +inj{SortBoolean{}, SortKItem{}}( + Lbl'UndsAnd-And-UndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}( + Lbltrue'Unds'LESSON-03-A'Unds'Boolean{}(), + Lbl'UndsPipePipeUndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}( + Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}(), + Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}() + ) + ) +) +``` + +Here we get one AST. + +Compare this with the output given when running the same command, but when the +ahead-of-time parser has been enabled with flag `--gen-glr-bison-parser`: + +``` +inj{SortBoolean{}, SortKItem{}}( + Lblamb{SortBoolean{}}( + Lbl'UndsAnd-And-UndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}( + Lbltrue'Unds'LESSON-03-A'Unds'Boolean{}(), + Lbl'UndsPipePipeUndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}( + Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}(), + Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}() + ) + ), + Lbl'UndsPipePipeUndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}( + Lbl'UndsAnd-And-UndsUnds'LESSON-03-A'Unds'Boolean'Unds'Boolean'Unds'Boolean{}( + Lbltrue'Unds'LESSON-03-A'Unds'Boolean{}(), + Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}() + ), + Lblfalse'Unds'LESSON-03-A'Unds'Boolean{}() + ) + ) +) +``` + +In this case, we get both ASTs. + +In general, for a K definition named `foo.k`, and directory `foo-kompiled` created +when running `kompile`, you can invoke the ahead-of-time parser you generated by +executing `foo-kompiled/parser_PGM ` on a file. ## Exercises @@ -404,5 +526,5 @@ express each individual intended meaning using brackets. ## Next lesson -Once you have completed the above exercises, you can continue to +Once you have completed the exercises above, you can continue to [Lesson 1.4: Disambiguating Parses](../04_disambiguation/README.md). From f2b1e507e32f491f69592e5bf76816a98541a8e6 Mon Sep 17 00:00:00 2001 From: Iulia Bastys Date: Mon, 5 May 2025 00:11:25 +0200 Subject: [PATCH 2/2] Update Lesson 1.3 README. --- .../k-tutorial/1_basic/03_parsing/README.md | 167 ++++++++++-------- 1 file changed, 94 insertions(+), 73 deletions(-) diff --git a/k-distribution/k-tutorial/1_basic/03_parsing/README.md b/k-distribution/k-tutorial/1_basic/03_parsing/README.md index 0b987e1f4a7..44a19c2b8c4 100644 --- a/k-distribution/k-tutorial/1_basic/03_parsing/README.md +++ b/k-distribution/k-tutorial/1_basic/03_parsing/README.md @@ -4,14 +4,16 @@ copyright: Copyright (c) Runtime Verification, Inc. All Rights Reserved. # Lesson 1.3: BNF Syntax and Parser Generation -The purpose of this lesson is to explain the full syntax and semantics of -**productions** in K, as well as how productions and other syntactic -**sentences** can be used to define grammars for parsing both rules and -programs. In this context, you'll also learn about two additional types -of productions, **brackets** and **tokens**. +In this lesson we will introduce more key aspects of the syntax and +semantics of **productions** in K, and show how these, along with other +syntactic **sentences** can be used to define grammars for parsing both rules +and programs. In this context, you'll also learn about two additional types +of productions, **brakets** and **tokens**. ## K's approach to parsing +K's grammar is divided into two components: one **outer syntax** and one +**inner syntax**. Outer syntax refers to the parsing of **requires**, K's grammar is divided into two components: one **outer syntax** and one **inner syntax**. Outer syntax refers to the parsing of **requires**, **modules**, **imports**, and **sentences** in a K definition. Inner syntax @@ -33,7 +35,7 @@ To illustrate how this works, let's consider the K module below which defines a logical calculator for evaluating Boolean expressions containing operations AND, OR, NOT, and XOR. -Input the code below into your editor as file `lesson-03-a.k`: +Save the code below in file `lesson-03-a.k`: ```k module LESSON-03-A @@ -47,15 +49,14 @@ module LESSON-03-A endmodule ``` -Observe that the productions in this file look a little different than +Observe that the productions in this module look a little different than what we have seen in the previous lesson. The reason is that K has two mechanisms for defining productions: [BNF notation](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) and -alphanumeric identifiers. In [Lesson 1.2](../02_basics/README.md) we have -seen the latter, where the identifier was followed by a (possibly empty) -list of sorts in parentheses. In this lesson, we will learn about the -former, which is a more generic mechanism for defining the syntax of -productions. +alphanumeric identifiers. In [Lesson 1.2](../02_basics/README.md) we presented +the latter, where the identifier was followed by a (possibly empty) list of +sorts in parentheses. In this lesson, we introduce the former, which is a more +generic mechanism for defining the syntax of productions. Recall the set of productions from the previous lesson: @@ -69,8 +70,7 @@ module LESSON-03-B endmodule ``` -This definition is equivalent to the following one which specifies the same -grammar, but in BNF notation: +We can write an equivalent definition in BNF notation as follows: ```k module LESSON-03-C @@ -82,7 +82,7 @@ module LESSON-03-C endmodule ``` -You can see that the sort of the function's argument is unchanged, but +Note that sort `Fruit` of the function's argument is unchanged, but everything else has been wrapped in double quotation marks. This is because in BNF notation, we distinguish between two types of **production items**: **terminals** and **non-terminals**. A terminal represents a literal string @@ -92,25 +92,23 @@ to a sort name, like `Fruit`, and the syntax of the production they belong to accepts any valid term of that sort at that position. In the previous lesson we executed successfully the program `colorOf(Banana())` -using `krun`. `krun` parses and interprets terms according to the grammar we -have defined. `Banana()` is a term of sort `Fruit`, hence a valid argument +using `krun`. `krun` parses and interprets terms according to the grammar +defined. `Banana()` is a term of sort `Fruit`, hence a valid argument for function `colorOf`. Under the hood, the term is automatically converted into an AST (abstract syntax tree), and then the function `colorOf` is evaluated using the function rules provided in the definition. -What about the strings between the double quotes? How does K match them? -The answer is that K uses [Flex](https://github.com/westes/flex) to generate a +How does K match the strings between the double quotes? The answer is that K +uses [Flex](https://github.com/westes/flex) to generate a scanner for the grammar. Remember that a scanner, or lexical analyzer or lexer, is a component of an interpreter that breaks down source code into tokens, which are units such as keywords, variables, and operators. These tokens are then processed by the parser, which interprets the structure of the code -according to the grammar rules. - -Flex looks for the longest possible match of a regular expression in the input. -If there are ambiguities between two or more regular expressions, it will pick -the one with the highest `prec` attribute. You can learn more about how Flex -matching works in the -[Flex Manual](https://westes.github.io/flex/manual/Matching.html#Matching). +according to the grammar rules. Flex looks for the longest possible match of a +regular expression in the input. If there are ambiguities between two or more +regular expressions, it will pick the one with the highest `prec` attribute. +You can learn more about how Flex matching works in the +[Flex Manual | Matching](https://westes.github.io/flex/manual/Matching.html#Matching). Returning to module `LESSON-03-A`, we can see that it defines a simple BNF grammar for expressions over Booleans. We have defined constructors @@ -120,22 +118,20 @@ given a syntax for each of these functions based on their syntax in the `C` programming language. As such, we can now write programs in the simple language we have defined! -First, let's compile our grammar: +Save the code below in file `and.bool`: ``` -kompile lesson-03-a.k +true && false ``` -Recall that compilation produces a parser, interpreter, and verifier for the -grammar specified in the K definition. - -Now, save the following program as `and.bool` in the same directory: +Now, let's compile our grammar first: ``` -true && false +kompile lesson-03-a.k ``` -Interpreting this program by executing +Recall that compilation produces a parser, interpreter, and verifier for the +grammar specified in the K definition. Interpreting the program by executing ``` krun and.bool @@ -160,7 +156,7 @@ This is expected, as we have not given rules defining the meaning of the `&&` function, and the error message highlights exactly this—_Maybe attempted to evaluate a symbol with no rules?_ -We cannot interpret the program just yet, but we can _parse_ it. To do this, +While we cannot interpret the program just yet, we can _parse_ it. To do this, run the command below from the same directory: ``` @@ -179,14 +175,16 @@ inj{SortBoolean{}, SortKItem{}}( ) ``` -`kast` is K's just-in-time parser, just another tool produced at compile time. -It generates a grammar from your K definition on the fly and uses it to parse +`kast` is K's just-in-time parser, just another tool generated at compile time. +It produces a grammar from the K definition on the fly and uses it to parse the program passed on the command line. -The `--output` flag controls how the resulting AST is represented. There are -several possible values for it, and you can see all options by running +K allows for several AST representations and you can choose a specific one by +setting the `--output` flag. You can see all possible value options by running `kast --help`. `kore` used above is one of them and denotes KORE, the -intermediate representation of K. +intermediate representation of K. You can learn more about KORE in another +[tutorial](https://github.com/runtimeverification/haskell-backend/blob/master/docs/kore-syntax.md), +currently work-in-progress. Value `kast` for the flag gives us an AST in a more direct representation of the original K definition. @@ -206,12 +204,17 @@ yields the following output, minus the formatting: ``` Comparing both outputs, you can observe that the former is largely a -name-mangled version of the latter. However, a notable difference is -represented by the `inj` attribute in the KORE output. Keep it in mind for now, -we will talk more about it in future lessons. +name-mangled version of the latter. A notable difference is represented by the +`inj` attribute in the KORE output and you can learn more about it in the +[KORE tutorial](https://github.com/runtimeverification/haskell-backend/blob/master/docs/kore-syntax.md). + +Note that `kast` also takes expressions as arguments, not only file names, +but not both at the same time. If you want to parse an expression, you need to +use flag `-e` or `--expression`: -Note that while `krun` also accepts programs as arguments, `kast` only takes -file names as arguments. +``` +kast --output kast -e "true && false" +``` ### Exercise @@ -248,15 +251,36 @@ ambiguous. K's just-in-time parser is a GLL (generalized the full generality of context-free grammars, including those grammars which are ambiguous. An ambiguous grammar is one where the same string can be parsed as multiple distinct ASTs. In this example, it can't decide whether it should -be parsed as `(true && false) || false` or as `true && (false || false)`. +be parsed as `(true && false) || false` (Fig. 3-A) or as `true && (false || false)` +(Fig. 3-B). + +Fig. 3-A +``` + || + / \ + && false + / \ +true false +``` + +Fig. 3-B +``` + && + / \ +true || + / \ + false false +``` In Boolean logic and other programming languages such as C, logical AND has -precedence over logical OR. However, grammars defined in K assume all operators -to have the same priority in evaluation, unless specified otherwise. Hence the +precedence over logical OR, rendering the AST in Fig. 3-A the only valid one. +However, grammars defined in K assume all operators to have the same priority +in evaluation, unless specified otherwise. Both ASTs in Fig. 3-A and Fig. 3-B +are possible with the grammar we defined in module `LESSON-3-A`, hence the ambiguity reported by the parser. You will learn in the next lesson how to set up precendence of some operators over others and define the logical connectives -_the usual way_. We continue this lesson by showing how to reduce ambiguity by -using brackets. +_the usual way_. We continue this lesson by showing how to reduce ambiguity +through the use of **brackets**. ## Brackets @@ -275,9 +299,8 @@ not impose any restrictions on the grammar provided for a bracket. Like in other languages, the most common type of bracket is one in which a non-terminal is surrounded by terminals representing one of the following symbols `()`, `[]`, `{}`, or `<>`. For example, we can define the most common -type of bracket, the parentheses, quite simply. - -Consider the following modified definition and save it to file `lesson-03-d.k`: +type of bracket, the parentheses, quite simply. Consider the following modified +definition and save it to file `lesson-03-d.k`: ```k module LESSON-03-D @@ -309,11 +332,11 @@ true && (false || false) When parsing these programs with `kast`, you get a unique AST with no error. If you check the output carefully, you will notice that the bracket itself does -not appear in the AST. In fact, this is a property unique to brackets: -productions with the bracket attribute are not represented in the parsed AST of -a term, and the child of the bracket is folded immediately into the parent -term. This is why a bracket production must have a single non-terminal of the same -sort as the production itself. +not appear in the AST. In fact, this is a property unique to bracket +productions: they are not represented in the parsed AST of a term, and the +child of the bracket is folded immediately into the parent term. This is why we +have the requirement mentioned above, that a bracket production must have a +single non-terminal of the same sort as the production itself. ### Exercise @@ -325,19 +348,17 @@ the AST. ## Tokens So far we have seen how to define the grammar of a language and we have -implicitly been using K's automatic lexer generation to generate a token for -each terminal in our grammar. -However, the grammar is not the only relevant part of parsing a language. Also -relevant is the lexical syntax of the language, i.e., how the tokens are -defined and recognized. +implicitly been using K's automatic lexer generation to produce a token for +each terminal in our grammar. However, the grammar is not the only relevant +part of parsing a language. Also relevant is the lexical syntax of the +language, i.e., how the tokens are defined and recognized. Sometimes we need to define more complex lexical syntax. Consider, for -instance, integers in C: an integer consists of a decimal, octal, -or hexadecimal number, followed by an optional suffix that specifies the type -of the literal. While it’s theoretically possible to define this syntax using -a grammar, doing so would be cumbersome and tedious. Additionally, you'd be -faced with an AST generated for the literal, which is not particularly -convenient to work with. +instance, integers in C. They consist of a decimal, octal, or hexadecimal +number, followed by an optional suffix that specifies the type of the literal. +While it’s theoretically possible to define this syntax using a grammar, doing +so would be cumbersome and tedious. Additionally, you'd be faced with an AST +generated for the literal, which is not particularly convenient to work with. As an alternative, K allows you to define **token** productions, which are [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) followed @@ -413,7 +434,7 @@ expression. Finally, recall that K uses [Flex](https://github.com/westes/flex) to implement its lexical analysis. As such, you can refer to the -[Flex Manual](http://westes.github.io/flex/manual/Patterns.html#Patterns) +[Flex Manual | Patterns](http://westes.github.io/flex/manual/Patterns.html#Patterns) for a detailed description of the regular expression syntax supported. For performance reasons, Flex's regular expressions are actually a regular language, and thus lack some of the syntactic convenience of modern "regular @@ -426,9 +447,9 @@ instead. So far we have been entirely focused on K's support for just-in-time parsing, where the parser is generated on the fly prior to being used. This method offers faster parser generation, but its performance suffers if you have to -repeatedly parse strings with the same parser. For this reason, it is generally -encouraged that when parsing programs to use K's ahead-of-time parser -generation based on tool [GNU Bison](https://www.gnu.org/software/bison/). +repeatedly parse strings with the same parser. For this reason, when parsing +programs, it is generally recommended to use K's ahead-of-time parser +generation based on [GNU Bison](https://www.gnu.org/software/bison/). You can enable ahead-of-time parsing via the `--gen-bison-parser` flag to `kompile`. This will make use of Bison's