Skip to content

Commit 15668b5

Browse files
committed
Update Lesson 1.3 README.
1 parent cbe2eb9 commit 15668b5

File tree

1 file changed

+94
-73
lines changed
  • k-distribution/k-tutorial/1_basic/03_parsing

1 file changed

+94
-73
lines changed

k-distribution/k-tutorial/1_basic/03_parsing/README.md

+94-73
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,16 @@ copyright: Copyright (c) Runtime Verification, Inc. All Rights Reserved.
44

55
# Lesson 1.3: BNF Syntax and Parser Generation
66

7-
The purpose of this lesson is to explain the full syntax and semantics of
8-
**productions** in K, as well as how productions and other syntactic
9-
**sentences** can be used to define grammars for parsing both rules and
10-
programs. In this context, you'll also learn about two additional types
11-
of productions, **brackets** and **tokens**.
7+
In this lesson we will introduce more key aspects of the syntax and
8+
semantics of **productions** in K, and show how these, along with other
9+
syntactic **sentences** can be used to define grammars for parsing both rules
10+
and programs. In this context, you'll also learn about two additional types
11+
of productions, **brakets** and **tokens**.
1212

1313
## K's approach to parsing
1414

15+
K's grammar is divided into two components: one **outer syntax** and one
16+
**inner syntax**. Outer syntax refers to the parsing of **requires**,
1517
K's grammar is divided into two components: one **outer syntax** and one
1618
**inner syntax**. Outer syntax refers to the parsing of **requires**,
1719
**modules**, **imports**, and **sentences** in a K definition. Inner syntax
@@ -33,7 +35,7 @@ To illustrate how this works, let's consider the K module below which defines
3335
a logical calculator for evaluating Boolean expressions containing operations
3436
AND, OR, NOT, and XOR.
3537

36-
Input the code below into your editor as file `lesson-03-a.k`:
38+
Save the code below in file `lesson-03-a.k`:
3739

3840
```k
3941
module LESSON-03-A
@@ -47,15 +49,14 @@ module LESSON-03-A
4749
endmodule
4850
```
4951

50-
Observe that the productions in this file look a little different than
52+
Observe that the productions in this module look a little different than
5153
what we have seen in the previous lesson. The reason is that K has two
5254
mechanisms for defining productions:
5355
[BNF notation](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) and
54-
alphanumeric identifiers. In [Lesson 1.2](../02_basics/README.md) we have
55-
seen the latter, where the identifier was followed by a (possibly empty)
56-
list of sorts in parentheses. In this lesson, we will learn about the
57-
former, which is a more generic mechanism for defining the syntax of
58-
productions.
56+
alphanumeric identifiers. In [Lesson 1.2](../02_basics/README.md) we presented
57+
the latter, where the identifier was followed by a (possibly empty) list of
58+
sorts in parentheses. In this lesson, we introduce the former, which is a more
59+
generic mechanism for defining the syntax of productions.
5960

6061
Recall the set of productions from the previous lesson:
6162

@@ -69,8 +70,7 @@ module LESSON-03-B
6970
endmodule
7071
```
7172

72-
This definition is equivalent to the following one which specifies the same
73-
grammar, but in BNF notation:
73+
We can write an equivalent definition in BNF notation as follows:
7474

7575
```k
7676
module LESSON-03-C
@@ -82,7 +82,7 @@ module LESSON-03-C
8282
endmodule
8383
```
8484

85-
You can see that the sort of the function's argument is unchanged, but
85+
Note that sort `Fruit` of the function's argument is unchanged, but
8686
everything else has been wrapped in double quotation marks. This is because
8787
in BNF notation, we distinguish between two types of **production items**:
8888
**terminals** and **non-terminals**. A terminal represents a literal string
@@ -92,25 +92,23 @@ to a sort name, like `Fruit`, and the syntax of the production they
9292
belong to accepts any valid term of that sort at that position.
9393

9494
In the previous lesson we executed successfully the program `colorOf(Banana())`
95-
using `krun`. `krun` parses and interprets terms according to the grammar we
96-
have defined. `Banana()` is a term of sort `Fruit`, hence a valid argument
95+
using `krun`. `krun` parses and interprets terms according to the grammar
96+
defined. `Banana()` is a term of sort `Fruit`, hence a valid argument
9797
for function `colorOf`. Under the hood, the term is automatically converted
9898
into an AST (abstract syntax tree), and then the function `colorOf` is
9999
evaluated using the function rules provided in the definition.
100100

101-
What about the strings between the double quotes? How does K match them?
102-
The answer is that K uses [Flex](https://github.com/westes/flex) to generate a
101+
How does K match the strings between the double quotes? The answer is that K
102+
uses [Flex](https://github.com/westes/flex) to generate a
103103
scanner for the grammar. Remember that a scanner, or lexical analyzer or lexer,
104104
is a component of an interpreter that breaks down source code into tokens,
105105
which are units such as keywords, variables, and operators. These tokens are
106106
then processed by the parser, which interprets the structure of the code
107-
according to the grammar rules.
108-
109-
Flex looks for the longest possible match of a regular expression in the input.
110-
If there are ambiguities between two or more regular expressions, it will pick
111-
the one with the highest `prec` attribute. You can learn more about how Flex
112-
matching works in the
113-
[Flex Manual](https://westes.github.io/flex/manual/Matching.html#Matching).
107+
according to the grammar rules. Flex looks for the longest possible match of a
108+
regular expression in the input. If there are ambiguities between two or more
109+
regular expressions, it will pick the one with the highest `prec` attribute.
110+
You can learn more about how Flex matching works in the
111+
[Flex Manual | Matching](https://westes.github.io/flex/manual/Matching.html#Matching).
114112

115113
Returning to module `LESSON-03-A`, we can see that it defines a simple BNF
116114
grammar for expressions over Booleans. We have defined constructors
@@ -120,22 +118,20 @@ given a syntax for each of these functions based on their syntax in the `C`
120118
programming language. As such, we can now write programs in the simple language
121119
we have defined!
122120

123-
First, let's compile our grammar:
121+
Save the code below in file `and.bool`:
124122

125123
```
126-
kompile lesson-03-a.k
124+
true && false
127125
```
128126

129-
Recall that compilation produces a parser, interpreter, and verifier for the
130-
grammar specified in the K definition.
131-
132-
Now, save the following program as `and.bool` in the same directory:
127+
Now, let's compile our grammar first:
133128

134129
```
135-
true && false
130+
kompile lesson-03-a.k
136131
```
137132

138-
Interpreting this program by executing
133+
Recall that compilation produces a parser, interpreter, and verifier for the
134+
grammar specified in the K definition. Interpreting the program by executing
139135

140136
```
141137
krun and.bool
@@ -160,7 +156,7 @@ This is expected, as we have not given rules defining the meaning of the `&&`
160156
function, and the error message highlights exactly this—_Maybe attempted
161157
to evaluate a symbol with no rules?_
162158

163-
We cannot interpret the program just yet, but we can _parse_ it. To do this,
159+
While we cannot interpret the program just yet, we can _parse_ it. To do this,
164160
run the command below from the same directory:
165161

166162
```
@@ -179,14 +175,16 @@ inj{SortBoolean{}, SortKItem{}}(
179175
)
180176
```
181177

182-
`kast` is K's just-in-time parser, just another tool produced at compile time.
183-
It generates a grammar from your K definition on the fly and uses it to parse
178+
`kast` is K's just-in-time parser, just another tool generated at compile time.
179+
It produces a grammar from the K definition on the fly and uses it to parse
184180
the program passed on the command line.
185181

186-
The `--output` flag controls how the resulting AST is represented. There are
187-
several possible values for it, and you can see all options by running
182+
K allows for several AST representations and you can choose a specific one by
183+
setting the `--output` flag. You can see all possible value options by running
188184
`kast --help`. `kore` used above is one of them and denotes KORE, the
189-
intermediate representation of K.
185+
intermediate representation of K. You can learn more about KORE in another
186+
[tutorial](https://github.com/runtimeverification/haskell-backend/blob/master/docs/kore-syntax.md),
187+
currently work-in-progress.
190188
Value `kast` for the flag gives us an AST in a more direct representation of
191189
the original K definition.
192190

@@ -206,12 +204,17 @@ yields the following output, minus the formatting:
206204
```
207205

208206
Comparing both outputs, you can observe that the former is largely a
209-
name-mangled version of the latter. However, a notable difference is
210-
represented by the `inj` attribute in the KORE output. Keep it in mind for now,
211-
we will talk more about it in future lessons.
207+
name-mangled version of the latter. A notable difference is represented by the
208+
`inj` attribute in the KORE output and you can learn more about it in the
209+
[KORE tutorial](https://github.com/runtimeverification/haskell-backend/blob/master/docs/kore-syntax.md).
210+
211+
Note that `kast` also takes expressions as arguments, not only file names,
212+
but not both at the same time. If you want to parse an expression, you need to
213+
use flag `-e` or `--expression`:
212214

213-
Note that while `krun` also accepts programs as arguments, `kast` only takes
214-
file names as arguments.
215+
```
216+
kast --output kast -e "true && false"
217+
```
215218

216219
### Exercise
217220

@@ -248,15 +251,36 @@ ambiguous. K's just-in-time parser is a GLL (<u>g</u>eneralized
248251
the full generality of context-free grammars, including those grammars which
249252
are ambiguous. An ambiguous grammar is one where the same string can be parsed
250253
as multiple distinct ASTs. In this example, it can't decide whether it should
251-
be parsed as `(true && false) || false` or as `true && (false || false)`.
254+
be parsed as `(true && false) || false` (Fig. 3-A) or as `true && (false || false)`
255+
(Fig. 3-B).
256+
257+
Fig. 3-A
258+
```
259+
||
260+
/ \
261+
&& false
262+
/ \
263+
true false
264+
```
265+
266+
Fig. 3-B
267+
```
268+
&&
269+
/ \
270+
true ||
271+
/ \
272+
false false
273+
```
252274

253275
In Boolean logic and other programming languages such as C, logical AND has
254-
precedence over logical OR. However, grammars defined in K assume all operators
255-
to have the same priority in evaluation, unless specified otherwise. Hence the
276+
precedence over logical OR, rendering the AST in Fig. 3-A the only valid one.
277+
However, grammars defined in K assume all operators to have the same priority
278+
in evaluation, unless specified otherwise. Both ASTs in Fig. 3-A and Fig. 3-B
279+
are possible with the grammar we defined in module `LESSON-3-A`, hence the
256280
ambiguity reported by the parser. You will learn in the next lesson how to set
257281
up precendence of some operators over others and define the logical connectives
258-
_the usual way_. We continue this lesson by showing how to reduce ambiguity by
259-
using brackets.
282+
_the usual way_. We continue this lesson by showing how to reduce ambiguity
283+
through the use of **brackets**.
260284

261285

262286
## Brackets
@@ -275,9 +299,8 @@ not impose any restrictions on the grammar provided for a bracket.
275299
Like in other languages, the most common type of bracket is one in which a
276300
non-terminal is surrounded by terminals representing one of the following
277301
symbols `()`, `[]`, `{}`, or `<>`. For example, we can define the most common
278-
type of bracket, the parentheses, quite simply.
279-
280-
Consider the following modified definition and save it to file `lesson-03-d.k`:
302+
type of bracket, the parentheses, quite simply. Consider the following modified
303+
definition and save it to file `lesson-03-d.k`:
281304

282305
```k
283306
module LESSON-03-D
@@ -309,11 +332,11 @@ true && (false || false)
309332

310333
When parsing these programs with `kast`, you get a unique AST with no error.
311334
If you check the output carefully, you will notice that the bracket itself does
312-
not appear in the AST. In fact, this is a property unique to brackets:
313-
productions with the bracket attribute are not represented in the parsed AST of
314-
a term, and the child of the bracket is folded immediately into the parent
315-
term. This is why a bracket production must have a single non-terminal of the same
316-
sort as the production itself.
335+
not appear in the AST. In fact, this is a property unique to bracket
336+
productions: they are not represented in the parsed AST of a term, and the
337+
child of the bracket is folded immediately into the parent term. This is why we
338+
have the requirement mentioned above, that a bracket production must have a
339+
single non-terminal of the same sort as the production itself.
317340

318341
### Exercise
319342

@@ -325,19 +348,17 @@ the AST.
325348
## Tokens
326349

327350
So far we have seen how to define the grammar of a language and we have
328-
implicitly been using K's automatic lexer generation to generate a token for
329-
each terminal in our grammar.
330-
However, the grammar is not the only relevant part of parsing a language. Also
331-
relevant is the lexical syntax of the language, i.e., how the tokens are
332-
defined and recognized.
351+
implicitly been using K's automatic lexer generation to produce a token for
352+
each terminal in our grammar. However, the grammar is not the only relevant
353+
part of parsing a language. Also relevant is the lexical syntax of the
354+
language, i.e., how the tokens are defined and recognized.
333355

334356
Sometimes we need to define more complex lexical syntax. Consider, for
335-
instance, integers in C: an integer consists of a decimal, octal,
336-
or hexadecimal number, followed by an optional suffix that specifies the type
337-
of the literal. While it’s theoretically possible to define this syntax using
338-
a grammar, doing so would be cumbersome and tedious. Additionally, you'd be
339-
faced with an AST generated for the literal, which is not particularly
340-
convenient to work with.
357+
instance, integers in C. They consist of a decimal, octal, or hexadecimal
358+
number, followed by an optional suffix that specifies the type of the literal.
359+
While it’s theoretically possible to define this syntax using a grammar, doing
360+
so would be cumbersome and tedious. Additionally, you'd be faced with an AST
361+
generated for the literal, which is not particularly convenient to work with.
341362

342363
As an alternative, K allows you to define **token** productions, which are
343364
[regular expressions](https://en.wikipedia.org/wiki/Regular_expression) followed
@@ -413,7 +434,7 @@ expression.
413434

414435
Finally, recall that K uses [Flex](https://github.com/westes/flex) to implement
415436
its lexical analysis. As such, you can refer to the
416-
[Flex Manual](http://westes.github.io/flex/manual/Patterns.html#Patterns)
437+
[Flex Manual | Patterns](http://westes.github.io/flex/manual/Patterns.html#Patterns)
417438
for a detailed description of the regular expression syntax supported. For
418439
performance reasons, Flex's regular expressions are actually a regular
419440
language, and thus lack some of the syntactic convenience of modern "regular
@@ -426,9 +447,9 @@ instead.
426447
So far we have been entirely focused on K's support for just-in-time parsing,
427448
where the parser is generated on the fly prior to being used. This method
428449
offers faster parser generation, but its performance suffers if you have to
429-
repeatedly parse strings with the same parser. For this reason, it is generally
430-
encouraged that when parsing programs to use K's ahead-of-time parser
431-
generation based on tool [GNU Bison](https://www.gnu.org/software/bison/).
450+
repeatedly parse strings with the same parser. For this reason, when parsing
451+
programs, it is generally recommended to use K's ahead-of-time parser
452+
generation based on [GNU Bison](https://www.gnu.org/software/bison/).
432453

433454
You can enable ahead-of-time parsing via the `--gen-bison-parser` flag to
434455
`kompile`. This will make use of Bison's

0 commit comments

Comments
 (0)