@@ -4,14 +4,16 @@ copyright: Copyright (c) Runtime Verification, Inc. All Rights Reserved.
4
4
5
5
# Lesson 1.3: BNF Syntax and Parser Generation
6
6
7
- The purpose of this lesson is to explain the full syntax and semantics of
8
- ** productions** in K, as well as how productions and other syntactic
9
- ** sentences** can be used to define grammars for parsing both rules and
10
- programs. In this context, you'll also learn about two additional types
11
- of productions, ** brackets ** and ** tokens** .
7
+ In this lesson we will introduce more key aspects of the syntax and
8
+ semantics of ** productions** in K, and show how these, along with other
9
+ syntactic ** sentences** can be used to define grammars for parsing both rules
10
+ and programs. In this context, you'll also learn about two additional types
11
+ of productions, ** brakets ** and ** tokens** .
12
12
13
13
## K's approach to parsing
14
14
15
+ K's grammar is divided into two components: one ** outer syntax** and one
16
+ ** inner syntax** . Outer syntax refers to the parsing of ** requires** ,
15
17
K's grammar is divided into two components: one ** outer syntax** and one
16
18
** inner syntax** . Outer syntax refers to the parsing of ** requires** ,
17
19
** modules** , ** imports** , and ** sentences** in a K definition. Inner syntax
@@ -33,7 +35,7 @@ To illustrate how this works, let's consider the K module below which defines
33
35
a logical calculator for evaluating Boolean expressions containing operations
34
36
AND, OR, NOT, and XOR.
35
37
36
- Input the code below into your editor as file ` lesson-03-a.k ` :
38
+ Save the code below in file ` lesson-03-a.k ` :
37
39
38
40
``` k
39
41
module LESSON-03-A
@@ -47,15 +49,14 @@ module LESSON-03-A
47
49
endmodule
48
50
```
49
51
50
- Observe that the productions in this file look a little different than
52
+ Observe that the productions in this module look a little different than
51
53
what we have seen in the previous lesson. The reason is that K has two
52
54
mechanisms for defining productions:
53
55
[ BNF notation] ( https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form ) and
54
- alphanumeric identifiers. In [ Lesson 1.2] ( ../02_basics/README.md ) we have
55
- seen the latter, where the identifier was followed by a (possibly empty)
56
- list of sorts in parentheses. In this lesson, we will learn about the
57
- former, which is a more generic mechanism for defining the syntax of
58
- productions.
56
+ alphanumeric identifiers. In [ Lesson 1.2] ( ../02_basics/README.md ) we presented
57
+ the latter, where the identifier was followed by a (possibly empty) list of
58
+ sorts in parentheses. In this lesson, we introduce the former, which is a more
59
+ generic mechanism for defining the syntax of productions.
59
60
60
61
Recall the set of productions from the previous lesson:
61
62
@@ -69,8 +70,7 @@ module LESSON-03-B
69
70
endmodule
70
71
```
71
72
72
- This definition is equivalent to the following one which specifies the same
73
- grammar, but in BNF notation:
73
+ We can write an equivalent definition in BNF notation as follows:
74
74
75
75
``` k
76
76
module LESSON-03-C
@@ -82,7 +82,7 @@ module LESSON-03-C
82
82
endmodule
83
83
```
84
84
85
- You can see that the sort of the function's argument is unchanged, but
85
+ Note that sort ` Fruit ` of the function's argument is unchanged, but
86
86
everything else has been wrapped in double quotation marks. This is because
87
87
in BNF notation, we distinguish between two types of ** production items** :
88
88
** terminals** and ** non-terminals** . A terminal represents a literal string
@@ -92,25 +92,23 @@ to a sort name, like `Fruit`, and the syntax of the production they
92
92
belong to accepts any valid term of that sort at that position.
93
93
94
94
In the previous lesson we executed successfully the program ` colorOf(Banana()) `
95
- using ` krun ` . ` krun ` parses and interprets terms according to the grammar we
96
- have defined. ` Banana() ` is a term of sort ` Fruit ` , hence a valid argument
95
+ using ` krun ` . ` krun ` parses and interprets terms according to the grammar
96
+ defined. ` Banana() ` is a term of sort ` Fruit ` , hence a valid argument
97
97
for function ` colorOf ` . Under the hood, the term is automatically converted
98
98
into an AST (abstract syntax tree), and then the function ` colorOf ` is
99
99
evaluated using the function rules provided in the definition.
100
100
101
- What about the strings between the double quotes? How does K match them?
102
- The answer is that K uses [ Flex] ( https://github.com/westes/flex ) to generate a
101
+ How does K match the strings between the double quotes? The answer is that K
102
+ uses [ Flex] ( https://github.com/westes/flex ) to generate a
103
103
scanner for the grammar. Remember that a scanner, or lexical analyzer or lexer,
104
104
is a component of an interpreter that breaks down source code into tokens,
105
105
which are units such as keywords, variables, and operators. These tokens are
106
106
then processed by the parser, which interprets the structure of the code
107
- according to the grammar rules.
108
-
109
- Flex looks for the longest possible match of a regular expression in the input.
110
- If there are ambiguities between two or more regular expressions, it will pick
111
- the one with the highest ` prec ` attribute. You can learn more about how Flex
112
- matching works in the
113
- [ Flex Manual] ( https://westes.github.io/flex/manual/Matching.html#Matching ) .
107
+ according to the grammar rules. Flex looks for the longest possible match of a
108
+ regular expression in the input. If there are ambiguities between two or more
109
+ regular expressions, it will pick the one with the highest ` prec ` attribute.
110
+ You can learn more about how Flex matching works in the
111
+ [ Flex Manual | Matching] ( https://westes.github.io/flex/manual/Matching.html#Matching ) .
114
112
115
113
Returning to module ` LESSON-03-A ` , we can see that it defines a simple BNF
116
114
grammar for expressions over Booleans. We have defined constructors
@@ -120,22 +118,20 @@ given a syntax for each of these functions based on their syntax in the `C`
120
118
programming language. As such, we can now write programs in the simple language
121
119
we have defined!
122
120
123
- First, let's compile our grammar :
121
+ Save the code below in file ` and.bool ` :
124
122
125
123
```
126
- kompile lesson-03-a.k
124
+ true && false
127
125
```
128
126
129
- Recall that compilation produces a parser, interpreter, and verifier for the
130
- grammar specified in the K definition.
131
-
132
- Now, save the following program as ` and.bool ` in the same directory:
127
+ Now, let's compile our grammar first:
133
128
134
129
```
135
- true && false
130
+ kompile lesson-03-a.k
136
131
```
137
132
138
- Interpreting this program by executing
133
+ Recall that compilation produces a parser, interpreter, and verifier for the
134
+ grammar specified in the K definition. Interpreting the program by executing
139
135
140
136
```
141
137
krun and.bool
@@ -160,7 +156,7 @@ This is expected, as we have not given rules defining the meaning of the `&&`
160
156
function, and the error message highlights exactly this&mdash ; _ Maybe attempted
161
157
to evaluate a symbol with no rules?_
162
158
163
- We cannot interpret the program just yet, but we can _ parse_ it. To do this,
159
+ While we cannot interpret the program just yet, we can _ parse_ it. To do this,
164
160
run the command below from the same directory:
165
161
166
162
```
@@ -179,14 +175,16 @@ inj{SortBoolean{}, SortKItem{}}(
179
175
)
180
176
```
181
177
182
- ` kast ` is K's just-in-time parser, just another tool produced at compile time.
183
- It generates a grammar from your K definition on the fly and uses it to parse
178
+ ` kast ` is K's just-in-time parser, just another tool generated at compile time.
179
+ It produces a grammar from the K definition on the fly and uses it to parse
184
180
the program passed on the command line.
185
181
186
- The ` --output ` flag controls how the resulting AST is represented. There are
187
- several possible values for it, and you can see all options by running
182
+ K allows for several AST representations and you can choose a specific one by
183
+ setting the ` --output ` flag. You can see all possible value options by running
188
184
` kast --help ` . ` kore ` used above is one of them and denotes KORE, the
189
- intermediate representation of K.
185
+ intermediate representation of K. You can learn more about KORE in another
186
+ [ tutorial] ( https://github.com/runtimeverification/haskell-backend/blob/master/docs/kore-syntax.md ) ,
187
+ currently work-in-progress.
190
188
Value ` kast ` for the flag gives us an AST in a more direct representation of
191
189
the original K definition.
192
190
@@ -206,12 +204,17 @@ yields the following output, minus the formatting:
206
204
```
207
205
208
206
Comparing both outputs, you can observe that the former is largely a
209
- name-mangled version of the latter. However, a notable difference is
210
- represented by the ` inj ` attribute in the KORE output. Keep it in mind for now,
211
- we will talk more about it in future lessons.
207
+ name-mangled version of the latter. A notable difference is represented by the
208
+ ` inj ` attribute in the KORE output and you can learn more about it in the
209
+ [ KORE tutorial] ( https://github.com/runtimeverification/haskell-backend/blob/master/docs/kore-syntax.md ) .
210
+
211
+ Note that ` kast ` also takes expressions as arguments, not only file names,
212
+ but not both at the same time. If you want to parse an expression, you need to
213
+ use flag ` -e ` or ` --expression ` :
212
214
213
- Note that while ` krun ` also accepts programs as arguments, ` kast ` only takes
214
- file names as arguments.
215
+ ```
216
+ kast --output kast -e "true && false"
217
+ ```
215
218
216
219
### Exercise
217
220
@@ -248,15 +251,36 @@ ambiguous. K's just-in-time parser is a GLL (<u>g</u>eneralized
248
251
the full generality of context-free grammars, including those grammars which
249
252
are ambiguous. An ambiguous grammar is one where the same string can be parsed
250
253
as multiple distinct ASTs. In this example, it can't decide whether it should
251
- be parsed as ` (true && false) || false ` or as ` true && (false || false) ` .
254
+ be parsed as ` (true && false) || false ` (Fig. 3-A) or as ` true && (false || false) `
255
+ (Fig. 3-B).
256
+
257
+ Fig. 3-A
258
+ ```
259
+ ||
260
+ / \
261
+ && false
262
+ / \
263
+ true false
264
+ ```
265
+
266
+ Fig. 3-B
267
+ ```
268
+ &&
269
+ / \
270
+ true ||
271
+ / \
272
+ false false
273
+ ```
252
274
253
275
In Boolean logic and other programming languages such as C, logical AND has
254
- precedence over logical OR. However, grammars defined in K assume all operators
255
- to have the same priority in evaluation, unless specified otherwise. Hence the
276
+ precedence over logical OR, rendering the AST in Fig. 3-A the only valid one.
277
+ However, grammars defined in K assume all operators to have the same priority
278
+ in evaluation, unless specified otherwise. Both ASTs in Fig. 3-A and Fig. 3-B
279
+ are possible with the grammar we defined in module ` LESSON-3-A ` , hence the
256
280
ambiguity reported by the parser. You will learn in the next lesson how to set
257
281
up precendence of some operators over others and define the logical connectives
258
- _ the usual way_ . We continue this lesson by showing how to reduce ambiguity by
259
- using brackets.
282
+ _ the usual way_ . We continue this lesson by showing how to reduce ambiguity
283
+ through the use of ** brackets** .
260
284
261
285
262
286
## Brackets
@@ -275,9 +299,8 @@ not impose any restrictions on the grammar provided for a bracket.
275
299
Like in other languages, the most common type of bracket is one in which a
276
300
non-terminal is surrounded by terminals representing one of the following
277
301
symbols ` () ` , ` [] ` , ` {} ` , or ` <> ` . For example, we can define the most common
278
- type of bracket, the parentheses, quite simply.
279
-
280
- Consider the following modified definition and save it to file ` lesson-03-d.k ` :
302
+ type of bracket, the parentheses, quite simply. Consider the following modified
303
+ definition and save it to file ` lesson-03-d.k ` :
281
304
282
305
``` k
283
306
module LESSON-03-D
@@ -309,11 +332,11 @@ true && (false || false)
309
332
310
333
When parsing these programs with ` kast ` , you get a unique AST with no error.
311
334
If you check the output carefully, you will notice that the bracket itself does
312
- not appear in the AST. In fact, this is a property unique to brackets:
313
- productions with the bracket attribute are not represented in the parsed AST of
314
- a term, and the child of the bracket is folded immediately into the parent
315
- term. This is why a bracket production must have a single non-terminal of the same
316
- sort as the production itself.
335
+ not appear in the AST. In fact, this is a property unique to bracket
336
+ productions: they are not represented in the parsed AST of a term, and the
337
+ child of the bracket is folded immediately into the parent term. This is why we
338
+ have the requirement mentioned above, that a bracket production must have a
339
+ single non-terminal of the same sort as the production itself.
317
340
318
341
### Exercise
319
342
@@ -325,19 +348,17 @@ the AST.
325
348
## Tokens
326
349
327
350
So far we have seen how to define the grammar of a language and we have
328
- implicitly been using K's automatic lexer generation to generate a token for
329
- each terminal in our grammar.
330
- However, the grammar is not the only relevant part of parsing a language. Also
331
- relevant is the lexical syntax of the language, i.e., how the tokens are
332
- defined and recognized.
351
+ implicitly been using K's automatic lexer generation to produce a token for
352
+ each terminal in our grammar. However, the grammar is not the only relevant
353
+ part of parsing a language. Also relevant is the lexical syntax of the
354
+ language, i.e., how the tokens are defined and recognized.
333
355
334
356
Sometimes we need to define more complex lexical syntax. Consider, for
335
- instance, integers in C: an integer consists of a decimal, octal,
336
- or hexadecimal number, followed by an optional suffix that specifies the type
337
- of the literal. While it’s theoretically possible to define this syntax using
338
- a grammar, doing so would be cumbersome and tedious. Additionally, you'd be
339
- faced with an AST generated for the literal, which is not particularly
340
- convenient to work with.
357
+ instance, integers in C. They consist of a decimal, octal, or hexadecimal
358
+ number, followed by an optional suffix that specifies the type of the literal.
359
+ While it’s theoretically possible to define this syntax using a grammar, doing
360
+ so would be cumbersome and tedious. Additionally, you'd be faced with an AST
361
+ generated for the literal, which is not particularly convenient to work with.
341
362
342
363
As an alternative, K allows you to define ** token** productions, which are
343
364
[ regular expressions] ( https://en.wikipedia.org/wiki/Regular_expression ) followed
@@ -413,7 +434,7 @@ expression.
413
434
414
435
Finally, recall that K uses [ Flex] ( https://github.com/westes/flex ) to implement
415
436
its lexical analysis. As such, you can refer to the
416
- [ Flex Manual] ( http://westes.github.io/flex/manual/Patterns.html#Patterns )
437
+ [ Flex Manual | Patterns ] ( http://westes.github.io/flex/manual/Patterns.html#Patterns )
417
438
for a detailed description of the regular expression syntax supported. For
418
439
performance reasons, Flex's regular expressions are actually a regular
419
440
language, and thus lack some of the syntactic convenience of modern "regular
@@ -426,9 +447,9 @@ instead.
426
447
So far we have been entirely focused on K's support for just-in-time parsing,
427
448
where the parser is generated on the fly prior to being used. This method
428
449
offers faster parser generation, but its performance suffers if you have to
429
- repeatedly parse strings with the same parser. For this reason, it is generally
430
- encouraged that when parsing programs to use K's ahead-of-time parser
431
- generation based on tool [ GNU Bison] ( https://www.gnu.org/software/bison/ ) .
450
+ repeatedly parse strings with the same parser. For this reason, when parsing
451
+ programs, it is generally recommended to use K's ahead-of-time parser
452
+ generation based on [ GNU Bison] ( https://www.gnu.org/software/bison/ ) .
432
453
433
454
You can enable ahead-of-time parsing via the ` --gen-bison-parser ` flag to
434
455
` kompile ` . This will make use of Bison's
0 commit comments