Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
**For a brief overview of Scallion and its purpose, you can watch [this
video](https://tube.switch.ch/videos/f18a2692).** What follows below is
a slightly more detailed description, and an example project you can use
to familiarize yourself with Scallion.
## Introduction to Parser Combinators
The next part of the compiler you will be working on is the parser. The
goal of the parser is to convert the sequence of tokens generated by the
lexer into an Amy *abstract syntax tree* (AST).
There are many approaches to writing parsers, such as:
- Writing the parser by hand directly in the compiler's language using
mutually recursive functions, or
- Writing the parser in a *domain specific language* (DSL) and using a
parser generator (such as Bison) to produce the parser.
Another approach, which we will be using, is *parser combinators*. The
idea behind the approach is very simple:
- Have a set of simple primitive parsers, and
- Have ways to combine them together into more and more complex
parsers. Hence the name *parser combinators*.
Usually, those primitive parsers and combinators are provided as a
library directly in the language used by the compiler. In our case, we
will be working with **Scallion**, a Scala parser combinators library
developed by *LARA*.
Parser combinators have many advantages -- the main one being easy to
write, read and maintain.
## Scallion Parser Combinators
### Documentation
In this document, we will introduce parser combinators in Scallion and
showcase how to use them. This document is not intended to be a complete
reference to Scallion. Fortunately, the library comes with a
[comprehensive
API](https://epfl-lara.github.io/scallion/scallion/index.html) which
fulfills that role. Feel free to refer to it while working on your
project!
### Playground Project
We have set up [an example project](scallion-playground.zip) that
implements a lexer and parser for a simple expression language using
Scallion. Feel free to experiment and play with it. The project
showcases the API of Scallion and some of the more advanced combinators.
### Setup
In Scallion, parsers are defined within a trait called `Syntaxes`. This
trait takes as parameters two types:
- The type of tokens,
- The type of *token kinds*. Token kinds represent groups of tokens.
They abstract away all the details found in the actual tokens, such
as for instance positions or identifiers name. Each token has a
unique kind.
In our case, the tokens will be of type `Token` that we introduced and
used in the previous project. The token kinds will be `TokenKind`, which
we have already defined for you.
object Parser extends Pipeline[Iterator[Token], Program]
with Parsers {
type Token = myproject.Token
type Kind = myproject.TokenKind
// Indicates the kind of the various tokens.
override def getKind(token: Token): TokenKind = TokenKind.of(token)
// You parser implementation goes here.
}
The `Parsers` trait (mixed into the `Parser` object above) comes from
Scallion and provides all functions and types you will use to define
your grammar and AST translation.
### Writing Parsers
When writing a parser using parser combinators, one defines many smaller
parsers and combines them together into more and more complex parsers.
The top-level, most complex, of those parser then defines the entire
syntax for the language. In our case, that top-level parser will be
called `program`.
All those parsers are objects of the type `Syntax[A]`. The type
parameter `A` indicates the type of values produced by the parser. For
instance, a parser of type `Syntax[Int]` produces `Int`s and a parser of
type `Syntax[Expr]` produces `Expr`s. Our top-level parser has the
following signature:
lazy val program: Parser[Program] = ...
Contrary to the types of tokens and token kinds, which are fixed, the
type of values produced is a type parameter of the various `Syntax`s.
This allows your different parsers to produce different types of values.
The various parsers are stored as `val` members of the `Parser` object.
In the case of mutually dependent parsers, we use `lazy val` instead.
lazy val definition: Syntax[ClassOrFunDef] =
functionDefinition | abstractClassDefinition | caseClassDefinition
lazy val functionDefinition: Syntax[ClassOrFunDef] = ...
lazy val abstractClassDefinition: Syntax[ClassOrFunDef] = ...
lazy val caseClassDefinition: Syntax[ClassOrFunDef] = ...
### Running Parsers
Parsers of type `Syntax[A]` can be converted to objects of type
`Parser[A]`, which have an `apply` method which takes as parameter an
iterator of tokens and returns a value of type `ParseResult[A]`, which
can be one of three things:
- A `Parsed(value, rest)`, which indicates that the parser was
successful and produced the value `value`. The entirety of the input
iterator was consumed by the parser.
- An `UnexpectedToken(token, rest)`, which indicates that the parser
encountered an unexpected token `token`. The input iterator was
consumed up to the erroneous token.
- An `UnexpectedEnd(rest)`, which indicates that the end of the
iterator was reached and the parser could not finish at this point.
The input iterator was completely consumed.
In each case, the additional value `rest` is itself some sort of a
`Parser[A]`. That parser represents the parser after the successful
parse or at the point of error. This parser could be used to provide
useful error messages or even to resume parsing.
override def run(ctx: Context)(tokens: Iterator[Token]): Program = {
import ctx.reporter._
val parser = Parser(program)
parser(tokens) match {
case Parsed(result, rest) => result
case UnexpectedEnd(rest) => fatal("Unexpected end of input.")
case UnexpectedToken(token, rest) => fatal("Unexpected token: " + token)
}
}
### Parsers and Grammars
As you will see, parsers built using parser combinators will look a lot
like grammars. However, unlike grammars, parsers not only describe the
syntax of your language, but also directly specify how to turn this
syntax into a value. Also, as we will see, parser combinators have a
richer vocabulary than your usual *BNF* grammars.
Interestingly, a lot of concepts that you have seen on grammars, such as
`FIRST` sets and nullability can be straightforwardly transposed to
parsers.
#### FIRST set
In Scallion, parsers offer a `first` method which returns the set of
token kinds that are accepted as a first token.
definition.first === Set(def, abstract, case)
#### Nullability
Parsers have a `nullable` method which checks for nullability of a
parser. The method returns `Some(value)` if the parser would produce
`value` given an empty input token sequence, and `None` if the parser
would not accept the empty sequence.
### Basic Parsers
We can now finally have a look at the toolbox we have at our disposition
to build parsers, starting from the basic parsers. Each parser that you
will write, however complex, is a combination of these basic parsers.
The basic parsers play the same role as terminal symbols do in grammars.
#### Elem
The first of the basic parsers is `elem(kind)`. The function `elem`
takes argument the kind of tokens to be accepted by the parser. The
value produced by the parser is the token that was matched. For
instance, here is how to match against the *end-of-file* token.
val eof: Parser[Token] = elem(EOFKind)
#### Accept
The function `accept` is a variant of `elem` which directly applies a
transformation to the matched token when it is produced.
val identifier: Syntax[String] = accept(IdentifierKind) {
case IdentifierToken(name) => name
}
#### Epsilon
The parser `epsilon(value)` is a parser that produces the `value`
without consuming any input. It corresponds to the *𝛆* found in
grammars.
### Parser Combinators
In this section, we will see how to combine parsers together to create
more complex parsers.
#### Disjunction
The first combinator we have is disjunction, that we write, for parsers
`p1` and `p2`, simply `p1 | p2`. When both `p1` and `p2` are of type
`Syntax[A]`, the disjunction `p1 | p2` is also of type `Syntax[A]`. The
disjunction operator is associative and commutative.
Disjunction works just as you think it does. If either of the parsers
`p1` or `p2` would accept the sequence of tokens, then the disjunction
also accepts the tokens. The value produced is the one produced by
either `p1` or `p2`.
Note that `p1` and `p2` must have disjoint `first` sets. This
restriction ensures that no ambiguities can arise and that parsing can
be done efficiently.[^1] We will see later how to automatically detect
when this is not the case and how fix the issue.
#### Sequencing
The second combinator we have is sequencing. We write, for parsers `p1`
and `p2`, the sequence of `p1` and `p2` as `p1 ~ p2`. When `p1` is of
type `A` and `p2` of type `B`, their sequence is of type `A ~ B`, which
is simply a pair of an `A` and a `B`.
If the parser `p1` accepts the prefix of a sequence of tokens and `p2`
accepts the postfix, the parser `p1 ~ p2` accepts the entire sequence
and produces the pair of values produced by `p1` and `p2`.
Note that the `first` set of `p2` should be disjoint from the `first`
set of all sub-parsers in `p1` that are *nullable* and in trailing
position (available via the `followLast` method). This restriction
ensures that the combinator does not introduce ambiguities.
#### Transforming Values
The method `map` makes it possible to apply a transformation to the
values produced by a parser. Using `map` does not influence the sequence
of tokens accepted or rejected by the parser, it merely modifies the
value produced. Generally, you will use `map` on a sequence of parsers,
as in:
lazy val abstractClassDefinition: Syntax[ClassOrFunDef] =
(kw("abstract") ~ kw("class") ~ identifier).map {
case kw ~ _ ~ id => AbstractClassDef(id).setPos(kw)
}
The above parser accepts abstract class definitions in Amy syntax. It
does so by accepting the sequence of keywords `abstract` and `class`,
followed by any identifier. The method `map` is used to convert the
produced values into an `AbstractClassDef`. The position of the keyword
`abstract` is used as the position of the definition.
#### Recursive Parsers
It is highly likely that some of your parsers will require to
recursively invoke themselves. In this case, you should indicate that
the parser is recursive using the `recursive` combinator:
lazy val expr: Syntax[Expr] = recursive {
...
}
If you were to omit it, a `StackOverflow` exception would be triggered
during the initialisation of your `Parser` object.
The `recursive` combinator in itself does not change the behaviour of
the underlying parser. It is there to *tie the knot*[^2].
In practice, it is only required in very few places. In order to avoid
`StackOverflow` exceptions during initialisation, you should make sure
that all recursive parsers (stored in `lazy val`s) must not be able to
reenter themselves without going through a `recursive` combinator
somewhere along the way.
#### Other Combinators
So far, many of the combinators that we have seen, such as disjunction
and sequencing, directly correspond to constructs found in `BNF`
grammars. Some of the combinators that we will see now are more
expressive and implement useful patterns.
##### Optional parsers using opt
The combinator `opt` makes a parser optional. The value produced by the
parser is wrapped in `Some` if the parser accepts the input sequence and
in `None` otherwise.
opt(p) === p.map(Some(_)) | epsilon(None)
##### Repetitions using many and many1
The combinator `many` returns a parser that accepts any number of
repetitions of its argument parser, including 0. The variant `many1`
forces the parser to match at least once.
##### Repetitions with separators repsep and rep1sep
The combinator `repsep` returns a parser that accepts any number of
repetitions of its argument parser, separated by an other parser,
including 0. The variant `rep1sep` forces the parser to match at least
once.
The separator parser is restricted to the type `Syntax[Unit]` to ensure
that important values do not get ignored. You may use `unit()` to on a
parser to turn its value to `Unit` if you explicitly want to ignore the
values a parser produces.
##### Binary operators with operators
Scallion also contains combinators to easily build parsers for infix
binary operators, with different associativities and priority levels.
This combinator is defined in an additional trait called `Operators`,
which you should mix into `Parsers` if you want to use the combinator.
By default, it should already be mixed-in.
val times: Syntax[String] =
accept(OperatorKind("*")) {
case _ => "*"
}
...
lazy val operation: Syntax[Expr] =
operators(number)(
// Defines the different operators, by decreasing priority.
times | div is LeftAssociative,
plus | minus is LeftAssociative,
...
) {
// Defines how to apply the various operators.
case (lhs, "*", rhs) => Times(lhs, rhs).setPos(lhs)
...
}
Documentation for `operators` is [available on this
page](https://epfl-lara.github.io/scallion/scallion/Operators.html).
##### Upcasting
In Scallion, the type `Syntax[A]` is invariant with `A`, meaning that,
even when `A` is a (strict) subtype of some type `B`, we *won\'t* have
that `Syntax[A]` is a subtype of `Syntax[B]`. To upcast a `Syntax[A]` to
a syntax `Syntax[B]` (when `A` is a subtype of `B`), you should use the
`.up[B]` method.
For instance, you may need to upcast a syntax of type
`Syntax[Literal[_]]` to a `Syntax[Expr]` in your assignment. To do so,
simply use `.up[Expr]`.
### LL(1) Checking
In Scallion, non-LL(1) parsers can be written, but the result of
applying such a parser is not specified. In practice, we therefore
restrict ourselves only to LL(1) parsers. The reason behind this is that
LL(1) parsers are unambiguous and can be run in time linear in the input
size.
Writing LL(1) parsers is non-trivial. However, some of the higher-level
combinators of Scallion already alleviate part of this pain. In
addition, LL(1) violations can be detected before the parser is run.
Syntaxes have an `isLL1` method which returns `true` if the parser is
LL(1) and `false` otherwise, and so without needing to see any tokens of
input.
#### Conflict Witnesses
In case your parser is not LL(1), the method `conflicts` of the parser
will return the set of all `LL1Conflict`s. The various conflicts are:
- `NullableConflict`, which indicates that two branches of a
disjunction are nullable.
- `FirstConflict`, which indicates that the `first` set of two
branches of a disjunction are not disjoint.
- `FollowConflict`, which indicates that the `first` set of a nullable
parser is not disjoint from the `first` set of a parser that
directly follows it.
The `LL1Conflict`s objects contain fields which can help you pinpoint
the exact location of conflicts in your parser and hopefully help you
fix those.
The helper method `debug` prints a summary of the LL(1) conflicts of a
parser. We added code in the handout skeleton so that, by default, a
report is outputted in case of conflicts when you initialise your
parser.
[^1]: Scallion is not the only parser combinator library to exist, far
from it! Many of those libraries do not have this restriction. Those
libraries generally need to backtrack to try the different
alternatives when a branch fails.
[^2]: See [a good explanation of what tying the knot means in the
context of lazy
languages.](https://stackoverflow.com/questions/357956/explanation-of-tying-the-knot)