Skip to content
Snippets Groups Projects
scallion.md 15.5 KiB
Newer Older
Noe Eric De Santo's avatar
Noe Eric De Santo committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405
**For a brief overview of Scallion and its purpose, you can watch [this
video](https://tube.switch.ch/videos/f18a2692).** What follows below is
a slightly more detailed description, and an example project you can use
to familiarize yourself with Scallion.

## Introduction to Parser Combinators

The next part of the compiler you will be working on is the parser. The
goal of the parser is to convert the sequence of tokens generated by the
lexer into an Amy *abstract syntax tree* (AST).

There are many approaches to writing parsers, such as:

-   Writing the parser by hand directly in the compiler's language using
    mutually recursive functions, or
-   Writing the parser in a *domain specific language* (DSL) and using a
    parser generator (such as Bison) to produce the parser.

Another approach, which we will be using, is *parser combinators*. The
idea behind the approach is very simple:

-   Have a set of simple primitive parsers, and
-   Have ways to combine them together into more and more complex
    parsers. Hence the name *parser combinators*.

Usually, those primitive parsers and combinators are provided as a
library directly in the language used by the compiler. In our case, we
will be working with **Scallion**, a Scala parser combinators library
developed by *LARA*.

Parser combinators have many advantages -- the main one being easy to
write, read and maintain.

## Scallion Parser Combinators

### Documentation

In this document, we will introduce parser combinators in Scallion and
showcase how to use them. This document is not intended to be a complete
reference to Scallion. Fortunately, the library comes with a
[comprehensive
API](https://epfl-lara.github.io/scallion/scallion/index.html) which
fulfills that role. Feel free to refer to it while working on your
project!

### Playground Project

We have set up [an example project](scallion-playground.zip) that
implements a lexer and parser for a simple expression language using
Scallion. Feel free to experiment and play with it. The project
showcases the API of Scallion and some of the more advanced combinators.

### Setup

In Scallion, parsers are defined within a trait called `Syntaxes`. This
trait takes as parameters two types:

-   The type of tokens,
-   The type of *token kinds*. Token kinds represent groups of tokens.
    They abstract away all the details found in the actual tokens, such
    as for instance positions or identifiers name. Each token has a
    unique kind.

In our case, the tokens will be of type `Token` that we introduced and
used in the previous project. The token kinds will be `TokenKind`, which
we have already defined for you.

    object Parser extends Pipeline[Iterator[Token], Program]
                     with Parsers {

      type Token = myproject.Token
      type Kind = myproject.TokenKind

      // Indicates the kind of the various tokens.
      override def getKind(token: Token): TokenKind = TokenKind.of(token)
      
      // You parser implementation goes here.
    }

The `Parsers` trait (mixed into the `Parser` object above) comes from
Scallion and provides all functions and types you will use to define
your grammar and AST translation.

### Writing Parsers

When writing a parser using parser combinators, one defines many smaller
parsers and combines them together into more and more complex parsers.
The top-level, most complex, of those parser then defines the entire
syntax for the language. In our case, that top-level parser will be
called `program`.

All those parsers are objects of the type `Syntax[A]`. The type
parameter `A` indicates the type of values produced by the parser. For
instance, a parser of type `Syntax[Int]` produces `Int`s and a parser of
type `Syntax[Expr]` produces `Expr`s. Our top-level parser has the
following signature:

    lazy val program: Parser[Program] = ...

Contrary to the types of tokens and token kinds, which are fixed, the
type of values produced is a type parameter of the various `Syntax`s.
This allows your different parsers to produce different types of values.

The various parsers are stored as `val` members of the `Parser` object.
In the case of mutually dependent parsers, we use `lazy val` instead.

    lazy val definition: Syntax[ClassOrFunDef] =
      functionDefinition | abstractClassDefinition | caseClassDefinition
     
    lazy val functionDefinition: Syntax[ClassOrFunDef] = ...

    lazy val abstractClassDefinition: Syntax[ClassOrFunDef] = ...

    lazy val caseClassDefinition: Syntax[ClassOrFunDef] = ...

### Running Parsers

Parsers of type `Syntax[A]` can be converted to objects of type
`Parser[A]`, which have an `apply` method which takes as parameter an
iterator of tokens and returns a value of type `ParseResult[A]`, which
can be one of three things:

-   A `Parsed(value, rest)`, which indicates that the parser was
    successful and produced the value `value`. The entirety of the input
    iterator was consumed by the parser.
-   An `UnexpectedToken(token, rest)`, which indicates that the parser
    encountered an unexpected token `token`. The input iterator was
    consumed up to the erroneous token.
-   An `UnexpectedEnd(rest)`, which indicates that the end of the
    iterator was reached and the parser could not finish at this point.
    The input iterator was completely consumed.

In each case, the additional value `rest` is itself some sort of a
`Parser[A]`. That parser represents the parser after the successful
parse or at the point of error. This parser could be used to provide
useful error messages or even to resume parsing.

    override def run(ctx: Context)(tokens: Iterator[Token]): Program = {
      import ctx.reporter._

      val parser = Parser(program)

      parser(tokens) match {
        case Parsed(result, rest) => result
        case UnexpectedEnd(rest) => fatal("Unexpected end of input.")
        case UnexpectedToken(token, rest) => fatal("Unexpected token: " + token)
      }
    }

### Parsers and Grammars

As you will see, parsers built using parser combinators will look a lot
like grammars. However, unlike grammars, parsers not only describe the
syntax of your language, but also directly specify how to turn this
syntax into a value. Also, as we will see, parser combinators have a
richer vocabulary than your usual *BNF* grammars.

Interestingly, a lot of concepts that you have seen on grammars, such as
`FIRST` sets and nullability can be straightforwardly transposed to
parsers.

#### FIRST set

In Scallion, parsers offer a `first` method which returns the set of
token kinds that are accepted as a first token.

    definition.first === Set(def, abstract, case)

#### Nullability

Parsers have a `nullable` method which checks for nullability of a
parser. The method returns `Some(value)` if the parser would produce
`value` given an empty input token sequence, and `None` if the parser
would not accept the empty sequence.

### Basic Parsers

We can now finally have a look at the toolbox we have at our disposition
to build parsers, starting from the basic parsers. Each parser that you
will write, however complex, is a combination of these basic parsers.
The basic parsers play the same role as terminal symbols do in grammars.

#### Elem

The first of the basic parsers is `elem(kind)`. The function `elem`
takes argument the kind of tokens to be accepted by the parser. The
value produced by the parser is the token that was matched. For
instance, here is how to match against the *end-of-file* token.

    val eof: Parser[Token] = elem(EOFKind)

#### Accept

The function `accept` is a variant of `elem` which directly applies a
transformation to the matched token when it is produced.

    val identifier: Syntax[String] = accept(IdentifierKind) {
      case IdentifierToken(name) => name
    }

#### Epsilon

The parser `epsilon(value)` is a parser that produces the `value`
without consuming any input. It corresponds to the *𝛆* found in
grammars.

### Parser Combinators

In this section, we will see how to combine parsers together to create
more complex parsers.

#### Disjunction

The first combinator we have is disjunction, that we write, for parsers
`p1` and `p2`, simply `p1 | p2`. When both `p1` and `p2` are of type
`Syntax[A]`, the disjunction `p1 | p2` is also of type `Syntax[A]`. The
disjunction operator is associative and commutative.

Disjunction works just as you think it does. If either of the parsers
`p1` or `p2` would accept the sequence of tokens, then the disjunction
also accepts the tokens. The value produced is the one produced by
either `p1` or `p2`.

Note that `p1` and `p2` must have disjoint `first` sets. This
restriction ensures that no ambiguities can arise and that parsing can
be done efficiently.[^1] We will see later how to automatically detect
when this is not the case and how fix the issue.

#### Sequencing

The second combinator we have is sequencing. We write, for parsers `p1`
and `p2`, the sequence of `p1` and `p2` as `p1 ~ p2`. When `p1` is of
type `A` and `p2` of type `B`, their sequence is of type `A ~ B`, which
is simply a pair of an `A` and a `B`.

If the parser `p1` accepts the prefix of a sequence of tokens and `p2`
accepts the postfix, the parser `p1 ~ p2` accepts the entire sequence
and produces the pair of values produced by `p1` and `p2`.

Note that the `first` set of `p2` should be disjoint from the `first`
set of all sub-parsers in `p1` that are *nullable* and in trailing
position (available via the `followLast` method). This restriction
ensures that the combinator does not introduce ambiguities.

#### Transforming Values

The method `map` makes it possible to apply a transformation to the
values produced by a parser. Using `map` does not influence the sequence
of tokens accepted or rejected by the parser, it merely modifies the
value produced. Generally, you will use `map` on a sequence of parsers,
as in:

    lazy val abstractClassDefinition: Syntax[ClassOrFunDef] =
      (kw("abstract") ~ kw("class") ~ identifier).map {
        case kw ~ _ ~ id => AbstractClassDef(id).setPos(kw)
      }

The above parser accepts abstract class definitions in Amy syntax. It
does so by accepting the sequence of keywords `abstract` and `class`,
followed by any identifier. The method `map` is used to convert the
produced values into an `AbstractClassDef`. The position of the keyword
`abstract` is used as the position of the definition.

#### Recursive Parsers

It is highly likely that some of your parsers will require to
recursively invoke themselves. In this case, you should indicate that
the parser is recursive using the `recursive` combinator:

    lazy val expr: Syntax[Expr] = recursive {
      ...
    }

If you were to omit it, a `StackOverflow` exception would be triggered
during the initialisation of your `Parser` object.

The `recursive` combinator in itself does not change the behaviour of
the underlying parser. It is there to *tie the knot*[^2].

In practice, it is only required in very few places. In order to avoid
`StackOverflow` exceptions during initialisation, you should make sure
that all recursive parsers (stored in `lazy val`s) must not be able to
reenter themselves without going through a `recursive` combinator
somewhere along the way.

#### Other Combinators

So far, many of the combinators that we have seen, such as disjunction
and sequencing, directly correspond to constructs found in `BNF`
grammars. Some of the combinators that we will see now are more
expressive and implement useful patterns.

##### Optional parsers using opt

The combinator `opt` makes a parser optional. The value produced by the
parser is wrapped in `Some` if the parser accepts the input sequence and
in `None` otherwise.

    opt(p) === p.map(Some(_)) | epsilon(None)

##### Repetitions using many and many1

The combinator `many` returns a parser that accepts any number of
repetitions of its argument parser, including 0. The variant `many1`
forces the parser to match at least once.

##### Repetitions with separators repsep and rep1sep

The combinator `repsep` returns a parser that accepts any number of
repetitions of its argument parser, separated by an other parser,
including 0. The variant `rep1sep` forces the parser to match at least
once.

The separator parser is restricted to the type `Syntax[Unit]` to ensure
that important values do not get ignored. You may use `unit()` to on a
parser to turn its value to `Unit` if you explicitly want to ignore the
values a parser produces.

##### Binary operators with operators

Scallion also contains combinators to easily build parsers for infix
binary operators, with different associativities and priority levels.
This combinator is defined in an additional trait called `Operators`,
which you should mix into `Parsers` if you want to use the combinator.
By default, it should already be mixed-in.

    val times: Syntax[String] =
      accept(OperatorKind("*")) {
        case _ => "*"
      }

    ...

    lazy val operation: Syntax[Expr] =
      operators(number)(
        // Defines the different operators, by decreasing priority.
        times | div   is LeftAssociative,
        plus  | minus is LeftAssociative,
        ...
      ) {
        // Defines how to apply the various operators.
        case (lhs, "*", rhs) => Times(lhs, rhs).setPos(lhs)
        ...
      }

Documentation for `operators` is [available on this
page](https://epfl-lara.github.io/scallion/scallion/Operators.html).

##### Upcasting

In Scallion, the type `Syntax[A]` is invariant with `A`, meaning that,
even when `A` is a (strict) subtype of some type `B`, we *won\'t* have
that `Syntax[A]` is a subtype of `Syntax[B]`. To upcast a `Syntax[A]` to
a syntax `Syntax[B]` (when `A` is a subtype of `B`), you should use the
`.up[B]` method.

For instance, you may need to upcast a syntax of type
`Syntax[Literal[_]]` to a `Syntax[Expr]` in your assignment. To do so,
simply use `.up[Expr]`.

### LL(1) Checking

In Scallion, non-LL(1) parsers can be written, but the result of
applying such a parser is not specified. In practice, we therefore
restrict ourselves only to LL(1) parsers. The reason behind this is that
LL(1) parsers are unambiguous and can be run in time linear in the input
size.

Writing LL(1) parsers is non-trivial. However, some of the higher-level
combinators of Scallion already alleviate part of this pain. In
addition, LL(1) violations can be detected before the parser is run.
Syntaxes have an `isLL1` method which returns `true` if the parser is
LL(1) and `false` otherwise, and so without needing to see any tokens of
input.

#### Conflict Witnesses

In case your parser is not LL(1), the method `conflicts` of the parser
will return the set of all `LL1Conflict`s. The various conflicts are:

-   `NullableConflict`, which indicates that two branches of a
    disjunction are nullable.
-   `FirstConflict`, which indicates that the `first` set of two
    branches of a disjunction are not disjoint.
-   `FollowConflict`, which indicates that the `first` set of a nullable
    parser is not disjoint from the `first` set of a parser that
    directly follows it.

The `LL1Conflict`s objects contain fields which can help you pinpoint
the exact location of conflicts in your parser and hopefully help you
fix those.

The helper method `debug` prints a summary of the LL(1) conflicts of a
parser. We added code in the handout skeleton so that, by default, a
report is outputted in case of conflicts when you initialise your
parser.

[^1]: Scallion is not the only parser combinator library to exist, far
    from it! Many of those libraries do not have this restriction. Those
    libraries generally need to backtrack to try the different
    alternatives when a branch fails.

[^2]: See [a good explanation of what tying the knot means in the
    context of lazy
    languages.](https://stackoverflow.com/questions/357956/explanation-of-tying-the-knot)