**For a brief overview of Scallion and its purpose, you can watch [this video](https://tube.switch.ch/videos/f18a2692).** What follows below is a slightly more detailed description, and an example project you can use to familiarize yourself with Scallion. ## Introduction to Parser Combinators The next part of the compiler you will be working on is the parser. The goal of the parser is to convert the sequence of tokens generated by the lexer into an Amy *abstract syntax tree* (AST). There are many approaches to writing parsers, such as: - Writing the parser by hand directly in the compiler's language using mutually recursive functions, or - Writing the parser in a *domain specific language* (DSL) and using a parser generator (such as Bison) to produce the parser. Another approach, which we will be using, is *parser combinators*. The idea behind the approach is very simple: - Have a set of simple primitive parsers, and - Have ways to combine them together into more and more complex parsers. Hence the name *parser combinators*. Usually, those primitive parsers and combinators are provided as a library directly in the language used by the compiler. In our case, we will be working with **Scallion**, a Scala parser combinators library developed by *LARA*. Parser combinators have many advantages -- the main one being easy to write, read and maintain. ## Scallion Parser Combinators ### Documentation In this document, we will introduce parser combinators in Scallion and showcase how to use them. This document is not intended to be a complete reference to Scallion. Fortunately, the library comes with a [comprehensive API](https://epfl-lara.github.io/scallion/scallion/index.html) which fulfills that role. Feel free to refer to it while working on your project! ### Playground Project We have set up [an example project](scallion-playground.zip) that implements a lexer and parser for a simple expression language using Scallion. Feel free to experiment and play with it. The project showcases the API of Scallion and some of the more advanced combinators. ### Setup In Scallion, parsers are defined within a trait called `Syntaxes`. This trait takes as parameters two types: - The type of tokens, - The type of *token kinds*. Token kinds represent groups of tokens. They abstract away all the details found in the actual tokens, such as for instance positions or identifiers name. Each token has a unique kind. In our case, the tokens will be of type `Token` that we introduced and used in the previous project. The token kinds will be `TokenKind`, which we have already defined for you. object Parser extends Pipeline[Iterator[Token], Program] with Parsers { type Token = myproject.Token type Kind = myproject.TokenKind // Indicates the kind of the various tokens. override def getKind(token: Token): TokenKind = TokenKind.of(token) // You parser implementation goes here. } The `Parsers` trait (mixed into the `Parser` object above) comes from Scallion and provides all functions and types you will use to define your grammar and AST translation. ### Writing Parsers When writing a parser using parser combinators, one defines many smaller parsers and combines them together into more and more complex parsers. The top-level, most complex, of those parser then defines the entire syntax for the language. In our case, that top-level parser will be called `program`. All those parsers are objects of the type `Syntax[A]`. The type parameter `A` indicates the type of values produced by the parser. For instance, a parser of type `Syntax[Int]` produces `Int`s and a parser of type `Syntax[Expr]` produces `Expr`s. Our top-level parser has the following signature: lazy val program: Parser[Program] = ... Contrary to the types of tokens and token kinds, which are fixed, the type of values produced is a type parameter of the various `Syntax`s. This allows your different parsers to produce different types of values. The various parsers are stored as `val` members of the `Parser` object. In the case of mutually dependent parsers, we use `lazy val` instead. lazy val definition: Syntax[ClassOrFunDef] = functionDefinition | abstractClassDefinition | caseClassDefinition lazy val functionDefinition: Syntax[ClassOrFunDef] = ... lazy val abstractClassDefinition: Syntax[ClassOrFunDef] = ... lazy val caseClassDefinition: Syntax[ClassOrFunDef] = ... ### Running Parsers Parsers of type `Syntax[A]` can be converted to objects of type `Parser[A]`, which have an `apply` method which takes as parameter an iterator of tokens and returns a value of type `ParseResult[A]`, which can be one of three things: - A `Parsed(value, rest)`, which indicates that the parser was successful and produced the value `value`. The entirety of the input iterator was consumed by the parser. - An `UnexpectedToken(token, rest)`, which indicates that the parser encountered an unexpected token `token`. The input iterator was consumed up to the erroneous token. - An `UnexpectedEnd(rest)`, which indicates that the end of the iterator was reached and the parser could not finish at this point. The input iterator was completely consumed. In each case, the additional value `rest` is itself some sort of a `Parser[A]`. That parser represents the parser after the successful parse or at the point of error. This parser could be used to provide useful error messages or even to resume parsing. override def run(ctx: Context)(tokens: Iterator[Token]): Program = { import ctx.reporter._ val parser = Parser(program) parser(tokens) match { case Parsed(result, rest) => result case UnexpectedEnd(rest) => fatal("Unexpected end of input.") case UnexpectedToken(token, rest) => fatal("Unexpected token: " + token) } } ### Parsers and Grammars As you will see, parsers built using parser combinators will look a lot like grammars. However, unlike grammars, parsers not only describe the syntax of your language, but also directly specify how to turn this syntax into a value. Also, as we will see, parser combinators have a richer vocabulary than your usual *BNF* grammars. Interestingly, a lot of concepts that you have seen on grammars, such as `FIRST` sets and nullability can be straightforwardly transposed to parsers. #### FIRST set In Scallion, parsers offer a `first` method which returns the set of token kinds that are accepted as a first token. definition.first === Set(def, abstract, case) #### Nullability Parsers have a `nullable` method which checks for nullability of a parser. The method returns `Some(value)` if the parser would produce `value` given an empty input token sequence, and `None` if the parser would not accept the empty sequence. ### Basic Parsers We can now finally have a look at the toolbox we have at our disposition to build parsers, starting from the basic parsers. Each parser that you will write, however complex, is a combination of these basic parsers. The basic parsers play the same role as terminal symbols do in grammars. #### Elem The first of the basic parsers is `elem(kind)`. The function `elem` takes argument the kind of tokens to be accepted by the parser. The value produced by the parser is the token that was matched. For instance, here is how to match against the *end-of-file* token. val eof: Parser[Token] = elem(EOFKind) #### Accept The function `accept` is a variant of `elem` which directly applies a transformation to the matched token when it is produced. val identifier: Syntax[String] = accept(IdentifierKind) { case IdentifierToken(name) => name } #### Epsilon The parser `epsilon(value)` is a parser that produces the `value` without consuming any input. It corresponds to the *𝛆* found in grammars. ### Parser Combinators In this section, we will see how to combine parsers together to create more complex parsers. #### Disjunction The first combinator we have is disjunction, that we write, for parsers `p1` and `p2`, simply `p1 | p2`. When both `p1` and `p2` are of type `Syntax[A]`, the disjunction `p1 | p2` is also of type `Syntax[A]`. The disjunction operator is associative and commutative. Disjunction works just as you think it does. If either of the parsers `p1` or `p2` would accept the sequence of tokens, then the disjunction also accepts the tokens. The value produced is the one produced by either `p1` or `p2`. Note that `p1` and `p2` must have disjoint `first` sets. This restriction ensures that no ambiguities can arise and that parsing can be done efficiently.[^1] We will see later how to automatically detect when this is not the case and how fix the issue. #### Sequencing The second combinator we have is sequencing. We write, for parsers `p1` and `p2`, the sequence of `p1` and `p2` as `p1 ~ p2`. When `p1` is of type `A` and `p2` of type `B`, their sequence is of type `A ~ B`, which is simply a pair of an `A` and a `B`. If the parser `p1` accepts the prefix of a sequence of tokens and `p2` accepts the postfix, the parser `p1 ~ p2` accepts the entire sequence and produces the pair of values produced by `p1` and `p2`. Note that the `first` set of `p2` should be disjoint from the `first` set of all sub-parsers in `p1` that are *nullable* and in trailing position (available via the `followLast` method). This restriction ensures that the combinator does not introduce ambiguities. #### Transforming Values The method `map` makes it possible to apply a transformation to the values produced by a parser. Using `map` does not influence the sequence of tokens accepted or rejected by the parser, it merely modifies the value produced. Generally, you will use `map` on a sequence of parsers, as in: lazy val abstractClassDefinition: Syntax[ClassOrFunDef] = (kw("abstract") ~ kw("class") ~ identifier).map { case kw ~ _ ~ id => AbstractClassDef(id).setPos(kw) } The above parser accepts abstract class definitions in Amy syntax. It does so by accepting the sequence of keywords `abstract` and `class`, followed by any identifier. The method `map` is used to convert the produced values into an `AbstractClassDef`. The position of the keyword `abstract` is used as the position of the definition. #### Recursive Parsers It is highly likely that some of your parsers will require to recursively invoke themselves. In this case, you should indicate that the parser is recursive using the `recursive` combinator: lazy val expr: Syntax[Expr] = recursive { ... } If you were to omit it, a `StackOverflow` exception would be triggered during the initialisation of your `Parser` object. The `recursive` combinator in itself does not change the behaviour of the underlying parser. It is there to *tie the knot*[^2]. In practice, it is only required in very few places. In order to avoid `StackOverflow` exceptions during initialisation, you should make sure that all recursive parsers (stored in `lazy val`s) must not be able to reenter themselves without going through a `recursive` combinator somewhere along the way. #### Other Combinators So far, many of the combinators that we have seen, such as disjunction and sequencing, directly correspond to constructs found in `BNF` grammars. Some of the combinators that we will see now are more expressive and implement useful patterns. ##### Optional parsers using opt The combinator `opt` makes a parser optional. The value produced by the parser is wrapped in `Some` if the parser accepts the input sequence and in `None` otherwise. opt(p) === p.map(Some(_)) | epsilon(None) ##### Repetitions using many and many1 The combinator `many` returns a parser that accepts any number of repetitions of its argument parser, including 0. The variant `many1` forces the parser to match at least once. ##### Repetitions with separators repsep and rep1sep The combinator `repsep` returns a parser that accepts any number of repetitions of its argument parser, separated by an other parser, including 0. The variant `rep1sep` forces the parser to match at least once. The separator parser is restricted to the type `Syntax[Unit]` to ensure that important values do not get ignored. You may use `unit()` to on a parser to turn its value to `Unit` if you explicitly want to ignore the values a parser produces. ##### Binary operators with operators Scallion also contains combinators to easily build parsers for infix binary operators, with different associativities and priority levels. This combinator is defined in an additional trait called `Operators`, which you should mix into `Parsers` if you want to use the combinator. By default, it should already be mixed-in. val times: Syntax[String] = accept(OperatorKind("*")) { case _ => "*" } ... lazy val operation: Syntax[Expr] = operators(number)( // Defines the different operators, by decreasing priority. times | div is LeftAssociative, plus | minus is LeftAssociative, ... ) { // Defines how to apply the various operators. case (lhs, "*", rhs) => Times(lhs, rhs).setPos(lhs) ... } Documentation for `operators` is [available on this page](https://epfl-lara.github.io/scallion/scallion/Operators.html). ##### Upcasting In Scallion, the type `Syntax[A]` is invariant with `A`, meaning that, even when `A` is a (strict) subtype of some type `B`, we *won\'t* have that `Syntax[A]` is a subtype of `Syntax[B]`. To upcast a `Syntax[A]` to a syntax `Syntax[B]` (when `A` is a subtype of `B`), you should use the `.up[B]` method. For instance, you may need to upcast a syntax of type `Syntax[Literal[_]]` to a `Syntax[Expr]` in your assignment. To do so, simply use `.up[Expr]`. ### LL(1) Checking In Scallion, non-LL(1) parsers can be written, but the result of applying such a parser is not specified. In practice, we therefore restrict ourselves only to LL(1) parsers. The reason behind this is that LL(1) parsers are unambiguous and can be run in time linear in the input size. Writing LL(1) parsers is non-trivial. However, some of the higher-level combinators of Scallion already alleviate part of this pain. In addition, LL(1) violations can be detected before the parser is run. Syntaxes have an `isLL1` method which returns `true` if the parser is LL(1) and `false` otherwise, and so without needing to see any tokens of input. #### Conflict Witnesses In case your parser is not LL(1), the method `conflicts` of the parser will return the set of all `LL1Conflict`s. The various conflicts are: - `NullableConflict`, which indicates that two branches of a disjunction are nullable. - `FirstConflict`, which indicates that the `first` set of two branches of a disjunction are not disjoint. - `FollowConflict`, which indicates that the `first` set of a nullable parser is not disjoint from the `first` set of a parser that directly follows it. The `LL1Conflict`s objects contain fields which can help you pinpoint the exact location of conflicts in your parser and hopefully help you fix those. The helper method `debug` prints a summary of the LL(1) conflicts of a parser. We added code in the handout skeleton so that, by default, a report is outputted in case of conflicts when you initialise your parser. [^1]: Scallion is not the only parser combinator library to exist, far from it! Many of those libraries do not have this restriction. Those libraries generally need to backtrack to try the different alternatives when a branch fails. [^2]: See [a good explanation of what tying the knot means in the context of lazy languages.](https://stackoverflow.com/questions/357956/explanation-of-tying-the-knot)