diff --git a/labs/lab03_material/scallion-playground.zip b/labs/lab03_material/scallion-playground.zip new file mode 100644 index 0000000000000000000000000000000000000000..16beb06a5fb27e1bddb5aba39bc855e21e559b30 Binary files /dev/null and b/labs/lab03_material/scallion-playground.zip differ diff --git a/labs/lab03_material/scallion.md b/labs/lab03_material/scallion.md new file mode 100644 index 0000000000000000000000000000000000000000..fd0c06ce7c01430f424787b6228eed6548a9620c --- /dev/null +++ b/labs/lab03_material/scallion.md @@ -0,0 +1,405 @@ +**For a brief overview of Scallion and its purpose, you can watch [this +video](https://tube.switch.ch/videos/f18a2692).** What follows below is +a slightly more detailed description, and an example project you can use +to familiarize yourself with Scallion. + +## Introduction to Parser Combinators + +The next part of the compiler you will be working on is the parser. The +goal of the parser is to convert the sequence of tokens generated by the +lexer into an Amy *abstract syntax tree* (AST). + +There are many approaches to writing parsers, such as: + +- Writing the parser by hand directly in the compiler's language using + mutually recursive functions, or +- Writing the parser in a *domain specific language* (DSL) and using a + parser generator (such as Bison) to produce the parser. + +Another approach, which we will be using, is *parser combinators*. The +idea behind the approach is very simple: + +- Have a set of simple primitive parsers, and +- Have ways to combine them together into more and more complex + parsers. Hence the name *parser combinators*. + +Usually, those primitive parsers and combinators are provided as a +library directly in the language used by the compiler. In our case, we +will be working with **Scallion**, a Scala parser combinators library +developed by *LARA*. + +Parser combinators have many advantages -- the main one being easy to +write, read and maintain. + +## Scallion Parser Combinators + +### Documentation + +In this document, we will introduce parser combinators in Scallion and +showcase how to use them. This document is not intended to be a complete +reference to Scallion. Fortunately, the library comes with a +[comprehensive +API](https://epfl-lara.github.io/scallion/scallion/index.html) which +fulfills that role. Feel free to refer to it while working on your +project! + +### Playground Project + +We have set up [an example project](scallion-playground.zip) that +implements a lexer and parser for a simple expression language using +Scallion. Feel free to experiment and play with it. The project +showcases the API of Scallion and some of the more advanced combinators. + +### Setup + +In Scallion, parsers are defined within a trait called `Syntaxes`. This +trait takes as parameters two types: + +- The type of tokens, +- The type of *token kinds*. Token kinds represent groups of tokens. + They abstract away all the details found in the actual tokens, such + as for instance positions or identifiers name. Each token has a + unique kind. + +In our case, the tokens will be of type `Token` that we introduced and +used in the previous project. The token kinds will be `TokenKind`, which +we have already defined for you. + + object Parser extends Pipeline[Iterator[Token], Program] + with Parsers { + + type Token = myproject.Token + type Kind = myproject.TokenKind + + // Indicates the kind of the various tokens. + override def getKind(token: Token): TokenKind = TokenKind.of(token) + + // You parser implementation goes here. + } + +The `Parsers` trait (mixed into the `Parser` object above) comes from +Scallion and provides all functions and types you will use to define +your grammar and AST translation. + +### Writing Parsers + +When writing a parser using parser combinators, one defines many smaller +parsers and combines them together into more and more complex parsers. +The top-level, most complex, of those parser then defines the entire +syntax for the language. In our case, that top-level parser will be +called `program`. + +All those parsers are objects of the type `Syntax[A]`. The type +parameter `A` indicates the type of values produced by the parser. For +instance, a parser of type `Syntax[Int]` produces `Int`s and a parser of +type `Syntax[Expr]` produces `Expr`s. Our top-level parser has the +following signature: + + lazy val program: Parser[Program] = ... + +Contrary to the types of tokens and token kinds, which are fixed, the +type of values produced is a type parameter of the various `Syntax`s. +This allows your different parsers to produce different types of values. + +The various parsers are stored as `val` members of the `Parser` object. +In the case of mutually dependent parsers, we use `lazy val` instead. + + lazy val definition: Syntax[ClassOrFunDef] = + functionDefinition | abstractClassDefinition | caseClassDefinition + + lazy val functionDefinition: Syntax[ClassOrFunDef] = ... + + lazy val abstractClassDefinition: Syntax[ClassOrFunDef] = ... + + lazy val caseClassDefinition: Syntax[ClassOrFunDef] = ... + +### Running Parsers + +Parsers of type `Syntax[A]` can be converted to objects of type +`Parser[A]`, which have an `apply` method which takes as parameter an +iterator of tokens and returns a value of type `ParseResult[A]`, which +can be one of three things: + +- A `Parsed(value, rest)`, which indicates that the parser was + successful and produced the value `value`. The entirety of the input + iterator was consumed by the parser. +- An `UnexpectedToken(token, rest)`, which indicates that the parser + encountered an unexpected token `token`. The input iterator was + consumed up to the erroneous token. +- An `UnexpectedEnd(rest)`, which indicates that the end of the + iterator was reached and the parser could not finish at this point. + The input iterator was completely consumed. + +In each case, the additional value `rest` is itself some sort of a +`Parser[A]`. That parser represents the parser after the successful +parse or at the point of error. This parser could be used to provide +useful error messages or even to resume parsing. + + override def run(ctx: Context)(tokens: Iterator[Token]): Program = { + import ctx.reporter._ + + val parser = Parser(program) + + parser(tokens) match { + case Parsed(result, rest) => result + case UnexpectedEnd(rest) => fatal("Unexpected end of input.") + case UnexpectedToken(token, rest) => fatal("Unexpected token: " + token) + } + } + +### Parsers and Grammars + +As you will see, parsers built using parser combinators will look a lot +like grammars. However, unlike grammars, parsers not only describe the +syntax of your language, but also directly specify how to turn this +syntax into a value. Also, as we will see, parser combinators have a +richer vocabulary than your usual *BNF* grammars. + +Interestingly, a lot of concepts that you have seen on grammars, such as +`FIRST` sets and nullability can be straightforwardly transposed to +parsers. + +#### FIRST set + +In Scallion, parsers offer a `first` method which returns the set of +token kinds that are accepted as a first token. + + definition.first === Set(def, abstract, case) + +#### Nullability + +Parsers have a `nullable` method which checks for nullability of a +parser. The method returns `Some(value)` if the parser would produce +`value` given an empty input token sequence, and `None` if the parser +would not accept the empty sequence. + +### Basic Parsers + +We can now finally have a look at the toolbox we have at our disposition +to build parsers, starting from the basic parsers. Each parser that you +will write, however complex, is a combination of these basic parsers. +The basic parsers play the same role as terminal symbols do in grammars. + +#### Elem + +The first of the basic parsers is `elem(kind)`. The function `elem` +takes argument the kind of tokens to be accepted by the parser. The +value produced by the parser is the token that was matched. For +instance, here is how to match against the *end-of-file* token. + + val eof: Parser[Token] = elem(EOFKind) + +#### Accept + +The function `accept` is a variant of `elem` which directly applies a +transformation to the matched token when it is produced. + + val identifier: Syntax[String] = accept(IdentifierKind) { + case IdentifierToken(name) => name + } + +#### Epsilon + +The parser `epsilon(value)` is a parser that produces the `value` +without consuming any input. It corresponds to the *ð›†* found in +grammars. + +### Parser Combinators + +In this section, we will see how to combine parsers together to create +more complex parsers. + +#### Disjunction + +The first combinator we have is disjunction, that we write, for parsers +`p1` and `p2`, simply `p1 | p2`. When both `p1` and `p2` are of type +`Syntax[A]`, the disjunction `p1 | p2` is also of type `Syntax[A]`. The +disjunction operator is associative and commutative. + +Disjunction works just as you think it does. If either of the parsers +`p1` or `p2` would accept the sequence of tokens, then the disjunction +also accepts the tokens. The value produced is the one produced by +either `p1` or `p2`. + +Note that `p1` and `p2` must have disjoint `first` sets. This +restriction ensures that no ambiguities can arise and that parsing can +be done efficiently.[^1] We will see later how to automatically detect +when this is not the case and how fix the issue. + +#### Sequencing + +The second combinator we have is sequencing. We write, for parsers `p1` +and `p2`, the sequence of `p1` and `p2` as `p1 ~ p2`. When `p1` is of +type `A` and `p2` of type `B`, their sequence is of type `A ~ B`, which +is simply a pair of an `A` and a `B`. + +If the parser `p1` accepts the prefix of a sequence of tokens and `p2` +accepts the postfix, the parser `p1 ~ p2` accepts the entire sequence +and produces the pair of values produced by `p1` and `p2`. + +Note that the `first` set of `p2` should be disjoint from the `first` +set of all sub-parsers in `p1` that are *nullable* and in trailing +position (available via the `followLast` method). This restriction +ensures that the combinator does not introduce ambiguities. + +#### Transforming Values + +The method `map` makes it possible to apply a transformation to the +values produced by a parser. Using `map` does not influence the sequence +of tokens accepted or rejected by the parser, it merely modifies the +value produced. Generally, you will use `map` on a sequence of parsers, +as in: + + lazy val abstractClassDefinition: Syntax[ClassOrFunDef] = + (kw("abstract") ~ kw("class") ~ identifier).map { + case kw ~ _ ~ id => AbstractClassDef(id).setPos(kw) + } + +The above parser accepts abstract class definitions in Amy syntax. It +does so by accepting the sequence of keywords `abstract` and `class`, +followed by any identifier. The method `map` is used to convert the +produced values into an `AbstractClassDef`. The position of the keyword +`abstract` is used as the position of the definition. + +#### Recursive Parsers + +It is highly likely that some of your parsers will require to +recursively invoke themselves. In this case, you should indicate that +the parser is recursive using the `recursive` combinator: + + lazy val expr: Syntax[Expr] = recursive { + ... + } + +If you were to omit it, a `StackOverflow` exception would be triggered +during the initialisation of your `Parser` object. + +The `recursive` combinator in itself does not change the behaviour of +the underlying parser. It is there to *tie the knot*[^2]. + +In practice, it is only required in very few places. In order to avoid +`StackOverflow` exceptions during initialisation, you should make sure +that all recursive parsers (stored in `lazy val`s) must not be able to +reenter themselves without going through a `recursive` combinator +somewhere along the way. + +#### Other Combinators + +So far, many of the combinators that we have seen, such as disjunction +and sequencing, directly correspond to constructs found in `BNF` +grammars. Some of the combinators that we will see now are more +expressive and implement useful patterns. + +##### Optional parsers using opt + +The combinator `opt` makes a parser optional. The value produced by the +parser is wrapped in `Some` if the parser accepts the input sequence and +in `None` otherwise. + + opt(p) === p.map(Some(_)) | epsilon(None) + +##### Repetitions using many and many1 + +The combinator `many` returns a parser that accepts any number of +repetitions of its argument parser, including 0. The variant `many1` +forces the parser to match at least once. + +##### Repetitions with separators repsep and rep1sep + +The combinator `repsep` returns a parser that accepts any number of +repetitions of its argument parser, separated by an other parser, +including 0. The variant `rep1sep` forces the parser to match at least +once. + +The separator parser is restricted to the type `Syntax[Unit]` to ensure +that important values do not get ignored. You may use `unit()` to on a +parser to turn its value to `Unit` if you explicitly want to ignore the +values a parser produces. + +##### Binary operators with operators + +Scallion also contains combinators to easily build parsers for infix +binary operators, with different associativities and priority levels. +This combinator is defined in an additional trait called `Operators`, +which you should mix into `Parsers` if you want to use the combinator. +By default, it should already be mixed-in. + + val times: Syntax[String] = + accept(OperatorKind("*")) { + case _ => "*" + } + + ... + + lazy val operation: Syntax[Expr] = + operators(number)( + // Defines the different operators, by decreasing priority. + times | div is LeftAssociative, + plus | minus is LeftAssociative, + ... + ) { + // Defines how to apply the various operators. + case (lhs, "*", rhs) => Times(lhs, rhs).setPos(lhs) + ... + } + +Documentation for `operators` is [available on this +page](https://epfl-lara.github.io/scallion/scallion/Operators.html). + +##### Upcasting + +In Scallion, the type `Syntax[A]` is invariant with `A`, meaning that, +even when `A` is a (strict) subtype of some type `B`, we *won\'t* have +that `Syntax[A]` is a subtype of `Syntax[B]`. To upcast a `Syntax[A]` to +a syntax `Syntax[B]` (when `A` is a subtype of `B`), you should use the +`.up[B]` method. + +For instance, you may need to upcast a syntax of type +`Syntax[Literal[_]]` to a `Syntax[Expr]` in your assignment. To do so, +simply use `.up[Expr]`. + +### LL(1) Checking + +In Scallion, non-LL(1) parsers can be written, but the result of +applying such a parser is not specified. In practice, we therefore +restrict ourselves only to LL(1) parsers. The reason behind this is that +LL(1) parsers are unambiguous and can be run in time linear in the input +size. + +Writing LL(1) parsers is non-trivial. However, some of the higher-level +combinators of Scallion already alleviate part of this pain. In +addition, LL(1) violations can be detected before the parser is run. +Syntaxes have an `isLL1` method which returns `true` if the parser is +LL(1) and `false` otherwise, and so without needing to see any tokens of +input. + +#### Conflict Witnesses + +In case your parser is not LL(1), the method `conflicts` of the parser +will return the set of all `LL1Conflict`s. The various conflicts are: + +- `NullableConflict`, which indicates that two branches of a + disjunction are nullable. +- `FirstConflict`, which indicates that the `first` set of two + branches of a disjunction are not disjoint. +- `FollowConflict`, which indicates that the `first` set of a nullable + parser is not disjoint from the `first` set of a parser that + directly follows it. + +The `LL1Conflict`s objects contain fields which can help you pinpoint +the exact location of conflicts in your parser and hopefully help you +fix those. + +The helper method `debug` prints a summary of the LL(1) conflicts of a +parser. We added code in the handout skeleton so that, by default, a +report is outputted in case of conflicts when you initialise your +parser. + +[^1]: Scallion is not the only parser combinator library to exist, far + from it! Many of those libraries do not have this restriction. Those + libraries generally need to backtrack to try the different + alternatives when a branch fails. + +[^2]: See [a good explanation of what tying the knot means in the + context of lazy + languages.](https://stackoverflow.com/questions/357956/explanation-of-tying-the-knot) diff --git a/labs/labs_03.md b/labs/labs_03.md new file mode 100644 index 0000000000000000000000000000000000000000..a540c734723aa8fd6e1915f0512b68cd17c1587c --- /dev/null +++ b/labs/labs_03.md @@ -0,0 +1,174 @@ +# Lab 03: Parser + +## Introduction + +Starting from this week you will work on the second stage of the Amy +compiler, the parser. The task of the parser is to take a sequence of +tokens produced by the lexer and transform it into an Abstract Syntax +Tree (AST). + +For this purpose you will write a grammar for Amy programs in a Domain +Specific Language (DSL) that can be embedded in Scala. Similarly to what +you have seen in the Lexer lab, each grammar rule will also be +associated with a transformation function that maps the parse result to +an AST. The overall grammar will then be used to automatically parse +sequences of tokens into Amy ASTs, while abstracting away extraneous +syntactical details, such as commas and parentheses. + +As you have seen (and will see) in the lectures, there are various +algorithms to parse syntax trees corresponding to context-free grammars. +Any context-free grammar (after some normalization) can be parsed using +the CYK algorithm. However, this algorithm is rather slow: its +complexity is in O(n\^3 \* g) where n is the size of the program and g +the size of the grammar. On the other hand, a more restricted LL(1) +grammar can parse inputs in linear time. Thus, the goal of this lab will +be to develop an LL(1) version of the Amy grammar. + +### The Parser Combinator DSL + +In the previous lab you already started working with **Silex**, which +was the library we used to tokenize program inputs based on a +prioritized list of regular expressions. In this lab we will start using +its companion library, **Scallion**: Once an input string has been +tokenized, Scallion allows us to parse the token stream using the rules +of an LL(1) grammar and translate to a target data structure, such as an +AST. + +To familiarize yourself with the parsing functionality of Scallion, +please make sure you read the [Introduction to (Scallion) Parser +Combinators](lab03_material/scallion.md). In it, you will learn how to describe grammars +in Scallion\'s parser combinator DSL and how to ensure that your grammar +lies in LL(1) (which Scallion requires to function correctly). + +Once you understand parser combinators, you can get to work on your own +implementation of an Amy parser in `Parser.scala`. Note that in this lab +you will essentially operate on two data structures: Your parser will +consume a sequence of `Token`s (defined in `Tokens.scala`) and produce +an AST (as defined by `NominalTreeModule` in `TreeModule.scala`). To +accomplish this, you will have to define appropriate parsing rules and +translation functions for Scallion. + +In `Parser.scala` you will already find a number of parsing rules given +to you, including the starting non-terminal `program`. Others, such as +`expr` are stubs (marked by `???`) that you will have to complete +yourself. Make sure to take advantage of Scallion\'s various helpers +such as the `operators` method that simplifies defining operators of +different precedence and associativity. + +### An LL(1) grammar for Amy + +As usual, the [Amy specification](amy specification) will guide you when +it comes to deciding what exactly should be accepted by your parser. +Carefully read Section 2 (*Syntax*). + +Note that the EBNF grammar in Figure 2 merely represents an +over-approximation of Amy\'s true grammar \-- it is too imprecise to be +useful for parsing: Firstly, the grammar in Figure 2 is ambiguous. That +is, it allows multiple ways to parse an expression. E.g. `x + y * z` +could be parsed as either `(x + y) * z` or as `x + (y * z)`. In other +words, the grammar doesn\'t enforce either operator precedence or +associativity correctly. Additionally, the restrictions mentioned +throughout Section 2 of the specification are not followed. + +Your task is thus to come up with appropriate rules that encode Amy\'s +true grammar. Furthermore, this grammar should be LL(1) for reasons of +efficiency. Scallion will read your grammar, examine if it is in LL(1), +and, if so, parse input programs. If Scallion determines that the +grammar is not in LL(1), it will report an error. You can also instruct +Scallion to generate some counter-examples for you (see the `checkLL1` +function). + +### Translating to ASTs + +Scallion will parse a sequence of tokens according to the grammar you +provide, however, without additional help, it does not know how to build +Amy ASTs. For instance, a (non-sensical) grammar that only accepts +sequences of identifier tokens, e.g. + + many(elem(IdentifierKind)): Syntax[Seq[Token]] + +will be useful in deciding whether the input matches the expected form, +but will simply return the tokens unchanged when parsing succeeds. + +Scallion does allow you to map parse results from one type to another, +however. For instance, in the above example we might want to provide a +function `f(idTokens: Seq[Token]): Seq[Variable]` that transforms the +identifier tokens into (Amy-AST) variables of those names. + +For more information on how to use Scallion\'s `Syntax#map` method +please refer to the [Scallion introduction](lab03_material/scallion.md). + +## Notes + +### Understanding the AST: Nominal vs. Symbolic Trees + +If you check the TreeModule file containing the ASTs, you will notice it +is structured in an unusual way: There is a `TreeModule` class extended +by `NominalTreeModule` and `SymbolicTreeModule`. The reason for this +design is that we need two very similar ASTs, but with different types +representing names in each case: Just after parsing (this assignment), +all names are just Strings and qualified names are essentially pairs of +Strings. We call ASTs that only use such String-based names `Nominal` +\-- the variant we will be using in this lab. Later, during name +analysis, these names will be resolved to unique identifiers, e.g. two +variables that refer to different definitions will be distinct, even if +they have the same name. For now you can just look at the TreeModule and +substitute the types that are not defined there (`Name` and +`QualifiedName`) with their definitions inside `NominalTreeModule`. + +### Positions + +As you will notice in the code we provide, all generated ASTs have their +position set. The position of each node of the AST is defined as its +starting position. It is important that you set the positions in all the +trees that you create for better error reporting later. Although our +testing infrastructure cannot directly check for presence of positions, +we will check it manually. + +### Pretty Printing + +Along with the stubs, we provide a printer for Amy ASTs. It will print +parentheses around all expressions so you can clearly see how your +parser interprets precedence and associativity. You can use it to test +your parser, and it will also be used during our testing to compare the +output of your parser with the reference parser. + +## Skeleton + +As usual, you can find the skeleton in the git repository. This lab +builds on your previous work, so \-- given your implementation of the +lexer \-- you will only unpack two files from the skeleton. + +The structure of your project `src` directory should be as follows: + + amyc + ├── Main.scala (updated) + │ + ├── ast (new) + │ ├── Identifier.scala + │ ├── Printer.scala + │ └── TreeModule.scala + │ + ├── lib + │ ├── scallion_3.0.6.jar (new) + │ └── silex_3.0.6.jar + │ + ├── parsing + │ ├── Parser.scala (new) + │ ├── Lexer.scala + │ └── Tokens.scala + │ + └── utils + ├── AmycFatalError.scala + ├── Context.scala + ├── Document.scala + ├── Pipeline.scala + ├── Position.scala + ├── Reporter.scala + └── UniqueCounter.scala + +## Deliverables + +You have TBD weeks to complete this assignment. + +**Deadline: Wednesday TBD October, 23:00**