Skip to content
Snippets Groups Projects
Commit c8e29099 authored by Samuel Chassot's avatar Samuel Chassot
Browse files

cleanup

parent 270694d4
No related branches found
No related tags found
No related merge requests found
Showing
with 0 additions and 2082 deletions
# Lab 02: Lexer ([Slides](lab02-slides.pdf))
This assignment is the first stage of the Amy compiler.
## Code Skeleton
In this lab you will start your own compiler from scratch, meaning that you will no longer rely on the compiler frontend which was previously provided to you as a jar file. In this lab you will build the lexical analysis phase (`lexer`). Since practically none of the compiler's code will be shared with the previous lab, the new branch (clplab2) contains a fresh skeleton.
Compared to the previous lab, the structure of your src directory will be as follows:
```
amyc
├── Main.scala (updated)
├── lib
│ ├── amy-frontend_2.12-1.7.jar (removed)
│ └── silex_2.12-0.5.jar (new)
├── parsing (new)
│ ├── Lexer.scala
│ └── Tokens.scala
└── utils (new)
├── AmycFatalError.scala
├── Context.scala
├── Document.scala
├── Env.scala
├── Pipeline.scala
├── Position.scala
├── Reporter.scala
└── UniqueCounter.scala
```
This lab will focus on the following two files:
* `src/amyc/parsing/Tokens.scala`: list of tokens and token kinds.
* `src/amyc/parsing/Lexer.scala`: skeleton for the `Lexer` phase.
Below you will find the instructions for the first lab assignment in which you will get to know and implement an interpreter for the Amy language. If you haven't looked at the [Labs Setup](https://gitlab.epfl.ch/lara/cs320/-/blob/main/labs/labs-setup.md) page yet, please do so before starting out with the assignment.
## A Lexer for Amy
The role of a lexer is to read the input text and convert it to a list of tokens. Tokens are the smallest useful units in a source file: a name referring to a variable, a bracket, a keyword etc. The role of the lexer is to group together those useful units (e.g. return the keyword else as a unit, as opposed to individual characters e, l, s, e) and to abstract away all useless information (i.e. whitespace, comments).
## Code structure
You can find the `lexer` in the `Lexer.scala` file. It is based on Scallion and Silex, a pair of Scala libraries which simplify the implementation of parsing pipelines. Silex allows you to transform an input character stream (such as the contents of an Amy source file) into a sequence of Tokens. We are going to take a closer look at Scallion in the next lab, where our goal will be to build Amy's parser. You can find more information on Scallion and Silex [here](https://github.com/epfl-lara/scallion), but we also included a short reference of Silex's API in `Lexer.scala`.
The Lexer has the following components:
* The public method is `run`. It just calls `lexer.spawn`(`source`) for every input file and concatenates the results.
* `lexer` is the Silex-based definition of tokenization rules. Each rule corresponds to a regular expression matching a prefix of the remaining program input. Silex will compose all of these rules into one finite state machine and apply the maximum-munch rule you've seen in class.
* Whenever a rule is found to match a (maximal) prefix of the remaining input, Scallion will call the transformation function provided using the |> operator in the rule. This function is given the matched input characters (cs) along with positional information (range) and should then produce an instance of Token. You can find its definition in `Tokens.scala`, which includes a list of all the different kinds of tokens that your Amy compiler should process. For instance, KeywordToken(`if`) represents an occurence of the reserved word if in a program.
For more details on how to write new rules, read the short introduction to Silex's API at the top of `Lexer.scala` or consider the examples on the Scallion website. You can also refer to [Silex's Scaladoc page](https://epfl-lara.github.io/silex/).
Your task is to complete the rules in `Lexer.scala` and implement the filtering of irrelevant tokens.
## Notes
Here are some details you should pay attention to:
* Make sure you recognize keywords as their own token kind. if, for instance, should be lexed as a token KeywordToken(“if”), not as an identifier with the content `if`.
* Make sure you correctly register the position of all tokens. Note the range parameter of the transformer functions. Once you have created a token, use `setPos`(`range._1`) to associate it with its position in the program source.
* In general, it is good to output as many errors as possible (this will be helpful to whomever uses your compiler, including yourself). Your lexer should therefore not give up after the first error, but rather skip the erroneous token, emit an error message, and then continue lexing. Scallion takes care of this for you for the most part. However, there are certain inputs that you might explicitly want to map to `ErrorToken`, such as unclosed multi-line comments.
* The Lexer does not immediately read and return all tokens, it returns an `Iterator`[`Token`] that will be used by future phases to read tokens on demand.
Comments and whitespace should not produce tokens. (The most convenient way of doing this in Scallion is to first produce dedicated tokens and then filter them out later; See the related TODO in `Lexer.scala`.)
* Returned tokens should be fresh instances of the the appropriate Token subclass. Value tokens (tokens that carry a value, such as identifiers), need to be constructed with the appropriate value.
* Make sure to correctly implement the Amy lexing rules for literals and identifiers.
## Example Output
For reference, here is a possible output for the example under `examples/Hello.scala`. You can always get reference output for the lexer from the reference compiler by typing
```
java -jar amyc-assembly-1.7.jar --printTokens <files>
```
```
KeywordToken(object)(1:1)
IdentifierToken(Hello)(1:8)
DelimiterToken({)(1:14)
IdentifierToken(Std)(2:3)
DelimiterToken(.)(2:6)
IdentifierToken(printString)(2:7)
DelimiterToken(()(2:18)
StringLitToken(Good morning!)(2:19)
DelimiterToken())(2:34)
DelimiterToken(})(3:1)
EOFToken()(4:1)
```
## Deliverables
Deadline: **Friday October 21 at 11pm**.
Submission: push the solved lab 2 to the branch `clplab2` that was created on your Gitlab repo. Do not push the changes to `clplab1`! It may interfere with your submission for lab 1.
You may want to copy the files you changed directly to the new branch, since the two branches don't share a history in git.
File deleted
# Lab 03: Parser ([Slides](lab03-slides.pdf))
## Introduction
Starting from this week you will work on the second stage of the Amy
compiler, the parser. The task of the parser is to take a sequence of
tokens produced by the lexer and transform it into an Abstract Syntax
Tree (AST).
For this purpose you will write a grammar for Amy programs in a Domain
Specific Language (DSL) that can be embedded in Scala. Similarly to what
you have seen in the Lexer lab, each grammar rule will also be
associated with a transformation function that maps the parse result to
an AST. The overall grammar will then be used to automatically parse
sequences of tokens into Amy ASTs, while abstracting away extraneous
syntactical details, such as commas and parentheses.
As you have seen (and will see) in the lectures, there are various
algorithms to parse syntax trees corresponding to context-free grammars.
Any context-free grammar (after some normalization) can be parsed using
the CYK algorithm. However, this algorithm is rather slow: its
complexity is in O(n\^3 \* g) where n is the size of the program and g
the size of the grammar. On the other hand, a more restricted LL(1)
grammar can parse inputs in linear time. Thus, the goal of this lab will
be to develop an LL(1) version of the Amy grammar.
### The Parser Combinator DSL
In the previous lab you already started working with **Silex**, which
was the library we used to tokenize program inputs based on a
prioritized list of regular expressions. In this lab we will start using
its companion library, **Scallion**: Once an input string has been
tokenized, Scallion allows us to parse the token stream using the rules
of an LL(1) grammar and translate to a target data structure, such as an
AST.
To familiarize yourself with the parsing functionality of Scallion,
please make sure you read the [Introduction to (Scallion) Parser
Combinators](material/scallion.md). In it, you will learn how to describe grammars
in Scallion\'s parser combinator DSL and how to ensure that your grammar
lies in LL(1) (which Scallion requires to function correctly).
Once you understand parser combinators, you can get to work on your own
implementation of an Amy parser in `Parser.scala`. Note that in this lab
you will essentially operate on two data structures: Your parser will
consume a sequence of `Token`s (defined in `Tokens.scala`) and produce
an AST (as defined by `NominalTreeModule` in `TreeModule.scala`). To
accomplish this, you will have to define appropriate parsing rules and
translation functions for Scallion.
In `Parser.scala` you will already find a number of parsing rules given
to you, including the starting non-terminal `program`. Others, such as
`expr` are stubs (marked by `???`) that you will have to complete
yourself. Make sure to take advantage of Scallion\'s various helpers
such as the `operators` method that simplifies defining operators of
different precedence and associativity.
### An LL(1) grammar for Amy
As usual, the [Amy specification](/labs/amy-specification/amy-specification.pdf) will guide you when
it comes to deciding what exactly should be accepted by your parser.
Carefully read Section 2 (*Syntax*).
Note that the EBNF grammar in Figure 2 merely represents an
over-approximation of Amy\'s true grammar \-- it is too imprecise to be
useful for parsing: Firstly, the grammar in Figure 2 is ambiguous. That
is, it allows multiple ways to parse an expression. E.g. `x + y * z`
could be parsed as either `(x + y) * z` or as `x + (y * z)`. In other
words, the grammar doesn\'t enforce either operator precedence or
associativity correctly. Additionally, the restrictions mentioned
throughout Section 2 of the specification are not followed.
Your task is thus to come up with appropriate rules that encode Amy\'s
true grammar. Furthermore, this grammar should be LL(1) for reasons of
efficiency. Scallion will read your grammar, examine if it is in LL(1),
and, if so, parse input programs. If Scallion determines that the
grammar is not in LL(1), it will report an error. You can also instruct
Scallion to generate some counter-examples for you (see the `checkLL1`
function).
### Translating to ASTs
Scallion will parse a sequence of tokens according to the grammar you
provide, however, without additional help, it does not know how to build
Amy ASTs. For instance, a (non-sensical) grammar that only accepts
sequences of identifier tokens, e.g.
many(elem(IdentifierKind)): Syntax[Seq[Token]]
will be useful in deciding whether the input matches the expected form,
but will simply return the tokens unchanged when parsing succeeds.
Scallion does allow you to map parse results from one type to another,
however. For instance, in the above example we might want to provide a
function `f(idTokens: Seq[Token]): Seq[Variable]` that transforms the
identifier tokens into (Amy-AST) variables of those names.
For more information on how to use Scallion\'s `Syntax#map` method
please refer to the [Scallion introduction](material/scallion.md).
## Notes
### Understanding the AST: Nominal vs. Symbolic Trees
If you check the TreeModule file containing the ASTs, you will notice it
is structured in an unusual way: There is a `TreeModule` class extended
by `NominalTreeModule` and `SymbolicTreeModule`. The reason for this
design is that we need two very similar ASTs, but with different types
representing names in each case: Just after parsing (this assignment),
all names are just Strings and qualified names are essentially pairs of
Strings. We call ASTs that only use such String-based names `Nominal`
\-- the variant we will be using in this lab. Later, during name
analysis, these names will be resolved to unique identifiers, e.g. two
variables that refer to different definitions will be distinct, even if
they have the same name. For now you can just look at the TreeModule and
substitute the types that are not defined there (`Name` and
`QualifiedName`) with their definitions inside `NominalTreeModule`.
### Positions
As you will notice in the code we provide, all generated ASTs have their
position set. The position of each node of the AST is defined as its
starting position. It is important that you set the positions in all the
trees that you create for better error reporting later. Although our
testing infrastructure cannot directly check for presence of positions,
we will check it manually.
### Pretty Printing
Along with the stubs, we provide a printer for Amy ASTs. It will print
parentheses around all expressions so you can clearly see how your
parser interprets precedence and associativity. You can use it to test
your parser, and it will also be used during our testing to compare the
output of your parser with the reference parser.
## Skeleton
As usual, you can find the skeleton in the git repository. This lab
builds on your previous work, so \-- given your implementation of the
lexer \-- you will only unpack two files from the skeleton.
The structure of your project `src` directory should be as follows:
amyc
├── Main.scala (updated)
├── ast (new)
│ ├── Identifier.scala
│ ├── Printer.scala
│ └── TreeModule.scala
├── lib
│ ├── scallion_3.0.6.jar (new)
│ └── silex_3.0.6.jar
├── parsing
│ ├── Parser.scala (new)
│ ├── Lexer.scala
│ └── Tokens.scala
└── utils
├── AmycFatalError.scala
├── Context.scala
├── Document.scala
├── Pipeline.scala
├── Position.scala
├── Reporter.scala
└── UniqueCounter.scala
## Reference compiler
Recall you can use the [reference compiler](/labs/amy_reference_compiler.md) for any doubts you have on the intended behaviour. For this lab you can use the command:
```
java -jar amyc-assembly-1.7.jar --printTrees <files>
```
## Deliverables
Deadline: **Friday November 4 at 11pm**.
Submission: push the solved lab 3 to the branch `clplab3` that was created on your Gitlab repo. Do not push the changes to other branches! It may interfere with your previous submissions.
You may want to copy the files you changed directly to the new branch, since the two branches don't share a history in git.
File deleted
File deleted
**For a brief overview of Scallion and its purpose, you can watch [this
video](https://tube.switch.ch/videos/f18a2692).** What follows below is
a slightly more detailed description, and an example project you can use
to familiarize yourself with Scallion.
## Introduction to Parser Combinators
The next part of the compiler you will be working on is the parser. The
goal of the parser is to convert the sequence of tokens generated by the
lexer into an Amy *abstract syntax tree* (AST).
There are many approaches to writing parsers, such as:
- Writing the parser by hand directly in the compiler's language using
mutually recursive functions, or
- Writing the parser in a *domain specific language* (DSL) and using a
parser generator (such as Bison) to produce the parser.
Another approach, which we will be using, is *parser combinators*. The
idea behind the approach is very simple:
- Have a set of simple primitive parsers, and
- Have ways to combine them together into more and more complex
parsers. Hence the name *parser combinators*.
Usually, those primitive parsers and combinators are provided as a
library directly in the language used by the compiler. In our case, we
will be working with **Scallion**, a Scala parser combinators library
developed by *LARA*.
Parser combinators have many advantages -- the main one being easy to
write, read and maintain.
## Scallion Parser Combinators
### Documentation
In this document, we will introduce parser combinators in Scallion and
showcase how to use them. This document is not intended to be a complete
reference to Scallion. Fortunately, the library comes with a
[comprehensive
API](https://epfl-lara.github.io/scallion/scallion/index.html) which
fulfills that role. Feel free to refer to it while working on your
project!
### Playground Project
We have set up [an example project](scallion-playground.zip) that
implements a lexer and parser for a simple expression language using
Scallion. Feel free to experiment and play with it. The project
showcases the API of Scallion and some of the more advanced combinators.
### Setup
In Scallion, parsers are defined within a trait called `Syntaxes`. This
trait takes as parameters two types:
- The type of tokens,
- The type of *token kinds*. Token kinds represent groups of tokens.
They abstract away all the details found in the actual tokens, such
as for instance positions or identifiers name. Each token has a
unique kind.
In our case, the tokens will be of type `Token` that we introduced and
used in the previous project. The token kinds will be `TokenKind`, which
we have already defined for you.
object Parser extends Pipeline[Iterator[Token], Program]
with Parsers {
type Token = myproject.Token
type Kind = myproject.TokenKind
// Indicates the kind of the various tokens.
override def getKind(token: Token): TokenKind = TokenKind.of(token)
// You parser implementation goes here.
}
The `Parsers` trait (mixed into the `Parser` object above) comes from
Scallion and provides all functions and types you will use to define
your grammar and AST translation.
### Writing Parsers
When writing a parser using parser combinators, one defines many smaller
parsers and combines them together into more and more complex parsers.
The top-level, most complex, of those parser then defines the entire
syntax for the language. In our case, that top-level parser will be
called `program`.
All those parsers are objects of the type `Syntax[A]`. The type
parameter `A` indicates the type of values produced by the parser. For
instance, a parser of type `Syntax[Int]` produces `Int`s and a parser of
type `Syntax[Expr]` produces `Expr`s. Our top-level parser has the
following signature:
lazy val program: Parser[Program] = ...
Contrary to the types of tokens and token kinds, which are fixed, the
type of values produced is a type parameter of the various `Syntax`s.
This allows your different parsers to produce different types of values.
The various parsers are stored as `val` members of the `Parser` object.
In the case of mutually dependent parsers, we use `lazy val` instead.
lazy val definition: Syntax[ClassOrFunDef] =
functionDefinition | abstractClassDefinition | caseClassDefinition
lazy val functionDefinition: Syntax[ClassOrFunDef] = ...
lazy val abstractClassDefinition: Syntax[ClassOrFunDef] = ...
lazy val caseClassDefinition: Syntax[ClassOrFunDef] = ...
### Running Parsers
Parsers of type `Syntax[A]` can be converted to objects of type
`Parser[A]`, which have an `apply` method which takes as parameter an
iterator of tokens and returns a value of type `ParseResult[A]`, which
can be one of three things:
- A `Parsed(value, rest)`, which indicates that the parser was
successful and produced the value `value`. The entirety of the input
iterator was consumed by the parser.
- An `UnexpectedToken(token, rest)`, which indicates that the parser
encountered an unexpected token `token`. The input iterator was
consumed up to the erroneous token.
- An `UnexpectedEnd(rest)`, which indicates that the end of the
iterator was reached and the parser could not finish at this point.
The input iterator was completely consumed.
In each case, the additional value `rest` is itself some sort of a
`Parser[A]`. That parser represents the parser after the successful
parse or at the point of error. This parser could be used to provide
useful error messages or even to resume parsing.
override def run(ctx: Context)(tokens: Iterator[Token]): Program = {
import ctx.reporter._
val parser = Parser(program)
parser(tokens) match {
case Parsed(result, rest) => result
case UnexpectedEnd(rest) => fatal("Unexpected end of input.")
case UnexpectedToken(token, rest) => fatal("Unexpected token: " + token)
}
}
### Parsers and Grammars
As you will see, parsers built using parser combinators will look a lot
like grammars. However, unlike grammars, parsers not only describe the
syntax of your language, but also directly specify how to turn this
syntax into a value. Also, as we will see, parser combinators have a
richer vocabulary than your usual *BNF* grammars.
Interestingly, a lot of concepts that you have seen on grammars, such as
`FIRST` sets and nullability can be straightforwardly transposed to
parsers.
#### FIRST set
In Scallion, parsers offer a `first` method which returns the set of
token kinds that are accepted as a first token.
definition.first === Set(def, abstract, case)
#### Nullability
Parsers have a `nullable` method which checks for nullability of a
parser. The method returns `Some(value)` if the parser would produce
`value` given an empty input token sequence, and `None` if the parser
would not accept the empty sequence.
### Basic Parsers
We can now finally have a look at the toolbox we have at our disposition
to build parsers, starting from the basic parsers. Each parser that you
will write, however complex, is a combination of these basic parsers.
The basic parsers play the same role as terminal symbols do in grammars.
#### Elem
The first of the basic parsers is `elem(kind)`. The function `elem`
takes argument the kind of tokens to be accepted by the parser. The
value produced by the parser is the token that was matched. For
instance, here is how to match against the *end-of-file* token.
val eof: Parser[Token] = elem(EOFKind)
#### Accept
The function `accept` is a variant of `elem` which directly applies a
transformation to the matched token when it is produced.
val identifier: Syntax[String] = accept(IdentifierKind) {
case IdentifierToken(name) => name
}
#### Epsilon
The parser `epsilon(value)` is a parser that produces the `value`
without consuming any input. It corresponds to the *𝛆* found in
grammars.
### Parser Combinators
In this section, we will see how to combine parsers together to create
more complex parsers.
#### Disjunction
The first combinator we have is disjunction, that we write, for parsers
`p1` and `p2`, simply `p1 | p2`. When both `p1` and `p2` are of type
`Syntax[A]`, the disjunction `p1 | p2` is also of type `Syntax[A]`. The
disjunction operator is associative and commutative.
Disjunction works just as you think it does. If either of the parsers
`p1` or `p2` would accept the sequence of tokens, then the disjunction
also accepts the tokens. The value produced is the one produced by
either `p1` or `p2`.
Note that `p1` and `p2` must have disjoint `first` sets. This
restriction ensures that no ambiguities can arise and that parsing can
be done efficiently.[^1] We will see later how to automatically detect
when this is not the case and how fix the issue.
#### Sequencing
The second combinator we have is sequencing. We write, for parsers `p1`
and `p2`, the sequence of `p1` and `p2` as `p1 ~ p2`. When `p1` is of
type `A` and `p2` of type `B`, their sequence is of type `A ~ B`, which
is simply a pair of an `A` and a `B`.
If the parser `p1` accepts the prefix of a sequence of tokens and `p2`
accepts the postfix, the parser `p1 ~ p2` accepts the entire sequence
and produces the pair of values produced by `p1` and `p2`.
Note that the `first` set of `p2` should be disjoint from the `first`
set of all sub-parsers in `p1` that are *nullable* and in trailing
position (available via the `followLast` method). This restriction
ensures that the combinator does not introduce ambiguities.
#### Transforming Values
The method `map` makes it possible to apply a transformation to the
values produced by a parser. Using `map` does not influence the sequence
of tokens accepted or rejected by the parser, it merely modifies the
value produced. Generally, you will use `map` on a sequence of parsers,
as in:
lazy val abstractClassDefinition: Syntax[ClassOrFunDef] =
(kw("abstract") ~ kw("class") ~ identifier).map {
case kw ~ _ ~ id => AbstractClassDef(id).setPos(kw)
}
The above parser accepts abstract class definitions in Amy syntax. It
does so by accepting the sequence of keywords `abstract` and `class`,
followed by any identifier. The method `map` is used to convert the
produced values into an `AbstractClassDef`. The position of the keyword
`abstract` is used as the position of the definition.
#### Recursive Parsers
It is highly likely that some of your parsers will require to
recursively invoke themselves. In this case, you should indicate that
the parser is recursive using the `recursive` combinator:
lazy val expr: Syntax[Expr] = recursive {
...
}
If you were to omit it, a `StackOverflow` exception would be triggered
during the initialisation of your `Parser` object.
The `recursive` combinator in itself does not change the behaviour of
the underlying parser. It is there to *tie the knot*[^2].
In practice, it is only required in very few places. In order to avoid
`StackOverflow` exceptions during initialisation, you should make sure
that all recursive parsers (stored in `lazy val`s) must not be able to
reenter themselves without going through a `recursive` combinator
somewhere along the way.
#### Other Combinators
So far, many of the combinators that we have seen, such as disjunction
and sequencing, directly correspond to constructs found in `BNF`
grammars. Some of the combinators that we will see now are more
expressive and implement useful patterns.
##### Optional parsers using opt
The combinator `opt` makes a parser optional. The value produced by the
parser is wrapped in `Some` if the parser accepts the input sequence and
in `None` otherwise.
opt(p) === p.map(Some(_)) | epsilon(None)
##### Repetitions using many and many1
The combinator `many` returns a parser that accepts any number of
repetitions of its argument parser, including 0. The variant `many1`
forces the parser to match at least once.
##### Repetitions with separators repsep and rep1sep
The combinator `repsep` returns a parser that accepts any number of
repetitions of its argument parser, separated by an other parser,
including 0. The variant `rep1sep` forces the parser to match at least
once.
The separator parser is restricted to the type `Syntax[Unit]` to ensure
that important values do not get ignored. You may use `unit()` to on a
parser to turn its value to `Unit` if you explicitly want to ignore the
values a parser produces.
##### Binary operators with operators
Scallion also contains combinators to easily build parsers for infix
binary operators, with different associativities and priority levels.
This combinator is defined in an additional trait called `Operators`,
which you should mix into `Parsers` if you want to use the combinator.
By default, it should already be mixed-in.
val times: Syntax[String] =
accept(OperatorKind("*")) {
case _ => "*"
}
...
lazy val operation: Syntax[Expr] =
operators(number)(
// Defines the different operators, by decreasing priority.
times | div is LeftAssociative,
plus | minus is LeftAssociative,
...
) {
// Defines how to apply the various operators.
case (lhs, "*", rhs) => Times(lhs, rhs).setPos(lhs)
...
}
Documentation for `operators` is [available on this
page](https://epfl-lara.github.io/scallion/scallion/Operators.html).
##### Upcasting
In Scallion, the type `Syntax[A]` is invariant with `A`, meaning that,
even when `A` is a (strict) subtype of some type `B`, we *won\'t* have
that `Syntax[A]` is a subtype of `Syntax[B]`. To upcast a `Syntax[A]` to
a syntax `Syntax[B]` (when `A` is a subtype of `B`), you should use the
`.up[B]` method.
For instance, you may need to upcast a syntax of type
`Syntax[Literal[_]]` to a `Syntax[Expr]` in your assignment. To do so,
simply use `.up[Expr]`.
### LL(1) Checking
In Scallion, non-LL(1) parsers can be written, but the result of
applying such a parser is not specified. In practice, we therefore
restrict ourselves only to LL(1) parsers. The reason behind this is that
LL(1) parsers are unambiguous and can be run in time linear in the input
size.
Writing LL(1) parsers is non-trivial. However, some of the higher-level
combinators of Scallion already alleviate part of this pain. In
addition, LL(1) violations can be detected before the parser is run.
Syntaxes have an `isLL1` method which returns `true` if the parser is
LL(1) and `false` otherwise, and so without needing to see any tokens of
input.
#### Conflict Witnesses
In case your parser is not LL(1), the method `conflicts` of the parser
will return the set of all `LL1Conflict`s. The various conflicts are:
- `NullableConflict`, which indicates that two branches of a
disjunction are nullable.
- `FirstConflict`, which indicates that the `first` set of two
branches of a disjunction are not disjoint.
- `FollowConflict`, which indicates that the `first` set of a nullable
parser is not disjoint from the `first` set of a parser that
directly follows it.
The `LL1Conflict`s objects contain fields which can help you pinpoint
the exact location of conflicts in your parser and hopefully help you
fix those.
The helper method `debug` prints a summary of the LL(1) conflicts of a
parser. We added code in the handout skeleton so that, by default, a
report is outputted in case of conflicts when you initialise your
parser.
[^1]: Scallion is not the only parser combinator library to exist, far
from it! Many of those libraries do not have this restriction. Those
libraries generally need to backtrack to try the different
alternatives when a branch fails.
[^2]: See [a good explanation of what tying the knot means in the
context of lazy
languages.](https://stackoverflow.com/questions/357956/explanation-of-tying-the-knot)
# Lab 04: Type Checker ([Slides](lab04-slides.pdf))
Parsing concludes the syntactical analysis of Amy programs. Having
successfully constructed an abstract syntax tree for an input program,
compilers typically run one or multiple phases containing checks of a
more semantical nature. Virtually all high-level programming languages
enjoy some form of name analysis, whose purpose is to disambiguate
symbol references throughout the program. Some languages go further and
perform a series of additional checks whose goal is to rule out runtime
errors statically (i.e., during compilation, or in other words, without
executing the program). While the exact rules for those checks vary from
language to language, this part of compilation is typically summarized
as \"type checking\". Amy, being a statically-typed language, requires
both name and type analysis.
## Prelude: From Nominal to Symbolic Trees
Recall that during parsing we created (abstract syntax) trees of the
*nominal* sort: Names of variables, functions and data types were simply
stored as strings. However, two names used in the program could be the
same, but not refer to one and the same \"thing\" at runtime. During
name analysis we translate from nominal trees to symbolic ones, to make
it clear whether two names refer to one and the same underlying entity.
That is, we explicitly replace strings by fresh identifiers which will
prevent us from mixing up definitions of the same name, or referring to
things that have not been defined. Amy\'s name analyzer is provided to
you as part of this lab\'s skeleton, but you should read the [dedicated
name analyzer page](material/NameAnalysis.md) to understand how it works.
## Introduction to Type Checking
The purpose of this lab is to implement a type checker for Amy. Our type
checking rules will prevent certain errors based on the kind or shape of
values that the program is manipulating. For instance, we should prevent
an integer from being added to a boolean value.
Type checking is the last stage of the compiler frontend. Every program
that reaches the end of this stage without an error is correct (as far
as the compiler is concerned), and every program that does not is wrong.
After type checking we are finally ready to interpret the program or
compile it to binary code!
Typing rules for Amy are presented in detail in the
[Amy specification](/labs/amy-specification/amy-specification.pdf). Make sure to check correct
typing for all expressions and patterns.
## Implementation
The current assignment focuses on the file `TypeChecker.scala`. As
usual, the skeleton and helper methods are given to you, and you will
have to complete the missing parts. In particular, you will write a
compiler phase that checks whether the expressions in a given program
are well-typed and report errors otherwise.
To this end you will implement a simplified form of the Hindley-Milner
(HM) type-inference algorithm that you\'ll hear about during the
lectures. Note that while not advertised as a feature to users of Amy,
behind the scenes we will perform type inference. It is usually
straightforward to adapt an algorithm for type inference to type
checking, since one can add the user-provided type annotations to the
set of constraints. This is what you will do with HM in this lab.
Compared to the presentation of HM type inference in class your type
checker can be simplified in another way: Since Amy does not feature
higher-order functions or polymorphic data types, types in Amy are
always *simple* in the sense that they are not composed of arbitrary
other types. That is, a type is either a base type (one of `Int`, `Bool`
and `String`) or it is an ADT, which has a proper name (e.g. `List` or
`Option` from the standard library). In the latter case, all the types
in the constructor of the ADT are immediately known. For instance, the
standard library\'s `List` is really a list of integers, so we know that
the `Cons` constructor takes an `Int` and another `List`.
As a result, your algorithm will never have to deal with complex
constraints over type constructors (such as the function arrow
`A => B`). Instead, your constraints will always be of the form
`T1 = T2` where `T1` and `T2` are either *simple* types or type
variables. This is most important during unification, which otherwise
would have to deal with complex types separately.
Your task now is to a) complete the `genConstraints` method which will
traverse a given expression and collect all the necessary typing
constraints, and b) implement the *unification* algorithm as
`solveConstraints`.
Familiarize yourself with the `Constraint` and `TypeVariable` data
structures in `TypeChecker.scala` and then start by implementing
`genConstraints`. The structure of this method will in many cases be
analogous to the AST traversal you wrote for the name analyzer. Note
that `genConstraints` also takes an *expected type*. For instance, in
case of addition the expected type of both operands should be `Int`. For
other constructs, such as pattern `match`es it is not inherently clear
what should be the type of each `case` body. In this case you can create
and pass a fresh type variable.
Once you have a working implementation of both `genConstraints` and
`solveConstraints` you can copy over your previous work on the
interpreter and run the programs produced by your frontend! Don\'t
forget that to debug your compiler\'s behavior you can also use the
reference compiler with the `--interpret` flag and then compare the
output.
## Skeleton
As usual, you can find the skeleton for this lab in a new branch of your
group\'s repository. After merging it with your existing work, the
structure of your project `src` directory should be as follows:
src/amyc
├── Main.scala (updated)
├── analyzer (new)
│ ├── SymbolTable.scala
│ ├── NameAnalyzer.scala
│ └── TypeChecker.scala
├── ast
│ ├── Identifier.scala
│ ├── Printer.scala
│ └── TreeModule.scala
├── interpreter
│ └── Interpreter.scala
├── lib
│ ├── scallion_3.0.6.jar
│ └── silex_3.0.6.jar
├── parsing
│ ├── Parser.scala
│ ├── Lexer.scala
│ └── Tokens.scala
└── utils
├── AmycFatalError.scala
├── Context.scala
├── Document.scala
├── Pipeline.scala
├── Position.scala
├── Reporter.scala
└── UniqueCounter.scala
## Deliverables
Deadline: **Thursday November 17 at 11pm**.
Submission: push the solved lab 4 to the branch `clplab4` that was created on your Gitlab repo. Do not push the changes to other branches! It may interfere with your previous submissions.
You may want to copy the files you changed directly to the new branch, since the two branches don't share a history in git.
File deleted
# Name Analysis
In the following, we will briefly discuss the purpose and implementation of the name analyzer phase in Amy. Name analysis has three goals:
* To reject programs that do not follow the Amy naming rules.
* For correct programs, to assign a unique identifier to every name. Remember that trees coming out of the parser contain plain strings wherever a name is expected. This might lead to confusion as to what each name refers to. Therefore, during name analysis, we assign a unique identifier to each name at its definition. Later in the program, every reference to that name will use the same unique identifier.
* To populate the symbol table. The symbol table contains a mapping from identifiers to all information that you could need later in the program for that identifier. For example, for each constructor, the symbol table contains an entry with the argument types, parent, and an index for this constructor.
After name analysis, only name-correct programs should survive, and they should contain unique identifiers that correspond to the correct symbol in the program.
You can always look at the expected output of name analysis for a given program by invoking the reference compiler with the `--printNames` option.
## The Symbol Table
The symbol table contains information for all kinds of entities in the program. In the first half of name analysis, we discover all definitions of symbols, assign each of them a fresh identifier, and store these identifier-definition entries in the symbol table.
The `SymbolTable` API contains three kinds of methods:
* `addX` methods will add a new object to the symbol table. Among other things, these methods turn the strings found in nominal trees into the fresh `Identifier`s we will use to construct symbolic trees.
* `getX` methods which take an `Identifier` as an argument. This is what you will be using to resolve symbols you find in the program, for example, during type checking.
* `getX` methods which take two strings as arguments. These are only useful for name analysis and should not be used later: since during name analysis unique identifiers have not been assigned to everything from the start, sometimes our compiler will need to look up a definition based on its name and the name of its containing module. Of course you should not use these methods once you already have an identifier (in particular, not during type checking).
## The different tree modules
It is time to talk in detail about the different tree modules in the `TreeModule` file. As explained earlier, our goal is to define two very similar tree modules, with the only difference being how a (qualified) name is represented: In a *nominal* tree, i.e. one coming out of the parser, names are plain strings and qualified names are pairs of strings. On the other hand, in a *symbolic* tree, both kinds of names are unique identifiers.
To represent either kind of tree, we define a single Scala trait called `TreeModule` which defines two *abstract type fields* `Name` and `QualifiedName`. This trait also defines all types we need to represent Amy ASTs. Many of these types depend on the abstract types.
These abstract types are filled in when we instantiate the trait. Further down in the same file you can see that we define two objects `NominalTreeModule` and `SymbolicTreeModule`, which instantiate the abstract types. In addition all types within `TreeModule` are conceptually defined separately in each of the two implementations. As a result, there is a type called `NominalTreeModule.Ite` which is *different* from the type called `SymbolicTreeModule.Ite`.
## The NameAnalyzer class
The `NameAnalyzer` class implements Amy's naming rules (section 3.4 of the Amy specification). It takes a nominal program as an input and produces a symbol table and a symbolic program.
Name analysis is split into well-defined steps. The idea is the following: we first discover all definitions in the program in the correct order, i.e., modules, types, constructors, and, finally, functions. We then rewrite function bodies and expressions to refer to the newly-introduced identifiers.
Notice how name analysis takes as input the `NominalTreeModule.Program` output by the Parser, and returns a `SymbolicTreeModule.Program` along with a populated symbol table. During the last step we therefore transform the program and each of its subtrees from `NominalTreeModule.X` into `SymbolicTreeModule.X`. For instance, a `NominalTreeModule.Program` will be transformed into a `SymbolicTreeModule.Program`, a `NominalTreeModule.Ite` into a `SymbolicTreeModule.Ite` and so forth. To save some typing, we have imported NominalTreeModule as `N` and SymbolicTreeModule as `S`. So to refer e.g. to a `Plus` in the original (nominal) tree module we can simply use `N.Plus` -- to refer to one in the symbolic tree module we can use `S.Plus`.
\ No newline at end of file
# Lab 05: Code Generation ([Slides](material/lab05-extra.md))
## Introduction
Welcome to the last common assignment for the Amy compiler. At this
point, we are finally done with the frontend: we have translated source
programs to ASTs and have checked that all correctness conditions hold
for our program. We are ready to generate code for our program. In our
case the target language will be *WebAssembly*.
WebAssembly is \"a new portable, size- and load-time-efficient format
suitable for compilation to the web\" (<http://webassembly.org>).
WebAssembly is designed to be called from JavaScript in browsers and
lends itself to highly-performant execution.
For simplicity, we will not use a browser, but execute the resulting
WebAssembly bytecode directly using `nodejs` which is essentially a
standalone distribution of the Chrome browser\'s JavaScript engine. When
you run your complete compiler (or the reference compiler) with no
options on program `p`, it will generate four different files under the
`wasmout` directory:
- `p.wat` is the wasm output of the compiler in text format. You can
use this representation to debug your generated code.
- `p.wasm` is the binary output of the compiler. This is what `nodejs`
will use. To translate to the binary format, we use the `wat2wasm`
tool provided by the WebAssembly developers. Note that
this tool performs a purely mechanical translation and thus its
output (for instance, `p.wasm`) corresponds to a binary
representation of `p.wat`.
- `p.js` is a JavaScript wrapper which we will run with nodejs and
serve as an entrypoint into your generated binary.
To run the program, simply type `nodejs wasmout/p.js`
### Installing nodejs and wat2wasm
- You can find directions for your favorite operating system
[here](https://nodejs.org/en/). You should have nodejs 12 or later
(run `nodejs --version` to make sure).
- Once you have installed nodejs, run `npm install deasync` from the
directory you plan to run `amyc` in, i.e. the toplevel directory of
the compiler.
- Install `wat2wasm` using your favorite package manager, the name of
the package is usually `wabt` (`apt install wabt`, `pacman -Sy wabt`, etc).
If you are not on linux, you can download it here:
<https://github.com/WebAssembly/wabt/releases/tag/1.0.31>, then copy the file
`bin/wat2wasm` (or `/bin/wat2wasm.exe` for windows) from the archive to
\<root of the project\>/bin
- Make sure the `wat2wasm` executable is visible: either in a system path,
or in the \<root of the project\>/bin folder (that you may have to create).
## WebAssembly and Amy
Links to the material will be provided here after the presentation of the lab.
Presentation by Georg Schmid from a few years ago: <https://tube.switch.ch/videos/00568845>, slides <https://lara.epfl.ch/~gschmid/clp20/codegen.pdf>
The lab has changed a tiny bit, for instance `set_global`, `get_global`, `set_local` and `get_local` are outdated and replaced with `global.set`, `global.get`, `local.set` and `local.get`, but otherwise it is a very good resource.
## The assignment code
### Overview
The code for the assignment is divided into two directories: `wasm` for
the modeling of the WebAssembly framework, and `codegen` for
Amy-specific code generation. There is a lot of code here, but your task
is only to implement code generation for Amy expressions within
`codegen/CodeGen.scala`.
- `wasm/Instructions.scala` provides types that describe a subset of
WebAssembly instructions. It also provides a type `Code` to describe
sequences of instructions. You can chain multiple instructions or
`Code` objects together to generate a longer `Code` with the `<:>`
operator.
- `wasm/Function.scala` describes a wasm function.
- `LocalsHandler` is an object which will create fresh indexes for
local variables as needed.
- A `Function` contains a field called `isMain` which is used to
denote a main function without a return value, which will be
handled differently when printing, and will be exported to
JavaScript.
- The only way to create a `Function` is using `Function.apply`.
Its last argument is a function from a `LocalsHandler` to
`Code`. The reason for this unusual choice is to make sure the
Function object is instantiated with the number of local
variables that will be requested from the LocalsHandler. To see
how it is used, you can look in `codegen/Utils.scala` (but you
won\'t have to use it directly).
- `wasm/Module.scala` and `wasm/ModulePrinter.scala` describe a wasm
module, which you can think of as a set of functions and the
corresponding module headers.
- `codegen/Utils.scala` contains a few utility functions (which you
should use!) and implementations of the built-in functions of Amy.
Use the built-ins as examples.
- `codegen/CodeGen.scala` is the focus of the assignment. It contains
code to translate Amy modules, functions and expressions to wasm
code. It is a pipeline and returns a wasm Module.
- `codegen/CodePrinter.scala` is a Pipeline which will print output
files from the wasm module.
### The cgExpr function
The focus of this assignment is the `cgExpr` function, which takes an
expression and generates a `Code` object. It also takes two additional
arguments: (1) a `LocalsHandler` which you can use to get a new slot for
a local when you encounter a local variable or you need a temporary
variable for your computation. (2) a map `locals` from `Identifiers` to
locals slots, i.e. indices, in the wasm world. For example, if `locals`
contains a pair `i -> 4`, we know that `local.get 4` in wasm will push
the value of i to the stack. Notice how `locals` is instantiated with
the function parameters in `cgFunction`.
## Skeleton
As usual, you can find the skeleton for this lab in a new branch of your
group\'s repository. After merging it with your existing work, the
structure of your project `src` directory should be as follows:
src/amyc
├── Main.scala (updated)
├── analyzer
│ ├── SymbolTable.scala
│ ├── NameAnalyzer.scala
│ └── TypeChecker.scala
├── ast
│ ├── Identifier.scala
│ ├── Printer.scala
│ └── TreeModule.scala
├── codegen (new)
│ ├── CodeGen.scala
│ ├── CodePrinter.scala
│ └── Utils.scala
├── interpreter
│ └── Interpreter.scala
├── lib
│ ├── scallion_3.0.6.jar
│ └── silex_3.0.6.jar
├── parsing
│ ├── Parser.scala
│ ├── Lexer.scala
│ └── Tokens.scala
├── utils
│ ├── AmycFatalError.scala
│ ├── Context.scala
│ ├── Document.scala
│ ├── Pipeline.scala
│ ├── Position.scala
│ ├── Reporter.scala
│ └── UniqueCounter.scala
└── wasm (new)
├── Function.scala
├── Instructions.scala
├── ModulePrinter.scala
└── Module.scala
## Deliverables
Deadline: **Friday December 9 at 11pm**.
Submission: push the solved lab 5 to the branch `clplab5` that was created on your Gitlab repo. Do not push the changes to other branches! It may interfere with your previous submissions.
You may want to copy the files you changed directly to the new branch, since the two branches don't share a history in git.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<title>slides</title>
<style>
html {
line-height: 1.5;
font-family: Georgia, serif;
font-size: 20px;
color: #1a1a1a;
background-color: #fdfdfd;
}
body {
margin: 0 auto;
max-width: 36em;
padding-left: 50px;
padding-right: 50px;
padding-top: 50px;
padding-bottom: 50px;
hyphens: auto;
overflow-wrap: break-word;
text-rendering: optimizeLegibility;
font-kerning: normal;
}
@media (max-width: 600px) {
body {
font-size: 0.9em;
padding: 1em;
}
}
@media print {
body {
background-color: transparent;
color: black;
font-size: 12pt;
}
p, h2, h3 {
orphans: 3;
widows: 3;
}
h2, h3, h4 {
page-break-after: avoid;
}
}
p {
margin: 1em 0;
}
a {
color: #1a1a1a;
}
a:visited {
color: #1a1a1a;
}
img {
max-width: 100%;
}
h1, h2, h3, h4, h5, h6 {
margin-top: 1.4em;
}
h5, h6 {
font-size: 1em;
font-style: italic;
}
h6 {
font-weight: normal;
}
ol, ul {
padding-left: 1.7em;
margin-top: 1em;
}
li > ol, li > ul {
margin-top: 0;
}
blockquote {
margin: 1em 0 1em 1.7em;
padding-left: 1em;
border-left: 2px solid #e6e6e6;
color: #606060;
}
code {
font-family: Menlo, Monaco, 'Lucida Console', Consolas, monospace;
font-size: 85%;
margin: 0;
}
pre {
margin: 1em 0;
overflow: auto;
}
pre code {
padding: 0;
overflow: visible;
overflow-wrap: normal;
}
.sourceCode {
background-color: transparent;
overflow: visible;
}
hr {
background-color: #1a1a1a;
border: none;
height: 1px;
margin: 1em 0;
}
table {
margin: 1em 0;
border-collapse: collapse;
width: 100%;
overflow-x: auto;
display: block;
font-variant-numeric: lining-nums tabular-nums;
}
table caption {
margin-bottom: 0.75em;
}
tbody {
margin-top: 0.5em;
border-top: 1px solid #1a1a1a;
border-bottom: 1px solid #1a1a1a;
}
th {
border-top: 1px solid #1a1a1a;
padding: 0.25em 0.5em 0.25em 0.5em;
}
td {
padding: 0.125em 0.5em 0.25em 0.5em;
}
header {
margin-bottom: 4em;
text-align: center;
}
#TOC li {
list-style: none;
}
#TOC a:not(:hover) {
text-decoration: none;
}
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
div.column{display: inline-block; vertical-align: top; width: 50%;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
{ counter-reset: source-line 0; }
pre.numberSource code > span
{ position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
{ content: counter(source-line);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
.display.math{display: block; text-align: center; margin: 0.5rem auto;}
</style>
<!--[if lt IE 9]>
<script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
<![endif]-->
</head>
<body>
<h2 id="demo">Demo</h2>
<pre><code>(func $Factorial_f (param i32 i32) (result i32) (local i32)
;;&gt; fn f(i: Int(32), j: Int(32)): Int(32) = {
;;| val res: Int(32) =
;;| (i + j);
;;| res
;;| }
;;&gt; i
local.get 0
;;&gt; j
local.get 1
;;&gt; (i + j)
i32.add
;;&gt; val res: Int(32)
local.set 2
;;&gt; res
local.get 2
)
(func $Factorial_fact (param i32) (result i32)
;;&gt; fn fact(i: Int(32)): Int(32) = {
;;| (if((i &lt; 2)) {
;;| 1
;;| } else {
;;| (i * fact((i - 1)))
;;| })
;;| }
;;&gt; i
local.get 0
;;&gt; 2
i32.const 2
;;&gt; (i &lt; 2)
i32.lt_s
;;&gt; (if((i &lt; 2)) {
;;| 1
;;| } else {
;;| (i * fact((i - 1)))
;;| })
if (result i32)
;;&gt; 1
i32.const 1
else
;;&gt; i
local.get 0
;;&gt; fact((i - 1))
;;&gt; i
local.get 0
;;&gt; 1
i32.const 1
;;&gt; (i - 1)
i32.sub
call $Factorial_fact
;;&gt; (i * fact((i - 1)))
i32.mul
end
)</code></pre>
<h2 id="wasm-basics">WASM basics</h2>
<h3 id="stack-machine">Stack machine</h3>
<ul>
<li>WASM is a stack based machine.</li>
<li>WASM has types. We will use exclusively i32.</li>
<li>Instructions can push or pop values from the stack.
<ul>
<li>i32.const x : push x to the stack.</li>
<li>i32.add : pop 2 values, add them and push the result.</li>
<li>drop : pop a value and ignore it.</li>
</ul></li>
<li>Locals can store values inside a function. Useful for val definitions among others.
<ul>
<li>local.get x : get xth local</li>
<li>local.set x : set xth local</li>
</ul></li>
<li>Globals store program wide values.
<ul>
<li>global.get x : get xth global</li>
<li>global.set x : set xth global</li>
</ul></li>
<li>Control flow.
<ul>
<li>if : pop value from stack, if 0 goto else otherwise continue.</li>
<li>call : pop arguments from the stack, jump to function.</li>
</ul></li>
</ul>
<h2 id="function-calls">Function calls</h2>
<p>How to call a function: - Push the required number of arguments on the stack. - Call the function. The call instruction will pop the arguments and place them in the locals. - The result will be placed on top of the stack.</p>
<pre><code>(func $f (param i32 i32) (result i32)
local.get 0
local.get 1
i32.add
)
(
i32.const 3 ;; arg 0
i32.const 4 ;; arg 1
;; A
call $f
;; B
)
A:
| |
| 4 | &lt;-- arg 1
| 3 | &lt;-- arg 0
|-------|
B:
| |
| |
| 7 | &lt;-- result
|-------|
</code></pre>
<h2 id="store">Store</h2>
<pre><code>Store 3 at address 48
| |
| |
| |
|--------| &lt;-- bottom of the stack
`i32.const 48`
| |
| |
| 48 | &lt;-- address
|--------| &lt;-- bottom of the stack
`i32.const 3`
| |
| 3 | &lt;-- value
| 48 | &lt;-- address
|--------| &lt;-- bottom of the stack
`i32.store` pops 2 values from the stack
| |
| |
| |
|--------| &lt;-- bottom of the stack
Heap
| address | 0 | 1 | 2 | .. | 47 | 48 | 49 | .. |
|---------|----|----|----|----|----|----|----|----|
| value | 0 | 0 | 0 | .. | 0 | 3 | 0 | .. |
^
value written</code></pre>
<h2 id="values">Values</h2>
<p>Very similar to java.</p>
<ul>
<li>Ints are represented simply with an i32.</li>
<li>Bools are represented with an i32, false = 0, true = 1.</li>
<li>Unit is represented with an i32 with value 0.</li>
</ul>
<h3 id="strings">Strings</h3>
<pre><code>| address | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|---------|----|----|----|----|----|----|----|----|
| value | 104| 101| 108| 108| 111| 33 | 0 | 0 |
| ascii | h | e | l | l | o | ! | \0 | |
| |
| |
| 24 | &lt;-- pointer to string
|--------| &lt;-- bottom of the stack
</code></pre>
<h3 id="adts">ADTs</h3>
<ul>
<li>store the value on the heap to reduced the size to the size of a pointer.</li>
<li>store which constructor the value holds.</li>
</ul>
<div class="sourceCode" id="cb5"><pre class="sourceCode scala"><code class="sourceCode scala"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> <span class="fu">getList</span><span class="op">():</span> <span class="ex">List</span> <span class="op">=</span> <span class="op">{</span> <span class="op">...</span> <span class="op">}</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="kw">val</span> ls<span class="op">:</span> <span class="ex">List</span> <span class="op">=</span> <span class="fu">getList</span><span class="op">();</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="co">// What is the size of list here?</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="co">// Is it a Nil or a Cons?</span></span></code></pre></div>
<pre><code>Cons(42, Nil())
| address | value |
|---------|---------|
| 0 | 1 | \
| 1 | | | constructor id.
| 2 | | | Cons
| 3 | | /
| 4 | 42 | \
| 5 | | | first member: int
| 6 | | | 42
| 7 | | /
| 8 | 1234 | \
| 9 | | | seconder member: pointer to Nil
| 10 | | | 1234
| 11 | | /
Field offset = 4 + 4 * field number
==&gt; Utils.scala:adtField
</code></pre>
<h2 id="allocation">Allocation</h2>
<p>Utils.scala:memoryBoundary is the index of a global variable that holds a pointer to the next free bytes.</p>
<h3 id="example-in-pseudocode">Example in pseudocode:</h3>
<p>Start of the program:</p>
<pre><code>global.set(memoryBoundary, 0)</code></pre>
<p>We want to allocate “hello!” = 7 bytes (don’t forget the null terminator).</p>
<p>Store current memory pointer as pointer to our new string:</p>
<pre><code>hello_string = global.get(memoryBoundary)</code></pre>
<p>Increment the memory boundary by 7 (size of string).</p>
<pre><code>global.set(memoryBoundary, global.get(memoryBoundary) + 7)</code></pre>
<h3 id="with-webassembly-instructions">With webassembly instructions:</h3>
<pre><code>;; With memoryBoundary = 0.
;; Load the current boundary for string
global.get 0
;; Load it again for the arithmetic
global.get 0
;; length of string
i32.const 7
;; base + length = new boundary
i32.add
;; store new boundary
global.set 0
;; now the string pointer is on the stack, we just
;; need to copy the character&#39;s bytes into it.
...</code></pre>
<h2 id="pattern-matching">Pattern matching</h2>
<p>A pattern matching expression:</p>
<pre><code>e match {
case p1 =&gt; e1
...
case pn =&gt; en
}</code></pre>
<p>can be considered to be equivalent to the following pseudocode:</p>
<pre><code>val v = e;
if (matchAndBind(v, p1)) e1
else if (matchAndBind(v, p2)) e2
else if ...
else if (matchAndBind(v, pn)) en
else error(&quot;Match error!&quot;)</code></pre>
<p>matchAndBind is equivalent to this:</p>
<pre><code>WildcardPattern:
&quot;case _ =&gt; ...&quot;
matchAndBind(v, _) = true
IdPattern:
&quot;case id =&gt; ...&quot;
matchAndBind(v, id) = { id = v; true }
LiteralPattern:
&quot;case 3 =&gt; ...&quot;
matchAndBind(v, lit) = { v == lit }
CaseClassPattern:
&quot;case Cons(x, _) =&gt; ...&quot;
matchAndBind(C_1(v_1, ..., v_n), C_2(p_1, ..., p_m)) = {
C_1 == C_2 &amp;&amp;
matchAndBind(v_1, p_1) &amp;&amp;
...
matchAndBind(v_m, p_m)
}</code></pre>
</body>
</html>
## Demo
```
(func $Factorial_f (param i32 i32) (result i32) (local i32)
;;> fn f(i: Int(32), j: Int(32)): Int(32) = {
;;| val res: Int(32) =
;;| (i + j);
;;| res
;;| }
;;> i
local.get 0
;;> j
local.get 1
;;> (i + j)
i32.add
;;> val res: Int(32)
local.set 2
;;> res
local.get 2
)
(func $Factorial_fact (param i32) (result i32)
;;> fn fact(i: Int(32)): Int(32) = {
;;| (if((i < 2)) {
;;| 1
;;| } else {
;;| (i * fact((i - 1)))
;;| })
;;| }
;;> i
local.get 0
;;> 2
i32.const 2
;;> (i < 2)
i32.lt_s
;;> (if((i < 2)) {
;;| 1
;;| } else {
;;| (i * fact((i - 1)))
;;| })
if (result i32)
;;> 1
i32.const 1
else
;;> i
local.get 0
;;> fact((i - 1))
;;> i
local.get 0
;;> 1
i32.const 1
;;> (i - 1)
i32.sub
call $Factorial_fact
;;> (i * fact((i - 1)))
i32.mul
end
)
```
## WASM basics
### Stack machine
- WASM is a stack based machine.
- WASM has types. We will use exclusively i32.
- Instructions can push or pop values from the stack.
- i32.const x : push x to the stack.
- i32.add : pop 2 values, add them and push the result.
- drop : pop a value and ignore it.
- Locals can store values inside a function. Useful for val definitions among others.
- local.get x : get xth local
- local.set x : set xth local
- Globals store program wide values.
- global.get x : get xth global
- global.set x : set xth global
- Control flow.
- if : pop value from stack, if 0 goto else otherwise continue.
- call : pop arguments from the stack, jump to function.
## Function calls
How to call a function:
- Push the required number of arguments on the stack.
- Call the function. The call instruction will pop the arguments and place them in the locals.
- The result will be placed on top of the stack.
```
(func $f (param i32 i32) (result i32)
local.get 0
local.get 1
i32.add
)
(
i32.const 3 ;; arg 0
i32.const 4 ;; arg 1
;; A
call $f
;; B
)
A:
| |
| 4 | <-- arg 1
| 3 | <-- arg 0
|-------|
B:
| |
| |
| 7 | <-- result
|-------|
```
## Store
```
Store 3 at address 48
| |
| |
| |
|--------| <-- bottom of the stack
`i32.const 48`
| |
| |
| 48 | <-- address
|--------| <-- bottom of the stack
`i32.const 3`
| |
| 3 | <-- value
| 48 | <-- address
|--------| <-- bottom of the stack
`i32.store` pops 2 values from the stack
| |
| |
| |
|--------| <-- bottom of the stack
Heap
| address | 0 | 1 | 2 | .. | 47 | 48 | 49 | .. |
|---------|----|----|----|----|----|----|----|----|
| value | 0 | 0 | 0 | .. | 0 | 3 | 0 | .. |
^
value written
```
## Values
Very similar to java.
- Ints are represented simply with an i32.
- Bools are represented with an i32, false = 0, true = 1.
- Unit is represented with an i32 with value 0.
### Strings
```
| address | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|---------|----|----|----|----|----|----|----|----|
| value | 104| 101| 108| 108| 111| 33 | 0 | 0 |
| ascii | h | e | l | l | o | ! | \0 | |
| |
| |
| 24 | <-- pointer to string
|--------| <-- bottom of the stack
```
### ADTs
- store the value on the heap to reduced the size to the size of a pointer.
- store which constructor the value holds.
```scala
def getList(): List = { ... }
val ls: List = getList();
// What is the size of list here?
// Is it a Nil or a Cons?
```
```
Cons(42, Nil())
| address | value |
|---------|---------|
| 0 | 1 | \
| 1 | | | constructor id.
| 2 | | | Cons
| 3 | | /
| 4 | 42 | \
| 5 | | | first member: int
| 6 | | | 42
| 7 | | /
| 8 | 1234 | \
| 9 | | | seconder member: pointer to Nil
| 10 | | | 1234
| 11 | | /
Field offset = 4 + 4 * field number
==> Utils.scala:adtField
```
## Allocation
Utils.scala:memoryBoundary is the index of a global variable that holds a pointer to the next free bytes.
### Example in pseudocode:
Start of the program:
global.set(memoryBoundary, 0)
We want to allocate "hello!" = 7 bytes (don't forget the null terminator).
Store current memory pointer as pointer to our new string:
hello_string = global.get(memoryBoundary)
Increment the memory boundary by 7 (size of string).
global.set(memoryBoundary, global.get(memoryBoundary) + 7)
### With webassembly instructions:
```
;; With memoryBoundary = 0.
;; Load the current boundary for string
global.get 0
;; Load it again for the arithmetic
global.get 0
;; length of string
i32.const 7
;; base + length = new boundary
i32.add
;; store new boundary
global.set 0
;; now the string pointer is on the stack, we just
;; need to copy the character's bytes into it.
...
```
## Pattern matching
A pattern matching expression:
e match {
case p1 => e1
...
case pn => en
}
can be considered to be equivalent to the following pseudocode:
val v = e;
if (matchAndBind(v, p1)) e1
else if (matchAndBind(v, p2)) e2
else if ...
else if (matchAndBind(v, pn)) en
else error("Match error!")
matchAndBind is equivalent to this:
WildcardPattern:
"case _ => ..."
matchAndBind(v, _) = true
IdPattern:
"case id => ..."
matchAndBind(v, id) = { id = v; true }
LiteralPattern:
"case 3 => ..."
matchAndBind(v, lit) = { v == lit }
CaseClassPattern:
"case Cons(x, _) => ..."
matchAndBind(C_1(v_1, ..., v_n), C_2(p_1, ..., p_m)) = {
C_1 == C_2 &&
matchAndBind(v_1, p_1) &&
...
matchAndBind(v_m, p_m)
}
# Labs 06: Compiler extension project
You have now written a compiler for Amy, a simple functional language.
The final lab project is to design and implement a new functionality of
your own choice on top of the compiler you built so far. In preparation
for this, you should aim to learn about the problem domain by searching
the appropriate literature. The project includes:
- designing and implementing the new functionality
- documenting the results in a written report document
This project has several deadlines, detailed below. Please note that the
first of them (choosing the topic) is already coming up on this Thursday!
Use the sessions on Wednesday and Thursday morning to discuss your own
ideas or choices.
## Selecting a Project Topic
**Deadline: Thursday December 1st**
In the following document, we list several project ideas, but you should
also feel free to submit your own. All groups will rank the
projects in order of preference, and we will then do our best to assign
the preferred projects to as many groups as possible. Because not all
projects are equally difficult, we annotated each of them with the
expected workload. The suggested projects cover a wide range of
complexity, and we will evaluate your submissions with that complexity
in mind. For instance, for a project marked with `(1)` (relatively low
complexity) we will be expecting a polished, well-tested and
well-documented extension, whereas projects on the other end (`(3)`) may
be more prototypical. For all submissions, however, we require that you
deliver code that compiles and a set of example input files that
demonstrate the new functionality.
[Project ideas](material/extensions.pdf)
To announce your preferences, [please fill out this form on Moodle before the deadline](https://moodle.epfl.ch/mod/questionnaire/view.php?id=1231114). You\'ll have to
provide **the numbers corresponding to the top exactly 5** projects you would like to
work on, in order of descending preference. We will do our best to
assign you the project you are most interested in.
## Project Orientation
We will try to inform you about the project assignment during the usual Wednesday and Thursday sessions. We ask you to be **proactive** and validate with the assistants your understanding of the project goals and the expectations of the end product. Think about the following questions and feel free to ask the assistants about them during the exercise sessions:
- What are the features you will add to the compiler/language?
- What would be some (short) programs highlighting the use of these features?
- What changes might be required in each compiler phase and/or what new phases would you add? (Very roughly)
## Project Presentation
You will present your idea during the last two weeks of the semester (Dec 14/15/19/21/22). We'll announce the concrete
schedule of presentations at a later point. [Instructions on what and how to present your project can be found here.](material/presentation.md)
## Project Implementation and Report
You will develop your project on top of your implementation of Amy. Please push all development on a new branch `clplab6`, ideally building on top of the codegen lab. We will refer to this branch in case of problems with your submission.
**Deadline: Monday January 9th, 23:59**
Submission: content of the clplab6 branch.
Final form of your submission should contain:
- Your implementation, which must, to be graded at all, compile and be able to run non-trivial examples.
- A subdirectory `extension-examples/` which includes **at least 5 examples** that demonstrate your compiler extension in action.
- A subdirectory `report/` which includes a PDF summarizing your extension.
- A subdirectory `slides/` which includes the PDF of the project presentation.
- A README file indicating how we should run and test the implemented functionality, with examples.
**If you did not manage to complete your planned features, or they are
partially implemented, make this clear in your report!**
You are encouraged to use the following (LaTeX) template for your
report:
- [LaTeX sources](material/report-template.tar.gz)
A PDF version of the template with the required section is available
here:
- [PDF Example](material/report-template.pdf)
Although you are not required to use the above template, your report
must contain at least the sections described in it with the appropriate
information. Note that writing this report will take some time, and you
should not do it in the last minute. The final report is an important
part of the compiler project. If you have questions about the template
or the contents of the report, make sure you ask them early.
A common question is \"how long should the report be?\". There\'s no
definitive answer to that. Considering that the report will contain code
examples and a technical description of your implementation, it would be
surprising if it were shorter than 3 pages. Please try to stay within 6
pages. A concise, but well-written report is preferable to a long, but
poorly-written one.
File deleted
## Compiler Extension Presentation Instructions
Presentations will take place in the weeks 13-14.
**The presentation should be 9 minutes long.**
**Q&A session of 5 minutes** will follow right after the
presentation.
Shortly after, you will receive feedback from us regarding the content
of your presentation, as well as some general feedback on the form.
### Presentation content
Your presentation should summarize your project. In particular, we\'d
expect to see
- a basic overview of the features you added to the compiler/language
- some (short) programs highlighting the use of these features, with a
description of how your extended compiler behaves on them
- possibly some theoretical background you had to learn about to
implement the extension
- an overview of the changes you made to each compiler phase and/or
which phases you added
### Presentation style
Here are some useful resources on how to prepare and give talks:
- [How To Speak by Patrick
Winston](https://www.youtube.com/watch?v=Unzc731iCUY)
- [How to give a great research talk by Simon Peyton
Jones](https://www.microsoft.com/en-us/research/academic-program/give-great-research-talk/)
Please do not use Viktor\'s videos as a model for the presentation, but
instead incorporate as many points of the talk of [Patrick
Winston](https://en.wikipedia.org/wiki/Patrick_Winston) as you believe
apply to your presentation. It is an amazing and entertaining talk,
despite (or because) it is meta-circular: he does as he says. Note:
breaking physical objects or referring to supernatural beings in your
video is not required. Use your own judgment and strike a balance in
being comfortable with what and how you are saying things and trying out
these pieces of advice.
### Additional Guidelines
These are guidelines for your project presentation. They must not be followed necessarely but if you have no idea how to start, feel free to use them.
Presentation Structure:
0) Extension Title, Group Number, Names of the group Members
1) Summary: What will we see? (Quickly)
2) An Overview of the extensions with some examples to illustrate
- What new exciting features does the extension make now possible?
- How are these used? 1 example which uses all the new features
3) The main part of the presentation:
- Focusing on Project Features
- For each Feature explain how the different compiler phases were affected
- Example: Feature A allows us to do THIS, with SUCH characteristics, and it affects the following phases LIKE THIS
- Focusing on Compiler Phases
- For each Compiler Phase explain how each feature affected it (Interpreter?, Lexer, Parser, Name Analyzer, Type Checker, Code Generator)
- Example: Phase A is affected in THIS way, because of the following features, leading to SUCH characteristics
- Use examples to help you show the changes
4) Conclusion
- Conclude the presentation content, very quickly
- Present some upgrades or expansions to this project, be critic of your own work
5) Remember to say "Thank You" and transition to the questions section (For example: "Do you have any questions?")
Talking:
- Remember to drink water before the presentation, your mouth will get dry otherwise
- Remember to articulate
- Remember to finish your sentences
- Remember to divide the presentation evenly between the group members
- Prepare and rehearse your speech
Slides:
- Do not put too much text
- Use examples to take your presentation down to earth
Other Tips:
- Do not be scared to have your speech in a piece of {paper | another document | comments on the slides} to be able to {read | refresh} your memory if you need to
- If you realize you do not have the time to say everything you want, focus on the most important things, and mention the least important parts
File deleted
File deleted
package amyc.ast
import amyc.utils.Positioned
// Definitions of symbolic Amy syntax trees
trait TreeModule {
// Common ancestor for all trees
trait Tree extends Positioned
// Expressions
trait Expr extends Tree
// Variables
case class Variable(name: Identifier) extends Expr
// Literals
trait Literal[+T] extends Expr { val value: T }
case class IntLiteral(value: Int) extends Literal[Int]
case class BooleanLiteral(value: Boolean) extends Literal[Boolean]
case class StringLiteral(value: String) extends Literal[String]
case class UnitLiteral() extends Literal[Unit] { val value: Unit = () }
// Binary operators
case class Plus(lhs: Expr, rhs: Expr) extends Expr
case class Minus(lhs: Expr, rhs: Expr) extends Expr
case class Times(lhs: Expr, rhs: Expr) extends Expr
case class Div(lhs: Expr, rhs: Expr) extends Expr
case class Mod(lhs: Expr, rhs: Expr) extends Expr
case class LessThan(lhs: Expr, rhs: Expr) extends Expr
case class LessEquals(lhs: Expr, rhs: Expr) extends Expr
case class And(lhs: Expr, rhs: Expr) extends Expr
case class Or(lhs: Expr, rhs: Expr) extends Expr
case class Equals(lhs: Expr, rhs: Expr) extends Expr
case class Concat(lhs: Expr, rhs: Expr) extends Expr
// Unary operators
case class Not(e: Expr) extends Expr
case class Neg(e: Expr) extends Expr
// Function/ type constructor call
case class Call(qname: Identifier, args: List[Expr]) extends Expr
// The ; operator
case class Sequence(e1: Expr, e2: Expr) extends Expr
// Local variable definition
case class Let(df: ParamDef, value: Expr, body: Expr) extends Expr
// If-then-else
case class Ite(cond: Expr, thenn: Expr, elze: Expr) extends Expr
// Pattern matching
case class Match(scrut: Expr, cases: List[MatchCase]) extends Expr {
require(cases.nonEmpty)
}
// Represents a computational error; prints its message, then exits
case class Error(msg: Expr) extends Expr
// Cases and patterns for Match expressions
case class MatchCase(pat: Pattern, expr: Expr) extends Tree
abstract class Pattern extends Tree
case class WildcardPattern() extends Pattern // _
case class IdPattern(name: Identifier) extends Pattern // x
case class LiteralPattern[+T](lit: Literal[T]) extends Pattern // 42, true
case class CaseClassPattern(constr: Identifier, args: List[Pattern]) extends Pattern // C(arg1, arg2)
// Definitions
trait Definition extends Tree { val name: Identifier }
case class ModuleDef(name: Identifier, defs: List[ClassOrFunDef], optExpr: Option[Expr]) extends Definition
trait ClassOrFunDef extends Definition
case class FunDef(name: Identifier, params: List[ParamDef], retType: TypeTree, body: Expr) extends ClassOrFunDef {
def paramNames = params.map(_.name)
}
case class AbstractClassDef(name: Identifier) extends ClassOrFunDef
case class CaseClassDef(name: Identifier, fields: List[TypeTree], parent: Identifier) extends ClassOrFunDef
case class ParamDef(name: Identifier, tpe: TypeTree) extends Definition
// Types
trait Type
case object IntType extends Type
case object BooleanType extends Type
case object StringType extends Type
case object UnitType extends Type
case class ClassType(qname: Identifier) extends Type
// A wrapper for types that is also a Tree (i.e. has a position)
case class TypeTree(tpe: Type) extends Tree
// All is wrapped in a program
case class Program(modules: List[ModuleDef]) extends Tree
}
// Identifiers represent unique names in Amy
class Identifier private(val name: String)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment