Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# Lab 03: Parser ([Slides](lab03-slides.pdf))
## Introduction
Starting from this week you will work on the second stage of the Amy
compiler, the parser. The task of the parser is to take a sequence of
tokens produced by the lexer and transform it into an Abstract Syntax
Tree (AST).
For this purpose you will write a grammar for Amy programs in a Domain
Specific Language (DSL) that can be embedded in Scala. Similarly to what
you have seen in the Lexer lab, each grammar rule will also be
associated with a transformation function that maps the parse result to
an AST. The overall grammar will then be used to automatically parse
sequences of tokens into Amy ASTs, while abstracting away extraneous
syntactical details, such as commas and parentheses.
As you have seen (and will see) in the lectures, there are various
algorithms to parse syntax trees corresponding to context-free grammars.
Any context-free grammar (after some normalization) can be parsed using
the CYK algorithm. However, this algorithm is rather slow: its
complexity is in O(n\^3 \* g) where n is the size of the program and g
the size of the grammar. On the other hand, a more restricted LL(1)
grammar can parse inputs in linear time. Thus, the goal of this lab will
be to develop an LL(1) version of the Amy grammar.
### The Parser Combinator DSL
In the previous lab you already started working with **Silex**, which
was the library we used to tokenize program inputs based on a
prioritized list of regular expressions. In this lab we will start using
its companion library, **Scallion**: Once an input string has been
tokenized, Scallion allows us to parse the token stream using the rules
of an LL(1) grammar and translate to a target data structure, such as an
AST.
To familiarize yourself with the parsing functionality of Scallion,
please make sure you read the [Introduction to (Scallion) Parser
Combinators](material/scallion.md). In it, you will learn how to describe grammars
in Scallion\'s parser combinator DSL and how to ensure that your grammar
lies in LL(1) (which Scallion requires to function correctly).
Once you understand parser combinators, you can get to work on your own
implementation of an Amy parser in `Parser.scala`. Note that in this lab
you will essentially operate on two data structures: Your parser will
consume a sequence of `Token`s (defined in `Tokens.scala`) and produce
an AST (as defined by `NominalTreeModule` in `TreeModule.scala`). To
accomplish this, you will have to define appropriate parsing rules and
translation functions for Scallion.
In `Parser.scala` you will already find a number of parsing rules given
to you, including the starting non-terminal `program`. Others, such as
`expr` are stubs (marked by `???`) that you will have to complete
yourself. Make sure to take advantage of Scallion\'s various helpers
such as the `operators` method that simplifies defining operators of
different precedence and associativity.
### An LL(1) grammar for Amy
As usual, the [Amy specification](/labs/amy-specification/amy-specification.pdf) will guide you when
it comes to deciding what exactly should be accepted by your parser.
Carefully read Section 2 (*Syntax*).
Note that the EBNF grammar in Figure 2 merely represents an
over-approximation of Amy\'s true grammar \-- it is too imprecise to be
useful for parsing: Firstly, the grammar in Figure 2 is ambiguous. That
is, it allows multiple ways to parse an expression. E.g. `x + y * z`
could be parsed as either `(x + y) * z` or as `x + (y * z)`. In other
words, the grammar doesn\'t enforce either operator precedence or
associativity correctly. Additionally, the restrictions mentioned
throughout Section 2 of the specification are not followed.
Your task is thus to come up with appropriate rules that encode Amy\'s
true grammar. Furthermore, this grammar should be LL(1) for reasons of
efficiency. Scallion will read your grammar, examine if it is in LL(1),
and, if so, parse input programs. If Scallion determines that the
grammar is not in LL(1), it will report an error. You can also instruct
Scallion to generate some counter-examples for you (see the `checkLL1`
function).
### Translating to ASTs
Scallion will parse a sequence of tokens according to the grammar you
provide, however, without additional help, it does not know how to build
Amy ASTs. For instance, a (non-sensical) grammar that only accepts
sequences of identifier tokens, e.g.
many(elem(IdentifierKind)): Syntax[Seq[Token]]
will be useful in deciding whether the input matches the expected form,
but will simply return the tokens unchanged when parsing succeeds.
Scallion does allow you to map parse results from one type to another,
however. For instance, in the above example we might want to provide a
function `f(idTokens: Seq[Token]): Seq[Variable]` that transforms the
identifier tokens into (Amy-AST) variables of those names.
For more information on how to use Scallion\'s `Syntax#map` method
please refer to the [Scallion introduction](material/scallion.md).
## Notes
### Understanding the AST: Nominal vs. Symbolic Trees
If you check the TreeModule file containing the ASTs, you will notice it
is structured in an unusual way: There is a `TreeModule` class extended
by `NominalTreeModule` and `SymbolicTreeModule`. The reason for this
design is that we need two very similar ASTs, but with different types
representing names in each case: Just after parsing (this assignment),
all names are just Strings and qualified names are essentially pairs of
Strings. We call ASTs that only use such String-based names `Nominal`
\-- the variant we will be using in this lab. Later, during name
analysis, these names will be resolved to unique identifiers, e.g. two
variables that refer to different definitions will be distinct, even if
they have the same name. For now you can just look at the TreeModule and
substitute the types that are not defined there (`Name` and
`QualifiedName`) with their definitions inside `NominalTreeModule`.
### Positions
As you will notice in the code we provide, all generated ASTs have their
position set. The position of each node of the AST is defined as its
starting position. It is important that you set the positions in all the
trees that you create for better error reporting later. Although our
testing infrastructure cannot directly check for presence of positions,
we will check it manually.
### Pretty Printing
Along with the stubs, we provide a printer for Amy ASTs. It will print
parentheses around all expressions so you can clearly see how your
parser interprets precedence and associativity. You can use it to test
your parser, and it will also be used during our testing to compare the
output of your parser with the reference parser.
## Skeleton
As usual, you can find the skeleton in the git repository. This lab
builds on your previous work, so \-- given your implementation of the
lexer \-- you will only unpack two files from the skeleton.
The structure of your project `src` directory should be as follows:
amyc
├── Main.scala (updated)
│
├── ast (new)
│ ├── Identifier.scala
│ ├── Printer.scala
│ └── TreeModule.scala
│
├── lib
│ ├── scallion_3.0.6.jar (new)
│ └── silex_3.0.6.jar
│
├── parsing
│ ├── Parser.scala (new)
│ ├── Lexer.scala
│ └── Tokens.scala
│
└── utils
├── AmycFatalError.scala
├── Context.scala
├── Document.scala
├── Pipeline.scala
├── Position.scala
├── Reporter.scala
└── UniqueCounter.scala
## Reference compiler
Recall you can use the [reference compiler](/labs/amy_reference_compiler.md) for any doubts you have on the intended behaviour. For this lab you can use the command:
```
java -jar amyc-assembly-1.7.jar --printTrees <files>
```
## Deliverables
Deadline: **Friday November 4 at 11pm**.
Submission: push the solved lab 3 to the branch `clplab3` that was created on your Gitlab repo. Do not push the changes to other branches! It may interfere with your previous submissions.
You may want to copy the files you changed directly to the new branch, since the two branches don't share a history in git.