-
Noe Eric De Santo authoredNoe Eric De Santo authored
For a brief overview of Scallion and its purpose, you can watch this video. What follows below is a slightly more detailed description, and an example project you can use to familiarize yourself with Scallion.
Introduction to Parser Combinators
The next part of the compiler you will be working on is the parser. The goal of the parser is to convert the sequence of tokens generated by the lexer into an Amy abstract syntax tree (AST).
There are many approaches to writing parsers, such as:
- Writing the parser by hand directly in the compiler's language using mutually recursive functions, or
- Writing the parser in a domain specific language (DSL) and using a parser generator (such as Bison) to produce the parser.
Another approach, which we will be using, is parser combinators. The idea behind the approach is very simple:
- Have a set of simple primitive parsers, and
- Have ways to combine them together into more and more complex parsers. Hence the name parser combinators.
Usually, those primitive parsers and combinators are provided as a library directly in the language used by the compiler. In our case, we will be working with Scallion, a Scala parser combinators library developed by LARA.
Parser combinators have many advantages -- the main one being easy to write, read and maintain.
Scallion Parser Combinators
Documentation
In this document, we will introduce parser combinators in Scallion and showcase how to use them. This document is not intended to be a complete reference to Scallion. Fortunately, the library comes with a comprehensive API which fulfills that role. Feel free to refer to it while working on your project!
Playground Project
We have set up an example project that implements a lexer and parser for a simple expression language using Scallion. Feel free to experiment and play with it. The project showcases the API of Scallion and some of the more advanced combinators.
Setup
In Scallion, parsers are defined within a trait called Syntaxes
. This
trait takes as parameters two types:
- The type of tokens,
- The type of token kinds. Token kinds represent groups of tokens. They abstract away all the details found in the actual tokens, such as for instance positions or identifiers name. Each token has a unique kind.
In our case, the tokens will be of type Token
that we introduced and
used in the previous project. The token kinds will be TokenKind
, which
we have already defined for you.
object Parser extends Pipeline[Iterator[Token], Program]
with Parsers {
type Token = myproject.Token
type Kind = myproject.TokenKind
// Indicates the kind of the various tokens.
override def getKind(token: Token): TokenKind = TokenKind.of(token)
// You parser implementation goes here.
}
The Parsers
trait (mixed into the Parser
object above) comes from
Scallion and provides all functions and types you will use to define
your grammar and AST translation.
Writing Parsers
When writing a parser using parser combinators, one defines many smaller
parsers and combines them together into more and more complex parsers.
The top-level, most complex, of those parser then defines the entire
syntax for the language. In our case, that top-level parser will be
called program
.
All those parsers are objects of the type Syntax[A]
. The type
parameter A
indicates the type of values produced by the parser. For
instance, a parser of type Syntax[Int]
produces Int
s and a parser of
type Syntax[Expr]
produces Expr
s. Our top-level parser has the
following signature:
lazy val program: Parser[Program] = ...
Contrary to the types of tokens and token kinds, which are fixed, the
type of values produced is a type parameter of the various Syntax
s.
This allows your different parsers to produce different types of values.
The various parsers are stored as val
members of the Parser
object.
In the case of mutually dependent parsers, we use lazy val
instead.
lazy val definition: Syntax[ClassOrFunDef] =
functionDefinition | abstractClassDefinition | caseClassDefinition
lazy val functionDefinition: Syntax[ClassOrFunDef] = ...
lazy val abstractClassDefinition: Syntax[ClassOrFunDef] = ...
lazy val caseClassDefinition: Syntax[ClassOrFunDef] = ...
Running Parsers
Parsers of type Syntax[A]
can be converted to objects of type
Parser[A]
, which have an apply
method which takes as parameter an
iterator of tokens and returns a value of type ParseResult[A]
, which
can be one of three things:
- A
Parsed(value, rest)
, which indicates that the parser was successful and produced the valuevalue
. The entirety of the input iterator was consumed by the parser. - An
UnexpectedToken(token, rest)
, which indicates that the parser encountered an unexpected tokentoken
. The input iterator was consumed up to the erroneous token. - An
UnexpectedEnd(rest)
, which indicates that the end of the iterator was reached and the parser could not finish at this point. The input iterator was completely consumed.
In each case, the additional value rest
is itself some sort of a
Parser[A]
. That parser represents the parser after the successful
parse or at the point of error. This parser could be used to provide
useful error messages or even to resume parsing.
override def run(ctx: Context)(tokens: Iterator[Token]): Program = {
import ctx.reporter._
val parser = Parser(program)
parser(tokens) match {
case Parsed(result, rest) => result
case UnexpectedEnd(rest) => fatal("Unexpected end of input.")
case UnexpectedToken(token, rest) => fatal("Unexpected token: " + token)
}
}
Parsers and Grammars
As you will see, parsers built using parser combinators will look a lot like grammars. However, unlike grammars, parsers not only describe the syntax of your language, but also directly specify how to turn this syntax into a value. Also, as we will see, parser combinators have a richer vocabulary than your usual BNF grammars.
Interestingly, a lot of concepts that you have seen on grammars, such as
FIRST
sets and nullability can be straightforwardly transposed to
parsers.
FIRST set
In Scallion, parsers offer a first
method which returns the set of
token kinds that are accepted as a first token.
definition.first === Set(def, abstract, case)
Nullability
Parsers have a nullable
method which checks for nullability of a
parser. The method returns Some(value)
if the parser would produce
value
given an empty input token sequence, and None
if the parser
would not accept the empty sequence.
Basic Parsers
We can now finally have a look at the toolbox we have at our disposition to build parsers, starting from the basic parsers. Each parser that you will write, however complex, is a combination of these basic parsers. The basic parsers play the same role as terminal symbols do in grammars.
Elem
The first of the basic parsers is elem(kind)
. The function elem
takes argument the kind of tokens to be accepted by the parser. The
value produced by the parser is the token that was matched. For
instance, here is how to match against the end-of-file token.
val eof: Parser[Token] = elem(EOFKind)
Accept
The function accept
is a variant of elem
which directly applies a
transformation to the matched token when it is produced.
val identifier: Syntax[String] = accept(IdentifierKind) {
case IdentifierToken(name) => name
}
Epsilon
The parser epsilon(value)
is a parser that produces the value
without consuming any input. It corresponds to the 𝛆 found in
grammars.
Parser Combinators
In this section, we will see how to combine parsers together to create more complex parsers.
Disjunction
The first combinator we have is disjunction, that we write, for parsers
p1
and p2
, simply p1 | p2
. When both p1
and p2
are of type
Syntax[A]
, the disjunction p1 | p2
is also of type Syntax[A]
. The
disjunction operator is associative and commutative.
Disjunction works just as you think it does. If either of the parsers
p1
or p2
would accept the sequence of tokens, then the disjunction
also accepts the tokens. The value produced is the one produced by
either p1
or p2
.
Note that p1
and p2
must have disjoint first
sets. This
restriction ensures that no ambiguities can arise and that parsing can
be done efficiently.1 We will see later how to automatically detect
when this is not the case and how fix the issue.
Sequencing
The second combinator we have is sequencing. We write, for parsers p1
and p2
, the sequence of p1
and p2
as p1 ~ p2
. When p1
is of
type A
and p2
of type B
, their sequence is of type A ~ B
, which
is simply a pair of an A
and a B
.
If the parser p1
accepts the prefix of a sequence of tokens and p2
accepts the postfix, the parser p1 ~ p2
accepts the entire sequence
and produces the pair of values produced by p1
and p2
.
Note that the first
set of p2
should be disjoint from the first
set of all sub-parsers in p1
that are nullable and in trailing
position (available via the followLast
method). This restriction
ensures that the combinator does not introduce ambiguities.