Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • shchen/cs320
  • raveendr/cs320
  • mwojnaro/cs320
3 results
Show changes
Showing
with 2271 additions and 0 deletions
File added
File added
File added
File added
File added
File added
File added
*.aux
*.fdb_latexmk
*.fls
*.log
*.out
*.synctex.gz
*.pdf
\begin{exercise}{}
For each the following languages, construct an NFA \(\mathcal{A}\) that
recognizes them, i.e. \(L(\mathcal{A})\) = \(L_i\):
\begin{enumerate}
\item \(L_1\): binary strings divisible by 3
\item \(L_2\): binary strings divisible by 4
\item \(L_3\): \(\{(w_1 \oplus w_2) \mid w_1 \in L_1 \land w_2 \in L_2 \land |w_1| = |w_2|\}\)
\end{enumerate}
where \(\oplus\) is the bitwise-xor operation on binary strings.
\begin{solution}
\begin{enumerate}
\item The language of binary strings divisible by 3. We need two
observations to construct this automaton:
\begin{enumerate}
\item If the automaton has consumed a binary string \(s\) with decimal
value, say, \(val(s) = n\), then we can determine the decimal value of
the string after reading one more character as either \(val(s0) = 2n\)
or \(val(s1) = 2n + 1\).
\item The set of strings is finite, but it is sufficient to know only
the value of the string \emph{modulo 3} to determine if it is divisible.
\end{enumerate}
We construct the automaton \(\mathcal{A}_1 = (Q, \Sigma, \delta, q_{init}, F)\)
where:
\begin{itemize}
\item \(Q = \{q_{init}, q_0, q_1, q_2\}\), representing the initial
state (empty word has no value), and the states corresponding to the
values \(0, 1, 2\) modulo 3.
\item \(\Sigma = \{0, 1\}\), as required.
\item \(\delta = \{(q_i, 0, q_j) \mid 2i \mod 3 = j\} \cup \{(q_i, 1,
q_j) \mid (2i + 1) mod 3 = j\} \cup \{(q_{init}, 0, q_0), (q_{init}, 1,
q_1)\}\), i.e., there is a transition from \(q_i\) to \(q_j\) if, as the
currently known value modulo 3 is \(i\), on reading \(0\) the next value
is \(j = 2i \mod 3\). We use the fact that \(2n \mod 3 = 2 (n \mod 3)
\mod 3\). The case for reading \(1\) is similar. The translations from
the initial state are to the states corresponding to the values \(0, 1\)
modulo 3.
For example, if we have read ``1101'', with decimal value \(13\), we
must be in state \(q_1\), as \(13 \mod 3 = 1\). On reading a \(0\), we
have the string ``11010'' with decimal value \(26\), and \(26 \mod 3 =
2\), so we transition to \(q_2\).
The full automaton is below.
\item \(F = \{q_0\}\) as we accept that words that are divisible by 3,
and are hence equal to 0 modulo 3.
\end{itemize}
The automaton is:
\begin{center}
\begin{tikzpicture}[node distance = 2cm, on grid]
\node[state, initial] (qi) {\(q_{init}\)};
\node[state, accepting, right of = qi] (q0) {\(q_0\)};
\node[state, above right of = q0] (q1) {\(q_1\)};
\node[state, below right of = q1] (q2) {\(q_2\)};
\draw[->]
(qi) edge node[above] {\(0\)} (q0)
(qi) edge node[above] {\(1\)} (q1)
%
(q0) edge[loop below] node[below] {\(0\)} (q0)
(q0) edge node[above] {\(1\)} (q1)
%
(q1) edge[bend right] node[below] {\(0\)} (q2)
(q1) edge[bend left] node[below] {\(1\)} (q0)
%
(q2) edge node[above] {\(0\)} (q1)
(q2) edge[loop below] node[below] {\(1\)} (q2)
;
\end{tikzpicture}
\end{center}
\item The language of binary strings divisible by 4. The construction is
similar to the one above, now with 5 states.
\begin{center}
\begin{tikzpicture}[node distance = 2cm, on grid]
\node[state, initial] (qi) {\(q_{init}\)};
\node[state, accepting, right of = qi] (q0) {\(q_0\)};
\node[state, above of = q0] (q1) {\(q_1\)};
\node[state, right of = q1] (q2) {\(q_2\)};
\node[state, right of = q0] (q3) {\(q_3\)};
\draw[->]
(qi) edge node[above] {\(0\)} (q0)
(qi) edge node[above] {\(1\)} (q1)
%
(q0) edge[loop below] node[below] {\(0\)} (q0)
(q0) edge node[left] {\(1\)} (q1)
%
(q1) edge node[above] {\(0\)} (q2)
(q1) edge[bend right] node[above] {\(1\)} (q3)
%
(q2) edge node[above] {\(0\)} (q0)
(q2) edge[bend left] node[above] {\(1\)} (q1)
%
(q3) edge node[right] {\(0\)} (q2)
(q3) edge[loop below] node[below] {\(1\)} (q3)
;
\end{tikzpicture}
\end{center}
\item To compute the bitwise-xor of two strings, we must compute a product
automaton. To accept a word \(w\), there must exist \(w_1, w_2\) such that
\(w_1 \in L_1\), and \(w_2 \in L_2\).
We do not explicitly construct the automaton, but present an argument.
First, consider the truth table for xor:
\begin{center}
\begin{tabular}{c c | c}
\(b_1\) & \(b_2\) & \(b_1 \oplus b_2\) \\
\hline
0 & 0 & 0 \\
0 & 1 & 1 \\
1 & 0 & 1 \\
1 & 1 & 0 \\
\end{tabular}
\end{center}
Notably, given a xor result, we cannot exactly determine the input bits.
In essence, we construct an automaton that, given a string, tries to
simulate the two input automata in parallel non-deterministically on all
possible pairs of input strings. If any of them are accepted, that means
we found a pair of strings that, one, are accepted by the two original
automata, and two, have the input string as their bitwise-xor.
Formally, the automaton \(\mathcal{A}_3 = (Q, \Sigma, \delta, q_{init},
F)\) has:
\begin{itemize}
\item \(Q = Q_1 \times Q_2\), where \(Q_1\) and \(Q_2\) are the state
sets of \(\mathcal{A}_1\) and \(\mathcal{A}_2\).
\item \(\Sigma = \{0, 1\}\) as before.
\item \(q_{init} = (q_{1, init}, q_{2, init})\) where \(q_{1, init}\)
and \(q_{2, init}\) are the initial states of \(\mathcal{A}_1\) and
\(\mathcal{A}_2\).
\item \(F = F_1 \times F_2\) similarly.
\item \(\delta\) is constructed as follows: for a pair of states \((q_1,
q_2)\), on reading a \(0\), we look at the truth table of xor; two input
pairs \((0, 0)\) and \((1, 1)\) could have produced this result bit.
Hence, we add transitions for both automata simultaneously, \((((q_1,
q_2), 0, (q_1', q_2')))\) corresponding to possible inputs \((0, 0)\) if
\(\delta_1(q_1, 0, q_1')\) and \(\delta_2(q_2, 0, q_2')\), and similarly
\(((q_1, q_2), 0, (q_1', q_2'))\) corresponding to possible inputs \((1,
1)\) if \(\delta_1(q_1, 1, q_1'')\) and \(\delta_2(q_2, 1, q_2'')\).
The case for reading a \(1\) is similar, with possible input pairs \((0,
1)\) and \((1, 0)\).
\end{itemize}
\end{enumerate}
\end{solution}
\end{exercise}
% \begin{exercise}{}
% \todo{Suggestion: From Kozen H2.3, recognizing with errors; regular language modulo n Hamming distance is regular}
% \end{exercise}
\begin{exercise}{}
Give a verbal and a set-notational description of the language accepted by
each of the following automata. You can assume that the alphabet is \(\Sigma =
\{a, b\}\).
\begin{enumerate}
\item \(\mathcal{A}_1\)
\begin{center}
\begin{tikzpicture}[node distance = 2cm, on grid]
\node[state, initial] (q0) {\(q_0\)};
\node[state, accepting, right of = q0] (q1) {\(q_1\)};
\node[state, right of = q1] (q2) {\(q_2\)};
\draw[->] (q0) edge node[above] {\(a\)} (q1)
(q1) edge node[above] {\(a\)} (q2)
(q2) edge[loop above] node[above] {\(a, b\)} (q2)
(q0) edge[loop above] node[above] {\(b\)} (q0)
(q1) edge[loop above] node[above] {\(b\)} (q1)
;
\end{tikzpicture}
\end{center}
\item \(\mathcal{A}_2\)
\begin{center}
\begin{tikzpicture}[node distance = 2cm, on grid]
\node[state, accepting, initial] (q0) {\(q_0\)};
\node[state, accepting, right of = q0] (q1) {\(q_1\)};
\node[state, right of = q1] (q2) {\(q_2\)};
\draw[->] (q0) edge node[below] {\(a\)} (q1)
(q1) edge[bend right] node[above] {\(b\)} (q0)
(q2) edge[loop above] node[above] {\(a, b\)} (q2)
(q0) edge[loop above] node[above] {\(b\)} (q0)
(q1) edge node[above] {\(a\)} (q2)
;
\end{tikzpicture}
\end{center}
\end{enumerate}
\begin{solution}
\begin{enumerate}
\item
As regular expression: \(b^*ab^*\), this is the language of words that
contain exactly one \(a\). In set-notation:
\begin{equation*}
\{w \mid \exists! i.\; 0 \leq i \leq |w| \land w_{(i)} = a\}
\end{equation*}
\item
As generalized regular expression (with complement): \((\Sigma^* aa
\Sigma^*)^c\). Without complement: \(b^*(ab^+)^*(a \mid \epsilon)\).
This is the language of words that contain no consecutive pair of
\(a\)'s. In set-notation:
\begin{equation*}
\{w \mid \forall i.\; 0 \leq i < |w| \land w_{(i)} = a \implies (i + 1 \geq |w| \lor w_{(i + 1)} \neq a)\}
\end{equation*}
\end{enumerate}
\end{solution}
\end{exercise}
% \begin{exercise}{}
% Construct a DFA for the following languages:
% \begin{enumerate}
% \item the set of strings over \(\Sigma = \{0, 1\}\) such that the number
% of 0's is a multiple of 2, and the number of 1's is a multiple of 3.
% \note{\cite[HW 1.1.c]{kozen2007automata}}
% \item the set of strings over \(\Sigma = \{0, 1\}\) that, as binary
% numerals, have decimal value 42. \note{\cite[Ex 2.1.a]{mogensen2010basics}}
% \end{enumerate}
% For the cases above:
% \begin{enumerate}
% \item is the complement of each language regular as well?
% \item is their union regular?
% \item is their intersection regular?
% \item is their difference regular?
% \end{enumerate}
% \begin{solution}
% The DFA constructions:
% \paragraph*{The set of strings over \(\Sigma = \{0, 1\}\) such that the
% number of 0's is a multiple of 2, and the number of 1's is a
% multiple of 3.} The DFA states count the number of 0's and 1's modulo
% 2 and 3 respectively. The transitions correspond to incrementing
% modulo 2 and 3. A state \(q_{n, m}\) represents the set of words
% such that the number of 0's \(= n \mod 2\) and the number of 1's \(=
% m \mod 3\).
% The state \(q_{0, 0}\) is the desired initial \emph{and}
% final state (why?).
% The transitions are:
% \begin{align*}
% q_{0, 0} \xrightarrow{0} q_{1, 0} & & q_{0, 0} \xrightarrow{1} q_{0, 1} \\
% q_{0, 1} \xrightarrow{0} q_{1, 1} & & q_{0, 1} \xrightarrow{1} q_{0, 2} \\
% q_{0, 2} \xrightarrow{0} q_{1, 2} & & q_{0, 2} \xrightarrow{1} q_{0, 0} \\
% q_{1, 0} \xrightarrow{0} q_{0, 0} & & q_{1, 0} \xrightarrow{1} q_{1, 1} \\
% q_{1, 1} \xrightarrow{0} q_{0, 1} & & q_{1, 1} \xrightarrow{1} q_{1, 2} \\
% q_{1, 2} \xrightarrow{0} q_{0, 2} & & q_{1, 2} \xrightarrow{1} q_{1, 0}
% \end{align*}
% In this manner, DFA states represent a \emph{finite partitioning} of the
% set of words \(\Sigma^*\), and the transitions define how these
% partitions correspond to each other.
% \textbf{Extra}: This is made more apparent in the algorithms for DFA
% minimization and equivalence checking and makes DFAs as interesting as
% they are theoretically. See
% \href{https://en.wikipedia.org/wiki/Myhill%E2%80%93Nerode_theorem}{Myhill-Nerode Theorem}
% or
% \href{https://en.wikipedia.org/wiki/Brzozowski_derivative}{Brzozowski derivatives}
% on Wikipedia.
% \paragraph*{The set of binary strings whose value is less than 42}.
% First, let's attempt to describe this set as a regular expression. The
% binary representation of 42 is \(101010_2\). We write the expression in
% a natural way, decomposing case by case looking at the bits:
% \begin{tikzpicture}[node distance = 3cm, on grid]
% \node (leading) {\(0^*\)};
% \node[below left = 1cm of leading] (leadinglabel) {Leading zeroes};
% \node[above right of = leading] (b50) {\(0 \,\Sigma^5\)};
% \node[below = 0.5cm of b50] (b50label) {\(< 32\)};
% \node[below right of = leading] (b51) {\(1\)};
% \node[below = 0.5cm of b51] (b51label) {\(\geq 32\)};
% \node[above right of = b51] (b40) {\(1 \,\Sigma^4\)};
% \node[below = 0.5cm of b40] (b40label) {\(\geq 48\)};
% \node[below right of = b51] (b41) {\(0\)};
% \node[below = 0.5cm of b41] (b41label) {\(< 48\)};
% \node[right of = b41] (dots) {\(\ldots\)};
% \path[->] (leading) edge node[above left] {read 5th bit as 0} ( b50)
% (leading) edge (b51)
% (b51) edge (b40)
% (b51) edge node[above right] {read 4th bit as 0, continue} (b41)
% (b41) edge (dots)
% ;
% \end{tikzpicture}
% And finally collapse this diagram into an expression:
% %
% \begin{gather*}
% L_{\leq 42} = 0^*(0\,\Sigma^5 \mid 1(0(0\,\Sigma^3 \mid 1(0(0\,\Sigma^1 \mid 10)))))
% \end{gather*}
% The `quick terminating' branches read \(010101\) (one's complement of
% \(42_{10}\)) and the `continuing' branches read \(101010\), i.e.
% \(42_{10}\). This is just a binary search for the number 42!
% The DFA for this regular expression (not provided) looks nearly
% identical to the diagram above. \(L_{< 42}\) requires only a minor
% change at the end.
% This can be done more generally, given any binary string, constructing a
% DFA recognizing all binary strings numerically smaller than it.
% \vspace{1em}
% For the regularity on applying operations:
% \begin{enumerate}
% \item the complements are regular. In the DFA, change all final
% states to non-final, and all previously non-final states to final.
% This is an NFA accepting the complement of the original language.
% \item the union is regular. Take the new NFA \((Q_1 \cup Q_2,
% \Sigma, \delta_1 \cup \delta_2, S_1 \cup S_2, F_1 \cup F_2)\) where
% \(A_1 = (Q_1, \Sigma, \delta_1, S_1, F_1)\) and \(A_2 = (Q_2,
% \Sigma, \delta_2, S_2, F_2)\) are the individual DFAs for the
% languages above. This is the `trivial' union of the automata, and
% accepts exactly the words that are in one or both of the languages.
% \item the intersection is regular. The construction is a bit more
% involved. Intuitively, in the way that the union NFA allows
% accepting runs from either automaton, the intersection NFA must
% simulate the runs of both automata together. Given two automata
% \(A_1 = (Q_1, \Sigma, \delta_1, S_1, F_1)\) and \(A_2 = (Q_2,
% \Sigma, \delta_2, S_2, F_2)\) representing the two languages, define
% the intersection NFA \(A_\cap = (Q_1 \times Q_2, \Sigma \times
% \Sigma, \delta, S_1 \times S_2, F_1 \times F_2)\), where \(\delta\)
% is defined as
% \begin{gather*}
% \delta \subseteq (Q_1 \times Q_2) \times (\Sigma \times \Sigma) \times (Q_1 \times Q_2) \\
% \delta((q_1, q_2), (a_1, a_2), (q_1', q_2')) \iff \delta_1(q_1, a_1, q_1') \land \delta_2(q_2, a_2, q_2')~.
% \end{gather*}
% Thus, \(A_\cap\) operates simultaneously on a state from \(A_1\),
% and a state from \(A_2\). There is a transition from \((q_1, q_2)\)
% to \((q_1', q_2')\) when reading the letters \((a_1, a_2)\)
% simultaneously if the transitions on \emph{both} sides are defined.
% Discuss and convince yourself that the language of \(A_\cap\) is
% precisely the intersection of the languages of \(A_1\) and \(A_2\).
% \item the difference is regular. It follows from the complement and
% intersection regularities as \(L_1 \setminus L_2 = L_1 \cap \bar
% L_2\).
% \end{enumerate}
% This has some interesting consequences! Consider the predicate \(x \geq
% 42 \land x > 14\). We can construct DFAs representing the individual
% conditions, and to represent the logical-and, take their intersection.
% This gives us a way to solve some arithmetic constraints using automata!
% \textbf{Extra}: Automata (of different kinds) have been extensively
% studied as ways to solve many classes of constraints. Regarding
% arithmetic, there is the seminal result by B\"uchi, Weak Second-order
% Arithmetic and Finite Automata \cite{buchi1960weak}.
% \end{solution}
% \end{exercise}
% \begin{exercise}{}
% \todo{Optionally, ask them to construct the NFA? But it's too simple}
% Consider the following NFA, \(A\):
% %
% \begin{center}
% \begin{tikzpicture}[node distance = 2cm, on grid]
% \node[state, initial] (q0) {\(q_0\)};
% \node[state, right of = q0] (q1) {\(q_1\)};
% \node[state, right of = q1] (q2) {\(q_2\)};
% \node[state, accepting, right of = q2] (q3) {\(q_3\)};
% \draw[->, loop] (q0) edge node[above] {\(a, b\)} (q0);
% \path[->] (q0) edge node[above] {\(a\)} (q1)
% (q1) edge node[above] {\(a\)} (q2)
% (q2) edge node[above] {\(a\)} (q3)
% ;
% \end{tikzpicture}
% \end{center}
% Describe the language \(L(A)\). Is \(A\) a DFA (barring missing
% transitions)? If not, determinize it using the subset construction seen in
% class
% %
% \footnote{This is called subset / product / powerset construction by
% different authors. Wikipedia for quick reference:
% \url{https://en.wikipedia.org/wiki/Powerset_construction}}.
% \begin{solution}
% \(L(A)\) is the set of strings over \(\{a, b\}\) ending in three a's. As
% a regular expression, \(L(A) = (a \mid b)^*aaa\).
% It is not deterministic due to the two outgoing transitions from \(q_0\)
% labelled \(a\).
% For the subset construction, we define the power set automaton \(A' =
% (2^Q, \Sigma, \delta', S', F')\), where \(Q = \{q_0, q_1, q_2, q_3\}\).
% The set of initial states \(S'\) only contains the singleton
% \(\{q_0\}\), as we start in solely the initial state \(q_0\). So, \(S' =
% \{\{q_0\}\}\).
% The set of final states \(F'\) is any set that contains a final state,
% here only \(q_3\). Thus, \(F' = \{\{s \in 2^Q \mid q_3 \in s\}\}\). This
% corresponds to `angelic' executions, i.e., a word is accepted if there
% is \emph{a} run of it that leads to a final state.
% We draw the set of states \(2^Q\). The states are represented compactly
% due to space constraints: \(s_{23}\) represents the set \(\{q_2,
% q_3\}\), for example. The empty set is marked by \(s_\emptyset\).
% %
% \begin{center}
% \begin{tikzpicture}[node distance = 2cm, on grid]
% \node[state] (q0000) {\(s_\emptyset\)};
% \node[state, right of = q0000] (q0001) {\(s_0\)};
% \node[state, right of = q0001] (q0011) {\(s_{01}\)};
% \node[state, right of = q0011] (q0010) {\(s_1\)};
% \node[state, below of = q0000] (q0100) {\(s_2\)};
% \node[state, right of = q0100] (q0101) {\(s_{02}\)};
% \node[state, right of = q0101] (q0111) {\(s_{012}\)};
% \node[state, right of = q0111] (q0110) {\(s_{01}\)};
% \node[state, accepting, below of = q0100] (q1100) {\(s_{23}\)};
% \node[state, accepting, right of = q1100] (q1101) {\(s_{023}\)};
% \node[state, accepting, right of = q1101] (q1111) {\(s_{0123}\)};
% \node[state, accepting, right of = q1111] (q1110) {\(s_{123}\)};
% \node[state, accepting, below of = q1100] (q1000) {\(s_3\)};
% \node[state, accepting, right of = q1000] (q1001) {\(s_{03}\)};
% \node[state, accepting, right of = q1001] (q1011) {\(s_{013}\)};
% \node[state, accepting, right of = q1011] (q1010) {\(s_{13}\)};
% \node[above = 1.5cm of q0001] (start) {start};
% \draw[->] (start) -- (q0001);
% \end{tikzpicture}
% \end{center}
% % not relevant anymore
% % The states are drawn as a
% % \href{https://en.wikipedia.org/wiki/Karnaugh_map}{Karnaugh Map} on 4
% % bits for consistency. Each adjacent cell differs by only `one bit'.
% %
% Finally, we add the transitions corresponding to the outgoing
% transitions \(t_0 - t_5\). We define the new transition relation as
% %
% \begin{gather*}
% \forall\; s, s' \in 2^Q.\; \delta'(s, a, s') \iff s' = \cup \{p \mid \exists\; q \in s.\; \delta(q, a, p)\}
% \end{gather*}
% Example: what is the transition from the set \(\{q_0, q_2\}\) on \(a\)?
% From \(q_0\) we can transition to \(q_0\) (by \(t_0\)) or to \(q_1\) (by
% \(t_2\)), and from \(q_2\), we can only transition to \(q_3\) (by
% \(t_4\)). Thus, we have the transition \(\{q_0, q_2\} \xrightarrow{a}
% \{q_0, q_1\} \cup \{q_3\}\), or in the notation as above, \(s_{02}
% \xrightarrow{a} s_{013}\). We do this for every state and every letter
% (\(16 \times 2\)).
% The resulting automaton is too complicated to draw fully while
% maintaining any readability! Below, we draw all the relevant
% transitions. Any missing transitions go to the dead state
% \(s_\emptyset\) (marked by the input `several').
% %
% % \begin{center}
% % \begin{tikzpicture}[node distance = 2cm, on grid]
% % \node[state] (q0000) {\(s_\emptyset\)};
% % \node[state, right of = q0000] (q0001) {\(s_0\)};
% % \node[state, right of = q0001] (q0011) {\(s_{01}\)};
% % \node[state, right of = q0011] (q0010) {\(s_1\)};
% % \node[state, below of = q0000] (q0100) {\(s_2\)};
% % \node[state, right of = q0100] (q0101) {\(s_{02}\)};
% % \node[state, right of = q0101] (q0111) {\(s_{012}\)};
% % \node[state, right of = q0111] (q0110) {\(s_{12}\)};
% % \node[state, accepting, below of = q0100] (q1100) {\(s_{23}\)};
% % \node[state, accepting, right of = q1100] (q1101) {\(s_{023}\)};
% % \node[state, accepting, right of = q1101] (q1111) {\(s_{0123}\)};
% % \node[state, accepting, right of = q1111] (q1110) {\(s_{123}\)};
% % \node[state, accepting, below of = q1100] (q1000) {\(s_3\)};
% % \node[state, accepting, right of = q1000] (q1001) {\(s_{03}\)};
% % \node[state, accepting, right of = q1001] (q1011) {\(s_{013}\)};
% % \node[state, accepting, right of = q1011] (q1010) {\(s_{13}\)};
% % \node[above = 1.5cm of q0001] (start) {start};
% % \draw[->] (start) -- (q0001);
% % \path[->, color = red]
% % (q0001.45) edge[out=0, in=90, loop] node[above] {\calpha} (q0001.45)
% % (q0011.45) edge[out=0, in=90, loop] node[above] {\calpha} (q0011.45)
% % (q0101.45) edge[out=0, in=90, loop] node[above] {\calpha} (q0101.45)
% % (q0111.45) edge[out=0, in=90, loop] node[above] {\calpha} (q0111.45)
% % (q1111.45) edge[out=0, in=90, loop] node[above] {\calpha} (q1111.45)
% % (q1101.45) edge[out=0, in=90, loop] node[above] {\calpha} (q1101.45)
% % (q1001.45) edge[out=0, in=90, loop] node[above] {\calpha} (q1001.45)
% % (q1011.45) edge[out=0, in=90, loop] node[above] {\calpha} (q1011.45)
% % ;
% % \path[->, color = blue]
% % (q0001.225) edge[out=180, in=270, loop] node[below left] {\cbeta} (q0001.225)
% % (q0011.225) edge[out=180, in=270, loop] node[below left] {\cbeta} (q0011.225)
% % (q0101.225) edge[out=180, in=270, loop] node[below left] {\cbeta} (q0101.225)
% % (q0111.225) edge[out=180, in=270, loop] node[below left] {\cbeta} (q0111.225)
% % (q1111.225) edge[out=180, in=270, loop] node[below left] {\cbeta} (q1111.225)
% % (q1101.225) edge[out=180, in=270, loop] node[below left] {\cbeta} (q1101.225)
% % (q1001.225) edge[out=180, in=270, loop] node[below left] {\cbeta} (q1001.225)
% % (q1011.225) edge[out=180, in=270, loop] node[below left] {\cbeta} (q1011.225)
% % ;
% % \path[->, color = OliveGreen]
% % (q0001) edge[] node[below] {\cgamma} (q0011)
% % (q0011.315) edge[in=-90, out=0, loop] node[below] {\cgamma} (q0011.315)
% % (q0101) edge[] node[below] {\cgamma} (q0111)
% % (q0111.315) edge[in=-90, out=0, loop] node[below] {\cgamma} (q0111.315)
% % (q1111.315) edge[in=-90, out=0, loop] node[below] {\cgamma} (q1111.315)
% % (q1101) edge[] node[below] {\cgamma} (q1111)
% % (q1001) edge[] node[below] {\cgamma} (q1011)
% % (q1011.315) edge[in=-90, out=0, loop] node[below] {\cgamma} (q1011.315)
% % ;
% % \path[->, color = purple]
% % (q0010) edge[] node[right] {\ceta} (q0110)
% % (q0011) edge[] node[right] {\ceta} (q0111)
% % (q0110.135) edge[out=90, in=180, loop] node[above] {\ceta} (q0110.135)
% % (q0111.135) edge[out=90, in=180, loop] node[above] {\ceta} (q0111.135)
% % (q1010) edge[] node[right] {\ceta} (q1110)
% % (q1011) edge[] node[right] {\ceta} (q1111)
% % (q1110.135) edge[out=90, in=180, loop] node[above] {\ceta} (q1110.135)
% % (q1111.135) edge[out=90, in=180, loop] node[above] {\ceta} (q1111.135)
% % ;
% % \path[->, color = orange]
% % (q0100) edge[] node[left] {\cpi} (q1100)
% % (q0101) edge[] node[left] {\cpi} (q1101)
% % (q0110) edge[] node[left] {\cpi} (q1110)
% % (q0111) edge[] node[left] {\cpi} (q1111)
% % (q1100.0) edge[in=0, out=90, loop] node[above right] {\cpi} (q1100.0)
% % (q1101.0) edge[in=0, out=90, loop] node[above right] {\cpi} (q1101.0)
% % (q1110.0) edge[in=0, out=90, loop] node[above right] {\cpi} (q1110.0)
% % (q1111.0) edge[in=0, out=90, loop] node[above right] {\cpi} (q1111.0)
% % ;
% % \end{tikzpicture}
% % \end{center}
% % \begin{center}
% % \begin{tikzpicture}[node distance = 2cm, on grid]
% % \node[state] (q0000) {\(s_\emptyset\)};
% % \node[state, right of = q0000] (q0001) {\(s_0\)};
% % \node[state, right of = q0001] (q0011) {\(s_{01}\)};
% % \node[state, right of = q0011] (q0010) {\(s_1\)};
% % \node[state, below of = q0000] (q0100) {\(s_2\)};
% % \node[state, right of = q0100] (q0101) {\(s_{02}\)};
% % \node[state, right of = q0101] (q0111) {\(s_{012}\)};
% % \node[state, right of = q0111] (q0110) {\(s_{12}\)};
% % \node[state, accepting, below of = q0100] (q1100) {\(s_{23}\)};
% % \node[state, accepting, right of = q1100] (q1101) {\(s_{023}\)};
% % \node[state, accepting, right of = q1101] (q1111) {\(s_{0123}\)};
% % \node[state, accepting, right of = q1111] (q1110) {\(s_{123}\)};
% % \node[state, accepting, below of = q1100] (q1000) {\(s_3\)};
% % \node[state, accepting, right of = q1000] (q1001) {\(s_{03}\)};
% % \node[state, accepting, right of = q1001] (q1011) {\(s_{013}\)};
% % \node[state, accepting, right of = q1011] (q1010) {\(s_{13}\)};
% % \node[above = 1.5cm of q0001] (start) {start};
% % \draw[->] (start) -- (q0001);
% % % (q0000)
% % % (q0001)
% % % (q0010)
% % % (q0011)
% % % (q0100)
% % % (q0101)
% % % (q0110)
% % % (q0111)
% % % (q1000)
% % % (q1001)
% % % (q1010)
% % % (q1011)
% % % (q1100)
% % % (q1101)
% % % (q1110)
% % % (q1111)
% % \path[->]
% % (q0000) edge[loop] node[above] {\(a\)} (q0000)
% % (q0001) edge[] node[above] {\(a\)} (q0011)
% % (q0010) edge[] node[above] {\(a\)} (q0100)
% % (q0011) edge[] node[above] {\(a\)} (q0111)
% % (q0100) edge[] node[above] {\(a\)} (q1000)
% % (q0101) edge[] node[above] {\(a\)} (q1011)
% % (q0110) edge[] node[above] {\(a\)} (q1100)
% % (q0111) edge[] node[above] {\(a\)} (q1111)
% % (q1000) edge[] node[above] {\(a\)} (q0000)
% % (q1001) edge[] node[above] {\(a\)} (q0011)
% % (q1010) edge[] node[above] {\(a\)} (q0100)
% % (q1011) edge[] node[above] {\(a\)} (q0111)
% % (q1100) edge[] node[above] {\(a\)} (q1000)
% % (q1101) edge[] node[above] {\(a\)} (q1011)
% % (q1110) edge[] node[above] {\(a\)} (q1100)
% % (q1111) edge[loop] node[above] {\(a\)} (q1111)
% % ;
% % \path[->]
% % (q0000) edge[loop] node[right] {\(b\)} (q0000)
% % (q0001) edge[] node[right] {\(b\)} (q0001)
% % (q0010) edge[] node[right] {\(b\)} (q0000)
% % (q0011) edge[] node[right] {\(b\)} (q0001)
% % (q0100) edge[] node[right] {\(b\)} (q0000)
% % (q0101) edge[] node[right] {\(b\)} (q0001)
% % (q0110) edge[] node[right] {\(b\)} (q0000)
% % (q0111) edge[] node[right] {\(b\)} (q0001)
% % (q1000) edge[] node[right] {\(b\)} (q0000)
% % (q1001) edge[] node[right] {\(b\)} (q0001)
% % (q1010) edge[] node[right] {\(b\)} (q0000)
% % (q1011) edge[] node[right] {\(b\)} (q0001)
% % (q1100) edge[] node[right] {\(b\)} (q0000)
% % (q1101) edge[] node[right] {\(b\)} (q0001)
% % (q1110) edge[] node[right] {\(b\)} (q0000)
% % (q1111) edge[loop] node[right] {\(b\)} (q0001)
% % ;
% % \end{tikzpicture}
% % \end{center}
% \begin{center}
% \begin{tikzpicture}[node distance = 2cm, on grid]
% \node[state] (q0000) {\(s_\emptyset\)};
% \node[state, right of = q0000] (q0001) {\(s_0\)};
% \node[state, right of = q0001] (q0011) {\(s_{01}\)};
% \node[state, right of = q0011] (q0010) {\(s_1\)};
% \node[state, below of = q0000] (q0100) {\(s_2\)};
% \node[state, right of = q0100] (q0101) {\(s_{02}\)};
% \node[state, right of = q0101] (q0111) {\(s_{012}\)};
% \node[state, right of = q0111] (q0110) {\(s_{12}\)};
% \node[state, accepting, below of = q0100] (q1100) {\(s_{23}\)};
% \node[state, accepting, right of = q1100] (q1101) {\(s_{023}\)};
% \node[state, accepting, right of = q1101] (q1111) {\(s_{0123}\)};
% \node[state, accepting, right of = q1111] (q1110) {\(s_{123}\)};
% \node[state, accepting, below of = q1100] (q1000) {\(s_3\)};
% \node[state, accepting, right of = q1000] (q1001) {\(s_{03}\)};
% \node[state, accepting, right of = q1001] (q1011) {\(s_{013}\)};
% \node[state, accepting, right of = q1011] (q1010) {\(s_{13}\)};
% \node[above = 1.5cm of q0001] (start) {start};
% \draw[->] (start) -- (q0001);
% \node[left = 1.5cm of q0000] (several) {several};
% \draw[->] (several) -- (q0000);
% % (q0000)
% % (q0001)
% % (q0010)
% % (q0011)
% % (q0100)
% % (q0101)
% % (q0110)
% % (q0111)
% % (q1000)
% % (q1001)
% % (q1010)
% % (q1011)
% % (q1100)
% % (q1101)
% % (q1110)
% % (q1111)
% \path[->]
% (q0000.north) edge[out=45, in=135, loop] node[above] {\(a, b\)} (q0000.north)
% (q0001) edge[] node[above] {\(a\)} (q0011)
% (q0010) edge[] node[above] {\(a\)} (q0100)
% (q0011) edge[] node[right] {\(a\)} (q0111)
% (q0100) edge[bend right] node[left] {\(a\)} (q1000)
% (q0101) edge[] node[above] {\(a\)} (q1011)
% (q0110) edge[] node[above] {\(a\)} (q1100)
% (q0111) edge[] node[right] {\(a\)} (q1111)
% (q1001) edge[] node[above] {\(a\)} (q0011)
% (q1010) edge[] node[above] {\(a\)} (q0100)
% (q1011) edge[bend right] node[right] {\(a\)} (q0111)
% (q1100) edge[] node[right] {\(a\)} (q1000)
% (q1101) edge[] node[above] {\(a\)} (q1011)
% (q1110) edge[bend left] node[below] {\(a\)} (q1100)
% (q1111.315) edge[out=270, in=0, loop] node[right] {\(a\)} (q1111.315)
% ;
% \path[->]
% (q0001.45) edge[out=0, in=90, loop] node[right] {\(b\)} (q0001.45)
% (q0011.north) edge[bend right] node[above] {\(b\)} (q0001.north)
% (q0101) edge[] node[right] {\(b\)} (q0001)
% (q0111) edge[] node[right] {\(b\)} (q0001)
% (q1001) edge[bend left] node[left] {\(b\)} (q0001)
% (q1011) edge[] node[right] {\(b\)} (q0001)
% (q1101) edge[bend left = 45] node[above left] {\(b\)} (q0001)
% (q1111) edge[] node[right] {\(b\)} (q0001)
% ;
% \end{tikzpicture}
% \end{center}
% After removing the remaining dead states and transitions for visibility,
% we can see the DFA:
% \begin{center}
% \begin{tikzpicture}[node distance = 2cm, on grid]
% \node[state, initial] (q0) {\(s_0\)};
% \node[state, right of = q0] (q1) {\(s_{01}\)};
% \node[state, right of = q1] (q2) {\(s_{012}\)};
% \node[state, accepting, right of = q2] (q3) {\(s_{0123}\)};
% \draw[->, in=180, out=90, loop] (q0.135) edge node[above] {\(b\)} (q0.135);
% \draw[->, in=45, out=-45, loop] (q3.0) edge node[right] {\(a\)} (q3.0);
% \path[->] (q0) edge node[above] {\(a\)} (q1)
% (q1) edge node[above] {\(a\)} (q2)
% (q2) edge node[above] {\(a\)} (q3)
% ;
% \path[->] (q1) edge[bend left] node[below left] {\(b\)} (q0)
% (q2) edge[bend right] node[above] {\(b\)} (q0)
% (q3) edge[bend left] node[below] {\(b\)} (q0)
% ;
% \end{tikzpicture}
% \end{center}
% This is an `eager' version of our original NFA. It accepts the same
% language. However, since it cannot be `non-deterministic' by
% definition, every time it comes across an \(a\), it must `guess' that it
% is the first of three ending \(a\)'s, and backtrack if it is not the
% case.
% % Although equally powerful, the DFA more accurately captures how a
% % human-written parser would operate in practice on this language.
% \end{solution}
% \end{exercise}
\section{Languages and Automata}
\begin{exercise}{}
Consider the following languages defined by regular expressions:
\begin{enumerate}
\item \(\{a,ab\}^*\)
\item \(\{aa\}^* \cup \{aaa\}^*\)
\item \(a^+b^+\)
\end{enumerate}
and the following languages defined in set-builder notation:
\begin{enumerate}
\renewcommand{\theenumi}{\Alph{enumi}}
\item \(\{w \mid \forall i. 0 \le i \le |w| \land w_{(i)} = b \implies (i > 0 \land w_{(i - 1)} = a)\}\) % 1
\item \(\{w \mid \forall i. 0 \le i < |w| - 1 \implies w_{(i)} = b \implies w_{(i + 1)} = a\}\) % wrong
\item \(\{w \mid \exists i. 0 < i < |w| \land w_{(i)} = b \land w_{(i - 1)} = a\}\) % wrong
\item \(\{w \mid (|w| = 0 \mod 2 \lor |w| = 0 \mod 3) \land \forall i. 0 \leq i < |w| \implies w_{(i)} = a\}\) % 2
\item \(\{w \mid \forall i. 0 \le i < |w| - 1 \land w_{(i)} = a \implies w_{(i + 1)} = b\}\) % wrong
\item \(\{w \mid \exists i. 0 < i < |w| - 1 \land
(\forall y. 0 \leq y \leq i \implies w_{(y)} = a) \land (\forall y. i < y < |w| \implies w_{(y)} = b) \}\) % 3
\end{enumerate}
For each pair (e.g. 1-A), check whether the two languages are equal, providing
a proof if they are, and a counterexample word that is in one but not the
other if unequal.
\begin{solution}
Equal language pairs: \(1 \mapsto A, 2 \mapsto D, 3 \mapsto F\).
Counterexamples (\(\cdot^\star\) means the word is in the alphabet-labelled
language, and the number-labelled language otherwise):
\begin{center}
\begin{tabular}{c c c c c c c}
& A & B & C & D & E & F \\
1 & - & a & a & a & aa & a \\
2 & ab\(^\star\) & ba\(^\star\)& ab\(^\star\)& - & ab\(^\star\)& aa \\
3 & abb & abb & aba\(^\star\) & aaabb & aab & - \\
\end{tabular}
\end{center}
We prove the first case as an example.
\begin{equation*}
\{a,ab\}^* = \{w \mid \forall i. 0 \le i \le |w| \land w_{(i)} = b \implies (i > 0 \land w_{(i - 1)} = a)\}
\end{equation*}
We must prove both directions, i.e. that \(\{a,ab\}^* \subseteq \{w \mid
P(w)\}\) and that \(\{w \mid P(w)\} \subseteq \{a,ab\}^*\).
\noindent
\textbf{Forward}: \(\{a,ab\}^* \subseteq \{w \mid P(w)\}\):
We must show that for all \(w \in \{a,ab\}^*\), \(P(w)\) holds. For any \(i
\in \naturals\), given that \(0 \le i \le |w| \land w_{(i)} = b\), we need
to show that \(i > 0 \land w_{(i - 1)} = a\).
From the definition of \(*\) on sets of words, we know that there must exist
\(n < |w|\) words \(w_1, \ldots, w_n \in \{a, ab\}\) such that \(w = w_1
\ldots w_n\). The index \(i\) must be in the range of one of these words,
i.e. there exist \(1 \leq m \leq n\) and \(0 \leq j < |w_m|\) such that
\(w_{(i)} = w_{m(j)}\).
We know that \(w_{(i)} = b\) and \(w_{m} \in \{a, ab\}\) by assumption. The
case \(w_m = a\) is a contradiction, since it cannot contain \(b\). Thus,
\(w_m = ab\). We know that \(w_{(i)} = w_{m(j)} = b\), so \(j = 1\). Thus,
\(w_{(i - 1)} = w_{m(j - 1)} = w_{m(0)} = a\), as required. Since \(i - 1
\geq 0\), being an index into \(w\), \(i > 0\) holds as well. Hence,
\(P(w)\) holds.
\noindent
\textbf{Backward}: \(\{w \mid P(w)\} \subseteq \{a,ab\}^*\):
We must show that for all \(w\) such that \(P(w)\) holds, \(w \in
\{a,ab\}^*\). We know by definition of \(*\) again, that \(w \in \{a,
ab\}*\) if and only if there exist \(n < |w|\) words \(w_1, \ldots, w_n \in
\{a, ab\}\) such that \(w = w_1 \ldots w_n\). We attempt to show that if
\(P(w)\) holds, then \(w\) admits such a decomposition.
We proceed by induction on the length of \(w\).
\noindent
\textit{Induction Case \(|w| = 0\)}: The empty word has a decomposition \(w =
\epsilon\) (with \(n = 0\) in the decomposition). QED.
\noindent
\textit{Induction Case \(|w| = 1\)}: The word \(w\) is either \(a\) or \(b\). We know
that \(P(w)\) holds, so \(w = a\) (why?). The decomposition is \(w = a\),
with \(n = 1\) and \(w_1 = a\). QED.
\noindent
\textit{Induction Case \(|w| > 1\)}:
Induction hypothesis: for all words \(v\) such that \(|v| < |w|\) and
\(P(v)\) holds, \(v\) admits a decomposition into words in \(\{a, ab\}\),
and thus \(v \in \{a, ab\}^*\).
We need to show that if \(P(w)\) holds, then \(w\) admits such a
decomposition as well. Split the proof based on the first two characters of
\(w\). There are four possibilities. We give the name \(v\) to the rest of
\(w\).
\begin{enumerate}
\item \(w = aav\): \(P(w)\) holds, so \(\forall i. 0 \le i \le |w| \land
w_{(i)} = b \implies (i > 0 \land w_{(i - 1)} = a)\). In particular, we
can restrict to \(i > 1\) as
\begin{equation*}
\forall i. 2 \le i \le |w| \land w_{(i)} = b \implies (i > 0 \land w_{(i - 1)} = a)
\end{equation*}
but \(w_{(i)}\) for \(i \geq 2\) is simply \(v_{(i - 2)}\). Rewriting:
\begin{equation*}
\forall i. 2 \le i \le |w| \land v_{(i - 2)} = b \implies (i > 0 \land v_{(i - 3)} = a)
\end{equation*}
Finally, since the statement holds for all \(i\), we can replace \(i\) by
\(i + 2\) without loss of generality, using \(|v| = |w| - 2\):
\begin{equation*}
\forall i. 0 \le i \le |v| \land v_{(i)} = b \implies (i > 0 \land v_{(i - 1)} = a)
\end{equation*}
This is precisely the statement \(P(v)\), so by the induction hypothesis,
\(v\) has a decomposition into words in \(\{a, ab\}\), \(v = v_1\ldots
v_m\) for some \(m < |v|\) and \(v_i \in \{a, ab\}\).
We can now construct a decomposition for \(w\), \(w = w_1\ldots w_{m+2}\)
such that \(w_1 = a\), \(w_2 = a\), and \(w_{i + 2} = v_i\) for \(1 \le i
\le m\). Since \(m < |v|\) and \(|v| = |w| - 2\), \(m + 2 < |w|\). QED.
\item \(w = abv\): by the same argument as the previous case, \(v\) has a decomposition
into words in \(\{a, ab\}\), \(v = v_1\ldots v_m\) for some \(m < |v|\)
and \(v_i \in \{a, ab\}\).
We can similarly construct a decomposition for \(w\), \(w = w_1\ldots
w_{m+1}\) such that \(w_1 = ab\) and \(w_{i + 1} = v_i\) for \(1 \le i \le
m\). Since \(m < |v|\) and \(|v| = |w| - 2\), in particular \(m + 1 <
|w|\). QED.
\item \(w = bav\) or \(w = bbv\): \(P(w)\) cannot hold (set \(i = 0\)), so
the statement is vacuously true.
\end{enumerate}
\end{solution}
\end{exercise}
% flavour text for lexer constructions
\section{Lexing}
% In the lectures, we have seen how to manually construct a lexer for small
% regular expressions. We often use tools that generate lexers from regular
% expressions. You will see one such tool, Silex, while building the Amy lexer.
% % about automata for lexing
% Lexing frameworks process the description of tokens for a given language, and
% may use a variety of techniques to construct the final lexer. The result is a
% program that accepts a string and returns a list of tokens. One way to do this
% automatically is by constructing and composing automata.
Consider a simple arithmetic language that allows you to compute one arithmetic
expression, construct conditionals, and let-bind expressions. An example program
is:
\begin{lstlisting}
let x = 3 in
let y = ite (x > 0) (x * x) 0 in
(2 * x) + y
\end{lstlisting}
The lexer for this language must recognize the following tokens:
\begin{align*}
\texttt{keyword}: &\quad \texttt{let} \mid \texttt{in} \mid \texttt{ite}\\
\texttt{op}: &\quad \texttt{+} \mid \texttt{-} \mid \texttt{*} \mid \texttt{/} \\
\texttt{comp}: &\quad \texttt{>} \mid \texttt{<} \mid \texttt{==} \mid \texttt{<=} \mid \texttt{>=} \\
\texttt{equal}: &\quad \texttt{=} \\
\texttt{lparen}: &\quad \texttt{(} \\
\texttt{rparen}: &\quad \texttt{)} \\
\texttt{id}: &\quad letter \cdot (letter \mid digit)^* \\
\texttt{number}: &\quad digit^+ \\
\texttt{skip}: &\quad \texttt{whitespace}
\end{align*}
For simplicity, \(letter\) is a shorthand for the set of all English lowercase
letters \(\{a - z\}\) and \(digit\) is a shorthand for the set of all decimal
digits \(\{0 - 9\}\).
% \todo{if we allow an \texttt{ite} keyword with operators, we can ask them how
% chained operators would be parsed, eg: \texttt{<===>=<===}. Is this interesting?}
\begin{exercise}{}
For each of the tokens above, construct an NFA that recognizes strings matching
its regular expression.
\begin{solution}
The construction is similar in each case, following translation of regular
expressions to automata. For example:
\begin{itemize}
\item \texttt{keyword}: \texttt{let} $\mid$ \texttt{in} $\mid$ \texttt{ite}
\begin{center}
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
\node[state,initial] (q_0) {$q_0$};
%
\node[state] (ql_1) [above right=of q_0] {$q_l$};
\node[state] (ql_2) [right=of ql_1] {$q_e$};
\node[state,accepting] (ql_3) [right=of ql_2] {$q_{let}$};
%
\node[state] (qin_1) [right=of q_0] {$q_{i1}$};
\node[state,accepting] (qin_2) [right=of qin_1] {$q_{in}$};
%
\node[state] (qite_1) [below right=of q_0] {$q_{i2}$};
\node[state] (qite_2) [right=of qite_1] {$q_t$};
\node[state,accepting] (qite_3) [right=of qite_2] {$q_{ite}$};
%
\path[->]
(q_0) edge node {\texttt{l}} (ql_1)
(ql_1) edge node {\texttt{e}} (ql_2)
(ql_2) edge node {\texttt{t}} (ql_3)
%
(q_0) edge node {\texttt{i}} (qin_1)
(qin_1) edge node {\texttt{n}} (qin_2)
%
(q_0) edge node {\texttt{i}} (qite_1)
(qite_1) edge node {\texttt{t}} (qite_2)
(qite_2) edge node {\texttt{e}} (qite_3)
;
\end{tikzpicture}
\end{center}
\item \texttt{id}: \texttt{letter} $\cdot$ (\texttt{letter} $\mid$ \texttt{digit})$^*$
\begin{center}
\begin{tikzpicture}[shorten >=1pt,node distance=3cm,on grid,auto]
\node[state,initial] (q_0) {$q_0$};
%
\node[state] (q1) [accepting, right=of q_0] {$q_1$};
%
\path[->]
(q_0) edge node {\texttt{letter}} (q1)
(q1) edge[loop above] node {\texttt{letter}} (q1)
(q1) edge[loop below] node {\texttt{digit}} (q1)
;
\end{tikzpicture}
\end{center}
\end{itemize}
The other cases are similar.
\end{solution}
\end{exercise}
A lexer is constructed by combining the NFAs for each of the tokens in parallel,
assuming maximum munch. The resulting token is the first NFA in the token order
that accepts a prefix of the string. Thus, tokens listed first have higher
priority. We then continue lexing the remaining string. You may assume that the
lexer drops any \texttt{skip} tokens.
\begin{exercise}{}
For each of the following strings, write down the sequence of tokens that
would be produced by the constructed lexer, if it succeeds.
\begin{enumerate}
\item \texttt{let x = 5 in x + 3}
\item \texttt{let5x2}
\item \texttt{xin}
\item \texttt{==>}
\item \texttt{<===><==}
\end{enumerate}
\begin{solution}
\begin{enumerate}
\item \texttt{[keyword("let"), id("x"), equal("="), number("5"), keyword("in"), id("x"), op("+"), number("3")]}
\item \texttt{[keyword("let"), number("5"), id("x2")]}
\item \texttt{[id("xin")]}
\item \texttt{[comp("=="), comp(">")]}
\item \texttt{[comp("<="), comp("=="), comp(">"), comp("<="), equal("=")]}
\end{enumerate}
\end{solution}
\end{exercise}
\begin{exercise}{}
Construct a string that would be lexed differently if we ran the NFAs in parallel
and instead of using token priority, simply picked the longest match.
\begin{solution}
There are many possible solutions. The key is to notice which tokens have
overlapping prefixes.
An example is \texttt{letx1}, which would be lexed as
\texttt{[keyword("let"), id("x1")]} if we check acceptance in order of
priority, but as \texttt{[id("letx1")]} if we run them in parallel.
\end{solution}
\end{exercise}
\ No newline at end of file
\documentclass[a4paper]{article}
\input{../macro}
\ifdefined\ANSWERS
\if\ANSWERS1
\printanswers
\fi
\fi
\title{CS 320 \\ Computer Language Processing\\Exercise Set 1}
\author{}
\date{February 28, 2025}
\begin{document}
\maketitle
% languages as sets
\input{ex/languages.tex}
% regex
% automata
\input{ex/dfa.tex}
% regex to automata
% constructing lexers
\input{ex/lexer.tex}
\bibliographystyle{plain}
\bibliography{../biblio}
\end{document}
\begin{exercise}{}
For each of the following languages, give a context-free grammar that
generates it:
\begin{enumerate}
\item \(L_1 = \{a^nb^m \mid n, m \in \naturals \land n \geq 0 \land m \geq n\}\)
\item \(L_2 = \{a^nb^mc^{n+m} \mid n, m \in \naturals\}\)
\item \(L_3 = \{w \in \{a, b\}^* \mid \exists m \in \naturals.\; |w| = 2m +
1 \land w_{(m+1)} = a \}\) (\(w\) is of odd length, has \(a\) in the middle)
\end{enumerate}
\begin{solution}
\begin{enumerate}
\item \(L_1 = \{a^nb^m \mid n, m \in \naturals \land n \geq 0 \land m \geq n\}\)
\begin{align*}
S &::= aSb \mid B\\
B &::= bB \mid \epsilon
\end{align*}
\item \(L_2 = \{a^nb^mc^{n+m} \mid n, m \in \naturals\}\)
\begin{align*}
S &::= aSc \mid B\\
B &::= bBc \mid \epsilon
\end{align*}
A small tweak to \(L_1\)'s grammar allows us to keep track of addition
precisely here. Could we do something similar for \(\{a^nb^nc^n \mid n \in
\naturals\}\)? (open-ended discussion)
\item \(L_3 = \{w \in \{a, b\}^* \mid \exists m \in \naturals.\; |w| = 2m +
1 \land w_{(m+1)} = a \}\)
\begin{align*}
S &::= aSb \mid bSa \mid aSa \mid bSb \mid a
\end{align*}
Note that after each recursive step, the length of the inner string has
the same parity (i.e. odd).
\end{enumerate}
\end{solution}
\end{exercise}
\begin{exercise}{}
Consider the following context-free grammar \(G\):
\begin{align*}
A &::= -A \\
A &::= A - \textit{id} \\
A &::= \textit{id} \\
\end{align*}
\begin{enumerate}
\item Show that \(G\) is ambiguous, i.e., there is a string that has two
different possible parse trees with respect to \(G\).
\item Make two different unambiguous grammars recognizing the same words,
\(G_p\), where prefix-minus binds more tightly, and \(G_i\), where
infix-minus binds more tightly.
\item Show the parse trees for the string you produced in (1) with respect
to \(G_p\) and \(G_i\).
\item Produce a regular expression that recognizes the same language as
\(G\).
\end{enumerate}
\begin{solution}
\begin{enumerate}
\item An example string is \(- \textit{id} - \textit{id}\). It can be
parsed as either \(-(\textit{id} - \textit{id})\) or \((- \textit{id}) -
\textit{id}\). The corresponding parse trees are:
\begin{center}
\begin{forest}
[\(A\)
[\(A\)
[\(-\)]
[\(\textit{id}\)]
]
[\(-\)]
[\(\textit{id}\)]
]
\end{forest}
\hspace{10ex}
\begin{forest}
[\(A\)
[\(-\)]
[\(A\)
[\(A\)
[\(\textit{id}\)]
]
[\(-\)]
[\(\textit{id}\)]
]
]
\end{forest}
\end{center}
Left: prefix binds tighter, right: infix binds tighter.
\item \(G_p\):
\begin{align*}
A &::= B \mid A - \textit{id} \\
B &::= -B \mid \textit{id}
\end{align*}
\(G_i\):
\begin{align*}
A &::= C \mid -A \\
C &::= \textit{id} \mid C - \textit{id}
\end{align*}
\item Parse trees for \(- \textit{id} - \textit{id}\) with respect to \(G_p\) (left)
and \(G_i\) (right):
\begin{center}
\begin{forest}
[\(A\)
[\(A\)
[\(B\)
[\(-\)]
[\(B\)
[\(\textit{id}\)]
]
]
]
[\(-\)]
[\(\textit{id}\)]
]
\end{forest}
\hspace{10ex}
\begin{forest}
[\(A\)
[\(-\)]
[\(A\)
[\(C\)
[\(\textit{id}\)]
]
[\(-\)]
[\(\textit{id}\)]
]
]
\end{forest}
\end{center}
\item \(L(G) = L(-^*\textit{id} (-\textit{id})^*)\). Note: \(()\) are part
of the regular expression syntax, not parentheses in the string.
\end{enumerate}
\end{solution}
\end{exercise}
\begin{exercise}{}
Consider the two following grammars \(G_1\) and \(G_2\):
\begin{align*}
G_1: & \\
S &::= S(S)S \mid \epsilon \\
G_2: & \\
R &::= RR \mid (R) \mid \epsilon
\end{align*}
\noindent
Prove that:
\begin{enumerate}
\item \(L(G_1) \subseteq L(G_2)\), by showing that for every parse tree in
\(G_1\), there exists a parse tree yielding the same word in \(G_2\).
\item (Bonus) \(L(G_2) \subseteq L(G_1)\), by showing that there exist
equivalent parse trees or derivations.
\end{enumerate}
\begin{solution}
\begin{enumerate}
\item \(L(G_1) \subseteq L(G_2)\).
We give a recursive transformation of parse trees in \(G_1\) producing
parse trees in \(G_2\).
\begin{enumerate}
\item \textbf{Base case:} The smallest parse tree is the \(\epsilon\)
production, which can be transformed as (left to right):
\begin{center}
\begin{forest}
[\(S\)
[\(\epsilon\)]
]
\end{forest}
\hspace{8ex}
\begin{forest}
[\(R\)
[\(\epsilon\)]
]
\end{forest}
\end{center}
\item \textbf{Recursive case:} Rule \(S ::= S(S)S\). The parse tree transformation is:
\begin{center}
\begin{forest}
[\(S\)
[\(S_1\)]
[\((_2\)]
[\(S_3\)]
[\()_4\)]
[\(S_5\)]
]
\end{forest}
\hspace{10ex}
\begin{forest}
[\(R\)
[\(R_1\)]
[\(R\)
[\(R\)
[\((_2\)]
[\(R_3\)]
[\()_4\)]
]
[\(R_5\)]
]
]
\end{forest}
\end{center}
The nodes are numbered to check that the order of children (left to
right) does not change. This ensures that the word yielded by the tree
is the same. The transformation is applied recursively to the children
\(S_1, S_3, S_5\) to obtain \(R_1, R_3, R_5\).
Verify that the tree on the right is indeed a parse tree in \(G_2\).
\end{enumerate}
\item \(L(G_2) \subseteq L(G_1)\).
Straightforward induction on parse trees does not work easily. The rule
\(R ::= RR\) in \(G_2\) is not directly expressible in \(G_1\) by a simple
transformation of parse trees. However, we can note that, in fact, adding
this rule to \(G_1\) does not change the language!
Consider the grammar \(G_1'\) defined by \(S ::= SS \mid S(S)S \mid
\epsilon\). We must show that for every two words \(v\) and \(w\) in
\(L(G_1)\), \(vw\) is in \(L(G_1)\), and so adding the rule \(S ::= SS\)
does not change the language.
We induct on the length \(|v| + |w|\).
\begin{enumerate}
\item \textbf{Base case:} \(|v| + |w| = 0\). \(v = w = vw = \epsilon \in
L(G_1)\). QED.
\item \textbf{Inductive case:} \(|v| + |w| = n + 1\). The induction
hypothesis is that for every \(v', w'\) with \(|v'| + |w'| = n\), \(v'w'
\in L(G_1)\).
From the grammar, we know that either \(v = \epsilon\) or \(v = x(y)z\)
for \(x, y, z \in L(G_1)\). If \(v = \epsilon\), then \(w = vw \in
L(G_1)\). In the second case, \(vw = x(y)zw\). However, \(zw \in
L(G_1)\) by the inductive hypothesis, as \(|z| + |w| < n \).
Thus, \(vw = x(y)z'\) for \(z' \in L(G_1)\). Finally, since \(x, y, z'
\in L(G_1)\), it follows from the grammar rules that \(vw = x(y)z' \in
L(G_1)\).
\end{enumerate}
Thus, \(L(G_1) = L(G_1')\). It can now be shown just as in the first part,
that \(L(G_2) \subseteq L(G_1')\).
\end{enumerate}
\end{solution}
\end{exercise}
\begin{exercise}{}
Consider a context-free grammar \(G = (A, N, S, R)\). Define the reversed
grammar \(rev(G) = (A, N, S, rev(R))\), where \(rev(R)\) is the set of rules
is produced from \(R\) by reversing the right-hand side of each rule, i.e.,
for each rule \(n ::= p_1 \ldots p_n\) in \(R\), there is a rule \(n ::=
p_n \ldots p_1\) in \(rev(R)\), and vice versa. The terminals,
non-terminals, and start symbol of the language remain the same.
For example, \(S ::= abS \mid \epsilon\) becomes \(S ::= Sba \mid \epsilon\).
Is it the case that for every context-free grammar \(G\) defining a language
\(L\), the language defined by \(rev(G)\) is the same as the language of
reversed strings of \(L\), \(rev(L) = \{rev(w) \mid w \in L\}\)? Give a proof
or a counterexample.
\begin{solution}
Consider any word \(w\) in the original language. Looking at the definition
of a language \(L(G)\) defined by a grammar \(G\):
\begin{equation*}
w \in L(G) \iff \exists T.\; w = yield(T) \land isParseTree(G, T)
\end{equation*}
There must exist a parse tree \(T\) for \(w\) with respect to \(G\). We must
show that there exists a parse tree for \(rev(w)\) with respect to the
reversed grammar \(G_r = rev(G)\) as well.
We propose that this is precisely the tree \(T_r = mirror(T)\). Thus, we
need to show that \(rev(w) = yield(T_r)\) and that \(isParseTree(G_r,
T_r)\).
\begin{enumerate}
\item \(rev(w) = yield(T_r)\): \(yield(\cdot)\) of a tree is the word
obtained by reading its leaves from left to right. Thus, the yield of the
mirror of a tree \(yield(mirror(\cdot))\) is the word obtained by reading
the leaves of the original tree from right to left. Thus, \(yield(T_r) =
yield(mirror(T)) = rev(yield(T)) = rev(w)\).
\item \(isParseTree(G_r, T_r)\): We need to show that \(T_r\) is a parse
tree with respect to \(G_r\). Consider the definition of a parse tree:
\begin{enumerate}
\item The root of \(T_r\) is the start symbol of \(G_r\): the root of
\(T_r = mirror(T)\) is the same as that of \(T\). Since \(T\)'s root
node must be the start symbol of \(G\), it is also the root symbol of
\(T_r\). \(G\) and \(G_r\) share the same start symbol in our
transformation.
\item The leaves are labelled by the elements of \(A\): the mirror
transformation does not alter the set or the label of leaves, only their
order. This property transfers from \(T\) to \(T_r\) as well.
\item Each non-leaf node is labelled by a non-terminal symbol: the
mirror transformation does not alter the label of non-leaf nodes either,
so this property transfers from \(T\) to \(T_r\) as well.
\item If a non-leaf node has children that are labelled \(p_1, \ldots,
p_n\) left-to-right, then there is a rule \((n ::= p_1 \ldots p_n)\) in
the grammar: consider any non-leaf node in \(T_r\), labelled \(n\), with
children labelled left-to-right \(p_1, \ldots, p_n\). By the definition
of \(mirror\), the original tree \(T\) must have the same node labelled
\(n\), with the reversed list of children left-to-right, \(p_n, \ldots,
p_1\). Since \(T\) is a parse tree for \(G\), \(n ::= p_n \ldots p_1\)
is a valid rule in \(G\), and by the reverse transformation, \(n ::= p_1
\ldots p_n\) must be a rule in \(G_r\). Thus, the property is satisfied.
\end{enumerate}
\end{enumerate}
Thus, both properties are satisfied. Therefore, the language defined by the
reversed grammar is the reversed language of the original grammar.
\end{solution}
\end{exercise}
\begin{exercise}{}
Recall the pumping lemma for regular languages:
For any language \(L \subseteq \Sigma^*\), if \(L\) is regular, there exists a
strictly positive constant \(p \in \naturals\) such that every word \(w \in
L\) with \(|w| \geq p\) can be written as \(w = xyz\) such that:
\begin{itemize}
\item \(x, y, z \in \Sigma^*\)
\item \(|y| > 0\)
\item \(|xy| \leq p\), and
\item \(\forall i \in \naturals.\; xy^iz \in L\)
\end{itemize}
Consider the language \(L = \{w \in \{a\}^* \mid |w| \text{ is prime}\}\).
Show that \(L\) is not regular by using the pumping lemma.
\begin{solution}
\(L = \{w \in \{a\}^* \mid |w| \text{ is prime}\}\) is not a regular
language.
To the contrary, assume it is regular, and so there exists a constant
\(p\) such that the pumping conditions hold for this language.
Consider the word \(w = a^{n} \in L\), for some prime \(n \geq p\). By the
pumping lemma, we can write \(w = xyz\) such that \(|y| > 0\), \(|xy| \leq
p\), and \(xy^iz \in L\) for all \(i \geq 0\).
Assume that \(|xz| = m\) and \(|y| = k\) for some natural numbers \(m\)
and \(k\). Thus, \(|xy^iz| = m + ik\) for all \(i\). Since by the pumping
lemma \(xy^iz \in L\) for every \(i\), it follows that for every \(i\),
the length \(m + ik\) is prime. However, if \(m \not = 0\), then \(m\)
divides \(m + mk\), and if \(m = 0\), then \(m + 2k\) is not prime. In
either case, we have a contradiction.
Thus, this language is not regular.
\end{solution}
\end{exercise}
\documentclass[a4paper]{article}
\input{../macro}
\ifdefined\ANSWERS
\if\ANSWERS1
\printanswers
\fi
\fi
\title{CS 320 \\ Computer Language Processing\\Exercise Set 2}
\author{}
\date{March 7, 2025}
\begin{document}
\maketitle
\input{ex/pumping}
\input{ex/cfg}
\end{document}
% Compiler Design 3.9
\begin{exercise}{}
Compute \(\nullable\), \(\first\), and \(\follow\) for the non-terminals \(A\)
and \(B\) in the following grammar:
%
\begin{align*}
A &::= BAa \\
A &::= \\
B &::= bBc \\
B &::= AA
\end{align*}
Remember to extend the language with an extra start production for the
computation of \(\follow\).
\begin{solution}
\begin{enumerate}
\item \(\nullable\): we get the constraints
\begin{gather*}
\nullable(A) = \nullable(BAa) \lor \nullable(\epsilon) \\
\nullable(B) = \nullable(bBc) \lor \nullable(AA)
\end{gather*}
We can solve these to get \(\nullable(A) = \nullable(B) = true\).
\item \(\first\): we get the constraints (given that both \(A\) and \(B\)
are nullable):
\begin{align*}
\first(A) &= \first(BAa) \cup \first(\epsilon) \\
&= \first(B) \cup \first(A) \cup \emptyset \\
&= \first(B) \cup \first(A) \\
\first(B) &= \first(bBc) \cup \first(AA) \\
&= \{b\} \cup \first(A) \cup \first(A) \cup \emptyset \\
&= \{b\} \cup \first(A)
\end{align*}
Starting from \(\first(A) = \first(B) = \emptyset\), we iteratively
compute the fixpoint to get \(\first(A) = \first(B) = \{a, b\}\).
\item \(\follow\): we add a production \(A' ::= A~\mathbf{EOF}\), and get
the constraints (in order of productions):
\begin{gather*}
\{\mathbf{EOF}\} \subseteq \follow(A) \\
\\
\first(A) \subseteq \follow(B) \\
\{a\} \subseteq \follow(A) \\
\\
\{c\} \subseteq \follow(B) \\
\\
\first(A) \subseteq \follow(A) \\
\follow(B) \subseteq \follow(A)
\end{gather*}
Substituting the computed \(\first\) sets, and computing a fixpoint, we
get \(\follow(A) = \{a, b, c,\mathbf{EOF}\}\) and \(\follow(B) = \{a, b,
c\}\).
\end{enumerate}
\end{solution}
\end{exercise}
% Compiler design 3.11
\begin{exercise}{}
Given the following grammar for arithmetic expressions:
\begin{align*}
S &::= Exp~\mathbf{EOF} \\
Exp &::= Term~ Add \\
Add &::= +~ Term~ Add \\
Add &::= -~ Term~ Add \\
Add &::= \\
Term &::= Factor~ Mul \\
Mul &::= *~ Factor~ Mul \\
Mul &::= /~ Factor~ Mul \\
Mul &::= \\
Factor &::= \mathbf{num} \\
Factor &::= (Exp) \\
\end{align*}
\begin{enumerate}
\item Compute \(\nullable\), \(\first\), \(\follow\) for each of the
non-terminals in the grammar.
\item Check if the grammar is LL(1). If not, modify the grammar to make it
so.
\item Build the LL(1) parsing table for the grammar.
\item Using your parsing table, parse or attempt to parse (till error) the
following strings, assuming that \(\mathbf{num}\) matches any natural
number:
\begin{enumerate}
\item \((3 + 4) * 5 ~\mathbf{EOF}\)
\item \(2 + + ~\mathbf{EOF}\)
\item \(2 ~\mathbf{EOF}\)
\item \(2 * 3 + 4 ~\mathbf{EOF}\)
\item \(2 + 3 * 4 ~\mathbf{EOF}\)
\end{enumerate}
\end{enumerate}
\begin{solution}
\begin{enumerate}
\item We can compute the \(\nullable\), \(\first\), and \(\follow\) sets as:
\begin{enumerate}
\item \(\nullable\):
%
\begin{align*}
\nullable(S) &= false \\
\nullable(Exp) &= false \\
\nullable(Add) &= true \\
\nullable(Term) &= false \\
\nullable(Mul) &= true \\
\nullable(Factor) &= false
\end{align*}
\item \(\first\): we have constraints:
%
\begin{align*}
\first(S) &= \first(Exp) \\
\first(Exp) &= \first(Term) \\
\first(Add) &= \{+\} \cup \{-\} \cup \emptyset \\
\first(Term) &= \first(Factor) \\
\first(Mul) &= \{*\} \cup \{/\} \cup \emptyset \\
\first(Factor) &= \{\mathbf{num}\} \cup \{(\}
\end{align*}
%
which can be solved to get:
%
\begin{align*}
\first(S) &= \{\mathbf{num}, (\} \\
\first(Exp) &= \{\mathbf{num}, (\} \\
\first(Add) &= \{+, -\} \\
\first(Term) &= \{\mathbf{num}, (\} \\
\first(Mul) &= \{*, /\} \\
\first(Factor) &= \{\mathbf{num}, (\}
\end{align*}
\item \(\follow\): we have constraints (for each rule, except
empty/terminal rules):
\begin{multicols}{2}
\allowdisplaybreaks
\begin{align*}
\{\mathbf{EOF}\} &\subseteq \follow(Exp) \\
&\\
\first(Add) &\subseteq \follow(Term) \\
\follow(Exp) &\subseteq \follow(Term) \\
\follow(Exp) &\subseteq \follow(Add) \\
&\\
\first(Add) &\subseteq \follow(Term) \\
\follow(Add) &\subseteq \follow(Term) \\
&\\
\first(Add) &\subseteq \follow(Term) \\
\follow(Add) &\subseteq \follow(Term) \\
&\\
\first(Mul) &\subseteq \follow(Factor) \\
\follow(Term) &\subseteq \follow(Factor) \\
\follow(Term) &\subseteq \follow(Mul) \\
&\\
\first(Mul) &\subseteq \follow(Factor) \\
\follow(Mul) &\subseteq \follow(Factor) \\
&\\
\first(Mul) &\subseteq \follow(Factor) \\
\follow(Mul) &\subseteq \follow(Factor) \\
&\\
\{)\} &\subseteq \follow(Exp) \\
\end{align*}
\end{multicols}
The fixpoint can again be computed to get:
\begin{align*}
\follow(S) &= \{\} \\
\follow(Exp) &= \{), \mathbf{EOF}\} \\
\follow(Add) &= \{), \mathbf{EOF}\} \\
\follow(Term) &= \{+, -, ), \mathbf{EOF}\} \\
\follow(Mul) &= \{+, -, ), \mathbf{EOF}\} \\
\follow(Factor) &= \{+, -, *, /, ), \mathbf{EOF}\}
\end{align*}
\end{enumerate}
\item The grammar is LL(1), there are no conflicts. Demonstrated by the
parsing table below.
\item LL(1) parsing table:
\begin{center}
\begin{tabular}{c|c|c|c|c|c|c|c|c}
& \(\mathbf{num}\) & \(+\) & \(-\) & \(*\) & \(/\) & \((\) & \()\) & \(\mathbf{EOF}\) \\
\hline
\(S\) & 1 & & & & & 1 & &\\
\(Exp\) & 1 & & & & & 1 & &\\
\(Add\) & & 1 & 2 & & & & 3 & 3 \\
\(Term\) & 1 & & & & & 1 & & \\
\(Mul\) & & 3 & 3 & 1 & 2 & & 3 & 3 \\
\(Factor\) & 1 & & & & & 2 & & \\
\end{tabular}
\end{center}
\item Parsing the strings:
\begin{enumerate}
\item \((3 + 4) * 5 ~\mathbf{EOF}\) \checkmark
\item \(2 + + ~\mathbf{EOF}\) --- fails on the second \(+\). The
corresponding error cell in the parsing table is \((Term, +)\).
\item \(2 ~\mathbf{EOF}\) \checkmark
\item \(2 * 3 + 4 ~\mathbf{EOF}\) \checkmark
\item \(2 + 3 * 4 ~\mathbf{EOF}\) \checkmark
\end{enumerate}
Example step-by-step LL(1) parsing state for \(2 * 3 + 4\):
\begin{center}
\begin{tabular}{c c c}
Lookahead & Stack & Next Rule \\
\hline
\(2\) & \(S\) & \(S ::= Exp ~\mathbf{EOF}\)\\
\(2\) & \(Exp ~ \mathbf{EOF}\) & \(Exp ::= Term~Add\)\\
\(2\) & \(Term ~ Add ~ \mathbf{EOF}\) & \(Term ::= Factor~Mul\)\\
\(2\) & \(Factor ~ Mul ~ Add ~ \mathbf{EOF}\) & \(Factor ::= \mathbf{num}\)\\
\(2\) & \(\mathbf{num} ~ Mul ~ Add ~ \mathbf{EOF}\) & \(match(\mathbf{num})\)\\
\(*\) & \(Mul ~ Add ~ \mathbf{EOF}\) & \(Mul ::= *~Factor~Mul\)\\
\(*\) & \(* ~Factor ~ Mul ~ Add ~ \mathbf{EOF}\) & \(match(*)\)\\
\(3\) & \(Factor ~ Mul ~ Add ~ \mathbf{EOF}\) & \(Factor ::= \mathbf{num}\)\\
\(3\) & \(\mathbf{num} ~ Mul ~ Add ~ \mathbf{EOF}\) & \(match(\mathbf{num})\)\\
\(+\) & \(Mul ~ Add ~ \mathbf{EOF}\) & \(Mul ::=\)\\
\(+\) & \(Add ~ \mathbf{EOF}\) & \(Add ::= +~Term~Add\)\\
\(+\) & \(+ ~Term ~Add ~ \mathbf{EOF}\) & \(match(+)\)\\
\(4\) & \(Term ~Add ~ \mathbf{EOF}\) & \(Term ::= Factor~Term*\)\\
\(4\) & \(Factor ~Mul ~Add ~ \mathbf{EOF}\) & \(Factor ::= \mathbf{num}\)\\
\(4\) & \(\mathbf{num} ~Mul ~Add ~ \mathbf{EOF}\) & \(match(\mathbf{num})\)\\
\(\mathbf{EOF}\) & \(Mul ~Add ~ \mathbf{EOF}\) & \(Mul ::= \)\\
\(\mathbf{EOF}\) & \(Add ~ \mathbf{EOF}\) & \(Add ::= \)\\
\(\mathbf{EOF}\) & \(\mathbf{EOF}\) & \(match(\mathbf{EOF})\)\\
\end{tabular}
\end{center}
\end{enumerate}
\end{solution}
\end{exercise}
\begin{exercise}{}
If \(L\) is a regular language, then the set of prefixes of words in \(L\) is
also a regular language. Given this fact, from a regular expression for \(L\),
we should be able to obtain a regular expression for the set of all prefixes
of words in \(L\) as well.
We want to do this with a function \(\prefixes\) that is recursive over the
structure of the regular expression for \(L\), i.e. of the form:
%
\begin{align*}
\prefixes(\epsilon) &= \epsilon \\
\prefixes(a) &= a \mid \epsilon \\
\prefixes(r \mid s) &= \prefixes(r) \mid \prefixes(s) \\
\prefixes(r \cdot s) &= \ldots \\
\prefixes(r^*) &= \ldots \\
\prefixes(r^+) &= \ldots
\end{align*}
\begin{enumerate}
\item Complete the definition of \(\prefixes\) above by filling in the
missing cases.
\item Use this definition to find:
\begin{enumerate}
\item \(\prefixes(ab^*c)\)
\item \(\prefixes((a \mid bc)^*)\)
\end{enumerate}
\end{enumerate}
\begin{solution}
The computation for \(\prefixes(\cdot)\) is similar to the computation of
\(\first(\cdot)\) for grammars.
\begin{enumerate}
\item The missing cases:
\begin{enumerate}
\item \(\prefixes(r \cdot s) = \prefixes(r) \mid r \cdot \prefixes(s)\).
Either we have read \(r\) partially, or we have read all of \(r\), and a
part of \(s\).
\item \(\prefixes(r^*) = r*\cdot\prefixes(r)\). We can
consider \(r^* = \epsilon \mid r \mid rr \mid \ldots\), and apply the
rules for union and concatenation. Intuitively, if the word has \(n \ge
0\) instances of \(r\), we can read \(m < n\) instances of \(r\), and
then a prefix of the next instance of \(r\).
\item \(\prefixes(r^+) = r^* \cdot \prefixes(r)\). Same as
previous. Why does the empty case still appear?
\end{enumerate}
\item The prefix computations are:
\begin{enumerate}
\item \(\prefixes(ab^*c) = \epsilon \mid a \mid ab^*(b \mid c \mid \epsilon)\). Computation:
\begin{align*}
\prefixes(ab^*c) &= \prefixes(a) \mid a\cdot\prefixes(b^*c) & [\text{concatenation}]\\
&= (a \mid \epsilon) \mid a\cdot\prefixes(b^*c) &[a]\\
&= (a \mid \epsilon) \mid a\cdot(\prefixes(b^*) \mid b^*\prefixes(c)) &[\text{concatenation}]\\
&= (a \mid \epsilon) \mid a\cdot(\prefixes(b^*) \mid b^*(c \mid \epsilon)) &[c]\\
&= (a \mid \epsilon) \mid a\cdot(b^*\prefixes(b) \mid b^*(c \mid \epsilon)) &[\text{star}]\\
&= (a \mid \epsilon) \mid a\cdot(b^*(b \mid \epsilon) \mid b^*(c \mid \epsilon)) &[b]\\
&= (a \mid \epsilon) \mid a\cdot(b^*(b \mid c \mid \epsilon)) &[\text{rewrite}]\\
&= \epsilon \mid a \mid a\cdot(b^*(b \mid c \mid \epsilon)) & [\text{rewrite}]\\
\end{align*}
\item \(\prefixes((a \mid bc)^*) = (a \mid bc)^*(\epsilon \mid a \mid b \mid bc)\).
\end{enumerate}
\end{enumerate}
\end{solution}
\end{exercise}
% this language is not LL 1 actually, I think
% \begin{exercise}{}
% Consider the following grammar of \(\mathbf{if}-\mathbf{then}-\mathbf{else}\) expressions with assignments:
% %
% \begin{align*}
% stmt &::= \mathbf{if} ~id = id~ \mathbf{then} ~stmt ~optStmt \\
% &::= \{ stmt^* \} \\
% &::= id = id; \\
% optStmt &::= \epsilon \mid \mathbf{else} ~stmt \\
% \end{align*}
% \begin{enumerate}
% \item Show that the grammar is ambiguous.
% \item Is the grammar LL(1)?
% \end{enumerate}
% \end{exercise}
\begin{exercise}{}
Argue that the following grammar is \emph{not} LL(1). Produce an equivalent
LL(1) grammar.
\begin{equation*}
E ::= \mathbf{num} + E \mid \mathbf{num} - E \mid \mathbf{num}
\end{equation*}
\begin{solution}
The language is clearly not LL(1), as on seeing a token \(\mathbf{num}\), we
cannot decide whether to continue parsing it as \(\mathbf{num} + E\),
\(\mathbf{num} - E\), or the end.
The notable problem is the common prefix between the rules. We can separate
this out by introducing a new non-terminal \(T\). This is a transformation
known as \emph{left factorization}.
\begin{align*}
E &::= \mathbf{num} ~T \\
T &::= + E \mid - E \mid \epsilon
\end{align*}
% without changing the terms or the overall "structure" of the grammar, we
% have logically partitioned it to fit within our parsing schema.
\end{solution}
\end{exercise}
\begin{exercise}{}
Consider the following grammar:
\begin{equation*}
S ::= S(S) \mid S[S] \mid () \mid [\;]
\end{equation*}
Check whether the same transformation as the previous case can be applied to
produce an LL(1) grammar. If not, argue why, and suggest a different
transformation.
\begin{solution}
Applying left factorization to the grammar, we get:
\begin{align*}
S &::= S ~T \mid S ~T \mid () \mid [\;] \\
T &::= (S) \mid [S]
\end{align*}
This is not LL(1), as on reading a token ``\((\)'', we cannot decide whether
this is the final parentheses (base case) in the expression, or whether
there is a \(T\) following it.
The problem is that this version of the grammar is left-recursive. A
recursive-descent parser for this grammar would loop forever on the first
rule. This is caused by the fact that our parsers are top-down, left to
right. We can fix this by \emph{moving} the recursion to the right. This is
generally called \emph{left recursion elimination}.
Transformed grammar steps (explanation below):
\begin{align*}
S &::= ()S' \mid [\;]S' \\
S' &::= (S)S' \mid [S]S' \mid \epsilon
\end{align*}
To eliminate left-recursion in general, consider a non-terminal \(A ::=
A\alpha \mid \beta\), where \(\beta\) does not start with \(A\) (not
left-recursive). We can remove the left recursion by introducing a new
non-terminal, \(A'\), such that:
\begin{align*}
A &::= \beta A' \\
A' &::= \alpha A' \mid \epsilon
\end{align*}
i.e., for the left-recursive rule \(A\alpha\), we instead attempt to parse
an \(\alpha\) followed by the rest. In exchange, the base case \(\beta\) now
expects an \(A'\) to follow it.
%
Note that \(\beta\) can be empty as well.
Intuitively, we are shifting the direction in which we look for instances of
\(A\). Consider a partial derivation starting from \(\beta \alpha \alpha
\alpha\). The original version of the grammar would complete the parsing as:
\begin{center}
\begin{forest}
[\(A\)
[\(A\)
[\(A\)
[\(A\)
[\(\beta\)]
]
[\(\alpha\)]
]
[\(\alpha\)]
]
[\(\alpha\)]
]
\end{forest}
\end{center}
but with the new grammar, we parse it as:
\begin{center}
\begin{forest}
[\(A\)
[\(\beta\)]
[\(A'\)
[\(\alpha\)]
[\(A'\)
[\(\alpha\)]
[\(A'\)
[\(\alpha\)]
[\(A'\)
[\(\epsilon\)]
]
]
]
]
]
\end{forest}
\end{center}
There are two main pitfalls to remember with left-recursion elimination:
\begin{enumerate}
\item it may need to be applied several times till the grammar is
unchanged, as the first transformation may introduce new (indirect)
recursive rules (check \(A ::= AA\alpha \mid \epsilon\)).
\item it may require \emph{inlining} some non-terminals, when the left
recursion is \emph{indirect}. For example, consider \(A ::= B\alpha, B ::=
A\beta\), where there is no immediate reduction to do, but inlining \(B\),
we get \(A ::= A\beta\alpha\), where the elimination can be applied.
\end{enumerate}
\end{solution}
\end{exercise}
\documentclass[a4paper]{article}
\input{../macro}
\ifdefined\ANSWERS
\if\ANSWERS1
\printanswers
\fi
\fi
\DeclareMathOperator{\prefixes}{prefixes}
\DeclareMathOperator{\first}{first}
\DeclareMathOperator{\nullable}{nullable}
\DeclareMathOperator{\follow}{follow}
\title{CS 320 \\ Computer Language Processing\\Exercise Set 3}
\author{}
\date{March 19, 2025}
\begin{document}
\maketitle
% prefixes of regular expressions
\input{ex/prefix}
% compute nullable follow first for CFGs
\input{ex/compute}
% build ll1 parsing table, parse or attempt to parse some strings
\input{ex/table}
\end{document}
\begin{exercise}{}
For each of the following pairs of grammars, show that they are equivalent by
identifying them with inductive relations, and proving that the inductive
relations contain the same elements.
\begin{enumerate}
\item
\(A_1 : S ::= S + S \mid \num \) \\
\(A_2 : R ::= \num ~R' \text{ and } R' ::= + R~ R' \mid \epsilon\)
\item
\(B_1 : S ::= S(S)S \mid \epsilon \) \\
\(B_2 : R ::= RR \mid (R) \epsilon\)
\end{enumerate}
\begin{solution}
\begin{enumerate}
\item \(A_2\) is the result of left-recursion elimination on \(A_1\).
First, expressing them as inductive relations, with rules named as on the
right:
%
\addtolength{\jot}{1ex}
\begin{gather*}
\AxiomC{\phantom{\(w_1 \in S\)}}
\RightLabel{\(S_{num}\)}
\UnaryInfC{\(\num \in S\)}
\DisplayProof
\quad
\AxiomC{\(w_1 \in S\)}
\AxiomC{\(w_2 \in S\)}
\RightLabel{\(S_+\)}
\BinaryInfC{\(w_1 + w_2 \in S\)}
\DisplayProof \\
%
\AxiomC{\(w \in S\)}
\RightLabel{\(A_{1}^{start}\)}
\UnaryInfC{\(w \in A_1\)}
\DisplayProof \\
%
\AxiomC{\(w \in R'\)}
\RightLabel{\(R_{num}\)}
\UnaryInfC{\(\num~ w\in R\)}
\DisplayProof \\
%
\AxiomC{\(w \in R\)}
\AxiomC{\(w' \in R'\)}
\RightLabel{\(R'_{+}\)}
\BinaryInfC{\(+w ~ w' \in R'\)}
\DisplayProof
\quad
\AxiomC{\phantom{\(w' \in R'\)}}
\RightLabel{\(R'_{\epsilon}\)}
\UnaryInfC{\(\epsilon \in R'\)}
\DisplayProof \\
%
\AxiomC{\(w \in R\)}
\RightLabel{\(A_{2}^{start}\)}
\UnaryInfC{\(w \in A_2\)}
\DisplayProof
\end{gather*}
We must show that for any word \(w\), \(w \in A_1\) if and only if \(w \in
A_2\). For this, it must be the case that there is a derivation tree for
\(w \in A_1\) (equivalently, \(w \in S\)) if and only if there is a
derivation tree for \(w \in A_2\) (equivalently, \(w \in R\)) according to
the inference rules above.
\begin{enumerate}
\item \(w \in S \implies w \in R\): we induct on the depth of the
derivation tree.
\begin{itemize}
\item Base case: derivation tree of depth 1. The tree must be
\begin{gather*}
\AxiomC{}
\RightLabel{\(S_{num}\)}
\UnaryInfC{\(\num \in S\)}
\DisplayProof
\end{gather*}
We can show that there is a corresponding derivation tree for \(w \in R\):
\begin{gather*}
\AxiomC{}
\RightLabel{\(R'_{\epsilon}\)}
\UnaryInfC{\(\epsilon \in R'\)}
\RightLabel{\(R_{num}\)}
\UnaryInfC{\(\num \in R\)}
\DisplayProof
\end{gather*}
\item Inductive case: derivation tree of depth \(n+1\), given that for
every derivation of depth \( \le n\) of \(w' \in S\) for any \(w'\), there
is a corresponding derivation of \(w' \in R\). The last rule applied
in the derivation must be \(S_+\):
\begin{gather*}
\AxiomC{\ldots}
\UnaryInfC{\(w_1 \in S\)}
\AxiomC{\ldots}
\UnaryInfC{\(w_2 \in S\)}
\RightLabel{\(S_+\)}
\BinaryInfC{\(w_1 + w_2 \in S\)}
\DisplayProof
\end{gather*}
By the inductive hypothesis, since \(w_1 \in S\) and \(w_2 \in S\)
have a derivation tree of smaller depth, there are derivation trees
for \(w_1 \in R\) and \(w_2 \in R\). In particular, the derivation for
\(w_1 \in R\) must end with the rule \(R_{num}\) (only case), so there
must be a derivation tree for \(\num ~w_1' \in R\) with \(w_1' \in
R'\) and \(num~w_1' = w_1\). We have the following pieces:
\begin{gather*}
\AxiomC{\ldots}
\UnaryInfC{\(w_1' \in R'\)}
\RightLabel{\(R_{num}\)}
\UnaryInfC{\(\num ~w_1' \in R\)}
\DisplayProof
\quad
\AxiomC{\ldots}
\UnaryInfC{\(w_2 \in R\)}
\DisplayProof
\end{gather*}
To show that \(w_1 + w_2 \in R\), i.e. \(\num ~w_1' + w_2 \in R\), we
must first show that \(w_1' + w_2 \in R'\), as required by the rule
\(R_{num}\). Note that words in \(R'\) are of the form \((+ \num)^*\).
We will prove this separately for all pairs of words at the end
(\(R'_{Lemma}\)). Knowing this, however, we can construct the
derivation tree for \(w_1 + w_2 \in R\):
\begin{gather*}
\AxiomC{\ldots}
\UnaryInfC{\(w_1' \in R'\)}
\AxiomC{\ldots}
\UnaryInfC{\(w_2 \in R\)}
\RightLabel{\(R'_{Lemma}\)}
\BinaryInfC{\(w_1' + w_2 \in R'\)}
\RightLabel{\(R_{num}\)}
\UnaryInfC{\(\num ~w_1' + w_2 \in R\)}
\DisplayProof
\end{gather*}
\(\num ~w_1' + w_2 = w_1 + w_2 = w\), as required.
Finally, we will show the required lemma. We will prove a stronger
property \(R'_{concat}\) first, that for any pair of words \(w_1, w_2
\in R'\), \(w_1 ~w_2 \in R'\) as well. We induct on the derivation of
\(w_1 \in R'\).
Base case: derivation ends with \(R'_\epsilon\). Then \(w_1 =
\epsilon\), and \(w_1 ~w_2 = w_2 \in R'\) by assumption.
Inductive case: derivation ends with \(R'_+\). Then \(w_1 = + v v'\)
for some \(v \in R\) and \(v' \in R'\):
\begin{gather*}
\AxiomC{\ldots}
\UnaryInfC{\(v \in R\)}
\AxiomC{\ldots}
\UnaryInfC{\(v' \in R'\)}
\RightLabel{\(R'_+\)}
\BinaryInfC{\(+ v ~v' \in R'\)}
\DisplayProof
\end{gather*}
Since \(v' \in R'\) has a smaller derivation tree than \(w_1\), by the
inductive hypothesis, we can prove that \(v'~w_2 \in R'\). We get:
\begin{gather*}
\AxiomC{\ldots}
\UnaryInfC{\(v \in R\)}
\AxiomC{\ldots}
\UnaryInfC{\(v' \in R'\)}
\AxiomC{\ldots}
\UnaryInfC{\(w_2 \in R'\)}
\RightLabel{\(R'_{concat}\)}
\BinaryInfC{\(v' ~w_2 \in R'\)}
\RightLabel{\(R'_+\)}
\BinaryInfC{\(+ v ~v' ~w_2 \in R'\)}
\DisplayProof
\end{gather*}
So, \(R'_{concat}\) is proven. We can show \(R'_{lemma}\), i.e. \(w_1'
+ w_2 \in R'\) if \(w_1' \in R'\) and \(w_2 \in R\) as:
\begin{gather*}
\AxiomC{\ldots}
\UnaryInfC{\(w_1' \in R'\)}
\AxiomC{\ldots}
\UnaryInfC{\(w_2 \in R\)}
\AxiomC{}
\RightLabel{\(R'_\epsilon\)}
\UnaryInfC{\(\epsilon \in R'\)}
\RightLabel{\(R'_+\)}
\BinaryInfC{\(+ w_2 \in R'\)}
\RightLabel{\(R'_{concat}\)}
\BinaryInfC{\(w_1' + w_2 \in R'\)}
\DisplayProof
\end{gather*}
Thus, the proof is complete.
\end{itemize}
\item \(w \in R \implies w \in S\): we induct on the depth of the
derivation tree for \(w \in R\). This direction is simpler than the other,
but the general method is similar.
\begin{itemize}
\item Base case: derivation tree of depth 2 (minimum). The tree must be
\begin{gather*}
\AxiomC{}
\RightLabel{\(R'_{\epsilon}\)}
\UnaryInfC{\(\epsilon \in R'\)}
\RightLabel{\(R_{num}\)}
\UnaryInfC{\(\num \in R\)}
\DisplayProof
\end{gather*}
We have the corresponding derivation tree for \(w \in S\):
\begin{gather*}
\AxiomC{}
\RightLabel{\(S_{num}\)}
\UnaryInfC{\(\num \in S\)}
\DisplayProof
\end{gather*}
\item Inductive case: derivation tree of depth \(n+1\), given that for
every derivation of depth \(\le n\) of \(w' \in R\) for any \(w'\),
there is a corresponding derivation of \(w' \in S\). The last rules
applied must be \(R_{num}\) and \(R'_{+}\) (otherwise the derivation
would be of the base case):
\begin{gather*}
\AxiomC{\ldots}
\UnaryInfC{\(w_1 \in R\)}
\AxiomC{\ldots}
\UnaryInfC{\(w_2 \in R'\)}
\RightLabel{\(R'_{+}\)}
\BinaryInfC{\(+ w_1 ~w_2 \in R'\)}
\RightLabel{\(R_{num}\)}
\UnaryInfC{\(\num + w_1~ w_2 \in R\)}
\DisplayProof
\end{gather*}
%
where \(w = \num + w_1 ~w_2\). However, we are somewhat stuck here, as
we have no way to relate \(R'\) and \(S\). We will separately show that
if \(+w' \in R'\), then there is a derivation of \(w' in S\) (lemma
\(R'_{S}\)). This will allow us to complete the proof:
\begin{gather*}
\AxiomC{}
\RightLabel{\(S_{num}\)}
\UnaryInfC{\(\num \in S\)}
\AxiomC{\ldots}
\UnaryInfC{\(+w_1 ~w_2 \in R'\)}
\RightLabel{\(R'_{S}\)}
\UnaryInfC{\(w_1 ~w_2 \in S\)}
\RightLabel{\(S_{+}\)}
\BinaryInfC{\(\num + w_1 ~w_2 \in S\)}
\DisplayProof
\end{gather*}
The proof of the lemma \(R'_S\) is by induction again, and not shown
here. This completes the original proof.
\end{itemize}
\end{enumerate}
\item Argument similar to Exercise Set 2 Problem 4 (same pair of
grammars). \(B_1 \subset B_2\) as relations can be seen by producing a
derivation tree for each possible case in \(B_1\). For the other
direction, \(B_2 \subseteq B_1\), it is first convenient to prove
that \(B_1\) is closed under concatenation, i.e., if \(w_1, w_2 \in B_1\)
then there is a derivation tree for \(w_1 ~ w_2 \in B_1\).
\end{enumerate}
\end{solution}
\end{exercise}