syntax include

Sunday, June 27, 2010

Abstract Parsing

My first prototype of an abstract parser seems to be working. It uses a very simple abstract stack representation that is very imprecise and needs to be replaced eventually – but it serves as great starting point. After all, it does give a sound approximation of the parse tree!

Suppose we are using this simple grammar for arithmetic expressions:
Tokens:
identifier = ['a'..'z']+;
integer_literal = ['0'..'9']+;
plus = '+';
minus = '-';
star = '*';
div = '/';
lparen = '(';
rparen = ')';

Productions:
exp = plusexp;
plusexp = plusexp plus mulexp | plusexp minus mulexp | mulexp;
mulexp = mulexp star unexp | mulexp div unexp | unexp;
unexp = minus unexp | primary;
primary = identifier | lparen exp rparen | integer_literal;

Now let the possible inputs to our language be given by the regular expression: a * (5|(6 + b)) (where the bold-faced symbols are control characters).
Tokenizing the expression with the abstract lexer gives us the automaton:

Feeding that to the abstract parser now yields the – rather inaccurate – graph of parse trees:
Okay so you probably can't make out the text on the nodes due the image being kinda small here on BlogSpot – but anyway, the gray nodes are tokens, the white boxes are productions, and the diamond-shaped nodes are choices.

Unfortunately the abstract stack representation is much too weak to create a precise syntax graph. The graph above can unfold to the correct syntax tree, but also the parse trees corresponding to "a", "6 + b", "6 + a", and just "6". This will obviously do us no good in the long run.

Once I manage to fix the precision issues, there is still the hurdle of performing the abstract equivalent of converting a parse tree to an abstract syntax tree. What can I call that process? Abstract, abstract syntax tree construction? Yikes.

No comments:

Post a Comment