syntax include

Monday, July 5, 2010

AST Graph and Abstract Type-Checking

Translation from parse graphs to AST graphs (or so I call them) proved to be more of an engineering challenge than a clever use of formal methods. The only reason compilers translate the parse tree to an AST is because the AST is more convenient to code against. Thus, the sole purpose to translating a parse graph to an AST graph must be to simplify the coding of analyses such as type-checking and side-effect analysis.

SableCC generates a ton of classes to represent the AST in a type-strong fashion, and so I hacked away at the rather monstrous SableCC source code to make it generate classes that represent a graph instead of a tree. In addition to being a graph, they needs to support choices and sequences. At first, I made one ChoiceNode class that was generic, such that ChoiceNode<PExp> was a choice node that was guaranteed to unfold to a node of some subtype of PExp. It became apparent, however, that Java's type system is simply not strong enough to strongly represent this type of structure; or at least not in a way that would gain me any convenience. The amount  of unsafe typecasts that were added because of the generics vastly outweighed the few they might spare me.

I then went a different route. Avoiding generic types, I generated separate classes to represent different types of choice nodes, so the class that unfolds to a PExp node is distinct from the class that unfolds to, say, a PStatement node. This is quite convenient, because any analysis will certainly have to treat such nodes very differently, and any type parameter would be invisible at runtime. A sequence of nodes is strictly distinguished from just a node, and as such, there are also separate choice nodes for PExp sequences, PStatement sequences, and so on. Sequences can be created using EmptySequence, SingletonSequence, and Concat nodes, of which there also exist one for each production in the AST.

These classes validate at runtime that their children have the correct type, and although short of being statically typed, the rigid structure of the AST graph has proven terrific for analysis. I believe any synthesized property in the AST can be approximated with a monotone backward analysis on the AST graph. Inherited properties can likewise be approximated with a montone forward analysis.

I managed to implement abstract type-checking for a simple expression-based language with integers and booleans in just 250 lines of Java code! For example, suppose we want to analyse the input:

if foo (<|+) bar then 34 else 45
The AST graph we get using abstract lexing, abstract parsing and AST graph transformation is:
The input can be two different strings, one of which contains a type error, since foo+bar is not a boolean as required by the if-then-else construct. And my type-analysis actually detects the type error! The error message it can produce at the moment is less than informative however:

Invalid type [INTEGER] in node asf.mini.node.PExp$Choice@6ef2ea42

Okay, so there is room for improvement there. I like to bash compilers of certain languages of producing useless error messages, and wouldn't sleep well at night if my program output something like that. At some point, it should probably generate an example string that demonstrates the type error it wants to report, and then also point out where in that string the type error occurs.