asgerf: 2010

Monday, July 5, 2010

AST Graph and Abstract Type-Checking

Translation from parse graphs to AST graphs (or so I call them) proved to be more of an engineering challenge than a clever use of formal methods. The only reason compilers translate the parse tree to an AST is because the AST is more convenient to code against. Thus, the sole purpose to translating a parse graph to an AST graph must be to simplify the coding of analyses such as type-checking and side-effect analysis.

SableCC generates a ton of classes to represent the AST in a type-strong fashion, and so I hacked away at the rather monstrous SableCC source code to make it generate classes that represent a graph instead of a tree. In addition to being a graph, they needs to support choices and sequences. At first, I made one ChoiceNode class that was generic, such that ChoiceNode<PExp> was a choice node that was guaranteed to unfold to a node of some subtype of PExp. It became apparent, however, that Java's type system is simply not strong enough to strongly represent this type of structure; or at least not in a way that would gain me any convenience. The amount of unsafe typecasts that were added because of the generics vastly outweighed the few they might spare me.

I then went a different route. Avoiding generic types, I generated separate classes to represent different types of choice nodes, so the class that unfolds to a PExp node is distinct from the class that unfolds to, say, a PStatement node. This is quite convenient, because any analysis will certainly have to treat such nodes very differently, and any type parameter would be invisible at runtime. A sequence of nodes is strictly distinguished from just a node, and as such, there are also separate choice nodes for PExp sequences, PStatement sequences, and so on. Sequences can be created using EmptySequence, SingletonSequence, and Concat nodes, of which there also exist one for each production in the AST.

These classes validate at runtime that their children have the correct type, and although short of being statically typed, the rigid structure of the AST graph has proven terrific for analysis. I believe any synthesized property in the AST can be approximated with a monotone backward analysis on the AST graph. Inherited properties can likewise be approximated with a montone forward analysis.

I managed to implement abstract type-checking for a simple expression-based language with integers and booleans in just 250 lines of Java code! For example, suppose we want to analyse the input:

if foo (<|+) bar then 34 else 45

The AST graph we get using abstract lexing, abstract parsing and AST graph transformation is:

The input can be two different strings, one of which contains a type error, since foo+bar is not a boolean as required by the if-then-else construct. And my type-analysis actually detects the type error! The error message it can produce at the moment is less than informative however:

Invalid type [INTEGER] in node asf.mini.node.PExp$Choice@6ef2ea42

Okay, so there is room for improvement there. I like to bash compilers of certain languages of producing useless error messages, and wouldn't sleep well at night if my program output something like that. At some point, it should probably generate an example string that demonstrates the type error it wants to report, and then also point out where in that string the type error occurs.

Monday, June 28, 2010

First take on Git and EGit

EGit, Eclipse's Git plugin, seems a bit underdeveloped still, but since it's officially in the incubation phase I'm not complaining. It can absolutely not perform Git's push command to a GitHub repository (loosely corresponding to a commit in SVN). To do that, I have to use the command-line. It also has trouble recognizing the .gitignore file, marking ignored files as untracked rather than ignoring them entirely. The plugin allows me to ignore all untracked files entirely, though, so it's not a showstopper for me.

Git's separation of commit and push commands is amazing, and though I'm still very new to Git, I can already feel the extra freedom it gives me. EGit is bound to catch up eventually, so learning Git now is probably a good investment of my time.

Subversive woes

The new Subversive plugin that ships with Eclipse Helios has some new pretty icons, but apart from that it seems to be garbage. It does not recognize the projects that I placed under SVN version control using Galileo. After some fooling around and a lot of googling, I found myself desperate and exported the project and re-commited it to SVN under the new Helios plugin. Crappy solution but at least it was under version control.

That was not the end of it though. Even though I committed it to a completely clean-created SVN repository, just for Helios's poor plugin, where only a single machine ever made commits to, it soon started complaining about files that were "already present on path foo/bar/baz". SVN has given be trouble before, but this time I was shocked to find the extent of the plugin's disability to do anything! I opened explorer with the the trusty old TortoiseSVN to address the issue. I managed to backup the folder, revert+update it, reinsert the files, and commit them. Tried and true and it works.

When I returned to Helios, it claimed that the project was no longer under version control. What the hell? What kind of alien format does the plugin use?!

Maybe it's time to ditch SVN and try GIT instead – something I have wanted to do for a while. GIT seems more robust, but it really comes down to how well their plugins work. With my current opinion of the Subversive plugin for Helios, though, EGIT must be really miserable if I am to go back to Subversive.

Abstract Parsing #2

Sometimes you get an idea. You play with it for a while, but after some time some unforeseen gotchas come to your attention. You rework the idea accordingly and then repeat. This can go on for a week or more, until you arrive at something that works, or maybe you ditched the idea at some point.

At other times, you get idea that kicks ass – it is elegant, clean, easy to understand, works in theory, and when you implement it everything actually works the first time around. I had an idea for how to upgrade the abstract stack representation (mentioned in previous post) that I was planning to try out tomorrow, but my other plans got marooned so I gave it a go. As it turned out, it only took about 30 minutes of coding, and once it compiled everything actually worked right off the bat. The parse graph for the previous example is as accurate as it can possibly get now:

I could hardly stop laughing in disbelief when I saw that graph! If you don't think this graph if beautiful ~~something must be wrong with you~~ then you must be a normal person.

Sunday, June 27, 2010

Abstract Parsing

My first prototype of an abstract parser seems to be working. It uses a very simple abstract stack representation that is very imprecise and needs to be replaced eventually – but it serves as great starting point. After all, it does give a sound approximation of the parse tree!

Suppose we are using this simple grammar for arithmetic expressions:

Tokens:
identifier = ['a'..'z']+;
integer_literal = ['0'..'9']+;
plus = '+';
minus = '-';
star = '*';
div = '/';
lparen = '(';
rparen = ')';

Productions:
exp = plusexp;
plusexp = plusexp plus mulexp | plusexp minus mulexp | mulexp;
mulexp = mulexp star unexp | mulexp div unexp | unexp;
unexp = minus unexp | primary;
primary = identifier | lparen exp rparen | integer_literal;

Now let the possible inputs to our language be given by the regular expression: a * (5|(6 + b)) (where the bold-faced symbols are control characters).
Tokenizing the expression with the abstract lexer gives us the automaton:

Feeding that to the abstract parser now yields the – rather inaccurate – graph of parse trees:

Okay so you probably can't make out the text on the nodes due the image being kinda small here on BlogSpot – but anyway, the gray nodes are tokens, the white boxes are productions, and the diamond-shaped nodes are choices.

Unfortunately the abstract stack representation is much too weak to create a precise syntax graph. The graph above can unfold to the correct syntax tree, but also the parse trees corresponding to "a", "6 + b", "6 + a", and just "6". This will obviously do us no good in the long run.

Once I manage to fix the precision issues, there is still the hurdle of performing the abstract equivalent of converting a parse tree to an abstract syntax tree. What can I call that process? Abstract, abstract syntax tree construction? Yikes.

Friday, June 25, 2010

Abstract Lexing

As my exam project in my Abstract Interpretation course, I developed a technique for performing lexical analysis of languages of strings rather than one concrete string. The application of this is to reason about dynamically generated strings such as SQL, XPath and JavaScript eval-statements.

As a rather pathetic example, consider the JavaScript code below:

var x = 23;
var y = 10;
var op = usePlus ? "+" : "-";
var z = eval(x + " " + op + " " + y);

Obviously the result is an integer and the eval call has no side-effects, but how can a program analysis conclude this? String analysis of the eval contents may deduce that the argument belongs to the language [1-9][0-9]* (+|-) [1-9][0-9]*. Using abstract lexing with JavaScript's lexical structure, we can deduce that the tokenization of that string is always NUMBER (MINUS|PLUS) NUMBER after whitespace truncation. The language of token strings is represented using a two-level automaton, a construct similar to a finite-state automaton that can represent tokens both by their name and the concrete text they are made of.

From this point, hopefully an abstract parser can create some kind of syntax graph representing the syntax trees that may result from parsing such a language of token strings. From this syntax graph, we should be able to perform abstract type-checking and start to reason about the behaviour of the eval-statement, and the type of the returned value.

I admit the example above isn't particularly motivating, but I don't like long posts. Information about the string analysis preceeding the lexical analysis can be found at the JSA website.

syntax include