asgerf: June 2010

Monday, June 28, 2010

First take on Git and EGit

EGit, Eclipse's Git plugin, seems a bit underdeveloped still, but since it's officially in the incubation phase I'm not complaining. It can absolutely not perform Git's push command to a GitHub repository (loosely corresponding to a commit in SVN). To do that, I have to use the command-line. It also has trouble recognizing the .gitignore file, marking ignored files as untracked rather than ignoring them entirely. The plugin allows me to ignore all untracked files entirely, though, so it's not a showstopper for me.

Git's separation of commit and push commands is amazing, and though I'm still very new to Git, I can already feel the extra freedom it gives me. EGit is bound to catch up eventually, so learning Git now is probably a good investment of my time.

Subversive woes

The new Subversive plugin that ships with Eclipse Helios has some new pretty icons, but apart from that it seems to be garbage. It does not recognize the projects that I placed under SVN version control using Galileo. After some fooling around and a lot of googling, I found myself desperate and exported the project and re-commited it to SVN under the new Helios plugin. Crappy solution but at least it was under version control.

That was not the end of it though. Even though I committed it to a completely clean-created SVN repository, just for Helios's poor plugin, where only a single machine ever made commits to, it soon started complaining about files that were "already present on path foo/bar/baz". SVN has given be trouble before, but this time I was shocked to find the extent of the plugin's disability to do anything! I opened explorer with the the trusty old TortoiseSVN to address the issue. I managed to backup the folder, revert+update it, reinsert the files, and commit them. Tried and true and it works.

When I returned to Helios, it claimed that the project was no longer under version control. What the hell? What kind of alien format does the plugin use?!

Maybe it's time to ditch SVN and try GIT instead – something I have wanted to do for a while. GIT seems more robust, but it really comes down to how well their plugins work. With my current opinion of the Subversive plugin for Helios, though, EGIT must be really miserable if I am to go back to Subversive.

Abstract Parsing #2

Sometimes you get an idea. You play with it for a while, but after some time some unforeseen gotchas come to your attention. You rework the idea accordingly and then repeat. This can go on for a week or more, until you arrive at something that works, or maybe you ditched the idea at some point.

At other times, you get idea that kicks ass – it is elegant, clean, easy to understand, works in theory, and when you implement it everything actually works the first time around. I had an idea for how to upgrade the abstract stack representation (mentioned in previous post) that I was planning to try out tomorrow, but my other plans got marooned so I gave it a go. As it turned out, it only took about 30 minutes of coding, and once it compiled everything actually worked right off the bat. The parse graph for the previous example is as accurate as it can possibly get now:

I could hardly stop laughing in disbelief when I saw that graph! If you don't think this graph if beautiful ~~something must be wrong with you~~ then you must be a normal person.

Sunday, June 27, 2010

Abstract Parsing

My first prototype of an abstract parser seems to be working. It uses a very simple abstract stack representation that is very imprecise and needs to be replaced eventually – but it serves as great starting point. After all, it does give a sound approximation of the parse tree!

Suppose we are using this simple grammar for arithmetic expressions:

Tokens:
identifier = ['a'..'z']+;
integer_literal = ['0'..'9']+;
plus = '+';
minus = '-';
star = '*';
div = '/';
lparen = '(';
rparen = ')';

Productions:
exp = plusexp;
plusexp = plusexp plus mulexp | plusexp minus mulexp | mulexp;
mulexp = mulexp star unexp | mulexp div unexp | unexp;
unexp = minus unexp | primary;
primary = identifier | lparen exp rparen | integer_literal;

Now let the possible inputs to our language be given by the regular expression: a * (5|(6 + b)) (where the bold-faced symbols are control characters).
Tokenizing the expression with the abstract lexer gives us the automaton:

Feeding that to the abstract parser now yields the – rather inaccurate – graph of parse trees:

Okay so you probably can't make out the text on the nodes due the image being kinda small here on BlogSpot – but anyway, the gray nodes are tokens, the white boxes are productions, and the diamond-shaped nodes are choices.

Unfortunately the abstract stack representation is much too weak to create a precise syntax graph. The graph above can unfold to the correct syntax tree, but also the parse trees corresponding to "a", "6 + b", "6 + a", and just "6". This will obviously do us no good in the long run.

Once I manage to fix the precision issues, there is still the hurdle of performing the abstract equivalent of converting a parse tree to an abstract syntax tree. What can I call that process? Abstract, abstract syntax tree construction? Yikes.

Friday, June 25, 2010

Abstract Lexing

As my exam project in my Abstract Interpretation course, I developed a technique for performing lexical analysis of languages of strings rather than one concrete string. The application of this is to reason about dynamically generated strings such as SQL, XPath and JavaScript eval-statements.

As a rather pathetic example, consider the JavaScript code below:

var x = 23;
var y = 10;
var op = usePlus ? "+" : "-";
var z = eval(x + " " + op + " " + y);

Obviously the result is an integer and the eval call has no side-effects, but how can a program analysis conclude this? String analysis of the eval contents may deduce that the argument belongs to the language [1-9][0-9]* (+|-) [1-9][0-9]*. Using abstract lexing with JavaScript's lexical structure, we can deduce that the tokenization of that string is always NUMBER (MINUS|PLUS) NUMBER after whitespace truncation. The language of token strings is represented using a two-level automaton, a construct similar to a finite-state automaton that can represent tokens both by their name and the concrete text they are made of.

From this point, hopefully an abstract parser can create some kind of syntax graph representing the syntax trees that may result from parsing such a language of token strings. From this syntax graph, we should be able to perform abstract type-checking and start to reason about the behaviour of the eval-statement, and the type of the returned value.

I admit the example above isn't particularly motivating, but I don't like long posts. Information about the string analysis preceeding the lexical analysis can be found at the JSA website.

syntax include