syntax include

Friday, June 25, 2010

Abstract Lexing

As my exam project in my Abstract Interpretation course, I developed a technique for performing lexical analysis of languages of strings rather than one concrete string. The application of this is to reason about dynamically generated strings such as SQL, XPath and JavaScript eval-statements.

As a rather pathetic example, consider the JavaScript code below:
var x = 23;
var y = 10;
var op = usePlus ? "+" : "-";
var z = eval(x + " " + op + " " + y);

Obviously the result is an integer and the eval call has no side-effects, but how can a program analysis conclude this? String analysis of the eval contents may deduce that the argument belongs to the language [1-9][0-9]* (+|-) [1-9][0-9]*. Using abstract lexing with JavaScript's lexical structure, we can deduce that the tokenization of that string is always NUMBER (MINUS|PLUS) NUMBER after whitespace truncation. The language of token strings is represented using a two-level automaton, a construct similar to a finite-state automaton that can represent tokens both by their name and the concrete text they are made of.

From this point, hopefully an abstract parser can create some kind of syntax graph representing the syntax trees that may result from parsing such a language of token strings. From this syntax graph, we should be able to perform abstract type-checking and start to reason about the behaviour of the eval-statement, and the type of the returned value.

I admit the example above isn't particularly motivating, but I don't like long posts. Information about the string analysis preceeding the lexical analysis can be found at the JSA website.

No comments:

Post a Comment