Tracking down a parser problem

In our Blackbox project, we are collecting a lot of data from programming novices as they learn, including the Java source code they are writing. We now have a reasonable amount of data: 5 million successful compilations (and 5 million unsuccessful). I’ve been doing some simple analysis on the data we have collected; in this post I will describe a bug in the analysis that look me a little while to track down.

The analysis is very straightforward. I pull all of the successfully-compiled files out of a disk cache and then process each one (by parsing it and doing my analysis on the resulting syntax tree), then finally pull all of the results together to produce a summary. The analysis runs on a multi-core machine so I run several worker threads in parallel, each running one task at a time until the work queue is empty. I encountered a problem where sometimes the analysis was very slow to finish (or didn’t finish at all). Tracking down a problem that occurs at the end of your hours-long analysis is definitely a frustrating and time-consuming business.

Early on I looked at time spent in garbage collection, and the possibility of a livelock in my worker framework. But eventually I realised that the problem did not lie in the framework or environment, it lay in the processing tasks themselves. Some of my processing tasks were simply taking an incredibly long time to finish. This sounds like an obvious cause, but it was quite surprising. The task was very small. It simply parsed a source file and scanned the resulting tree. There should be no possibility for an infinite loop in there, given that the input is a finite source file. Why would most of these tasks finish near-instantly, while a small handful took near-forever?

Once I realised the problem was in the processing task, I also realised that it was certain inputs that were tying up the worker threads. With enough bad inputs, each worker would eventually be occupied trying to process a difficult input, until my program pretty much ground to a halt. Looking at the difficult inputs gave a clue to the problem. Here is the gist of one problematic input, which was implementing something like a game of twenty questions:

boolean q1 = nextAnswer();
if (q1) {
    return "...";
}
if (!q1) {
    boolean q2 = nextAnswer();
    if (q2) {
        return "...";
    }
    if (!q2) {
        boolean q3 = nextAnswer();
        if (q3) {
            return "...";
        }
        if (!q3) {
            ... // Lots more nested ifs for q4 onwards
        }
    }
}

And so on, with all the if statements nested twenty levels deep. (Like I said: the data comes from programming novices.) So clearly the problem resided with if statements. Now, parsing if statements in many C-like languages has a problem known as the dangling-else problem — if you have this:

if (a)
if (b)
blah();
else
bleh();

Does that else attach to the first if or the second? The usual answer is that it attaches to the innermost if. However, if you use a classic declarative BNF approach to writing the grammar, you will write an ambiguous grammar. So C and Java rewrite their grammar to be unambiguous. Here’s Java’s rewrite, from the Java Language Specification:

StatementNoShortIf:
    StatementWithoutTrailingSubstatement
    LabeledStatementNoShortIf
    IfThenElseStatementNoShortIf
    WhileStatementNoShortIf
    ForStatementNoShortIf

IfThenStatement:
    if ( Expression ) Statement

IfThenElseStatement:
    if ( Expression ) StatementNoShortIf else Statement

IfThenElseStatementNoShortIf:
    if ( Expression ) StatementNoShortIf else StatementNoShortIf

This rewrite essentially ensures that a StatementNoShortIf will never end in a (non-brace-enclosed) if-without-else, thus removing the ambiguity. And it turns out that the Java parsing library I was using follows this grammar exactly. Effectively, they had copied the Java Language Specification directly into a parsing library, which doesn’t sound unreasonable as a way to ensure that you parse all of Java correctly. Here’s the relevant snippet of code that uses the Parsec parsing combinator library (comments added by me):

ifStmt = do
 tok KW_If -- Consume if keyword
 e <- parens exp -- then parenthesised expression
 -- Then try the following two alternatives in turn.
 -- First, statement with else:
 (try $ do th <- stmtNSI -- Stmt with no short if
           tok KW_Else -- Consume else keyword
           el <- stmt -- Parse another statement
           return $ IfThenElse e th el) <|>
   -- Second alternative: if-without-else:
   (do th <- stmt 
       return $ IfThen e th)

The practical consequence of this code is that every if-without-else has two parse attempts; one that looks for an else afterwards (but must parse the body of the if before it finds out there isn’t one) and one that re-parses the same body, but doesn’t look for an else. By itself, this double parse is not that big an issue — sure, it takes time to parse the body twice, especially if it’s large, but computers are fast enough that it’s not a massive issue.

That is, until you nest multiple if statements. Our novice programmer nested 24 deep. So we first attempt to parse the if looking for an else. But we also make the same attempt with the other 20 nested ifs. The innermost one will fail (our novice isn’t using “else”s), and re-parse. Then the second innermost one will fail and reparse. Suddenly our parsing will make something like 2^24 attempts to parse the innermost if body. It was this complexity that caused our parsing tasks to get into a very long running loop. One small fix later to avoid parsing the body twice, and the parsing was running in reasonable time on the problematic inputs — effectively going from O(2^N) to O(N) complexity. Sanity restored. (You can now argue about the superiority of LL, LR parsers and so on, if you’re so inclined.)

Conclusion

These sorts of articles are usually end meant to end with a learned lesson. Take your pick: don’t blindly copy BNF into a parsing library, beware of potential problems in the libraries you are using, have analytics on how long your processing tasks are taking and flag up/kill off outliers, and finally be better at narrowing down the problem than I was (a few print statements at the right points pinpointed the problem, but I went down a few blind alleys first). I am certainly learning that when you have a large dataset like ours, the odd corner cases in your code will be found out. This is both a blessing and a curse: you have to write better code, but at least you have a massive source of input data to test your algorithm on.

About these ads

2 Comments

Filed under Uncategorized

2 responses to “Tracking down a parser problem

  1. How exactly did you fix the problem? I don’t think parsec has cut operators or other things to prevent backtracking.

    • My solution, which I’m reasonably confident in, is to just do the obvious original idea: parse the if, the expression, and the body. Then I optionally parse an else (and if there is one, the body) or just return the if-without-else, should there not be an else. This avoids the double parse of the body, and because each parse of an if is greedy, it should still obey the binds-to-the-innermost-if rule. This is doable because Parsec favours the leftmost alternative, unlike BNF which is declarative and theoretically has no preference between two alternatives.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s