Judging Programming Performance, Programmatically

One aspect that is shared between programming games and programming education is the need to measure progress. In both cases a player/pupil is given some sort of problem description or set of requirements, and they must write a program to satisfy this. If the program meets all the requirements, that’s usually a success (modulo code quality). But if the program doesn’t do what it is supposed to, how do you judge its merit?

When marking programming assessments, educators usually try to judge if the student had the right idea, or if they were working along the right lines. Perhaps fittingly, judging the quality of a part-finished program is a task which needs a human. No-one has yet invented a metric that can judge program quality: it either works or it doesn’t, and if it doesn’t, judging progress is hard.

This is really the reason why there are so few large-scale programming contests. Either the submissions must be manually judged (which doesn’t scale well, at least without crowd-sourcing), or there needs to be some way of automatically scoring/ranking the submissions. But if all you can test is working/not-working, there’s no way to form a fine-grained scale to pick a winner! Thus the programming contests that do exist (like the ICFP contest) tend to be based around an open-ended problem with a score attached, such as writing a control program for a Mars rover that must take a minimum time to find its way to its destination. Thus, the outcome is measured — not the program itself.

So what do programming games do when they want to score your efforts? SpaceChem opts for two easily measured metrics: program speed, and program size. As a game mechanic, this works reasonably well. Trying to speed up your programs really taxes your brain and adds a new dimension to the game. However, in turns of programming education, this is a pretty terrible idea. Not only does it support students’ misconceptions that speed is important, but to speed up programs you must do all sorts of dirty tricks and learn more complicated, intricate ways of programming. The same goes for optimising for program size. So as good a programming game as SpaceChem is, it demonstrates that measuring programs by these simple metrics is a bad idea.

Presenting Performance

Having said that, one really nice thing about SpaceChem’s scores is the way they are presented. Lots of games just present your score and global rank. But being #148502 in the world is not really very exciting, not is it particularly motivating. So SpaceChem presents histograms, showing everyone’s score, and where yours sits:

As well as being statistically sensible (a point estimate for the mean would not tell the full story), there is motivation to improve. The diagram above shows that my solution was clearly taking a lot longer than most people, and it looks like there’s a fairly common solution (the tall blue bar on the left) that I have missed. I wonder if it would work to hand out histograms like this to school/university classes after an assessment?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s