Permanent Registered Reports Track At Computer Science Education

Recently, Aleata Hubbard Cheuoua, Eva Marinus and I acted as guest editors of a special issue at the Computer Science Education journal. The special issue was distinctive because it only accepted replication studies, and reviewed them as Registered Reports. Registered Reports are a new way to publish science: you peer review the study’s design, before the study has been carried out:

(Image Creative Commons Attribution-NoDerivatives 4.0 International License from COS.)

This minimises all kinds of scientific questionable research practices (such as fishing for significant results) and means that changes can be made at a meaningful stage. Often in classical peer review, reviewers ask for changes (e.g. “you should have collected data on prior experience”) that cannot be addressed after data collection has been completed.

I don’t want to spend a long time writing about Registered Reports here, but bottom line is: we completed the special issue, it went well, you can read our editorial, the issue itself, and a whole paper that we just published at ICER 2022 about the experience.

Permanent Registered Reports Track

As a result of the special issue and the paper, Aleata, Eva and I have been appointed as Associated Editors at Computer Science Education to handle a permanent Registered Reports track. The addition of the track is currently being processed by the publisher: we have written submission instructions that will appear on the journal website shortly and a new track is being added to the submission system. In the mean time, the draft of the submission instructions is in this Google Doc (although it is possible it may change slightly before becoming official.)

We would like to invite all researchers in computer science education research to consider whether to make their next study a registered report. It doesn’t fit all models of research (hence it’s an extra track, not a replacement) but for many studies it fits very well. If you’re at the planning stage for a new study, you could write up your planned method analysis, submit it as a registered report, and get expert feedback plus an accept-in-principle before you’ve even begun collecting the data. What’s not to like?

If you have any questions about registered reports at Computer Science Education, please feel free to contact Aleata, Eva and me. We are grateful to the editors-in-chief, Jan Vahrenhold and Brian Dorn for commissioning the special issue and for adding this as a permanent part of their journal.

Reflections on Dagstuhl organisation

Last week I attended Dagstuhl seminar 22302, which I co-organised. I thought it might be useful to record some thoughts on what went well and what could have gone better from an organiser’s point of view. (Note: there were three other organisers; I speak only for myself and the others may well have different opinions.)



A Dagstuhl seminar is an invite-only event on a specific computer science related topic (proposed by the organisers) that is held at the Dagstuhl conference centre in rural Germany and subsidised by the German government. Attendees pay 50 EUR a day for accommodation and three meals a day, with no registration fee. So the cost is 250 EUR for the week, plus travel expenses to get there.

Not everyone knows the Dagstuhl model and with hindsight, that includes the invitees. We invited seven non-academics, but only one of them attended. There’s all kinds of possible reasons for this: maybe they didn’t have a week to spare, maybe the case is hard to make in their company, COVID risks, etc. But maybe some just didn’t understand the low cost or the overall model; at least one invitee said they’d come next year instead, but each Dagstuhl is a one-off. We perhaps should have made a clearer pitch to those unlikely to be familiar with  Dagstuhl.

This plays in to a related point: we deliberately invited many people outside the usual community for educational programming languages and systems, to try to broaden the field. But I think in general many of those “external” attendees declined and instead the group photo at the top of this post contains mostly familiar faces. In hindsight maybe we should have been more pro-active in reaching out to the “new” people and explaining why we wanted them there and how we thought they may might benefit, in order to also benefit from their attendance.

We invited 42 people and 18 attended, which is probably about right considering this was organisation in the time of COVID (Dagstuhl say 50% acceptance is typical). When travel decisions needed to be made last winter, some universities were prohibiting travel while some participants were not willing to take the uncertain risk. We had the option to run a hybrid Dagstuhl but we decided the benefits of in-person for the attendees outweighed the loss of some attendees who could not make it in-person, and personally I am still happy with that decision.



One temptation at conferences is to jam the program full of content. Based particularly on our co-organiser Shriram’s experience and advice (which is publicly available) we ended up having 2 morning sessions of 1.5 hours, a 2 hour afternoon session after lunch, followed by 30 minute refreshments and then a 2 hour break before dinner. After dinner on the first three days was a 1 hour panel. This felt about right to me, and I appreciated the longer break in the middle of the day. In hindsight maybe we could have put the 2 hour break after lunch (several participants, me included, felt like a nap after lunch!) and then put the afternoon session before dinner, but it was still good to have the break.

One of the challenges of Dagstuhl is to avoid just making it into a standard conference. We did have talks: the first three days were spent on short (15 minutes incl Q&A, followed by some joint Q&A) talks to get everyone aware of each other’s work. Every attendee gave at least one talk — on a topic they proposed, via surveys sent out before the event. The last two days were spent doing responsive activities: two break-out sessions with topics decided on Wednesday, a sort of lightning talk session with last-minute sign-ups and a Google Doc collaboration session. I think that was a good balance although we ended up being a little late finalising and communicating the schedule to the participants.

That Google Doc session is something I’d seen work well before: everyone in the room loads up a Google Doc on a topic (e.g. brainstorming ideas for new research projects) and starts typing. Many ideas are generated and then potentially combined, and people can add comments on what others have written. We wrote two documents in parallel with about 20 people in the room and each was 3-4 pages after about 20 minutes. I think in general it’s a good idea: it generates useful content that people can refer back to, and it’s a good way for more junior/introverted/reticent people to contribute without having to talk in front of the whole room. I think in hindsight there were two mistakes: one is that I asked people to suggest ideas for Google Doc topics before they had experienced such a session, so they were suggesting topics for something they didn’t really understand. Once the actual Doc writing was underway I think people understood how it worked and saw the benefits. The other mistake is I had allocated most of a 1.5 hour session but the actual Doc writing dried up after about 20 minutes. So better would be either a shorter session, or maybe two consecutive Docs: one on a pre-determined topic to demonstrate the idea, then brainstorming topics for the next Doc, followed by writing it.



Our programme left time for social activities which are a useful part of any conference: time to relax, to get to know other researchers and discuss work (and non-work) informally in smaller groups. It was nice to see a variety of activities: some running and walking, some music, some games, all optional and led by the attendees. Often these were adjacent — playing The Mind is an interesting challenge when you are being serenaded by Let It Go on ukulele just to your left. Personally, I enjoyed the chance to run with Mark Guzdial and Shriram, to play Wavelength and more with many of the attendees, and some good mini-snooker sessions with Michael Lee.


It was an interesting experience to organise a Dagstuhl. It broadly splits into five phases: writing the topic proposal to Dagstuhl, planning the invitee list, organising the programme, organising during the event, and writing the report afterwards (not done yet). The invitees and programme are the part I remember taking the most effort, to go from a blank sheet in both cases to agreement among the four organisers. The actual during-event organisation was surprisingly smooth; I was able to enjoy it just as much as if I was an attendee, which I had worried would not be the case.

I’d recommend submitting a Dagstuhl proposal to other computer science education researchers, especially those looking to initiate and strengthen collaborations in a particular area. the next proposal deadline is November, so now is a perfect time to reach out and start putting together an organiser team (typically 3-5 people). As an extra motivation, something I only learned while checking out yesterday is that attending is free for organisers.

Errordle: a serious game

In January 2022, the online game Wordle soared in popularity. It’s a word-guessing game where you guess a five-letter word and are told for each guess whether a letter is in the right place in the target word, present in the solution but in a different place, or not present at all in the target word. I fully expect that its popularity will crash soon enough, like most fads, although not before the NYT acquired it.

At the end of January 2022 I (virtually) attended a Dagstuhl seminar on the human factors of error messages, and one of the organisers, Brett Becker, put a call out for fun activities for the opening Sunday night session. So I came up with Errordle: an error message guessing game. I was fortunate that others had already written an open-source clone of Wordle that I could adapt.

Errordle has the same basis as Wordle, but you guess words in a target Java error message, rather than letters in a target word. As a hint, you are shown the single line of Java code on which the error occurs, according to the compiler. (Technically: the line where the error starts, in the case of multi-line errors.) Your job is to then guess the error message by putting together words that can occur in the error message.

Errordle is now publicly available and you can play here: If you’re interested, read on for details about the data, some privacy and technical considerations, and some reflections on it.

Blackbox Dataset

The source code lines and error messages in Errordle come from the Blackbox Mini subset of our Blackbox dataset. This means that they are all real error messages that occurred in the wild, for some random user of our BlueJ Java IDE who has opted in to data collection. I did handpick the final set for this public version, both for privacy reasons and to provide a varied challenge.

The Blackbox dataset records participant’s full source code with light anonymisation, and as such some of the data can still contain identifying details. For this reason the dataset is not public, and we cannot expose the complete source code that produces the error. So unfortunately you will only see the single line of code that is exposed, and may well believe that the error cannot possibly be produced by that line. More on that at the end.

The errors are not the full range of Java error messages. I restricted the errors to those that are five words or less, and it turns out that this excludes all type-related error messages, and only leaves some syntax error messages, which tend to be terser.

Some Reflections

Participants at the Dagstuhl seminar seemed to enjoy the game. I ran a wrap-up session late in the conference where I provided my own reflections:

  • The “keyboard” of the words in the errors that you use to make a guess shows the words learners are confronted with. On the whole, they are a mix of technical terms (class, interface) and negative words (illegal, missing).
  • There are many options for the exact wording, even knowing what the error should be and with a limited selection of words. There is not noticeable consistency between the messages.
  • Many errors are not complete sentences. (Although this is slightly biased by the restriction of five words or less!)
  • The line indicated by the compiler as the error location is often not the cause of the syntax error. It depends on the lines before to fully understand the error. I’ve deliberately made the game frustrating by only showing one line; this is what Java learners have to deal with, if they focus on the red underline showing the error and do not realise that they need to check the lines above.

So, go and play:

CSE Special Issue: Registered Report Replications

There is often much hang-wringing about research quality. Researchers are familiar with many of the issues: the file-drawer effect, HARKing, p-hacking, over-emphasis on novelty/surprise. We know some possible solutions to these problems, but putting them into practice is difficult when they are in conflict with existing practices. For example, registered reports involve reviewing a paper at the design stage, before data collection. This avoids a focus on novelty of results and allows meaningful changes to the study in response to reviews. Making sure to accept replication studies can make our knowledge base sturdier, shoring up existing foundations before moving on to new studies.

So: what if we took these great ideas and combined them into a journal special issue? Aleata Hubbard Cheuoua, Eva Marinus and I will be guest-editing a special issue of the Computer Science Education journal that will only accept replication studies that are published as registered reports. This means you will submit your introduction, motivation, and research/analysis design for a study which has not yet taken place. This will be reviewed, and if you get a positive decision, the paper is accepted in-principle, regardless of the results. The study must be a replication, which means much of your research design will already be decided for you. So the upfront effort is low — although don’t forget that if you get accepted, you must actually do the study!

Because this is a new way of doing things for our area, we are issuing a call for papers and a call for reviewers; there will be training for the reviewers. The full text of the call will appear soon on the CSE journal website, but in the mean time here is the call and a draft of our further details webpage. If you have any queries, please do contact Aleata, Eva or me.

New Blackbox Mini dataset

This post introduces our new Blackbox Mini dataset of novice programmer behaviour. First, let’s whizz through a quick bit of Blackbox background, because Blackbox Mini doesn’t make sense if you don’t know what Blackbox is. We develop BlueJ, a beginner’s IDE for Java. All users of BlueJ are asked to opt-in to recording their activity data for research purposes. This data is automatically uploaded to the Blackbox dataset, which collects things like debugger use, testing activity, compile errors, source code edits and more. Blackbox is stored in an SQL database with a documented schema. The Blackbox dataset has been collecting since 2013, and is now quite large (the central table has 4 billion rows). Blackbox access is available to other researchers on request. Phew! So that’s Blackbox.

Introducing Blackbox Mini

Two years ago we ran a survey of researchers who were using Blackbox. One item of feedback was that the database was unwieldy for the common use case. Many researchers wanted to look at source code and compile errors, but were then faced with an SQL database which was so large that even simple queries could take a long time and produce a flood of data. It felt like there was room for improvement in the way the data was presented to researchers. This is the motivation behind a new dataset: Blackbox Mini.

Blackbox Mini is a small subset of the original Blackbox data, with an additional simplified way to access the source code. The intention is that this will make it easier for new researchers to get started. I’ll dig into each of those features in turn.

A small subset

The subsetting is fairly straightforward: for simplicity, I took the initial part of the Blackbox dataset, with the first million events from 2013 (rather than the full 4 billion since). This does mean it lacks more recent data such as BlueJ 4’s change to compilation behaviour, but it was technically by far the easiest way to extract a subset.

The subset is stored in a database with an identical schema to the full dataset. The advantage of a smaller database is that queries should execute much more quickly, and the results are less likely to be overwhelming. Because the schemas are identical, queries developed and tested on the smaller database can be run as-is on the larger database afterwards.

An additional simplified way to access the source code

The more interesting part is the new data format, which solves a lot of previous problems. A lot of researchers are attracted to Blackbox to try to analyse the source code and compile errors, which are a bit buried away in our database. And even when you get the source code out of Blackbox, you have the general problem that you need a parser to perform any analyses. So our problems are:

1. The source code is not so easy to access, and common tasks such as “find me all the sequential versions of this source file and the compile errors at each stage” are not obvious.
2. You need to find a Java parser compatible with your analysis language of choice (be it Python, Java itself, Scheme, etc).
3. Not all the parsers are up to date with Java 8 (which added lambda syntax).
4. Many parsers are only designed for syntactically valid code, but much of the Blackbox data is syntactically invalid. It’s hard to analyse the causes of compiler errors if you can’t handle any invalid code.
5. Even once you have a syntax tree, doing analysis on tree structures can be laborious in many languages.

Two years ago I was at a Dagstuhl and met a researcher who was working on a project called SrcML. SrcML takes in Java source code and produces an XML document that represents the parse tree. For example, this fragment (abbreviated for space reasons):

public static void main(String [ ]  args)
    int test1 = 96;    //test score 1

Turns into this piece of XML:

<function><type><specifier>public</specifier> <specifier>static</specifier> <name>void</name></type> <name>main</name><parameter_list>(<parameter><decl><type><name><name>String</name><index>[ ]</index></name></type> <name>args</name></decl></parameter>)</parameter_list>
        <decl_stmt><decl><type><name>int</name></type> <name>test1</name> <init>= <expr><literal type="number">96</literal></expr></init></decl>;</decl_stmt>         <comment type="line">//test score 1</comment>

This may not seem like it’s doing much besides changing to an isomorphic format, but it actually helps mitigates all of our problems:

1. The format can be a directory full of XML files, no database needed.
2. XML libraries are generally easier to find for your programming language of choice than a Java parsing library.
3. Only SrcML has to be up-to-date with Java 8 (although admittedly it doesn’t support Java 10’s var keyword).
4. SrcML is somewhat robust to syntactically invalid code.
5. XML already has a query language, XPath, designed to allow easy querying of tree structures, which we can use for analysis.

So Blackbox Mini provides source code in a simple file-based format. There is a sub-directory for each Blackbox project. Within that is a XML file corresponding to each source file in that project. The XML file contains a sequential list of historical versions of that file, one version for each compilation attempt, and we use XML attributes and tags to list there the compile errors and the (SrcML-processed) source code for that version.

This means that everything you need to know for the most common use case (looking at source code and compile errors) is in the XML files. At a workshop at the abbreviated SIGCSE 2020 I worked through some examples of performing analysis from Python. Here is a complete Python 2.7 script for counting the frequencies of types used in declarations in all of the Blackbox Mini source code:

import xml.etree.ElementTree as ET
import os
from collections import Counter

acc = Counter([])
for (dirpath, dirnames, filenames) in os.walk('/mini/srcml'):
    for filename in filenames:
        root = ET.parse(os.path.join(dirpath, filename).getroot()
        for name_el in root.findall('.//decl/type/name'):
            if not (name_el.text is None):

That takes about twenty minutes to run on the full dataset (I suggest using an even smaller subset during development). The top ten types are: ‘int’: 526817, ‘String’: 172710, ‘double’: 111872, ‘Color’: 42114, ‘boolean’: 34433, ‘ActionEvent’: 15868, ‘Scanner’: 15614, ‘JButton’: 15272, ‘Graphics’: 13729, ‘JPanel’: 13085. There are further issues questions to solve before interpreting this result: the types are counted at each compile, and I believe a few much-compiled long programs are biasing the data a bit. Should you normalise by source file, by project, by user or a combination? What I like is that Blackbox Mini lets me move forwards to the interesting research considerations much more quickly, rather than being stuck on the technical aspects of getting the data.


My hope is that this new presentation of a Blackbox subset allows more researchers to investigate analysing the Blackbox dataset. The Blackbox Mini dataset lives on the same machine as the Blackbox dataset, so if you already have access to Blackbox, you also already have access to Blackbox Mini (I’ll add some resources for this to our Blackroom community site). If you’re a researcher interested in getting access to the data, send an email to and we will give you the simple form to fill in in order to sign up.

UK and Ireland Computing Education Research Conference

This year, Brett Becker and I will be acting as program co-chairs of the new UK and Ireland Computing Education Research (UKICER) conference. It will be held at the University of Kent on 5-6th September 2019. The committee includes Janet Carter as conference chair, along with Sally Fincher, Quintin Cutts and Steven Bradley. Steven has been running the Computing Education Practice conference at Durham for the past few years, which has grown impressively. We believe that there is a growing community of computing education researchers in the UK and Ireland, but we do not have a local conference to support this community’s growth. Our hope is that this sister conference to CEP will provide a useful outlet to share Irish and British computing education research, and encourage research collaborations.

The conference will run roughly from lunchtime on the 5th to lunchtime on the 6th, with some collaboration-building events beforehand and some workshops afterwards. We thus invite submissions of research papers (max 6 pages, ACM format), and proposals for 1-2 hour workshops, by the beginning of June. More details are available on the conference website. Please feel free to send any questions to and please do share this news with anyone you think might be interested in submitting or attending. We hope to see a variety of researchers and educators for all age groups.

While My Guitar Gently Beeps

In an email this week, Greg Wilson asked me why music playing was centred around individual tutoring but programming education was not, i.e., why professional musicians still take lessons but professional programmers do not. That reminded me of this article I’d recently read about Fender ( in this BBC article and this [paywalled] FT article). I wondered about similarities and differences between learning music and learning programming — especially end-user programming where people do small bits of coding in larger systems, like formulas in spreadsheets, scripting in image editors and so on.

Gender Balance, and Goals

Like just about every other industry on the planet, guitar manufacturers have generally neglected the female demand for their product. Turns out half of all guitar buyers are women, and half of those buy electric (unlike the stereotype of women playing acoustic). It also turned out that many purchasers didn’t want to become performers — they just wanted to play by themselves. In his SIGCSE keynote, Mark Guzdial mentioned how many more end-user programmers there are than professional programmers, yet the former are treated like a pale imitation of the latter.

Drop-out Rates

You may think computing courses have high drop-out rates, but Fender discovered that within a year, 90% of purchasers have stopped trying to play, and many of those stopped within the first three months. In Fender’s case they have a clear economic motivation to fix this — if no-one can learn guitar they are not going to buy another (and they’re going to ebay the one they did buy). So Fender were driven to aid learning by making apps and tutorials. In this, perhaps, computing is not so bad — there is a wealth of material out there for learning to program.

One difference between music and programming is that we can change what programming looks like. We can’t change the core programming concepts but we have the ability to make our tools more helpful and less painful. However, the guitar is inherently unmodifiable. Sure, we can allow guitar music creation using simpler tools — keyboards and MPCs and computerised music generation — but that’s not a guitar.

The invention of those other tools did lead to an explosion in music creation though. A bunch of UK dance and electronica bands in the 90s got started by messing around with electronics and computers and seeing what came out, and similar patterns led to the rise in hip-hop producers in the US at the same time. The tools did make a difference in who could make music and how easily.

Reality Bites

My uncle said about his efforts to learn the guitar: “I pick it up for a few hours and I can make noise, but then I get frustrated that I’m not playing like Eric Clapton and put it down again.” We have expectations about what we can achieve and we all want a quick win rather than a long painful process. The programming equivalent is perhaps the young kids who want to make the next Call of Duty but then find they’re struggling to draw a turtle graphics hexagon on screen. Again, programming can build helper libraries and layers of abstraction to make this process much easier, more so than music playing where reality bites: when you pick up a violin for the first time, nothing is going to stop your ears bleeding as you try to hit a note. The musical alternative is to use an easier instrument, but in music and programming, such beginner’s tools are often sneered at.


Music teaching is a mixture of formal and informal. Some learn in schools, some have one-to-one tuition, some learn for themselves via the Internet. There is a temptation among those who learned programming by themselves to describe formal school teaching as unnecessary — why doesn’t everyone do what I did and teach themselves? But just because some learn themselves doesn’t mean that others wouldn’t benefit from widely available teaching. I learned programming by myself with a handful of books for years before I took any programming classes. But what I now wonder is: how much faster could I have learned with a teacher? Just because you can learn a certain way doesn’t mean it’s best.

This is a good excuse to trot an anecdote about Jamie Smith, the producer member of the British band The xx:

[Smith] was already being hailed as a visionary figure by his label boss at XL Recordings, Richard Russell. “I found him really inspiring as a beatmaker in quite a specific way,” says Russell. “He was playing the MPC – which is a piece of studio equipment you’re supposed to use for recording and sequencing – as an instrument. That idea sort of blew my mind: that you could play something live and still have the sounds I love, the sample sounds. I started doing it myself straight away, on Bobby Womack’s album, on the stuff I was doing with Damon Albarn.” (“I didn’t realise you weren’t supposed to do that with it,” shrugs Smith. “I still don’t know how to use one properly.”)


You might say there’s three types of feedback in learning. One is the obvious signs of success — does the code compile? Does the guitar make the right note? Another is longer-term pain points — is the code hard to make changes to (e.g. because of lots of repeated code)? Is it hard to play the notes at the right pace? The third is higher-level features that aren’t obvious without external feedback — is your code well structured for others to understand? Are you using an interesting variety of musical techniques?

The lack of useful feedback for the second and third items often crop up in end-user programs. You find replicated chunks of near-identical code because the programmers don’t know about functions, repeated variables like x1, x2, x3 because they don’t know arrays and so on. It’s an unknown unknown — how do you get people to write better code when they don’t know what better is, and don’t know what techniques are available for doing so?


Learning has similar challenges in programming, music and other domains. There is a diverse set of ways to learn: self-taught, tutored, formally taught in classrooms, informally taught by peers, and so on. If we focus too much on one and assume it fits everyone, we can miss out on a lot of potential learners. What I’m struck by when observing some of our Blackbox participants is that programming can be a slow, frustrating and painful experience for many who are by themselves. There is no-one solution to improving learning: it needs to be a combination of language design, tool design, pedagogy, materials and more.

Congratulations to Mark Guzdial

Research is ultimately about discovering new knowledge, and sharing it. A long-standing criticism of academic researchers is that they tend to falter on the sharing part. Far too many researchers think that publishing a paper in a paywalled journal or conference is sufficient. The modern world allows us to engage in many more ways with other researchers and the larger public — blogs, twitter, youtube and so on — yet many researchers still do not publicise their research findings well.

I’m on my way to SIGCSE 2019, the biggest conference in computer science education. On Friday morning the keynote will be given by Mark Guzdial, who will receive the SIGCSE Award for Outstanding Contribution to CS Education. Mark is a prolific researcher in his own right — SIGCSE statistics this week showed he is the second most frequent author there:

I wouldn’t mind betting that Mark is one of the most well-known computer science education researchers worldwide, but the reason for that is not just his own papers — it’s his dedication to disseminating research.

Mark runs the blog as a solo effort. You might know the blog even if you don’t know his name — unlike my decision to plaster my face on the side of this blog, his name is modestly buried on the about page. It is a remarkable act of dedication to run a blog for over ten years as Mark has, with regular content several times a week for hundreds of weeks on end. In trying to find the age of his blog, I reached page 345 (!) to find his first wordpress post in 2009, let alone the ones before that on a different host. He not only posts about his own research but is generous in discussing and crediting the research of others, and is also active in discussions on twitter.

There is a surprising disconnect in education between research and teaching. Part of the reason that dodgy pedagogy proliferates — learning styles, etc — is that research findings rarely make it into the hands of teachers. Platforms like Mark’s blog are a way to correct this serious issue. It’s also something that universities rarely give credit for. Public engagement is too often a “nice to have” tickbox that’s rated well below citation counts, teacher ratings and the other matriculated metrics du jour. So it’s nice to view this SIGCSE award as recognition for something that Mark has done for so long without due appreciation.

Mark is very generous with his time and is a welcoming, even-handed presence in the field. At the end of every SIGCSE the feedback form asks for proposals for future SIGCSE award winners. I’ve been writing Mark’s name in for years, and he is a worthy winner. I look forward to his keynote, which he will no doubt blog about afterwards.

Code highlighting: the lowlights

Syntax highlighting is such a ubiquitous feature in program editors that we often give it very little thought. It’s even like an indicator of program code: you can tell something is code if it is in a fixed-width font and some of the words are consistently coloured. It’s clearly a popular feature but is it actually helpful?

The latest paper on this (paywalled, alas) is by Hannebauer et al, which I found via Greg Wilson. They set ~400 participants a variety of comprehension and editing tasks and found zero difference in correctness between having syntax highlighting on and off during the tasks. So it doesn’t look like it helps with programming.

There’s two ways to take this result. One is that since the feature is ineffective, we should stop wasting effort on building it into our IDEs. The other way to view null results is that since it makes no difference either way, we are free to choose based on other considerations. The authors of the paper imply that people seem to like syntax colouring, which may well be for aesthetic reasons. And if it doesn’t get in the way, why not make it look prettier?

The authors end with a suggestion that highlighting syntax keywords may not be most effective use of colour, and propose a few of their own schemes, such as using colouring to do a live git blame display. I’d say the obvious other use of colour is for scope highlighting, where coloured background indicates the extents of code blocks. BlueJ has both syntax highlighting and scope highlighting, which can be a bit busy:

When we made the Stride editor, we left out syntax highlighting and just kept the scope highlighting provided by the frame outlines, which seemed less visually noisy:

We haven’t done a study to look at the effect of this scope highlighting. But David Weintrop and Uri Wilensky did something similar when they looked at multiple choice questions shown in text-based form (with syntax highlighting) versus block-based form (which is effectively scope highlighting), and the non-null effects showed a superiority of blocks over text, although the highlighting is not the only difference:

Their paper is available online (unpaywalled).

So although syntax highlighting of tokens does not seem to make a difference, scope highlighting may aid comprehension. (If anyone wants to study this directly, you can toggle syntax and scope highlighting on and off in BlueJ, so be our guest…)

What makes a [computing education] research paper?

Yesterday on Twitter, Jens Moenig had some kind words to say about our journal paper on Stride and complained about its repeated rejection from other journals as a symptom of incorrect criteria for accepting computing education research papers (head to twitter for the full thread):

The core issue was this: the original Stride journal paper was a long detailed description of the design of our Stride editor and the decisions involved, with very minimal evaluation. In general, should this be accepted as a paper?

The case against accepting

Computing education is full of tools. There are lots of block-based editors and beginners’ IDEs and learning assistants and so on. I’ll admit that — even as I work on making new tools — when I come to review a paper with a new tool, I do roll my eyes briefly and wonder if yet another tool is needed. The problem for our field is that we have a lot of tool-makers, but few tool evaluators. There are much fewer researchers (like David Weintrop, for example) who perform detailed comparisons between tools that they did not write themselves. The field has a glut of unevaluated tools, which is surely not helpful for someone wondering which tool to use, and we can’t be sure that any of the tools actually aid learning. In this light, rejecting our paper seems reasonable: yet another paper on a new tool with no evaluation.

The case for accepting

There’s two main arguments I see for accepting the paper. One is that the design description itself can be of value. As someone who builds tools I find it very useful to talk to other designers, like Jens and John Maloney, to find out why they made certain design decisions. I can use their tools — Scratch, Snap, GP — but that doesn’t explain the full story behind the design choices. Jens’ point is that he found it useful to read our decisions in order to improve the decisions they make in their tools. This type of exchange is beneficial for the field — the question then is should these design descriptions be considered computing education research papers by themselves, or should they be put somewhere else (some kind of design journal? or things like a tools paper track?).

The other argument for accepting design by itself is the amount of work. Our paper with design alone was 25-30 pages, which pushes the limit for most journals, and was the summation of three years’ work. A full evaluation would add another year and another 10 pages. Should this be one mega piece of work, or two separate bits of work? It can only be two papers if the first design paper can get accepted by itself. (The counter-argument is that there’s no guarantee the second paper ever appears…)

I will say that there are differences in quality of writing about design. A lot of papers I see on tools fall into the trap of describing technical details which do not generalise (e.g. we used web server X and hooked it up to cloud Y for storage) rather than discussing design decisions and trade-offs and user considerations. They also tend, due to page limits, to have minimal descriptions and pictures of the system as a whole. I had to look quite far back in time for the related work section in the Stride paper, and I can confirm that your paper will outlast your tool, so it needs to be useful to someone who does not have the tool itself. I’ve written in more detail in a previous post about this issue.


I’m not interested in grousing about one particular paper, but this is an issue that we run into repeatedly in our research. Our team has a lot of expertise in building tools but not much in evaluation. Should we be able to publish our designs as a computing education research paper, or should it always be coupled with an evaluation? It would be much easier for us if we were able to only publish designs, but I’m sympathetic to arguments on both sides.

At the ICER doctoral consortium a few years ago, Sally Fincher and Mark Guzdial ran an exercise to ask the students and discussants what computing education research comprised. I said that it was investiations of student learning and that our tools-only approach was on the periphery. What surprised was that all the people doing such investigations said that computing education research was tool-building, and their investigations were peripheral. I think perhaps this tension between tools and evaluation is inevitable in our field — but maybe it’s also useful.