New Blackbox Mini dataset

This post introduces our new Blackbox Mini dataset of novice programmer behaviour. First, let’s whizz through a quick bit of Blackbox background, because Blackbox Mini doesn’t make sense if you don’t know what Blackbox is. We develop BlueJ, a beginner’s IDE for Java. All users of BlueJ are asked to opt-in to recording their activity data for research purposes. This data is automatically uploaded to the Blackbox dataset, which collects things like debugger use, testing activity, compile errors, source code edits and more. Blackbox is stored in an SQL database with a documented schema. The Blackbox dataset has been collecting since 2013, and is now quite large (the central table has 4 billion rows). Blackbox access is available to other researchers on request. Phew! So that’s Blackbox.

Introducing Blackbox Mini

Two years ago we ran a survey of researchers who were using Blackbox. One item of feedback was that the database was unwieldy for the common use case. Many researchers wanted to look at source code and compile errors, but were then faced with an SQL database which was so large that even simple queries could take a long time and produce a flood of data. It felt like there was room for improvement in the way the data was presented to researchers. This is the motivation behind a new dataset: Blackbox Mini.

Blackbox Mini is a small subset of the original Blackbox data, with an additional simplified way to access the source code. The intention is that this will make it easier for new researchers to get started. I’ll dig into each of those features in turn.

A small subset

The subsetting is fairly straightforward: for simplicity, I took the initial part of the Blackbox dataset, with the first million events from 2013 (rather than the full 4 billion since). This does mean it lacks more recent data such as BlueJ 4’s change to compilation behaviour, but it was technically by far the easiest way to extract a subset.

The subset is stored in a database with an identical schema to the full dataset. The advantage of a smaller database is that queries should execute much more quickly, and the results are less likely to be overwhelming. Because the schemas are identical, queries developed and tested on the smaller database can be run as-is on the larger database afterwards.

An additional simplified way to access the source code

The more interesting part is the new data format, which solves a lot of previous problems. A lot of researchers are attracted to Blackbox to try to analyse the source code and compile errors, which are a bit buried away in our database. And even when you get the source code out of Blackbox, you have the general problem that you need a parser to perform any analyses. So our problems are:

1. The source code is not so easy to access, and common tasks such as “find me all the sequential versions of this source file and the compile errors at each stage” are not obvious.
2. You need to find a Java parser compatible with your analysis language of choice (be it Python, Java itself, Scheme, etc).
3. Not all the parsers are up to date with Java 8 (which added lambda syntax).
4. Many parsers are only designed for syntactically valid code, but much of the Blackbox data is syntactically invalid. It’s hard to analyse the causes of compiler errors if you can’t handle any invalid code.
5. Even once you have a syntax tree, doing analysis on tree structures can be laborious in many languages.

Two years ago I was at a Dagstuhl and met a researcher who was working on a project called SrcML. SrcML takes in Java source code and produces an XML document that represents the parse tree. For example, this fragment (abbreviated for space reasons):

public static void main(String [ ]  args)
{
    int test1 = 96;    //test score 1

Turns into this piece of XML:

<function><type><specifier>public</specifier> <specifier>static</specifier> <name>void</name></type> <name>main</name><parameter_list>(<parameter><decl><type><name><name>String</name><index>[ ]</index></name></type> <name>args</name></decl></parameter>)</parameter_list>
    <block>{<block_content>
        <decl_stmt><decl><type><name>int</name></type> <name>test1</name> <init>= <expr><literal type="number">96</literal></expr></init></decl>;</decl_stmt>         <comment type="line">//test score 1</comment>

This may not seem like it’s doing much besides changing to an isomorphic format, but it actually helps mitigates all of our problems:

1. The format can be a directory full of XML files, no database needed.
2. XML libraries are generally easier to find for your programming language of choice than a Java parsing library.
3. Only SrcML has to be up-to-date with Java 8 (although admittedly it doesn’t support Java 10’s var keyword).
4. SrcML is somewhat robust to syntactically invalid code.
5. XML already has a query language, XPath, designed to allow easy querying of tree structures, which we can use for analysis.

So Blackbox Mini provides source code in a simple file-based format. There is a sub-directory for each Blackbox project. Within that is a XML file corresponding to each source file in that project. The XML file contains a sequential list of historical versions of that file, one version for each compilation attempt, and we use XML attributes and tags to list there the compile errors and the (SrcML-processed) source code for that version.

This means that everything you need to know for the most common use case (looking at source code and compile errors) is in the XML files. At a workshop at the abbreviated SIGCSE 2020 I worked through some examples of performing analysis from Python. Here is a complete Python 2.7 script for counting the frequencies of types used in declarations in all of the Blackbox Mini source code:

import xml.etree.ElementTree as ET
import os
from collections import Counter

acc = Counter([])
for (dirpath, dirnames, filenames) in os.walk('/mini/srcml'):
    for filename in filenames:
        root = ET.parse(os.path.join(dirpath, filename).getroot()
        for name_el in root.findall('.//decl/type/name'):
            if not (name_el.text is None):
                acc.update([name_el.text])
print(acc)

That takes about twenty minutes to run on the full dataset (I suggest using an even smaller subset during development). The top ten types are: ‘int’: 526817, ‘String’: 172710, ‘double’: 111872, ‘Color’: 42114, ‘boolean’: 34433, ‘ActionEvent’: 15868, ‘Scanner’: 15614, ‘JButton’: 15272, ‘Graphics’: 13729, ‘JPanel’: 13085. There are further issues questions to solve before interpreting this result: the types are counted at each compile, and I believe a few much-compiled long programs are biasing the data a bit. Should you normalise by source file, by project, by user or a combination? What I like is that Blackbox Mini lets me move forwards to the interesting research considerations much more quickly, rather than being stuck on the technical aspects of getting the data.

Summary

My hope is that this new presentation of a Blackbox subset allows more researchers to investigate analysing the Blackbox dataset. The Blackbox Mini dataset lives on the same machine as the Blackbox dataset, so if you already have access to Blackbox, you also already have access to Blackbox Mini (I’ll add some resources for this to our Blackroom community site). If you’re a researcher interested in getting access to the data, send an email to blackbox-admin@bluej.org and we will give you the simple form to fill in in order to sign up.

UK and Ireland Computing Education Research Conference

This year, Brett Becker and I will be acting as program co-chairs of the new UK and Ireland Computing Education Research (UKICER) conference. It will be held at the University of Kent on 5-6th September 2019. The committee includes Janet Carter as conference chair, along with Sally Fincher, Quintin Cutts and Steven Bradley. Steven has been running the Computing Education Practice conference at Durham for the past few years, which has grown impressively. We believe that there is a growing community of computing education researchers in the UK and Ireland, but we do not have a local conference to support this community’s growth. Our hope is that this sister conference to CEP will provide a useful outlet to share Irish and British computing education research, and encourage research collaborations.

The conference will run roughly from lunchtime on the 5th to lunchtime on the 6th, with some collaboration-building events beforehand and some workshops afterwards. We thus invite submissions of research papers (max 6 pages, ACM format), and proposals for 1-2 hour workshops, by the beginning of June. More details are available on the conference website. Please feel free to send any questions to ukicer2019@easychair.org and please do share this news with anyone you think might be interested in submitting or attending. We hope to see a variety of researchers and educators for all age groups.

While My Guitar Gently Beeps

In an email this week, Greg Wilson asked me why music playing was centred around individual tutoring but programming education was not, i.e., why professional musicians still take lessons but professional programmers do not. That reminded me of this article I’d recently read about Fender ( in this BBC article and this [paywalled] FT article). I wondered about similarities and differences between learning music and learning programming — especially end-user programming where people do small bits of coding in larger systems, like formulas in spreadsheets, scripting in image editors and so on.

Gender Balance, and Goals

Like just about every other industry on the planet, guitar manufacturers have generally neglected the female demand for their product. Turns out half of all guitar buyers are women, and half of those buy electric (unlike the stereotype of women playing acoustic). It also turned out that many purchasers didn’t want to become performers — they just wanted to play by themselves. In his SIGCSE keynote, Mark Guzdial mentioned how many more end-user programmers there are than professional programmers, yet the former are treated like a pale imitation of the latter.

Drop-out Rates

You may think computing courses have high drop-out rates, but Fender discovered that within a year, 90% of purchasers have stopped trying to play, and many of those stopped within the first three months. In Fender’s case they have a clear economic motivation to fix this — if no-one can learn guitar they are not going to buy another (and they’re going to ebay the one they did buy). So Fender were driven to aid learning by making apps and tutorials. In this, perhaps, computing is not so bad — there is a wealth of material out there for learning to program.

One difference between music and programming is that we can change what programming looks like. We can’t change the core programming concepts but we have the ability to make our tools more helpful and less painful. However, the guitar is inherently unmodifiable. Sure, we can allow guitar music creation using simpler tools — keyboards and MPCs and computerised music generation — but that’s not a guitar.

The invention of those other tools did lead to an explosion in music creation though. A bunch of UK dance and electronica bands in the 90s got started by messing around with electronics and computers and seeing what came out, and similar patterns led to the rise in hip-hop producers in the US at the same time. The tools did make a difference in who could make music and how easily.

Reality Bites

My uncle said about his efforts to learn the guitar: “I pick it up for a few hours and I can make noise, but then I get frustrated that I’m not playing like Eric Clapton and put it down again.” We have expectations about what we can achieve and we all want a quick win rather than a long painful process. The programming equivalent is perhaps the young kids who want to make the next Call of Duty but then find they’re struggling to draw a turtle graphics hexagon on screen. Again, programming can build helper libraries and layers of abstraction to make this process much easier, more so than music playing where reality bites: when you pick up a violin for the first time, nothing is going to stop your ears bleeding as you try to hit a note. The musical alternative is to use an easier instrument, but in music and programming, such beginner’s tools are often sneered at.

Lessons

Music teaching is a mixture of formal and informal. Some learn in schools, some have one-to-one tuition, some learn for themselves via the Internet. There is a temptation among those who learned programming by themselves to describe formal school teaching as unnecessary — why doesn’t everyone do what I did and teach themselves? But just because some learn themselves doesn’t mean that others wouldn’t benefit from widely available teaching. I learned programming by myself with a handful of books for years before I took any programming classes. But what I now wonder is: how much faster could I have learned with a teacher? Just because you can learn a certain way doesn’t mean it’s best.

This is a good excuse to trot an anecdote about Jamie Smith, the producer member of the British band The xx:

[Smith] was already being hailed as a visionary figure by his label boss at XL Recordings, Richard Russell. “I found him really inspiring as a beatmaker in quite a specific way,” says Russell. “He was playing the MPC – which is a piece of studio equipment you’re supposed to use for recording and sequencing – as an instrument. That idea sort of blew my mind: that you could play something live and still have the sounds I love, the sample sounds. I started doing it myself straight away, on Bobby Womack’s album, on the stuff I was doing with Damon Albarn.” (“I didn’t realise you weren’t supposed to do that with it,” shrugs Smith. “I still don’t know how to use one properly.”)

Feedback

You might say there’s three types of feedback in learning. One is the obvious signs of success — does the code compile? Does the guitar make the right note? Another is longer-term pain points — is the code hard to make changes to (e.g. because of lots of repeated code)? Is it hard to play the notes at the right pace? The third is higher-level features that aren’t obvious without external feedback — is your code well structured for others to understand? Are you using an interesting variety of musical techniques?

The lack of useful feedback for the second and third items often crop up in end-user programs. You find replicated chunks of near-identical code because the programmers don’t know about functions, repeated variables like x1, x2, x3 because they don’t know arrays and so on. It’s an unknown unknown — how do you get people to write better code when they don’t know what better is, and don’t know what techniques are available for doing so?

Summary

Learning has similar challenges in programming, music and other domains. There is a diverse set of ways to learn: self-taught, tutored, formally taught in classrooms, informally taught by peers, and so on. If we focus too much on one and assume it fits everyone, we can miss out on a lot of potential learners. What I’m struck by when observing some of our Blackbox participants is that programming can be a slow, frustrating and painful experience for many who are by themselves. There is no-one solution to improving learning: it needs to be a combination of language design, tool design, pedagogy, materials and more.

Congratulations to Mark Guzdial

Research is ultimately about discovering new knowledge, and sharing it. A long-standing criticism of academic researchers is that they tend to falter on the sharing part. Far too many researchers think that publishing a paper in a paywalled journal or conference is sufficient. The modern world allows us to engage in many more ways with other researchers and the larger public — blogs, twitter, youtube and so on — yet many researchers still do not publicise their research findings well.

I’m on my way to SIGCSE 2019, the biggest conference in computer science education. On Friday morning the keynote will be given by Mark Guzdial, who will receive the SIGCSE Award for Outstanding Contribution to CS Education. Mark is a prolific researcher in his own right — SIGCSE statistics this week showed he is the second most frequent author there:

I wouldn’t mind betting that Mark is one of the most well-known computer science education researchers worldwide, but the reason for that is not just his own papers — it’s his dedication to disseminating research.

Mark runs the computinged.wordpress.com blog as a solo effort. You might know the blog even if you don’t know his name — unlike my decision to plaster my face on the side of this blog, his name is modestly buried on the about page. It is a remarkable act of dedication to run a blog for over ten years as Mark has, with regular content several times a week for hundreds of weeks on end. In trying to find the age of his blog, I reached page 345 (!) to find his first wordpress post in 2009, let alone the ones before that on a different host. He not only posts about his own research but is generous in discussing and crediting the research of others, and is also active in discussions on twitter.

There is a surprising disconnect in education between research and teaching. Part of the reason that dodgy pedagogy proliferates — learning styles, etc — is that research findings rarely make it into the hands of teachers. Platforms like Mark’s blog are a way to correct this serious issue. It’s also something that universities rarely give credit for. Public engagement is too often a “nice to have” tickbox that’s rated well below citation counts, teacher ratings and the other matriculated metrics du jour. So it’s nice to view this SIGCSE award as recognition for something that Mark has done for so long without due appreciation.

Mark is very generous with his time and is a welcoming, even-handed presence in the field. At the end of every SIGCSE the feedback form asks for proposals for future SIGCSE award winners. I’ve been writing Mark’s name in for years, and he is a worthy winner. I look forward to his keynote, which he will no doubt blog about afterwards.

Code highlighting: the lowlights

Syntax highlighting is such a ubiquitous feature in program editors that we often give it very little thought. It’s even like an indicator of program code: you can tell something is code if it is in a fixed-width font and some of the words are consistently coloured. It’s clearly a popular feature but is it actually helpful?

The latest paper on this (paywalled, alas) is by Hannebauer et al, which I found via Greg Wilson. They set ~400 participants a variety of comprehension and editing tasks and found zero difference in correctness between having syntax highlighting on and off during the tasks. So it doesn’t look like it helps with programming.

There’s two ways to take this result. One is that since the feature is ineffective, we should stop wasting effort on building it into our IDEs. The other way to view null results is that since it makes no difference either way, we are free to choose based on other considerations. The authors of the paper imply that people seem to like syntax colouring, which may well be for aesthetic reasons. And if it doesn’t get in the way, why not make it look prettier?

The authors end with a suggestion that highlighting syntax keywords may not be most effective use of colour, and propose a few of their own schemes, such as using colouring to do a live git blame display. I’d say the obvious other use of colour is for scope highlighting, where coloured background indicates the extents of code blocks. BlueJ has both syntax highlighting and scope highlighting, which can be a bit busy:

When we made the Stride editor, we left out syntax highlighting and just kept the scope highlighting provided by the frame outlines, which seemed less visually noisy:

We haven’t done a study to look at the effect of this scope highlighting. But David Weintrop and Uri Wilensky did something similar when they looked at multiple choice questions shown in text-based form (with syntax highlighting) versus block-based form (which is effectively scope highlighting), and the non-null effects showed a superiority of blocks over text, although the highlighting is not the only difference:

Their paper is available online (unpaywalled).

So although syntax highlighting of tokens does not seem to make a difference, scope highlighting may aid comprehension. (If anyone wants to study this directly, you can toggle syntax and scope highlighting on and off in BlueJ, so be our guest…)

What makes a [computing education] research paper?

Yesterday on Twitter, Jens Moenig had some kind words to say about our journal paper on Stride and complained about its repeated rejection from other journals as a symptom of incorrect criteria for accepting computing education research papers (head to twitter for the full thread):

The core issue was this: the original Stride journal paper was a long detailed description of the design of our Stride editor and the decisions involved, with very minimal evaluation. In general, should this be accepted as a paper?

The case against accepting

Computing education is full of tools. There are lots of block-based editors and beginners’ IDEs and learning assistants and so on. I’ll admit that — even as I work on making new tools — when I come to review a paper with a new tool, I do roll my eyes briefly and wonder if yet another tool is needed. The problem for our field is that we have a lot of tool-makers, but few tool evaluators. There are much fewer researchers (like David Weintrop, for example) who perform detailed comparisons between tools that they did not write themselves. The field has a glut of unevaluated tools, which is surely not helpful for someone wondering which tool to use, and we can’t be sure that any of the tools actually aid learning. In this light, rejecting our paper seems reasonable: yet another paper on a new tool with no evaluation.

The case for accepting

There’s two main arguments I see for accepting the paper. One is that the design description itself can be of value. As someone who builds tools I find it very useful to talk to other designers, like Jens and John Maloney, to find out why they made certain design decisions. I can use their tools — Scratch, Snap, GP — but that doesn’t explain the full story behind the design choices. Jens’ point is that he found it useful to read our decisions in order to improve the decisions they make in their tools. This type of exchange is beneficial for the field — the question then is should these design descriptions be considered computing education research papers by themselves, or should they be put somewhere else (some kind of design journal? or things like a tools paper track?).

The other argument for accepting design by itself is the amount of work. Our paper with design alone was 25-30 pages, which pushes the limit for most journals, and was the summation of three years’ work. A full evaluation would add another year and another 10 pages. Should this be one mega piece of work, or two separate bits of work? It can only be two papers if the first design paper can get accepted by itself. (The counter-argument is that there’s no guarantee the second paper ever appears…)

I will say that there are differences in quality of writing about design. A lot of papers I see on tools fall into the trap of describing technical details which do not generalise (e.g. we used web server X and hooked it up to cloud Y for storage) rather than discussing design decisions and trade-offs and user considerations. They also tend, due to page limits, to have minimal descriptions and pictures of the system as a whole. I had to look quite far back in time for the related work section in the Stride paper, and I can confirm that your paper will outlast your tool, so it needs to be useful to someone who does not have the tool itself. I’ve written in more detail in a previous post about this issue.

Summary

I’m not interested in grousing about one particular paper, but this is an issue that we run into repeatedly in our research. Our team has a lot of expertise in building tools but not much in evaluation. Should we be able to publish our designs as a computing education research paper, or should it always be coupled with an evaluation? It would be much easier for us if we were able to only publish designs, but I’m sympathetic to arguments on both sides.

At the ICER doctoral consortium a few years ago, Sally Fincher and Mark Guzdial ran an exercise to ask the students and discussants what computing education research comprised. I said that it was investiations of student learning and that our tools-only approach was on the periphery. What surprised was that all the people doing such investigations said that computing education research was tool-building, and their investigations were peripheral. I think perhaps this tension between tools and evaluation is inevitable in our field — but maybe it’s also useful.

Pedagogy of Programming Tools

If you want to teach programming, you have several decisions you need to make. You need to choose:

  • a programming language, such as Java, Python, Javascript, Ruby,
  • a programming environment, which may be something like Notepad + command-line, or a full-blown IDE like Visual Studio,
  • a context, such as making games, media computation, website creation, robotics, and
  • a pedagogical approach, such as what you will teach, in what order, and using which activities.

Not everyone thinks about that last item in terms of explicitly deciding on a pedagogical approach. But as soon as you start making decisions such as: “what task do I start with?”, you are implicitly deciding. Do you start with “What is a variable?” or “Here’s how to print ‘Hello World'” or “This is the syntax of a function call”? Do you teach automated testing? Do you start with a blank program or start by modifying an existing program? You have always chosen a pedagogical approach, whether you realise it or not.

What’s interesting about the four items above is that they all interact with each other. The top three clearly so: you can’t write Java in IDLE, for example, and you may find your robot of choice doesn’t support Javascript. But the tool and the language you choose will affect the available pedagogy and vice versa. Programming tools are not pedagogy-neutral. Your tool determines which programming-related activities are easy and which are hard, which in turn will affect how you use the tool to teach.

Code tracing is a useful skill but doing it an environment with a debugger that shows variable values step-by-step makes it much easier than in Notepad+command-line. Parsons problems (where you drag bits of pre-written code into order) are easier in Scratch than in a text editor. BlueJ lets you call methods on objects via the context menu without writing any code, whereas an IDE like IntelliJ does not. It’s useful to understand what pedagogies your tool supports or makes difficult when making a choice.

In our latest Greenfoot Live video, my colleague Hamza and I sat down for half an hour to do some Greenfoot programming and talk about pedagogical strategies in Greenfoot: ways you can use it to teach, and what pedagogical approaches we have in mind when designing the tool. I’m quite pleased with how it turned out, and I think it’s worth watching:

Whether you agree with our particular pedagogical philosophies or not, next time you choose a programming language and tool, be aware of its impact on what teaching approaches and activities it can support well, and which activities it will make hard for you to engage in.