A major part of research is acquiring and sharing knowledge. This manages not to be as straightforward as it should for political/business reasons (see: journal publishers, paywalls and open access), but technically it is at least simple. You write a paper consisting of words and pictures, other people download them and read them. Knowledge has been transmitted. Where life gets much more difficult is a newer but fast-growing part of research: sharing software.
One use of software in research is as a tool for doing analysis. This affects all the natural sciences, and there are issues with how to gain credit for producing software (see the new proposed policy on software citation, and the new journal of open source software). But within computer science there is an additional research role for software: sometimes the software is part of the research output. Nowhere is this more apparent than in areas like human-computer interaction research.
Publishing software interface research
The typical form of a modern paper about a new software interface is to provide a description of the interface, followed by an evaluation of the interface with human participants. Thus the research output is two-fold: design and science. Putting these both into a paper might seems sufficient.
However, accurately describing an interface design in text is a difficult task — the medium is just ill-suited. (Much like writing about music being compared to dancing about architecture.) It is difficult to describe the function of all interactions with the system: you’d write an endless series of “when the user presses left here they return to the home screen”; something almost akin to the original program code. You’d also need to describe not only the intended interactions but what happens when the user does something wrong. You also can’t use pure text: images are surely necessary to portray an interface. And that’s not to mention emergent properties which affect software’s usability, like the speed of the interface. Ultimately, if you want to understand the design of a software interface, there’s very little substitute for just using the interface.
Research Software Archaeology
Recently, I needed to write a detailed related work section for our work on frame-based editing. One of the challenges of publishing this work is that it is similar to work on the structured editors of the 1980s, which have largely failed to catch on. And additionally, it seems every reviewer knows a different editor, so each one seems to come back with “how is this different to structured editor X that I used in the 1980s?” 
So I end up searching for details about the design of 1980s structured editors. If there’s no paper and no software, there’s not really any way to find out about the design. If there is a paper, I hope that it has a reasonably detailed description of the editor (for example, the write-up of the Cornell Program Synthesizer). Regardless, I also try to search for a runnable version of the software. Ha!
There are few editors from that period which are available to run on a modern machine. Some were simply never released, partly because pre-Internet, sharing software was awkward. Some are unavailable due to their age: many of the structured editors were designed for processors or operating systems which are no longer available. So some editors seem to be totally lost — I can’t find any leads on downloading a copy of the Cornell Program Synthesizer, for example. Some other editors have a tantalising binary distribution which often cannot be run: for example, Boxer’s Mac binary.
I did have one or two successes, such as getting a version of the GENIE editor running in an emulator. And it was a revelation that greatly pushed forward my understanding of old structured editors. By modern standards, they were awful. The papers’ descriptions didn’t make clear how tedious and fiddly the navigation was, how unhelpful the editor was, how awkward it was to deal with errors. Running the software was an absolutely crucial step to comparing our work to theirs. It allowed me to understand the design and critique the editor’s operation for myself, rather than relying on the authors’ incomplete descriptions of their own software.
For all the other editors which I couldn’t run, there are these reviewers asking the perfectly valid question in research: “How does your work relate to previous work X?” And the honest answer is: I don’t know. Perhaps nobody can know any more — the paper wasn’t very detailed and the software is lost in time. This is no way to do research.
The solution to all of this is readily apparent: if your software is part of the research output, you must publish the software. And a binary is insufficient; binaries too easily bit rot, refusing to run on modern systems with no way to fix them. Source code is what is needed.
This week Andy Ko made available his Citrus/Barista structured editor from the 2000s. I downloaded and ran it: the binary did run, but it spat out repeated exceptions and I wasn’t sure if that was impairing the functioning of the software. Thankfully, Ko didn’t just publish a binary: he published the source code. For this, I salute him. I went to modify the source code and it turned out not to compile with a modern Java compiler. After some tweaks I got it compiling, and then fixed the exception. Because the code was on github, one accepted pull request later and the software in his repository will now compile and run on a modern machine. This — this is how software research should be.
Published source code for software is crucial to allow later researchers to use, evaluate and compare the software. I fully understand that everyone feels antsy about publishing source code. If I’m honest, the Citrus source code is a bit confusing, somewhat lacking in documentation and the software seems a little rickety. But that’s how research software usually is; my research code for our Blackbox work is the same. I’m not particularly proud of that code, but my recognition that sharing the source is important marginally outstripped my embarassment at its quality. Research code will almost always be shaky and iffy . It’s usually written by a single person (often not a professional software developer) for a single purpose, so it’s likely to be hacky and not well documented. Let’s all accept that research code is bad, and agree to share anyway.
 It’s interesting to note that when the researchers ask how our work compares, they are implicitly asking about the design, not the science. Given that almost all venues will only accept science or design+science, it’s curious that most of the comparison to related work is about comparing the design. This is at least partly because the science quickly outdates in software interfaces. Even if the older editor papers had performed rigorous evaluations (which they almost exclusively did not), the results don’t necessarily persist. If someone told you that editor X had been evaluated as easy to use and as good as text editors, tested on a 25-line text terminal on a 1980s thin-client Unix machine, would you say that was useful in evaluating editor X against a modern editor? Would it even be worth comparing the usability of our editor directly against a 1980s editor? I doubt it; the usefulness of the previous work is more in comparing our design to theirs, not so much our scientific evaluation against theirs.
 Given that we make software — BlueJ and Greenfoot — which we encourage people to use, I should point out that they are actually stable, reliable, and fairly well engineered! And open source, to boot. The setup of our research group and funding allows us to do this, making us blessed compared to other researchers. Quality software in research is of course possible, and the preferred option, but we must recognise that it is a rarity, and not let that get in the way of sharing.