I work in the team developing BlueJ, a tool (an IDE) for helping beginners to learn object-oriented programming. BlueJ has over 1.8 million active users each year*. What are they all doing, exactly? Most of them are novices, learning to program. What mistakes are they making? Where are they getting stuck?
Various studies have looked at this issue specifically for BlueJ, e.g. Jadud (2007), Ragonis and Ben-Ari (2005). Other studies have looked at student behaviour in different tools but at a larger scale, e.g. Edwards et al. (2009). These studies have the advantage (not to be underestimated) of being able to relate observed behaviour to individual characteristics, to perform interviews, close-up observations and so on. What most of the studies lack is significant scale. Jadud had around 100 participants, Ragonis and Ben-Ari 50, Edwards had 1100. It isn’t the case that more participants is automatically better research — far from it. But larger numbers can convey some benefits.
Imagine that a particular problem or misconception affects, say, one student in hundred. A teacher with a class of twenty will see such a student once every five years. A researcher with a study of 100 will see one such student. A researcher with a large database, say 100,000 users, will have 1,000 such students. It’s clear who has the best chance of analysing this problem. Large scale data also has the advantage of generalising beyond some of the confounds in smaller scale studies. An individual smaller study can be confounded by the institution in which it is run, or the cohort of students, or the handful of teachers involved. A larger study which has participants from hundreds or thousands of institutions automatically overcomes these biases that can be problematic at smaller scales.
The other disadvantage of existing studies is that there is a lot of replicated work. Each study collects its own data and then throws it away again. While there are ethical concerns about personal data, if we could share the anonymous parts then we could maximise data re-use and minimise effort.
The Blackbox project
We have begun a data collection project, called Blackbox, to record the actions of BlueJ users. We’re inviting all the BlueJ users (with the latest version, 3.1.0, onwards) to take part. About 2 months in to the project, we already have 25,000 users who have agreed to take part, with 1,000 sending us data each day. Based on current estimates, I expect that in November 2013 we should see around 5,000 users sending data each day, with a total of over 100,000 users. Rather than hoarding the data, we are making this data available to other computing education researchers for use in their own research, so that we can all benefit from this project.
This is not a panacea for programming research — in some respects this data is incredibly limited, as it is deliberately anonymised and carries no data at all about the person sitting in front of the computer. But we are receiving a lot of program code, IDE interactions and compilation results which could prove very useful for various research projects. For example, my colleague Davin is investigating improving the quality of Java error messages for beginners. The Blackbox project will provide a lot of data on what errors users encounter, and what they subsequently do about the error, which should prove helpful to him.
Today I gave a lightning talk at the ICER conference about this project, and I’m also preparing a paper for next year’s SIGCSE conference with some more details and initial results. Meanwhile, if any researchers are interested in using this data for their own research, you can email us to ask about getting access at firstname.lastname@example.org.
(Incidentally, I think this sort of large-scale data collection is likely to be a growth area in computing education research in the next few years. I saw another lightning talk about looking at the Khan Academy data; there is also data from a lot of MOOCs, and things like Scratch 2.0 store all programs in the cloud as they are edited. The difficult aspect is getting useful, sensible results out of such a large data set. I suspect there may be a Gartner hype-cycle effect: an initial rush of papers that just tack “big data” on to some studies that suffer from the lack of context on the data, followed by some push back. As I said, it’s not a magic bullet, but deployed sensibly could form a useful research tool.)
* The 1.8 million figure is an underestimate for the total BlueJ users, but it’s a pretty accurate figure for “BlueJ users with full Internet access”, which is what matters for our Blackbox estimates. If they don’t have Internet access, they can’t take part anyway.